Class %DeepSee.extensions.clusters.CLARA

class %DeepSee.extensions.clusters.CLARA extends PAM

This class provides an implemantation of CLARA (Clustering for Large Applications) algorithm.

An obvious way of clustering larger datasets is to try and extend existing methods so that they can cope with a larger number of objects. The focus is on clustering large numbers of objects rather than a small number of objects in high dimensions. Kaufman and Rousseeuw (1990) suggested the CLARA (Clustering for Large Applications) algorithm for tackling large applications. CLARA extends their k-medoids approach or a large number of objects. It works by clustering a sample from the dataset and then assigns all objects in the dataset to these clusters.

CLARA (CLustering LARge Applications) relies on the sampling approach to handle large data sets. Instead of finding medoids for the entire data set, CLARA draws a small sample from the data set and applies the PAM algorithm to generate an optimal set of medoids for the sample. The quality of resulting medoids is measured by the average dissimilarity between every object in the entire data set D and the medoid of its cluster

To alleviate sampling bias, CLARA repeats the sampling and clustering process a pre-defined number of times and subsequently selects as the final clustering result the set of medoids with the minimal cost.

Inventory

Parameters	Properties	Methods	Queries	Indices	ForeignKeys	Triggers
	5	11

Summary

Properties
CacheCost	DSName	Dim	K	NIdle
Normalize	P	SampleSize	Treshold	UseSA
Verbose

Methods
%%OIDGet	%AddToSaveSet	%ClassIsLatestVersion	%ClassName
%ConstructClone	%DispatchClassMethod	%DispatchGetModified	%DispatchGetProperty
%DispatchMethod	%DispatchSetModified	%DispatchSetMultidimProperty	%DispatchSetProperty
%Extends	%GetParameter	%IsA	%IsModified
%New	%NormalizeObject	%ObjectModified	%OriginalNamespace
%PackageName	%RemoveFromSaveSet	%SerializeObject	%SetModified
%ValidateObject	ById	ClusterCost	CurrentTotalCost
Delete	Distance	Distance1	Distance12
Execute	Exists	GetASWIndex	GetCalinskiHarabaszIndex
GetCentroid	GetCluster	GetClusterSize	GetCost
GetCount	GetData	GetDimensions	GetId
GetNumberOfClusters	GetPearsonGammaIndex	GlobalCentroid	IsPrepared
New	Open	Prepare	RelativeClusterCost
Reset	SetData	SubsetCentroid	TotalCost
iterateCluster	printAll	printCluster	randomSubset

Properties

• property CacheCost as %Integer [ InitialExpression = -1 ];

Unused in current implementation

• property NIdle as %Integer [ InitialExpression = 5 ];

A minimum number of idle iterations (i.e. iterations that do not improve the total cost).

• property SampleSize as %Integer [ InitialExpression = 100 ];

Sample Size to use for one PAM run

• property Treshold as %Double [ InitialExpression = 0.001 ];

Treshold to determine when to stop

• property UseSA as %Boolean [ InitialExpression = 0 ];

Whether to use Simulated Annealing in each PAM run for a sample (not recommended).

Methods

• method Execute() as %Status

• method IsPrepared() as %Boolean

Checks whether the model is ready for an analysis to be executed. This is dependent on a specific algorithm and therefore this method is overriden by subclasses.

• classmethod New(dsName As %String, Output sc As %Status) as CLARA

• classmethod Open(dsName As %String, Output sc As %Status) as CLARA

• method Prepare() as %Status


[BASEXML] > [%DeepSee] > [extensions] > [clusters] > [CLARA]		Private Storage