MUG'04: Clustering

Improvements to Daylight Clustering

Jack Delany, John Bradshaw

DAYLIGHT Chemical Information Systems, Inc. Mission Viejo, CA USA

User-defined Similarity Measures

Daylight currently supports three different similarity measures: Tanimoto, Tversky, and Euclidean distance. There are numerous other measures described and studied in the literature (Holliday).

At EMUG01 John spoke about a general facility for computing similarity values, which has been implemented in the clustering package. Nearneighbors, spherex, and kmodes all accept the user-specified similarity measure. Also, Daycart has a new function, "user_similarity()", which accepts a user-provided similarity measure and performs the generic similarity search.

OBJECT B

0 1 Totals

OBJECT A 0 d b b+ d

1 a c a + c = A

Totals a + d b + c = B n

	OBJECT B
0	1	Totals
OBJECT A	0	d	b	b+ d
1	a	c	a + c = A
Totals	a + d	b + c = B	n

a is the count of bits on in object A but not in object B.
b is the count of bits on in object B but not in object A.
c is the count of the bits on in both object A and object B.
d is the count of the bits off in both object A and object B.

Currently, there are a number of symbolic named expressions which are built into the expression evaluation:

Symbolic name Expression

TANIMOTO c/(a+b+c)

EUCLID sqrt((c+d)/(a+b+c+d))

DICE (2.0*c)/(a+c+b+c)

COSINE c/sqrt((a+c)*(b+c))

KULCZYNSKI 0.5*((c/(a+c))+(c/(b+c)))

JACCARD c/(a+b+c)

RUSSELL/RAO c/(a+b+c+d)

MATCHING (c+d)/(a+b+c+d)

HAMMAN ((c+d)-(a+b))/(a+b+c+d)

ROGERS/TANIMOTO (c+d)/((a+b)+(a+b+c+d))

FORBES (c*(a+b+c+d))/((a+c)*(b+c))

SIMPSON c/min((a+c),(b+c))

PEARSON (c*d-a*b)/sqrt((a+c)*(b+c)*(a+d)*(b+d))

YULE (c*d-a*b)/(c*d+a*b)

MANHATTAN (a+b)/(a+b+c+d)

Symbolic name	Expression
TANIMOTO	c/(a+b+c)
EUCLID	sqrt((c+d)/(a+b+c+d))
DICE	(2.0*c)/(a+c+b+c)
COSINE	c/sqrt((a+c)*(b+c))
KULCZYNSKI	0.5*((c/(a+c))+(c/(b+c)))
JACCARD	c/(a+b+c)
RUSSELL/RAO	c/(a+b+c+d)
MATCHING	(c+d)/(a+b+c+d)
HAMMAN	((c+d)-(a+b))/(a+b+c+d)
ROGERS/TANIMOTO	(c+d)/((a+b)+(a+b+c+d))
FORBES	(c(a+b+c+d))/((a+c)(b+c))
SIMPSON	c/min((a+c),(b+c))
PEARSON	(cd-ab)/sqrt((a+c)(b+c)(a+d)*(b+d))
YULE	(cd-ab)/(cd+ab)
MANHATTAN	(a+b)/(a+b+c+d)

For programs, there are two new options: -expr and -comparison. If not specified, the default for -expr is TANIMOTO. The option -comparison should be either "DISTANCE" or "SIMILARITY". If unspecified, it will choose based on two limit cases: a = b = 1, c = d = 0, vs a = b = 0, c = d = 1.

$ nearneighbors -expr FORBES wdi.fp
nearneighbors: expression (FORBES) identified as a SIMILARITY
nearneighbors: reading input file (wdi.fp)
...
$ nearneighbors -expr "(a+b)/(a+b+c+d)" wdi.fp
nearneighbors: expression ((a+b)/(a+b+c+d)) identified as a DISTANCE
...

Useful as input to "Russian Doll" (MacCuish, MUG03), also may be useful as a tool for exploring "Data Fusion" (Ginn).