Improvements to Daylight Clustering

DAYLIGHT Chemical Information Systems, Inc. Mission Viejo, CA USA

User-defined Similarity Measures

Daylight currently supports three different similarity measures: Tanimoto, Tversky, and Euclidean distance. There are numerous other measures described and studied in the literature (Holliday).

At EMUG01 John spoke about a general facility for computing similarity values, which has been implemented in the clustering package. Nearneighbors, spherex, and kmodes all accept the user-specified similarity measure. Also, Daycart has a new function, "user_similarity()", which accepts a user-provided similarity measure and performs the generic similarity search.

OBJECT B 0 1 d b b+ d a c a + c = A a + d b + c = B n

a is the count of bits on in object A but not in object B.
b is the count of bits on in object B but not in object A.
c is the count of the bits on in both object A and object B.
d is the count of the bits off in both object A and object B.

Currently, there are a number of symbolic named expressions which are built into the expression evaluation:

Symbolic name Expression
TANIMOTO c/(a+b+c)
EUCLID sqrt((c+d)/(a+b+c+d))
DICE (2.0*c)/(a+c+b+c)
COSINE c/sqrt((a+c)*(b+c))
KULCZYNSKI 0.5*((c/(a+c))+(c/(b+c)))
JACCARD c/(a+b+c)
RUSSELL/RAO c/(a+b+c+d)
MATCHING (c+d)/(a+b+c+d)
HAMMAN ((c+d)-(a+b))/(a+b+c+d)
ROGERS/TANIMOTO (c+d)/((a+b)+(a+b+c+d))
FORBES (c*(a+b+c+d))/((a+c)*(b+c))
SIMPSON c/min((a+c),(b+c))
PEARSON (c*d-a*b)/sqrt((a+c)*(b+c)*(a+d)*(b+d))
YULE (c*d-a*b)/(c*d+a*b)
MANHATTAN (a+b)/(a+b+c+d)

For programs, there are two new options: -expr and -comparison. If not specified, the default for -expr is TANIMOTO. The option -comparison should be either "DISTANCE" or "SIMILARITY". If unspecified, it will choose based on two limit cases: a = b = 1, c = d = 0, vs a = b = 0, c = d = 1.

```\$ nearneighbors -expr FORBES wdi.fp
nearneighbors: expression (FORBES) identified as a SIMILARITY