Jack Delany, John Bradshaw
DAYLIGHT Chemical Information Systems, Inc. Mission Viejo, CA USA
User-defined Similarity Measures
Daylight currently supports three different similarity measures: Tanimoto, Tversky, and Euclidean distance. There are numerous other measures described and studied in the literature (Holliday).
At EMUG01 John spoke about a general facility for computing similarity values, which has been implemented in the clustering package. Nearneighbors, spherex, and kmodes all accept the user-specified similarity measure. Also, Daycart has a new function, "user_similarity()", which accepts a user-provided similarity measure and performs the generic similarity search.
OBJECT B | ||||
---|---|---|---|---|
0 | 1 | Totals | ||
OBJECT A | 0 | d | b | b+ d |
1 | a | c | a + c = A | |
Totals | a + d | b + c = B | n |
Currently, there are a number of symbolic named expressions which are built into the expression evaluation:
Symbolic name | Expression |
---|---|
TANIMOTO | c/(a+b+c) |
EUCLID | sqrt((c+d)/(a+b+c+d)) |
DICE | (2.0*c)/(a+c+b+c) |
COSINE | c/sqrt((a+c)*(b+c)) |
KULCZYNSKI | 0.5*((c/(a+c))+(c/(b+c))) |
JACCARD | c/(a+b+c) |
RUSSELL/RAO | c/(a+b+c+d) |
MATCHING | (c+d)/(a+b+c+d) |
HAMMAN | ((c+d)-(a+b))/(a+b+c+d) |
ROGERS/TANIMOTO | (c+d)/((a+b)+(a+b+c+d)) |
FORBES | (c*(a+b+c+d))/((a+c)*(b+c)) |
SIMPSON | c/min((a+c),(b+c)) |
PEARSON | (c*d-a*b)/sqrt((a+c)*(b+c)*(a+d)*(b+d)) |
YULE | (c*d-a*b)/(c*d+a*b) |
MANHATTAN | (a+b)/(a+b+c+d) |
For programs, there are two new options: -expr and -comparison. If not specified, the default for -expr is TANIMOTO. The option -comparison should be either "DISTANCE" or "SIMILARITY". If unspecified, it will choose based on two limit cases: a = b = 1, c = d = 0, vs a = b = 0, c = d = 1.
$ nearneighbors -expr FORBES wdi.fp nearneighbors: expression (FORBES) identified as a SIMILARITY nearneighbors: reading input file (wdi.fp) ... $ nearneighbors -expr "(a+b)/(a+b+c+d)" wdi.fp nearneighbors: expression ((a+b)/(a+b+c+d)) identified as a DISTANCE ...
Useful as input to "Russian Doll" (MacCuish, MUG03), also may be useful as a tool for exploring "Data Fusion" (Ginn).