Jack Delany, John Bradshaw
DAYLIGHT Chemical Information Systems, Inc. Mission Viejo, CA USA
User-defined Similarity Measures
Daylight currently supports three different similarity measures: Tanimoto, Tversky, and Euclidean distance. There are numerous other measures described and studied in the literature (Holliday).
At EMUG01 John spoke about a general facility for computing similarity values, which has been implemented in the clustering package. Nearneighbors, spherex, and kmodes all accept the user-specified similarity measure. Also, Daycart has a new function, "user_similarity()", which accepts a user-provided similarity measure and performs the generic similarity search.
|OBJECT A||0||d||b||b+ d|
|1||a||c||a + c = A|
|Totals||a + d||b + c = B||n|
Currently, there are a number of symbolic named expressions which are built into the expression evaluation:
For programs, there are two new options: -expr and -comparison. If not specified, the default for -expr is TANIMOTO. The option -comparison should be either "DISTANCE" or "SIMILARITY". If unspecified, it will choose based on two limit cases: a = b = 1, c = d = 0, vs a = b = 0, c = d = 1.
$ nearneighbors -expr FORBES wdi.fp nearneighbors: expression (FORBES) identified as a SIMILARITY nearneighbors: reading input file (wdi.fp) ... $ nearneighbors -expr "(a+b)/(a+b+c+d)" wdi.fp nearneighbors: expression ((a+b)/(a+b+c+d)) identified as a DISTANCE ...
Useful as input to "Russian Doll" (MacCuish, MUG03), also may be useful as a tool for exploring "Data Fusion" (Ginn).