Tie handling in Jarvis-Patrick

Jarvis-Patrick is based on premise that items with shared neighbors must also be close to one another. The method, as described in the original reference (Jarvis), is:

It's a non-parametric method; the clustering operates on the lists of shared neighbors. Provided that one can generate lists of near neighbors for a given set of items they can be clustered. The near neighbor generation is O(n^2) and is typically the slow step. In our world, we do tanimoto similarity searches to generate the near neighbors lists. It's fast for large datasets and has been applied in numerous ways to clustering and compound selection. It does have some peculiarities in its behavior:

Fingerprints are an information-reduced representation of structural information. It is not rare for different structures to result in the same generated fingerprint (path length, repeated structural units). Furthermore, fingerprint generation parameters (folding, size) will result in collisions between fingerprints. On top of this is the occurance of ties in proximity between items within the dataset.

Database Unique SMILES (#1) Unique FPS (2048 bits) (#2) Unique FPS (folded) (#3)
WDI034 63009 58786 57870
ACD033 167768 153559 149949
SPRESI00 3082319 2869375 2795899

1. thorlist dbname | grep -c "^FP<"
2. thorlist dbname | fingerprint -z -b 2048 -c 2048 | \
     grep "^FP<" | sort -u | wc
3. thorlist dbname | grep "^FP<" | cut -d ';' -f1 | \
     sort -u | wc

J/K, version
options
AVG Cluster Size # Clusters # Singletons
Ten largest cluster sizes
10/16, v4.8 8.6 15884 31181
132,111,104,102,102,99,99,93,90,89
10/15, v4.8 7.1 18376 37088
89,76,73,72,67,66,65,64,63,63
10/16, v4.9 9.2 15025 29138
211,164,150,145,133,128,128,126,116,111
10/15, v4.9 7.7 17211 34373
111,99,92,83,83,81,81,80,80,77
10/16, v4.9
-JP_USE_ALL_TIES
9.6 14529 28227
276,261,168,159,158,157,156,150,144,142
10/15, v4.9
-JP_USE_ALL_TIES
8.2 16517 33065
215,144,136,120,111,103,102,101,99,96