Where Did Stigmata Come From and What Exactly Does It Do?

Below is quick introduction to the program, for a more thorough text the most useful document is the paper:

N.E. Shemetulskis, D. Weininger, C.J. Blankley, J.J. Yang, and C. Humblet, "Stigmata: An Algorithm To Determine Structural Commonalities in Diverse Datasets", Journal of Chemical Information and Computer Sciences, 36(4),1996,862-871.


Stigmata was motivated from the following structural diversity problem. Can I write an algorithm that will search through a structurally diverse collection of compounds and find common connectivity paths which exist in either the whole collection or in some fraction of the collection? The need for such a program can arise from the results of a high-throughput screen or from an analysis of known ligands for a particular receptor, or even for a set of compounds which have some similar functional activity, such as receptor/ligand binding, toxcitiy, etc and it would be interesting to know if this is correlated accross the set due to 2-dimensional commonalities.

Key Features of the Algorithm

Stigmata has three key features, one is flexibility in the commonality search based on a user defined threshold, a second is the color mapping of the results onto the chemical structures, and third is the capability to search a secondary database using the commonalities that were found from the collection analyzed.

The flexibilitiy feature is driven by a user defined threshold value, which determines how many of the structures have to contain a common feature for it to be considered part of the common features of the set. If you have a highly structurally diverse collection, then a low threshold value (e.g. 0.5) would be appropriate. This gives the algorithm much freedom to find common features. In the case of the threshold value of 0.5, common features will be found if they exist in at least half of the data set. The commonalities are accumulated into a single representation called a modal fingerprint. It is the modal fingerprint which is used to generate similarity metrics for each structure in the dataset and also for the database searching option.

Each structure in the analyzed dataset can be visualized with the visualization routine, xvstigmata. Each atom in each structure is given a color based on it's stigmata generated atom score. The scores range from zero to one and the colors follow the temperature scale where zero is red, and one is white. The full range mapping is red-orange-yellow-green-blue-white. Atoms which are part of many of the common paths of the set appear in green, blue, or white, those not in any common paths would appear as red. Xvstigmata also displays the similarity metrics, MSIM and MODP. MSIM is the tanimoto similarity between the molecular fingerprint for the structure and the modal fingerprint. MODP also ranges from zero to one and represents the percentage of the bits in common to the modal fingerprint relative to the total bits set in the modal fingerprint. These two values enable one to determine how much of the commonality found in the dataset is also found in the structure, and also how many novel features are contained in the structure. The visualization coupled with these two metrics provides means for quick structural analysis of the input dataset.

The third main feature enables the modal fingerprint to be used as database query. The top hits (determined from MODP and MSIM scores) from the search can be visualized with xvstigmata. This feature is useful for finding structures of potential interest from databases which contain the common features and also some unique features. Such structures may proove interesting for idea generation, further screening, etc.

Running Stigmata

Back to Help Page