This program takes a class descriptor on stdin, looks up all members of that class in a merlin pool derived from a command line thor database.

Each of these class members is used as a target to sort the database by similarity to the target.

For each target the ranksum of all the other members of the class is calculated. Using a Mann-Whitney statistic, a test is carried out for each target in turn to see if it retrieves the class from the population in the same way that the original classification did.

The test is one-sided as we are testing not only that the two groups are separated but also that they separation is towards the top of the ranked list. Ranks are from 0, the target itself. Values in excess of 1.65 are significant at the 95% level and values in excess of 2.33 are significant at the 99% level.

``` Formulae
For each of m targets in a population of size N
N = m + n

Tm = targets[i].rank_sum

Mann-Whitney        U = m*n + m*(m+1)/2 - Tm

Mean               mu = m*n / 2

Standard error  sigma = sqrt( m*n*(N + 1)/12 )

z-score = ( U - mu ) / sigma
=( ( m*n + m*(m+1))/2 - Tm ) / sigma

Let
const = ( m*n + m*(m+1))/2
= m*(N+1)/2

For the purists this is the formulation of the Wilcoxon
rank-sum test as it is the rank-sum of a random sample of size m,
or the mean of the rank-sums.

So for each target in class
z-score  = ( const - Tm ) /sigma

As this is a single tailed test the critical values are
1.65 ( alpha = 0.05 ) and 2.33 ( alpha = 0.01 )

A z value which exceeds the critical value indicates that,
with the given level of confidence, the target is separating
the population into two classes.

Program zeemer.

Calling syntax: \$ zeemer [options] databasename

Options:-
-tag  <tag>
the thor database tag of the datatype containing the activity/class
-field <field>
the field of the activity/class datatype to use
Default tag is "AC" and default field "1" i.e the first.
-coeff <coefficient>
the type of similarity measure to be used.
Valid values are TANIMOTO, EUCLID or TVERSKY.
Default is TANIMOTO.
-alpha <value>
Tversky coefficient values need to be 0.0->1.0
-beta <value>
Tversky coefficient values need to be 0.0->1.0
-tdt
prints the output in tdt format. Default is a table.
\$SMI<SMILES>
RANK<target;no_of_hits;sum of ranks;z-score>
|
-zscore <value>
Critical value of z-score to filter compounds. Useful values are
1.65 and 2.33. If the value is exceeded it means you are 95% or
99% sure the separation into two classes has not occurred by chance.
When zscore is <> 0.0 the the program acts as a simple filter
and prints out a file of SMILES.
prints a header on the csv file. Default is to omit it.
-sort
sorts the output by the rank sum. Default is leave in input order.
Unsorted is useful for comparing data from different measures as the
output can be joined. See join().
Sorting is not useful for tdt output.
-dos
produces output with cr/lf at the end of each data line to facilitate
transfer to a Windows statistics/graphics package
Database specification format is: database%basepw@host:service:user%user-hostpw
The database name must be specified (other fields are optional).

```

This program reads activity type on stdin and returns a comma delimited table with each row representing SMILES and the sum of the ranks of the other compounds in the same class, along with the z-score, unless the -tdt option is set.