CLOGP3 was written in 1983-1984 at Pomona College to provide
a formal method for rationalization of hydrophobicity.
Drs. Leo (Pomona College), Weininger and Veith (USEPA)
CLOGP3 is built in the form of a non-automatic learning machine,
i.e., its calculations are entirely based on extraction of empirical
properties from observations.
CLOGP3 is not designed to produce the minimal average residual.
The strategy of CLOGP3 is to formalize Dr. Leo's intuition.
Although it is based on a fragment-based additive model,
in practice this is little more than a framework upon which
formalized generalizations about interactions can be applied
CLOGP3's capabilities include: formalization of the fragment method,
canonical fragment generation, an extended electronic (sigma-rho)
method, and support for a host of submethods describing fragment
CLOGP3 was written in FORTRAN-77 under Unix Version 7
using the `f77' compiler.
It was then ported to VMS and PR1MOS and widely distributed.
New fragment knowledge can be incorporated into CLOGP3 without
without divulging original measured structures.
This feature allows proprietary measurements to contribute to the
machine, and led to a large and rapid increase in the knowledge base.
Since 1987, the CLOGP3 algorithm has remained nearly unchanged,
although the knowledge base (fragdb) has steadily increased
in size and additional interfragment corrections have been added.
Daylight has delivered batch and X-versions of CLOGP3
(clogp, xvpcmodels) since 1990.
An object-level interface (clogptalk)
and CGI interfaces (daycgi/clogp, daycgi/medchem)
were added to the lineup in 1995 and 1996.
CLOGP3 was ported to PC's and Mac's by the MedChem Project
at Pomona College.
In 1996, BioByte took over development and support of CLOGP3
Today (1998), Daylight and BioByte have "synched" versions in
preparation for further development.
The synchronized version of CLOGP3 will be provided with
Daylight Release 4.61.
CLOGP3 -- computational advantages
Availability -- currently available for Macintosh, Windows and
Unix architectures, in a number of environments (local/web interactive,
Reliability -- very large batch runs have been done regularly
for a decade, updates are "plug and play".
Extensibility -- currently handles more than triple the number
of fragments and corrections as the original, with the same underlying
Speed -- slow to initialize the algorithm (1-10 S),
then fast to process structures (10-100 mS).
Robustness -- computation is generally not sensitive to
individual measurements in that intermediate properties are generally
overdetermined by observations. Example 3
Longevity -- survived ~15 years (so far) with no major
algorithmic changes nor reimplementation of the algorithm
Utility -- literally 1000's of QSARs have been published which
use CLOGP3 results as parameters
CLOGP3 -- computational disadvantages
CLOGP3 is not a predictive program by design, e.g., no attempt
is made to minimize the average residual. The goals of rationalization
and prediction often conflict: in CLOGP3, rationalization always wins.
Fortran-77 implementation is getting long in the tooth.
It's getting difficult to obtain good Fortran development environments
and good Fortran programmers.
CLOGP3 is written in modular, but non-object-oriented 1980's
code style. The algorithms used in CLOGP3 (e.g., graph canonicalization,
lexical processing) suffer from this heavy-handed software architecture
(especially using a "formula translation" language!) One result of this
is that most recent improvements have not taken advantage of the
underlying design philosophy.
The CLOGP3 implementation is very output-oriented, i.e.,
the internal computations are responsible for producing results for
a particular (and primitive) human interface. For such a kludge,
this actually has worked pretty well (for over a decade!), but has
stymied attempts to deliver such results via more modern interfaces.
CLOGP3 doesn't produce results for structural classes with no
observations. This is largely due to the nature of the algorithm.
clogp4 design goals
Retain all advantages of CLOGP3.
Availability: native Unix, Windows, and Macintosh versions
Extensibility: data-driven algorithm using extended
Speed: retain/regain absolute speed of original CLOGP3 algorithm
Robustness: retain observation-based fragment methodology
Longevity: avoid transient technologies except in transient
code (i.e., applications)
Utility: only time will tell
Divorce computational code from input/output.
The oopish programming model implemented by the Daylight toolkits
provides all required critical capabilities. But perhaps we can
go farther than that ...
We must continue to provide clogp access from Fortran, C and C++.
Beyond that, we must be prepared to support access from higher
levels (e.g., scripting languages like perl) and lower/deeper
interfaces levels (e.g., Java Beans).
Provide methods for
bothrationalization and prediction
of logPo/w. The rationalization capability of CLOGP3 is what
makes it possible to extend its range far beyond what was
put into the program originally: we are not only able to make
predictions about structural classes that we know about, but
also find out when we don't know something.
However that may be, the fact of the matter is that many (most?)
people need a true logP prediction method, i.e., knowing only
what we know right now, what is the best guess for this new
structure? These goals may not be as incompatible as we
Create a modern implementation of the clogp algorithm.
It will be object-oriented, data-driven, and re-entrant.
Primary clogp4 objects will be the fragment database
and its children: constants, fragments, environments, etc.
clogp4 will use the Daylight oopish toolkits and will emerge as
an a oopish toolkit itself.
Create a clogp4 fragment database which emulates the CLOGP3
algorithm exactly. clogp4 will be capable of computing the
results of many clogp "algorithms" ... by guaranteeing that the
CLOGP3 algorithm is one of them, we provide backwards compatibility
(In fact, there is no obvious reason that the CLOGP3 database format
should not be directly acceptable to clogp4.)
At this point we can produce production applications which use the
clogp4 toolkit and call the job "finished". Aside from documentation,
filling in new types of annotations and implementing Al's most recent
craze in logP, we'll have a refreshed clogp which is both backwards
compatable and supportable for another decade. (But it's unlikely
to work much better than CLOGP3.)
Create a clogp4 fragment database which implements a predictive
clogp algorithm (i.e., minimizes the average residual). This is
simple to state and theoretically feasible (we now have about 10,000
reliable logpstar data to work with). However, it's a fairly hairy
non-linear optimization problem. The first line of attack will be
to implement a GA with populations of fragment databases operating
on partial logpstar datasets, evolving to minimize the computational
residual. [Granted, the GA is my personal favorite method for large
nonlinear optimizations; other methods such as SA or NN might be more
appropriate. The proof, as they say, is in the pudding.]
The next (optional) will be to analyze successful fragment databases
in terms of the semantics of the adjustable variables. This won't
directly improve predictive performance, but it may teach us something
about hydrophobic interactions.
Deliver a single clogp4 core algorithm with two
separate fragment databases
which instantiate clogp4 algorithms for rationalization and prediction.
Reuse code in the oopish toolkits.
Improve reliability, e.g., use of dt_cansmiles().
Simplify implementation, e.g., use of dt_match()
Provide a tool, not a separate universe
Create a clogp4 toolkit.
Provide platform independence (Unix, Windows, Macintosh, JVM)