MUG '98 - Weininger

clogp4

Dave Weininger
Daylight CIS

CLOGP3 Synopsis

CLOGP3 was written in 1983-1984 at Pomona College to provide a formal method for rationalization of hydrophobicity. Drs. Leo (Pomona College), Weininger and Veith (USEPA)
CLOGP3 is built in the form of a non-automatic learning machine, i.e., its calculations are entirely based on extraction of empirical properties from observations. CLOGP3 is not designed to produce the minimal average residual.
The strategy of CLOGP3 is to formalize Dr. Leo's intuition. Although it is based on a fragment-based additive model, in practice this is little more than a framework upon which formalized generalizations about interactions can be applied
CLOGP3's capabilities include: formalization of the fragment method, canonical fragment generation, an extended electronic (sigma-rho) method, and support for a host of submethods describing fragment interactions.
CLOGP3 was written in FORTRAN-77 under Unix Version 7 using the `f77' compiler. It was then ported to VMS and PR1MOS and widely distributed.
New fragment knowledge can be incorporated into CLOGP3 without without divulging original measured structures. (Example 1, Example 2) This feature allows proprietary measurements to contribute to the machine, and led to a large and rapid increase in the knowledge base.
Since 1987, the CLOGP3 algorithm has remained nearly unchanged, although the knowledge base (fragdb) has steadily increased in size and additional interfragment corrections have been added.
Daylight has delivered batch and X-versions of CLOGP3 (clogp, xvpcmodels) since 1990. An object-level interface (clogptalk) and CGI interfaces (daycgi/clogp, daycgi/medchem) were added to the lineup in 1995 and 1996.
CLOGP3 was ported to PC's and Mac's by the MedChem Project at Pomona College.
In 1996, BioByte took over development and support of CLOGP3
Today (1998), Daylight and BioByte have "synched" versions in preparation for further development. The synchronized version of CLOGP3 will be provided with Daylight Release 4.61.

CLOGP3 -- computational advantages

Availability -- currently available for Macintosh, Windows and Unix architectures, in a number of environments (local/web interactive, local/remote batch).
Reliability -- very large batch runs have been done regularly for a decade, updates are "plug and play".
Extensibility -- currently handles more than triple the number of fragments and corrections as the original, with the same underlying code
Speed -- slow to initialize the algorithm (1-10 S), then fast to process structures (10-100 mS).
Robustness -- computation is generally not sensitive to individual measurements in that intermediate properties are generally overdetermined by observations. Example 3
Longevity -- survived ~15 years (so far) with no major algorithmic changes nor reimplementation of the algorithm
Utility -- literally 1000's of QSARs have been published which use CLOGP3 results as parameters

CLOGP3 -- computational disadvantages

CLOGP3 is not a predictive program by design, e.g., no attempt is made to minimize the average residual. The goals of rationalization and prediction often conflict: in CLOGP3, rationalization always wins.
Fortran-77 implementation is getting long in the tooth. It's getting difficult to obtain good Fortran development environments and good Fortran programmers.
CLOGP3 is written in modular, but non-object-oriented 1980's code style. The algorithms used in CLOGP3 (e.g., graph canonicalization, lexical processing) suffer from this heavy-handed software architecture (especially using a "formula translation" language!) One result of this is that most recent improvements have not taken advantage of the underlying design philosophy.
The CLOGP3 implementation is very output-oriented, i.e., the internal computations are responsible for producing results for a particular (and primitive) human interface. For such a kludge, this actually has worked pretty well (for over a decade!), but has stymied attempts to deliver such results via more modern interfaces.
CLOGP3 doesn't produce results for structural classes with no observations. This is largely due to the nature of the algorithm.

clogp4 design goals

Retain all advantages of CLOGP3.
- Availability: native Unix, Windows, and Macintosh versions
- Extensibility: data-driven algorithm using extended fragment database
- Speed: retain/regain absolute speed of original CLOGP3 algorithm
- Robustness: retain observation-based fragment methodology
- Longevity: avoid transient technologies except in transient code (i.e., applications)
- Utility: only time will tell
Divorce computational code from input/output. The oopish programming model implemented by the Daylight toolkits provides all required critical capabilities. But perhaps we can go farther than that ...
Language independence. We must continue to provide clogp access from Fortran, C and C++. Beyond that, we must be prepared to support access from higher levels (e.g., scripting languages like perl) and lower/deeper interfaces levels (e.g., Java Beans).
Provide methods for both rationalization and prediction of logPo/w. The rationalization capability of CLOGP3 is what makes it possible to extend its range far beyond what was put into the program originally: we are not only able to make predictions about structural classes that we know about, but also find out when we don't know something. However that may be, the fact of the matter is that many (most?) people need a true logP prediction method, i.e., knowing only what we know right now, what is the best guess for this new structure? These goals may not be as incompatible as we once imagined.

clogp4 strategy

Create a modern implementation of the clogp algorithm. It will be object-oriented, data-driven, and re-entrant. Primary clogp4 objects will be the fragment database and its children: constants, fragments, environments, etc. clogp4 will use the Daylight oopish toolkits and will emerge as an a oopish toolkit itself.
Create a clogp4 fragment database which emulates the CLOGP3 algorithm exactly. clogp4 will be capable of computing the results of many clogp "algorithms" ... by guaranteeing that the CLOGP3 algorithm is one of them, we provide backwards compatibility (In fact, there is no obvious reason that the CLOGP3 database format should not be directly acceptable to clogp4.)
At this point we can produce production applications which use the clogp4 toolkit and call the job "finished". Aside from documentation, filling in new types of annotations and implementing Al's most recent craze in logP, we'll have a refreshed clogp which is both backwards compatable and supportable for another decade. (But it's unlikely to work much better than CLOGP3.)
Create a clogp4 fragment database which implements a predictive clogp algorithm (i.e., minimizes the average residual). This is simple to state and theoretically feasible (we now have about 10,000 reliable logpstar data to work with). However, it's a fairly hairy non-linear optimization problem. The first line of attack will be to implement a GA with populations of fragment databases operating on partial logpstar datasets, evolving to minimize the computational residual. [Granted, the GA is my personal favorite method for large nonlinear optimizations; other methods such as SA or NN might be more appropriate. The proof, as they say, is in the pudding.]
The next (optional) will be to analyze successful fragment databases in terms of the semantics of the adjustable variables. This won't directly improve predictive performance, but it may teach us something about hydrophobic interactions.
Deliver a single clogp4 core algorithm with two separate fragment databases which instantiate clogp4 algorithms for rationalization and prediction.

clogp4 tactics

Reuse code in the oopish toolkits.
- Improve reliability, e.g., use of dt_cansmiles().
- Simplify implementation, e.g., use of dt_match()
- Provide a tool, not a separate universe
Create a clogp4 toolkit.
- Provide platform independence (Unix, Windows, Macintosh, JVM)
- Improve stability (target: 10 years)
- Simplify algorithm management (fragdb object manipulation)
- Allow higher level processing (e.g., GA on fragdb populations)
Implement clogp4 as an "object property"-oriented method.
- Suitable for modern user interfaces, e.g., table object
- Detach computation from I/O environment
- Further simplify access
- Allow more sophisticated reporting (Example 4)
- Model for generic computational process
Implement clogp4 as a fully object-oriented, output-independent method.
- Instead of (e.g.) a table of contributions containing string values of table cells, a sequence of contribution objects (with fragment SMILES, etc.) is provided as a computational property.
- I/O issues disappear completely: there is no I/O
- Provides everything that object-oriented communications provides.
- Requires an object-oriented computational model (System 5.x) not just an object-oriented communications model.
- Example 5

Daylight Chemical Information Systems, Inc.
info@daylight.com