MUG '04 -- 24 - 27 Feb, 2004

QSAR in fedora

Dave Weininger
Metaphorics LLC

Introduction to QSAR

QSAR stands for Quantitative Structure-Activity Relationship, i.e., a relationship between one or more molecular parameters (i.e., properties of the structure) and an observed molecular property (the activity) for a set of molecules.

When the activity is a reaction rate or equilibrium constant, the QSAR is known as a physiochemical QSAR. An example of a physiochemical QSAR is the correlation of a dissociation constant (e.g., pKa) with σ (sigma, electron-withdrawing potential), as in the Hammett equation. Example. Another example.

When the activity is an observed biological endpoint (example) or is measured in a biological system (example), the QSAR is known as a biological QSAR.

QSARs are most commonly dertived for structurally-related compounds, e.g., congeneric series (example), but can also be derived for miscellaneous compounds using molecular parameters (example).

A number of QSAR models are used, the most common being a "single parameter linear model" (i.e., a simple correlation, example). More complex, non-linear multiparameter models can be supported when enough data are available (example).

Two important QSAR research techniques are comparative QSAR (comparison of QSARs for similar activities in different systems or for different kinds of structures in the same system), and lateral validation of biological activities with physiochemical reactivities (IMHO, a bit of a misnomer, but the name has stuck).

It is important to understand that a QSAR is simply an observation of a correlation between one or more molecular parameters and an observed property. By themselves, QSARs do not establish cause-and-effect relationships between parameters and properties. However, QSARs can be predictive in the statistical sense.

Isn't QSAR a specialized technique for drug discovery and refinement?

Sort of yes but definitely no. That was its origin, but there's much more to it than that.

The qsar database contains about 20,000 QSAR relationships, about evenly divided between physicochemical (45%) and biological (55%) QSARs. A huge amount of data are represented: 265,785 molecular structures with 2,022,106 total parameter values recorded (of which 51,755 are used in final QSARs). These QSAR data were obtained from 567 journals and 50 books, representing 20,032 specific citations written by 24,700 different authors.

This database represents the bulk of what has been observed to date about the effects of molecular structure on chemical reactivity and biological activity (discounting proprietary information). For anyone involved in any kind of molecular discovery, access to this data should be the "ante". I.e., before setting out to discover something new, one should have access to what has already been noticed.

QSAR in fedora

Fedora (federation of research assets) is a federated database system which delivers researcher-friendly access to a large amount of disparate information. Fedora combines a straight-forward web interface with high-performance informatics methods to provide simplifed yet rigorous access to research data.

Existing programs which provide access to QSAR data (e.g., C-QSAR) can only be effectively used by a dedicated and well-trained specialist. Part of this problem is that the nature of QSAR information is not as "flat" as most data -- each entry in a QSAR database represents a relationship with a wealth of underlying information. Another aspect leading to apparent QSAR complexity is that a large number of molecular parameters have evolved over the last 30-40 years, many of which are poorly documented and thus poorly understood. Additionally, QSARs have been usefully applied to many sub-fields but QSAR remains on the fringe of each.

The QSAR dataset is therefore a perfect candidate for a inclusion in a federated database system such as fedora. The qsar fedora service implements a primary object database in which each entry is a QSAR relationship with component objects such as molecular structures, observed properties, molecular parameters, equations, and references. (Status.) Internal object databases are maintained for QSAR classification, molecular parameters, enzyme functionality, publications, and authors.

The overall design goals of the qsar fedora service are divided into four levels:

  1. Access to underlying data

    All QSAR structures (structures used to derive a QSAR and omitted ones). All observed values (dependent variables). All available molecular parameters (independent variables, whether used in the final QSAR or not). Complete references, indexed by publication and author. Automatic structure crossreferencing with other fedora servers.

  2. Access to QSAR correlations

    QSAR classification and system. The QSAR equation, indexed by parameter (term). Regression statistics, including confidence limits on the coefficient of each term and on predicted values. Worked out computation for each structure. Enrichment with related enzyme functionality assignments and automatic enzyme crossreferencing with other fedora servers.

  3. Computation of QSAR applicability

    Evaluation of the structural scope and parametric scope of each QSAR.

  4. Property predictions for new structures

    Given the computation of applicability above, calculate all required molecular parameters and estimate predicted value and error of prediction. Note that, by their nature, not all QSARs are predictive.

The current fedora qsar service implements the first two goals (access to QSAR data and correlations) and most of the third (applicability). Quite a bit of collabrative work needs to be with BioByte to achieve the final goal (prediction).

What fedora QSAR is not

The fedora qsar service is not a tool for deriving new QSARs from raw data.

The fedora qsar interface

Taka a tour of the fedora qsar server.


Daylight Chemical Information Systems, Inc.