40,000 small molecules docked to 10 proteins using DockIt with results stored in a standard Thor/Merlin database.
: fast distance-geometry-based docking
Robust program designed for production work
Analytical DOCK solution
Fastest distance-geometry-based docking program yet
Three scoring functions: DOCK, PMF, PLP
New Metaphorics product
One protein, 1hpv, was selected "on purpose"
Others selected randomly from ~150 known protein-ligand structures
1add 1ela 1hpv 1hyt 1rbp 1rds 1swd 2cbr 3cla 4hvp
Small & large, simple & complex, well-known & obscure, one pair
bound to ligands
(which were also docked)
40,000 small molecules
Database of diverse compounds intended for screening
Well-formed data, e.g., isomers, salt data
Plated samples are available from InterBioscreen
Freely available as Thor database
10 by 40,000 by 100 trials each makes 40,000,000 dockings
Done on 240/256 R12000 processor SGI/Cray supercomputer
CEX used for communication between programs
Results as CEX objects converted to Thor datatrees
Concensus pick: ~400,000 conformations
Thor: $D3D-rooted subtrees contain scores, rms, etc.
Merlin: SMILES/conformation rows, 3-D data not in pool
Rasmol: widely used, free visualization program, local "expert"
How many structures/dockings does it take to generate interesting results? Is 40,000 not enough, enough, more than enough?
Are some proteins very specific and some promiscuous, or more-or-less the same?
Are some small molecules very specific and some promiscuous, or more-or-less the same?
Do current (faddish?) scoring functions produce sensible results?
Is RMS a sensible gold standard?
Does consensus scoring make sense?
Does DockIt/docking eliminate obvious losers?
Does DockIt/docking find known binders?
Can "normal" informatics be wrapped around docking data?
Are generic communication (CEX, XML) protocols suitable for production work?
Is DockIt ready for prime-time?
Do you need DockIt in your shop?
Project started less than a month ago, done in a couple weeks!
Started with a close-to-production CEX-talking DockIt, production Daylight small-molecule database and DBMS software, etc.
Would have taken a long time on our machines (45 days / many months). Even using local compute resources (NCGR) would have taken ~9 days, we would have been hard-pressed to get it onto CD-ROM by MUG.
SGI generously provided time on a brand new (-1 week old) 256 R12000 processor machine. Compute time using ~240 of the CPUs was less than 24 hours.
Access to the machine needed to be via private SGI network, so Scott camped out at the SGI sales office in Portland and got it done.
The result set was ~54 GB. Getting the results backed out and transferred was as difficult and time-consuming as everything else combined.
Reduction of the data to Thor/Merlin database format was relatively easy: done in a day.
CD-ROMs are available now.
10 proteins as bound to ligands
10 ligands as bound to proteins
lowest RMS docking of 10 ligands to their proteins
consensus docking of 38,123 small molecules to each of 10 proteins
heavy atom count
RMS for bindings
DOCK, PMF and PLP scores for dockings
docking consensus score (and ranks)
docked-to and bound-to IDs
m4x10x40k database contents:
38,143 SMILES-rooted datatrees, 10 are proteins: 10 are observed ligands, the rest (38,123) are from bioscr99sc.
381,113 conformation-rooted subtrees: 20 are bindings, the rest (381,093) are dockings,
381,103 sets of DOCK, PMF, PLP & consensus scores
m4x10x40k database sizes:
m4x10x40k.DP, biggest file
all database files
merlin pool (3,620,250 dataitems)
merlin total, including overhead
Docking informatics is definitely doable.
However, this represents a big change from how things are currently done.
Handling 10's of GB's of data is work, but not "rocket science".
Modern disks have enough capacity.
Daylight DBMS works fine with proteins.
Proteins are just big molecules!
(Most are not really so big, at that.)
Almost no changes were needed to the Thor/Merlin system.
CEX makes such informatics possible.
CEX could be faster
XML might work if it came of age
Pushing around 40,000,000 PDBs would not have been reasonable
DockIt is ready for production
DockIt definitely cranks.
Scale up is not a factor: large-scale performance is within a few percent of predictions.
Can expect to complete large jobs w/o crashes, blowups, etc.
Automated sphere generation (sphinx) and refinement (sublime) are 90% but not 100% ... yet.
We still have the choice between best (with H's) or fast (without H's), but not "best and fast" at the same time.
Lots of interesting dockings crop up
Could these strange and wonderful scaffolds really work?
The current crop of scoring functions aren't wonderful.
Size bias. Surprisingly expensive. False positives.
Between-molecule scoring comparisons are non-intuitive.
The three scoring functions provided with DockIt are very different from each other. In so far as they represent the current world-range in scoring, these three are about as good as one can do.
Between-molecule consensus scoring has practical (rather than theoretical) merit.
Conformational consensus scoring seems to work well (e.g., among 100 docked conformations of the same molecule). However, that's not what most people want to know.
Tom Magdziarz, SGI, Mountain View
David Zirl, SGI, Mountain View
Tim Tomasi, SGI, Portland
Bill Durch, SGI, Chippewa Falls
Pete Wargo, NCGR, Santa Fe
Metaphorics Krewe, Santa Fe & Mission Viejo