EuroMUG '04 -- 4 - 5 Nov, 2004

GENSMI: Exhaustive Enumeration of Simple Graphs

Michael A. Kappler
Daylight CIS, Inc.

ABSTRACT

An algorithm has been created to generate all possible simple graphs (no atom or bond types). Topological indices are used to characterize frameworks and select "drug-like" scaffolds. As a result of processing a massive data set, Daylight tools have improved.


Motivation

At EuroMUG '03, a challenge was offered ``to construct all possible compounds containing carbon, nitrogen, and oxygen with a molecular weight between 150 and 250.'' Such a challenge arises from the exploratory nature of drug discovery endeavors such as the "Screensaver Lifesaver" project, a community effort aimed at discovery of new drug candidates and involves over a 2.5 million computers, 3.5 billion molecules and 16 protein targets. Motivation stems from the implied fact that the challenge hasn't been done before, and the possibility to suggest new "drug-like" candidates. To rise to the challenge of the latter, a collaboration was initiated between people at Daylight and the University of New Mexico School of Medicine.

Introduction

Counting

Enumerating

Sampling

Duplication

Applications

Goals

Methods

A constructive algorithm was developed to assemble simple graphs by adding (and removing) one vertex and one edge at a time, called orderly generation. The algorithm proceeds as follows:

 1 Routine: Main
 2 G <- NewGraph(1,0)
 3 Output(G)
 4 Enumerate-Vertices(G,1)
 5 done
 6
 7 Routine: EnumerateVertices(G,k)
 8 set <- SelectVertices(G,k)
 9 for i <- set[1], ... set[m]
10   if AcceptableVertex(G,i)
11     G <- AddVertex(G,i)
12     if GraphIsUnique(G)
13       Output(G)
14       EnumerateEdge(G,i)
15       EnumerateVertices(G,i)
16     end
17     G <- RemoveVertex(G,i)
18   end
19 end
20 return
21
22 Routine: EnumerateEdges(G,k)
23 setA <- SelectVertices(G,k)
24 setB <- SelectVertices2(G,k)
25 for i <- setA[1], ... setA[m]
26   for j <- setB[1], ... setB[n]
27     if AcceptableEdge(G,i,j)
28       G <- AddEdge(G,i,j)
29       if GraphIsUnique(G)
30         Output(G)
31         EnumerateEdges(G,i)
32       end
33       G <- RemoveEdge(G,i,j)
34     end
35   end
36 end
37 return

Algorithm Description

Graphs are enumerated by systematic assembly of acyclic graphs, and as they are constructed, each one is used to enumerate cyclic graphs. The algorithm begins by connecting a vertex to a graph in all possible ways, resulting in a set of acyclic graphs, one of which is G. As each acyclic graph is constructed, an edge is added in all possible ways resulting in set of monocyclic graphs. As each monocyclic graph is constructed, an edge is added in all possible ways resulting in a set of bicyclic graphs, and so on, until constraints are violated. Eventually, the algorithm enumerates all cyclic graphs based on G. Then, another vertex is connected in all possible ways, resulting in a new set of acyclic graphs, one of which is G'. As each G' is constructed, it is treated as G and the process recurs until constraints are violated. Eventually, the algorithm enumerates all acyclic graphs based on G. This implementation enumerates acyclic branched graphs before linear ones. For example, neopentane and 2,2-dimethylbutane occur early in the enumeration and n-pentane and n-hexane occur late.

Use of Symmetry

Order of Vertices and Edges

Use of Constraints

From Graph Theory to Chemistry

Results and Discussion

Table 1: Number of Simple Graphs with a Given Vertex and Cycle Count
Exhaustive Enumeration

                                                         Cycles
             0          1            2           3           4           5          6            7         8

    1        1
    2        1
    3        1          1
    4        2          2            1           1
    5        3          5            5           4           2           1           1
    6        5         12           17          18          14           8           3          1
V   7        9         29           56          79          79          59          31          9          2
e   8       18         73          182         326         430         427         298        134         35
r   9       35        185          573       1,278       2,161       2,768       2,616      1,714        707
t  10       75        475        1,792       4,875      10,162      16,461      20,346     18,436     11,477
i  11      159      1,231        5,533      17,978      45,282      90,111     140,605    167,702    146,427
c  12      355      3,232       16,977      64,720     192,945     460,699     880,737  1,327,242  1,530,906
e  13      802      8,506       51,652     227,842     790,849   2,222,549   5,082,279  9,372,118 13,663,758
s  14    1,858     22,565      156,291     787,546   3,138,808  10,216,607  27,402,524 60,319,259
   15    4,347     60,077      470,069   2,678,207  12,116,550  45,076,266 139,571,729
   16   10,359    160,629    1,407,264   8,982,754  45,675,153
   17   24,894    430,724    4,193,977  29,761,361
   18   60,523  1,158,502   12,451,760  97,557,854
   19  148,284  3,122,949   36,838,994
   20  366,319  8,437,289  102,733,261

 Total=834,181,624

Table 2: Number of Simple Graphs with a Given Vertex and Cycle Count
World Drug Index 2004.2 Database

                                         Cycles
            0       1       2       3       4       5       6       7       8       9+

    0       0
    1     113
    2     120
    3      78       3
    4     109       0       0       0
    5     183      13       0       0       0       0       0
    6     215      48       0       0       0       0       0       0
V   7     199      71       3       0       0       0       0       0       0
e   8     200     151       5       0       0       0       0       0       0       0
r   9     209     219      18       1       0       0       0       0       0       0
t  10     236     324      83       3       0       0       0       0       0       0
i  11     248     428     146      11       0       0       0       0       0       0
c  12     196     518     215      13       1       0       0       0       0       0
e  13     171     500     294      42       0       0       0       0       0       0
s  14     207     526     448     105       1       0       0       0       0       0
   15     145     508     482     168       3       0       0       0       0       0
   16     151     514     579     307      13       1       0       0       0       0
   17     111     439     748     380      34       0       0       0       0       0
   18      87     387     812     515      91       1       0       0       0       0
   19     113     330     765     672     121       5       0       0       0       0
   20     127     306     791     770     200      25       0       1       0       0
   21+   1335    3051    7146   11568   10773    6133    2746    1562     874    1509

 Total=64073 Subset(20/8)=17376 (27.1%)

Table 3: Number of Simple Graphs with a Given Vertex and Cycle Count
MedChem 2003 Database

                                         Cycles
            0       1       2       3       4       5       6       7       8       9+

    0       0
    1      14
    2      43
    3      64       4
    4     167      11       0       0
    5     311      58       2       0       0       0       0
    6     485     145       3       0       0       0       0       0
V   7     585     338       7       2       1       0       0       0       0
e   8     568     678      22       3       0       0       0       0       0       0
r   9     524    1089      68       1       0       0       0       0       0       0
t  10     460    1316     233       9       0       0       0       0       0       0
i  11     395    1451     512      10       0       0       0       0       0       0
c  12     313    1423     728      18       0       0       0       0       0       0
e  13     268    1204     714      62       0       0       0       0       0       0
s  14     222    1089     905     143       0       0       0       0       0       0
   15     195    1019     906     217       8       0       0       0       0       0
   16     150     841     956     311       9       0       0       0       0       0
   17     118     692     992     353      38       0       0       0       0       0
   18      88     567    1032     456      51       2       0       0       0       0
   19      72     471     947     527      73      12       0       0       0       0
   20      72     385     825     643     127      18       0       0       0       0
   21+    490    1537    4455    5820    4147    1723     499     143      79      42

 Total=48776 Subset(20/8)=29841 (61.2%)

Table 4: Number of Simple Graphs with a Given Vertex and Cycle Count
National Cancer Institute 2000 Database

                                         Cycles
            0       1       2       3       4       5       6       7       8       9+

    0       0
    1       4
    2      41
    3      62       2
    4     134       7       0       0
    5     301      50       0       0       0       0       0
    6     545     184       7       0       0       0       0       1
V   7     845     467      22       1       0       0       0       0       0       1
e   8    1099    1030      82      10       0       1       0       0       0       0
r   9    1254    1766     227       9       4       1       0       0       0       0
t  10    1302    2482     646      34       4       7       0       0       0       0
i  11    1212    3219    1299      85      10       6       1       1       0       3
c  12    1196    3903    2008     113       6      16       1       0       0       3
e  13    1000    3753    2360     246      19      12       9       2       0      16
s  14     946    3752    2830     561      22      11       7       0       0      13
   15     684    3188    3289     905      44       7       8       0       0      14
   16     687    2670    3813    1367      88      21      12       3       0      11
   17     543    2052    3951    1722     214      14       6       0       2       9
   18     437    1718    4071    2337     382      33       8       1       0      12
   19     349    1218    3564    2518     586      41      11       2       1      15
   20     352    1091    3195    2857     764      76      14       3       0      20
   21+   1950    4168   13553   20117   16934    6661    3610    1149     735     998

 Total=162148 Subset(20/8)=92156 (56.8%)

Table 5: Number of Simple Graphs with a Given Vertex and Cycle Count
Wombat 2004.1 Database

                                         Cycles
            0       1       2       3       4       5       6       7       8       9+

    0       0
    1       0
    2       0
    3       0       0
    4       0       0       0       0
    5       4       0       0       0       0       0       0
    6      10       4       0       0       0       0       0       0
V   7      15      17       0       0       0       0       0       0       0
e   8      20      45       2       0       0       0       0       0       0       0
r   9      32      60       6       0       0       0       0       0       0       0
t  10      37      70      15       0       0       0       0       0       0       0
i  11      33     111      49       1       0       0       0       0       0       0
c  12      29     161     134       9       0       0       0       0       0       0
e  13      31     219     246      55       0       0       0       0       0       0
s  14      37     230     325     117       0       0       0       0       0       0
   15      30     234     432     178       0       0       0       0       0       0
   16      35     214     489     327       6       0       0       0       0       0
   17      22     176     583     528      25       0       0       0       0       0
   18      22     181     631     707      81       0       0       0       0       0
   19      31     168     566     877     137       3       0       0       0       0
   20      29     183     590    1092     293      20       0       0       0       0
   21+    503    1864    6438   16471   16628    8089    2438     800     161     160

 Total=64566 Subset(20/8)=11014 (17.1%)

Table 6: Number of Simple Graphs with a Given Vertex and Cycle Count
Spresi 1995 Database

                                         Cycles
            0       1       2       3       4       5       6       7       8       9+

    0      21
    1     152
    2     624
    3    1198      15
    4    1792      61       5       2
    5    3152     326      21       2       0       0       0
    6    4751    1346      59       6       1       0       0       0
V   7    7555    3610     221      32       5       0       1       0       0
e   8   10342    7929     865     116      21       3       0       0       0       0
r   9   13440   14276    2175     268      63      10       0       0       0       0
t  10   15316   21917    5431     660     123      32       4       0       0       0
i  11   16412   29705   10626    1171     195      51       5       0       0       0
c  12   16129   37335   17728    2201     324     119      16       1       0       0
e  13   15590   40557   25484    3635     402     137      29       1       0       0
s  14   14078   42871   34280    7036     577     194      50       9       2       0
   15   12091   41193   41966   11044     820     174      47       6       3       1
   16   10742   38391   48140   16681    1453     229      63      16       3       5
   17    9537   33254   51497   21981    2399     224      94      18       1       5
   18    7781   29445   53497   29108    4308     410     104      38       4       6
   19    6770   24307   51201   34060    6183     601      80      41       3       0
   20    6035   20558   48724   38578    9272     961     143      53      10      21
   21+  42037   92696  278555  388071  288685  121001   51728   20218   10836   19719

 Total=2462825 Subset(20/8)=1149220 (46.7%)

Table 7: Number of Simple Graphs with a Given Vertex and Cycle Count
ChemNavigator 2004 Database

                                         Cycles
            0       1       2       3       4       5       6       7       8       9+

    0     223
    1      40
    2      52
    3     102       2
    4     242       6       0       1
    5     420      55       0       0       0       0       0
    6     703     200       2       0       0       0       0       0
V   7    1010     667       8       3       1       0       0       0       0
e   8    1384    1769      43       3       0       0       0       0       0       0
r   9    1543    3328     199       5       1       0       0       0       0       0
t  10    1854    5529     858      26       2       0       0       0       0       1
i  11    1858    7825    2166      90       1       1       0       0       0       0
c  12    2066   10641    4360     212       6      11       1       0       0       0
e  13    1948   12790    7474     552      18      17       0       0       0       0
s  14    1954   15770   12350    1363      28      24       2       0       0       1
   15    1668   18696   19647    2751      87      19       8       0       0       0
   16    1589   21709   28996    5344     199      24      14       0       0       0
   17    1284   23867   42812    9666     551      10      16       0       0       0
   18    1147   25806   62743   17053    1130      38      15       3       0       0
   19     891   26254   87480   29058    2469      68      12       3       0       0
   20     853   26497  115160   47718    5287     165      23       3       1       2
   21+   4845  116773 1759394 4212968 3986746 1972355  503728   65068    9067    2713

 Total=13366081 Subset(20/8)=732420 (5.5%)

Table 8: Characterization of Databases
All Compounds

Database           nCmpds    AMU    nAtom  Brnch  Cycli  nChir  nHdon  nHacc  nRots  nRing  sRing

WDI 2004.2         64,073  380.82   26      2.47   1.09   0      2      9      4      3      6
                          (552.86) (38.77) (0.25) (0.33) (6.92) (9.81)(23.38) (6.86) (2.17) (2.33)

MedChem 2003       48,766  265.35   18      2.40   1.00   n/a    1      6      2      2      6
                          (140.24)  (9.57) (0.22) (0.37)        (1.89) (5.16) (4.03) (1.40) (1.05)

NCI 2000          162,148  281.26   19      2.41   1.00   n/a    1      6      3      2      6
                          (154.37) (10.56) (0.22) (0.36)        (1.97) (5.67) (3.91) (1.73) (1.11)

Wombat 2004.1      64,566  401.48   28      2.40   1.09   0      2      8      3      3      6
                          (215.06) (15.29) (0.14) (0.17) (2.68) (4.01) (8.82) (3.94) (1.42) (1.12)

Spresi 1995     2,462,825  313.31   21      2.41   1.00   0      1      6      3      2      6
                          (194.35) (12.85) (0.24) (0.34) (2.25) (2.06) (6.83) (4.98) (1.78) (1.29)

ChemNav 2004   13,366,081  455.41   31      2.41   1.00   n/a    1      9      2      3      6
                           (96.71)  (6.56) (0.12) (0.09)        (1.01) (3.60) (3.14) (1.27) (0.53)

Notes:
  median values
  standard deviation in parenthesis

Key:
  nCmpds  - number of compounds
  Avg MW  - average molecule weight
  Brnch   - number of bonds per non-terminal atom (branching)
  Cycli   - number of cycles per ring atom (cyclization)
  nChir   - number of chiral center per molecule
  nRots   - number of rotatable bonds
  nRing   - number of cycles per molecule
  sRing   - size of cycles
  n/a     - not applicable

Table 9: Characterization of Databases
Compounds With 20 Vertices and 8 Cycles or Less

Database           nCmpds    AMU    nAtom  Brnch  Cycli  nChir  nHdon  nHacc  nRots  nRing  sRing

WDI 2004.2         17,376  236.07   16      2.43   1.00   0      2      6      2      2      6
                           (67.01)  (4.38) (0.40) (0.46) (1.20) (1.66) (3.38) (2.53) (1.16) (0.82)

MedChem 2003       29,841  204.26   14      2.40   1.00   n/a    1      5      2      1      6
                           (68.07)  (4.08) (0.25) (0.43)        (1.39) (3.14) (2.34) (0.97) (0.66)

NCI 2000           92,156  224.28   15      2.40   1.00   n/a    1      5      2      2      6
                           (62.55)  (3.63) (0.24) (0.41)        (1.48) (3.01) (2.42) (1.09) (0.76)

Wombat 2004.1      11,014  249.32   18      2.40   1.13   0      1      5      2      2      6
                           (47.39)  (3.01) (0.14) (0.25) (1.02) (1.69) (2.80) (1.98) (0.95) (0.64)

Spresi 1995     1,149,220  236.39   16      2.40   1.00   0      0      5      2      2      6
                           (63.32)  (3.55) (0.29) (0.42) (0.84) (1.44) (2.86) (2.56) (1.10) (0.85)

ChemNav 2004      732,420  267.33   18      2.38   1.00   n/a    1      5      2      2      6
                           (52.17)  (2.78) (0.15) (0.21)        (1.14) (2.57) (1.78) (0.78) (0.67)

Notes:
  median values
  standard deviation in parenthesis

Key:
  nCmpds  - number of compounds
  Avg MW  - average molecule weight
  Brnch   - number of bonds per non-terminal atom (branching)
  Cycli   - number of cycles per ring atom (cyclization)
  nChir   - number of chiral center per molecule
  nRots   - number of rotatable bonds
  nRing   - number of cycles per molecule
  sRing   - size of cycles
  n/a     - not applicable

Table 10: Number of Simple Graphs with a Given Vertex and Cycle Count
Exhaustive Enumeration Within Constraints

                                                         Cycles
             0          1            2           3           4           5           6            7         8

    1        0
    2        0
    3        1          0
    4        2          0            0           0
    5        2          1            0           0           0           0           0
    6        5          2            0           0           0           0           0          0
V   7        8          6            1           0           0           0           0          0          0
e   8       17         15            6           1           0           0           0          0          0
r   9       31         45           28           8           0           0           0          0          0
t  10       73        124          115          56           0           0           0          0          0
i  11      146        351          449         337          50           0           0          0          0
c  12      346        970        1,650       1,785         691           0           0          0          0
e  13      798      2,711        5,802       8,544       6,083         208           0          0          0
s  14    1,821      7,433       19,593      37,306      41,017       3,463           0          0
   15    4,326     20,615       64,475     152,299     232,931      42,982           0
   16   10,184     56,130      205,565     586,821   1,168,284
   17   24,790    154,750      647,676   2,158,602
   18   59,680    419,620    2,000,657   7,662,010
   19  147,722  1,152,845    6,077,679
   20  362,008  3,147,135   17,261,193

 Total=43,963,080 Subset(enumeration)=5.3%

Constraints:
  Molecular Weight: 0-282
  # of atoms:       0-20
  branching:        2.00-3.23
  cyclization:      0.00-1.92
  # of cycles:      0-8
  size of cycles:   5-7

Findings

Conclusion

Phase I of this project, enumeration of simple graphs, has lead to a huge amount of information. Characterization of frameworks using topological indices has lead to a selection of "drug-like" scaffolds. Daylight tools have improved as a result of the shear magnitude of generating and processing this massive data set.

Future Direction

Acknowledgements


Daylight Chemical Information Systems, Inc.
info@daylight.com