There are an estimated
1,453 drugs that have been approved by the US Food and Drug Administration (FDA) in its entire history. If we take small molecules as a case study (as opposed to protein therapeutics), most drug libraries that can be purchased from suppliers come in batches of 1,000 different molecules. This is literally a drop in the ocean because computational estimates suggest that the
actual number of small molecules
could be 10^60 (i.e. 1 million billion billion billion billion billion billion). This is 10-20 orders of magnitude less some estimates of the number of atoms in the universe (for readers who are familiar with the AlphaGo story, this comparison may sound familiar). The American Chemical Society
Chemical Abstracts Service database contains almost 146 million organic and inorganic substances, which suggests that
almost all small molecules (>99.9%) have
never been synthesized, studied and tested.Ā
Taken together, these statistics mean that research labs can only test a few thousand drug candidates in the real world. Once theyāve found a few candidates that look functionally promising, they often iteratively explore structural variants to determine if there are better versions out there. This again requires synthesis experiments and chemical assays. As such, to even make a dent into the 10^60 drug-like small molecules in existence, we need to use software as a means to rationally and automatically explore the chemical design space for candidates that have desired properties for a specific indication.
To solve this problem, ML researches have turned to sequence-based generative models trained on large databases of chemical structures (often using SMILES representations). Here, chemistry is effectively treated as a language with its own grammar and rule set. Generative models are capable of learning the probability distribution over molecule space such that they implicitly internalise the rules of chemistry. This compares somewhat to how generative models trained over a corpus of English text can learn English grammar. To this end,
a 2017 paper from the University of Münster and AstraZeneca presented a recurrent neural network that was trained as a generative model for molecular structures. While the performance wasnāt state-of-the-art, the system could reproduce a small percentage of molecules known to have antibacterial effects.
Related work from AstraZeneca expanded on the RNN-based molecule generator by using a policy-based reinforcement learning approach to tune the RNN for the task of generating molecules with predetermined desirable properties. The authors show that the model can be trained to generate analogues to the anti-inflammatory drug
Celecoxib, implying that this system could be used for scaffold hopping or library expansion starting from a single molecule.
Other ML methods have been used to build molecule generators. The University of Tokyo and RIKEN borrowed from the AlphaGo architecture to build a system called
ChemTS. Here, a SMILES-based molecule generator system is built using Monte Carlo Tree Search for shallow search and an RNN for rollout of each downstream path. The same group also published an evolutionary approach called
ChemGE that relies on genetic algorithms to improve the diversity of molecules that are ultimately generated. Finally, researchers at the University of Cambridge applied a pair of deep networks trained as an autoencoder to convert molecules represented as SMILES strings into a
continuous vector representation. The autoencoder was trained jointly on a property prediction task. Finally, even generative adversarial networks (GANs) have been adapted to operate directly on graph-structure molecule data. The
MolGAN system from the University of Amsterdam combines GANs with a reinforcement learning objective to encourage the generation of molecules with specific desired chemical properties. Taken together, these papers illustrate the promise for accelerating chemical search space exploration using
virtual screening.