Briefing Document: Explainable Deep Learning for Molecular Discovery
Citation: Wong, F., Omori, S., Li, A. et al. An explainable deep learning platform for molecular discovery. Nat Protoc (2024). https://doi.org/10.1038/s41596-024-01084-x
Dates: Received - 11 March 2024 | Accepted - 26 September 2024 | Published - 09 December 2024
Source: Excerpts from "An explainable deep learning protocol for de novo molecular discovery" (s41596-024-01084-x.pdf)
Executive Summary
This briefing document summarizes a research protocol for leveraging explainable deep learning (DL) to accelerate the discovery of novel molecules with desired properties, particularly focusing on antibiotic discovery as a case study. The protocol utilizes Graph Neural Networks (GNNs) and a software package called Chemprop to build predictive models of molecular properties based on experimental data. A key innovation is the integration of Monte Carlo Tree Search (MCTS) to provide "rationales" – specific chemical substructures that explain the model's predictions. This explainability contrasts with traditional "black box" DL approaches and allows researchers to gain chemical insights, efficiently narrow chemical spaces, and prioritize compounds with promising structural features for experimental validation. The protocol encompasses data generation, model training and benchmarking, rationale analysis, and experimental validation, and is designed to be accessible even without extensive coding expertise or specialized hardware.
Main Themes and Important Ideas
1. The Power and Limitation of Deep Learning in Molecular Discovery
- DL approaches have demonstrated significant success in modeling compounds, leading to increased true discovery rates and the identification of drug candidates across various therapeutic areas.
- "DL approaches have accurately modeled compounds, with increased true discovery rates, and resulted in the discovery of drug candidates including antibiotics7,17–22, senolytics23 and anticancer and antiviral combinations24,25."
- However, a major drawback of many DL models is their "black box" nature, lacking transparency in how they arrive at predictions.
- "However, a major limitation to DL approaches is that they are typically black box in nature or unable to provide reasoning behind model predictions."
2. Explainable Deep Learning (xDL) as a Solution
- Explainable DL is an emerging field focused on providing the reasoning behind DL model predictions, contrasting with interpretable DL which aims to reveal the decision-making steps.
- "Explainable DL is an emerging field that aims to open up the black box by providing this reasoning7,34–37. Explainable DL contrasts with interpretable DL, which aims to reveal patterns of decision-making steps the models perform to arrive at their predictions38 (see Box 2 for further comparison)."
- In the context of molecular discovery, xDL identifies chemical substructures ("rationales") with positive predictive value for a property of interest.
- "As applied to molecular discovery, explainable DL identifies patterns of chemical atoms and bonds—chemical substructures—that have positive predictive value for a property of interest."
- This enables generalization beyond individual compounds to classes of compounds sharing these substructures, leading to specific chemical insights and efficient narrowing of chemical spaces.
- "By enabling models to predict structural classes directly, explainable DL can produce specific chemical insights, efficiently narrow chemical spaces and lead researchers toward effective chemical scaffolds."
- Explainability and interpretability are driving research towards increased comprehension, trustworthiness, and accountability of AI/ML models (Box 2).
3. The Proposed Explainable DL Platform and Protocol
- The protocol introduces a general-purpose explainable DL platform for molecular discovery, using GNNs as the DL model architecture.
- "In this protocol, we present an explainable DL platform for molecular discovery (Fig. 1)."
- "The platform uses graph neural networks (GNNs) as a DL model architecture. GNNs model chemical structures as collections of nodes (chemical atoms) and edges (chemical bonds) and perform computations that pool together information from neighboring atoms and bonds16,39–42."
- The protocol provides a practical guide to Chemprop, a user-friendly software package implementing GNNs, specifically directed Message Passing Neural Networks (MPNNs).
- "We provide a practical introduction to Chemprop16,43–45, a software package implementing GNNs..."
- "Chemprop is a software package implementing a variant of MPNNs, in which messages are associated with directed edges in contrast to nodes16,45."
- The protocol covers experimental data generation (using high-throughput screening as an example), model implementation, model explainability (using MCTS), evaluation, and experimental validation.
- "This protocol includes both computational and experimental components, and it does not require coding proficiency or specialized hardware. Starting from data generation and ending in the testing of model predictions, this protocol can be executed in as little as 1–2 weeks, depending on dataset size and the time required for experiments."
- The protocol is exemplified through antibiotic discovery, but its applicability extends to discovering molecules with other biological or chemical properties.
4. Key Components of the Protocol
- Graph Neural Networks (GNNs) and Message Passing Neural Networks (MPNNs): GNNs process molecular structures represented as graphs (atoms as nodes, bonds as edges) through message passing between neighboring atoms and bonds. Chemprop implements directed MPNNs.
- "GNNs are a type of ANN designed to operate on input data represented as graphs. Graphs are mathematical objects that consist of nodes and edges. For molecular discovery, these nodes and edges represent chemical atoms and bonds, respectively."
- "To incorporate local connectivity information, message passing—a step where embeddings from a node or edge’s neighbors are pooled with its preexisting embeddings—is often incorporated into a GNN layer’s update function, resulting in an architecture called MPNNs."
- Chemprop: A user-friendly, constantly maintained software package implementing MPNNs with diverse functionalities and customization options. It was successfully used in the discovery of halicin as an antibacterial compound.
- "Key features of Chemprop are that it is user friendly, possesses diverse functionalities and customization options and, since its development in 2019, has been constantly maintained. Its application to antibiotic discovery, resulting in the identification of halicin as an antibacterial compound17, was reported by our laboratory in 2020."
- Monte Carlo Tree Search (MCTS) for Explainability: For molecules predicted to be active ("hits"), MCTS is used to identify the smallest substructure ("rationale") that still results in a high predicted activity. This provides an explanation for the model's prediction.
- "For each molecule with a suitably high prediction score (‘hit’), we aim to determine the smallest substructure resulting in the molecule being classified as active (the molecule’s ‘rationale’)."
- "The MCTS algorithm solves the discrete optimization problem of finding the rationale of a molecule... each state in the search tree is a substructure (su...