ML in discovery and design
Applying ML for discovery research is challenging. Contrary to some expectations, ML tools do not implicitly reveal how complex systems work, even if they appear to successfully model them. Recently, I shared brief thoughts about the challenges of applying ML for target discovery, prompting me to reflect on the specific areas of drug development where ML is having an impact.
In a previous post on this topic, I observed that ML has yet to make a substantial impact on discovery research and instead has currently found application in design (of small molecules, antibodies, and vectors). This observation aligns with the trend of new companies focusing on "AI for chemistry" (design) rather than "AI for biology" (discovery). That being said, "AI for chemistry" does not properly capture the rapidly expanding fields of antibody and gene therapy engineering.
My post focused on how current market dynamics are affecting the application of ML in drug development. However, more important scientific challenges remain in demonstrating the usefulness of ML in the discovery stage of drug development.
Discovery research
Discovery is pivotal in drug development, representing the crucial initial stage – the "top of the funnel". It paves the way to creating a significant strategic advantage in the competitive scientific arena. Understanding novel disease biology equips a team with insights into the root cause of a condition. However, even after you have done the hard work of target discovery, there is a long way to go. A pharmaceutical industry veteran recently told me that only 1/50 “promising” small molecules designed for a given target are ultimately successful. That isn’t counting the many molecules it takes to get to a “promising” one. Over 20 years, he was involved in 37 development programs. One of them made it to market.
Discovery research encompasses more than just uncovering new aspects of disease biology. It also includes alternative development strategies, such as repurposing existing drugs or widening the range of their therapeutic uses. These strategies acknowledge the variability in biological responses to therapies and the possibility of incomplete understanding of a drug's mechanisms of action (MoA). An interesting case is polypharmacology, where a drug affects multiple targets. Previously seen as unfavorable, this trait is now recognized in many approved drugs as potentially beneficial rather than detrimental.
Problem statement
To explore the technicalities more in-depth, I will frame the problem of identifying targets with a platform technology. It is important to acknowledge that different platforms have inherent trade-offs that favor either discovery or design.
Imagine a scenario where you have the ability to measure a phenotype in a high-throughput manner. This enables the screening of numerous chemicals within a system to assess their biological responses. Here, we assume you have a model system (e.g., a cell line) that approximates the diseased phenotype. Many discovery efforts aim to shift a "disease" state to either "dead" or "normal." While there are many phenotypes to avoid, such as toxicity and adverse events, we will disregard them for simplicity.
Given a disease, there may be several molecular targets that can be modulated to achieve the desired phenotype. In our setup, let's assume that the targets are among the ~20k mammalian proteins. This scenario overlooks various levels of regulatory biology that can be targeted (DNA, RNA; transcription, translation) and the possibility that upregulating or stabilizing a target could hold the key to the desired phenotype instead of inhibition. The final goal is to identify the set of proteins that maximizes the desired phenotype, given the diseased phenotype and a chemical intervention.
Frameworks
Here are four ways to frame target identification with machine learning tools. These are general approaches so we will look look at them within the context of target identification, outlining their strengths and weaknesses:
Discriminative modeling (supervised approaches)
You are attempting to directly predict target(s).
Advantages: Direct and straightforward prediction of targets; offers probabilistic target estimates for ranking and prioritization.
Disadvantages: Often faces a scarcity of well-characterized and accurately labeled data (weak labels); limited prior information on drug-target interactions, particularly in context-specific scenarios.
Generative modeling (unsupervised approaches)
You are attempting to generate novel phenotype readouts, conditioned on intervention variable(s), and interpret them.
Advantages: Operates without labeled data, suitable for situations with sparse ground truth; potentially unbiased data exploration.
Disadvantages: Requires high accuracy for simulating biological processes and in silico experiments; presents the traditional challenge of distilling and interpreting high-dimensional phenotypes; only as good as the data – and you need a lot.
Retrieval (contrastive approaches)
You are attempting to build a similarity map of phenotypes and interpret relative relationships.
Advantages: Transforms the problem into similarity search after establishment of a metric; leverages phenotype relationships for target function insights; useful in undersampled areas or few-shot settings (e.g., few MoA examples).
Disadvantages: Dependent on a useful notion of similarity; relies on known phenotypes/MoA as landmarks for interpretation.
Model interpretation (feature attribution approaches)
You are attempting to interpret which features a model relies upon for making a given prediction.
Advantages: Identifies critical features used in predictions; can be used in discriminative/generative/contrastive frameworks; offers insights into model behavior and confounding variables.
Disadvantages: Lacks causal assurance, prone to model shortcuts; empirical nature of the field; risk needing to interpret high-dimensional feature sets.
Additional considerations
Now, are targets being discovered in the wild with the help of ML tools? Probably, but not in the “genie” type of way. A lot of distillation, interpretation, and follow-up work is required. Here are a few other thoughts that may help put the challenge of applying ML for target ID into perspective:
ML's effectiveness hinges on access to substantial, high-quality data, from deeply considered experiments.
The vastness of chemical space necessitates non-random exploration strategies to extract meaningful utility from any model that hopes to extrapolate along this axis.
Compared to other fields (often mental analogies are made to the progress in NLP and CV), biology suffers from significant undersampling. Frankly, there is a scarcity of comprehensive data for “blue-sky” discovery. Focusing the problem on biological niches is important.
Biology is inherently dynamic; current tools do not capture this with any significant volume. Predominant high-throughput measurements provide static snapshots (t=0), failing to fully capture the breadth of these dynamic systems.