Can Machine Learning Give Us Faster and Cheaper Clinical Trials?
On Trial Design, Historical Controls, and Clinical Prediction Models
Introduction
In drug development, clinical trials carry a hefty price tag with a median cost of $48 million, or $41,413 per patient.[i] Usually, the more patients involved, the more expensive the trial. Crucially, biotech companies bringing novel therapeutics to market must recruit enough patients to adequately demonstrate efficacy and safety while reducing costs and minimizing time to market.
Biotech and pharma executives frequently focus on streamlining operations such as patient recruitment and coordination of clinical sites. Yet, another compelling avenue for cost-savings surrounds the statistical methodology actually used to analyze the data once it is collected. Recently, such methodology has been extended to incorporate machine learning (ML) prediction models. This has led companies such as Unlearn, a San Francisco-based technology start-up, to offer tools and services that use ML to optimize clinical trials and, in their words, “advance artificial intelligence to eliminate trial and error in medicine.”[ii]
In this post, I critically evaluate Unlearn’s most mature product called prognostic covariate adjustment (PROCOVA).[iii] The value proposition is that by using PROCOVA, trial investigators get the same level of information with fewer patients recruited, resulting in potentially faster and cheaper trials. The conditions surrounding how this can be done – as well as the caveats – are the topic of this piece. The more cutting-edge “TwinRCT” product will be discussed in my next post, which will build upon many of the concepts introduced in this piece. Briefly, PROCOVA works in three steps:
1. Model Development: Utilize historical data from previous trials and observational data to construct a model that predicts patients’ outcomes, or prognosis, under the control treatment.
2. Prognostic Scoring: Apply the trained model to compute prognostic scores for the patients in the current trial.
3. Treatment Effect Estimation: Adjust for the prognostic score in the final statistical model used to estimate the treatment effect.
In theory, when designing the trial, investigators can examine the model's predictive performance in step one above to infer how much PROCOVA can reduce statistical noise, or variance, when estimating the treatment effect. This quantified variance reduction can then be incorporated into the trial design by reducing the number of patients who will need to be recruited while still maintaining the same chance of detecting a signal (i.e. a beneficial treatment effect) – hence, cost-savings.
While PROCOVA represents a promising step toward more efficient clinical trials, there are many practical barriers to its widespread implementation. First, gathering high-quality data to train ML models that meet the requirements of trial protocol (e.g. inclusion criteria, treatment regimens, comparator group) could be difficult. I argue that only randomized trial data can be reasonably used to train prognostic models as opposed to easier-to-obtain and relatively inexpensive observational data like electronic health records and registries. Second, highly qualified expertise is needed to scientifically justify variables inputted into prognostic models and ensure they are collected correctly in the clinical trial. Third, PROCOVA may complicate the interpretation of results in trials with non-continuous outcomes, including the ubiquitous “time-to-event” endpoint. Most of all, any potential modelling challenges must be addressed and hedged prior to recruiting patients to achieve any cost savings. Trial protocols are seldom set in stone and often change in response to events that occur during the trial (e.g. recruitment challenges, newly approved treatments, COVID-19). It is not clear how adaptable and robust the PROCOVA framework is to such changes.
Clinical trials carry important scientific and financial implications. Introducing complex methodology invites esoteric risk that executives may not know how to guard against. In this sense, the theoretical often conflicts with the practical. While existing methods like covariate adjustment are theoretically “suboptimal,” they have stood the test of time and are widely accepted. Thus, developers of ML methods in clinical trials must grapple with not only technical barriers but, more crucially, cultural ones. In the remainder of this piece, I first outline some key statistical concepts around trial design, ML, and historical data before diving into my key points in more detail.
Clinical Trial First Principles: Treatment Effects, Power, and Covariate Adjustment
Suppose a biotech company wished to test their novel drug and apply for regulatory approval whether their novel colorectal cancer (CRC) drug reduces tumor size. One or more “Phase 3” trials may be conducted pitting their drug against the current standard of care to see which is more efficacious – shrinks the tumor size more – while maintaining a level of acceptable safety for the patient. The difference in the shrinkage between the two groups is called the “treatment effect,” or measure of efficacy. Usually, these trials are large in sample size but how many patients does a trial really need? It boils down to statistical power, or the probability of a statistical test detecting the proposed drug works better than control assuming it truly does. In essence, a high power tells us that if a signal is there, there is a good chance of detecting it amongst the noise. All else equal, the more patients recruited, the higher the power is. This is due primarily to the overall noise, or “variability,” in the treatment effect estimate going down as the sample size increases. Imagine the certainty of the unfairness of a coin if the percentage of flips landing in heads was 60% after 10 flips versus the certainty if it was 60% after 1000 flips. The rule of thumb is that Phase 3 trials should aim for 80-90% power.
In standard practice, the power level is set before the trial begins and one solves for the number of patients that should be recruited. The choice of statistical methodology can assist with maintaining the same power while recruiting fewer patients. Analysis of covariance (ANCOVA) is one such methodology that can potentially increase the power to detect treatment effects by adjusting for factors that influence a patient’s outcome independent of the treatment. In our example, these would be variables that affect the rate of tumor growth like medical history, genetic mutations, and lifestyle choice. By holding these covariates constant, one removes statistical noise not related to the novel drug, allowing us a more precise estimate of the treatment effect, resulting in more power.
PROCOVA: ANCOVA meets Machine Learning and Historical Controls
PROCOVA is based on mitigating an important contingency of its namesake ANCOVA. How much power increases as a result of covariate adjustment depends on whether one has accurately captured the nature of the relationship between the covariates and the outcome. To characterize relationships with continuous outcomes, investigators often use linear models because of the ease of interpretation. However, this means that the covariates must enter the model as linear terms too. Consider a variable like BMI where at lower values, the effect on tumor growth might increase at a small but constant rate as one’s BMI increases (i.e. linear). Yet, at higher BMIs, the effect of tumor growth may be exponential with each increase in BMI (i.e. non-linear). Consider also that prognostic variables can interact: the protective effect of exercise against tumor growth can be enhanced by the presence of a certain gene that affects inflammation. While in traditional modeling these complexities would need to be explicitly defined, ML will be able to learn such non-linear relationships without having to a priori specify them and theoretically avoid what is called “model misspecification.”
To be sure, adjusting for variables as if they were linearly related to the outcome even if they are not will usually still reduce model variance but additional variance reduction is being left on the table by modeling them incorrectly. This is important because as investigators are planning the trial, they want to be able to minimize the number of patients needed, which requires an accurate assessment of model misspecification. In this sense, a needle must be threaded. On one hand, if one is too optimistic about the covariate adjustment, the trial will be underpowered and have a smaller chance of detecting a positive effect if one is present. On the other, one would like to avoid recruiting more patients than needed.
From a methodological standpoint, investigators usually fit linear models due to their ease of interpretation. Yet, it is desirable to capture complex, non-linear relationships between the prognostic variables and outcome. Furthermore, when designing the trial, investigators would like to quantify the sample size reduction by adjusting for the prognostic score. This is where PROCOVA comes in. The bedrock of PROCOVA is based on a growing trend in evidence generation: historical controls.
Historical controls encompass data already collected, either observational or randomized, where the patients received the control treatment of the current trial and their outcomes were observed. In our CRC trial, this would be data on the tumor growth of individuals who received the control, or current standard of care. This data can then be used to build a prediction model to predict patient outcomes as a function of prognostic factors. The trained model can then be applied to the current trial by computing the patient outcomes as if they received the control treatment and adjusting for it in a linear model. This is like covariate adjustment except for the fact that the only covariate adjusted for is the prognostic score. This is particularly useful when there are several covariates that need to be accounted for. For trial design purposes, by examining the model predictive performance on the historical data, one may infer the performance in the trial and thus calculate the potentially reduced sample size.[iv]
While PROCOVA has the ability to potentially reduce the number of patients by using ML to capture relationships of prognostic variables using historical data, this requires its own set of trade-offs. Most of all, to have significant value, PROCOVA must be able to demonstrate value before any data has been collected. This may be more of an art than a science. I dive into some important considerations and nuances for those wishing to use PROCOVA below.
Practical Considerations for Building Prognostic Models
Consideration 1: Justifying the Covariates for Model Training
Suppose our hypothetical biotech company wished to use PROCOVA in their CRC drug trial. The first step is to decide which covariates should be included in the prognostic model. The company must further keep in mind that these covariates must be collected in the planned clinical trial in order to derive the prognostic scores. Additionally, the more covariates utilized, the more data will be needed to train ML models. Importantly, the chosen covariates have to be predictive of patient outcomes independent of the novel treatment. In other words, the response to the treatment should not vary with the values of the covariates, also called treatment effect heterogeneity. If this is the case, interaction terms should be specified in the statistical model, which may change the interpretation of the treatment effect. Proving the appropriate scientific relevance of each covariate may be challenging in many settings and could be outside the scope of the data.
Some practical trade-offs arise: while some variables are predictive of outcome, they may be expensive or difficult to collect. For example, sequencing genetic mutations or tumor characteristics derived from imaging, which may incur radiation exposure. There is also the issue of missing data as the patient may not consent or is contraindicated to certain collection procedures. This induces selection bias even when using advanced imputation methods unless one makes restrictive missing data assumptions.
Recall that the reduction in sample size, and therefore cost savings, is directly related to the predictive ability of the prognostic model. Thus, in order to achieve adequate performance of said model, companies must spend money and effort on collecting covariates and ensuring they are not missing. Realistically, modelers may face a scenario where only a few key demographic variables can be reliably measured. This brings into question whether one should simply adjust for these few variables during the analysis phase of the clinical trial results rather than go through the comparatively complicated PROCOVA with ML modeling. The difference in the sample size reduction may not be material at the end of the day and may not justify the additional complexity.
Consideration 2: Obtaining the Proper Training Data
Assuming that our company has decided upon the covariates that they both should and could collect, they must now gather the historical data to train the ML prognostic model. It is important to emphasize that the purpose of this model is to predict what would happen to patients in the trial if they received the control. It must therefore be ensured that the trained model applies directly to the population and intervention that will be studied, which takes precedence over optimizing predictive importance in the historical data sample. Data curation in this step is necessary but potentially fraught as the historical data must meet a stringent set of conditions: (i) the patients must have received the exact control regimen as in the trial, (ii) the patients must meet the inclusion-exclusion criteria of the trial, (iii) the covariates that the company wishes to model are present in the data and have been measured in the same manner, and (iv) the outcome is measured by the same criteria and frequency as planned in the trial.
There are two categories of data one can use to train the ML models. The first is randomized control trial (RCT) data where a drug was compared to a control treatment via randomization. The second is observational data, also called “real-world data” (RWD), from registries, insurance claims, or electronic health record (EHR) data. Compared to RCT data there is no randomization but rather a cohorts of patients are simply tracked over time as they are treated in the health system. Considering the conditions for data curation I outlined in the previous paragraph, only RCT data can reasonably be used to train prognostic models as opposed to observational data. This statement may receive some pushback especially because observational data is much easier to access than RCT data, especially from competing products, and is more likely to be available in large quantities to adequately train ML. Thus, PROCOVA may be limited in many cases. I should reiterate that the purpose of the prognostic model is to predict the outcome of patients in the planned study, which will ultimately be used to provide a strong case for the efficacy and safety of the novel drug.
To explain why observational data is suboptimal for PROCOVA, let us return to the curation conditions I listed above. In particular, I focus on the first two that require a match between the observational data and planned trial for (i) the control intervention and (ii) the profile of patients being treated.
For the first condition, well-controlled trials should always compare the experimental treatment to the standard of care, which is carefully defined in the trial’s protocol. “Standard of care” is a multifaceted, all-encompassing term that includes subsequent care after receiving the treatment, even if that “treatment” is placebo. As an illustrative example of standard of care in CRC, physicians treat patients initially showing symptoms with “first-line” treatments and, if that fails, then “second-line” treatments are pursued. Even if our company’s drug is a first-line treatment, investigators in a trial are responsible for managing the care of the patients if the first-line fails for not only ethical reasons but also because it will impact long-term efficacy and safety data.
However, when it comes to standard of care in the real world, due to practical considerations, such as cost, physician ability, and access, patients may not receive the highest quality of care available, or even the same dosage and the same time intervals. In addition, patients are not blinded to the treatment they are receiving, and compliance is often difficult to track. Consider that the standard of care constantly shifts over time as new evidence is presented to the medical community. This means that there may be time-varying effects one must account for on the control of interest and concomitant medications, which can be difficult to capture with observational data. While such developments are rigorously discussed within the various committees that monitor clinical trials, there is often no such consensus in the real world. These factors all limit the generalizability of RWD-trained prognostic models to randomized trials. This additionally brings up the possibility that protocol changes mid-trial will require retraining of a prognostic model, compromising initial sample size estimates.
The type of patients being treated is defined by the inclusion-exclusion criteria of the trial. Matching these criteria requires that the same variables, or suitable proxies of them, that are used to select patients in our company’s trial are present in the observational data. Even screening for eligible patients may be different between the real world and a trial. For CRC, a patient may be screened via colonoscopy, sigmoidoscopy, or computed tomography with each method having different diagnostic accuracies. Usually, a trial will dictate the most accurate method be used yet, in the real world, patients weigh the cost-benefits and choose accordingly. This means there may be significant heterogeneity in false-positive and false-negative patients in the historical sample that is correlated with the screening methods used. Consequentially, the population that is eligible to receive the first-line treatment is different between the trial and the RWD.
Without making restrictive assumptions or oversimplifying the study protocol, observational data appears untenable to train prognostic models. Of course, this is in the best-case scenario that assumes there are no issues with data quality, which is usually much worse in observational data than in clinical trials. Many of the same points I have made have also been raised by FDA guidance documents on historical control trials.[v] There are sensitivity analyses that can measure the sensitivity to assumptions to the aforementioned issues issue but, ultimately, sensitivity analyses are subjective and would have to cover a large number of scenarios. This only adds more overhead and complexity to regulatory approval when the key selling point is efficiency and cost-savings. It is important to remind ourselves that simply adjusting for covariates is always available free of charge and requires no data curation or training of ML models.
Scientifically, the discussion so far motivates some interesting research questions. I am unaware of any literature that compares the predictions of ML models trained only on observational data versus those trained on RCT. Firstly, it would be helpful to quantify the difference and sensitivity of predictions to different strategies to curate observational datasets to a target trial. Second, from these findings, an adjustment procedure may be used to mitigate these differences under some conditions. The results from this research would give us a sober perspective on historical controls in the context of prediction modeling.
Consideration 3: Estimation and Interpretability with Non-linear Endpoints
A large proportion of clinical trials, if not the lion’s share, are event-driven (e.g. disease progression or death) and thus rely on binary or time-to-event endpoints as opposed to continuous ones. Thus, statistics like the odds ratio (OR), risk ratio (RR), risk difference (RD), or hazard ratio (HR) are commonly utilized. When this is the case, significant caution must be exercised when using PROCOVA. When the endpoint is non-continuous, adjusting for prognostic factors will still increase the power but may change the interpretation of the treatment effect due to a mathematical phenomenon called “non-collapsibility.”[vi] This is important because the use of PROCOVA will influence what information the medical community can gather from trial results.
First, I must review some technical details about covariate adjustment. If one does not adjust for any factors, one obtains the “marginal,” or average treatment effect for the entire trial population. However, when adjusting the prognostic variable, this is no longer the case when estimating the OR and HR. Rather, a conditional treatment effect is obtained, comparing a patient from the treatment group to the control group who both have an average value of the prognostic variable.[vii],[viii] For the RD and RR, one maintains the marginal interpretation so long as there is no treatment effect heterogeneity. This means that if one created bins of the prognostic score and then estimated the treatment effect for the patients that fall within each bin, the treatment effect would not be the same for all bins. Whether this assumption is met is dependent on the disease area and is largely untestable.
Going back to our CRC drug, it is common practice to measure the time-to-event endpoint progression-free survival (PFS), which measures how long a patient lives without CRC progressing from its current state. Therefore, the estimate of the treatment effect will be based on the HR, which, in simple terms, computes the risk of CRC progression between the treatment groups over time. The marginal treatment effect is the HR for the entire CRC population recruited in the trial. In contrast, the conditional treatment effect using PROCOVA tells us what the HR is comparing patients between the treatment arms with the same characteristics. The marginal effect is useful for policy-making (e.g. for Medicare deciding to reimburse) and safety profiles while the conditional effect gives us more information on individual treatment effects, which may be more useful in medical practice (see ref. viii). Furthermore, conditional effects are more easily compared across trials with different patient populations (e.g. geographic areas) or “transportability.”[ix],[x]
The debate whether marginal or conditional estimands is far from settled but even if one decides to pursue a conditional estimate, PROCOVA may muddy the waters. If one did not summarize all prognostic variables in a single score, then treatment effects in different subgroups based on patient characteristics could be calculated. This can then be used by physicians to gauge the cost-benefit for patients they see. The issue with PROCOVA is that one no longer maintains that ease of interpretation and doctors will very likely not be computing prognostic score of their own. As a result, in order for results to be useful for the medical community, work would need to be done to adequately describe what each level of the prognostic score means and which patients encompass them.
PROCOVA arguably makes trials less transportable as opposed to more. This is crucial in a time when the replicability of scientific results has been put under question.[xi],[xii] Furthermore, transportability is important when comparing competing products across different companies. Ultimately, any given PROCOVA trial’s prognostic score is specific to that trial because the model is tailored to that specific population. If the model were trained on a different population to match a similar study’s inclusion-exclusion criteria, it would not be unreasonable to expect the prognostic model to be different. For example, the effect of diet may be weighted differently across geographical areas in CRC settings. Thus, the interpretation of the conditional treatment effect would be different. In contrast, risk scores like the Charleston Co-morbidity Index are derived from the same model and do not face this problem across studies. Lastly, one of the main appeals of ML is that one can re-train models with more data to ostensibly achieve better performance. This means that results from trials with prognostic models trained on older data will not be comparable to those with newer data. If PROCOVA was the standard methodology for clinical trials, this would be a pervasive and serious scientific issue.
Putting it All Together: Have We Reached the Future of Clinical Trials?
Unlearn has put together an impressive set of technical papers that demonstrate the potential of PROCOVA. Yet, as I have highlighted, there are multiple practical barriers standing between PROCOVA and widespread adoption. First, there must be a sufficient number of covariates that can be used to build and predict prognostic outcomes. This requires additional research and clinical validation on the company’s part. Second, the historical data must be available in a large enough quantity that matches the patients in the current trial. Observational data is largely inappropriate for these purposes while RCT data can be difficult to ascertain. Third, the endpoint must be amenable to covariate adjustments. With non-continuous outcomes, companies must first decide whether a marginal or conditional effect is desirable and further overcome potential interpretability issues by using PROCOVA. One large unanswered question is if PROCOVA is compatible with other cost-saving trial designs like interim analyses, which stop trials early if substantial benefit or harm is demonstrated.
With these considerations, it is important to contextualize PROCOVA within the scope of competing methodologies. Covariate adjustment is free, the properties well-researched, and does not require historical data. Additionally, validated prognostic scores exist like the Framingham risk score in cardiology or genetic-based polygenic risk scores in oncology. Custom-build prognostic scores may be unnecessarily reinventing the wheel. Lastly, it is not clear what prevents large companies with a built-out data infrastructure from creating a custom version of PROCOVA themselves.
Ultimately, for any statistical method to have compelling upfront value, a company needs to have the utmost confidence in the methodology prior to patient recruitment. If one recruits too few patients, the trial will be underpowered and one may miss a beneficial treatment effect if it is indeed there. Phase 3 trials are often pivotal and have large financial implications for biotech start-ups and pharma giants alike. This leads to many executives to being risk-averse rather than optimistic about novel methodology. This is a cultural fact that may be extremely difficult to change. While covariate adjustment may be “suboptimal,” it is simple and, for that reason, a trade-off often worth making. While the merits of covariate adjustment continue to be debated in the statistical community, Unlearn is on a mission trailblazing new ways to conduct clinical trials. Yet, the medical community may just not be ready yet to adopt such developments. It may be beneficial for Unlearn to publish work that further contextualizes the financial and operational efficiencies of utilizing PROCOVA.
Join the “Machine Learning in Healthcare” Substack for my next blog post where I discuss Unlearn’s cutting-edge iteration on PROCOVA called DigitalTwins.
[i] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295430/
[ii] https://www.linkedin.com/pulse/top-secret-plan-ai-medicine-part-1-charles-fisher/
[iii] https://www.unlearn.ai/resources/summary-of-the-ema-september-2022-qualification-opinion-for-procova-tm
[iv] https://www.degruyter.com/document/doi/10.1515/ijb-2021-0072/html
[v] https://www.fda.gov/media/164960/download
[vi] https://www.jstor.org/stable/2676645
[vii] https://journals.sagepub.com/doi/full/10.1177/0049124121995548
[viii] https://www.tandfonline.com/doi/full/10.1080/19466315.2023.2292774
[ix] https://pubmed.ncbi.nlm.nih.gov/32978962/
[x] https://www.fharrell.com/post/ipp/
[xi] https://www.nature.com/articles/483531a
[xii] https://jamanetwork.com/journals/jama/fullarticle/201218