The Evidence Base
May 15, 2023
Machine learning (ML), an application of artificial intelligence, is increasingly being investigated for potential roles in healthcare, including use in diagnostics, patient monitoring and drug discovery. Predictive analytics is another possible application, for example using ML to analyze real-world data (RWD) sources such as electronic health records to predict disease risk in a population.
In this interview, we speak to Jessica Paulus (VP of Research, OM1 Inc.) about the work of Costas Boussios (Chief Data Scientist, OM1 Inc.) and colleagues, presented at ISPOR 2023 (7–10 May 2023, Boston, MA, USA), on the use of ML to estimate scores for validated measures in autoimmune diseases and their potential for future research studies.
Please could you introduce yourself and your organization?
I am Jessica Paulus, and I am an epidemiologist and also Vice President of Research at OM1 Inc., which is a real-world evidence (RWE) health technology company, headquartered in Boston (MA, USA).
In your research presented at ISPOR 2023 [1], you developed ML models to estimate disease activity measures in RWD sources. What are the current challenges in extracting this sort of information from RWD?
I will begin by talking about why we endeavored to take on this challenge. There are measures of patient disease activity that are critical for understanding the burden of a patient’s condition and how well potential therapies may be helping, but are just not captured as well as one might think in RWD.
In this project we focused on disease activity scores that are very important for the study of the effectiveness of therapies in several auto-immune conditions. Clinical practice guidelines suggest that these different disease activity measures are useful indicators of patient burden of disease, and they are standardly captured in clinical trials to measure efficacy of treatments; however, they are often not easily accessible in RWD due to a host of issues, including clinical practice variation and technological challenges related to how these measures are captured, when they are assessed in usual clinical care.
This was the reason we took on this challenge of trying to estimate some of these disease activity scores. There are a number of different dimensions of challenge to doing this. One is ensuring you have RWD that are clinically rich enough to give the substance to begin to estimate these scores. This might include information on the patients’ symptoms, such as changes in balance or gait or sleep, other medications they are receiving, and their other comorbidities. Having very rich clinical data – that is often only available in unstructured clinical notes – is a prerequisite to being able to develop high performing models that estimate these scores with high validity.
A second challenge is being able to parse the clinical notes where this information is typically stored. A lot of the valuable information needed to estimate these scores is within the clinical narratives or the notes that clinicians make while they are having an encounter with a patient. This poses a technical challenge, to be able to identify and distinguish the notes that are most useful to mine for information from the many notes that are taken.
A final challenge involves making sure these models are very high performing, with high correlation between an observed disease activity score and the estimated score. This is a technical challenge, to build a highly predictive model that can estimate these scores with fidelity, so that ultimately, the algorithms perform well on patients that do not have a score observed.
How were your ML models developed?
We use a combination of medical language processing and machine learning techniques, coupled with an amazing repository of RWD, which included rich clinical notes that was a key ingredient in being able to successfully execute this work. By contrast, if you were to try and work with data that only included information on insurance claims, for example, it would be a greater challenge to build a good predictive model.
The process involved reviewing for patients that have both an observed score as well as clinical notes. These notes might include information on the patient’s symptoms, what medications they are on, signs of progression or stabilization of their condition, social factors, and a lot of other clues that are contained within the clinical narrative that can be highly predictive of the disease activity score. The model building technology also features a very important and critical layer of clinical review to ensure that the identified predictive factors comport with expert clinical judgement. Finally, the last step is using statistical measures to assess how well the algorithm performs with respect to predicting the observed scores.
Your study focuses on four autoimmune diseases – why were these selected?
These conditions – rheumatoid arthritis, systemic lupus erythematosus, ankylosing spondylitis and multiple sclerosis – were selected because they all feature disease activity scores that are validated and used by clinicians to make decisions about the care of patients. As outlined in the poster, these conditions all feature these scores which are used as very common outcome measures for how well a patient is doing in clinical trials, as well as in real world clinical practice. They are useful metrics for providers to make decisions with patients about potentially selecting or altering the therapies and care they receive. So the scores have a very high degree of clinical utility. They are also critical for the research enterprise and for those developing and evaluating novel therapies because they can be a key measure of potential effectiveness.
In a nutshell, can you describe some of the findings from the study?
One of the key findings is that we were able to develop estimation algorithms for each of these four conditions that had very high performance characteristics. Some of the area under the curve statistics were quite high – around the order of 0.9 for some of these models, which indicates these algorithms were highly accurate in predicting the observed scores. But one of the really powerful findings is that we confirmed that some of these measures are not captured to the degree one might expect in real world data, but if we use the estimated scores, we can increase the available sample size by a factor of 15 or 20. With the provision of the estimated scores, we can include many more patients in our studies, which has a really meaningful impact on generalizability as well as statistical power.
What implications do these results have for the future use of ML in healthcare and RWD sources?
I think these technologies are extremely powerful and exciting. To begin, this allows us to further capitalize on the power of RWD by including many more patients in research studies who might not have been included initially. This is attractive for a number of reasons. The simplest is statistical power. If you’re including a million patients, rather than 100,000 patients, you’re able to study that condition with much more nuance. For example, if we’re talking about multiple sclerosis, having a much bigger starting base of patients will allow you to do more precise estimation of treatment effectiveness in subgroups of those patients. Statistical power is very meaningful, but there’s also a validity advantage unlocked with this technology. You might have concerns about the representativeness of the patients who only have an observed disease activity score. How do they differ from the other patients? How might their providers differ from the providers of other patients who do not capture the scores in their clinical practice? This has an important implication for promotion of health equity in research, which is something that’s quite important to the work that we do at OM1. Again, thinking about representativeness of patients, we may be able to study underrepresented patients to a more meaningful degree by inclusion of more patients who might have been lacking scores to begin with.
This poster is reporting on the development of these estimated scores. To really have translational potential, we need to then ask the question – do these scores that are estimated perform as well as the observed scores in for example, predicting a key outcome of interest, such as a relapse in multiple sclerosis? That’s one of the important next steps for this work and our results in this area have been really promising. We’re going to be presenting this at a conference later this year to showcase the translational utility of the estimated scores and their powerful use in pharmacoepidemiologic research.