Every now and then a scientific paper makes it to the news headlines. Recently, the paper ‘Learning the natural history of human disease with generative transformers’ appeared in Nature. The corresponding AI-tool, Delphi-2M, was praised by the media$ for its unprecedented forecasting power for human health. Let’s take a (not too) deep dive into the article to judge whether it lives up to this promise.
What does Delphi-2M do? Delphi-2M predicts disease course for ~1,000 human diseases using a large language model (LLM) trained on UK biobank registry data. The authors adapted the LLM such that it can handle the continuous time-scale. Moreover, they combine it with a dedicated survival model to deal with competing risks (of multiple diseases) and time-to-event data. Disclaimer: I’m not an expert in LLMs, but I do highly appreciate this part of the work, as well as the careful effort they took to handle the registry data and to validate Delphi-2M on an independent Danish data set. But…
Do we really know Delphi-2M is accurate? The article claims to be equally accurate to most separate models that were trained on individual diseases. That is indeed an achievement. But let us zoom in on how they evaluated performance.
Predicting the next event First, they claim to be about 76% accurate on forecasting the next medical event, including death. This number is pretty useless. Hopefully, there are many fairly healthy individuals in the database. For them, it matters a lot when an event will happen. After all, we all die at some point. In fact, to deal with these healthy individuals, the authors invent a trick. They randomly add a relatively short-term ‘healthy event’. Of course, this helps tremendously in increasing the performance of the model for healthy people. But it’s fake. And it clouds explanation and interpretation of Delphi-2M’s predictions, a topic that receives extensive attention in the paper.
Long-term predictions A more important claim is about the accuracy of the predictions after 10 years. This is claimed to drop only by 6% on absolute scale to 70%. Problem here is that they evaluate the model as if it returns binary (yes/no) predictions, while in fact time-to-event models require evaluations that take the time aspect into account. In fact, when reading the ‘Methods’ section of the article it was not clear to me what they did. Consider diabetes. Did they evaluate whether someone will develop diabetes after 10 years from now, or did they evaluate whether someone has or will have diabetes 10 years from now? I really don’t know. The first requires excluding people who developed diabetes within 10 years from the evaluation (technically: from the risk-set), whereas the latter includes short-term predictions and is hence a much easier task.
What does Delphi-2M not do?
Doctor, are you sure? Nowadays long-term weather predictions are often accompanied by uncertainty plumes. These tell us one important message: beyond 10 days one shouldn’t trust the prediction. Delphi-2M does not provide any uncertainty statements#. So, basically: you’re telling someone that (s)he will likely develop cardiovascular disease without knowing how certain you are about that statement. Why does it not provide uncertainties? Because computing uncertainties is much more difficult than computing predictions. Absent uncertainty quantification is a known shortcoming of many machine learning models, although there has been progress in recent years, e.g. by the development of conformal prediction methods&.
Will that treatment benefit me? The authors are somewhat cautious about causal claims, but also state that “An evident application of Delphi-type models is to support medical decision-making”. For medical decision making, however, one needs a different framework: a model that predicts the effect of a potential treatment. That is a predictive rather than prognostic model. The FDA wrote a very nice note on the difference between those. Without going into details: estimating the effect of a potential treatment is much more difficult ánd disease-specific than building a prognostic model. Prognostic models like Delphi-2M can be part of such a predictive model, but a lot more is needed to make the forecasts of use in the doctor’s room.
For whom is Delphi-2M useful? So, certainly not yet for doctors, nor for patients. In fact, I would argue its biggest stakeholders might be insurance companies and health policy makers. For them, such a supermodel is great for creating forecasts for large parts of the population. Uncertainties are less relevant in such a setting, because one is averaging over large numbers of people. As mentioned by the authors, biases in the training set (the UK biobank) are relevant as these do not disappear with numbers. But these can be assessed (and sometimes corrected for) when carefully comparing properties of the training set with one’s own target population.
Conclusion Delphi-2M is a great technical achievement that will likely stimulate the development and use of LLMs for longitudinal predictions in health care. But it is not of use in the clinical decision making process (yet). For that, it requires appropriate validation and non-trivial adjustments to support trustworthy treatment decisions.
Notes
* See also this post by Maarten van Smeden, with discussions. It addresses other important issues like (mis)calibration for rare diseases and the absence of references to joint models (which can also generate disease trajectories).
$ BBC article: ‘AI can forecast your future health – just like the weather’
# They actually provide uncertainty statements (confidence intervals) of the performance metrics. I doubt, however, whether these are correct as they are based on a questionable independence assumption, which renders these intervals too short.
& See ‘A gentle introduction to conformal prediction and distribution-free uncertainty quantification’.
Figure: netscribes.com
Add comment