COVID-19 and the epistemology of epidemiological models at the dawn of AI

Ellison, George orcid iconORCID: 0000-0001-8914-6812 (2020) COVID-19 and the epistemology of epidemiological models at the dawn of AI. Annals of Human Biology, 47 (6). pp. 506-513. ISSN 0301-4460

[thumbnail of Author Accepted Manuscript]
PDF (Author Accepted Manuscript) - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

[thumbnail of Supplementary Material]
PDF (Supplementary Material) - Supplemental Material
Available under License Creative Commons Attribution Non-commercial No Derivatives.


Official URL:


The models used to estimate disease transmission, susceptibility and severity determine what epidemiology can (and cannot tell) us about COVID-19. These include: ‘model organisms’ chosen for their phylogenetic/aetiological similarities; multivariable statistical models to estimate the strength/direction of (potentially causal) relationships between variables (through ‘causal inference’), and the (past/future) value of unmeasured variables (through ‘classification/prediction’); and a range of modelling techniques to predict beyond the available data (through ‘extrapolation’), compare different hypothetical scenarios (through ‘simulation’), and estimate key features of dynamic processes (through ‘projection’). Each of these models: address different questions using different techniques; involve assumptions that require careful assessment; and are vulnerable to generic and specific biases that can undermine the validity and interpretation of their findings. It is therefore necessary that the models used: can actually address the questions posed; and have been competently applied. In this regard, it is important to stress that extrapolation, simulation and projection cannot offer accurate predictions of future events when the underlying mechanisms (and the contexts involved) are poorly understood and subject to change. Given the importance of understanding such mechanisms/contexts, and the limited opportunity for experimentation during outbreaks of novel diseases, the use of multivariable statistical models to estimate the strength/direction of potentially causal relationships between two variables (and the biases incurred through their misapplication/misinterpretation) warrant particular attention. Such models must be carefully designed to address: ‘selection-collider bias’, ‘unadjusted confounding bias’ and ‘inferential mediator adjustment bias’ – all of which can introduce effects capable of enhancing, masking or reversing the estimated (true) causal relationship between the two variables examined. Selection-collider bias occurs when these two variables independently cause a third (the ‘collider’), and when this collider determines/reflects the basis for selection in the analysis. It is likely to affect all incompletely representative samples, although its effects will be most pronounced wherever selection is constrained (e.g. analyses focusing on infected/hospitalised individuals). Unadjusted confounding bias disrupts the estimated (true) causal relationship between two variables when: these share one (or more) common cause(s); and when the effects of these causes have not been adjusted for in the analyses (e.g. whenever confounders are unknown/unmeasured). Inferentially similar biases can occur when: one (or more) variable(s) (or ‘mediators’) fall on the causal path between the two variables examined (i.e. when such mediators are caused by one of the variables and are causes of the other); and when these mediators are adjusted for in the analysis. Such adjustment is commonplace when: mediators are mistaken for confounders; prediction models are mistakenly repurposed for causal inference; or mediator adjustment is used to estimate direct and indirect causal relationships (in a mistaken attempt at ‘mediation analysis’). These three biases are central to ongoing and unresolved epistemological tensions within epidemiology. All have substantive implications for our understanding of COVID-19, and the future application of artificial intelligence to ‘data-driven’ modelling of similar phenomena. Nonetheless, competently applied and carefully interpreted, multivariable statistical models may yet provide sufficient insight into mechanisms and contexts to permit more accurate projections of future disease outbreaks.

1. These biases, and the terminology involved, may be challenging to readers who are unfamiliar with the use of causal path diagrams (such as Directed Acyclic Graphs; DAGs) which have been instrumental in identifying the different roles that variables can play in causal processes (whether as ‘exposures’, ‘outcomes’, ‘confounders’, ‘mediators’, ‘colliders’, ‘competing exposures’ or ‘consequences of the outcome’) and revealing hitherto under-acknowledged sources of bias in analyses designed to support causal inference. For what we hoped might offer accessible introductions to DAGs (and how [not] to use these) please see: Ellison (2020); and Tennant et al. (2019). For more technical detail on ‘collider bias’, ‘unadjusted confounding bias’ and ‘inferential mediator adjustment bias’ (and its related concern, the ‘Table 2 fallacy’), please refer to: Cook and Ranstam 2017; Munafò et al. (2018); Tennant et al. (2017); VanderWeele and Arah (2011); and Westreich and Greenland (2013).

Repository Staff Only: item control page