Ricochet is the best place on the internet to discuss the issues of the day, either through commenting on posts or writing your own for our active and dynamic community in a fully moderated environment. In addition, the Ricochet Audio Network offers over 50 original podcasts with new episodes released every day.
The Perils and Pleasures of Modeling
The term ‘model’ is much in the news, and I’m not talking about @RightAngles trade. It’s the term apparently favored by the media to describe a general area that may also go by: cybernetics, system dynamics, advanced statistics, simulation, control theory, and others. Having some academic and professional background in the domain, this is my (inevitably simplified) attempt to sketch its limits, so you can be smarter than the average journalist.
So, simplifying, as warned: There are two types of models. One is broadly statistical in approach. The other attempts to be more mechanistic.
And there are two major uses of models. One is descriptive: What’s going on here? The other is control: What can we do about it?
Statistical Modeling
This may also be labeled curve fitting, black-box models, deep learning, stochastic models, and more. It means taking as large a sample as possible of system inputs over time, and correlated outputs over time, and building a statistical description of how they relate to one another.
The farthest the mass media go into this territory is the canonical bell curve: “Here is the distribution of salaries for purple humped clerics. Here is the distribution for green crested clerics. They are different -> discrimination!” Having tried to explain the output of complex statistical models to state-level legislators, I have a bit of empathy.
In our current situation, the best known statistical model is the IHME model, being used by both media and government to estimate where the pandemic is headed and, importantly, what resources will be required to meet it. IHME is only slightly more complex than the standard bell curve model, it’s using something called a logistic or S-curve. Statistical modeling is also widely used in another domain temporarily shoved off the front pages, climate.
Why use this? It’s easy to get running – just start watching and recording what’s going on. No need for fancy experiments to isolate cause and effect – which might not be possible anyway – just watch the trends. You can refine things as you go along and get more data. Note that IHME is doing exactly that as more data comes in from states and from countries that are further along in the pandemic. These are very compelling arguments when you are under the gun for forecasts and lives depend on it.
What can go wrong? Just a few things…
Biased or inaccurate sampling. All statistical techniques depend on having a representative sample of the domain in question. What happens if some of the data going into the model has been deliberately perturbed (*cough* China *cough*). What happens if your sampling space, say South Korea or northern Italy, has economic or social practices that differ from where you are attempting to forecast, North America? Nothing good.
Under-sampling and over-extrapolation. These often go together. An unbiased statistical model may be good in areas where you have lots of data, but fall apart outside that sample space, quite a problem if you are trying to forecast extreme conditions. Climate models are notorious for this, using techniques like principal components analysis on limited historical records, and attempting to extrapolate the results into extreme conditions of CO2 and temperature.
Overfitting and incorrect model assumptions. Again, these often go together. It’s an aphorism in the field that you can fit an elephant with enough parameters, meaning roughly that you can always pile on fudge factors to conceal the fact that your underlying system concept is wrong. Hockey sticks come to mind. Simplified statistical epidemic models may fall apart if we try to restart an economy without reaching a steady-state of virus.
Hidden variables and lack of understanding. These are not the same thing, but they will both destroy attempts to use a statistical model for control purposes. Something you can’t currently observe (asymptomatic carriers?) may turn out to be a major driving variable. If you don’t really know what’s in the black box, attempts to drive its inputs to create desired outputs may not go well, particularly when there are inevitable time delays between taking an action and seeing its results.
Simulations
Also known as mechanistic models or just plain science. This is where you attempt to understand cause and effect in some detail, going into internal processes of the system as necessary, and build a mathematical replica. If you’re doing climatology you’ll model things like carbon fixing by plants depending on temperature and CO2 levels. If you are doing epidemiology, you’ll have things like social network density and incubation periods. In the current situation, the best-known model of this type comes from Imperial College of London. This is a simulation that was constructed after the H1N1 pandemic and embeds a detailed model of epidemic spread that was retrospectively tested against the pandemic records.
Why use this? In a word, understanding. If you have some validation of cause and effect mechanisms, you are on firmer ground trying to reach beyond your previous experiences, and in coming up with control strategies, which are both perilous with purely statistical models.
What could go wrong? Just a few things…
Taking the model out of context. What worked very well for Carboniferous forests, may not when we have things like managed tree farms. H1N1 is all well and good, but the Wuflu isn’t actually a flu and propagates differently.
Time, we have no time! Understanding takes time and often controlled experiments, and often neither is or will ever be available before decisions must be made.
Incompleteness. There are very few simulations of any complexity that are completely mechanistic. There’s always some statistical modeling buried in there. The Imperial College model doesn’t actually have all the churches, schools and airports described, instead it has a ‘synthetic population’ generated in accordance with a statistical description. Components like that are subject to all the problems described above for statistical models.
There’s no neat conclusion to this post. None of the models being tossed about are completely right or wrong. They are incomplete. This might give you some sympathy, beyond what the MSM spin will ever provoke, for those modelers being sweated by decision-makers who have trillions of dollars and thousands of lives on the line.
Published in Healthcare
This so needs dissemination. I’ve been noticing the word “models” too, and couldn’t help being reminded of their use in the ongoing “Climate Change” debate. The hard fact is that they can be used to show whatever someone wants them to show.
Good post. I don’t think the average journalist or politician has any idea what people are talking about when they use the word ‘model’. (Well, this kind of model, at least)….It is just some sort of magical crystal ball developed by people with the Right Credentials.
Interestingly enough, there’s not much different between Right Angle’s type of model and statistical models. They are representation of some real thing in the world. Some models are better representative of reality than others and some real events are well represented by models…
But those are quite rare.
As I’ve learned more about tailoring and pattern making in sewing, I’ve come to appreciate models and “generic”, average measurements. But these measurements only get you so far. The model shows off fashion in it’s best way, and general measurements capture a large group, never really fitting anyone perfectly. If I try to make something fit one specific person and “overfit” without enough allowance in the measurements, my real person would never be able to move or eat while wearing what I make.
So, the model is only useful as a narrow representative, and all my curve fitting should provide enough wiggle room for real life scenarios.
See, not so different.
Good post. You have a typo. It’s the IHME model, not IMHE. I made the same error myself in one of my prior posts on the subject.
Good catch, fixed. Thank you!
It is the habitual abuse of the correlation coefficient that would scare most observers. (Not all correlations are equal.)
And now to add an extra wrinkle. Listening to the description of the way the IHME model is constantly updated, based on the latest observed data, we are not even talking simple statistical modeling. It is, apparently, Bayesian. That is, there is a special area of econo/socio/biometric analysis, not to be confused with the basic “statistics” necessary to enter the subject area, that provides theoretical underpinnings for adding in information while your “statistical model” is running. You get to continuously improve, if all the conditions are right, your predictions.
I’m seeing that as a good thing. They bootstrapped the model using data reports of pandemic behavior out of other countries as their ‘priors’, did a minimum variance fit to their proposed logistic model, and then mapped the states against that model. If they are rerunning the fit as more data comes in, it may average out any bias in the original samples. Unfortunately, as I’ve watched that process over the last couple of days, their forecast of total deaths has trended up (perhaps an effect of washing out understated Chinese numbers??).
I saw the title but didn’t have time to read the post – does this help?
Agreed, I’m always going to advocate Bayesian modeling. And totally not because Bayesian statistics essentially sums up all of my grad school work and now professional work…
Some interesting news about the coronavirus model developed at Imperial College London, the projections of which (initially 500K+ dead, later reduced with different assumptions about social distancing) have received wide publicity and have influenced UK government policy…
Several researchers have apparently asked to see Imperial’s calculations, but Prof. Neil Ferguson, the man leading the team, has said that the computer code is 13 years old and thousands of lines of it “undocumented,” making it hard for anyone to work with, let alone take it apart to identify potential errors. He has promised that it will be published in a week or so….
https://www.wsj.com/articles/coronavirus-lessons-from-the-asteroid-that-didnt-hit-earth-11585780465?mod=searchresults&page=1&pos=1
My masters thesis was a model of the first wall of a fusion reactor undergoing a plasma disruption. Because there were no mathematical solutions for the equations I used, I wrote a computer program (e.g. a model) using the Crank-Nicholson Method, a central finite difference approach.
One of the things I had to do was a sensitivity analysis, which was to vary the input parameters one at a time by 10%. This way, I could determine what inputs were most important when running the program to get results. However, many of the input parameters were coefficients for the equations. Once I realized this, I knew I could make whatever outcome I wanted (within reason) if I could justify the parameters used. When I coupled this with knowleadge gained by reading How To Lie With Statistics by Darrell Huff, I knew I was home free.
The bottom line is this: many of those jokes and saying about statistics are true:
“If you torture the data long enough, it will say whatever you want.”
“There are lies, damn lies, and statistics.”
“Facts are stubborn things, but statistics are pliable.”
And I believe this one quote sums it all up:
“Statisticians, like artists, have the bad habit of falling in love with their models.”
Drop this silly constraint from your resume and you have the makings of a rather serviceable climate scientist.
I posted this Econtalk from last year on a another post, on how bad the H1N1 models were in 2009 from Google. Gerd Gigerenzer instead used flu-related doctor visits in a region from the two previous weeks and tested it. Guess what, their model was better at predicting the flu, than Googles peer reviewed (Nature Magazine) model.
He explains googles model below:
See the OP, under ‘overfitting’….
the website is healthdata.org which is easier to remember
45 variables?
the curse of multi diminsionality
more variables = higher r^2 or correlation
more variables = less predictive value
the devil is in the assumptions underlying the model