As many of you know, I’m organizing the 2013 CADD Gordon conference (July 21st to 25th, Mount Snow, Vermont). A Gordon conference is rare opportunity to actually “confer”, and there’s a lot to talk about! The subject will be the use of statistics in molecular modeling. Not molecular modeling. The statistics of molecular modeling. Not statistics. The statistics of molecular modeling. I repeat myself because in the course of organizing the conference I’ve needed to. Many times!
“So you mean a conference on modeling with some statistics.” Er, no.
“You do realize this is the CADD Gordon meeting, not the Statistics Gordon conference, right?”
Yes, I do realize this. That’s why it’s a conference on statistics in molecular modeling. I’ve come to realize that this is a strange concept to most in our field. And that, actually, is the point: it shouldn’t be a strange concept. This summer, I hope to make a dent in that perception—not by talking about boring statistics, but about how to be better at what we do.
Hal Varian, the chief economist at Google, said that the “sexy” job of the next decade will be that of statistician. When people laugh, he points out that few would have predicted that computer engineering would have risen to preeminence in the 1990s. His rationale is straightforward: from finance to sports to politics, statistics have suddenly become the “it” thing. And what do we have in molecular modeling? Not much. Take one very simple aspect of statistics: the concept of confidence limits, a.k.a. error bars. Simple, but fairly central to two aspects of modeling: (A) what the expected range in performance of a technique might be, and (B) whether method X is better than method Y with some confidence. It’s difficult to see why this would be a controversial or technically demanding requirement of publications in our area, and yet it is astounding how rarely it is seen or correctly interpreted. A classic example of the importance of (A) in the wider world is the 1997 Red River flood of Grand Forks, North Dakota. The predicted flood level from the National Weather Service (NWS) was 49 feet, leading the citizens of Grand Forks to prepare barriers of about 51 feet. What the NWS did not include was the error bar on that prediction, which was plus or minus nine feet! In fact, the river crested at 54 feet, causing an estimated $3.5 billion in damages. One of the explanations for the lack of information on the part of the NWS was that they did not want to look uncertain! Yet they effectively left out the most important part of the prediction.
Physics uses statistics too!
Of course, error bars in molecular modeling are not quite such a matter of life and death (we hope), but the desire to look more precise than we truly are does come into play. I think this is especially true in areas of modeling that claim to be physics-based. In fact, I recently heard one young modeler declare that he had no need to learn statistics because he did structure-based modeling! I can only assume this perception arises because real physics can have startling accuracy: precision tests of quantum electrodynamics, for instance, agree with theory to a few parts per billion. But this is irrelevant. The physics-based approaches are just as imprecise in their predictions of our “wet” world as empirical methods according to any and all blind challenges. We have seen this repeatedly in SAMPL events. And it’s not like physics doesn’t require good statistics. Yet how else can one explain the ubiquitous pronouncements of the superiority of one method over another in our journals, as opposed to the more nuanced and infinitely more useful, but perhaps less fundable, probability of superiority?
Error bars, or the lack of them, are just the tip of the iceberg. Take, for example, the quite common case of measurements that lie within some range, with a substantial subset of results outside of that range (e.g., affinity measurement). How are the out-of-range results treated? (A) Set to the limit of the assay, e.g. 10 millimolar or whatever, (B) Ignored. Clearly neither (A) or (B) is quite ideal. The first introduces unknown error into any regression model and the latter leaves out information. Did you know that there are variants of R-squared, called pseudo-R-squareds, that allows you to use both? Shouldn’t this be something our field investigates?
Speaking of the limitations of experiments, when was the last time you saw a modeling paper that actually acknowledged such a thing? There was a nice paper by the Abbott crew (Hajduk, Muchmore & Brown, DDT, 14, April 2009) that tried to quantify the effect of experimental noise and assay range on R-squared, which was unfortunately widely ignored as far as I can tell. Crystal structures permit an estimation of coordinate precision, but this is rarely used in assaying the RMSD of pose prediction. When is electron density used, as it should be, to quantify this? Even the simple procedure of sampling from the potential experimental uncertainty is lacking. How much of the widely discussed concept of “activity cliffs” is due to experimental error? In the world according to Reverend Bayes there are ways to interpret experimental results that include our expectations, yet this is never discussed, nor is the concept of experimental design (in any formal sense).
Suppose you have multiple measurements, each with a different reported error. How do you average these values? The simple answer is to weight by the inverse square of the error (more uncertain measurements are weighted less). But suppose we don’t know the error for one of the measurements and that measurement is askew from the others—is there not a risk that including it in the average actually worsens the prediction? “Any measurement is better than none” is the falsehood that has become the mantra driving the industry to cheaper but less accurate assays for years. No measurement does not mean no information; for example, molecular similarity often gives you a respectable null model estimate.
And what about null models? Untold numbers of papers proclaim the greatness of their particular method but fail to compare it to anything simpler, such as 2D fingerprint searches as a null model in virtual screening. This illustrates the problem that even a very basic aspect of the scientific method—the concept of a control experiment—is typically ignored by our journals. While it’s great that a method works (ignoring the probable cherry-picking that goes on, especially in academia), it should matter if a (statistically) equivalent result arises from a (reliable) method such as 2D (or ROCS, if you’ll forgive me some bias!). There are many flavors of null models—molecular weight, for instance, acts as a useful one for scoring functions for affinity prediction—but many papers don’t seem to include any.
A trickier concept—but one we ought to appreciate in an empirical field—is the future risk of adjustable parameters. Future risk just means predictions about data we have yet to see, typically incorrectly interpreted in terms of a training and test set. For instance, the training set includes examples from all time periods, as opposed to a separation into a true “past” and “future” dataset. While I think the field is slowly learning this one, there is still a reliance on cross-validation or y-scrambling to “prove” you haven’t over-trained. It’s not that these techniques are bad; they are just incomplete. For example, cross-validation might suggest that a model is bad—e.g., over-trained—but how often will it be wrong? Bad models can appear good and good models as bad. How do you know the chance of this for your model? And how does this classification error change with the size and composition of the dataset? There are some quite lovely approaches from information theory that give bounds on expected performance, given a certain number of parameters, but they are mostly unknown in CADD.
And this is only explicit parameterization. What about implicit parameters, such as the choice of parameters from a large set, or the choice of method in the first place, or the choice of system? These can all be worked within the context of a larger (Bayesian) framework. But that’s the keyword: work. And who, other than the more puritanical, is going to do more work than necessary to get published, especially if journals continue to have such low standards? There are societal forces at work here, not just ones of rigor. And even with standards, how useful are they if we can’t reproduce the work of others? One of Barry Honig’s aphorisms that I try to pass along is “It’s not science until someone else does it.” It’s when a group of us start to use an approach we start to learn and reproducibility is a good starting point.
Hype, glorious hype
Another issue is the ability to see through hype. I’ve commented in a previous entry about how management in pharma finds hype totally irresistible, whether it be about experimental approaches, classes of targets, management techniques (how’s that lean six sigma working out for you?) or simply the latest in computer innovation. There’s some real progress in the latter: Google Translate or IBM’s Watson come to mind. However, there is a lot of nonsense out there too. How do you distinguish between the two? Well, a firm grounding in statistics doesn’t hurt. I’ll try to give an example from the hype surrounding “Deep Learning” in the near future.
I could go on. But let’s get back to my plans for the Gordon conference. There are clearly a lot of simple methods that ought to be standard to modeling: how to calculate error bars and interpret them, how to plot straight lines, how to deal with outliers, etc. To address some of these, I plan morning sessions where members of our community can present methods they found useful, along with the science such approaches enabled. It’s my hope to capture these and other approaches in written form and also in a web-accessible interface for the community. There are also the bigger issues, such as how non-ideal our data is for standard statistical tests, how to deal with parameter risk, experimental error and NULL models. These each will have their own sessions, with two speakers and plenty of time for discussion. I don’t know if we can “solve” these issues at such a short meeting, but we can perhaps make a start. Then there are the societal issues, in which journals feature prominently. A skeptic might claim that no matter how happy-clappy a meeting we have amongst the faithful, it is all for naught if the journals don’t get on board. Well, I have a backup plan. I won’t elaborate just yet, but don’t worry, I’m not starting my own journal. Thought of that. Came to my senses long ago.
Finally, as I alluded to earlier, there are other fields out there, either having success with a more statistical approach or perhaps suffering because of a lack of one. I’ve invited a panel of five external speakers: Steve Ziliak, an economist from Roosevelt University, coauthor of the excellent The Cult of Statistical Significance; Cosma Shalizi from CMU and the field of machine learning, who has written popular articles on bootstrapping and Bayesian reasoning; George Wolford II, from the department of psychology at Dartmouth, an igNoble prize winner for a priceless paper on fMRI studies; Carson Chow from the NIH, who has applied Bayesian analysis to a variety of topics, including obesity in the United States; and, finally, Elizabeth Iorns, a medical researcher who has set up a company, ScienceExchange, to help reproduce experimental results for companies that can no longer trust published findings. These five will give evening talks, each bringing a different and, we hope, illuminating perspective.
I think statistics is an important part of what we ought to do, as modelers, but usually don’t. The cost of not using statistics is less thorough work and shakier progress. So come to Mount Snow in July (incongruous as that sounds) and we’ll try to make a difference. Knowing some statistics will not only make us better scientists, but is also a part of the rich intellectual heritage of the twentieth century. It’s time to catch up!
Visit the GRC CADD 2013 website.
The address to the Shalizi, American Scientist 98:186 (2010) article is here:
Where is the Peter Kenny "The wrong things" paper published? - jbbrown
I have just started studying -- but after the Bootstrapping paper, I am really wondering why it wouldn't be pretty much useful in most situations. What I got from the article is that all the complicated formulas (e.g. comparing AUCs using those crazy approximate formulas) -- is what you would have done b/f computers. Now with computers Bootstrapping is so easy (and generally applicable), that there really is no reason to use complicated approximiate formulas that assume underlying distributions.
I am guessing this is really naive POV -- but not sure what I am missing. - Brock
I'm glad you liked Cosma's excellent paper- he will be around the entire conference, I'm glad to say. As for not needing anything else, it depends. One of my speakers in the Practical Session suggested a talk title, "Bootstrapping Solves Everything", so there are certainly those who see it as a "Universal Wrench". However, there are some caveats- for one thing you don't always have the primary data- if you read a paper which quotes and AUC or an enrichment how can you assess the confidence limits if you don't have the data from which this is drawn, unfortunately more usual than not. Secondly, to paraphrase Wigner (I think), it's nice that the computer understands the problem, but I'd like to too- knowing the analytic approximations tend to allow you to do this. There'll be a nice example of this where R**2 is concerned from Scott Brown. Finally, bootstrapping doesn't save you from outliers, non-ideal data, distribution shifts, over-parameterization etc. It *can* help include experimental error in model building, which I like a lot, but it doesn't (I think) solves everything. Of course, Cosma might disagree- which is why we turn up at the confer-ence, to confer. - Anthony Nicholls
For those of us completely ignorant of statistics in molecular modeling, how would you suggest we effectively prepare? I.e. If I can spend 1-2 hours a week until the conference, what do I read? - Brock
Sorry to have taken some time to reply- I need to set the mechanism to let me know when someone has commented! Yes, there is the article I wrote for Jurgen Bajorath a couple of years ago that Christian references- in it I summarize a lot of the basics I (re)learnt. That article also has a resource section at the end which summarizes a lot of what I read to become less ignorant. I've a great fondness for the article by astrophysicist Tom Loredo I mention there, but it's by no means a simple read. I tried hard to get Tom to come to the conference but I think he felt I was crazy- actually that has been a common response from most of the external speakers I attempted to contact- they literally do not believe a field such as ours can be as bad as it is. Thankfully a few allowed me to convince them, but not Tom unfortunately. I would advise looking up Cosma Shalizi's article on bootstrapping from American Scientist. It's easy to read and really useful and Cosma is one of the few "externals" I did convince. Another, Steve Ziliak, has a book, "The Cult of Statistical Significance" that is fascinating but not a general statistics text. People speak very highly of Nate Silver's book, "The Signal and the Noise" but I haven't read it. What I will do, though, is to give the matter some more thought and come up a list a month or so before the conference. I also plan on writing a glossary of terms used in the statistical world, because I found that mastering the language was really the hardest part. - Anthony Nicholls
I would love to go back to the primary sources and learn -- but I am lazy and trying to shift the work onto you. What I was hoping to get was a list that included things like the Peter Kenny articles -- which look like a must read for our domain. And to exploit your natural tendencies to maybe give a frank two sentence review of a number of papers that we might not want to believe so heavily in.
I promise to take time in retirement (if I live that long) to read the primary literature :)
(And thank you to Christian -- I find Anthony's paper quite enjoyable -- but I wish I understood a bit more of it -- old folks need crayon drawings ;) - Brock
Certainly Peter Kenny's paper is a masterclass on what is wrong, but it doesn't suggest what to do right (other than stop doing wrong!). Peter is a lead off speaker, Monday morning, for the conference. The Abbott paper I mention on experimental noise and R**2 is a rare example of a paper trying to address the issues that should be obvious. Ajay Jain and I tried to make some sensible points in our editorial of the JCAMD issue on our symposium on evaluations. But really, there is no convincing list of papers to read from within our field- which is the problem, I think. The paper Christian refers to that I wrote was an attempt to present the simple concepts from which we build- e.g. variance, standard error, confidence limits, covariance, etc, and where they come from. Those you just have to know- you can't do chemistry without knowing your elements. As for a list of papers not to read- that's easy but I could never claim to be exhaustive! I almost never see papers with solid statistics- the best are ones that have at least tried! Let's just say that all the papers that claim to have 'solved' some important part of our field, e.g. scoring functions, e.g. loop modeling, e.g. selectivity or off-target effects- none of them seem in the least statistically sound. I'm sure the authors of such papers believe in what they publish, but Ajay's point about statistics is that it can not save us if we don't want to be saved. While true, I do believe it can help readers judge the saints from the sinners! - Anthony Nicholls
I just stumbled over this 2010 paper from Ant: What Do We Know?: Simple Statistical Techniques that Help. It looks like a nice (and free?) introduction to the basic stats part to me.
Thanks for making the GRC-CADD statistics conference. Very much looking forward to it...
- Christian Kramer