The address to the Shalizi, American Scientist 98:186 (2010) article is here:
Where is the Peter Kenny "The wrong things" paper published? - jbbrown
I have just started studying -- but after the Bootstrapping paper, I am really wondering why it wouldn't be pretty much useful in most situations. What I got from the article is that all the complicated formulas (e.g. comparing AUCs using those crazy approximate formulas) -- is what you would have done b/f computers. Now with computers Bootstrapping is so easy (and generally applicable), that there really is no reason to use complicated approximiate formulas that assume underlying distributions.
I am guessing this is really naive POV -- but not sure what I am missing. - Brock
I'm glad you liked Cosma's excellent paper- he will be around the entire conference, I'm glad to say. As for not needing anything else, it depends. One of my speakers in the Practical Session suggested a talk title, "Bootstrapping Solves Everything", so there are certainly those who see it as a "Universal Wrench". However, there are some caveats- for one thing you don't always have the primary data- if you read a paper which quotes and AUC or an enrichment how can you assess the confidence limits if you don't have the data from which this is drawn, unfortunately more usual than not. Secondly, to paraphrase Wigner (I think), it's nice that the computer understands the problem, but I'd like to too- knowing the analytic approximations tend to allow you to do this. There'll be a nice example of this where R**2 is concerned from Scott Brown. Finally, bootstrapping doesn't save you from outliers, non-ideal data, distribution shifts, over-parameterization etc. It *can* help include experimental error in model building, which I like a lot, but it doesn't (I think) solves everything. Of course, Cosma might disagree- which is why we turn up at the confer-ence, to confer. - Anthony Nicholls
For those of us completely ignorant of statistics in molecular modeling, how would you suggest we effectively prepare? I.e. If I can spend 1-2 hours a week until the conference, what do I read? - Brock
Sorry to have taken some time to reply- I need to set the mechanism to let me know when someone has commented! Yes, there is the article I wrote for Jurgen Bajorath a couple of years ago that Christian references- in it I summarize a lot of the basics I (re)learnt. That article also has a resource section at the end which summarizes a lot of what I read to become less ignorant. I've a great fondness for the article by astrophysicist Tom Loredo I mention there, but it's by no means a simple read. I tried hard to get Tom to come to the conference but I think he felt I was crazy- actually that has been a common response from most of the external speakers I attempted to contact- they literally do not believe a field such as ours can be as bad as it is. Thankfully a few allowed me to convince them, but not Tom unfortunately. I would advise looking up Cosma Shalizi's article on bootstrapping from American Scientist. It's easy to read and really useful and Cosma is one of the few "externals" I did convince. Another, Steve Ziliak, has a book, "The Cult of Statistical Significance" that is fascinating but not a general statistics text. People speak very highly of Nate Silver's book, "The Signal and the Noise" but I haven't read it. What I will do, though, is to give the matter some more thought and come up a list a month or so before the conference. I also plan on writing a glossary of terms used in the statistical world, because I found that mastering the language was really the hardest part. - Anthony Nicholls
I would love to go back to the primary sources and learn -- but I am lazy and trying to shift the work onto you. What I was hoping to get was a list that included things like the Peter Kenny articles -- which look like a must read for our domain. And to exploit your natural tendencies to maybe give a frank two sentence review of a number of papers that we might not want to believe so heavily in.
I promise to take time in retirement (if I live that long) to read the primary literature :)
(And thank you to Christian -- I find Anthony's paper quite enjoyable -- but I wish I understood a bit more of it -- old folks need crayon drawings ;) - Brock
Certainly Peter Kenny's paper is a masterclass on what is wrong, but it doesn't suggest what to do right (other than stop doing wrong!). Peter is a lead off speaker, Monday morning, for the conference. The Abbott paper I mention on experimental noise and R**2 is a rare example of a paper trying to address the issues that should be obvious. Ajay Jain and I tried to make some sensible points in our editorial of the JCAMD issue on our symposium on evaluations. But really, there is no convincing list of papers to read from within our field- which is the problem, I think. The paper Christian refers to that I wrote was an attempt to present the simple concepts from which we build- e.g. variance, standard error, confidence limits, covariance, etc, and where they come from. Those you just have to know- you can't do chemistry without knowing your elements. As for a list of papers not to read- that's easy but I could never claim to be exhaustive! I almost never see papers with solid statistics- the best are ones that have at least tried! Let's just say that all the papers that claim to have 'solved' some important part of our field, e.g. scoring functions, e.g. loop modeling, e.g. selectivity or off-target effects- none of them seem in the least statistically sound. I'm sure the authors of such papers believe in what they publish, but Ajay's point about statistics is that it can not save us if we don't want to be saved. While true, I do believe it can help readers judge the saints from the sinners! - Anthony Nicholls
I just stumbled over this 2010 paper from Ant: What Do We Know?: Simple Statistical Techniques that Help. It looks like a nice (and free?) introduction to the basic stats part to me.
Thanks for making the GRC-CADD statistics conference. Very much looking forward to it...
- Christian Kramer