Document:Five More Nails
From AIDS Wiki
NOTWITHSTANDING ANY OTHER NOTICE ON THIS PAGE, the material on this page is NOT available under the GNU Free Documentation License; in accordance with Title 17 U.S.C. section 107, it is posted in the manner of bulletin boards in schools and workplaces, to encourage public education and citizen awareness, without profit or payment, for persons and entities engaging in non-profit research and educational activities and purposes only.
"You Bet Your Life"
29 October 2006
[postscript 6 November 2008]
Hi Darin,
I noticed on your recent post that you considered Figure 1 of the Sept 27 JAMA article to be a "mathematical artifact". Take a look at this chart on Dr. Bennett’s blog which he maintains is “proof” of a correlation between CD4 cell count and HIV Viral load... He calculates a magic R2 value of .93 calculated from Figure 1.
I am leaning towards thinking you are correct in your analysis, but Bennett seems to have a point (or five – not to be gratuitously humorous about the matter because there is nothing at all funny about AIDS, really.)
Any comment? Thanks.
A concerned reader.
Dear Concerned Reader,
He has NOT proved a correlation between viral load and CD4 cell loss in individual patients. He has proved a correlation between the viral load and CD4 cell loss for five values that represent the median CD4 cell loss for their respective viral load subgroups. It is statistical trickery of the most transparent kind.
Here is the best way I can put it: ANY data set such as the one under consideration, has a "line of best fit"; these appear in Figure 3. ANY data set at all. The coefficient of determination (R2) tells you how well the data set as a whole "fits" this line of best fit. In Figure 3, clearly they don't fit at all.
Now here is a more mathematical explanation of what I meant by "it's the statistical equivalent of squinting your eyes so hard you can't see any details anymore". Let's say you just take a more-or-less random data set (as Figure 3 almost is) and break it up into subgroups by intervals of the predictor variable. All of the data points in any one of these subgroups come from patients who presented roughly similar HIV viral load levels. But within any one of these subgroups, the data points are still more-or-less randomly scattered. But there IS a general pattern, because the line of best fit does have negative slope. In other words, if you look at the total set, the points do (very generally) slope down slightly. The same holds for each subgroup. For each subgroup, the data points are scattered, but (very generally) have a slight downward trend. The point is, (reflected by the R2 values for the total data set AND for each subgroup) they don't "fit" that trend very well (if at all).
Now, we have decided to choose the "median" response for each subgroup. It does not take a rocket scientist to figure out that if you choose the median response for each subgroup, those medians are going to lie somewhere very close to the line of best fit. The only way the median point could lie FAR from the line of best fit was if the points had some strange distribution, like 2/3 of them very low and then a big jump and the other 1/3 way up high. Just a quick glance at Figure 3 shows they're not strangely distributed like this.
So, here is the net effect of considering the five points in Figure 1: you have a cloud of almost random data points; you plot the line of best fit through those points (which looks almost as absurd as some of the graphs in Ho/Wei); then you "choose" five data points out of the cloud which all happen to lie very close to the line of best fit. It's no surprise then that they all lie in a straight line and hence give a high correlation to each other. The "correlation" does not reflect any real correlation in the data set itself – it's a mathematical artifact of the way the medians were chosen. It's the statistical equivalent of squinting your eyes real hard and picking five points with your finger. It's ridiculous.
So, all Figure 1 reflects is the slope of the line of best fit from Figure 3, with the lack of correlation obscured, and with some "error bars". The error bars have absolutely no biological meaning; they are confidence intervals for the median points. They are just saying, "look, the median point lies in here somewhere". So what??
Then people like Bennett point out the "simple linear relationship" in Figure 1 and claim it's evidence of some kind of "correlation". It's NOT reflecting the correlation of the total data set; it's just reflecting the small slope of the line of best fit. But every data set has such a slope value. You can do this trick to ANY data set that's more-or-less randomly distributed. It's so clear that the authors of the study couldn't just put the four clouds of data points upfront in the article, and just report the R2 values in the abstract – they had to concoct Figure 1 to distract people at the start of reading the article, and then report the median values to give the idea there was some "correlation" in the abstract.
Actually, that wasn't good enough, because the median values were still too close to each other. They had to go back over each subgroup individually and run a different model with each one, and I can hear their collective sighs of relief when they finally got numbers spaced out from each other a little more. (Meaning, more than 10-15 cells/mm3/year difference between the most extreme groups.) Then they tried to "rescue" the R2 = 0.04 by several ways, but could only get to at best 0.08 or 0.10. I can just see them after they first saw the data and the actual R2 values – OMG, we have to put this in JAMA... what do we do??
This all might mean something if there were any reason to look at the subgroups this way. But I can't find any. The only reason I can find is to smooth the data out and have a nice looking graph like Figure 1. In my 9 Oct post, I point out why I think the boundaries chosen are arbitrary and why I don't think there's any good biological reason to group them this way. And biological reasons have to be the reason for choices like this. The reasons can't be purely mathematical or just arbitrary. This is all standard stuff. Do a google search on "subgroup analysis" (in quotes). You'll come up with a slew of articles on how to "misuse/abuse" subgroup analysis. This paper should go down in history as Exhibit A.
Postscript: 6 November 2008
Objective: To estimate the proportion of variability in rate of CD4 cell loss predicted by presenting plasma HIV RNA levels in untreated HIV-infected persons.... Main Outcome Measures: The extent to which presenting plasma HIV RNA level could explain the rate of model-derived yearly CD4 cell loss, as estimated by the coefficient of determination (R2). Results: In both cohorts, higher presenting HIV RNA levels were associated with greater subsequent CD4 cell decline. In the study cohort, median model–estimated CD4 cell decrease among participants with HIV RNA levels of 500 or less, 501 to 2,000, 2,001 to 10,000, 10,001 to 40,000, and more than 40,000 copies/mL were 20, 39, 48, 56, and 78 cells/μL, respectively. Despite this trend across broad categories of HIV RNA levels, only a small proportion of CD4 cell loss variability (4%-6%) could be explained by presenting plasma HIV RNA level. — Rodríguez et al., JAMA, 2006; 296: 1498-1506.
Common Errors Involving Correlation — We now identify three of the most common sources of errors made in interpreting results involving correlation... 2. Another error arises with data based on averages. Averages suppress individual variation and may inflate the correlation coefficient. One study produced a 0.4 linear correlation coefficient for paired data relating income and education among individuals, but the linear correlation coefficient became 0.7 when regional averages were used. — Elementary Statistics, 10th edition, Mario F. Triola
Ecological Correlation: Correlations based on averages can be arbitrarily misleading if they are interpreted to be about individuals. Correlations based on averages are usually too high, because they ignore the variability across individuals. Correlation of averages is called ecological correlation.... Ecological correlations are correlation coefficients of averages across groups of individuals, rather than correlation coefficients for individuals. Ecological correlations tend to be stronger than the correlation coefficient for individuals, although the opposite is also possible. Beware arguments about association that rely on ecological correlations. — "Statistics Tools for Internet and Classroom Instruction with a Graphical User Interface", Philip B. Stark, Professor of Statistics, UC Berkeley
© 2006 by Darin Brown
Originally published at "You Bet Your Life"

