Data dredging

An example of data produced by data dredging, showing a correlation between the number of letters in a spelling bee's winning word and the number of people in the United States killed by venomous spiders.

Data dredging (also data fishing, data snooping, and p-hacking) is the misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect. This is done by performing many statistical tests on the data and only paying attention to those that come back with significant results, instead of stating a single hypothesis about an underlying effect before the analysis and then conducting a single test for it.

The process of data dredging involves automatically testing huge numbers of hypotheses about a single data set by exhaustively searching—perhaps for combinations of variables that might show a correlation, and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable. Conventional tests of statistical significance are based on the probability that a particular result would arise if chance alone were at work, and necessarily accept some risk of mistaken conclusions of a certain type (mistaken rejections of the null hypothesis). This level of risk is called the significance. When large numbers of tests are performed, some produce false results of this type, hence 5% of randomly chosen hypotheses turn out to be significant at the 5% level, 1% turn out to be significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some will be statistically significant but misleading, since almost every data set with any degree of randomness is likely to contain (for example) some spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these results.

Panel from a 2011 xkcd cartoon explaining p hacking, in which scientists look for relationships between many colors of jelly beans and acne, and find a p value <0.05 only for green ones.[1]

The multiple comparisons hazard is common in data dredging. Moreover, subgroups are sometimes explored without alerting the reader to the number of questions at issue, which can lead to misinformed conclusions.[2]

Drawing conclusions from data

The conventional frequentist statistical hypothesis testing procedure is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data, followed by carrying out a statistical significance test to see how likely such results would be found if chance alone were at work. (The last step is called testing against the null hypothesis.)

A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every data set contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same statistical population, it is impossible to assess the likelihood that chance alone would produce such patterns. See testing hypotheses suggested by the data.

Here is a simple example. Throwing a coin five times, with a result of 2 heads and 3 tails, might lead one to hypothesize that the coin favors tails by 3/5 to 2/5. If this hypothesis is then tested on the existing data set, it is confirmed, but the confirmation is meaningless. The proper procedure would have been to form in advance a hypothesis of what the tails probability is, and then throw the coin various times to see if the hypothesis is rejected or not. If three tails and two heads are observed, another hypothesis, that the tails probability is 3/5, could be formed, but it could only be tested by a new set of coin tosses. It is important to realize that the statistical significance under the incorrect procedure is completely spurious – significance tests do not protect against data dredging.

Hypothesis suggested by non-representative data

Suppose that a study of a random sample of people includes exactly two people with a birthday of August 7: Mary and John. Someone engaged in data snooping might try to find additional similarities between Mary and John. By going through hundreds or thousands of potential similarities between the two, each having a low probability of being true, an unusual similarity can almost certainly be found. Perhaps John and Mary are the only two people in the study who switched minors three times in college. A hypothesis, biased by data snooping, could then be "People born on August 7 have a much higher chance of switching minors more than twice in college."

The data itself very strongly supports that correlation, since no one with a different birthday had switched minors three times in college. However, if (as is likely) this is a spurious hypothesis, this result will most likely not be reproducible; any attempt to check if others with an August 7 birthday have a similar rate of changing minors will most likely get contradictory results almost immediately.

Bias

Bias is a systematic error in the analysis. For example, doctors directed HIV patients at high cardiovascular risk to a particular HIV treatment, abacavir, and lower-risk patients to other drugs, preventing a simple assessment of abacavir compared to other treatments. An analysis that did not correct for this bias unfairly penalised abacavir, since its patients were more high-risk so more of them had heart attacks.[2] This problem can be very severe, for example, in the observational study.[2][3]

Missing factors, unmeasured confounders, and loss to follow-up can also lead to bias.[2] By selecting papers with a significant p-value, negative studies are selected against—which is the publication bias. This is also known as File Cabinet Bias, because less significant p-value results are left in the file cabinet and never published.

Multiple modelling

Another aspect of the conditioning of statistical tests by knowledge of the data can be seen while using the frequency of data flow in a system or machine in the data analysis linear regression. A crucial step in the process is to decide which covariates to include in a relationship explaining one or more other variables. There are both statistical (see Stepwise regression) and substantive considerations that lead the authors to favor some of their models over others, and there is a liberal use of statistical tests. However, to discard one or more variables from an explanatory relation on the basis of the data, means one cannot validly apply standard statistical procedures to the retained variables in the relation as though nothing had happened. In the nature of the case, the retained variables have had to pass some kind of preliminary test (possibly an imprecise intuitive one) that the discarded variables failed. In 1966, Selvin and Stuart compared variables retained in the model to the fish that don't fall through the net—in the sense that their effects are bound to be bigger than those that do fall through the net. Not only does this alter the performance of all subsequent tests on the retained explanatory model—it may introduce bias and alter mean square error in estimation.[4][5]

Examples in meteorology and epidemiology

In meteorology, hypotheses are often formulated using weather data up to the present and tested against future weather data, which ensures that, even subconsciously, future data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This process ensures that no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.

As another example, suppose that observers note that a particular town appears to have a cancer cluster, but lack a firm hypothesis of why this is so. However, they have access to a large amount of demographic data about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable correlates significantly with the cancer rate across the area. While this may suggest a hypothesis, further testing using the same variables but with data from a different location is needed to confirm. Note that a p-value of 0.01 suggests that 1% of the time a result at least that extreme would be obtained by chance; if hundreds or thousands of hypotheses (with mutually relatively uncorrelated independent variables) are tested, then one is likely to obtain a p-value less than 0.01 for many null hypotheses.

Remedies

Looking for patterns in data is legitimate. Applying a statistical test of significance, or hypothesis test, to the same data that a pattern emerges from is wrong. One way to construct hypotheses while avoiding data dredging is to conduct randomized out-of-sample tests. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset—say, subset A—is examined for creating hypotheses. Once a hypothesis is formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where B also supports such a hypothesis is it reasonable to believe the hypothesis might be valid. (This is a simple type of cross-validation and is often termed training-test or split-half validation.)

Another remedy for data dredging is to record the number of all significance tests conducted during the study and simply divide one's criterion for significance ("alpha") by this number; this is the Bonferroni correction. However, this is a very conservative metric. A family-wise alpha of 0.05, divided in this way by 1,000 to account for 1,000 significance tests, yields a very stringent per-hypothesis alpha of 0.00005. Methods particularly useful in analysis of variance, and in constructing simultaneous confidence bands for regressions involving basis functions are the Scheffé method and, if the researcher has in mind only pairwise comparisons, the Tukey method. The use of Benjamini and Hochberg's false discovery rate is a more sophisticated approach that has become a popular method for control of multiple hypothesis tests.

When neither approach is practical, one can make a clear distinction between data analyses that are confirmatory and analyses that are exploratory. Statistical inference is appropriate only for the former.[5]

Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated by the same method used to obtain the 20% estimate, the result is between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.

Academic journals increasingly shift to the registered report format, which aims to counteract very serious issues such as data dredging and HARKing, which have made theory-testing research very unreliable: For example, Nature Human Behaviour has adopted the registered report format, as it “shift[s] the emphasis from the results of research to the questions that guide the research and the methods used to answer them”[6]. The European Journal of Personality defines this format: “In a registered report, authors create a study proposal that includes theoretical and empirical background, research questions/hypotheses, and pilot data (if available). Upon submission, this proposal will then be reviewed prior to data collection, and if accepted, the paper resulting from this peer-reviewed procedure will be published, regardless of the study outcomes.”[7]

References

1. ^ "Significant". xkcd. 2011-04-06.
2. ^ a b c d Young, S. S.; Karr, A. (2011). "Deming, data and observational studies" (PDF). Significance. 8 (3): 116–120. doi:10.1111/j.1740-9713.2011.00506.x.
3. ^ Davey Smith, G.; Ebrahim, S. (2002). "Data dredging, bias, or confounding". BMJ. 325 (7378): 1437–1438. doi:10.1136/bmj.325.7378.1437. PMC . PMID 12493654.
4. ^ Selvin, H.C.; Stuart, A. (1966). "Data-Dredging Procedures in Survey Analysis". The American Statistician. 20 (3): 20–23. doi:10.1080/00031305.1966.10480401. JSTOR 2681493.
5. ^ a b Berk, R.; Brown, L.; Zhao, L. (2009). "Statistical Inference After Model Selection". J Quant Criminol. 26: 217–236. doi:10.1007/s10940-009-9077-7.
6. ^ "Promoting reproducibility with registered reports". Nature Human Behaviour. 1 (1): 0034. 10 January 2017. doi:10.1038/s41562-016-0034. More than one of |website= and |journal= specified (help)
7. ^ "Streamlined review and registered reports soon to be official at EJP". ejp-blog.com.