Why Research Findings Are Usually Wrong

by John Hardie BDS, MSc, PhD, FRCD(C)

The reading of articles in scien­tific journals and attending lectures by published researchers are considered hallmarks of the knowledgeable professional. Indeed, ignorance of the advice gleaned from such publications or from the verbal exhortations of their authors might be grounds for investigating professional competence. Questioning the value of these sources of learning would appear to be counter intuitive. However, as this article will describe there are justified reasons for so doing.

“Statistics can prove almost anything,” was a headline in the National Post on November 21, 2011. It referred to a new study appearing in Psychological Science1 which demonstrated that through data manipulation it is easy to publish statistically significant evidence to support any hypothesis. In October 2011, the Journal of the Canadian Dental Association published an article which contained the following, “It would therefore seem logical to make every possible effort to reduce the chances of false or unreliable data being published in the scientific literature.”2

These publications are suggesting that there is an essence of unreliability surrounding virtually all scientific findings. It would be useful to know if such a dramatic revelation has additional support.

In November 2010, the Atlantic published an extensive report, “Lies, Damned Lies, and Medical Science” on the work of Dr. John Ioannidis, Clinical and Molecular Epidemiologist, Tufts University School of Medicine.3 In plain language the article explains why Dr. Ioannidis—with a sterling reputation in the medical community-has come to the conclusion that 90% of the published medical information that MDs rely on is false, and that advice given to us by experts on health, nutrition and drugs is misleading, false or often flat out wrong.

The article supports these contentions by summarizing two significant papers published by Dr. Ioannidis. The first appeared in PLoS Medicine in 2005.4 In it he used mathematical reasoning to correctly predict that 80% of non-randomized studies (the most common), 25% of randomized small –medium sized trials, and 10% of randomised large trials would have their results convincingly refuted by later studies.3 The second paper also appeared in 2005 in the Journal of the American Medical Association.5 This time Ioannidis concentrated on 49 of the most significant findings that had occurred in medicine during the past 13 years as determined by two factors. One, the pertinent articles had been published in the journals most often cited by the research community and two, the 49 papers themselves were the most widely cited articles in those journals. The subjects covered in the papers included; the widespread use of HRT during menopause, vitamin E to reduce heart disease, coronary stents to lower the risk of heart attacks and daily doses of aspirin to reduce the risk of heart disease and strokes. Forty five (45) of the 49 articles provided methods to verify the effectiveness of their respective claims. When 34 such claims were retested, 14 or 41% of them were shown quite convincingly to be wrong or grossly exaggerated.5

If between a third and a half of the most prestigious, highly accepted medical research is untrustworthy, it is reasonable to question the reliability of findings in papers that are infrequently cited or appear in minor publications.

The relevance of this to dentistry is made possible by referring to the recent article by Faggion.2 In it he acknowledges Journal Citation Reports (JCR). These are systematic, objective, quantifiable means of assessing the research influence and impact of scientific journals. The highest ranking medical journals with their Impact Factors (IF) are; New England Journal of Medicine–IF=47, Lancet–IF=30.7 and Journal of the American Medical Association–IF=28.8. Ioannidis selected his 49 papers from among these three top journals and others which had Impact Factors greater than seven according to the JCR. As noted by Faggion2 the highest ranking dental journals are; Journal of Clinical Periodontology–IF=3.5, Journal of Dental Research–IF=3.4 and Oral Oncology–IF=3.1. As an aside, the Journal of the Canadian Dental Association has an Impact Factor of 0.95 in contrast to the Canadian Medical Association Journal with an Impact Factor of 7.2.

The research impacts and influences of the three most prestigious medical journals are approximately 10 times greater than those of the three most frequently cited dental publications. Since Ioannidis has shown that 30-50% of highly respected medical research is faulty, it is a safe assumption that at least 50% of dental research findings are highly questionable. In fact, the situation is probably much worse. As stated earlier, Ioannidis, supported by a large swathe of the medical community, believes that as much as 90% of the published professional information that physicians rely on is flawed. (3) It is highly likely that at least a similar, if not greater, fault level applies to the research findings that dentist use to guide their practices.

Faggion suggests that despite the peer review process fraud and misconduct are unfortunate realities of medical research.2 It would be naïve to believe that dental research is immune to similar abuses. Although fraud and misconduct will produce false findings, there are other less malicious aspects of research methodology that are believed to be responsible for inaccurate and untrustworthy results.

Simmons et al1 and Ioannidis4 discuss the term “statistically significant” and its relevance to incorrect findings. To appreciate why, a brief understanding is required of the null hypothesis, p-values, false positives and publication bias.

The null hypothesis is the proposition that implies no effect or no relationship between phenomena or data being investigated. It is usually expressed as a negative. An example would be, “Hyperactivity is unrelated to eating sugar.” If by using statistics the hypothesis is tested and found to be probably false, then the null hypothesis would be rejected or nullified with the result that there might be a connection between hyperactivity and sugar intake. It is the statistical significance of the test that is used to either reject the null hypothesis or fail to reject it. Accordingly, a null hypothesis is a statistical construct that can never be proven since in reality there might be or might not be a relationship between hyperactivity and sugar ingestion.

P-values and Statistical Sig­nificance. P-values are statistical terms that refer to the probability that a test result could be due to normal random variations, in other words- chance. A p-value of 0 (the lowest possible) means that there is a 0% possibility that test results are due to chance and that the results are significant. A p-value of one (the highest possible) means that the test results are 100% consistent with those due to random variations and that the results are not significant.

By convention a value of 0.05 is commonly chosen as the critical p-value or the significant level at which it is possible to reject the null hypothesis. This is because 0.05 translates into a 95% probability that the results are not due to chance and only a 5% probability that they are due to chance. When the null hypothesis is rejected the result is said to be statistically significant implying that the result is probably true. Other p-values can be used. However, it is important to realise that the higher the p-value the greater is the probability that the results are due to chance. For example, a p-value of 0.1 indicates a 90% probability that the results are true and a 10% probability that they are false, whereas a p-value of 0.01
indicates that there is a 99% probability that the results are correct and only a 1% chance that they are wrong.

False positives are the incorrect rejection of the null hypothesis. If the testing of a treatment known to be clinically ineffective determines that using a p-value of 0.05 the therapy is probably effective, the null hypothesis “that it is clinically ineffective” will be rejected with the creation of a false positive. As a consequence of the false positive, the therapy will be accepted because it has been shown to have a 95% probability of being useful. The pervasive presence of false positives is considered to be among the most serious errors in medical (dental) research.1,3,4

Publication Bias is the known tendency to report research with statistically significant positive results (p-value<0.05) more frequently (up to ten times) than results that are negative (i.e. support the null hypothesis) or are inconclusive.6 Since current research practice favours using statistical significance to “prove” theories and because publication bias exists, Berlin and others believe that to produce positive results there is a widespread manipulation of data leading to a preponderance of false positive results in the literature.7-9

A more extreme example will emphasize the magnitude of the problem. Suppose a team is charged with investigating the ability of 100 mouthwashes to control or prevent gingivitis without knowing that all of them are clinically ineffective. It is very likely that one out of 20 tests will produce a p-value of 0.05 or smaller by pure chance. This means that about five tests (100/20) will have “statistically significant” results suggesting that they are effective. This is a false positive rate of 5%. Even though all of the mouthwashes are useless for the purposes being tested, the hypothetical researchers will ignore the 95 negative tests and focus on having the five positive results published due to their awareness that journal editors overwhelmingly favour positive (albeit false) results. In turn, the five products with the “statistically significant” results will receive commercial endorsements accompanied by expensive promotions to the profession and public.

This exaggerated scenario shows how it is possible for entirely false but “statistically significant” findings to enter the literature with a level of credibility that is extremely difficult to discredit or dispute.1

Begg concurs with others that the support, publication and acceptance of false positives is “deeply embedded in current research practice” with a potential to produce the magnitude of wrong findings as identified by Ioannidis.4-7 The reality of this understanding is revealed by appreciating that among the most cited research findings in the most prestigious journals investigated by Ioannidis, 32% with “significant” results were found to have incorrect or exaggerated results, and that an incredible 74% of those using conventionally accepted p-values of 0.05 were subsequently proven to be wrong even though their test results were accepted as “statistically significant.”5 Considering that most of these studies were randomized controlled trials-the “gold standard” for substantiated evidence- there is a distinct possibility that a critical analysis of statistically significant dental research findings would reveal a similar pattern of errors.

A certain laxity regarding the publication of false positive results appears to be a critical reason why research findings are flawed. Attempting to understand why this has occurred has been a major focus of work done by Simmons, Ioannidis and Faggion.1-5

In their paper Simmons et al identify the presence of “researcher degrees of freedom” as a major reason why research is flawed.1 This concept centres on two aspects of an investigator’s behaviour. The first relates to data collection and observations. For example, researchers rarely decide beforehand which specific data to collect or reject, which observations to include or exclude and which confounding variables to control or ignore. Secondly, when faced with having to make specific decisions on these issues during the course of a study, investigators have an inbuilt desire to establish a statistically significant result.1 Thus, when faced with analytical decisions regarding data, observations and variables researchers will, with convincing self-justification, choose those that will create results having a statistical significance of p<0.05. This manipulation of the evidence and its interpretation is not driven by maliciousness but by an innate conviction that whatever decisions produce the most favourable (publishable) results are entirely appropriate.

Thus while some of the decisions that researchers make might be innocent and entirely reasonable to them, the “degrees of freedom” that they are permitted allows them to extract “statistically significant” results from almost any test. Indeed, Simmons was able to show that relatively minor manipulations would produce false positives at p< 0.05 levels- 60% of the time and at p< 0.01 levels—21% of the time.1

The “degrees of freedom” are similar to the presence of “bias” in research as noted by Ioannidis in 2005. According to Ioannidis “bias” is the selective manipulation and distortion of a study’s design, data, analysis and presentation to produce results that correspond to what the researchers expected or hoped to find and what editors will publish. As a consequence, while the results might be appealing because they appear to support a favourite hypothesis they are not necessarily true.4 The chances that the results are true decreases as the level of “bias” increases.4

Therefore, it seems that the “degrees of freedom” and “bias” that researchers are afforded in designing their studies and in interpreting their results are significant factors in producing flawed research.

Apart from “bias” Ioannidis identified six other factors that increase the probability of a research finding being untrue.

Sample Size. The smaller the sample size the less likely it is that the research findings are true. A small size might not detect important differences between the members of the sample resulting in false conclusions. Ioannidis has noted that research findings are more likely true in scientific fields that undertake large studies (several thousand subjects) compared to ones involving 100 or fewer in the sample size.4 It is recommended that a competent statistician is consulted regarding the appropriate sample size pertinent to the type of study design.10 Unfortunately, the sample size is often dictated by the resources and time available to do the study, the inconvenience in gathering a large sample, the experiences of the researchers and the number of samples used in previous similar studies.10 It is accepted that a failure to make a correct sample size calculation will adversely affect the value of the study.4,10

Apart from major epidemiologic studies involving—for example— the incidence of caries in specific populations it is unlikely that any significant amount of dental research has sample sizes that are in the thousands.

Effect Size. This is a measure of the magnitude of the result. For example, studies involving a compound that reduces caries by 60-80% are more likely to be true than those involving only a 5-10% decrease. According to Ioannidis any scientific field (including dental research) that produces small effect sizes are, “plagued by almost ubiquitous false positive claims.”4

Previous Studies. Well- designed randomized controlled trials (RCTs) and meta-analyses wi
ll generally produce more accurate results than a single or simple study that attempts to challenge a null hypothesis.4 While this may be true, RCTs and meta-analysis are not without their faults. For example, a meta-analysis using the combined data from a number of studies is only as good as each of the studies supplying the pooled information and is subjected to the same degrees of freedom and bias as described above.11 Although considered the “gold standard” RCTs are fallible. RCTs involving thousands of samples/subjects are complex, expensive and time consuming. The large numbers in these mega-trials do not insulate them from the same human emotions that govern the analysis of data, observations and variables, and subsequent statistical computing associated with simpler less complex investigations. There are ongoing debates as to the merits of RCTs over observational studies.12,13 Therefore, it might be unwise to consider RCTs as the last word in the design of clinical trials.

Flexibility in Design. The greater the flexibility in the design, definition, acceptable outcomes and analytical methods of a study the less likely are the results to be true. 4 Having common standards apply to studies would be beneficial as would unequivocal outcomes. For example, if the outcome in a study is death, the results are liable to be more accurate than those involving scales of pain perception following surgery.

Finances and Prejudices. The greater the financial interests or conflicts of interest associated with the investigation the more likely it is that the results will be false.4 It is reasonable to assume that if financial benefits are gleaned from a result, investigator degrees of freedom are capable of the manipulation required to produce a favourable outcome. Ioannidis notes that if a pet believe or hypothesis of a researcher is converted into a study simply to satisfy a criterion for tenure, the inevitable self-interest biases will almost certainly produce a false result.4 Conflicts of interest arise when, via the peer review process, a study is squashed in favour of one that complies with the beliefs of the reviewers even if that belief is based on faulty research. Such action perpetuates the acceptance of untrue findings.4

Popularity of the Topic. Ioan­nidis has shown that if the same question is being pursued by a number of research teams, the validity of the results decreases as the number of investigations increases.4 The probable reason for this is that since prestige will be attached to the first team to produce a “positive” result, compromises and biases will be employed to hasten a favourable albeit probably false outcome.

While the degree to which these factors operate in dentistry is unknown, their very existence is reason enough to question the validity of most, if not all, dental research. Presumably it was this concern that caused Faggion to conduct his study.2

How the imperfections in research methodology might apply to dentistry is illustrated by the following.

In the January 2012 edition of the Journal of the Canadian Dental Association there is a brief article titled, “Benefits of Flossing for Reducing Gingivitis.”14 It refers to a recent Cochrane systematic review of randomized controlled trials to suggest that, “Flossing remains an effective adjunct to toothbrushing” because, according to the review, “people who brush and floss regularly have less gum bleeding compared to toothbrushing alone.”15 Does the nature of the review justify these conclusions?

The systematic review was a meta-analysis of 12 previously conducted RCTs. Although the combined total of the participants in the 12 trials was 1,083, the individual trials had sample sizes ranging from 24 to 218 and none of the trials offered how sample sizes were calculated. Five of the trials had a high risk for bias with the remainder having an unclear risk. Although flossing appeared to have a statistically significant effect on reducing gingivitis, the effect size was an 8% reduction.15 The sample sizes, the presence of bias and the small effect size suggest that the findings in each of the 12 trials are probably false. Performing an analysis on the pooled results will not adjust for the defects in the original trials. Thus, the findings of the systemic review are more than likely false. Indeed, the authors of the review must be applauded for recognizing that the, “Trials were of poor quality and conclusions must be viewed as unreliable.”15 Interestingly, in 2008, Berchier and colleagues published a meta-analysis of 11 studies on the efficacy of dental flossing.16 They concluded that flossing had no effect on reducing gingivitis. This would seem to support the claim by Ioannidis that the flaws in research methodology are the reasons why apparently well intentioned investigators studying the same topic often arrive at dramatically different results.4,5 Perhaps the brevity of the article in the January edition of the Journal of the CDA prevented the inclusion of the study by Berchier. Nevertheless, the omission perpetuates the still unsubstantiated believe that flossing is effective and demonstrates the role that publications have in spreading (presumably unintentionally) claims that, in all likelihood, are false.

The above supports the opinion that dental research is subjected to the same faults in design and interpretation as Simmons and Ioannidis have identified for medical studies.1,4,5 Therefore, there would appear to be justification for suggesting that dentists should be highly suspicious of all research findings. As an aside, dental floss was introduced in 1819.15 The fact that after 193 years its efficacy remains unresolved is an uninspiring reflection on the state of dental research.

Methods of increasing the probability that research results are true have been provided by Faggion, Simmons and Ioannidis.1-4 The suggestions focus on improving the ability of studies to be replicated. The idea behind this concept is, if subsequent investigators using exactly the same methodology as the original researchers arrive at the same results, there is an increased probability that the results are true. Faggion refers to the methodology of an investigation as its “raw data.”2 The raw data would require researchers to; identify how and why data collection would be terminated before data collection begins, list all the variables influencing the study, report all experimental manipulations even those that failed to produce the desired result and include the statistical results of observations that were subsequently eliminated.1,2,4 It is believed that these requirements would decrease the selective manipulation of the research in order to arrive at a preconceived or favourable outcome. Peer reviewers could assist in this task by ensuring that these requirements are included as part of the research protocol in the final manuscript.1,2 The fundamental concept behind the requirements is to reduce investigator induced degrees of freedom or bias. If these requirements had been in place before the meta-analysis on flossing was performed, the amount of bias in the 12 RCTs would have been reduced increasing the probability that the results of the analysis were correct.

None of the 64 dental journals reviewed by Faggion required the publication of the raw data with the manuscript submission.2 However to be fair, among the 10 topped ranked medical journals only three suggest that the raw data should be published.2 Until publications demand the inclusion of the raw data this omission is reason enough to question even the most prestigious research.

There appear to be sufficient justifications for doubting or, at least, questioning the veracity of most medical research findings. It is highly likely that dental research is plagued by the same faults that infect medical investigations. According
ly, it would be important to cast a critical eye on all dental studies, especially those that advance or support the preconceived ideas or biases of their authors, because they are more than likely wrong.OH

Dr. Hardie was intimately involved in the development of the RCDSO 1996 evidence-based guidelines.

Oral Health welcomes this original article.


1. Simmons, JP et al. False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science 2011; 22(11): 1359-1366.

2. Faggion CM. Improving Transparency in Dental Research by Making the Raw Data Available. J Can Dent Assoc 2011; 77: b122.

3. Freedman DH. Lies, Damned Lies, and Medical Science. The Atlantic 2010; Nov: 76-86.

4. Ioannidis JPA. Why most published research findings are false. PloS Med 2005; 2(8):e124.

5. Ioannidis JPA. Contradicted and Initially Stronger Effects in Highly Cited Clinical Research. JAMA 2005; 294(2); 218-228.

6. Sackett DL. Bias in analytical research. J Chronic Dis; 32(1-2): 51-63.

7. Begg CB, Berlin JA. Publication Bias and Dissemination of Clinical Research. J Natl Cancer Inst 1989; 81(2): 107-115.

8. Hopewell S et al. Publication bias in clinical trials due to statistical significance or direction of trial results. Cochrane Database of Systematic Reviews 2009; Issue. Art. No.: MR000006; DOI:10.1002/14651858.MR00 0006.pub3.

9. Dickersin K et al. Publication bias and clinical trials. Controlled Clinical Trials 1987; 8(4): 343-353.

10. Malhotra RK, Indrayan A. A simple nomogram for sample size for estimating sensitivity and specificity of medical tests. Indian J Opthamol 2010; 58(6): 519-522.

11. Linde K, Willich SN. How Objective Are Systematic Reviews? Differences Between Reviews on Complimentary Medicine. J Roy Soc Med 2003; 96(1): 17-22.

12. Black N. Why we need observational studies to evaluate effect of health care. BMJ 1996; 312:1215-1218.

13. Sanson-Fisher RW et al. Limitations of the randomized controlled trial in evaluating population-based health interventions. An J Prev Med 2007; 33(2): 155-161.

14. Benefits of Flossing for Reducing Gingivitis. J Can Dent Assoc 2012; 78(1): 18.

15. Sambunjak D et al. Flossing for the management of periodontal diseases and dental caries in adults. Cochrane database of Systematic Reviews 2011, issue 12. Art No.: CD008829. DOI:10.1002/14651858.CD008829.pub2.

16. Berchier C et al. The efficacy of dental floss in addition to a toothbrush on plaque and parameters of gingival inflammation: a systematic review. International J Dent Hyg 2008; 6: 265-279.