No Evidence for a Reproducibility Crisis in Psychology
4 March 2016
A recent article by the Open Science Collaboration (a group of 270 coauthors) gained considerable academic and public attention due to its sensational conclusion that the replicability of psychological science is surprisingly low. Science magazine lauded this article as one of the top 10 scientific breakthroughs of the year across all fields of science, reports of which appeared on the front pages of newspapers worldwide. Gilbert, King, Pettigrew and Wilson (2016)'s comment on 'Estimating the reproducibility of psychological science,'" which appeared in Science, shows that OSC's article contains three major statistical errors and, when corrected, provides no evidence of a replication crisis. Indeed, the evidence is consistent with the opposite conclusion -- that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%. The moral of the story is that meta-science must follow the rules of science.
“This paper has had extraordinary impact,” Gilbert said.“It was Science magazine's number three ‘Breakthrough of the Year’ across all fields of science. It led to changes in policy at many scientific journals, changes in priorities at funding agencies, and it seriously undermined public perceptions of psychology. So it is not enough now, in the sober light of retrospect, to say that mistakes were made. These mistakes had very serious repercussions. We hope the OSC will now work as hard to correct the public misperceptions of their findings as they did to produce the findings themselves.”
Read the full Press Release here.
Social Media and the Crowd-Sourcing of Social Psychology
18 November 2014
[NOTE: Below are some elaborations on issues raised during the recent Headcon '14 Conference, organised by the Edge Foundation. The video and transcript of my talk entitled “Moral Intuitions, Replication and the Scientific Study of Human Nature,” together with a discussion involving Fiery Cushman, David Pizarro, Laurie Santos and L.A. Paul can be found here.
Update 22 December 2014: Coverage of Edge Presentation in The Psychologist (British Psychological Society): "Speaking Out on the 'Replication Crisis'" and on NPR: "Culture of Psychological Science at Stake"]
An essential part of science is to extend and replicate earlier findings. This has always been the process of doing research in social psychology, so it is surprising that the message that psychologists almost never try to replicate their findings gets constantly repeated in the media. In reality most of our papers contain at least one replication: We look at the same research question from different angles, using different methods. This is called a conceptual replication. For example, instead of repeating a single method 5 times in one paper, we report 5 studies using different methods to test the same question because it gives us confidence that a finding is not due to any one method. In addition, in order to test new questions we routinely build on other researchers’ work and therefore replicate their earlier findings (for a list of replications of some of my work, see *Footnote).
Recently so-called "direct replications" have attracted a lot of attention. Since they usually only involve a repetition of one specific method out of many potential methods, a more accurate term is a "method repetition." Although it can help establish whether a method is reliable, it cannot say much about the existence of a given phenomenon, especially when a repetition is only done once (Stanley & Spence, 2014). Nevertheless, method repetitions are often presented as if findings reflect the general existence, or absence of an effect.
Several months ago I noted the editorial irregularities involving the publication of method repetitions in a scientific journal in the absence of any peer review. Part of the issue was some people’s inappropriate interpretation of findings on social media. But the larger point was lost on many, including the Chronicle of Higher Education reporter who concluded his article by saying "while being nice is important, so is being right." That is precisely the problem: Without due scientific process we will learn little about the true reproducibility of our phenomena. Let's review what happened after I expressed concerns because scientific findings entered the archived journal record without the most minimal quality control. Unfortunately, despite heated discussions, not much has changed.
The Crowd-Sourcing of Social Psychology
Like other scientists, social psychologists study highly specialized topics based on expertise accumulated in years of advanced training. But as the sociologist Harry Collins has argued, nowadays on the internet everybody can be a scientist. Indeed, on social media many people have strong opinions on social psychology. After all, the topics sound intuitive, and the methods appear simple. The large-scale replication efforts such as the Reproducibility Project, Many Labs and the Replication Special Issue take the democratization of social psychology to a whole new level: Anybody can nominate a research topic, anybody can be a researcher, and anybody can be a reviewer.
Selection of Replication Topics
Original research requires an up-to-date understanding of the relevant literature and appropriate methods. The goal is to address a gap in the knowledge base. In contrast, there appears to be no systematic approach to the selection of replication topics. It is even possible to anonymously nominate replication targets without any scientific reason, and despite claims to openness and transparency it is not clear how targets are subsequently chosen. The main criteria are that a study is "important" and "feasible for replication." Since "important" is impossible to define, replicators cherry-pick mostly studies that seem straight-forward to run. For example, despite the Reproducibility Project's goal of coverage across 493 articles published in three top journals in 2008, only some are singled out and made "available for replication." Since January 2012 experiments for less than 8% of eligible articles have been completed.
Furthermore, it is irrelevant if an extensive literature has already confirmed the existence of a phenomenon. Although the original study was conducted in 1935 and textbooks already concluded that "the Stroop effect is the most reliable and robust phenomenon of reaction time research (Rosenbaum, 2009, p. 287)," Many Labs 3 will conduct a further replication. Norbert Schwarz has noted that the selection of many replication topics appears to follow the "Priors of Ignorance," namely Bayesian priors in the absence of any knowledge of the literature.
If somebody asks for your materials to do a method repetition you have two options: You cooperate, and help the replicators with their experiment. In that case sharing your time and expertise means that if the study fails, it will be considered even stronger evidence against your work because you confirmed the method. If you do not cooperate then this information will be exposed on the Reproducibility Project’s website, or may appear on social media, so it is difficult to say "no". The replicators anyway may still patch together a study in the absence of the guidelines suggested by Danny Kahneman. Subsequently the original author will need to make sense of the results, or if no effect was observed, may want to conduct additional experiments to show that the effect is robust after all. Apart from reputational considerations, being targeted for replication involves a lot of work because the original author is often the only person with expertise on the topic.
Various large-scale replication efforts are coordinated via the Center for Open Science, a technology company with almost 50 people and several million dollars of funding and replicators are well-connected on social media. Thus, it would be easy to ensure that the replication burden is evenly distributed. Instead, the same topics, and therefore the same scientists are singled out again and again. For example, my work was included in both the Replication Special Issue and the Reproducibility Project. Furthermore, there is a disproportional number of replications on psychological phenomena that occur outside of conscious awareness (e.g., embodied cognition and priming), which -- despite having been supported by considerable evidence -- also are the topics about which some replicators make condescending comments on social media.
It is a curious coincidence that people who have publicly criticized some aspects of replication efforts have had their work subjected to multiple replication attempts. Norbert Schwarz was targeted in Many Labs and Many Labs 2. Fritz Strack published a critical piece (Stroebe & Strack, 2014), and is targeted in Many Labs, Many Labs 2 and in another attempt for Perspectives on Psychological Science. Monin and Oppenheimer (2014) wrote a somewhat critical commentary on the successful replication of their research in Many Labs. Both authors already had several of their papers scrutinized in the Reproducibility Project. In addition, Oppenheimer's work is again featured in Many Labs 2 and Monin's work in Many Labs 3.
The same authors who produced a ceiling effect in the replication of my work but in print accused me of "hunting for artifacts" and mocked my research on social media have turned their attention to another paper of mine. They have many suspicions on topics on which they never published any successful research and within a very short time have moved from one phenomenon to another and produced many null results. They single out specific individuals, and for example, targeted several of John Bargh’s papers while ridiculing this work on social media. Further examinations often show that there actually is an effect when experiments and analyses are done carefully, both in the case of Bargh’s work and mine.
Another prolific replicator mocked my research on Twitter while at the same time his student demanded replication materials. By now at least three of my papers are under scrutiny by people who despite coming from outside the research area already made their negative expectations clear. In addition, many well-intentioned students are apparently no longer taught how to develop an understanding of a given literature. Instead they email original authors to get replication materials plus "all relevant information." Since many go for findings that may be counterintuitive if one is unfamiliar with the literature and for which studies are seemingly straight-forward to run, the same scientists are constantly expected to provide expertise and serve as surrogate supervisors to orphaned students. Some universities even teach courses on replication, as if re-running somebody else's experiment was a transferable skill that can be divorced from the in-depth understanding of a highly specialized literature.
If there really were no other "important" findings, then being repeatedly targeted would be less of an issue if method repetitions were conducted to the same high standard as the original work. Unfortunately, that is often not the case.
Quality of Replication Evidence
Research normally is done by experts and the students and trainees working under their close supervision. Indeed, as noted in the context of biology, a lack of familiarity with the subject matter poses the risk that replication testing conditions are not optimal. Danny Kahneman encouraged replications done in collaboration with other experts to give replications the best possible chance of success. None of the replications efforts follow this approach. Instead, anybody can sign up to get involved, including "citizen scientists" without any academic affiliation.
While replicators are encouraged to contact the original authors to get details on the precise method, there appears to be no requirement that replicators themselves have successfully conducted related experiments. For example, for the Reproducibility Project they are merely instructed to "Gather all relevant materials: Whether shared by the original authors or not, prepare the relevant materials and instrumentation to conduct the replication. Keep notes of where there was insufficient information to reconstruct precisely as in the original report (and include this information in your intro/methods)."
Research requires meticulous control of all aspects of the experimental protocol and the resulting social context. Indeed, even in studies with mice seemingly irrelevant factors such as the gender of the experimenter can make a difference (Sorge et al., 2014). Instead, for method repetitions efficiency in data collection often trumps attention to detail. For example, some Many Labs investigations include up to 15 computerized experiments in the same session lasting just 30 minutes. Although this alone makes the testing conditions completely different from regular psychology research, the influence of the order in which so many experiments are presented is largely ignored. Furthermore, studies that were initially carried out in highly controlled laboratory contexts are sometimes repeated under conditions that offer no comparable experimental control, such as via online surveys.
Method repetitions often emphasize a large sample size while neglecting other aspects of experimental design. However, Westfall, Kenny and Judd (2014) and Westfall, Judd and Kenny (in press) demonstrated that with a small set of stimuli, the chances of finding an effect can be low even when a large number of participants are tested. Monin and Oppenheimer (2014) make a related point. Westfall et al. (in press) conclude: "if a direct replication of a study involving stimulus sampling follows the traditional advice about increasing the number of participants but using the exact same stimulus set, then it would often be theoretically impossible for the power of the replication study to be much higher than that of the original study." Thus, simply re-using earlier materials and testing a large sample of participants does not guarantee that a study is well-controlled and highly-powered.
When experiments are reconstructed by replicators without previous expertise, subsequent errors often limit the interpretability even for "successfully replicated" findings. For example, in Many Labs the wrong procedure was used for Kahneman's anchoring effect so the claim that the replication yielded a larger effect than the original study is incorrect. Similarly, the original work by Schwarz et al. used a scale designed to capture people's TV consumption in Germany in 1983, when there were only three TV stations. Despite his recommendation to update the scale to make it appropriate for TV watching habits across different countries thirty years later, Many Labs used the outdated scale, thus making the result uninformative although it was reported to have "successfully replicated" the effect.
Evaluation of Replication Evidence
There is a long tradition in science to withhold judgment on findings until they have survived expert peer review. But some of the large-scale replication efforts have intentionally removed such quality control. This was the case for the Replication Special Issue; similarly, findings from the Reproducibility Project are simply uploaded to a public website without having gone through the same kind of expert review as the original research. There still is no recognition of the reputational damage done to original authors once unverified claims of "failed replications" enter the public domain.
The errors regarding findings published in the Replication Special Issue, including the editor’s own paper Many Labs, already show that "crowd-sourcing" without peer review does not work. But instead of editorial corrections or retractions readers are encouraged to run their own analyses on the public data files, or wait for further data to become available. But what will be considered the valid finding? My blog, your blog, or another blog in 10 year's time? In any case, if an experiment failed to capture the relevant psychological process, no amount of analysis will help.
Furthermore, reviewers with relevant expertise do not arrive at a judgment by group consensus; instead, it is essential that they give their evaluations independently, usually anonymously so they can be candid. Then an editor weights the strength of their arguments and makes a judgment. However, it is not about determining whether an effect is "real" and exists for all eternity; the evaluation instead answers a simply question: Does a conclusion follow from the evidence in a specific paper? This needs to be established before a finding is published and therefore presented as valid and reliable.
Findings from single experiments are often declared as if they speak to the reproducibility of an effect, when in reality they are unable to do so (Stanley & Spence, 2014). For example, only 38 of the 493 eligible studies have been completed and analyzed for the Reproducibility Project and none of the findings have undergone any expert review and yet, in a BBC interview it was already announced that only "one third of results are reproducing the original results" and that there is "less reproducibility than assumed." A more appropriate, albeit less media-friendly, conclusion would be:
"When somebody without a track record of successfully having conducted research on a certain psychological phenomenon cherry-picks a topic and with little or no expert involvement attempts to reconstruct one randomly selected method out of many possible methods in one single experiment, and when no experts independently confirm whether methods and results were valid, and whether the conclusions follow given the larger context of the literature, then for such an experiment an effect is reported a third of the time."
In other words, when method repetitions omit all scientific rigor typically used for psychological research, many studies will not work. While it may be curious to see what happens under such conditions, they hardly constitute a fair attempt to reveal the "overall rate of reproducibility in a sample of the published psychology literature." Instead, if one wanted to raise doubts about a research area, or about specific researchers' work, this would be the way to do it.
Some have wondered whether original authors should be "allowed" to review method repetitions of their work. Due process has an easy answer: What is the most relevant evidence when making a judgment? Editors get input from the experts who know the most about a topic. By definition there is nobody in the world who knows more about the research than the person who developed it. As always, there should be additional independent reviewers. Editors are in the business of objectively evaluating the input provided by reviewers. Due process does not suppress evidence just because of its source; it considers the strength of it, along with all other relevant evidence.
The Model of Clinical Trials
Method repetitions in psychology appear to be modelled after the rigorous clinical trials in medicine. Now, let’s imagine a "crowd-sourced" approach. We could have an open call to anybody, regardless of relevant training, to approach pharmaceutical companies for samples of their pills in order to conduct replications. It is unlikely that companies would cooperate if no pharmacologists with expertise on a specific drug were involved. But one could still buy certain drugs at the pharmacy. Of course this could only be done for drugs that are easy to obtain, and effects that are easy to measure.
So replicators could test again and again whether Aspirin has blood-thinning properties, irrespective of the fact that this has already been established. Bayer would be expected to provide guidance on how to run the "trials," and to subsequently interpret the resulting "data." Even without Bayer’s input, one could move ahead and simplify the method as long as it was pre-registered and documented. For example, instead of bringing participants to a lab, they could be asked to complete an online survey.
Once the "trial" is complete, instead of expert review one could upload the "findings" to a website, and encourage doctors to analyze publicly available data files in order to determine the best course of treatment for their patients. It is unlikely that anybody would tell Bayer to not worry about their reputation because "replication is not about you" and that they should take the crowd-sourced investigations of their product as a "compliment". Similarly, claims from such "trials" in the public domain would be confusing to patients, who may no longer be confident in their medication. The absurdity of such an approach is self-evident. Science is not done via crowd-sourcing regarding the generation of evidence, and the evaluation of it. But such an approach is actively encouraged for method repetitions: They suggest that psychology is not a science at all.
The Incentive of Method Repetitions
Why are method repetitions so popular? One possibility is that while most journals reject 80-90% of submissions, some large-scale replications efforts offer replicators a guaranteed publication, plus funding to do the experiment. In addition, many journals now accept method repetitions. Many colleagues have remarked, however, that it is difficult to publish a successful replication of a previously observed effect. It shows nothing new. In contrast, a single failed study now is treated as if it does show something new: A well-known effect has been "debunked"! Replication papers often just superficially describe whether a study "worked" and make little attempt to integrate findings into the literature. Some papers describing single experiments conclude that "more research is needed" when in reality many conceptual replications already provided plenty of evidence for the phenomenon because the literature moved on long ago.
A method repetition involving one study can result in a publication in a top journal with only a fraction of the effort required for an original paper involving several conceptual replications in the same journal. In principle one could get many publications by producing one failed study after another. Although this can be productive for replicators, it creates a bottleneck for expert reviewers. Since many method repetitions are done on similar topics, the same researchers are repeatedly expected to review resulting studies. Some of us can no longer keep up with the flood of studies that are conducted without an in-depth understanding of the literature and therefore often have little in common with the objectives tested in the original research.
Method repetitions involve an interesting dilemma not only regarding work load, but about authorship, which is usually only conferred when somebody makes a significant intellectual contribution. For example, a research assistant who had no input into the research and merely follows a script developed by others is usually not an author on the resulting paper. By definition there is no novel intellectual contribution in a method repetition: All the work was done previously by the original author. This person is expected to provide further expertise by helping replicators with their study, or the interpretation of the results, but there is no credit; instead, the original author ends up with a question mark next to their name. Authorship issues could be resolved by adversarial collaborations – if by "collaboration" one means that one party contributes their scepticism while the other party provides all relevant knowledge.
The So-Called "Replication Debate"
Supporters of method repetitions maintain a powerful social media presence that constantly repeats one message: We need more replications. In particular, whenever somebody raises concerns about how method repetitions are carried out, many voices repeat the party line. Instead of addressing arguments, a key strategy is to attack the doubters' credibility. Kahneman is talking nonsense, Lieberman has a Harry Potter Theory of social psychology, Mitchell's opinion is ludicrous, pseudo-scientific and awe-inspiringly clueless. There have been plenty of insults by "intuitive" social psychologists, who are incredulous at why anybody could be "against replications." Then the replication efforts carry on precisely as before. In any case, many highly vocal supporters do research that is unlikely to ever be targeted by the "crowd-sourced" method repetitions that they recommend for the work of others.
This suppresses any meaningful public discussion. Just count the number of social media posts that support direct replications, and then count the ones that share concerns; there are very few of the latter. Has nobody even noticed who has been silent in the so-called "debate"? Although many young people are highly connected on social media, there is practically no direct criticism from junior social psychologists without tenure or other job security. Public concerns have come only from a few people at the top professorial rank who are so established that they can afford being labelled to be "anti-replication." For the rest of us who need to build a career on "replication-feasible" research it is best to keep quiet: Since there are no criteria for the selection of replication targets one should not attract any unnecessary attention.
Instead, it is a useful strategy to either tacitly or actively support the replication efforts. For example, if you participate in one of the large replication projects such as Many Labs, you may have an influence on the selection of targets. You could ensure that your work is not chosen, or if is, suggest the most robust finding and the most reliable method to boost the reputation of your own work while other people won't have that option. Furthermore, if you work in the 150km radius around Tilburg, Diederik Stapel's former institution, then your public support of replications demonstrates your commitment to scientific integrity.
All this creates the illusion of a controversy. But replication is not controversial – novel findings routinely build on earlier findings by using the methods developed by other researchers, therefore replicating and extending the work. Similarly, publications usually include multiple studies that build on one another, replicating central findings of the earlier studies. Social psychologists appreciate the value of reproducibility and replicate their own, and each others’ findings all the time. At issue is not the value of replications, but the media hype with which method repetitions are marketed despite obvious limitations. Moreover, many original authors have cooperated with replicators in every possible way, only to find that they are quick to make exaggerated claims about their findings while concerns are dismissed.
Social Psychologists and the Moral Values of Harm and Fairness
Social psychologists tend to be liberals, so their most important moral values are to prevent harm, and to ensure fairness (Graham, Haidt, & Nosek, 2009). Why do they not speak out when the replication burden is shouldered by just a small group of scientists while we learn little about the true reproducibility of all of our phenomena? How does it make sense to apply a rigorous scientific approach only to original research, but not to the "crowd-sourcing" by non-experts that allegedly aims to establish the reliability of the original research? Is nobody concerned when misleading claims about the reproducibility of phenomena are not just made in the media but even in scientific journals while conceptual replications are ignored?
#repligate provides the answer. If your career depends on "replication-feasible" research, having an opinion is not worth it. There will be an aggressive social media response that instead of addressing concerns involves accusations and insults regarding your professional and moral integrity. This will be followed by seemingly scientific public scrutiny of your work by "intuitive" experts, and "crowd-sourced" replication attempts that have little resemblance with your original research. You will need to conduct additional studies to demonstrate the integrity of your earlier work, but at some point it will become impossible to keep up with the demands for replications materials and data files, and the rate at which some people produce "evidence".
Overall, if you say there is something wrong with how some method repetitions are done, many people will try to show that there must be something wrong with you. In any case, voicing a concern is futile: Despite my explicit request the Replication Special Issue editors refused to subject the replication of my work to independent peer review. Similarly, results for the Reproducibility Project continue to be published before any experts confirm their validity.
As a community, it would be useful to collectively come up with guidelines on how to best conduct replications that follow a scientific approach and help our science become more rigorous. Doing research requires looking at a comprehensive body of evidence, including conceptual replications, in order to determine a knowledge gap. What is the concern we want to address: the reliability of a specific method, or the reproducibility of a certain phenomenon? The cherry-picked single-study method repetition cannot address either issue, especially when research quality is low. As scientists it is our obligation to society to address meaningful questions in order to advance knowledge, not just re-run outdated studies merely because they are "feasible" to conduct. Indeed, these are the skills we need to teach to our next generation of scientists so they can make their own unique contribution.
But it appears that no critical discussion is possible. Perhaps our field is experiencing a phenomenon that it discovered: Pluralistic ignorance. Many colleagues are probably concerned, but because criticism is not tolerated and everybody keeps quiet, the show is allowed to go on. In the war against false positives, you are either with us, or against us. Having an opinion is quickly translated into being "against replications." So, regardless of what you really think, repeat after me: Direct replications are saving our science. Most importantly, make sure to keep your head down.
An Example of Conceptual Replications: Is Physical Purity Related to Moral Purity?
My research has approached this question from two angles: First, is physical disgust related to moral outcomes?
Schnall, Haidt, Clore & Jordan (2008) showed across 4 experiments using different methods that feelings of disgust can make moral judgments more severe. According to Google Scholar, since 2008 this paper has already been cited over 550 times. The reason is that many other researchers built on the work and replicated and extended it, using a variety of different methods. Indeed, there are additional findings to suggest that induced disgust relates to moral outcomes (e.g., Bonini, Hadjichristidis, Mazzocco, Demattè, Zampini, Sbarbati & Magon 2011; Eskine, Kacinik, & Prinz, 2011; Harle & Sanfey, 2010; Horberg, Oveis, Keltner & Cohen, 2009; Moretti & di Pellegrino, 2010; Ugazio, Lamm & Singer, 2012; Wheatley & Haidt, 2005; Winterich, Mittal & Morales, 2014; Whitton, Henry, Rendell & Grisham, 2014).
Looking at the reverse relationship, it has also been shown that exposure to morally bad behavior leads to disgust (e.g., Cannon, Schnall, & White, 2011; Chapman, Kim, Susskind, & Anderson, 2009; Danovitch & Bloom, 2009; Gutierrez, Giner-Sorolla, & Vasiljevic, 2012; Hutcherson & Gross, 2011; Jones & Fitness, 2008; Moll, de Oliveira-Souza, Moll, Ignacio, Bramati, Caparelli-Daquer, & Eslinger, 2005; Parkinson, Sinnott-Armstrong, Koralus, Mendelovici, McGeer, & Wheatley, 2011; Russell & Piazza, 2014; Schaich Borg, Lieberman, & Kiehl, 2008; Young & Saxe, 2011)
The relationship between disgust and morality has further been examined via differences in people's sensitivity to disgust. Indeed, whether one gets easily physically disgusted or not is linked to moral considerations (e.g., Crawford, Inbar, & Maloney, 2014; Horberg, Oveis, Keltner & Cohen, 2009; Hodson & Costello, 2007; Inbar, Pizarro & Bloom, 2009; Inbar, Pizarro, Knobe, & Bloom, 2009; Inbar, Pizarro, & Bloom, 2012; Inbar, Pizarro, Iyer, & Haidt, 2012; Jones & Fitness, 2008; Graham, Haidt & Nosek, 2009; Olatunji, Abramowitz, Williams, Connolly, & Lohr, 2007; Tybur, Lieberman, Griskevicius, 2009).
Summarizing all these findings, Chapman and Anderson (2013, p. 322) concluded in a recent review paper: "We have described a growing body of research that has investigated the role of disgust in moral cognition. Taken together, these studies converge to support the notion that disgust does play an important role in morality. "
Second, I have looked at the opposite of disgust, to investigate: Is physical cleanliness related to moral outcomes?
Schnall, Benton & Harvey (2008) showed in 2 experiments that an experimentally created sense of cleanliness has the opposite effect on moral judgment as disgust, and makes judgments less severe. One experiment had participants wash their hands after having been exposed to disgust, the other experiment induced a sense of cleanliness using cognitive priming. Already this paper involved a conceptual replication by testing whether effects would be comparable for actual hand-washing and when merely activating thoughts related to cleanliness.
Various other studies followed up on this work and used different methods to also show a link between physical cleanliness and moral cleanliness (Cramwinckel, De Cremer & van Dijke, 2012; Cramwinckel, Van Dijk, Scheepers & Van den Bos, 2013; Denke, Rotte, Heinze & Schaefer, 2014; Golec da Zavala, Waldzus, & Cypryanska, 2014; Gollwitzer & Melzer, 2012; Helzer & Pizarro, 2011; Jones & Fitness, 2008; Lee & Schwarz, 2010; Liljenquist, Zhong & Galinsky, 2010; Lobel, Cohen, Shahin, Malov, Golan & Busnach, 2014; Ritter & Preston, 2011; Reuven, Liberman, & Dar, 2013; Tobia, Chapman & Stich, 2013; Winterich, Mittal & Morales, 2014; Xie, Yu, Zhou & Sedikides, 2013;Xu, Begue & Bushman, 2014; Zhong & Liljenquist, 2006; Zhong, Strejcek, Sivanathan, 2010).
In addition to these conceptual replications, there have been multiple successful direct replications involving cleanliness using the materials as used in Schnall, Benton & Harvey (2008), done in different labs in the United States, namely by Huang (in press), and Daubman (replication 1, replication 2) and further, by Leung & Wong in Hong Kong.
Overall, the finding that physical purity and moral purity are related has been replicated many times by different investigators across the world, using various different methods.
Recent Discussions on Replications in Psychology: 15 June 2014
1) Nate Kornell on Replication Bullying
2) Allen McConnell on New Media and Scientific Discourse
3) Dan Gilbert on Replication Bullying
4) Jim Coan on Negative Psychology
5) Tim Wilson on False Negatives in Psychology
Further Thoughts on Replications, Ceiling Effects and Bullying
University of Cambridge
31 May 2014
Recently I suggested that the registered replication data of Schnall, Benton & Harvey (2008), conducted by Johnson, Cheung & Donnellan (2014) had a problem that their analyses did not address. I further pointed to current practices of "replication bullying." It appears that the logic behind these points is unclear to some people, so here are some clarifications.
The Ceiling Effect in Johnson, Cheung & Donnellan (2014)
Compared to the original data, the participants in Johnson et al.'s (2014) data gave significantly more extreme responses. With a high percentage of participants giving the highest score, it is likely that any effect of the experimental manipulation will be washed out because there is no room for participants to give a higher score than the maximal score. This is called a ceiling effect. My detailed analysis shows how a ceiling effect could account for the lack of an effect in the replication data.
Some people have wondered why my analysis determines the level of ceiling across both the neutral and clean conditions. The answer is that one has to use an unbiased indicator of ceiling that takes into account the extremity of responses in both conditions. In principle, any of the following three possibilities can be true: Percentage of extreme responses is a) unrelated to the effect of the manipulation, b) is positively correlated with the effect of the manipulation or c), is negatively correlated with the effect of the manipulation. In other words, this ceiling indicator takes into account the fact it could go either way: Having many extreme scores could help, or hinder in terms of findings the effect.
Importantly, the prevalence of extreme scores is neither mentioned in the replication paper, nor do any of the reported analyses take into account the resulting severe skew of data. Thus, it is inappropriate to conclude that the replication constitutes a failure to replicate the original finding, when in reality the chance of finding an effect was compromised by the high percentage of extreme scores.
The Replication Authors' Rejoinder
The replication authors responded to my commentary in a rejoinder. It is entitled "Hunting for Artifacts: The Perils of Dismissing Inconsistent Replication Results." In it, they accuse me of "criticizing after the results are known," or CARKing, as Nosek and Lakens (2014) call it in their editorial. In the interest of "increasing the credibility of published results" interpretation of data evidently needs to be discouraged at all costs, which is why the special issue editors decided to omit any independent peer review of the results of all replication papers.
In their rejoinder Johnson et al. (2014) make several points:
First, they note that there was no a priori reason to suspect that my dependent variables would be inappropriate for use with college students in Michigan. Indeed, some people have suggested that it is my fault if the items did not work in the replication sample. Apparently there is a misconception among replicators that first, all materials from a published study must be used in precisely the same manner as in the original study, and second, as a consequence, that "one size fits all", and all materials have to lead to the identical outcomes regardless of when and where you test a given group of participants.
This is a surprising assumption, but it is wide-spread: The flag-priming study in the "Many Labs" paper involved presenting participants in Italy, Poland, Malaysia, Brazil and other countries with the American flag, and asking them questions such as what they thought about President Obama, and the United States' invasion of Iraq. After all, these were the materials that the original authors had used, so the replicators had no choice. Similarly, in another component of "Many Labs" the replicators insisted on using an outdated scale from 1983, even though the original authors cautioned that this makes little sense, because, well, times have changed. Considering that our field is about the social and contextual factors that shape all kinds of phenomena, this is a curious way to do science indeed.
Second, to "fix" the ceiling effect, in their rejoinder Johnson et al. (2014) remove all extreme observations from the data. This unusual strategy was earlier suggested to me by editor Lakens in his editorial correspondence and then implemented by the replication authors. It involves getting rid of a huge portion of the data: 38.50% from Study 1, and 44.00% of data in Study 2.
Even if sample size is still reasonable, this does not solve the problem: The ceiling effect indicates response compression at the top end of the scale, where variability due to the manipulation would be expected. If you get rid of all extreme responses then you are throwing away exactly that part of that portion of the data where the effect should occur in the first place. I have to yet find a statistician or a textbook that advocates removing nearly half of the data as an acceptable strategy to address a ceiling effect.
Now, imagine that I, or any author of an original study, had done this in a paper: It would be considered "p-hacking" if I had excluded almost half my data to show an effect. No reviewer would have approved of such an approach. By the same token, it should be considered "reverse p-hacking": But the replication authors are allowed to do so and claim that there "still is no effect."
Third, the replication authors selectively analyze a few items regarding the ceiling effect because they consider them "the only relevant comparisons." But it does not make sense to cherry-pick single items: The possible ceiling effect can be demonstrated in my analyses that aggregate across all items. As in their replication paper, the authors further claim that only selected items on their own showed an effect in my original paper. This is surprising considering that it has always been standard practice to collapse across all items in an analyses, rather than making claims about any given individual item. Indeed, there is a growing recognition that rather than focussing on individual p-values one needs to instead consider overall effect sizes (Cumming, 2012). My original studies have Cohen’s d of .61 (Experiment 1), and .85 (Experiment 2).
Again, imagine that I had done this in my original paper, to only selectively report the items for which there was an effect, "because they were the only relevant comparisons" and ignored all others. I would most certainly be condemned for "p-hacking", and no peer reviewer would have let me get away with it. Thus, it is stunning that the replication authors are allowed to make such a case, and it is identical to the argument made by the editor in his emails to me.
In summary, there is a factual error in the paper by Johnson et al. (2014) because none of the reported analyses take into account the severe skew in distributions: The replication data show a significantly greater percentage of extreme scores than the original data (and the successful direct replication by Arbesfeld, Collins, Baldwin, & Daubman, 2014). This is especially a problem in Experiment 2, which used a shorter scale with only 7 response options, compared to the 10 response options in Study 1. The subsequent analyses in the rejoinder, of getting rid of extreme scores, or to selectively focus on single items, are not sound. A likely interpretation is a ceiling effect in the replication data, which makes it difficult to detect the influence of a manipulation, and is especially a problem with short response scales (Hessling, Traxel & Schmidt, 2004). There can be other reasons for why the effect was not observed, but the above possibility should have been discussed by Johnson et al. (2014) as a limitation of the data that makes the replication inconclusive.
An Online Replication?
In their rejoinder the replication authors further describe an online study, but unfortunately it was not a direct replication: It lacked the experimental control that was present in all other studies that successfully demonstrated the effect. The scrambled sentences task (e.g., Srull & Wyer, 1979) involves underlining words on a piece of paper, as in Schnall et al. (2008), Besman et al. (2013) and Arbesfeld et al. (2014). Whereas the paper-based task is completed under the guidance of an experimenter, for online studies it cannot be established whether participants exclusively focus on the priming task. Indeed, results from online versions of priming studies systematically differ from lab-based versions (Ferguson, Carter, & Hassin, 2014). Priming involves inducing a cognitive concept in a subtle way. But for it to be effective, one has to ensure that there are no other distractions. In an online study it is close to impossible to establish what else participants might be doing at the same time.
Even more importantly for my study, the specific goal was to induce a sense of cleanliness. When participants do the study online they may be surrounded by mess, clutter and dirt, which would interfere with the cleanliness priming. In fact, we have previously shown that one way to induce disgust is to ask participants to sit at a dirty desk, where they were surrounded by rubbish, and this made their moral judgments more severe (Schnall, Haidt, Clore & Jordan, 2008). So a dirty environment while doing the online study would counteract the cleanliness induction. Another unsuccessful replication has surfaced that was conducted online. If somebody had asked me whether it makes sense to induce cleanliness in an online study, I would have said "no," and they could have saved themselves some time and money. Indeed, the whole point of the registered replications was to ensure that replication studies had the same amount of experimental control as the original studies, which is why original authors were asked to give feedback.
I was surprised that the online study in the rejoinder was presented as if it can address the problem of the ceiling effect in the replication paper. Simply put: A study with a different problem (lack of experimental control) cannot fix the clearly demonstrated problem of the registered replication studies (high percentage of extreme scores). So it is a moving target: Rather than addressing the question of "is there an error in the replication paper," to which I can show that the answer is "yes," the replication authors were allowed to shift the focus to inappropriate analyses that "also don't show an effect," and to new data that introduce new problems.
Guilty Until Proven Innocent
My original paper was published in Psychological Science, where it was subjected to rigorous peer-review. A data-detective some time ago requested the raw data, and I provided them but never heard back. When I checked recently, this person confirmed the validity of my original analyses; this replication of my analyses was never made public.
So far nobody has been able to see anything wrong in my work. In contrast, I have shown that the analyses reported by Johnson et al. (2014) fail to take into account the severe skew in distributions, and I alerted the editors to it in multiple attempts. I shared all my analyses in great detail but was told the following by editor Lakens: "Overall, I do not directly see any reason why your comments would invalidate the replication. I think that reporting some additional details, such as distribution of the ratings, the interitem correlations, and the cronbachs alpha for the original and replication might be interesting as supplementary material."
I was not content with this suggestion of simply adding a few details while still concluding the replication failed to show the original effect. I therefore provided further details on the analyses and said the following:
"Let me state very clearly: The issues that my analyses uncovered do not concern scientific disagreement but concern basic analyses, including examination of descriptive statistics, item distributions and scale reliabilities in order to identify general response patterns. In other words, the authors failed to provide the high-quality analysis strategy that they committed to in their proposal and that is outlined in the replication recipe (Brandt et al., 2014). These are significant shortcomings, and therefore significantly change the conclusion of the paper." I requested that the replication authors are asked to revise their paper to reflect these additional analyses before the paper goes into print, or instead, that I get the chance to respond to their paper in print. Both requests were denied.
Now, let's imagine the reverse situation: Somebody, perhaps a data detective, found a comparable error in my paper and claims that it calls into question all reported analyses, and alerted the editors. Or let's imagine that there was not even a specific error but just a "highly unusual" pattern in the data that was difficult to explain. Multiple statistical experts would be consulted to carefully scrutinize the results, and there would be immediate calls to retract the paper, just to be on the safe side, because other researchers should not rely on findings that may be invalid. There might even be an ethics investigation to ascertain that no potential wrong-doing has occurred. But apparently the rules are very different when it comes to replications: The original paper, and by implication, the original author, is considered guilty until proven innocent, while apparently none of the errors in the replication matter; they can be addressed with reverse p-hacking. Somehow, even now, the onus still is on me, the defendant, to show that there was an error made by the accusers, and my reputation continues to be on the line.
Multiple people, including well-meaning colleagues, have suggested that I should collect further data to show my original effect. I appreciate their concern, because publicly criticizing the replication paper brought focused attention to the question of the replicability of my work. Indeed, some colleagues had advised against raising concerns for precisely this reason: Do not draw any additional attention to it. The mere fact that the work was targeted for replication implies that there must have been a reason for it to be singled out as suspect in the first place. This is probably why many people remain quiet when somebody claims a failed replication to their work. Not only do they fear the mocking and ridicule on social media, they know that while a murder suspect is considered innocent until proven guilty, no such allowance is made for the scientist under suspicion: She must be guilty of something – "the lady doth protest too much", as indeed a commentator on the Science piece remarked.
So, about 10 days after I publicly raised concerns about a replication of my work, I have not only been publicly accused of various questionable research practices or potential fraud, but there have been further calls to establish the reliability of my earlier work. Nobody has shown any error in my paper, and there are two independent direct replications of my studies, and many conceptual replications involving cleanliness and morality, all of which corroborate my earlier findings (e.g., Cramwinckel, De Cremer & van Dijke, 2012; Cramwinckel, Van Dijk, Scheepers & Van den Bos, 2013; Gollwitzer & Melzer, 2012; Jones & Fitness, 2008; Lee & Schwarz, 2010; Reuven, Liberman & Dar, 2013; Ritter & Preston, 2011; Xie, Yu, Zhou & Sedikides, 2013; Zhong & Liljenquist, 2006; Zhong, Strejcek, Sivanathan, 2010). There is an even bigger literature on the opposite of cleanliness, namely disgust, and its link to morality, generated by many different labs, using a wide range of methods (for reviews, see Chapman & Anderson, 2013; Russell & Giner-Sorolla, 2013).
But can we really be sure, I'm asked? What do those dozens and dozens of studies really tell us, if they have only tested new conceptual questions but have not used the verbatim materials from one particular paper published in 2008? There can only be one conclusive way to show that my finding is real: I must conduct one pre-registered replication study with identical materials, but this time using a gigantic sample. Ideally this study should not be conducted by me, since due to my interest in the topic I would probably contaminate the admissible evidence. Instead, it should be done by people who have never worked in that area, have no knowledge of the literature or any other relevant expertise. Only if they, in that one decisive study, can replicate the effect, then my work can be considered reliable. After all, if the effect is real, with identical materials the identical results should be obtained anywhere in the world, at any time, with any sample, by anybody. From this one study the truth will be so obvious that nobody has to even evaluate the results; the data will speak for themselves once they are uploaded to a website. This truth will constitute the final verdict that can be shouted off the roof-tops, because it overturns everything else that the entire literature has shown so far. People who are unwilling to accept this verdict may need to be prevented from "critiquing after the results are known" or CARKing (Nosek & Lakens, 2014) if they cannot live up to the truth. It is clear what is expected from me: I am a suspect now. More data is needed, much more data, to fully exonerate me.
In contrast, I have yet to see anybody suggest that the replication authors should run a more carefully conceived experiment after developing new stimuli that are appropriate for their sample, one that does not have the problem of the many extreme responses they failed to report, indeed an experiment that in every way lives up to the confirmed high quality of my earlier work. Nobody has demanded that they rectify the damage they have already done to my reputation by reporting sub-standard research, let alone the smug presentation of the conclusive nature of their results. Nobody, not even the editors, asked them to fix their error, or called on them to retract their paper, as I would surely have been asked to do had somebody found problems with my work.
What is Bullying?
To be clear: I absolutely do not think that trying to replicate somebody else's work should be considered bullying. The bullying comes in when failed replications are announced with glee and with direct or indirect implications of "p-hacking" or even fraud. There can be many reasons for any given study to fail; it is misleading to announce replication failure from one study as if it disproves an entire literature of convergent evidence. Some of the accusations that have been made after failures to replicate can be considered defamatory because they call into question a person's professional reputation.
But more broadly, bullying relates to an abuse of power. In academia, journal editors have power: They are the gatekeepers of the published record. They have two clearly defined responsibilities: To ensure that the record is accurate, and to maintain impartiality toward authors. The key to this is independent peer-review. All this was suspended for the special issue: Editors got to cherry-pick replication authors with specific replication targets. Further, only one editor (rather than a number of people with relevant expertise, which could have included the original author, or not) unilaterally decided on the replication quality. So the policeman not only rounded up the suspects, he also served as the judge to evaluate the evidence that led him to round up the suspects in the first place.
Without my intervention to the Editor-in-Chief, 15 replications papers would have gone into print at a peer-review journal without a single finding having been scrutinized by any expert outside of the editorial team. For original papers such an approach would be unthinkable, even though original finding do not nearly have the same reputational implications as replication findings. I spent a tremendous amount of time to do the work that several independent experts should have done instead of me. I did not ask to be put into this position, nor did I enjoy it, but at the end there simply was nobody else to independently check the reliability of the findings that concern the integrity of my work. I would have preferred to spend all that time doing my own work instead of running analyses for which other people get authorship credit.
How do we know whether any of the other findings published in that special issue are reliable? Who will verify the replications of Asch's (1946) and Schachter's (1951) work, since they no longer have a chance to defend their contributions? More importantly, what exactly have we learned if Asch's specific study did not replicate when using identical materials? Do we have to doubt his work and re-write the textbooks despite the fact that many conceptual replications have corroborated his early work? If it was true that studies with small sample sizes are "unlikely to replicate", as some people falsely claim, then there is no point in trying to selectively smoke out false positives by running one-off large-sample direct replications that are treated as if they can invalidate a whole body of research using conceptual replications. It would be far easier to just declare 95% of the published research invalid, burn all the journals, and start from scratch.
Publications are about power: The editors invented new publication rules and in the process lowered the quality well below what is normally expected for a journal article, despite their alleged goal of "increasing the credibility of published results" and the clear reputational costs to original authors once "failed" replications go on the record. The guardians of the published record have a professional responsibility to ensure that whatever is declared as a "finding," whether confirmatory, or disconfirmatory, is indeed valid and reliable. The format of the Replication Special Issue fell completely short of this responsibility, and replication findings were by default given the benefit of the doubt.
Perhaps the editors need to be reminded of their contractual obligation: It is not their job to adjudicate whether a previously observed finding is "real." Their job is to ensure that all claims in any given paper are based on solid evidence. This includes establishing that a given method is appropriate to test a specific research question. A carbon copy of the materials used in an earlier study does not necessarily guarantee that this will be the case. Further, when errors become apparent, it is the editors' responsibility to act accordingly and update the record with a published correction, or a retraction. It is not good enough to note that "more research is needed." By definition, that is always the case.
Errors have indeed already become apparent in other papers in the replication special issue, such as in the "Many Labs" paper. They need to be corrected, otherwise false claims, incl. those involving "successful" replications, continue to be part of the published literature on which other researchers build their work. It is not acceptable to claim that it does not matter whether a study is carried out online or in the lab when this is not correct, or to present a graph that includes participants from Italy, Poland, Malaysia, Brazil and other international samples who were primed with an American flag to test the influence on Republican attitudes (Ferguson, Carter & Hassin, 2014). It is not good enough to mention in a footnote that in reality the replication of Jacowitz and Kahneman (1995) was not comparable to the original method, and therefore the reported conclusions do not hold. It is not good enough to brush aside original authors' concerns that despite seemingly identical materials the underlying psychological process was not fully captured (Crisp, Miles, & Husnu, 2014; Schwarz & Strack, 2014).
The printed journal pages constitute the academic record, and each and every finding reported in those pages needs to be valid and reliable. The editors not only owe this to the original authors whose work they targeted, but also to the readers who will rely on their conclusions and take them at face value. Gross oversimplifications of which effects were "successfully replicated" or "unsuccessfully replicated" are simply not good enough when there are key methodological errors. In short, replications are not good enough when they are held to a much lower standard than the very findings they end up challenging.
It is difficult to not get the impression that the format of the special issue meant that the editors sided with the replication authors, while completely silencing original authors. In my specific case, the editor even helped them make their case against me in print, by suggesting analyses to address the ceiling effect that few independent peer reviewers would consider acceptable. Indeed, he publicly made it very clear on whose side he is on.
Lakens has further publicly been critical of my research area, embodied cognition, because according to him, such research "lacks any evidential value." He came to this conclusion based on a p-curve analysis of a review paper (Meier, Schnall, Schwarz & Bargh, 2012), in which Schnall, Benton & Harvey (2008) is cited. Further, he mocked the recent SPSP Embodiment Preconference at which I gave a talk, a conference that I co-founded several years ago. In contrast, he publicly praised the blog by replication author Brent Donnellan as "excellent"; it is the very blog for which the latter by now apologized because it glorifies failed replications.
Bullying involves an imbalance of power and can take on many forms. It does not depend on your gender, skin colour, whether you are a graduate student or a tenured faculty member. It means having no control when all the known rules regarding scientific communication appear to have been re-written by those who not only control the published record, but in the process also hand down the verdicts regarding which findings shall be considered indefinitely "suspect."
University of Cambridge
22 May 2014
Recently I was invited to be part of a "registered replication" project of my work. It was an interesting experience, which in part was described in an article in My commentary describing specific concerns about the replication paper is available .here.
Some people have asked me for further details. Here are my answers to specific questions.
Question 1: “Are you against replications?”
I am a firm believer in replication efforts and was flattered that my paper (Schnall, Benton & Harvey, 2008) was considered important enough to be included in the special issue on "Replications of Important Findings in Social Psychology." I therefore gladly cooperated with the registered replication project on every possible level: First, following their request I promptly shared all my experimental materials with David Johnson, Felix Cheung and Brent Donnellan and gave detailed instructions on the experimental protocol. Second, I reviewed the replication proposal when requested to do so by special issue editor Daniel Lakens. Third, when the replication authors requested my SPSS files I sent them the following day. Fourth, I offered the replication authors to analyze their data within two week’s time when they told me about the failure to replicate my findings. This offer was declined because the manuscript had already been submitted. Fifth, when I discovered the ceiling effect in the replication data I shared this concern with the special issue editors, and offered to help the replication authors correct the paper before it goes into print. This offer was rejected, as was my request for a published commentary describing the ceiling effect.
I was told that the ceiling effect does not change the conclusion of the paper, namely that it was a failure to replicate my original findings. The special issue editors Lakens and Nosek suggested that if I had concerns about the replication, I should write a blog; there was no need to inform the journal's readers about my additional analyses. Fortunately Editor-in-Chief Unkelbach overruled this decision and granted published commentaries to all original authors whose work was included for replication in the special issue.
Of course replications are much needed and as a field we need to make sure that our findings are reliable. But we need to keep in mind that there are human beings involved, which is what Danny Kahneman's commentary emphasizes. Authors of the original work should be allowed to participate in the process of having their work replicated. For the Replication Special Issue this did not happen: Authors were asked to review the replication proposal (and this was called "pre-data peer review"), but were not allowed to review the full manuscripts with findings and conclusions. Further, there was no plan for published commentaries; they were only implemented after I appealed to the Editor-in-Chief.
Various errors in several of the replications (e.g., in the "Many Labs" paper) became only apparent once original authors were allowed to give feedback. Errors were uncovered even for successfully replicated findings. But since the findings from "Many Labs" were already heavily publicized several months before the paper went into print, the reputational damage for some people behind the findings already started well before they had any chance to review the findings. "Many Labs" covered studies on 15 different topics, but there was no independent peer review of the findings by experts on those topics.
For all the papers in the special issue the replication authors were allowed the "last word" in the form of a rejoinder to the commentaries; these rejoinders were also not peer-reviewed. Some errors identified by original authors were not appropriately addressed so they remain part of the published record.
Question 2: “Have the findings from Schnall, Benton & Harvey (2008) been replicated?”
We reported two experiments in this paper, showing that a sense of cleanliness leads to less severe moral judgments. Two direct replications of Study 1 have been conducted by Kimberly Daubman and showed the effect (described on Psych File Drawer: Replication 1, Replication 2). These are two completely independent replications that were successfully carried out at Bucknell University. Further, my collaborator Oliver Genschow and I conducted several studies that also replicated the original findings. Oliver ran the studies in Switzerland and in Germany, whereas the original work had been done in the United Kingdom, so it was good to see that the results replicated in other countries. These data are so far unpublished.
Importantly, all these studies were direct replications, not conceptual replications, and they provided support for the original effect. Altogether there are seven successful demonstrations of the effect, using identical methods: Two in our original paper, two by Kimberly Daubman and another three by Oliver Genschow and myself. I am not aware of any unsuccessful studies, either by myself or others, apart from the replications reported by Johnson, Cheung and Donnellan.
In addition, there are now many related studies involving conceptual replications. Cleanliness and cleansing behaviors, such as hand washing, have been shown to influence a variety of psychological outcomes (Cramwinckel, De Cremer & van Dijke, 2012; Cramwinckel, Van Dijk, Scheepers & Van den Bos, 2013; Florack, Kleber, Busch, & Stöhr, 2014; Gollwitzer & Melzer, 2012; Jones & Fitness, 2008; Kaspar, 2013; Lee & Schwarz, 2010a; Lee & Schwarz, 2010b; Reuven, Liberman & Dar, 2013; Ritter & Preston, 2011; Xie, Yu, Zhou & Sedikides, 2013; Xu, Zwick & Schwarz, 2012; Zhong & Liljenquist, 2006; Zhong, Strejcek, Sivanathan, 2010).
Question 3: “What do you think about “pre-data peer review?”
There are clear professional guidelines regarding scientific publication practices and in particular, about the need to ensure the accuracy of the published record by impartial peer review. The Committee on Publications Ethics (COPE) specifies the following Principles of Transparency and Best Practice in Scholarly Publishing: "Peer review is defined as obtaining advice on individual manuscripts from reviewers expert in the field who are not part of the journal’s editorial staff." Peer review of only methods but not full manuscripts violates internationally acknowledged publishing ethics. In other words, there is no such thing as "pre-data peer review."
Editors need to seek the guidance of reviewers; they cannot act as reviewers themselves. Indeed, "One of the most important responsibilities of editors is organising and using peer review fairly and wisely." (COPE, 2014, p. 8). The editorial process implemented for the Replication Special Issue went against all known publication conventions that have been developed to ensure impartial publication decisions. Such a breach of publication ethics is ironic considering the stated goals of transparency in science and increasing the credibility of published results. Peer review is not censorship; it concerns quality control. Without it, errors will enter the published record, as has indeed happened for several replications in the special issue.
Question 4: “You confirmed that the replication method was identical to your original studies. Why do you now say there is a problem?”
I indeed shared all my experimental materials with the replication authors, and also reviewed the replication proposal. But human nature is complex, and identical materials will not necessarily have the identical effect on all people in all contexts. My research involves moral judgments, for example, judging whether it is wrong to cook and eat a dog after it died of natural causes. It turned out that for some reason in the replication samples many more people found this to be extremely wrong than in the original studies. This was not anticipated and became apparent only once the data had been collected. There are many ways for any study to go wrong, and one has to carefully examine whether a given method was appropriate in a given context. This can only be done after all the data have been analyzed and interpreted.
My original paper went through rigorous peer-review. For the replication special issue all replication authors were deprived of this mechanism of quality control: There was no peer-review of the manuscript, not by authors of the original work, nor by anybody else. Only the editors evaluated the papers, but this cannot be considered peer-review (COPE, 2014). To make any meaningful scientific contribution the quality standards for replications need to be at least as high as for the original findings. Competent evaluation by experts is absolutely essential, and is especially important if replication authors have no prior expertise with a given research topic.
Question 5: “Why did participants in the replication samples give much more extreme moral judgments than in the original studies?”
Based on the literature on moral judgment, one possibility is that participants in the Michigan samples were on average more politically conservative than the participants in the original studies conducted in the UK. But given that conservatism was not assessed in the registered replication studies, it is difficult to say. Regardless of the reason, however, analyses have to take into account the distributions of a given sample, rather than assuming that they will always be identical to the original sample.
Question 6: “What is a ceiling effect?”
A ceiling effect means that responses on a scale are truncated toward the top end of the scale. For example, if the scale had a range from 1-7, but most people selected "7", this suggests that they might have given a higher response (e.g., "8" or "9") had the scale allowed them to do so. Importantly, a ceiling effect compromises the ability to detect the hypothesized influence of an experimental manipulation. Simply put: With a ceiling effect it will look like the manipulation has no effect, when in reality it was unable to test for such an effects in the first place. When a ceiling effect is present no conclusions can be drawn regarding possible group differences.
Research materials have to be designed such that they adequately capture response tendencies in a given population, usually via pilot testing (Hessling, Traxel, & Schmidt, 2004). Because direct replications use the materials developed for one specific testing situation, it can easily happen that materials that were appropriate in one context will not be appropriate in another. A ceiling effect is only one example of this general problem.
Question 7: “Can you get rid of the ceiling effect in the replication data by throwing away all extreme scores and then analysing whether there was an effect of the manipulation?”
There is no way to fix a ceiling effect. It is a methodological rather than statistical problem, indicating that survey items did not capture the full range of participants’ responses. Throwing away all extreme responses in the replication data of Johnson, Cheung & Donnellan (2014) would mean getting rid of 38.50% of data in Study 1, and 44.00% of data in Study 2. This is a problem even if sample sizes remain reasonable: The extreme responses are integral to the data sets; they are not outliers or otherwise unusual observations. Indeed, for 10 out of the 12 replication items the modal (=most frequent) response in participants was the top score of the scale. Throwing away extreme scores gets rid of the portion of the data where the effect of the manipulation would have been expected. Not only would removing almost half of your sample be considered "p-hacking," most importantly, it does not solve the problem of the ceiling effect.
Question 8: “Why is the ceiling effect in the replication data such a big problem?”
Let me try to illustrate the ceiling effect in simple terms: Imagine two people are speaking into a microphone and you can clearly understand and distinguish their voices. Now you crank up the volume to the maximum. All you hear is this high-pitched sound (“eeeeee”) and you can no longer tell whether the two people are saying the same thing or something different. Thus, in the presence of such a ceiling effect it would seem that both speakers were saying the same thing, namely “eeeeee”.
The same thing applies to the ceiling effect in the replication studies. Once a majority of the participants are giving extreme scores, all differences between two conditions are abolished. Thus, a ceiling effect means that all predicted differences will be wiped out: It will look like there is no difference between the two people (or the two experimental conditions).
Further, there is no way to just remove the “eeeee” from the sound in order to figure out what was being said at that time; that information is lost forever and hence no conclusions can be drawn from data that suffer from a ceiling effect. In other words, you can’t fix a ceiling effect once it’s present.
Question 9: “Why do we need peer-review when data files are anyway made available online?”
It took me considerable time and effort to discover the ceiling effect in the replication data because it required working with the item-level data, rather than the average scores across all moral dilemmas. Even somebody familiar with a research area will have to spend quite some time trying to understand everything that was done in a specific study, what the variables mean, etc. I doubt many people will go through the trouble of running all these analyses and indeed it’s not feasible to do so for all papers that are published.
That is the assumption behind peer-review: You trust that somebody with the relevant expertise has scrutinized a paper regarding its results and conclusions, so you don’t have to. If instead only data files are made available online there is no guarantee that anybody will ever fully evaluate the findings. This puts specific findings, and therefore specific researchers, under indefinite suspicion. This was a problem for the Replication Special Issue, and is also a problem for the Reproducibility Project, where findings are simply uploaded without having gone through any expert review.
Question 10: “What has been your experience with replication attempts?”
My work has been targeted for multiple replication attempts; by now I have received so many such requests that I stopped counting. Further, data detectives have demanded the raw data of some of my studies, as they have done with other researchers in the area of embodied cognition because somehow this research area has been declared "suspect." I stand by my methods and my findings and have nothing to hide and have always promptly complied with such requests. Unfortunately, there has been little reciprocation on the part of those who voiced the suspicions; replicators have not allowed me input on their data, nor have data detectives exonerated my analyses when they turned out to be accurate.
I invite the data detectives to publicly state that my findings lived up to their scrutiny, and more generallly, share all their findings of secondary data analyses. Otherwise only errors get reported and highly publicized, when in fact the majority of research is solid and unproblematic.
With replicators alike, this has been a one-way street, not a dialogue, which is hard to reconcile with the alleged desire for improved research practices. So far I have seen no actual interest in the phenomena under investigation, as if the goal was to as quickly as possible declare the verdict of "failure to replicate." None of the replicators gave me any opportunity to evaluate the data before claims of "failed" replications were made public. Replications could tell us a lot about boundary conditions of certain effects, which would drive science forward, but this needs to be a collaborative process.
The most stressful aspect has not been to learn about the "failed" replication by Johnson, Cheung & Donnellan, but to have had no opportunity to make myself heard. There has been the constant implication that anything I could possibly say must be biased and wrong because it involves my own work. I feel like a criminal suspect who has no right to a defense and there is no way to win: The accusations that come with a "failed" replication can do great damage to my reputation, but if I challenge the findings I come across as a "sore loser."
Question 11: “But… shouldn’t we be more concerned about science rather than worrying about individual researchers’ reputations?”
Just like everybody else, I want science to make progress, and to use the best possible methods. We need to be rigorous about all relevant evidence, whether it concerns an original finding, or a "replication" finding. If we expect original findings to undergo tremendous scrutiny before they are published, we should expect no less of replication findings.
Further, careers and funding decisions are based on reputations. The implicit accusations that currently come with failure to replicate an existing finding can do tremendous damage to somebody's reputation, especially if accompanied by mocking and bullying on social media. So the burden of proof needs to be high before claims about replication evidence can be made. As anybody who's ever carried out a psychology study will know, there are many reasons why a study can go wrong. It is irresponsible to declare replication results without properly assessing the quality of the replication methods, analyses and conclusions.
Let's also remember that there are human beings involved on both sides of the process. It is uncollegial, for example, to tweet and blog about people's research in a mocking tone, accuse them of questionable research practices, or worse. Such behavior amounts to bullying, and needs to stop.
Question 12: “Do you think replication attempts selectively target specific research areas?”
Some of the findings obtained within my subject area of embodied cognition may seem surprising to an outsider, but they are theoretically grounded and there is a highly consistent body of evidence. I have worked in this area for almost 20 years and am confident that my results are robust. But somehow recently all findings related to priming and embodied cognition have been declared "suspicious." As a result I have even considered changing my research focus to "safer" topics that are not constantly targeted by replicators and data detectives. I sense that others working in this area have suffered similarly, and may be rethinking their research priorities, which would result in a significant loss to scientific investigation in this important area of research.
Two main criteria appear to determine whether a finding is targeted for replication: Is a finding surprising, and can a replication be done with little resources, i.e., is a replication feasible. Whether a finding surprises you or not depends on familiarity with the literature —the lower your expertise, the higher your surprise. This makes counterintuitive findings a prime target, even when they are consistent with a large body of other findings. Thus, there is a disproportionate focus on areas with surprising findings for which replications can be conducted with limited resources. It also means that the burden of responding to "surprised" colleagues is very unevenly distributed and researchers like myself are targeted again and again, which will do little to advance the field as a whole. Such practices inhibit creative research on phenomena that may be counterintuitive because it motivates researchers to play it safe to stay well below the radar of the inquisitors.
Overall it's a base rate problem: If replication studies are cherry-picked simply due to feasibility considerations, then a very biased picture will emerge, especially if failed replications are highly publicized. Certain research areas will suffer from the stigma of "replication failure" while other research areas will remained completely untouched. A truely scientific approach would be to randomly sample from the entire field, and conduct replications, rather than focus on the same topics (and therefore the same researchers) again and again.
Question 13: “So far, what has been the personal impact of the replication project on you?”
The "failed" replication of my work was widely announced to many colleagues in a group email and on Twitter already in December, before I had the chance to fully review the findings. So the defamation of my work already started several months before the paper even went into print. I doubt anybody would have widely shared the news had the replication been considered "successful." At that time the replication authors also put up a blog entitled "Go Big or Go Home," in which they declare their studies an "epic fail." Considering that I helped them in every possible way, even offered to analyze the data for them, this was disappointing.
At a recent interview for a big grant I was asked to what extent my work was related to Diederik Stapel’s and therefore unreliable; I did not get the grant. The logic is remarkable: if researcher X faked his data, then the phenomena addressed by that whole field are called into question.
Further, following a recent submission of a manuscript to a top journal a reviewer raised the issue about a "failed" replication of my work and therefore called into question the validity of the methods used in that manuscript.
The constant suspicions create endless concerns and second thoughts that impair creativity and exploration. My graduate students are worried about publishing their work out of fear that data detectives might come after them and try to find something wrong in their work. Doing research now involves anticipating a potential ethics or even criminal investigation.
Over the last six months I have spent a tremendous amount of time and effort on attempting to ensure that the printed record regarding the replication of my work is correct. I made it my top priority to try to avert the accusations and defamations that currently accompany claims of failed replications. Fighting on this front has meant that I had very little time and mental energy left to engage in activities that would actually make a positive contribution to science, such as writing papers, applying for grants, or doing all the other things that would help me get a promotion.
Question 14: “Are you afraid that being critical of replications will have negative consequences?”
I have already been accused of "slinging mud", "hunting for artifacts", "soft fraud" and it is possible that further abuse and bullying will follow, as has happened to colleagues who responded to failed replications by pointing out errors of the replicators. The comments posted below the Science article suggest that even before my commentary was published people were quick to judge. In the absence of any evidence, many people immediately jumped to the conclusion that something must be wrong with my work, and that I'm probably a fraudster. I encourage the critics to read my commentary so you can arrive at a judgment based on the facts.
I will continue to defend the integrity of my work, and my reputation, as anyone would, but I fear that this will put me and my students even more directly into the cross hairs of replicators and data detectives. I know of many colleagues who are similarly afraid of becoming the next target and are therefore hesitant to speak out, especially those who are junior in their careers or untenured. There now is a recognized culture of "replication bullying:" Say anything critical about replication efforts and your work will be publicly defamed in emails, blogs and on social media, and people will demand your research materials and data, as if they are on a mission to show that you must be hiding something.
I have taken on a risk by publicly speaking out about replication. I hope it will encourage others to do the same, namely to stand up against practices that do little to advance science, but can do great damage to people's reputations. Danny Kahneman, who so far has been a big supporter of replication efforts, has now voiced concerns about a lack of "replication etiquette," where replicators make no attempts to work with authors of the original work. He says that "this behavior should be prohibited, not only because it is uncollegial but because it is bad science. A good-faith effort to consult with the original author should be viewed as essential to a valid replication."
Let's make replication efforts about scientific contributions, not about selectively targeting specific people, mockery, bullying and personal attacks. We could make some real progress by working together.