Recent Discussions on Replications in Psychology: 15 June 2014
1) Nate Kornell on Replication Bullying
2) Allen McConnell on New Media and Scientific Discourse
3) Dan Gilbert on Replication Bullying
4) Jim Coan on Negative Psychology
5) Tim Wilson on False Negatives in Psychology
Further Thoughts on Replications, Ceiling Effects and Bullying
University of Cambridge
31 May 2014
Recently I suggested that the registered replication data of Schnall, Benton & Harvey (2008), conducted by Johnson, Cheung & Donnellan (2014) had a problem that their analyses did not address. I further pointed to current practices of "replication bullying." It appears that the logic behind these points is unclear to some people, so here are some clarifications.
The Ceiling Effect in Johnson, Cheung & Donnellan (2014)
Compared to the original data, the participants in Johnson et al.'s (2014) data gave significantly more extreme responses. With a high percentage of participants giving the highest score, it is likely that any effect of the experimental manipulation will be washed out because there is no room for participants to give a higher score than the maximal score. This is called a ceiling effect. My detailed analysis shows how a ceiling effect could account for the lack of an effect in the replication data.
Some people have wondered why my analysis determines the level of ceiling across both the neutral and clean conditions. The answer is that one has to use an unbiased indicator of ceiling that takes into account the extremity of responses in both conditions. In principle, any of the following three possibilities can be true: Percentage of extreme responses is a) unrelated to the effect of the manipulation, b) is positively correlated with the effect of the manipulation or c), is negatively correlated with the effect of the manipulation. In other words, this ceiling indicator takes into account the fact it could go either way: Having many extreme scores could help, or hinder in terms of findings the effect.
Importantly, the prevalence of extreme scores is neither mentioned in the replication paper, nor do any of the reported analyses take into account the resulting severe skew of data. Thus, it is inappropriate to conclude that the replication constitutes a failure to replicate the original finding, when in reality the chance of finding an effect was compromised by the high percentage of extreme scores.
The Replication Authors' Rejoinder
The replication authors responded to my commentary in a rejoinder. It is entitled "Hunting for Artifacts: The Perils of Dismissing Inconsistent Replication Results." In it, they accuse me of "criticizing after the results are known," or CARKing, as Nosek and Lakens (2014) call it in their editorial. In the interest of "increasing the credibility of published results" interpretation of data evidently needs to be discouraged at all costs, which is why the special issue editors decided to omit any independent peer review of the results of all replication papers.
In their rejoinder Johnson et al. (2014) make several points:
First, they note that there was no a priori reason to suspect that my dependent variables would be inappropriate for use with college students in Michigan. Indeed, some people have suggested that it is my fault if the items did not work in the replication sample. Apparently there is a misconception among replicators that first, all materials from a published study must be used in precisely the same manner as in the original study, and second, as a consequence, that "one size fits all", and all materials have to lead to the identical outcomes regardless of when and where you test a given group of participants.
This is a surprising assumption, but it is wide-spread: The flag-priming study in the "Many Labs" paper involved presenting participants in Italy, Poland, Malaysia, Brazil and other countries with the American flag, and asking them questions such as what they thought about President Obama, and the United States' invasion of Iraq. After all, these were the materials that the original authors had used, so the replicators had no choice. Similarly, in another component of "Many Labs" the replicators insisted on using an outdated scale from 1983, even though the original authors cautioned that this makes little sense, because, well, times have changed. Considering that our field is about the social and contextual factors that shape all kinds of phenomena, this is a curious way to do science indeed.
Second, to "fix" the ceiling effect, in their rejoinder Johnson et al. (2014) remove all extreme observations from the data. This unusual strategy was earlier suggested to me by editor Lakens in his editorial correspondence and then implemented by the replication authors. It involves getting rid of a huge portion of the data: 38.50% from Study 1, and 44.00% of data in Study 2.
Even if sample size is still reasonable, this does not solve the problem: The ceiling effect indicates response compression at the top end of the scale, where variability due to the manipulation would be expected. If you get rid of all extreme responses then you are throwing away exactly that part of that portion of the data where the effect should occur in the first place. I have to yet find a statistician or a textbook that advocates removing nearly half of the data as an acceptable strategy to address a ceiling effect.
Now, imagine that I, or any author of an original study, had done this in a paper: It would be considered "p-hacking" if I had excluded almost half my data to show an effect. No reviewer would have approved of such an approach. By the same token, it should be considered "reverse p-hacking": But the replication authors are allowed to do so and claim that there "still is no effect."
Third, the replication authors selectively analyze a few items regarding the ceiling effect because they consider them "the only relevant comparisons." But it does not make sense to cherry-pick single items: The possible ceiling effect can be demonstrated in my analyses that aggregate across all items. As in their replication paper, the authors further claim that only selected items on their own showed an effect in my original paper. This is surprising considering that it has always been standard practice to collapse across all items in an analyses, rather than making claims about any given individual item. Indeed, there is a growing recognition that rather than focussing on individual p-values one needs to instead consider overall effect sizes (Cumming, 2012). My original studies have Cohen’s d of .61 (Experiment 1), and .85 (Experiment 2).
Again, imagine that I had done this in my original paper, to only selectively report the items for which there was an effect, "because they were the only relevant comparisons" and ignored all others. I would most certainly be condemned for "p-hacking", and no peer reviewer would have let me get away with it. Thus, it is stunning that the replication authors are allowed to make such a case, and it is identical to the argument made by the editor in his emails to me.
In summary, there is a factual error in the paper by Johnson et al. (2014) because none of the reported analyses take into account the severe skew in distributions: The replication data show a significantly greater percentage of extreme scores than the original data (and the successful direct replication by Arbesfeld, Collins, Baldwin, & Daubman, 2014). This is especially a problem in Experiment 2, which used a shorter scale with only 7 response options, compared to the 10 response options in Study 1. The subsequent analyses in the rejoinder, of getting rid of extreme scores, or to selectively focus on single items, are not sound. A likely interpretation is a ceiling effect in the replication data, which makes it difficult to detect the influence of a manipulation, and is especially a problem with short response scales (Hessling, Traxel & Schmidt, 2004). There can be other reasons for why the effect was not observed, but the above possibility should have been discussed by Johnson et al. (2014) as a limitation of the data that makes the replication inconclusive.
An Online Replication?
In their rejoinder the replication authors further describe an online study, but unfortunately it was not a direct replication: It lacked the experimental control that was present in all other studies that successfully demonstrated the effect. The scrambled sentences task (e.g., Srull & Wyer, 1979) involves underlining words on a piece of paper, as in Schnall et al. (2008), Besman et al. (2013) and Arbesfeld et al. (2014). Whereas the paper-based task is completed under the guidance of an experimenter, for online studies it cannot be established whether participants exclusively focus on the priming task. Indeed, results from online versions of priming studies systematically differ from lab-based versions (Ferguson, Carter, & Hassin, 2014). Priming involves inducing a cognitive concept in a subtle way. But for it to be effective, one has to ensure that there are no other distractions. In an online study it is close to impossible to establish what else participants might be doing at the same time.
Even more importantly for my study, the specific goal was to induce a sense of cleanliness. When participants do the study online they may be surrounded by mess, clutter and dirt, which would interfere with the cleanliness priming. In fact, we have previously shown that one way to induce disgust is to ask participants to sit at a dirty desk, where they were surrounded by rubbish, and this made their moral judgments more severe (Schnall, Haidt, Clore & Jordan, 2008). So a dirty environment while doing the online study would counteract the cleanliness induction. Another unsuccessful replication has surfaced that was conducted online. If somebody had asked me whether it makes sense to induce cleanliness in an online study, I would have said "no," and they could have saved themselves some time and money. Indeed, the whole point of the registered replications was to ensure that replication studies had the same amount of experimental control as the original studies, which is why original authors were asked to give feedback.
I was surprised that the online study in the rejoinder was presented as if it can address the problem of the ceiling effect in the replication paper. Simply put: A study with a different problem (lack of experimental control) cannot fix the clearly demonstrated problem of the registered replication studies (high percentage of extreme scores). So it is a moving target: Rather than addressing the question of "is there an error in the replication paper," to which I can show that the answer is "yes," the replication authors were allowed to shift the focus to inappropriate analyses that "also don't show an effect," and to new data that introduce new problems.
Guilty Until Proven Innocent
My original paper was published in Psychological Science, where it was subjected to rigorous peer-review. A data-detective some time ago requested the raw data, and I provided them but never heard back. When I checked recently, this person confirmed the validity of my original analyses; this replication of my analyses was never made public.
So far nobody has been able to see anything wrong in my work. In contrast, I have shown that the analyses reported by Johnson et al. (2014) fail to take into account the severe skew in distributions, and I alerted the editors to it in multiple attempts. I shared all my analyses in great detail but was told the following by editor Lakens: "Overall, I do not directly see any reason why your comments would invalidate the replication. I think that reporting some additional details, such as distribution of the ratings, the interitem correlations, and the cronbachs alpha for the original and replication might be interesting as supplementary material."
I was not content with this suggestion of simply adding a few details while still concluding the replication failed to show the original effect. I therefore provided further details on the analyses and said the following:
"Let me state very clearly: The issues that my analyses uncovered do not concern scientific disagreement but concern basic analyses, including examination of descriptive statistics, item distributions and scale reliabilities in order to identify general response patterns. In other words, the authors failed to provide the high-quality analysis strategy that they committed to in their proposal and that is outlined in the replication recipe (Brandt et al., 2014). These are significant shortcomings, and therefore significantly change the conclusion of the paper." I requested that the replication authors are asked to revise their paper to reflect these additional analyses before the paper goes into print, or instead, that I get the chance to respond to their paper in print. Both requests were denied.
Now, let's imagine the reverse situation: Somebody, perhaps a data detective, found a comparable error in my paper and claims that it calls into question all reported analyses, and alerted the editors. Or let's imagine that there was not even a specific error but just a "highly unusual" pattern in the data that was difficult to explain. Multiple statistical experts would be consulted to carefully scrutinize the results, and there would be immediate calls to retract the paper, just to be on the safe side, because other researchers should not rely on findings that may be invalid. There might even be an ethics investigation to ascertain that no potential wrong-doing has occurred. But apparently the rules are very different when it comes to replications: The original paper, and by implication, the original author, is considered guilty until proven innocent, while apparently none of the errors in the replication matter; they can be addressed with reverse p-hacking. Somehow, even now, the onus still is on me, the defendant, to show that there was an error made by the accusers, and my reputation continues to be on the line.
Multiple people, including well-meaning colleagues, have suggested that I should collect further data to show my original effect. I appreciate their concern, because publicly criticizing the replication paper brought focused attention to the question of the replicability of my work. Indeed, some colleagues had advised against raising concerns for precisely this reason: Do not draw any additional attention to it. The mere fact that the work was targeted for replication implies that there must have been a reason for it to be singled out as suspect in the first place. This is probably why many people remain quiet when somebody claims a failed replication to their work. Not only do they fear the mocking and ridicule on social media, they know that while a murder suspect is considered innocent until proven guilty, no such allowance is made for the scientist under suspicion: She must be guilty of something – "the lady doth protest too much", as indeed a commentator on the Science piece remarked.
So, about 10 days after I publicly raised concerns about a replication of my work, I have not only been publicly accused of various questionable research practices or potential fraud, but there have been further calls to establish the reliability of my earlier work. Nobody has shown any error in my paper, and there are two independent direct replications of my studies, and many conceptual replications involving cleanliness and morality, all of which corroborate my earlier findings (e.g., Cramwinckel, De Cremer & van Dijke, 2012; Cramwinckel, Van Dijk, Scheepers & Van den Bos, 2013; Gollwitzer & Melzer, 2012; Jones & Fitness, 2008; Lee & Schwarz, 2010; Reuven, Liberman & Dar, 2013; Ritter & Preston, 2011; Xie, Yu, Zhou & Sedikides, 2013; Zhong & Liljenquist, 2006; Zhong, Strejcek, Sivanathan, 2010). There is an even bigger literature on the opposite of cleanliness, namely disgust, and its link to morality, generated by many different labs, using a wide range of methods (for reviews, see Chapman & Anderson, 2013; Russell & Giner-Sorolla, 2013).
But can we really be sure, I'm asked? What do those dozens and dozens of studies really tell us, if they have only tested new conceptual questions but have not used the verbatim materials from one particular paper published in 2008? There can only be one conclusive way to show that my finding is real: I must conduct one pre-registered replication study with identical materials, but this time using a gigantic sample. Ideally this study should not be conducted by me, since due to my interest in the topic I would probably contaminate the admissible evidence. Instead, it should be done by people who have never worked in that area, have no knowledge of the literature or any other relevant expertise. Only if they, in that one decisive study, can replicate the effect, then my work can be considered reliable. After all, if the effect is real, with identical materials the identical results should be obtained anywhere in the world, at any time, with any sample, by anybody. From this one study the truth will be so obvious that nobody has to even evaluate the results; the data will speak for themselves once they are uploaded to a website. This truth will constitute the final verdict that can be shouted off the roof-tops, because it overturns everything else that the entire literature has shown so far. People who are unwilling to accept this verdict may need to be prevented from "critiquing after the results are known" or CARKing (Nosek & Lakens, 2014) if they cannot live up to the truth. It is clear what is expected from me: I am a suspect now. More data is needed, much more data, to fully exonerate me. I must repent, for I have sinned, but redemption remains uncertain.
In contrast, I have yet to see anybody suggest that the replication authors should run a more carefully conceived experiment after developing new stimuli that are appropriate for their sample, one that does not have the problem of the many extreme responses they failed to report, indeed an experiment that in every way lives up to the confirmed high quality of my earlier work. Nobody has demanded that they rectify the damage they have already done to my reputation by reporting sub-standard research, let alone the smug presentation of the conclusive nature of their results. Nobody, not even the editors, asked them to fix their error, or called on them to retract their paper, as I would surely have been asked to do had anybody found problems with my work.
What is Bullying?
To be clear: I absolutely do not think that trying to replicate somebody else's work should be considered bullying. The bullying comes in when failed replications are announced with glee and with direct or indirect implications of "p-hacking" or even fraud. There can be many reasons for any given study to fail; it is misleading to announce replication failure from one study as if it disproves an entire literature of convergent evidence. Some of the accusations that have been made after failures to replicate can be considered defamatory because they call into question a person's professional reputation.
But more broadly, bullying relates to an abuse of power. In academia, journal editors have power: They are the gatekeepers of the published record. They have two clearly defined responsibilities: To ensure that the record is accurate, and to maintain impartiality toward authors. The key to this is independent peer-review. All this was suspended for the special issue: Editors got to cherry-pick replication authors with specific replication targets. Further, only one editor (rather than a number of people with relevant expertise, which could have included the original author, or not) unilaterally decided on the replication quality. So the policeman not only rounded up the suspects, he also served as the judge to evaluate the evidence that led him to round up the suspects in the first place.
Without my intervention to the Editor-in-Chief, 15 replications papers would have gone into print at a peer-review journal without a single finding having been scrutinized by any expert outside of the editorial team. For original papers such an approach would be unthinkable, even though original finding do not nearly have the same reputational implications as replication findings. I spent a tremendous amount of time to do the work that several independent experts should have done instead of me. I did not ask to be put into this position, nor did I enjoy it, but at the end there simply was nobody else to independently check the reliability of the findings that concern the integrity of my work. I would have preferred to spend all that time doing my own work instead of running analyses for which other people get authorship credit.
How do we know whether any of the other findings published in that special issue are reliable? Who will verify the replications of Asch's (1946) and Schachter's (1951) work, since they no longer have a chance to defend their contributions? More importantly, what exactly have we learned if Asch's specific study did not replicate when using identical materials? Do we have to doubt his work and re-write the textbooks despite the fact that many conceptual replications have corroborated his early work? If it was true that studies with small sample sizes are "unlikely to replicate", as some people falsely claim, then there is no point in trying to selectively smoke out false positives by running one-off large-sample direct replications that are treated as if they can invalidate a whole body of research using conceptual replications. It would be far easier to just declare 95% of the published research invalid, burn all the journals, and start from scratch.
Publications are about power: The editors invented new publication rules and in the process lowered the quality well below what is normally expected for a journal article, despite their alleged goal of "increasing the credibility of published results" and the clear reputational costs to original authors once "failed" replications go on the record. The guardians of the published record have a professional responsibility to ensure that whatever is declared as a "finding," whether confirmatory, or disconfirmatory, is indeed valid and reliable. The format of the Replication Special Issue fell completely short of this responsibility, and replication findings were by default given the benefit of the doubt.
Perhaps the editors need to be reminded of their contractual obligation: It is not their job to adjudicate whether a previously observed finding is "real." Their job is to ensure that all claims in any given paper are based on solid evidence. This includes establishing that a given method is appropriate to test a specific research question. A carbon copy of the materials used in an earlier study does not necessarily guarantee that this will be the case. Further, when errors become apparent, it is the editors' responsibility to act accordingly and update the record with a published correction, or a retraction. It is not good enough to note that "more research is needed" to fully establish whether a finding is "real."
Errors have indeed already become apparent in other papers in the replication special issue, such as in the "Many Labs" paper. They need to be corrected, otherwise false claims, incl. those involving "successful" replications, continue to be part of the published literature on which other researchers build their work. It is not acceptable to claim that it does not matter whether a study is carried out online or in the lab when this is not correct, or to present a graph that includes participants from Italy, Poland, Malaysia, Brazil and other international samples who were primed with an American flag to test the influence on Republican attitudes (Ferguson, Carter & Hassin, 2014). It is not good enough to mention in a footnote that in reality the replication of Jacowitz and Kahneman (1995) was not comparable to the original method, and therefore the reported conclusions do not hold. It is not good enough to brush aside original authors' concerns that the replication of their work did not constitute an appropriate replication and therefore did not fully test the research question (Crisp, Miles, & Husnu, 2014; Schwarz & Strack, 2014).
The printed journal pages constitute the academic record, and each and every finding reported in those pages needs to be valid and reliable. The editors not only owe this to the original authors whose work they targeted, but also to the readers who will rely on their conclusions and take them at face value. Gross oversimplifications of which effects were "successfully replicated" or "unsuccessfully replicated" are simply not good enough when they do not even take into account whether a given method was appropriate to test a given research question in a given sample, especially when not a single expert has verified these claims. In short, replications are not good enough when they are held to a much lower standard than the very findings they end up challenging.
It is difficult to not get the impression that the format of the special issue meant that the editors sided with the replication authors, while completely silencing original authors. In my specific case, the editor even helped them make their case against me in print, by suggesting analyses to address the ceiling effect that few independent peer reviewers would consider acceptable, analyses that would have been unthinkable had I dared to present them as an original author. Indeed, he publicly made it very clear on whose side he is on.
Lakens has further publicly been critical of my research area, embodied cognition, because according to him, such research "lacks any evidential value." He came to this conclusion based on a p-curve analysis of a review paper (Meier, Schnall, Schwarz & Bargh, 2012), in which Schnall, Benton & Harvey (2008) is cited. Further, he mocked the recent SPSP Embodiment Preconference at which I gave a talk, a conference that I co-founded several years ago. In contrast, he publicly praised the blog by replication author Brent Donnellan as "excellent"; it is the very blog for which the latter by now apologized because it glorifies failed replications.
Bullying involves an imbalance of power and can take on many forms. It does not depend on your gender, skin colour, whether you are a graduate student or a tenured faculty member. It means having no control when all the known rules regarding scientific communication appear to have been re-written by those who not only control the published record, but in the process also hand down the verdicts regarding which findings shall be considered indefinitely "suspect."
University of Cambridge
22 May 2014
Recently I was invited to be part of a “registered replication” project of my work. It was an interesting experience, which in part was described in an article in My commentary describing specific concerns about the replication paper is available .here.
Some people have asked me for further details. Here are my answers to specific questions.
Question 1: “Are you against replications?”
I am a firm believer in replication efforts and was flattered that my paper (Schnall, Benton & Harvey, 2008) was considered important enough to be included in the special issue on “Replications of Important Findings in Social Psychology.” I therefore gladly cooperated with the registered replication project on every possible level: First, following their request I promptly shared all my experimental materials with David Johnson, Felix Cheung and Brent Donnellan and gave detailed instructions on the experimental protocol. Second, I reviewed the replication proposal when requested to do so by special issue editor Daniel Lakens. Third, when the replication authors requested my SPSS files I sent them the following day. Fourth, I offered the replication authors to analyze their data within two week’s time when they told me about the failure to replicate my findings. This offer was declined because the manuscript had already been submitted. Fifth, when I discovered the ceiling effect in the replication data I shared this concern with the special issue editors, and offered to help the replication authors correct the paper before it goes into print. This offer was rejected, as was my request for a published commentary describing the ceiling effect.
I was told that the ceiling effect does not change the conclusion of the paper, namely that it was a failure to replicate my original findings. The special issue editors Lakens and Nosek suggested that if I had concerns about the replication, I should write a blog; there was no need to inform the journal's readers about my additional analyses. Fortunately Editor-in-Chief Unkelbach overruled this decision and granted published commentaries to all original authors whose work was included for replication in the special issue.
Of course replications are much needed and as a field we need to make sure that our findings are reliable. But we need to keep in mind that there are human beings involved, which is what Danny Kahneman's commentary emphasizes. Authors of the original work should be allowed to participate in the process of having their work replicated. For the Replication Special Issue this did not happen: Authors were asked to review the replication proposal (and this was called "pre-data peer review"), but were not allowed to review the full manuscripts with findings and conclusions. Further, there was no plan for published commentaries; they were only implemented after I appealed to the Editor-in-Chief.
Various errors in several of the replications (e.g., in the "Many Labs" paper) became only apparent once original authors were allowed to give feedback. Errors were uncovered even for successfully replicated findings. But since the findings from "Many Labs" were already heavily publicized several months before the paper went into print, the reputational damage for some people behind the findings already started well before they had any chance to review the findings. "Many Labs" covered studies on 15 different topics, but there was no independent peer review of the findings by experts in those topics.
For all the papers in the special issue the replication authors were allowed the "last word" in the form of a rejoinder to the commentaries; these rejoinders were also not peer-reviewed. Some errors identified by original authors were not appropriately addressed so they remain part of the published record.
Question 2: “Have the findings from Schnall, Benton & Harvey (2008) been replicated?”
We reported two experiments in this paper, showing that a sense of cleanliness leads to less severe moral judgments. Two direct replications of Study 1 have been conducted by Kimberly Daubman and showed the effect (described on Psych File Drawer: Replication 1, Replication 2). These are two completely independent replications that were successfully carried out at Bucknell University. Further, my collaborator Oliver Genschow and I conducted several studies that also replicated the original findings. Oliver ran the studies in Switzerland and in Germany, whereas the original work had been done in the United Kingdom, so it was good to see that the results replicated in other countries. These data are so far unpublished.
Importantly, all these studies were direct replications, not conceptual replications, and they provided support for the original effect. Altogether there are seven successful demonstrations of the effect, using identical methods: Two in our original paper, two by Kimberly Daubman and another three by Oliver Genschow and myself. I am not aware of any unsuccessful studies, either by myself or others, apart from the replications reported by Johnson, Cheung and Donnellan.
In addition, there are now many related studies involving conceptual replications. Cleanliness and cleansing behaviors, such as hand washing, have been shown to influence a variety of psychological outcomes (Cramwinckel, De Cremer & van Dijke, 2012; Cramwinckel, Van Dijk, Scheepers & Van den Bos, 2013; Florack, Kleber, Busch, & Stöhr, 2014; Gollwitzer & Melzer, 2012; Jones & Fitness, 2008; Kaspar, 2013; Lee & Schwarz, 2010a; Lee & Schwarz, 2010b; Reuven, Liberman & Dar, 2013; Ritter & Preston, 2011; Xie, Yu, Zhou & Sedikides, 2013; Xu, Zwick & Schwarz, 2012; Zhong & Liljenquist, 2006; Zhong, Strejcek, Sivanathan, 2010).
Question 3: “What do you think about “pre-data peer review?”
There are clear professional guidelines regarding scientific publication practices and in particular, about the need to ensure the accuracy of the published record by impartial peer review. The Committee on Publications Ethics (COPE) specifies the following Principles of Transparency and Best Practice in Scholarly Publishing: “Peer review is defined as obtaining advice on individual manuscripts from reviewers expert in the field who are not part of the journal’s editorial staff.” Peer review of only methods but not full manuscripts violates internationally acknowledged publishing ethics. In other words, there is no such thing as "pre-data peer review."
Editors need to seek the guidance of reviewers; they cannot act as reviewers themselves. Indeed, "One of the most important responsibilities of editors is organising and using peer review fairly and wisely." (COPE, 2014, p. 8). The editorial process implemented for the Replication Special Issue went against all known publication conventions that have been developed to ensure impartial publication decisions. Such a breach of publication ethics is ironic considering the stated goals of transparency in science and increasing the credibility of published results. Peer review is not censorship; it concerns quality control. Without it, errors will enter the published record, as has indeed happened for several replications in the special issue.
Question 4: “You confirmed that the replication method was identical to your original studies. Why do you now say there is a problem?”
I indeed shared all my experimental materials with the replication authors, and also reviewed the replication proposal. But human nature is complex, and identical materials will not necessarily have the identical effect on all people in all contexts. My research involves moral judgments, for example, judging whether it is wrong to cook and eat a dog after it died of natural causes. It turned out that for some reason in the replication samples many more people found this to be extremely wrong than in the original studies. This was not anticipated and became apparent only once the data had been collected. There are many ways for any study to go wrong, and one has to carefully examine whether a given method was appropriate in a given context. This can only be done after all the data have been analyzed and interpreted.
My original paper went through rigorous peer-review. For the replication special issue all replication authors were deprived of this mechanism of quality control: There was no peer-review of the manuscript, not by authors of the original work, nor by anybody else. Only the editors evaluated the papers, but this cannot be considered peer-review (COPE, 2014). To make any meaningful scientific contribution the quality standards for replications need to be at least as high as for the original findings. Competent evaluation by experts is absolutely essential, and is especially important if replication authors have no prior expertise with a given research topic.
Question 5: “Why did participants in the replication samples give much more extreme moral judgments than in the original studies?”
Based on the literature on moral judgment, one possibility is that participants in the Michigan samples were on average more politically conservative than the participants in the original studies conducted in the UK. But given that conservatism was not assessed in the registered replication studies, it is difficult to say. Regardless of the reason, however, analyses have to take into account the distributions of a given sample, rather than assuming that they will always be identical to the original sample.
Question 6: “What is a ceiling effect?”
A ceiling effect means that responses on a scale are truncated toward the top end of the scale. For example, if the scale had a range from 1-7, but most people selected “7”, this suggests that they might have given a higher response (e.g., “8” or “9”) had the scale allowed them to do so. Importantly, a ceiling effect compromises the ability to detect the hypothesized influence of an experimental manipulation. Simply put: With a ceiling effect it will look like the manipulation has no effect, when in reality it was unable to test for such an effects in the first place. When a ceiling effect is present no conclusions can be drawn regarding possible group differences.
Research materials have to be designed such that they adequately capture response tendencies in a given population, usually via pilot testing (Hessling, Traxel, & Schmidt, 2004). Because direct replications use the materials developed for one specific testing situation, it can easily happen that materials that were appropriate in one context will not be appropriate in another. A ceiling effect is only one example of this general problem.
Question 7: “Can you get rid of the ceiling effect in the replication data by throwing away all extreme scores and then analysing whether there was an effect of the manipulation?”
There is no way to fix a ceiling effect. It is a methodological rather than statistical problem, indicating that survey items did not capture the full range of participants’ responses. Throwing away all extreme responses in the replication data of Johnson, Cheung & Donnellan (2014) would mean getting rid of 38.50% of data in Study 1, and 44.00% of data in Study 2. This is a problem even if sample sizes remain reasonable: The extreme responses are integral to the data sets; they are not outliers or otherwise unusual observations. Indeed, for 10 out of the 12 replication items the modal (=most frequent) response in participants was the top score of the scale. Throwing away extreme scores gets rid of the portion of the data where the effect of the manipulation would have been expected. Not only would removing almost half of your sample be considered “p-hacking,” most importantly, it does not solve the problem of the ceiling effect.
Question 8: “Why is the ceiling effect in the replication data such a big problem?”
Let me try to illustrate the ceiling effect in simple terms: Imagine two people are speaking into a microphone and you can clearly understand and distinguish their voices. Now you crank up the volume to the maximum. All you hear is this high-pitched sound (“eeeeee”) and you can no longer tell whether the two people are saying the same thing or something different. Thus, in the presence of such a ceiling effect it would seem that both speakers were saying the same thing, namely “eeeeee”.
The same thing applies to the ceiling effect in the replication studies. Once a majority of the participants are giving extreme scores, all differences between two conditions are abolished. Thus, a ceiling effect means that all predicted differences will be wiped out: It will look like there is no difference between the two people (or the two experimental conditions).
Further, there is no way to just remove the “eeeee” from the sound in order to figure out what was being said at that time; that information is lost forever and hence no conclusions can be drawn from data that suffer from a ceiling effect. In other words, you can’t fix a ceiling effect once it’s present.
Question 9: “Why do we need peer-review when data files are anyway made available online?”
It took me considerable time and effort to discover the ceiling effect in the replication data because it required working with the item-level data, rather than the average scores across all moral dilemmas. Even somebody familiar with a research area will have to spend quite some time trying to understand everything that was done in a specific study, what the variables mean, etc. I doubt many people will go through the trouble of running all these analyses and indeed it’s not feasible to do so for all papers that are published.
That is the assumption behind peer-review: You trust that somebody with the relevant expertise has scrutinized a paper regarding its results and conclusions, so you don’t have to. If instead only data files are made available online there is no guarantee that anybody will ever fully evaluate the findings. This puts specific findings, and therefore specific researchers, under indefinite suspicion. This was a problem for the Replication Special Issue, and is also a problem for the Reproducibility Project, where findings are simply uploaded without having gone through any expert review.
Question 10: “What has been your experience with replication attempts?”
My work has been targeted for multiple replication attempts; by now I have received so many such requests that I stopped counting. Further, data detectives have demanded the raw data of some of my studies, as they have done with other researchers in the area of embodied cognition because somehow this research area has been declared “suspect.” I stand by my methods and my findings and have nothing to hide and have always promptly complied with such requests. Unfortunately, there has been little reciprocation on the part of those who voiced the suspicions; replicators have not allowed me input on their data, nor have data detectives exonerated my analyses when they turned out to be accurate.
I invite the data detectives to publicly state that my findings lived up to their scrutiny, and more generallly, share all their findings of secondary data analyses. Otherwise only errors get reported and highly publicized, when in fact the majority of research is solid and unproblematic.
With replicators alike, this has been a one-way street, not a dialogue, which is hard to reconcile with the alleged desire for improved research practices. So far I have seen no actual interest in the phenomena under investigation, as if the goal was to as quickly as possible declare the verdict of “failure to replicate.” None of the replicators gave me any opportunity to evaluate the data before claims of “failed” replications were made public. Replications could tell us a lot about boundary conditions of certain effects, which would drive science forward, but this needs to be a collaborative process.
The most stressful aspect has not been to learn about the “failed” replication by Johnson, Cheung & Donnellan, but to have had no opportunity to make myself heard. There has been the constant implication that anything I could possibly say must be biased and wrong because it involves my own work. I feel like a criminal suspect who has no right to a defense and there is no way to win: The accusations that come with a "failed" replication can do great damage to my reputation, but if I challenge the findings I come across as a “sore loser.”
Question 11: “But… shouldn’t we be more concerned about science rather than worrying about individual researchers’ reputations?”
Just like everybody else, I want science to make progress, and to use the best possible methods. We need to be rigorous about all relevant evidence, whether it concerns an original finding, or a “replication” finding. If we expect original findings to undergo tremendous scrutiny before they are published, we should expect no less of replication findings.
Further, careers and funding decisions are based on reputations. The implicit accusations that currently come with failure to replicate an existing finding can do tremendous damage to somebody’s reputation, especially if accompanied by mocking and bullying on social media. So the burden of proof needs to be high before claims about replication evidence can be made. As anybody who’s ever carried out a psychology study will know, there are many reasons why a study can go wrong. It is irresponsible to declare replication results without properly assessing the quality of the replication methods, analyses and conclusions.
Let's also remember that there are human beings involved on both sides of the process. It is uncollegial, for example, to tweet and blog about people's research in a mocking tone, accuse them of questionable research practices, or worse. Such behavior amounts to bullying, and needs to stop.
Question 12: “Do you think replication attempts selectively target specific research areas?”
Some of the findings obtained within my subject area of embodied cognition may seem surprising to an outsider, but they are theoretically grounded and there is a highly consistent body of evidence. I have worked in this area for almost 20 years and am confident that my results are robust. But somehow recently all findings related to priming and embodied cognition have been declared “suspicious.” As a result I have even considered changing my research focus to “safer” topics that are not constantly targeted by replicators and data detectives. I sense that others working in this area have suffered similarly, and may be rethinking their research priorities, which would result in a significant loss to scientific investigation in this important area of research.
Two main criteria appear to determine whether a finding is targeted for replication: Is a finding surprising, and can a replication be done with little resources, i.e., is a replication feasible. Whether a finding surprises you or not depends on familiarity with the literature —the lower your expertise, the higher your surprise. This makes counterintuitive findings a prime target, even when they are consistent with a large body of other findings. Thus, there is a disproportionate focus on areas with surprising findings for which replications can be conducted with limited resources. It also means that the burden of responding to “surprised” colleagues is very unevenly distributed and researchers like myself are targeted again and again, which will do little to advance the field as a whole. Such practices inhibit creative research on phenomena that may be counterintuitive because it motivates researchers to play it safe to stay well below the radar of the inquisitors.
Overall it's a base rate problem: If replication studies are cherry-picked simply due to feasibility considerations, then a very biased picture will emerge, especially if failed replications are highly publicized. Certain research areas will suffer from the stigma of "replication failure" while other research areas will remained completely untouched. A truely scientific approach would be to randomly sample from the entire field, and conduct replications, rather than focus on the same topics (and therefore the same researchers) again and again.
Question 13: “So far, what has been the personal impact of the replication project on you?”
The “failed” replication of my work was widely announced to many colleagues in a group email and on Twitter already in December, before I had the chance to fully review the findings. So the defamation of my work already started several months before the paper even went into print. I doubt anybody would have widely shared the news had the replication been considered “successful.” At that time the replication authors also put up a blog entitled “Go Big or Go Home,” in which they declare their studies an “epic fail.” Considering that I helped them in every possible way, even offered to analyze the data for them, this was disappointing.
At a recent interview for a big grant I was asked to what extent my work was related to Diederik Stapel’s and therefore unreliable; I did not get the grant. The logic is remarkable: if researcher X faked his data, then the phenomena addressed by that whole field are called into question.
Further, following a recent submission of a manuscript to a top journal a reviewer raised the issue about a “failed” replication of my work and therefore called into question the validity of the methods used in that manuscript.
The constant suspicions create endless concerns and second thoughts that impair creativity and exploration. My graduate students are worried about publishing their work out of fear that data detectives might come after them and try to find something wrong in their work. Doing research now involves anticipating a potential ethics or even criminal investigation.
Over the last six months I have spent a tremendous amount of time and effort on attempting to ensure that the printed record regarding the replication of my work is correct. I made it my top priority to try to avert the accusations and defamations that currently accompany claims of failed replications. Fighting on this front has meant that I had very little time and mental energy left to engage in activities that would actually make a positive contribution to science, such as writing papers, applying for grants, or doing all the other things that would help me get a promotion.
Question 14: “Are you afraid that being critical of replications will have negative consequences?”
I have already been accused of “slinging mud”, “hunting for artifacts”, "soft fraud" and it is possible that further abuse and bullying will follow, as has happened to colleagues who responded to failed replications by pointing out errors of the replicators. The comments posted below the Science article suggest that even before my commentary was published people were quick to judge. In the absence of any evidence, many people immediately jumped to the conclusion that something must be wrong with my work, and that I'm probably a fraudster. I encourage the critics to read my commentary so you can arrive at a judgment based on the facts.
I will continue to defend the integrity of my work, and my reputation, as anyone would, but I fear that this will put me and my students even more directly into the cross hairs of replicators and data detectives. I know of many colleagues who are similarly afraid of becoming the next target and are therefore hesitant to speak out, especially those who are junior in their careers or untenured. There now is a recognized culture of “replication bullying:” Say anything critical about replication efforts and your work will be publicly defamed in emails, blogs and on social media, and people will demand your research materials and data, as if they are on a mission to show that you must be hiding something.
I have taken on a risk by publicly speaking out about replication. I hope it will encourage others to do the same, namely to stand up against practices that do little to advance science, but can do great damage to people's reputations. Danny Kahneman, who so far has been a big supporter of replication efforts, has now voiced concerns about a lack of "replication etiquette," where replicators make no attempts to work with authors of the original work. He says that "this behavior should be prohibited, not only because it is uncollegial but because it is bad science. A good-faith effort to consult with the original author should be viewed as essential to a valid replication."
Let's make replication efforts about scientific contributions, not about selectively targeting specific people, mockery, bullying and personal attacks. We could make some real progress by working together.