We have a problem: What 34% reproducibility actually means

Last month, a consortium of roughly 800 researchers published three papers in Nature through a project called SCORE (Systematizing Confidence in Open Research and Evidence). I was part of the team that examined analytical robustness in social science — and, at the risk of sounding alarmist, the results weren’t pretty.

Here's the core finding: we took 100 highly cited studies, gave 200 researchers the exact same original data and hypothesis, and asked them to analyze. Only 34% reproduced the original effect exactly. Effect sizes shrank by a median of 50%. Most stayed qualitatively similar, but the numerical variation was substantial — in fact, 2% of analysts found the opposite effect from what the original study reported.

This isn't about researcher dishonesty. It's about a systemic problem: we've built science that permits multiple legitimate analytical choices to produce different conclusions from identical data.

Why This Matters

You've probably heard the term "researcher degrees of freedom." It means that when a dataset lands on your desk, there are dozens of defensible choices waiting to be made: How do you handle missing data? Which outliers do you exclude? Do you log-transform this variable? Which covariates go into the model? Do you include interaction terms? We talk through all of these questions, plus more, in my Advanced Statistics class.

Each choice is reasonable. Each has justification. But collectively, they create a landscape where the "same" analysis can legitimately arrive at different answers.

This is the analytical flexibility problem. And the SCORE findings suggest it's not a theoretical concern — it's a documented feature of how we actually do science. Variation exists when we conduct data analysis; that’s normal. The troubling part is how often we pretend it doesn't.

What the Full SCORE Project Tells Us

This robustness study is one component of SCORE's larger scope. The full project examined three dimensions of credibility:

  1. Reproducibility (same data, same code): About half of claims were precisely reproduced. Three-quarters were at least approximately reproduced.

  2. Replicability (new data, same question): About half of claims replicated with statistical significance. But median effect sizes shrank by more than 50%.

  3. Robustness (same data, different analyses): The study I got to be a part of, finding that variations in how we conducted analysis led to wildly different results at the end.

What's striking is that these three don't correlate strongly with each other. A study could be reproducible but not robust. Robust but not replicable. The implication is uncomfortable: there's no single "credibility score" for research. Different dimensions fail in different ways, and we can't predict which ones will be vulnerable just by looking at one metric.

The research also shows that credibility is heterogeneous across disciplines. Political science and economics showed higher reproducibility. But no field consistently outperformed others across all dimensions.

So What’s Next?

Here's the hardest part to sit with: many of the findings we've relied on in social science for the last two decades may be credible in some dimensions and fragile in others.

The SCORE project doesn't invalidate past work. But it does suggest that we've been systematically overconfident in claims that looked solid under one lens but shatter under another.

That's not a reason to burn it all down. It's a reason to be more careful with what comes next.

Maybe that means more efforts towards requiring and training future scholars to preregister studies and make data and code openly available. Maybe that means changing how journals decide “what’s publishable” to incentivize replication and robustness over novelty.

The Nature papers, press coverage, and full datasets are all available now. The work is public, and the methods are transparent. That's what makes this different from just complaining — it's a documented map of where the fragility lives, and it points us towards potential next steps.

Why We Need “Big Team Science”

One last reflection: it was an honor for me to be part of this massive team with amazing leadership coming out of the Center for Open Science. But when I tell people that I was “part of a 200-person authorship team,” many raise their eyebrows.

I believe this is the future of scientific research. The biggest, most groundbreaking studies require far more effort than any one individual scientist or lab could produce. We need to create, invest in, and engage in these multi-lab, multi-country collaborations to pool our resources and collectively work towards important research that otherwise would never see the light of day.

Luckily, this isn’t the only group engaging in this “big team science” approach. I’m part of the leadership team for a study at Psychological Science Accelerator investigating gender differences in leadership across 30+ countries with 40+ labs; I get to coordinate and clean the data coming in from all of these sources — from a variety of platforms and languages — to test the same fundamental question. Other groups working on similar initiatives include the Advancement of Replications Initiative in Management and the Berkeley Initiative for Transparency in the Social Sciences.

For those beginning their scholarly journey and interested in contributing towards these bigger, system-wide initiatives at how we improve science, I strongly recommend you explore these groups.

It may not be as flashy as leading your own major study, but sometime being a small cog in the wheel — like I was for these Nature papers! — brings more long-term rewards for science as a whole.

Read the papers:

Learn more about SCORE:

Next
Next

Who actually wins when AI comes to work?