Researcher Degrees of Freedom and Systematically Approaching Robustness Checks: An Example

Recently, I shared my reactions to some new political science research where I gave special attention to considerations related to researcher degrees of freedom and characterizing the strength of empirical results. For example, this research could’ve followed different analytical paths and readers only saw a few results from a few combinations of analytical choices. In this context, showing the distribution of results across various specifications would add some clarity. What kind of exercise did I have in mind (and why is it useful)? I’ll show an example using replication data from a completely different study (note: this comes from a course assignment that involved replicating and extending the study). I hope this not only clarifies the value of specification curve-type analyses, but also reveals that studies with strong research designs for causal inference — experiments, regression discontinuity designs — are not always immune to problems related to researcher degrees of freedom.

Lehmann and Masterson (2020) have a very interesting recent political science article on the relationship between refugee aid and violence toward refugees. In an RDD setup where household altitude level determined refugee aid allocation (making for an as-if randomly assigned treatment), the authors find that “cash transfers [aid] did not increase hostility toward refugees, and if anything they may have reduced it.” In addition to results in the main text, they add several appendix results to argue that their “results are robust to numerous modeling choices.” While useful, these robustness checks aren’t systematic — they pull out a few analytical choices and show results using them, but do not cover the entire span of all analytical combinations. Showing each in their own table is cumbersome, but a specification curve — showing the distribution of results across all defensible combinations of analyzing the data — clearly and easily conveys this systematic coverage that would make us surer in the robustness of their results.

Analysis could’ve diverged along five paths:

  • 1) linear vs. quadratic model, 2) including vs. excluding outlier data, 3) including vs. excluding potentially contaminated units, 4) control variables (none vs. ones with treatment assignment imbalance vs. all household level traits), and 5) outcome measurement (a few formulations of the four-point scale into categorical or continuous form, as well as violence from fellow Syrians subtracted out, that resulted in five possible DVs)

For each of the two key outcomes — verbal assault and physical violence by Lebanese natives toward Syrian refugees — this amounted to 120 analytical combinations. The below plots show the distribution of t-statistics for the refugee aid treatment effect across every model, by DV, along with x’s indicating the t-stats highlighted in the main text. Dashed lines mark critical values of -1.96 and 1.96.

Let’s start with the first one:

What do we learn? For the verbal assault DV, nearly all of the t-stats fall between -1.96 and 1.96. Only 7% of models produce treatment effects that are significant at a conventional level. The model t-stats emphasized in the main text, however, all cluster around -1.96 — they are not representative of what all possible models show. Verbal assault results are thus weaker and less robust to modeling decisions than previously realized. It’s worth emphasizing this does not imply ill intent on the part of researchers necessarily. While this could’ve resulted from intentional modeling decisions (e.g. p-hacking), it could’ve easily been unintentional and just unlucky — a certain modeling approach was chosen and run with, and the researchers had no idea it produced an unrepresentative result. Though herein lies the values of exercises like this (and why it’s always worth probing more on empirical results that involve many researcher degrees of freedom); without them, we might be misled.

For the physical violence outcome, we get a more mixed bag in terms of whether main text results were representative of all combinations. The distribution of t-stats is bimodal, with peaks around 0 and just outside of -1.96. Though it seems like the emphasized t-stats are a bit more representative, it turns out that all the strongest results (t < -1.96) are driven entirely by a quadratic modeling approach, as this second graph shows (plotting the violence outcome t-stat distribution, colored by choice of functional form):

This is not a reassuring sign in terms of assessing the strength of results for this DV, as again we find weak robustness.

In sum, this type of exercise makes clear the value of taking a systematic approach to assessing the robustness of empirical results. From a statistical standpoint, we learn that treatment effects on both outcomes have a very wide distribution of t-statistic values; results are pretty sensitive to model specification. At the same time, verbal assault DV results do tend to fall around a t-stat of -1 (making us more sure of a null result there), while for physical violence things are much less clear given the especially high sensitivity to functional form. From a substantive standpoint, we learn that results might be even more consistent with a null effects story than the “null and maybe negative effects” frame the authors take. Mechanism tests in the paper center heavily on the idea that aid reduced violence — with a fuller view of results in hand, it would’ve been good to more closely consider why the treatment seemed to register so little effect (Too weak for the native population to notice? Too weak for refugees to meaningfully contribute to the native community?).

Thus, we leave with some different takeaways than previously acknowledged, and a much clearer sense of what refugee aid’s treatment effect of aid is under various ways of analyzing the data. Hopefully more empirical papers take and show results from this approach of more systematically executing robustness checks in the face of many researcher degrees of freedom. Often times, carrying it out is just a matter of automating different ways of treatment effect estimation (e.g. I created a function and used for loops to run the above analysis). Other resources also exist that help on this front.

Update: A “multiverse analysis” (see here) better fits what I have in mind than the term “specification curve” used above.

Researcher Degrees of Freedom and Systematically Approaching Robustness Checks: An Example