A recent political science working paper titled “Racial Equality Frames and Public Policy Support: Survey Experimental Evidence” by English and Kalla (I’ll refer to as EK) has garnered a lot of attention and generated plenty of interesting discussion. Many important sticking points have already been debated at length, but I wanted to highlight a few reactions I had while reading the paper that I haven’t seen mentioned much. They include both supportive and critical perspectives, and are paper-specific but also broadly applicable to how we do survey experiments and empirical research.
1. In survey experiments, testing stimuli drawn from the real world is both valuable and justifiable. One of the more common critiques of EK’s work has been the choice of treatment frames used. There will always be substantial room for researcher discretion in how experimental stimuli are created. EK are motivated by real world messaging appeals — made by current politicians — that increasingly invoke race. In this sense, the stimuli of interest is already clearly defined, and the jump from real world construct to experimental representation is much less ambiguous. This is what one of the authors uses in defense of EK’s frame choices — these are frames from the real world, and this in turn has benefits such as conferring stronger external validity. I confronted similar critiques in a messaging-type survey experiment I did in the past and had a similar defense: if your experimental design connects clearly to your research question, and that question seeks to test reactions to pieces of real world information environments, stimuli based on real world content are very useful.
2. We should be cautious with judging effect sizes in messaging survey experiments, which can underestimate influence. Another common pushback EK received centered on the seemingly minor size of the treatment effects on public opinion. Moving beyond only considering statistical significance is important, but often times people have benchmarks in mind for opinion change (e.g. those from observational polling trends) that are not comparable to experimental treatment effect magnitude. Most survey experiments give a good sense of how large an effect treatments would have in the real world but far from perfect. I can think of at least two reasons why they may appear smaller in experiments, mainly in the context of treatments on messaging, cues, and information. First, subjects may have received the treatment before entering the experiment and thus be less moveable by it once in the study. This is known as “pretreatment effects.” For example, imagine that some subjects already heard the class- or race-based appeals in EK’s design before the study, and updated their policy opinions in response. Had this not happened (and those persuadable by the frames weren’t already persuaded), we’d observe larger effects from the experiment. Second, a one-shot treatment like EK’s is not the same as cumulative, repeated exposure to it — the type of “treatment receipt” that’s more likely to occur in the real world if a certain treatment frame is adopted widely (see here for a related perspective). Whether or not due to these issues, it might be unsurprising that recent research argues survey experiments might actually be underestimating the effects of cue-/informational-based treatments.
3. The difference between a significant effect and insignificant effect is not necessarily itself significant. At the heart of EK’s study is the finding that racial frames are more effective than class or race+class ones. Nearly all public discussion of the paper revolves around this point too. After all, testing which message is most effective is a question of comparison. Unfortunately, EK do not provide this statistical test — whether treatment effect A (e.g. class frame) is statistically significantly different from treatment effect B (e.g. race frame) — that is central to their study’s purpose and necessary for properly understanding its findings. This is of course not a fatal error — the key test can easily be done and added to the paper — but it’s worth pointing out that this key test is omitted. Sadly I can’t say it was strange to see it missing, as this slippage (seeing one significant effect, another insignificant effect, and discussing results as if the two effects are themselves distinguishable) is endemic in quantitative social science.
4. In the presence of many researcher degrees of freedom, we need a clear sense of what all combinations show and (in)consistency of results. Experiments are often nice because, compared to observational work, researchers have many fewer degrees of freedom (i.e. choices in how to conduct analysis), analysis is straightforward, and as a result we might be more inclined to believe the result is not a false positive or p-hacked. When reading EK’s paper that uses an experiment, I was surprised to see how many different analytical choices were being used. For example, analysis could diverge on the following dimensions: 1) continuous outcome vs. binary measure, 2) survey weights vs. no weights (the construction of which, a little concerningly, was not specified in the pre-analysis plan), 3) pretreatment covariates as controls vs. none, 4) pooled across issues vs. first issue (or other issue subsets) only, and 5) many different subgroups where effects were checked. Despite many researcher degrees of freedom, we only see a small slice of all possible combinations. It would be unwieldy to individually present results from all combinations, but this issue could be better handled (e.g. maybe showing a distribution of effects/t-statistics/p-values for all combinations). At the very least, we need to have a much better sense of how consistent results from all combinations are, as this bears directly on any conclusions we can draw.
5. Many borderline significant results, a weird distribution of p-values, and a large amount of statistical tests without correction for multiple testing is all concerning. As I was reading EK’s paper, I started noticing a peculiar amount of results that were significant at the p<.05 level. (Given that the authors distinguish between significance levels at .01, .05, and .10 levels, I assume these p<.05 results mean the p-value was between .01 and .05. and not just anywhere below .05.) By my count, out of the 18 reported p-values in the paper’s main text, 4 were <.01, 7 were between .01-.05, 3 were between .05-.10, and 4 >.10. I also counted 51 different statistical tests (e.g. treatment X vs. control comparisons). Why is this concerning? A concentration of results right near the .05 significance level threshold is odd and typically seen as a telltale sign of false positive results. Indeed, lower p-values correlate with higher replicability where even small shifts between the p of .01 to .10 range seem to matter. These concerns are exacerbated by the large amount of researcher degrees of freedom that I noted earlier (e.g. what if one small change in choice of analysis moves a p<.05 result above the threshold?). Importantly, EK do not correct for multiple testing in any way, which is especially problematic given the number of tests here (this means more opportunity to stumble on false positives). It’s not clear how many of their significant results would survive multiple testing corrections, but things don’t bode well in light of the number of borderline significant results.