Placebo statements in list experiments: Evidence from a face-to-face survey in Singapore

Abstract List experiments are a widely used survey technique for estimating the prevalence of socially sensitive attitudes or behaviors. Their design, however, makes them vulnerable to bias: because treatment group respondents see a greater number of items (J + 1) than control group respondents (J), the treatment group mean may be mechanically inflated due simply to the greater number of items. The few previous studies that directly examine this do not arrive at definitive conclusions. We find clear evidence of inflation in an original dataset, though only among respondents with low educational attainment. Furthermore, we use available data from previous studies and find similar heterogeneous patterns. The evidence of heterogeneous effects has implications for the interpretation of previous research using list experiments, especially in developing world contexts. We recommend a simple solution: using a necessarily false placebo statement for the control group equalizes list lengths, thereby protecting against mechanical inflation without imposing costs or altering interpretations.


Introduction
List experiments (also known as item count technique (ICT)) are a widely-used survey technique designed to elicit true preferences on sensitive topics that are vulnerable to social desirability bias (Rosenfeld et al. 2016). They work as follows: respondents are divided into control and treatment groups. The control group is shown J non-sensitive statements and asked to indicate how many are true. The treatment group is shown J + 1 statements, where the J statements are the same as the control group, but the + 1 is a sensitive item that may elicit socially desirable responses if asked directly. The difference in the mean number of true statements between the control and treatment groups, referred to as the difference-in-means estimator, is interpreted as the percentage of the population for whom the sensitive statement is true. This technique has been used to estimate the prevalence of a wide range of socially sensitive attitudes and behaviors from ethnic prejudice to sexual practices and voting behavior. 1 List experiments are subject to both strategic and non-strategic respondent error (Ahlquist 2018). Strategic errors arise when respondents lie to conceal their position on the sensitive issue, which is revealed when all or none of the statements are indicated as true. To prevent these ceiling and floor effects, best practice calls for one relatively rare and one relatively common item (Blair and Imai 2012;Glynn 2013). Non-strategic error includes such things as coding errors and poor quality responses that arise when respondents do not understand or rush through the list experiment. As noted by Ahlquist (2018), previous work on ICT has generally disregarded the implications of non-strategic errors.
We raise attention to a potential non-strategic error that emerges from the differential list lengths in typical ICT designs: the higher number of statements in the J + 1 treatment group relative to the J control group may produce an artificial inflation of "true" statements in the treatment group if respondents resort to satisficing, for example by selecting the perceived middle point (Krosnick 1999). Despite the potential for this error, only a few studies (Holbrook and Krosnick 2010;Ahlquist et al. 2014;Kiewiet de Jong and Nickerson 2014) have directly examined the effect of ICT design on responses. They take the following approach: A placebo statement that is exceedingly rare or impossible is added to an alternative control group. Since it should be false for all respondents, the mean of the J + 1 alternative control group with the placebo statement should be the same as the J control group. Any significant difference in means is the result of bias from the standard ICT design.
Individually, the studies are inconclusive. Holbrook and Krosnick (2010) find a difference in means that suggests inflation; the effect, however, does not reach significant levels using a twotailed t-test. Ahlquist et al. (2014) likewise find evidence of inflation, this time at statistically significant levels. 2 Kiewiet de Jong and Nickerson (2014) explicitly look for inflation or deflation; they find "little evidence of an upward bias in estimates" (p. 662). None of the studies note strong evidence of heterogeneous effects. This paper brings a representative sample that provides substantial statistical power to bear on the question of non-strategic bias in ICT design. 3 As with the previous studies, we use a placebo statement to identify the potential effects of differential list lengths. We find strong evidence for mechanical inflation, though only among the subgroup with relatively low levels of educational attainment. This finding is consistent with previous research that shows response quality to vary depending on cognitive ability and education levels (see Krosnick 1991). As list experiments require greater attention to detail and concentration than conventional questions, this subgroup may have an increased propensity to resort to satisficing (Kramon and Weghorst 2012), which in turn can drive mechanical inflation.
We also conduct a meta-analysis of previous work, finding inflation to be more likely than not, and roughly the size of many reported treatment effects at around 7-8 percent. Details are in Section 4 of the supplementary materials. Moreover, we reanalyze data from Ahlquist et al. (2014) for heterogeneous effects and find, consistent with our study, evidence of inflation among the subgroup with relatively low levels of educational attainment. 4 Our findings have important implications for list experiment best practices. They suggest that the conventional J/J + 1 design is vulnerable to bias toward positive findings, at least in contexts where some respondents have low levels of formal education or are especially prone to satisficing. To protect against this bias, we recommend inclusion of a placebo statement in the control group that equalizes the list lengths at J + 1/J + 1. The placebo statement should be false for all or nearly all respondents, and should not be so disruptive that it triggers a low-quality response to the 2 10 percent of voters are estimated to agree to an impossible statement, with a p-value of 0.017 (7 percent and p = 0.051 if unweighted data). These figures are reported in Zigerell (2017). 3 Assuming the observed effect size, power in our sample is at least 40 percent greater than previous studies. All calculations done with the sampsi command in STATA. See Section 3 in the supplementary materials for details concerning power calculations. 4 The previous non-findings for heterogeneous effects are unsurprising. Studies like Holbrook and Krosnick (2010) that rely on internet convenience samples in fully industrialized countries are likely to under-represent low education attainment respondents, making them less vulnerable to heterogeneous effects through selection bias. While Kiewiet de Jong and Nickerson (2014) use a representative sample in a developing context, their limited sample size may not be sufficiently sensitive to subgroup variation, which they concede (p. 671). remaining list items. 5 Ultimately, the inclusion of a placebo statement is a costless preventative measure that does not increase cognitive demands or alter interpretation of survey experiments, but does protect against the observed mechanical bias among vulnerable subgroups.

Data and survey design
The data described below come from a list experiment embedded in a survey on social attitudes and community relations in Singapore, conducted from 2 March to 18 April 2019. The survey was administered in person by a multi-ethnic team of enumerators comprised of local university students, either on weekdays (6-8pm) or weekends (10am-6pm). Most respondents required between 5 and 10 minutes to complete the questionnaire, which was comprised of closed questions. Buildings were randomly selected to approximate a representative sample of the resident population. The response rate was 38.6 percent, marginally above the typical rate of surveys carried out by official institutions in Singapore. 6 In total, the dataset contains 1,278 observations. Full details of the survey methodology are included in the supplementary materials. 7 The list experiment was designed to estimate mechanical inflation. The four-item control group received four neutral statements, while the five-item placebo group received the same four neutral statements, plus a (necessarily false) placebo statement.
All groups received the same instructions: "Look at the following statements below. Can you tell us how many statements are true for you? Please don't tick individual statements, just tell us the total number" [Emphasis in the original]. The four neutral statements were chosen using the generally accepted criteria for list experiments: natural fit into the context of the survey, uncorrelated (both with one another and with other broader socio-economic characteristics), and resistant to ceiling and floor effects.
The placebo statement was designed to be plausible but false for all respondents: "I have been invited to have dinner with PM Lee at Sri Temasek next week." 8 This is the equivalent of being asked to have dinner with the President of the United States in the White House or some other equally improbable event. Hence, we assume that it is false for all respondents and easily recognized as such. Table 1 provides a summary of the overall findings. For the whole sample, the mean number of reported true statements is higher (1.89) for the placebo group than for the standard control group (1.77). The magnitude is substantial: this suggests that the inclusion of the + 1 placebo statement induces roughly 12 percent of respondents in the five-item placebo group to increase their reported number of true statements by 1 above their counterparts in the four-item control group. Figure 1 in the supplementary materials provides the frequency distribution for both the four-item and the five-item groups. Few respondents in either group indicate 0 or all statements to be true, which suggests that the presence of a clearly false placebo statement does not induce respondents to indicate extreme counts. 9 5 Examples include "I moved to my current home less than one week ago"; "I spent last New Year's eve at the top of the Eiffel Tower"; or "I had dinner with the President of my country last week". Albeit highly unlikely, these are all plausible. We caution against false but potentially disruptive statements like "I have the ability to teleport myself to different countries" because they may reduce the seriousness with which respondents approach the remaining items. 6 For instance, 24.6 percent in the Institute of Policy Studies "Post-Election Survey 2015" (Institute of Policy Studies 2015). 7 Original survey questionnaire available upon request. 8 Lee Hsien Loong is the Primer Minister of Singapore. Sri Temasek is the Prime Minister's official residence. 9 In the four-item control group, 8.67 percent of respondents indicated zero or four statements to be true. In the five-item placebo group, 11.55 percent indicated zero, four, or five to be true. See Figure 1 in supplementary materials for the full distribution of responses. Table 1 also reports mean number of true statements by subgroups on the dimensions of political knowledge, educational attainment, household income, and age. 10 We opt for simple categories to facilitate comparisons: respondents are coded as having high political knowledge when they are able to correctly name their electoral district; household income is above and below 3,500 Singapore dollars per month (which represents roughly the bottom third); while age is above or below 60 years.

Results
The findings suggest that the treatment effect of the placebo statement is highly heterogeneous: for the politically knowledgeable, relatively educated, and middle and upper income, the difference in means between the four-item control and five-item placebo groups is insignificant, meaning that the inclusion of the placebo statement does not inflate the reported number of true statements. By contrast, the difference in means is statistically significant and substantively meaningful among the counterpart subgroups. This provides a strong initial indication of which respondent types are most vulnerable to mechanically inflating their true statement count in conventional list experiments.
In order to check the robustness of these findings to a different context, we examine data from Ahlquist et al. (2014), which are available online at Harvard Dataverse. 11 The study likewise uses a standard four-item control group and a five-item placebo group, in which the extra placebo statement is necessarily false for all respondents. The 3,000 responses were collected via online survey in the United States. The results of the replication study are broadly in line with our general conclusions. The mean item count in the five-item placebo group is 0.07 points higher than the four-item control group; the difference reaches conventional levels of statistical significance. Furthermore, the elderly and those with lower levels of formal education are more likely to mechanically increase their reported number of true statements in response to the placebo statement, supporting our finding of heterogeneous treatment effects. The effect of income, however, is inconclusive. Details of the replication study and further discussion can be found in Section 3.3 in the supplementary materials.
We return to our dataset to examine the heterogeneous treatment effects more precisely. Since formal education, age, and income may themselves be correlated, we estimate an OLS regression model using the following specification originally from Holbrook and Krosnick (2010), then adopted by Imai (2011) and Blair and Imai (2012): Reported p-values are from a one-sided difference in means t-test between the four-item control and five-item placebo groups. Political knowledge: "1" if respondents know the electoral district in which they reside, "0" otherwise.
where LIST i is the number reported in the list experiment, X i are sociodemographic variables, and PLACEBO i is a dummy that takes value "1" if the respondent was part of the five-item placebo group, "0" if part of the four-item control group. γ is the vector of our coefficients of interest: we expect it to be significant for the variables specified in Table 1. Table 2 reports the results. Panel A captures the interaction between individual characteristics and the placebo statement, which can be read as the propensity to inflate the number of "true" statements in the five-item placebo list. Panel B captures the baseline relationship, i.e., the correlation between individual characteristics and number of "true" statements in the four-item control list. For example, column 1 in Panel B indicates that an elderly respondent from the four-item group reports on average 0.071 items less than a younger counterpart from the four-item group, though the difference does not reach conventional levels of statistical significance. Column 1 in Panel A indicates that an elderly respondent from the five-item group reports on average 0.181 more items than a younger counterpart from the five-item group. Note that the baseline for Panel A (five-item placebo group) is 1.89 items, i.e., 0.12 higher than the Panel B (four-item control group) baseline of 1.77.
Specifications (1)-(4) confirm the unconditional results of Table 1 using fixed effects and clustered standard errors: age, education levels, political sophistication, and income are associated with mechanical inflation, although only education reaches conventional levels of statistical significance. Results also suggest that, on average, respondents with only a primary school education report 0.32 more true items when presented with the placebo statement, whereas the predicted difference between the response to the four-item vis-a-vis the five-item list is of a negligible 0.02 points for those with a college degree. 12 Specifications (5)-(9) further add sociodemographic controls (gender, ethnicity, and apartment size): earlier findings are robust to their inclusion. Specification (9), which includes all controls and variables of interest, reveals that educational attainment is the strongest predictor of inflating the number of true statements due to the inclusion of the placebo statement.
Other variables (especially income and political knowledge) likely lose their significance due to power and multicollinearity issues. Finally, note that the R 2 s are generally quite low: this is evidence that, as intended by design, agreement to the statements in our list experiment is randomly distributed across the population and hard to correlate with observables. 13 To illustrate the effect of education and income on propensity to inflate item counts, we predict the number of "true" statements using specification (9) from Table 2 and present this through a smooth polynomial fit. Figure 1 shows the results, with the left panels comprising the responses from the five-item placebo group and the right panel the responses from the fouritem control group.
We see that with the full set of controls, a respondent with primary school education or below-i.e., six years of schooling-is likely to report around 0.3 more true items on average when presented with the five-item list than when presented with the four-item list, a difference that disappears for those with bachelor degree or higher-i.e., 18 + years of schooling-(panels a and b). A similar effect can be seen for respondents in the lowest income groups, who report on average .2 more true items than when presented with the five-item list, an effect that vanishes for higher income groups (panels c and d). This suggests mechanical inflation in the lower socioeconomic strata. 12 Using results from specification (2): Average response to the five-item list for those with six years of education: 1.474 + 0.463 + 0.017*6 − 0.024*6 = 1.89. Average response to the four-item list for those with six years of education: 1.474 + 0.017*6 = 1.57. Difference: 1.89 − 1.57 = 0.32. Similar computations for those with college degree (20 years of education). 13 To further check for the robustness of our results, we also use the non-linear specification suggested by Imai (2011) and Blair and Imai (2012). Results can be found in the supplementary materials (Table 4 in Section 3.2).

Conclusion
This paper uses original data to provide evidence for mechanical inflation in conventional list experiments. It finds evidence of heterogeneous effects, with inflation most pronounced among low educational attainment respondents who may be most inclined toward satisficing. We find additional evidence for this conclusion in a replication exercise using data from Ahlquist et al. (2014). Moreover, we conduct a meta-analysis using results from Ahlquist et al. (2014), Holbrook and Krosnick (2010), and Kiewiet de Jong and Nickerson (2014). This shows inflation to be more likely than not and roughly the size of many reported treatment effects; that is, around 0.074 points when pooling all studies together and weighting by the number of observations. 14 The findings have clear implications. Studies that rely on list experiments in contexts where low educational attainment is widespread may have artificially inflated treatments that lead to invalid conclusions. By contrast, studies using convenience sampling that over-represents young and educated respondents are comparatively less vulnerable, though they may likewise be problematic if respondents resort to satisficing, for example, when incentives for providing 14 All details are in Section 4 of the supplementary materials. Note that we have focused only on a placebo that is false for all respondents. Kiewiet de Jong and Nickerson (2014) also test a placebo that is true for nearly all respondents; it increases the number of true statements by significantly less than 1, which further supports the notion that satisficing is responsible for the bias from unequal list lengths in treatment and control groups. accurate responses are inadequate or when the questionnaire is particularly long or cognitively demanding.
We suggest a simple preventative solution. Inclusion of a placebo statement in the control group equalizes the control and treatment list lengths, thereby preventing artificial inflation of the treatment group when respondents resort to satisficing. The placebo statement should: (i) be false for all or nearly all respondents and be easily recognized as such; (ii) be orthogonal to other items in the list to avoid interactions that may themselves introduce bias; and (iii) not be so outlandish or disruptive that respondent seriousness declines, which may increase the risk of extreme responses like "all" or "none". When samples are sufficiently large, requirement (ii) can be confirmed by randomly alternating between different placebo statements and ensuring there is no difference in means.
Placebo statements are essentially costless, as they do not alter the mechanics, cognitive demands, or interpretation of list experiments. Given their potential benefits, we see no reason to exclude their usage in any setting, but they are especially valuable in contexts where educational attainment is low, or in instruments that are unusually vulnerable to satisficing.