From reading numbers to seeing ratios: a benefit of icons for risk comprehension

Promoting a better understanding of statistical data is becoming increasingly important for improving risk comprehension and decision-making. In this regard, previous studies on Bayesian problem solving have shown that iconic representations help infer frequencies in sets and subsets. Nevertheless, the mechanisms by which icons enhance performance remain unclear. Here, we tested the hypothesis that the benefit offered by icon arrays lies in a better alignment between presented and requested relationships, which should facilitate the comprehension of the requested ratio beyond the represented quantities. To this end, we analyzed individual risk estimates based on data presented either in standard verbal presentations (percentages and natural frequency formats) or as icon arrays. Compared to the other formats, icons led to estimates that were more accurate, and importantly, promoted the use of equivalent expressions for the requested probability. Furthermore, whereas the accuracy of the estimates based on verbal formats depended on their alignment with the text, all the estimates based on icons were equally accurate. Therefore, these results support the proposal that icons enhance the comprehension of the ratio and its mapping onto the requested probability and point to relational misalignment as potential interference for text-based Bayesian reasoning. The present findings also argue against an intrinsic difficulty with understanding single-event probabilities.


Introduction
Everyday decision-making, in areas ranging from healthcare to finance, often requires the integration of different pieces of statistical information to infer the probability of relevant outcomes. For example, the decision to participate in breast cancer screening (i.e., mammograms) depends, among other factors, on the perceived predictive value of the test (e.g., Navarrete et al., 2015). This commonly requires considering different data: the breast cancer prevalence (the base rate), and the conditional probabilities of a positive mammogram in the presence (hit rate) and in the absence (false alarm rate) of breast cancer. The cognitive demands involved in this inference (comprehension of the data and the corresponding arithmetic calculations), as well as potentially helpful strategies, have been the subject of widespread research in the field of Bayesian reasoning (e.g., see recent reviews in Mandel & Navarrete, 2015).
In a typical Bayesian problem, solvers are presented with the above information (base rate, hit rate and false alarm rate) and required to calculate the Bayesian probability (e.g., the posterior probability of having breast cancer in the case of receiving a positive mammogram; see examples in the "Appendix"). It is well known that this is no trivial task; the percentage of participants producing accurate estimates is often lower than 40%, even for problems in which the data are presented in natural frequency format (frequencies that preserve the reference class as "3 of 4" instead of normalized ratios such as "75%"; e.g., Barbey & Sloman, 2007;Brase, 2009;Chapman & Liu, 2009;Evans et al., 2000;Gigerenzer, 1991;Gigerenzer & Hoffrage, 1995;Hoffrage et al., 2015;Pighin et al., 2016;Sloman et al., 2003;Sirota et al., 2014a, b;Sloman et al., 2003; see also the meta-analysis by McDowell & Jacobs, 2017).
As suggested recently by Johnson & Tubau (2017), Bayesian word problems reporting natural frequencies can be as difficult as the ones reporting percentages because they do not eliminate the relational misalignment between presented receive a positive mammogram" and "of 96 women without breast cancer, 12 receive a positive mammogram" specify the size of corresponding subset. As observed in similarity judgments and analogical reasoning, "objects are placed in correspondence based on their roles within the matching relational structure" (Markman & Gentner, 1993, p. 459). Accordingly, number-relational role associations induced from standard presentations make it possible to use such numbers in a similar role ("of 4 + 96 women, 3 + 12 receive a positive mammogram"), but hinder their use in a different role, as required in the Bayesian inference "of 3 + 12 women with positive mammogram, three have breast cancer," where the bold part indicates the set (posterior reference class in Fig. 1). In this regard, inaccurate Bayesian reasoning might be caused not only by a limited understanding of the nestedset structure of the data (Barbey & Sloman, 2007), or how to translate frequencies into probabilities (Cosmides & Tooby, 1996;Gigerenzer, 1991), but by difficulties involved in mapping the presented relationships onto the requested one (preliminary evidences supporting this proposal, using natural frequencies, are reported in Johnson & Tubau, 2017). Interestingly, and also consistent with this proposal, problems that present the sample statistics as frequency grids or icon arrays make it easier to solve the Bayesian question in frequency format, compared to verbal presentations alone (e.g., Brase, 2009 and2014;Galesic, Garcia-Retamero & Gigerenzer, 2009;Garcia-Retamero & Hoffrage, 2013;Garcia-Retamero et al. (2015); Sedlmeier & Gigerenzer, 2001; but see Brase & Hill, 2017;Cosmides &Tooby, 1996 andSirota et al., 2014b for mixed findings). Critically, by explicitly presenting the requested set and subset in overlapping areas, icons reduce the relational reasoning demand of the Bayesian inference (i.e., the posterior ratio can be seen at a glance; see Fig. 2). The benefit of icons seems to be stronger when they are presented without the redundant text (e.g., Khan et al., 2015;Ottley et al., 2016), a fact that suggests that standard verbal presentations promote the formation of misleading associations (Barbey & Sloman, 2007;Johnson & Tubau, 2015). Indeed, the null benefit of icons (see references above) was observed in icons + text presentations. Hence, a better relational alignment, together with reduced interference from misleading verbal associations, might explain the benefit of icons for Bayesian problem solving.
Previous studies have demonstrated the benefit of icons for inferring either frequencies in subsets or individual chances, but in both cases through questions that refer to the specific quantities represented in the array (e.g., "Imagine Michael is tested now. Out of a total of 100 chances, Michael has _____ chance(s) of a positive reaction from the test, _____ of which will be associated with actually having the infection"; Brase, 2009). However, a stronger proof of the usefulness of icons for understanding individual risks would be provided by requesting estimates of individual probabilities, without prompting a determined reference class. Using more ambiguous questions, it would be possible to study the extent to which icons enhance comprehension of the ratio, beyond the represented quantities. If this were the case, icons might induce more accurate estimates and the use of equivalent expressions; that is, the use of different numbers to express the same probability (e.g., the posterior probability in the mammogram problem presented in the "Appendix" can be expressed as "3 of 15", "1 of 5", "20 of 100" or "2 of 10"). The present research aimed to test this hypothesis by analyzing the form and the accuracy of participants' probability estimates, based on data presented either in iconic or verbal formats. Given that natural frequency problems commonly prompt to infer frequencies in determined set and subset (frequency question), it is uncertain the extent to which, compared to percentages, natural frequencies facilitate inferring single-event probabilities. In this sense, a second goal of this research was to shed light into this issue using a probability question. Finally, it also aimed to test whether the differences between formats might depend on the alignment of the request (aligned or misaligned with the presented relationships).

Fig. 1
Schematic representation of the numerical relationships presented in the "Appendix" (mammogram problem, NF format). Note that the difficulty of calculating the posterior reference class does not stem from the addition of the two focal subsets, but on the role change from subset to reference class (in bold numbers relevant for the posterior ratio; further details can be read in Experiment 2)

Experiment 1
To test the benefit of icons for ratio comprehension, problems that presented the sample statistics in one of three formats (icon arrays: IA; natural frequencies: NF; and percentages: PE) and that requested a single-event probability were presented to three groups (see the "Appendix"). In contrast to PE problems, which unambiguously prompt the use of a percentage, we expected to observe different interpretations of the requested subset and reference class in the responses to IA and NF problems. Due to the increased computational demands, we also expected less accurate normalized responses (e.g., percentages) than non-normalized ones (e.g., ratios of represented frequencies). Nevertheless, if icons enhance comprehension of the requested ratio, compared to the verbal formats, they should facilitate its expression in any equivalent form.

Participants
One hundred and forty (36 men and 104 women; mean age = 22.87 years, SD = 5.57) psychology undergraduates from the University of Barcelona took part in this experiment before being introduced to Bayes' rule. All of them provided written consent and the research was approved by the University of Barcelona's Bioethics Commission.
Participants were randomly assigned to three groups according to the format in which the data were presented (icon arrays N = 49; natural frequencies N = 49; percentages N = 42). Given that each participant solved two problems (see below), we analyzed more than 80 responses for each condition.

Materials and procedure
All participants had to evaluate the two health scenarios shown in "Appendix", with the data presented in one of the three formats: IA, NF or PE. The single-event probability question was identical across the three groups. Participants were tested collectively, but each had their own computer and solved the task individually by typing the requested responses in "X of Y" or "%" in PE format (see question in the "Appendix"). There were no time limits for responses, but all the participants finished the whole exercise in less than 20 min.

Results and discussion
Responses were coded as correct when the division of the proposed numbers matched the mathematical probability, i.e., 0.2 in the mammogram problem and 0.33 in the hypertension problem. For the latter, responses rounded to 0.3 were also considered correct (IA five responses; NF one response; PE four responses). Given the match between this rounded response and the false alarm rate, rounded responses expressed as "24 of 80" were not counted as correct (one response in NF format). 1 Accuracy levels for the two scenarios were similar in each format (ps > 0.11), so the analyses were performed by taking the total of correct responses (0, 1 or 2) for each participant into account.
An analysis of the errors showed differences between the scenarios (see Fig. 4). For the mammogram scenario, the most common errors were caused by confusion with the hit rate (12 and 36% for NF and PE formats, respectively), the total positive rate (22 and 10% for NF and PE formats, respectively) and the base rate (10% for either NF or PE format). For the hypertension scenario, the most common sources of confusion were the base rate (30 and 29% for NF and PE formats, respectively) and the hit rate (13 and 20%, for NF and PE formats, respectively). The few errors  In sum, the probability estimates based on icons were more accurate and expressed in more diverse equivalent forms than those based on verbal formats. These findings support the hypothesis that icons facilitate the comprehension of the ratio beyond the represented quantities. Furthermore, the large distribution of errors in both verbal formats confirmed the suggestion that verbal presentations induce superficial reasoning and misleading associations (Barbey & Sloman, 2007;Johnson & Tubau, 2017). Results also suggest that natural frequencies, such as percentages, are unhelpful for inferring single-event probabilities. Nevertheless, as previously shown in the context of frequency estimates (Johnson & Tubau, 2017), this limitation might be related to the misalignment of the Bayesian inference. Experiment 2 aimed to test this hypothesis.

Experiment 2
Based on the alignment hypothesis (Johnson & Tubau, 2017), we hypothesized that icons would facilitate the comprehension of the ratio through a more direct mapping between the data and the request. Nevertheless, alternative explanations might also hold. The advantage of icons has been attributed to their role enhancing a frequentist interpretation of the requested chances (Brase, 2009(Brase, , 2014Cosmides & Tooby, 1996), or a clearer representation of the nested-set structure of the data (Barbey & Sloman, 2007;Reyna, 2004). Therefore, besides theoretical discrepancies between these accounts (see for example the comments on Barbey & Sloman, 2007), both would predict differences between iconic and verbal formats for estimates requiring identical computation. In contrast, the relational alignment hypothesis would predict differences between formats mainly in case of misalignment between presented and requested relationships.
To test these hypotheses, new groups of participants saw/read the previous IA or NF data, but were requested two single-event probability estimates requiring the same arithmetical steps but differing in their alignment with the text (see "Materials and procedure"). Based on the frequentist or nested-set accounts, we expected a significant effect of format for both estimates. Nevertheless, from the relational alignment hypothesis, we expected differences between formats mainly for the misaligned estimate.

Participants
One hundred and sixteen students (16 men and 100 women; mean age = 21.28 years, SD = 2.68) from the same population as in Experiment 1 took part in this experiment. All of them also provided written consent. Participants were randomly assigned to two groups according to the format in which the data were presented (icon arrays N = 57; natural frequencies N = 59). None of them had participated in similar experiments before.

Materials and procedure
As in the previous experiment, all the participants had to evaluate the two health scenarios shown in the "Appendix", with the data presented in one of two formats: IA or NF. The problems ended with two single-event probability requests: the aligned one required to estimate the probability of the datum (e.g., what is the probability of a woman at that age to get a positive mammogram?), whereas the misaligned one required to estimate the posterior probability of suffering the disease, knowing the datum (see "Appendix"). Note that both responses require adding the same subsets but for a different role: as a new subset "(3 + 12) of 100" in the aligned response, or as a new reference class "3 of (3 + 12)" in the misaligned one. Furthermore, as shown in "Appendix", we changed the numbers in the hypertension scenario to avoid the coincidence between the false positive rate and the rounded Bayesian estimate.

Results and discussion
Responses were coded as correct when the division of the proposed numbers matched the mathematical probability (0.15 and 0.24 for the aligned estimate; 0.2 and 0.33 for the misaligned one for mammogram and hypertension scenarios, respectively). For the hypertension scenario, the posterior probability estimate rounded to 0.3 was also considered correct (IA four responses; NF three responses). Accuracy levels for the two scenarios were similar in each format (ps > 0.48), so the analyses were performed with the total of correct responses (0, 1 or 2) for each participant and question (aligned and misaligned). Figure 5 shows the percentage of participants in each category.
For the aligned requests, the mean number of correct estimates was similar in both groups: 1.44 and 1.38 for the IA and the NF groups, respectively (p = .25). For the misaligned ones, results replicated previous findings: the mean number of correct estimates was higher for the IA group than for the NF group [1.36 vs 0.44; χ 2 (2) = 32.64, p < .001, V = 0.53; see Fig. 5], and the difference between groups was significant for either normalized [Mammogram scenario χ 2 (1) = 21.32, p < .001, ϕ = 0.50; Hypertension scenario χ 2 (1) = 19.10, p < .001, ϕ = 0.48], or non-normalized responses [Mammogram scenario χ 2 (1) = 4.08, p = .04, ϕ = 0.36; Hypertension scenario χ 2 (1) = 5.26, p = .02, ϕ = 0.41]. Correct simplifications of the ratio were also more common among responses to IA problems (ten responses in total) 3 than to NF problems (two responses; see Table 2). Hence, these findings supported the hypothesis that natural frequencies verbally presented would be particularly misleading for misaligned requests, being as useful as icons for inferring single-event probabilities from aligned relationships. In contrast, by avoiding the directionality of the text, icons were equally helpful for any probability estimate. Fig. 5 Percentage of participants who correctly solved none, one, or both problems in each group (IA icon arrays; NF natural frequencies) and for each question in Experiment 2 (alignment refers to the relational match between the question and the text of NF problems; aligned question: probability of the datum; misaligned question: posterior probability) Table 2 Overall percentage of correct responses to the misaligned question, and the corresponding frequencies for each scenario, among normalized (including 100 or 10 as denominator) and non-normalized ratios in Experiment 2 (frequencies of correct simplified ratios, as "1/3" instead of "8/24", are shown within parentheses)

General discussion
The present research aimed to test the hypothesis that icon arrays facilitate Bayesian reasoning by enhancing comprehension of the ratio, beyond the represented quantities. This proposal was supported by the results of both experiments, which showed that icon arrays not only promoted selection of the correct numerator and denominator, but also induced further numerical processing, as demonstrated by the use of correct equivalent expressions for the requested probability.
Results of Experiment 2 also showed that icons promoted equally accurate estimates for any request (probability of the datum or posterior probability). In contrast, for natural frequencies, whereas estimates aligned with the text (probability of the datum) were as accurate as the ones based on icons, the misaligned estimates (posterior probability) were mostly inaccurate. Therefore, a critical difficulty for Bayesian reasoning based on verbal formats (including either percentages or natural frequencies) seems to be the misalignment between presented and requested relationships (Johnson & Tubau, 2017). This relational misalignment might explain the dependency of the Bayesian inference on capacities and skills beyond numeracy (e.g., Chapman & Liu, 2009), such as working memory or reflective thinking (e.g., Lesage et al., 2013;Sirota et al., 2014a). Importantly, the single-event posterior probability estimates based on icons were as accurate as frequency estimates, as shown in a pilot experiment. 4 This finding argues against an intrinsic difficulty with understanding singleevent probabilities (e.g., Cosmides & Tooby, 1996; see similar claims in; Girotto & Gonzalez, 2001;Johnson-Lairdet al., 1999;Pighin et al., 2017). Specifically, the tendency to simplify the description of the sample statistics (e.g., from "3 of 15" to "1 of 5" or "20%"), observed in half (Experiment 1) or in most (Experiment 2) of the correct responses to IA problems, might point towards the conceptualization of probability as an individual propensity (Gillies, 2000), or as a subjective degree of confidence induced from the sample statistics (Cosmides & Tooby, 1996 p. 66, note 19). 5 Accordingly, icons might promote a gist comprehension of the risk; that is, a comprehension of the numerical relation beyond the specific numbers (e.g., Reyna, 2004). Equivalent expressions were less frequent among the responses to NF problems, which would suggest that verbal presentations promote a more superficial processing of the data.
Of note was the finding that, among correct estimates of the posterior ratio in Experiment 2, normalizations were more frequent than in Experiment 1 in both formats (NF 2 vs 10; IA 17 vs 42, for experiments 1 vs 2). This might stem from the influence of the first request, which prompted to use the total sample of 100 as reference class. Nevertheless, the posterior probability (misaligned) estimates based on natural frequencies were still mostly inaccurate, even for participants who produced accurate probability (aligned) estimates of the datum. 6 It is also worth noting that in line with previous observations (e.g., Evans et al., 2000;Hafenbrädl & Hoffrage, 2015;Johnson & Tubau, 2017;Pennycook & Thompson, 2012), the analysis of incorrect responses to either NF or PE problems of Experiment 1 showed highly variable estimates of the posterior probability (between 0.03 and 0.95; see Fig. 4). This is also coherent with superficial processing of relevant numerals, without the necessary integration, which might be caused by the relational misalignment between presented and requested set-subset relationships (see also Holyoak & Koh, 1987, for similar arguments in other problem-solving tasks). Differences in the frequency of specific errors also confirm the influence of superficial traits such as the numerical format (the hit rate was more often selected when presented as a percentage; see also Hafenbrädl & Hoffrage, 2015), the relative magnitude of presented numbers (the base rate 0.2 in the hypertension scenario was selected more often than the base rate 0.04 in the mammogram scenario), or the ease of performing certain arithmetic calculations (e.g., the total positive 3 + 12 in the mammogram scenario was selected more often than the total high sodium diet 12 + 24 in the hypertension scenario; see Fig. 4). The responses to IA problems were affected by these superficial traits to a much lesser degree, with only a few attributed to the abovementioned errors. Accordingly, more integrated pictures such as icon arrays might be useful tools for overcoming misleading "associative tendencies" (Barbey & Sloman, 2007).
In conclusion, an important step towards facilitating probabilistic reasoning consists of enhancing the comprehension of statistical data and the corresponding mapping onto the required estimate. In that regard, the present 4 In the pilot experiment, different groups received the same IA and NF problems of Experiment 1 (see "Appendix"), but were asked for frequencies (e.g., "of the women who test positive, how many have breast cancer?"), in the IA (N = 20) and NF (N = 22) formats. For the IA group, the mean number of correct responses was similar as in the present experiments (1.42). For the NF group, it was higher than in present experiments (0.82), but still lower than for the IA group (p = .02). 5 Although a default frequency-based representation is defended by these authors, it is also claimed that a frequentist mechanism might produce subjective confidence for single event probabilities: "even though it might initially output a frequency, and perhaps even store the information as such, other mechanisms may make that frequentist output consciously accessible in the form of a subjective degree of confidence" (Cosmides & Tooby, 1996, p. 66, note 19). findings confirm the advantage of icon arrays for fostering a visual grasp of quantitative relationships. Importantly, as demonstrated by the use of equivalent expressions for the requested probability, present findings suggest that icons enhance the comprehension of the ratio beyond the represented frequencies. Therefore, a much better understanding of individual risks can be achieved by promoting the apprehension of ratios rather than numbers. Whether icons would enhance single-event probabilistic reasoning in other non-university population, or to what extent they would benefit actual decision making, are some of the remaining questions that require further research.
Funding ET was funded by Secretaría de Estado de Investigación, Desarrollo e Innovación of Spanish Ministerio de Economía y Competividad (PSI2013-41568-P). AC and ET were also funded by the Catalan Government (2014-SGR-79).