YOUNG LEARNERS’ PROCESSING OF MULTIMODAL INPUT AND ITS IMPACT ON READING COMPREHENSION

Abstract Theories of multimedia learning suggest that learners can form better referential connections when verbal and visual materials are presented simultaneously. Furthermore, the addition of auditory input in reading-while-listening conditions benefits performance on a variety of linguistic tasks. However, little research has been conducted on the processing of multimedia input (written text and images) with and without accompanying audio. Eye movements were recorded during young L2 learners’ (N = 30) processing of a multimedia story text in reading-only and reading-while-listening conditions to investigate looking patterns and their relationship with comprehension using a multiple-choice comprehension test. Analysis of the eye-movement data showed that the presence of audio in reading-while-listening conditions allowed learners to look at the image more often. Processing time on text was related to lower levels of comprehension, whereas processing time on images was positively related to comprehension.


INTRODUCTION
Reading materials for young, English as foreign language (EFL) learners often come with pictures that illustrate and support the content of the text and that make the reading passages more engaging. Research has shown that pictures play a major role in the development of listening and reading skills, as they contribute to the creation of contexts that affect the meanings derived from words (Wright, 2010). The nonverbal information in pictures allows learners to predict what the text is about, making the construction of meaning easier, and helps them keep the overall context in mind as well as information about the characters in the text and the situations in which they find themselves (Wright, 2010).
The benefit of the simultaneous presentation of verbal and nonverbal information is supported by the multimedia learning hypothesis, which suggests that people learn more deeply from input that includes both words (written or spoken) and pictures (including static graphs, pictures, and dynamic videos or animations) than from words alone (Mayer, 2001(Mayer, , 2009(Mayer, , 2014. The verbal input in multimedia materials can be presented using different sensory modalities, that is, visual and auditory. Materials that involve multiple sensory modalities (i.e., visual, auditory, and kinaesthetic) have also been referred to as multimodal learning environments (e.g., Massaro, 2012). Thus, multimodal learning involves learning from a combination of sensory modes, while multimedia learning refers to learning from text (written or spoken) and pictures. In the context of content learning and knowledge construction in a first language (L1), theories of multimedia learning and empirical investigations supporting those theories have suggested that presenting a text through auditory and written modes, when pictures are also present, results in redundant information that might be detrimental for learning and comprehension (e.g., Kalyuga & Sweller, 2014). However, ample evidence for the positive effect of this redundancy has been provided in the second language (L2) context. Many studies have shown the advantage of reading-while-listening (RWL) conditions for the acquisition of a variety of linguistic components in an L2, including reading fluency, comprehension, and vocabulary learning (e.g., Brown et al., 2008;Chang, 2009;Chang & Millett, 2014Webb & Chang, 2015).
Despite the reported benefits of combining text and pictures in multimedia reading materials, very little is known about how learners process the different input sources in these learning conditions. In the L1, studies have used eye-tracking to explore learners' online processing of text and pictures in the context of science and maths learning (e.g., Johnson & Mayer, 2012;Mason et al., 2013Mason et al., , 2015 and have provided useful insights about how learners integrate the different input sources. Unfortunately, we do not have a clear picture yet of how verbal and nonverbal input sources are processed in the context of L2 learning in the presence of auditory input and, importantly, how potential processing differences might be related to learning and comprehension. In a recent exploratory study, Serrano and Pellicer-Sánchez (2019) showed that the presence of auditory input led to differences in the allocation of attention in an illustrated graded reader in an L2, with more looks to the pictures in the RWL condition than in a readingonly (RO) condition. Importantly, Serrano and Pellicer-Sánchez provided initial evidence for the relationship between processing patterns and comprehension, suggesting that longer processing times on the text in both RWL and RO reflected processing difficulties that were then related to lower comprehension scores. However, the authors call for more research on this topic, as their results could be due to the small set (N = 10) of rather challenging questions used in the study. In addition, despite the reported positive effect that pictures have on reading comprehension (e.g., Elley & Mangubhai, 1983;Omaggio, 1979), no previous studies have looked at the relationship between the processing of pictures and comprehension in both RO and RWL conditions. Thus, our understanding of the relationship between the allocation of attention to the different input sources in multimedia materials and comprehension is rather limited. The present study addresses this gap by using eye-tracking to examine how young EFL learners (11-12-year olds with 5 years of English instruction) process text and pictures in RO and RWL conditions and the impact that the potential differences in the allocation of attention to both text and pictures has on comprehension.

PRINCIPLES OF MULTIMEDIA AND MULTIMODAL LEARNING
The multimedia principle, put forward by Mayer (2001), states that people learn better from words and pictures than from words alone. Multimedia learning has been shown to lead not only to better learning outcomes but also to higher levels of motivation for learners (e.g., Sung & Mayer, 2013). Forms of presentation in multimedia environments are categorized according to the presentation modes (i.e., pictorial and verbal presentation) and the sensory modalities (i.e., auditory and visual presentation) (Mayer, 2014). As Mayer (2014) explains, the presentation mode relates to Paivio's (1986Paivio's ( , 2006 dual coding theory, which suggests that the two modes (i.e., verbal and nonverbal) are processed through two different channels, each with limited processing capacity. The simultaneous activation of the verbal and nonverbal systems fosters learning. Notably, learning from a combination of sensory modalities (i.e., visual, auditory, kinaesthetic) is also referred to as multimodal learning (e.g., Massaro, 2012;Niegeman & Heidig, 2012).
Based on the available empirical evidence, Mayer (2009) identified 12 principles for the creation of effective multimedia learning environments. Two of those principles are particularly relevant for the present study, that is, the redundancy principle and the modality principle. One of the most important principles of multimedia learning is the redundancy principle, which suggests that redundant material (i.e., material that is concurrently presented in different forms or unnecessarily elaborated) interferes with learning (Kalyuga & Sweller, 2014). According to this principle, "[P]eople learn better from graphics and narration than from graphics, narration, and printed text" (Niegeman & Heidig, 2012, p. 2374. Duplication of the same information may overload working memory, inhibiting comprehension and learning (Kalyuga & Sweller, 2014). According to cognitive load theory (Sweller, 1988), the need to coordinate this redundant information involves a higher cognitive demand, which can have a detrimental effect on learning and comprehension (Kalyuga & Sweller, 2014). Interestingly, Kalyuga and Sweller (2014) claim that the negative effects of the simultaneous presentation of written and spoken text might be particularly evident in second or foreign language learning. However, empirical studies supporting the redundancy principle have mainly been conducted in the context of information acquisition and knowledge construction in an L1 (e.g., Kalyuga et al., 1999;Jamet & Le Bohec, 2007;Mayer et al., 2001). Similar evidence in the L2 context is scarce (e.g., Moussa-Inaty et al., 2012) and contrasts with the positive effect of combining written and spoken texts found in RWL studies in the L2 (see review in the next section), as well as with the evidence provided by studies supporting the role of subtitles and captions on comprehension and L2 vocabulary learning (e.g., Montero Perez et al., 2014;Peters, 2019).
Also relevant for the present study is the modality principle, which suggests that people learn better when pictures are presented with auditory text than with written text (Mayer & Moreno, 1998;Moreno & Mayer, 1999;Schnotz, 2014). According to the modality principle, the simultaneous presentation of written text and illustrations involves split attention that could negatively impact learning. Interestingly, Schnotz (2014) predicts a reversed modality effect, by which, in certain situations, written text with illustrations might be better than spoken text. As Schnotz (2014) explains, written text allows learners to pause and reread difficult passages and gives readers the opportunity to adapt their perceptual processing to their needs. These opportunities to pause and reread could be particularly useful for L2 learners.
It is important to note that the principles of multimedia learning were introduced to explain processing of multimodal and multimedia materials in L1 learning, specifically the learning of science and maths, where a complex integration of sources is needed and where the text and illustrations are specifically designed to teach content (e.g., a figure showing an engine and the teacher's oral explanation about how it functions). However, the type of multimedia materials that are often used in the L2 context serve a different purpose and require a different level of integration. For example, in a graded reader, like the one used in the present study, the content of the text can be understood without processing the pictures. Thus, it requires less complex integration of different information sources.

READING-WHILE-LISTENING IN A SECOND LANGUAGE
Research on the effectiveness of combining auditory and visual modes in RWL in an L2 abounds. There is some empirical evidence questioning the effectiveness of RWL, suggesting that it has a detrimental effect on learning and comprehension. In Diao and Sweller's (2007) study, for example, EFL adult learners (first-year university students) were asked to read two texts (two experimental sessions) in one of two instructional conditions, that is, RO or RWL. Participants were asked to read the text twice and to complete a comprehension recall test after the reading. Results of the study showed that RWL led to lower reading comprehension scores than RO.
Despite the negative evidence provided by Diao and Sweller (2007), the majority of investigations in the L2 context have suggested a positive effect of RWL. Studies conducted with adult learners have shown that RWL interventions led to improvements on a range of linguistics components, including listening fluency (e.g., Chang, 2009), vocabulary learning (e.g., Webb & Chang, 2015Webb et al., 2013), listening comprehension (e.g., Chang, 2009), and reading rates and reading comprehension (e.g., Chang & Millett, 2015), with RWL often showing an advantage over other modalities such as RO or listening only. Although the evidence is scant, a few studies have also shown the beneficial effects of RWL for young learners. Lightbown (1992) compared the effects of an extensive RWL instructional intervention to teacher-led instruction with primary schoolchildren and showed that RWL was at least as effective as teacher-led treatment for the acquisition of receptive and productive skills. In a followup study, Lightbown et al. (2002) found that after six years of the extensive RWL intervention, learners performed as well as comparison groups in receptive measures and in measures of oral production, although the approach was not as effective for written production. Similarly, Trofimovich et al. (2009) showed positive effects of RWL for young learners' pronunciation accuracy.
In general, the studies suggest that RWL is not only beneficial for a range of L2 tasks but also that learners have positive attitudes toward this mode of instruction, for both adult (e.g., Brown et al., 2008;Chang, 2009;Chang & Millett, 2014) and young learners (e.g., Lightbown et al., 2002;Tragant et al., 2016;Tragant & Vallbona, 2018).

EYE-TRACKING STUDIES ON MULTIMODAL INPUT
Eye-tracking allows researchers to examine the cognitive effort involved in processing different types of stimuli (i.e., written/spoken verbal stimuli, as well as nonverbal, visual stimuli) (Pellicer-Sánchez & Conklin, 2020). It provides measures of different elements of the eye-movement record: saccades, that is, the rapid movements of the eyes; fixations, that is, when the eyes stop; as well as regressions, that is, movements back in a text while reading. Eye-tracking research has shown interesting differences in processing patterns for text and images/scenes (for a review of research see Conklin et al., 2018). Research has shown that average fixation duration on images (260-330 ms) tends to be longer than fixation durations on text in silent reading (225-250 ms) because during scene perception useful information is gained from a fairly wide field of view (Rayner, 2009). Eye-tracking studies of reading have also shown that, when compared to adult readers, children have slower reading rates, more fixations, longer fixation durations, less skipping, and more saccades (Rayner, 1998(Rayner, , 2009Whitford & Joanisse, 2018). Different processing patterns have been found for monolingual and bilingual children, with bilingual children having longer fixation durations and longer reading times than monolingual children when reading in their L1 (e.g., Whitford & Joanisse, 2018). Longer fixation durations, more saccades, and a higher number of fixations have also been found when bilingual children read in their L2 than in their L1 (e.g., Whitford & Joanisse, 2018).
There has recently been a growing interest in the use of eye-tracking in the context of multimedia and multimodal learning, but there is still fairly little research (Alemdag & Cagiltay, 2018). The use of eye-tracking in multimedia learning overcomes many of the limitations imposed by self-report measures and allows for a direct indication of cognitive processing during multimedia learning (Mayer, 2017). Eye-tracking can demonstrate how learners integrate the different sources of input that are presented simultaneously and use this to explore their potential impact on performance measures (see Alemdag & Cagiltay, 2018, for a review of eye-tracking research in the domain of multimedia learning).
The majority of studies investigating eye movements during multimedia learning focused on science and maths learning in the L1 with the aim of providing empirical evidence for the principles of multimedia learning. Studies conducted with adult learners have demonstrated that presenting spoken versus written text alongside visuals, results in more processing time on the visualizations in the former case, and more time spent reading than looking at the visualisations in the latter (e.g., Schmidt-Weigand et al., 2010). Research has also shown that learning improves when text and pictures are presented close to each other, that is, spatial contiguity principle (e.g., Johnson & Mayer, 2012). Studies conducted with young learners in the L1 have also demonstrated that the integration of text and pictures supports retention and the application of newly learned knowledge (Mason et al., 2015). Young learners' attention to relevant pictures seems to be positively related to learning scores (e.g., Eitel, 2016), and a better integration of text and pictures is associated with enhanced performance (Mason et al., 2015).
In the L2 context, eye-tracking studies on multimedia and multimodal materials have mainly been concerned with the processing of subtitled videos. Previous research conducted with adult learners has shown that both the animation and subtitles are processed and that learners process subtitles regardless of the language of the soundtrack and the language of the subtitles (e.g., Bisson et al., 2014). In general, irregular reading patterns (i.e., higher skipping rate, fewer fixations, longer latencies) have been shown in the processing of subtitles (e.g., d 'Ydewalle & de Bruycker, 2007), with processing patterns differing by L1 background (e.g., Winke et al., 2013) and by proficiency level (e.g., Muñoz, 2017). Some evidence for the relationship between subtitle reading and learning measures has also been provided in the L2 context (e.g., Montero Perez et al., 2015). Apart from the examination of subtitled videos, very few studies have used eye-tracking to examine adult learners' processing of text and static pictures in multimodal materials in the L2 context. In a recent study, Warren et al. (2018) examined L2 adult learners' eye movements to different gloss types (i.e., text only, picture only, and text + picture). They found that the presence of pictures in multimodal glosses led to less attention paid to the text, although these processing differences were not reflected in the general comprehension of the text. Bisson et al. (2015) examined L2 learners' eye movements when they learned new words that were presented in an L2 auditory mode with written L1 translations and pictures. They found that the presence of pictures reduced attention to the translations and that the time spent processing pictures was positively related to learning gains.
Very few eye-tracking studies on multimedia learning have been conducted with young L2 learners. Similar to research with adult learners, most of the available studies have focused on the processing of subtitled videos. This research has demonstrated that younger learners also show irregular reading patterns (i.e., higher skipping rate, fewer fixations, longer latencies) when processing subtitles (e.g., d 'Ydewalle & de Bruycker, 2007), and when compared to adult learners, they skip fewer subtitles and spend more time on them (e.g., Muñoz, 2017). When the use of dynamic images has been compared to static pictures, eye-tracking has shown more processing of the visuals in the dynamic condition (e.g., Tragant & Pellicer-Sánchez, 2019).
While these studies focused on the processing of subtitles, a recent exploratory study by Serrano and Pellicer-Sánchez (2019) examined young L2 learners' eye movements to text and pictures in an illustrated graded reader and found that the presence of auditory text led to more time spent processing the images. They also explored the relationship between processing time on the text and reading comprehension and reported a negative correlation between them. However, despite the differences reported in the processing of pictures in the presence of auditory input, the study did not examine whether the processing of pictures was also related to comprehension. As argued earlier, given the central role that pictures play in reading comprehension, it would be important to examine, not only how they are processed but also how their processing is related to comprehension. In addition, as acknowledged by the authors, the negative relationship between processing of text and comprehension reported in the study could be attributed to the relatively small and challenging set of comprehension questions. The present study uses eye-tracking to examine young learners' processing of text and pictures in an illustrated graded reader as well as the relationship between processing patterns (on both text and pictures) and comprehension.

THE STUDY
The aim of the present study was to examine young L2 learners' processing of visual text and images in the presence and absence of the auditory text, as well as the relationship that viewing patterns on both text and pictures had on comprehension. The following research questions were addressed: 1. Does the presence of auditory input affect young learners' allocation of attention to the text and pictures in multimodal reading conditions? 2. Is the amount of attention allocated to the text and images related to comprehension?
To address these questions, participants were asked to read a multimodal text in RO and RWL conditions while their eye movements were recorded and to complete a comprehension test. Eye movements to text and image areas in the two multimodal conditions were analyzed, and this was related to their performance on the comprehension test. Based on the results of the study by Serrano and Pellicer-Sánchez (2019), it was hypothesized that the presence of audio would lead to differences in the amount of time allocated to text and pictures. Regarding the second research question, the available findings suggest a negative relationship between time spent processing the text and comprehension (Serrano & Pellicer-Sánchez, 2019). However, eye-tracking studies in other contexts have shown a positive relationship between processing time and performance measures (e.g., Godfroid et al., 2018;Pellicer-Sánchez, 2016). Thus, it was hypothesized that the relationship between processing time and comprehension could go in either direction. Although no previous studies have looked at the relationship between the processing of pictures and reading comprehension, there is evidence showing a positive relationship between processing of pictures and learning gains (e.g., Bisson et al., 2015). Based on these findings, a positive relationship between processing of pictures and comprehension was hypothesized.

PARTICIPANTS
Participants in this study were 30 EFL Catalan-Spanish bilinguals in a primary school in Barcelona (Spain). They were all in grade 6 and their ages ranged from 11 to 12. All participants had the expected level of L1 literacy for their age group (as reported by the class teacher). They all had received 5 years of English instruction and their proficiency level was A1.1 according to the Common European Framework of Reference for Languages (CEFR). Prior to the experiment, participants' vocabulary knowledge was assessed using the X-Lex vocabulary size test (Meara & Milton, 2003). Results revealed that all participants had a mean vocabulary size between 1K and 2K (max = 2,600 words, min = 1,100, M = 1,985, SD = 443). Data from two participants were removed from the analysis because they failed to complete one of the measurement instruments. Data from 28 learners (14 male, 14 female) were included in the analyses.

Reading Materials
The graded reader The Canterville Ghost (Wilde, 2012) (level A1.1, 300 headwords) was modified for the purposes of our study. The text from this graded reader was shortened to fit the length of the experiment and some of the lower frequency words were deleted to ensure that the vocabulary included in the story was within learners' level of proficiency. For example, the words pumpkin and crayon, which have a lower frequency (8K and 9K), were deleted from the original story. The final version of the text had 566 words, 94.2% of which were within the first 1,000 most frequent words (lexical profile of the text analyzed with Lextutor, Cobb, n.d.). Our aim was to have 95-98% of the words in the text from the 1K, as participants in the study (with a mean vocabulary size of 1,985) were likely to know the words in this level, 1 and this would indicate adequate comprehension of the text (Hu & Nation, 2000). One of the most frequent words in the text, that is, ghost, was from the 2K level but, given its centrality in the narrative, we confirmed knowledge of this word with each participant before starting the reading activity. Thus, counting ghost as a known word meant a lexical coverage of 96.5%.
The text was presented across 14 pages, which constituted the 14 screens/trials in the eyetracking experiment (see Appendix for a sample of the experimental trials). Fourteen images were selected from the original graded reader to be presented alongside the text. The selected images accompanied the same part of the text as in the original graded reader. The text and image stimuli were designed to control for many of the factors that are known to affect eyemovement behavior. We wanted to ensure that the text that appeared on each page of the reading experiment had the same or very similar number of words and appeared in the same format (same font size and font style). The size of the original images was modified so that all the images had the same size. The position of text and images was counterbalanced so that both types of stimuli appeared at the right and left of the display. The design of the illustrated story followed the spatial and contiguity principle of the cognitive theory of multimedia learning, which suggests that people learn better when words and pictures are presented near to each other and simultaneously, rather than successively (Mayer, 2009).
Finally, the auditory stimuli for the RWL condition was recorded by a native speaker of British English for each page of text at a speech rate of 113 words per minute (wpm), similar to the rate of speech of the original audio provided by the publishers for this and other graded readers. 2

Comprehension Test
Because we wanted to explore the relationship between processing of text and pictures and comprehension, two types of questions were created: questions that could be answered by reading the text and questions that could only be answered by extracting information from the pictures. As explained earlier, pictures help readers to predict the content of the text, keep information about the overall context and characters in mind, and facilitate meaning construction (Wright, 2010). In this sense, the images could also be helpful in answering the text-related questions, as they supported the content of the text. However, the image-related questions focused on specific visual features that were not reported in the text. Thus, we will refer to two types of questions: text + image questions and image-only questions. The narrative was first parsed into idea units (i.e., distinct events or actions that occurred in the course of the story) and these were then used to create multiple-choice questions. Each test item provided three options and a fourth "I don't know." The test was in Catalan to ensure comprehension of the content. Questions that related to the images could only be answered by having looked at the pictures (e.g., what a character was wearing, where a scene took place). A battery of 26 multiple-choice items was piloted prior to the experiment with a group of learners of similar characteristics (N = 46). The results of the pilot allowed us to examine the quality of the items in terms of discrimination and level of difficulty. Based on these analyses, 18 of the 26 questions were kept unchanged and 10 new questions were created. The final test included a total of 28 items, that is, 19 text + image questions and 9 image-only questions. After administration of the final test, the level of difficulty and discrimination was checked again. Only three items in the final test had a low level of discrimination and they were discarded from analysis. Consequently, all analyses in the study are based on written responses to 25 multiple-choice items (16 text + image questions and 9 image-only items) (Cronbach's alpha = .80).

PROCEDURE AND ANALYSIS
Data were collected individually in a quiet room in the participants' school. Instructions were provided orally in the children's L1. Participants were then asked to read the story for comprehension while their eye movements were recorded. The experiment followed a within-subjects design, with all participants being exposed to both RO and RWL conditions. Half the story was presented in RO and the other half in RWL in a counterbalanced design. Participants were told about the existence of the two conditions and that they would have to answer some comprehension questions after reading the story. Although the audio was only included in one part of the story, participants were asked to wear the headphones for the duration of the story as it aided concentration and also helped to isolate any potential noise.
The story was presented on a 1280 Â 1024 monitor and displayed over 14 screens. In the RO condition pages advanced with a mouse click, whereas in the RWL condition the pages advanced automatically when the audio recording finished. Eye movements were recorded with Tobii T120 at a sampling rate of 120 Hz that has a typical accuracy of .5°( measured in ideal conditions) and .2°resolution. A 5-point calibration and validation procedure was performed at the beginning of the experiment. No other calibrations were performed during the experiment. After the reading activity participants were asked to complete the comprehension test with no time pressure. The reading task lasted around 20 minutes and the whole procedure around 50 minutes.
For the analysis of eye movements, two regions of interest were defined for each trial, surrounding the image and the block of text. Fixations shorter than 80 ms were removed from the dataset (1% of the data). The following eye-movement measures were extracted and analyzed as measures of attention allocation: A dichotomous scoring system was used to score the comprehension test (1 for correct responses and 0 for incorrect responses). In response to the first research question, we examined the effect of two independent variables, that is, condition (RWL and RO) and region (text and picture) on the dependent variables, that is, the three eye-movement measures, through linear mixed-effect models using the lme4 (v 1.1-21; Bates et al., 2015) package for R (v 3.6.1; R Core Team, 2019). The p values for the effects were obtained using the lmerTest package (3.1-0; Kuznetsova et al., 2017). Separate models were fitted for each of the dependent variables. Because the duration of the trials (and hence the total dwell time and total fixation count) was limited by the duration of the audio recordings in the RWL condition whereas reading in RO trials was self-paced, percentage measures were entered in the models as a way of controlling for differences in trial length. Following Chang and Choi's (2014) approach, the proportion of the amount of time spent gazing at texts and pictures was used, instead of the raw total reading time, for two main reasons. First, because attention is typically used as a relative term (Cowen, 1995) and, second, because percentage measures have also been used in studies on multimedia learning to study attention allocation to pictures and text (e.g., Chang & Choi, 2014;d'Ydewalle & De Bruycker, 2007;Johnson & Mayer, 2012;Yang et al., 2013). To answer the second research question, we fitted logistic regression models to the response accuracy data using the glm function from the base R stats package.

PROCESSING OF TEXT AND IMAGES
The total time that learners spent processing the text and image areas in the two conditions (RO vs. RWL) was explored first (see Table 1). As explained in the previous section, percentage measures were entered in the models as a way of controlling for differences in trial length and as a measure of relative attention distribution. The three dependent measures were checked for normality using the fitdistrplus package (v 1.0-14; Delignette-Muller & Dutang, 2015) and the method outlined by Cullen and Frey (1999). They were found to significantly deviate from normality. Shapiro-Wilk tests confirmed significant deviations from normality for Dwell Time %: W = .76, p < .0001; Fixation Count %: W = .77, p < .0001; and Average Fixation Duration: W = .98, p < .0001. There is, however, marked disagreement as to the importance of parametric assumptions (e.g., McCulloch & Neuhaus, 2011), with evidence pointing to the relative robustness of linear mixed models to violations of normality (Arnau et al., 2013), and debate as to the cost-benefits of data transformations to address issues of nonnormality in terms of interpretability of the effects (Liceralde & Gordon, 2018). While some authors have recommended alternative statistical approaches (e.g., GLMMs; Lo & Andrews, 2015), particularly in cases of small sample sizes (Arnau et al., 2013), in this instance, we have instead opted to fit a robust linear mixed model to the data using the robustlmm package (v 2.3; Koller, 2016) as a check for our analysis. A comparison of the coefficients produced by the lme4 and robustlmm package showed broad agreement. Thus, the coefficients produced by the lme4 package are reported here.
A first model was fitted to the Dwell time % data, modeling the interaction between condition (RO vs. RWL) and region of interest (TEXT vs. IMAGE) as fixed effects, with random intercepts for the effect of trial and participant (a model with random slopes for the effect of trial within participants was not found to be a better fit for the data by computing an ANOVA between the two models, χ² = 0, p = 1). The model (see Table S1 in the supplementary materials) revealed significant main effects of Condition, β = .03, t(836) = 4.97, p < .0001, d = . 34, and of Region, β = .84, t(836) = 122.04, p < .0001, d = 8.44, as well as a significant interaction between the two, β = À.06, t(836) = À6.87, p < .0001, d = À.47. To decompose the interaction, we ran Bonferroni-corrected post-hoc comparisons between all levels of the two factors (i.e., condition and region) using the emmeans package (v. 1.3.5.1). These revealed that, proportionally, the participants spent more time on the text region (compared to the picture region) in the RO condition than in the RWL condition, β = À.03, z = À4.73, p < .0001, whereas more time was spent fixating the images during RWL trials than during RO trials, β = .03, z = À4.97, p < .0001.
Fixation counts on the two regions of interest were then examined (see descriptive statistics in Table 1). As explained in the preceding text, these were computed as percentages of the total number of fixations recorded during each trial. The same model structures were fitted to this dependent variable as for Dwell time %. This model (see Table S2 in the supplementary materials) revealed significant main effects of both Condition, β = .03, t(836) = 5.71, p < .0001, d = .39, and Region, β = .80, t(836) = 120.46, p < .0001, d = 8.33, as well as a significant interaction between the two factors, β = À.07, t(836) = À7.93, p < .0001, d = À.54. Post-hoc comparisons on the interaction revealed the same pattern of results as the Dwell time % measure, whereby learners spent proportionally more time fixating the text region during RO trials than during RWL trials, β = À.03, z = À5.51, p < .0001, whereas more time was spent fixating the images during RWL trials than during RO trials, β = .03, z = 5.71, p < .0001.
It is worth noting that the main effects of Region, both for Dwell time % (d = 8.33) and fixation counts (d = 8.44), were considerably larger than those included in the size estimates commonly used in the literature (e.g., Cohen's levels). Effect sizes larger than 1 have been found in previous research (Hattie, 2009) and those above 2 have been described as huge in the applied statistical literature (Sawilowsky, 2009).
The average duration of fixations in the two regions of interest was then analyzed (see descriptive statistics in Table 1). The model (see Table S3 in the supplementary materials) revealed a main effect of Condition, β = 25.29, t(807) = 5.31, p < .0001, d = .37, main effect of Region, β = 83.78, t(807) = 17.24, p < .0001, d = 1.21, and an interaction between Condition and Region, β = À15.77, t(807) = À2.33, p < .05, d = À.16. Post-hoc comparisons showed that the difference in average fixation duration between RWL and RO trials was significant for the pictures, β = 25.29, z = 5.31, p < .0001, with longer average fixations on the images during trials with audio. Mean fixations on the text during RWL trials and RO trials were not significantly different, β = 9.51, z = 1.97, p = .29.

TEXT COMPREHENSION
Finally, learners' scores in the comprehension test and their relationship with viewing patterns were analyzed (see response accuracy descriptive statistics in Table 2). We first looked at the relationship between processing time on a particular region (picture or text) and response accuracy for comprehension questions pertaining to that region type (imageonly or text + image questions). Because there was not one text + image and one image-only question per page/trial, dwell times were computed per type of region of interest (picture or text) but averaged across trials. Similarly, participants' mean response accuracy was computed (as a correct/incorrect ratio bounded between 0 and 1) per question type (i.e., text + image questions and image-only questions). Because the data was averaged across trials, linear mixed models were no longer viable as a statistical method; furthermore, because of the bounded nature of the outcome variable, a simple linear regression would similarly not be indicated. We therefore opted to fit logistic regression models to the response accuracy data using the glm function from the base R stats package.
Results of the logistic regression models reported in Table 3 revealed only a main effect of Region, suggesting overall better response accuracy for the text + image questions compared to image-only questions. There was no main effect of Condition, suggesting a similar response accuracy in the RWL and RO conditions. There was also a significant interaction between Dwell time % and Region, with a higher % of fixations on the text related to lower accuracy. The interaction between Condition and Region was also significant, suggesting that there is a stronger relationship between Dwell time % and comprehension accuracy in the RO compared to the RWL condition (see Figure 1).
Because the three-way interaction was not significant, a final model was fitted with only the two-way interactions between Dwell time % and Region, and between Condition and Region (see results in Table 4). In line with the findings of the previous model, this model revealed a main effect of Region and a significant interaction between Dwell time % and Region. The interaction between Condition and Region was not significant in this model, suggesting that the relationship between Dwell time % and comprehension accuracy does not differ by condition.  This analysis shows the connection between percentage of time on a particular type of region and accuracy in responding to questions that referred to that area (i.e., percentage of time on images and accuracy on image-only questions, as well as percentage of time on the text and accuracy on the text + image questions). However, it could also be hypothesized  that processing time on a type of region (picture or text) might also support comprehension of questions related to the opposite target type. The percentage of time on images might support comprehension of text + image questions, and percentage of time on text could also support comprehension of image-only questions. Thus, as a final analysis we tested whether Dwell time % on the opposite target type was related to response accuracy (i.e., whether percentage of time spent looking at the picture could help in correctly answering text + image questions, and vice versa). To do this, we swapped the eye-movement measures between regions of interest and used the new resulting variable as a predictor in the same logistic regression structure used in the previous analyses. The results showed that there was a significant interaction between Dwell time % and Region, β = 12.59, z = 2.85, p = .004, suggesting that spending proportionally more time looking at the images was related to greater accuracy on text + image questions, but proportionally more time spent looking at the text was related to lower accuracy on image-only questions (see Figure 2). It must be pointed out that the logistic regressions reported here produced odds ratios that were unusually large and, in a couple of cases, unusually small. We report them in the table because, for significant effects, they nevertheless had confidence intervals thathowever wide-reliably did not cross 0. This is also consistent with the wide CIs observed in Figure 1 and Figure 2, and likely the result of a low number of observed events for each accuracy level.

DISCUSSION
The current study contributes important knowledge about the processing of multimodal materials in L2 and EFL contexts. Importantly, it responds to the call for more research on the relationship between eye-movement patterns and learning outcomes (Alemdag & Cagiltay, 2018). To achieve this, we examined young learners' processing of images and written text in the presence and absence of an auditory version of the text. More specifically, we looked at the proportion of fixations and proportion of fixation duration on the text and image areas. We also explored the relationship between reading/viewing patterns and performance on a comprehension test. Results of the analysis of Dwell time % and Fixation count % showed that, in general, young L2 learners are likely to spend proportionally more time on the processing of the text than the images in both multimodal conditions (RO and RWL). This confirms previous research showing that, when presented with pictures and text, learners tend to spend more time on the text (Schmidt-Weigand et al., 2010). It is important to note that these patterns could be explained by the degree of informativeness of the different input sources in the materials. In this study, the text carried most of the information and learners were also aware that they would be answering some comprehension questions after the reading task. Thus, it is not surprising that they spent more time processing the text. Different patterns would be expected with other types of reading materials where the visual input carries most of the information, such as in comic books.
In response to the first research question, results showed that processing patterns were clearly affected by the presentation of the auditory input. The analyses demonstrated that in the RWL mode young learners spend proportionally more time and have more fixations on the pictures than in RO conditions, while in the RO mode they spend proportionally more time and have a higher percentage of fixations on the text than they do in the RWL mode. Average fixation durations were also longer in general in the RWL condition, particularly for the images. As Serrano and Pellicer-Sánchez (2019) argue, because the verbal input is presented auditorily, learners can look at the pictures more often and make a better use of them, which allows them to better integrate the verbal and nonverbal sources of input. Looking more at the pictures in the RWL condition does not seem to hinder comprehension. The lack of differences in comprehension between the RO and RWL conditions does not support an advantage of the RWL over RO as it was the case in earlier investigations (e.g., Chang, 2009;Chang & Millett, 2015), nor a detrimental effect of RWL on comprehension (e.g., Diao & Sweller, 2007). Importantly, it supports results of previous investigations showing the beneficial effect of RWL for young learners' comprehension (e.g., Lightbown, 1992).
The current findings go against the redundancy principle, which suggests a negative effect of presenting a text in both written and spoken modalities. Kalyuga and Sweller (2014) argued that this negative effect should be particularly evident for L2 readers because of the difficulty they have linking auditory and written input. However, results of the present study show similar levels of comprehension in the RO and RWL conditions. This suggests that the negative effect of the presentation of written and spoken input found in the context of L1 content learning (e.g., Kalyuga et al., 1999;Jamet & Le Bohec, 2007;Mayer et al., 2001) is not applicable to the L2 reading context, at least with young learners. Our results are also problematic for the dual modality principle, which predicts a detrimental effect of dual modality presentation. The results of the present study show processing differences when the verbal input is provided through two modalities (i.e., spoken and written), but this does not seem to have a detrimental effect on comprehension, supporting previous research findings (e.g., Chang & Millett, 2015). Again, this calls into question the applicability of principles of multimedia learning in L2 contexts.
In addition, the proportionally longer reading times for the text in the RO condition could be a reflection of the possibility of pausing and rereading parts of the passage; this would allow readers to adapt their processing speed to their needs, as suggested by Schnotz (2014). Results of the present study also confirm the findings of the study by Serrano and Pellicer-Sánchez (2019), which was also conducted with young learners of similar age and proficiency, and demonstrate similar reading and viewing patterns with their unmodified, more authentic materials and our less authentic, more controlled ones.
Concerning the eye movements examined in the present study, the analyses of the average fixation durations have shown that, contrary to what was expected, fixations on images were shorter than fixations on the text. This is in contrast to what has been suggested for adult readers and might be a reflection of the developing reading skills of the participants in the present study. As expected, average fixation durations on the text were longer than typical reading times in adult readers (Rayner, 1998(Rayner, , 2009Whitford & Joanisse, 2018).
In response to the second research question, results of this study have shown that a higher percentage of total dwell time on the text was related to lower accuracy on the text + image questions. This is in line with previous findings suggesting that longer relative gaze duration (i.e., proportion of the amount of time gazing at text) (e.g., Chang & Choi, 2014) and longer total dwell time (e.g., Serrano & Pellicer-Sánchez, 2019) are a sign of processing difficulties that are then reflected in lower comprehension scores. Importantly, results of the present study have provided initial evidence of a relationship between the amount of attention allocated to images and scores on text + image questions. A higher proportion of processing time on images supported comprehension of the text, providing evidence for the positive role of images in reading comprehension. This finding is also in line with previous studies showing a positive relation between total dwell time (Bisson et al., 2015) as well as relative attention (i.e., fixation count on picture-fixation count on text) (Eitel, 2016) on pictures and learning scores. It could also be hypothesized that the relationship between the processing of the images and accuracy on text + images-related questions is a consequence of faster readers spending proportionally less time on the text. Spending less time on the text, and consequently having more time available to look at the images, could be a sign of higher proficiency in reading that is then reflected in accuracy scores. Interestingly, more time on images was not related to higher accuracy on the image-only questions, suggesting that when processing the images, learners did not pay particular attention to the specific visual features that the questions addressed and that they used images mainly as support for text comprehension.
The results of this study have important implications for teaching. The present study indicates that while the children spent more time on the images in the RWL mode, their comprehension was equally good in both modes. We know from previous studies that RWL is popular with young learners and that they generally show positive attitudes toward RWL (e.g., Lightbown et al., 2002;Tragant et al., 2016;Tragant & Vallbona, 2018). Based on this evidence, it seems advisable for teachers to promote RWL among young learners. It is a powerful language learning tool in less formal contexts both at school and at home. Importantly, the present study also supports the use of images to support young learners' comprehension. Despite claims that images may pull attention away from the text (e.g., Hill, 2013), this study has shown that the amount of attention allocated to images seems to support reading comprehension.
It is important to acknowledge the limitations of the present study. This investigation is the first to examine the potential role that the percentage of dwell time allocated to images has on comprehension. However, the number of images used in the present study, as well as the content they depicted, did not allow us to have a larger number of image-only questions. Having enough questions that can only be answered by processing the images might only be possible in a much longer experiment and/or with images drawn specifically for the purposes of the study. Finally, the results of the present study shed light on our understanding of young L2 learners' processing of multimodal input, but it remains to be demonstrated whether similar patterns would be observed with learners of different ages and proficiency levels. Future studies should examine the relationship between reading and viewing patterns and comprehension with L2 learners having a wide variety of characteristics.

CONCLUSION
This study has provided further evidence for the benefits of using eye-tracking to examine processing during multimodal learning (Mayer, 2017). The results of the present study show that the addition of auditory input leads to processing differences, with proportionally more time spent on images in the RWL condition than in RO. These processing differences suggest a better integration of the verbal and pictorial sources of information in multimedia materials with auditory input, without having a negative impact on comprehension. Importantly, this study has shown that proportionally more time on text is related to lower levels of comprehension, whereas more time on images is related to better comprehension, revealing interesting differences in the relationship between reading/viewing patterns and comprehension.

SUPPLEMENTARY MATERIALS
To view supplementary material for this article, please visit http://dx.doi.org/10.1017/ S0272263120000091. NOTES 1 While it is likely that participants with a vocabulary size of 1,985 words would know most of the words in the 1K, a VST does not provide information of word knowledge at different frequency levels. Future studies should use a vocabulary measure that provides more reliable information about knowledge at different frequency levels.
2 Normal speech rate in English is approximately 150 wpm (Buck, 2001;Chang, 2011;Griffiths, 1990). This is the speech rate followed in audiobooks for adults. Examination of the audio recordings of graded readers at this level showed that the speech rate ranged from 90-120 wpm.