Please Scroll down for Article the Quarterly Journal of Experimental Psychology Speech Segmentation Is Facilitated by Visual Cues

This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, redistribution , reselling , loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

speech signal, both infants and adults are sensitive to the distribution of regularities of speech and can exploit this information in language learning. This important learning mechanism has been coined as statistical learning. This mechanism involves the ability to learn from different regular patterns via different sensory modalities (Conway & Christiansen, 2006). Concerning speech, statistical learning has been demonstrated to assist language learning at the phonetic level in infants (Jusczyk & Luce, 1994), as well as to be a sufficient cue for children and adults to segment word candidates in fluent speech (Aslin, Saffran, & Newport, 1998;Saffran, Aslin, & Newport, 1996a). For instance, the computation of the transitional probabilities of syllables (i.e., the likelihood of one syllable to be followed by another one) has been shown to be useful for language learners in the location of word boundaries (see, e.g., Saffran et al., 1996a;Saffran, Newport, & Aslin, 1996b).
In language learning, however, there are often multiple cues besides transitional probabilities of syllables that can help to solve the word identification problem. In a recent review, Kuhl (2004) proposed that joining visual attention to an object that is named by an adult might help infants to segment words from ongoing speech. This idea is based on previous results (Baldwin, 1991(Baldwin, , 1993; see also Tomasello & Barton, 1994), which found that 18-month-old children tend to follow the speaker's eye gaze to infer the referent of a novel word. Thus, at the earliest stages of language acquisition, a plausible strategy for learning words is to extract referents from direct visual observations of objects, scenes, or events that are guided by joining visual attention to an object that is named by an adult (Kuhl, 2004). This is further supported by the demonstration that receptive vocabulary skills are related to an infant's tendency to follow the gaze of an adult (Baldwin, 1995;Brooks & Meltzoff, 2002).
In convergence with the idea that multiple cues could be used to enhance speech segmentation, several studies have provided evidence that infants are sensitive to audio-visual synchrony in speech (see, e.g., Dodd, 1979;Gogate & Bahrick, 1998;Gogate, Bahrick, & Watson, 2000). For example, infants gaze longer at a speaking face when the audio and visual sources are synchronous than when they are not, and they are sensitive to asynchronies as small as 400 ms (Dodd, 1979). Similarly, Kuhl andMeltzoff (1982, 1984) reported that 4-month-old infants gazed longer at video images that were vocally compatible with an auditory signal than at incompatible ones. Related to these studies, Aronson and Rosenbloom (1971) reported that 10-day-old infants showed distress when their mother's voice was heard to emanate from a location distal to her face. In a more recent study, Hollich, Newman, and Jusczyk (2005) showed that 7.5month-olds were able to segregate speech in a noisy environment when seeing a video of the talker's face synchronized with the target passage. However, they were not able to accomplish this task when the video was unsynchronized or when there was a static face during the familiarization phase. When the synchronized face was substituted by a synchronized signal from an oscilloscope, their performance was also facilitated. The authors interpreted this finding in favour of the existence of a special sensitivity in infants to synchronized multimodal information that helps them to segregate the target speech signal from other sound sources in a noisy environment. Moreover, blind children have trouble acquiring certain phonemic distinctions (Mills, 1987), highlighting the importance of vision in language acquisition.
It thus seems that infants' word comprehension develops from the early detection of intersensory associations between auditory speech patterns (words) and visible objects or actions. Accordingly, language learning may depend on the dynamic and reciprocal interaction between intersensory perception, selective attention, and memory mechanisms. Consider, for example, the synchronous appearance of a car and the mouth movements and vocalizations of a caretaker together with the corresponding sounds: This information can be used as multimodal cues by an infant to isolate the word "car" from the continuous speech stream and ultimately to comprehend speech.
At the theoretical level, a recent model of language learning claims that infants' sensitivity to joint visual and auditory attention, together with their imitative abilities, may explain their capability to appreciate the communicative intentions of other persons (Tomasello, 2003). Temporal contiguity in the form of simultaneous appearance of an object and a word (its label) can be argued to play a central role at the early phase of infant word learning and constitutes an important element in the emergentist coalition model of word learning (Hollich et al., 2000). One of the tenets of this model is that infants' word learning relies on a perceptual subset of the available cues in the coalition, and social cues, like the eye gaze direction of others, are recruited later on during development. Thus, temporal contiguity, together with perceptual salience, would guide word learning early on in child development, followed by a shift at 12 months of age towards a greater dependency on social cues, like following adults' eye gaze or handling of objects (Golinkoff & Hirsh-Pasek, 2006).
The studies reviewed above emphasize the importance of the presence of multimodal cues that can initially guide infants' selective attention and enhance speech segmentation (e.g., Bloom, 1998;Gogate et al., 2000;Hollich et al., 2005). The use of multimodal information in language processing has also been documented in adults learning a second language (Davis & Kim, 2001). In fact, a number of studies have demonstrated that adult listeners increase their identification rate of speech sounds when they also have access to visual information such as the dynamics of the facial articulators (Rosenblum & Saldana, 1996). In addition, in noisy environments, the intelligibility of speech increases when the speaker's face is present (Dodd, 1977;Macleod & Summerfield, 1987;Sumby & Pollack, 1954). Using nondegraded auditory information, Reisberg, McLean, and Goldfield (1987) observed that when listening to a speaker with a strong foreign accent or to a passage with a complex semantic message, seeing the speaker's lips helped language comprehension (see also Dodd, 1977;Sanders & Goodrich, 1971). In a similar vein, Thompson and Ogden (1995) reported that participants' memory of spoken sentences using native language materials was facilitated by showing the face of the speaker. The impact of "visible speech" in processing a foreign language has also been documented in adult language learners, which could serve to compensate for the weaker information accrued in the lexicon. Reisberg et al. (1987) showed that seeing the speaker's lips improved the performance of two groups of second-language learners (native English speakers learning French or German). Interestingly, this effect was larger for secondlanguage learners than for native speakers. Also, in a more recent study in which adults were asked to repeat and memorize phrases of a language that they had not heard before (a foreign language), their performance improved when at the learning phase they had visual access to a video of the lower part (from the nose to the chin) of the speaker's face (Davis & Kim, 2001).
Given these findings on the impact of visible speech in first-and second-language processing, it is possible that visual information may also aid adults in initially segmenting the words of a new language. This hypothesis favours the idea that all available visual and auditory cues might be employed in order to better understand speech in noisy situations or to learn a new language (e.g., Davis & Kim, 2001). Moreover, as it has been successfully implemented in a computational multisensory language interface (Yu & Ballard, 2004), it is possible that language learning could take place in an unsupervised mode with the collection of acoustic signals in concert with multisensory information from other sensory modalities, such as the speaker's eye gaze direction, head and hand movements, and so on. The fundamental idea is that to acquire a language, the learner can make use of nonspeech contextual information to facilitate speech segmentation.
In order to study this issue in relation to second-language learning, we investigated whether adults' speech segmentation was facilitated by visual cues (images of objects) that were synchronized to the onset/offset of the words embedded in an artificial language stream (see Figure 1 for a summary of all experimental conditions). We hypothesized that speech segmentation in adults would benefit from the temporal contiguity of visual and auditory information and that this facilitation would occur even when there was no association between the novel words and object images.

EXPERIMENT 1
In this experiment, we explored whether speech segmentation is facilitated by synchronously presented visual object images. Participants were exposed to a continuous auditory speech signal composed of nonsense words. This artificial language stream is, at first, usually perceived as a long string of syllables but, after a short period of exposure, the nonsense words can be segmented from the syllable stream by computing the transitional probabilities of the syllables (Aslin et al., 1998;Saffran et al., 1996b). In the critical experimental condition, visual stimuli were added and delivered in synchrony with the word onsets and offsets in the auditory language stream (see Figure 1). The visual information consisted of real drawings of objects, presented one at a time, with each one remaining on the screen for the entire duration of each word in the acoustic stream. The pictures were presented in a pseudorandom order, and thus they were not associated with specific words in the speech stream. In this way, we ensured that the only useful visual information provided by the object images was the temporal contiguity between words and pictures.

Method
Participants A total of 52 students at the University of Barcelona participated in the study. Participants were Figure 1. Illustration of the procedure used for language exposure in the different experimental conditions. In all conditions, auditory information was presented (the uppermost row shows an auditory stream composed of four words). In the first experiment, the auditory alone condition (audio) was compared to the audio-visual synchrony condition (synchronous). In the latter condition, the onset of the picture perfectly matches the onset of each word. In the second experiment, the duration of each picture varied, and each picture was synchronized with the first, second, and third syllables of an auditory word (arrhythmic). In the third experiment, in two separate conditions pictures were synchronized with the onset of the second syllable (asynchronous 2nd-syllable) or with the onset of the third syllable (asynchronous 3rd-syllable) of the auditory words.
THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2010, 63 (2) randomly assigned to one of the two conditions: auditory language stream and audio-visual language stream. All participants were native speakers of Spanish or Catalan, and all of them received extra course credits for their participation.

Stimuli
A total of 24 different consonant -vowel syllables were used to create two language streams (see Appendix). For each stream, four trisyllabic nonsense words were concatenated to form a continuous speech stream. The acoustic streams were first created by using the speech synthesizer MBROLA (Dutoit, Pagel, Pierret, Bataille, & van der Vreken, 1996), and then the duration of the streams was adjusted to a millisecond precision using the Cooledit software. The use of the artificial language learning methodology enables us to control for potential segmentation cues, such as word-stress or coarticulation. Thus, all phonemes had the same duration (116 ms) and pitch (200 Hz; equal pitch rise and fall, with pitch maximum at 50% of the phoneme) in the language streams. The only reliable cue that could help to discover word boundaries was the statistical structure of the language. In all streams the transitional probability of the syllables forming a word was 1.0, while for syllables spanning word boundaries it was .33. The duration of the acoustic stream was 2 min 24 s and 768 ms. The duration of each word was 696 ms, and each one was repeated 52 times along the stream.
In addition, for each language stream 24 partwords were created by recombining the syllables of the 4 words. Thus, 12 part-words were made by concatenating the last two syllables of a word and the first one of another (part-words 2-3 -1), and the other 12 were made by concatenating the last syllable of a word and the first two syllables of another (part-words 3 -1 -2).

Procedure
Each participant was exposed to either a single auditory speech stream or a single audio-visual speech stream delivered through headphones at a comfortable sound pressure level. All participants were instructed to listen carefully to the syllable stream and to identify novel words appearing in it. To ensure that the participants in the audiovisual condition were paying attention to both stimulus streams, they were instructed to try to associate the novel words with the pictures that were simultaneously presented on the screen.
Pictures in the audio-visual condition were presented in pseudorandom order, with the constraint that each picture appeared equally often with each word in the acoustic stream (13 times). In other words, there were no associative relationships between the pictures and the words. The visual stimuli were displayed on a white background with each picture extending for a 3.8 Â 3.88 visual angle and were presented at the centre of the screen at an average viewing distance of 75 cm. Picture and word durations were equal (696 ms); that is, pictures were presented in a perfect onset-offset synchrony with words. Auditory streams and picture pools were counterbalanced across participants and conditions. Immediately after the auditory or audio-visual stream in each experimental condition, a test phase was presented. The test consisted of a standard auditory two-alternative-forced-choice (2AFC) test. Test items were composed of the 4 words of each stream and 4 part-words randomly selected from the pool of 24 part-words of the same stream (2 part-words corresponding to the syllable structure 2-3 -1 and 2 to the syllable structure 3 -1-2; see the Stimuli section for more details). Words and part-words were exhaustively combined, rendering a total of 16 pairs presented in random order. After hearing each test item pair, the participants were asked to decide, by pressing a button corresponding to the first or the second item of the pair, which item was a word of the language stream. The presentation of the items of a pair was separated by a 400-ms pause.
It should be noted that the frequencies of words and part-words are not equated in this sort of paradigm, and words appear much more often than part-words in the language stream (the word/ part-word ratio is 26/9). Nevertheless, it has been demonstrated that the same results (words selected more often than part-words in the recognition test) are found when controlling for word and part-word frequency in the stream (e.g., Aslin et al., 1998;Graf, Evans, Alibali, & Saffran, 2007).

Results and discussion
The mean percentages of correctly segmented words were as follows (see Figure 2): 68.75 + 18.29% for the auditory condition and 85.58 + 12.72% for the audio-visual condition. Both values were significantly different from chance (50%), p values , .001. Significantly more words were successfully segmented by the participants who were exposed to the audio-visual streams than by those exposed to the auditory speech streams, t(50) ¼ -3.85, p , .001.
In support of the hypothesis, the participants' recognition performance showed a clear-cut beneficial effect of combined auditory and visual information on word segmentation when compared to purely auditory speech input. We hypothesize that this reflects multimodal sensory integration used in language learning. However, it might also simply result from heightened attention/ motivation due to the presence of additional Figure 2. Distributions of the percentages of correctly segmented nonsense words in the auditory two-alternative-forced-choice (2AFC) test administered after the auditory and audio-visual conditions (Experiment 1: audio alone condition vs. synchronous condition; Experiment 2: arrhythmic condition; Experiment 3: asynchronous 2nd-syllable vs. asynchronous 3rd-syllable). Each point corresponds to an individual participant score, and stars denote the mean values for each condition. All conditions were significantly different from chance (50%, all ps , .001).
visual stimulation. To rule out the latter alternative, we ran an additional experiment.

EXPERIMENT 2
We designed a new experiment in which the duration of the image exposure was varied while the same number of picture exposures as that in the previous experiment was maintained. This manipulation yielded an audio-visual arrhythmic condition (see Figure 1). However, each picture along the visual stream continued to be synchronized with the onset of one syllable (the first, the second, or the third syllable) of each word.

Method
Participants A new group of 24 native speakers of Spanish or Catalan who were students at the University of Barcelona participated in the study. All participants received extra course credits for their participation.

Stimuli
The audio-streams, words, part-words, pictorial stimuli, and overall set-up were the same as those in Experiment 1.

Procedure
The procedure was the same as that in Experiment 1 except that in Experiment 2 each picture in the visual stream was synchronized equally often with the second syllable onset (232 ms from word onset) and with the third syllable onset (464 ms from word onset). Moreover, each picture was displayed in the visual stream for 464, 696, and 928 ms-that is, for the duration of two, three, and four syllables, respectively (see Figure 1). In addition, a constraint was introduced in the setting so that two consecutive pictures with the same duration were not permitted in the visual sequence. The same speech segmentation test as that used in Experiment 1 was given to participants.

Results and discussion
The mean percentage of correctly segmented words was 71.88 + 19.2% (see Figure 2). This percentage was different from chance (50%), p , .001. The arrhythmic condition did not differ from the audio-alone condition of Experiment 1, t(48) ¼ -0.6, p . .5, but when the arrhythmic condition was compared to the synchronous condition of Experiment 1, a statistically significant difference was observed: arrhythmic vs. synchronous, t(48) ¼ 3.0, p , .01.
These results rule out the possibility that the observed facilitation of speech segmentation in the audio-visual condition of Experiment 1 was due to general attentional/motivational effects of multimodal stimulation. Rather, the changing visual stimuli presumably catch attention, which helps in determining word onset/offset when coinciding with changes in transitional probabilities of syllables. It is worth noting here that the arrhythmic condition did not interfere with speech segmentation as compared to the audioalone baseline. In other words, visual attention was not driving speech segmentation performance, but it was effective only when it provided congruent, useful information for the task at hand.
While the present results show a theoretically important audio-visual synchrony effect on speech segmentation performance, in real-life learning situations audio-visual information is not synchronized at the millisecond level. One would thus expect to find a temporal window within which coinciding auditory and visual information could facilitate speech segmentation. To explore this issue, we ran an additional experiment using a less than perfect word -picture synchrony.

EXPERIMENT 3
This audio-visual speech segmentation experiment involved a systematic displacement of the visual stream so that it was delayed from the onset of the auditory language stream by one or two syllables (see Figure 1). Thus, novel words and pictures were no longer synchronized. On one hand, if the effect encountered in the first experiment was merely a laboratory finding that is obtained only when there is a perfect synchrony of the auditory and visual information, it should have been abolished here, as was the case with the arrhythmic audio-visual condition of the second experiment. On the other hand, if the auditory and visual information can interact within a certain temporal window, the facilitatory effect should be observed even when the visual information is somewhat displaced in time.

Participants
Another 48 students at the University of Barcelona participated in the study. All participants were native speakers of Spanish or Catalan and received extra course credits for their participation. Participants were randomly assigned to one of the two conditions: one-syllable asynchrony and two-syllable asynchrony.

Stimuli
The audio-streams, words, part-words, pictures, and overall set-up were the same as those in Experiments 1 and 2.

Procedure
The procedure was otherwise identical to that of Experiment 1, but in two separate conditions: The visual stimuli were synchronized with the onset of the second syllable (asynchronous-2ndsyllable: 232 ms from word onset and 464 ms from word offset) or with the onset of the third syllable (asynchronous-3rd-syllable: 464 ms from word onset and 232 ms from word offset) of each word in the auditory stream (see also Figure 1). Each picture remained for a constant duration of 696 ms (three syllables) along the visual stream. Finally, the same speech segmentation test was administered to participants as that in Experiment 1, but with the main constraint that only one type of part-word was used in each condition. Thus, when the visual stream was synchronized with the onset of the second syllable, four 2 -3 -1 part-words (i.e., the ones synchronized with the visual stimuli in that condition) were exhaustively paired with words to create the test items. In the same way, in the asynchronous-3rd-syllable condition, 3-1 -2 part-words were used to create the test pair items.

Results and discussion
The mean percentage of correctly segmented words for each experimental condition was as follows (see Figure 2): 73.96 + 19.8% for the asynchrony-2nd-syllable condition and 79.95 + 13.5% for the asynchrony-3rd-syllable condition. Both percentages were different from chance (50%), p , .001. The two conditions did not differ from each other, t(46) ¼ -1.22, p . .2. The present results and the ones from Experiment 1 (the audio-alone and the synchronous conditions) were compared by a betweengroup one-way analysis of variance (ANOVA). The results revealed a clear task effect, F(3, 96) ¼ 5.13, p , .01. Further pairwise t tests showed that the difference between the auditoryalone and the asynchronous-3rd-syllable condition was significant, t(48) ¼ -2.44, p , .02. The audio-alone condition did not differ from the asynchronous-2nd-syllable condition, t(48) ¼ -0.97, p . .3. The asynchronous-3rd-syllable condition was also not different from the synchronous condition, t(48) ¼ -1.52, p . .1.
These results indicate that the word -picture synchronous and asynchronous-3rd-syllable conditions led to the highest word segmentation performance. Why would the word segmentation performance in the asynchronous-3rd-syllable condition be closer to perfect synchrony than that of the asynchronous-2nd-syllable condition? In the former condition, picture onset is closer to the next word boundary. Therefore, we suggest that picture onset synchronized with the last syllable highlights an upcoming low-probability syllable transition indicating a word boundary. In the asynchronous-2nd-syllable condition, the visual cue onset highlights the middle syllable of a possible word, an irrelevant position for detecting word boundaries. It should be noted that an opposite explanation-that is, the auditory information capturing attention and directing it to the visual stimuli-cannot account for the present pattern of results.
In conclusion, the audio-visual facilitation effect we report does not hinge upon perfect synchrony.
Instead, there appears to be a time window (of at least 232 ms in the case of the present manipulation) within which relevant cross-modal information can be integrated with speech-related segmentation cues. This is in line with studies that have investigated the effects of lip synchrony on speech recognition. Interestingly, Hashimoto and Kumashiro (2004) found that a delay up to 120 ms (corresponding to the mean duration of the mora, the Japanese equivalent of the syllable) did not disrupt the lip-reading advantage. They concluded that visual and auditory information in speech is integrated on a syllabic time scale. It has also been shown that a strict temporal synchrony between visual and the auditory speech stimuli is not necessary for the McGurk effect to occur (Munhall, Gribble, Sacco, & Ward, 1996;Soto-Faraco & Alsius, 2009).

GENERAL DISCUSSION
We sought to study the importance of temporal contiguity of visual information in speech segmentation. The present three experiments show that audio-visual temporal contiguity helps in segmenting words from the continuous auditory stream, but only when the audio-visual information is synchronized with word onset/offset or when the visual information changes close to the word offset. This must be a perceptual/attentional effect, as the visual information in the audio-visual condition in Experiment 1 provided only word onset -offset cues: The images appeared in random order and had thus no relationship to the specific words.
The present pattern of results thus emphasizes the importance of temporal contiguity of visual information in speech segmentation and bears relevance to the study of how the integration of different types of information or multimodal cues facilitates language learning (Hollich et al., 2000(Hollich et al., , 2005 and specifically speech segmentation. Furthermore, the present results provide the basis for a new paradigm that can be extended to study other specific aspects of multimodal language learning in perfect laboratory control settings in adults and infants. Although infants and adults are able to track computational probabilities across syllables and are able to segment artificial speech when this statistical information is the only available segmentation cue (Saffran et al., 1996a(Saffran et al., , 1996b, it is evident that in natural learning contexts, multiple and multimodal cues are used to segment real speech (Hollich et al., 2000). This corresponds well with the everyday experience when learning a new language. Thus, the temporal contiguity between auditory and visual information such as lip movements (visible speech) or a teacher's gaze to objects, pictures, or other persons in a context provides cues that facilitate speech segmentation. The present results support previous findings in which native-language processing (Dodd, 1977;Reisberg et al., 1987;Sanders & Goodrich, 1971;Thompson & Ogden, 1995) or foreignlanguage learning (Davis & Kim, 2001;Reisberg et al., 1987) was facilitated with visible speech.
The facilitatory temporal contiguity effect on speech segmentation may rely on domain-general capabilities that also benefit language learning (Bloom, 2002). The temporal contiguity of the stimuli probably acts as an attentional cue that highlights the words embedded in the speech stream. We observed this effect even though in our experiments the visual cues were void of any associative relationship with the specific words. Importantly, the facilitation effect does not require perfect millisecond-level synchrony to appear. We observed a significant increase in the number of segmented words when the visual cue appeared together with the last syllable of each word in the audio-visual stream. This indicates that visual cues facilitate speech segmentation when the cue is near to an upcoming word boundary. This is also in line with previous studies in language learning suggesting that learners pay more attention to the end of words and benefit more from salient syllables (i.e., syllable carrying word-stress) placed at the end of words (see e.g., Cunillera, Gomila, & Rodriguez-Fornells, 2008;Echols, 1993;Echols & Newport, 1992;Saffran et al., 1996b).
An important aspect of the present experiments is to understand the underlying mechanism responsible for the facilitation of speech segmentation when redundant intersensory information is provided. In principle, it is possible that the attentional cue provided by the synchrony between the onset/offset of each picture and word facilitates the computation of statistical probabilities across word boundaries. Alternatively, this attentional cue could also act independently of the statistical computation process, simply helping to identify word boundaries. However, the results from the arrhythmic and asynchronous audio-visual conditions speak against an independent attentional process that bypasses statistical learning. If that were the case, interference would have been observed in the arrhythmic and asynchronous conditions because the onset -offset of the pictures would have captured incorrect syllable transitions as the onset -offset of the words. The results depicted in Figure 2 do not show any interference in the arrhythmic or asynchrony condition as compared to the audio-alone baseline.
Our interpretation of the present results favours a model in which segmentation of continuous speech is facilitated only when visual and auditory information temporally coincide within a given time window that encompasses at least 200 ms (van Wassenhove, Grant, & Poeppel, 2007), possibly the syllable preceding the onset of a word. When these cues do not co-occur within this time range, participants might favour a default statistical learning mode, which provides enough information to be able to isolate words from the speech stream based on transitional probabilities. Disregarding visual cues when they do not temporally match the information present in the speech signal itself might be important in everyday language-learning situations. Imagine a situation where a teacher is speaking about a static object without pointing or gazing at it, or when unrelated visual cues come and go in an asynchronous fashion. In such situations, filtering out unnecessary visual information would be important in order to be able to segment the speech correctly. However, in other cases, such as lip-reading and speech, visual and auditory information tend to coincide, and, therefore, the system would benefit from the temporal contiguity between both visual and auditory cues (Rosenblum & Saldana, 1996). As we have seen, this strategy is in fact used in infant-directed communication (i.e., "multimodal motherese", Gogate et al., 2000), in which mothers provide multimodal redundant information. In the same way, if one wanted to teach someone a new word for a visual object, and that object was present, one would most likely point to it when pronouncing its name. Interestingly, it has even been found that the head movements that naturally co-occur with speech improve auditory speech perception (Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004).
It is also important to consider that intersensory temporal synchrony does not require the different audio-visual components to occur at the same instant in time, as the perceptual system tolerates a certain amount of temporal discrepancy. This audio-visual discrepancy window is larger when a visual stimulus is presented before an auditory one (112 ms) than in the reverse presentation, audio-visual order (65 ms; Lewkowicz, 1996). These temporal synchrony windows are very similar to the ones obtained by McGrath and Summerfield (1985). These authors presented adult participants with lip-like figures that mimicked the opening of lips with a tone that appeared either before or after the opening of the lips, with systematically varied intervals. Asynchrony was detected only when the auditory event preceded the visual event by about 79 ms, while for the reverse (visual-auditory) order, the integration window was about 138 ms (see also Dixon & Spitz, 1980).
The reason for these differences in the intersensory temporal synchrony window has to be related to the faster processing of auditory information in the central nervous system. Thus, if the auditory information is presented faster than the visual information, the temporal window that allows the creation of a unified perceptual experience is reduced when compared to visual-audio presentations. Interestingly, Lewkowicz (1996) has shown that the size of the audio-visual temporal synchrony window is larger in infants than in adults (audio-visual: 350 ms; visual-audio: 450 ms). These differences might reflect infants' inexperience with temporal discriminations and their slower rate of transmission of information in the nervous system (Lewkowicz, 1996). However, the advantage for infants of having a larger time synchrony window is that it could facilitate the identification of certain relationships in the environment that are presented temporally closely but with a certain degree of asynchrony. For example, Gogate et al. (2000) have shown that mothers of prelexical infants (between 5 to 8 months old) tend to use multimodal communication styles to teach labels for novel objects and actions (see also Zukow-Goldring, 1997). In particular, named labels are produced in synchrony with the actions exerted on the objects. This synchrony might help infants to detect and infer word-referent relations. Curiously enough, this tendency by mothers to use multimodal communication styles is reduced when the infant becomes older (e.g., at 21-30 months).
These lines of evidence suggest that infants might take advantage of a larger intersensory temporal window than do adults, which probably provides more capacity to unify perceptually relevant elements in the context. For example, it has been shown that, between two-and-a-half and four months of age, infants attended more to synchronous than asynchronous visible lip movements and audible speech patterns (Dodd, 1979). In addition, at four months, infants are able to recognize the correspondence between the sight of a bouncing object and a sound (Spelke, 1979). In line with this, the intersensory redundancy hypothesis (IRH, Bahrick & Lickliter, 2000, 2002 claims that overlapping of information provided by different senses helps in focusing attention on critical aspects of the environment. The redundancy, which includes synchrony, rhythm, tempo, and so on, across more than one sensory modality, is considered to be an advantage for the perceptual processing system involved in learning. It is also a possible cornerstone of perceptual development, allowing learners to selectively attend to related aspects of the multimodal information found in the input that represent unitary events and at the same time to ignore the information from unrelated events nearby (Bahrick & Lickliter, 2002;Gibson & Pick, 2000). Similarly, it is easy to conceive that mothers, who use a slower tempo when speaking to their infants ("motherese" or "infant-directed speech"; Fernald & Simon, 1984), might provide supportive multimodal information (e.g., visible speech, gestures) that could be integrated in wider time-synchrony windows. This multimodal information might help the process of identifying the boundaries of the words and, ultimately, the speech segmentation process. A similar idea has been proposed by Hollich et al. (2005) in order to explain their results of better speech segmentation in 7.5month-old infants when synchronous visual information was provided. The authors suggest that infants might tolerate larger temporal asynchronies than do adults, increasing their capacity to segregate speech streams and segment speech especially in noisy environments.
Thus, the infant's initial sensitivity to multimodal information provides an economical way of guiding perceptual processing to focus on meaningful, unitary events. It would be reasonable to assume that the advantage of exploiting cross-modal redundant information is preserved throughout the life span and that an adult acquiring a second language might be able to exploit the redundancy of information found in the learning environment to boost the perception of unitary speech units.
Finally, a variable that we manipulated in the present study was the visual rhythmicity. It is important to note that the auditory stream we applied provided no rhythmic properties besides the syllabic pattern, as the speech streams were synthesized with a constant syllabic length and a flat stresspattern. Rhythm is considered by linguistics to be paramount for distinguishing one family of languages from another (Abercrombie, 1967;Pike, 1945;, with Spanish, like most of the Romance languages, being classified as a syllabic-timed language. Rhythmicity is probably the first source of information that language learners can detect in the speech signal, as newborns are able to discriminate their native language from a nonnative one when the two languages belong to different rhythmic classes (Mehler et al., 1988). In addition, Ramus and coworkers (Ramus, Hauser, Miller, Morris, & Mehler, 2000) found that cotton-top tamarin monkeys were able to discriminate continuous speech from two rhythmically distinct languages (see also Tincoff et al., 2005), indicating that rhythmicity is a general property of the auditory signal detected by, at least, the mammalian auditory system.
However, the exact properties that correspond to the perception of rhythm in the speech signal are still not well understood. Rhythm could emerge from the succession of syllables, vowels, stress patterns, pitch features, or any repeated perceptual changes detected in the speech input Ramus, Nespor, & Mehler, 1999). In spite of the lack of a coherent definition of speech rhythmicity, several studies have explored the role of rhythm in speech segmentation. The underlying idea is that rhythm might aid in acquiring some phonological properties and that speakers of different languages might use different segmentation units, with rhythm being the cue that guides infants to select the proper unit (Cutler, Mehler, Norris, & Segui, 1986;Otake, Hatano, Cutler, & Mehler, 1993). Other studies have shown that rhythmicity provides critical cues for segmenting utterances into constituents such as clauses, phrases, and words. For example, infants at the age of 6 to 7 months can exploit overall rhythm to predict clause and phrase boundaries (Hirsh-Pasek et al., 1987), and at the age of 9 months, infants coordinate the statistical and the rhythmic structures of speech input to identify possible "word-like" multisyllabic rhythmic units (Morgan & Saffran, 1995). It seems that infants evolve from detecting large rhythmic linguistic units as clauses to finally achieving the detection of multisyllabic words, the smallest meaningful rhythmic units. This might be possible due to the progressive development of a more flexible attentional system that enables faster changes of attentional allocation.
It is evident that, in real-life word learning, multimodal cues do not appear in perfect (millisecond) synchrony. Moreover, it is possible to perceive synchronicity even if a perfect multimodal synchrony is lacking, as has been demonstrated in studies on the phenomena coined as perceptual centres (P-centres; Morton, Marcus, & Frankish, 1976). The P-centres are subjective moments of occurrence based on properties of regularity and synchrony found in production and perception (Scott, 1998). Thus, the present set-up does not claim ecological validity but was rather designed to test the role of multimodal perception in speech segmentation in a strictly controlled situation. The demonstration that the beneficial effect of audio-visual synchrony exists also in the absence of perfect synchrony and cannot be explained by more general attentional/motivational factors paves the way to research in the context of more natural language learning. More studies are also needed to further specify the time window for integration of auditory and visual information in speech segmentation. For instance, an interesting experiment would be one in which the auditory stream is accompanied by visual lip movement (visible speech) pronouncing the same syllabic stream.
In summary, temporal contiguity of intersensory information will probably sharpen and tune the efficacy of the underlying learning mechanisms, in this case the statistical learning process. Furthermore, the detection of temporal synchrony might also be very useful for both infants and second-language learners not only to increase speech segmentation but also to detect wordobject relations in natural environments. Our results support the importance of visual cues in language learning and, in line with the emergentist coalition model of infants' acquisition of native language (Hollich et al., 2000), also emphasize the potential importance of temporal contiguity at the early phase of second-language learning in adults.