THE PHONETICS AND PHONOLOGY OF ENGLISH CASUAL SPEECH: LEARNING FROM L2 LEARNERS

partial nasal assimilation, supporting the Articulatory Phonology approach. Many instances of categorical nasal substitution were also found, however. It is argued that a more traditional phonological feature-changing analysis better accounts for the categorical changes. Both Articulatory Phonology and traditional feature-based phonology are required to account for the full set of data.


INTRODUCTION: ARTICULATORY PHONOLOGY AND CASUAL SPEECH
For many phoneticians who study English, interest in second-language speech is primarily practical, focused on how to change learners' pronunciation for the better. In contrast, this paper will take a more theoretical stance. Rather than focusing on what phoneticians can teach second-language (L2) learners, this paper will focus on what phoneticians can learn from L2 learners. Because L2 pronunciation combines aspects of different linguistic systems, studying that interaction can provide insight into the structure of Language (with a capital L) in general. Just as physicists can learn about the basic structure of atoms by investigating what happens when particles collide, so the study of L2 speech production provides an opportunity to learn about the basic elements of Language, by examining native patterns that persist or new patterns that are created when language systems collide. This paper, specifically, will discuss acoustic data from Korean and Russian learners of English, with an emphasis on English "connected" or "casual" speech.
The term "casual speech" refers to a speech style that contains the optional assimilations and deletions that are more common in less formal pronunciations. These often occur at word boundaries, so this speech style can also be called "connected speech." Examples include final consonant deletion, as in "mashed potatoes" pronounced as "ma [ -p]otatoes," or assimilations, as in "tin pans" pronounced as "ti [m-p]ans," or a combination, as in "sandwich" pronounced as "sa [m]wich." In understanding these kinds of alternative pronunciations, I follow the theory of Articulatory Phonology, expounded by Louis Goldstein and Catherine Browman in a series of articles beginning in 1986. The theory has been widely adopted and updated since then, including work summarized in this paper. (For overviews, see Browman / Goldstein 1990, 1992, Goldstein et al. 2006, Hall 2010, Zsiga 2011 In the theory of Articulatory Phonology, the basic "atoms" of phonological organization are articulatory gestures: movements of articulators toward the goal of making a constriction in the vocal tract at a particular place with a particular degree of closure. Examples include labial closing, velum opening, or pharyngeal constriction. Lexical items can contrast in the presence or absence of gestures, gestural specification, and in the way the gestures are coordinated in time. Allophony is the result of differences in gestural organization. Gestural scores for the words "bad," "ban," "pan" and "span" Gestural organization can be graphically represented in a "gestural score," as shown in Figure 1. As in an orchestral score, the x-axis is time, and the different rows of the score indicate the parts played by different articulators, sometimes sequential, sometimes overlapping. In Figure 1A, the word "bad" is shown to have three gestures: labial closing, lowering and fronting of the tongue body, and closure of the tongue tip at the alveolar ridge. Note that the consonant gestures overlap in time with the gesture for the vowel. Because the labial closure and tongue body gesture begin at approximately the same time, the tongue body will fully reach the vowel "target" by the time the lips open. The vocal tract remains open for a time for the vowel, and then the tongue tip gesture for [d] begins as the vowel gesture ends. In Figure 1B, a velum opening gesture has been added to the end of the word, creating "ban." (If the gesture had been added at the beginning, the word would be "mad.") The velum opening overlaps not only with the consonant, making a fully nasal [n], but begins during the vowel. This pattern of overlap creates partial vowel nasalization. The allophonic "rule" of vowel nasalization comes about because of this pattern of gestural timing.
In Figure 1C, a laryngeal opening gesture has been added, which turns "ban" into "pan." In the current model, the default state for speech is voicing, so only gestures for devoicing (laryngeal opening) are indicated. Note that the laryngeal opening gesture extends into the beginning of the vowel. The result of this pattern of co-ordination is aspiration: a period of time after the labial closure has been released, during which the larynx remains open. Again, this allophonic "rule" comes about because of a specific pattern of gestural timing.
Finally, Figure 1D shows the word "span." A gesture for an alveolar fricative (a "critical" constriction of the tongue tip, meaning exactly the right constriction to create turbulence) has been added. In an initial consonant cluster, the two consonant gestures are re-organized: [p] moves somewhat to the right and [s] somewhat to the left, so that the laryngeal opening is centered in the middle of the cluster, not on any one gesture (Browman/Goldstein 1988). The result is that the laryngeal opening no longer extends into the vowel: [p] is unaspirated after [s].
Patterns of gestural overlap can also account for allophonic variation at word boundaries. For example, Zsiga (1995Zsiga ( , 2000 argued that palatalization of [s] to [ ] in American English phrases like "miss you" is the result of gestural overlap. Figure 2A shows gestures for [s] (critical tongue tip) and [j] (palatal tongue body) as they might be organized in the words "miss" and "you" pronounced with a pause between. There is no overlap between the final [s] and the initial [j]. But in "connected speech" in English, the words overlap, as shown in Figure 2B. As a result, the two gestures blend together, producing a fricative that sounds most like [ ], though careful acoustic analysis shows that it is not exactly the same as [ ].  Figure 3 shows a spectrogram of the phrase "press you" pronounced in connected speech by a native English speaker (Zsiga 2000). The first part of the fricative is highpitched, corresponding to [s]. The second part, however, shows overlap with [j]: note the high F2 and F3 formants visible on the spectrogram, co-extensive with the second half of fricative noise. The acoustic result of the overlap is lowered pitch, more similar to [ ]. The pitch lowering, however, is partial and gradual. Zsiga (2000) interprets the gradient change as evidence against a phonological rule substituting [ ] for [s], and in favor of an analysis based on gestural overlap. Figure 3. Spectrogram of the phrase "press you" pronounced in connected speech by an English speaker, showing gradient palatalization due to gestural overlap (Zsiga 2000: 85) It is important to keep in mind, however, that patterns of gestural overlap are language specific. Figure 4 (also from Zsiga 2000) shows an [s#j] sequence in Russian (from the phrase /pas jejo/ "tended it.") For a Russian speaker, words do not overlap: the fricative /s/ is completed before the fricative /j/ begins. This pattern of overlap is not just a fact about /s/ and /j/: Russian speakers in general keep their words separate, and thus Russian does not show the type of connected speech assimilations typical of English. Note that palatalized [s j ] is a different story: for this contrastive segment the two gestures are simultaneous, and both are carefully maintained, a pattern of articulation that English speakers learning Russian find very hard to accomplish. English learners of Russian, however, are liable to pronounce the name "Boris Yeltsin" as Bori[ ] Yeltsin, transferring the English pattern of overlap to their Russian pronunciation. Is the assimilation seen in American English palatalization, and other casual speech processes, phonology or phonetics? Traditionally, such alternations are described in terms of phonological rules. The lexicon specifies a basic form of each word, and then a set of rules indicate how to pronounce these words in different environments, by inserting, deleting, or changing one segment into another: [n] becomes [m] before a labial, or [s] becomes [ ] before a palatal, either within words or at word boundaries. Only then, after the phonological computation is completed, the speaker's brain sends instructions to the articulators to produce the specified sequence.
Articulatory Phonology argues that phonological rules such as these are not needed. According to the theory, the lexicon specifies different basic word forms that the speaker memorizes. Some of these stored lexical items may be semantically or syntactically related to each other, such as "press" and "pressure" or "balance" and "imbalance," but they are not derived from one another by phonological rule. For the variable and partial assimilations and deletions of connected speech, Articulatory Phonology explains them in terms of gestural reorganization and overlap as described above, not segment substitution. The question then arises of whether there is space for "phonology" in Articulatory Phonology. Does a speaker make generalizations that are language-specific, but independent of lexical specification, "rules" that are more general than the pronunciation of specific words, but planned at a higher level than the unintended consequences of coarticulation?
To answer this question, we can learn by looking at the speech of language learners. L2 learners often transfer segmental pronunciation and allophony from the L1. What do they do with connected, casual speech? If the assimilations of casual speech are associated with specific lexical items, we would expect that no transfer from L1 to L2 would occur, because the lexical items in the two languages are different. If assimilations do transfer, and are found to be partial and gradient, that would be evidence for an account based in phonetic coarticulation and gestural overlap. The third possibility is that casual speech alternations do transfer, but look like categorical substitutions rather than gestural blending. If that is the case, then a phonological level, separate from both the lexicon and from articulatory organization -a level of productive phonological rules or constraints -must be posited.

CONNECTED SPEECH IN L2
Studies on connected speech in L2 are rather sparse, and results are not uniform. Most studies have found that learners often don't use the relevant rules of casual speech from either the L1 or L2. Instead they may use an "interlanguage" pattern that corresponds to neither. The usual interlanguage pattern seems to keep words separate, even if applying rules from the L1 would make L2 pronunciation more native-like.
For example, Weinberger (1994) reported on the speech of Mandarin speakers of English. In careful speech, Mandarin does not allow word-final obstruents. But in casual speech, Mandarin speakers can delete certain final vowels, leaving obstruents in word final position, so that /tòufu/ may be pronounced as [tòuf]. (Vowels are deleted in casual speech if they are stressless and toneless, and share the place of articulation of the preceding consonant: [u] after labials, [i] after dentals, and retroflex after retroflex.) Final epenthesis of vowels is found in neither L1 English nor L1 Mandarin. Nonetheless, Weinberger found that the same speakers who pronounce /tòufu/ as [tòuf] in their L1 pronounce "loaf" as [lofu]  As (1) shows, voiced obstruents are devoiced in word-final position: /vaz/ "glass" is pronounced as [vas]. But in connected speech, as shown in (2), the word-final obstruent assimilates to the voicing of a following consonant: [vaz gran] "big glass," [vas petit] "small glass." Cebrian found that Catalan learners of English transfer wordfinal devoicing, but not cross-word voicing assimilation, to L2 English, as shown in (3).
(3) Catalan speakers' pronunciation of English phrases wise guy wi[s] guy proud girl prou[t] girl In the cases of "wise guy" and "proud girl," transferring the Catalan rule of voicing assimilation would have resulted in the correct English pronunciation, but the Catalan speakers in Cebrian's study did not do that. Cebrian argues that the speakers were obeying an "interlanguage prosodic constraint" he terms "Word Integrity." The principle of word integrity requires that words in L2 be unconnected. Word Integrity "treats every word as a separate unit and prevents the synchronization of sounds belonging to different words" (2000: 19). Zsiga (2003) also found evidence for word integrity in pronunciations by Russian learners of English. These speakers tended to release final consonants in clusters, for example pronouncing "make parts" as m[ek h p h ]arts, rather than the more overlapped L1 English pattern that results in unreleased final consonants. But does word integrity always hold of L2 speech? A subsequent study, reported in Zsiga (2011) and summarized below, found that it does not. Zsiga (2011) recorded the speech of 12 native Korean learners of English. Six were classified as advanced L2 speakers, with a mean of 15.4 years of English instruction (usually beginning in middle school), and a mean of 4.1 years of residence in the United States. These were speakers with very high TOEFL scores or the equivalent, who were studying in degree programs at U.S. universities. The other six speakers were classified as intermediate. They also had many years of English instruction (mean of 8.5 years) but had lived in the U.S. for a year or less (mean of 5.9 months). They were for the most part family members of students or diplomats, and all were enrolled in intermediate-level ESL classes rather than degree programs. In addition to the 12 native Korean speakers, three native English speakers were recorded as controls.

AN EXPERIMENT ON NASALIZATION IN KOREAN ENGLISH
The study focused on one aspect of Korean and Korean-English pronunciation: word-final voiceless stops followed by a word-initial nasal. In Korean, these stops undergo nasal assimilation, as shown in (4). (4) Nasal assimilation in Korean [pap] "rice" [pam mekta] "eat rice" [ot] "clothes" [on man] "only clothes" The 12 speakers read sentences in Korean and English that contained stop#nasal sequences and nasal#nasal sequences, interspersed with other sentences that targeted obstruent#obstruent sequences, in randomized order. There were eight nasal-target sentences in Korean and 16 in English. Korean examples included [kimpap mekta] "eat sushi" and [ot neta] "put clothes in," matched with structurally similar English phrases such as "keep Matt awake" and "bought nine." Korean and English were recorded in separate blocks, with the Korean sentences read first. Instructions were given in Korean by a native Korean-speaking research assistant, and the Korean sentences were written in Korean orthography. Three repetitions of each sentence were recorded. The three native English speakers read only the English sentences.
The recordings were then transcribed by two listeners, one native Korean, one native English, focusing on the word-final consonant. Acoustic measures were also taken, including consonant duration, duration of voicing, duration of nasalization, and presence or absence of audible release on the word-final consonant. The question to be answered was: how did the Korean speakers pronounce these word-final consonants? Did they follow the native Korean pattern, the native English pattern, or an interlanguage pattern that corresponds to neither L1 nor L2?
The native Korean tokens were analyzed to establish the L1 baseline. As predicted, in native Korean, 93% of the obstruent#nasal sequences were pronounced as nasal#nasal. An example is shown in Figure 6, which shows the syllables [on man] from underlying /os mantulta/ "make clothes." There was no measurable difference between underlying and derived [n#m]. A few tokens in the dataset (5%) were pronounced with the word-final consonant fully or partially voiced but not nasalized, and a few others (2%) remained fully voiceless. The categorical change in 93% of tokens argues for a phonological substitution: In Korean, [n] is substituted for [t] and [m] for [p] when a nasal consonant begins the next word. Figure 6. Spectrogram of [on man] from /os mantulta/, make clothes (Zsiga 2011: 307) Examples of stop#nasal sequences in native American English are shown in Figure  7. Consistent with previous studies by Cohn (1993) and by Huffman (1989), 76% of the obstruent#nasal sequences produced by the native English speakers had voiceless closure with no release, usually with some glottalization on the preceding vowel. An additional 13% had creaky voicing throughout the closure. Only 11% showed an audible release between stop and nasal. None of the tokens produced by native English speakers had modal voicing or nasalization during the stop closure. Figure 7. Spectrograms of two repetitions of the phrase "ate mine" by native English speakers (Zsiga 2011: 309) In contrast to the consistency shown by native speakers of both languages, the realization of pre-nasal stops by Korean speakers of English was highly variable, as shown in Table 1.   Though no speaker was completely consistent, each speaker did have a predominant or preferred pattern of pronunciation: four speakers produced a majority of tokens with nasal consonants, seven produced a majority of tokens with unreleased consonants, and one produced a majority of tokens with released final consonants. It is worth noting that there was no significant effect of level of instruction. The additional years of English instruction that the advanced speakers had received, and their high levels of proficiency in written English, did not make a significant difference in their pronunciation of these sequences. Figure 8 shows an example spectrogram of a final voiceless stop with audible release, from the phrase "ate mine" by Speaker K8. This is the pattern that is predicted by the principle of Word Integrity proposed by Cebrian (2000) and given preliminary support in the Russian-English data collected by Zsiga (2003). These released final stops, although they are not typical of either L1 Korean or L1 American English, they do signal a clear separation between words. Tokens exhibiting word integrity were a minority in these data, however, at only 15% of tokens, mostly produced by only one speaker.  Figure 9 shows four examples of unreleased final stops, with variable amounts of voicing during the oral closure. This type of token represented the plurality (49%) of the data. Some tokens, such as 9A, showed no voicing during the oral closure and some, such as 9D, showed voicing throughout the oral closure. Most tokens, however, showed partial voicing of variable duration, as in 9B and C.
A. bought nine B. keep Matt C. keep Matt D. keep Matt Figure 9. Unreleased stops in Korean English, with variable voicing (Zsiga 2000: 318-320) In addition, voicing of obstruents was found not only in pre-nasal position, but in all intersonorant environments. An example is shown in Figure 10. The phrase is "that ticket will" extracted from the target sentence "I hope that ticket will stop Nan from speeding," which was included in the dataset for the [p#n] sequence. The beginning words of the phrase, however, illustrate the pattern of voicing that was typical throughout the dataset. The initial /t/ and medial /k/ of "ticket" both show voicing for about half the closure duration, and the final /t/ of the word, preceding /w/, is fully voiced. The existence of gradient intersonorant voicing in the L2 English data indicates the transfer of coarticulatory routines (that is, patterns of gestural overlap) from native Korean to L2 English. The transfer is applying here not just within words, but across word boundaries. Overall, the voicing data supports an Articulatory Phonology account. What about the patterns of nasalization, which in native Korean appears to be categorychanging phonology? Will connected speech phonological rules also transfer?
As was shown in Table 9, 23% of pre-nasal obstruents in this L2 dataset were pronounced as "full nasals," identical to an underlying nasal sequence. The formant patterns and amplitude are consistent with 100% nasality during the consonant sequence, and the durations of underlying and derived nasal sequences do not differ. For example, the mean duration of an [n#m] sequence derived from [t#m] was 149 ms, exactly the same as the mean for the an underlying [n#m] sequence. Figure 11 compares an underlying nasal sequence from the phrase "train Matt" to a derived nasal sequence from the phrase "ate mine." Both are pronounced as [n#m]. These "full nasals" are indicative of a categorical phonological substitution, not gestural overlap. The phonological rule of nasalization has transferred from L1 Korean to L2 English. [n#m] sequences in Korean English (Zsiga 2000: 312) Figure 12. A spectrogram of the phrase "pick Nat" in Korean English showing partial nasalization (Zsiga 2000: 313) Another 9% of the data, however, can be categorized as "partial nasals." These segments were transcribed as nasal consonants, but further acoustic analysis showed that the nasalization was partial and gradient. An example is shown in Figure 12, a spectrogram that this is not a case of substitution, however. In this case, the consonant sequence begins with a very short oral period, transitioning quickly to nasality. These partial nasals are better analyzed in terms of gestural overlap rather than phonological substitution. They sound like nasal consonants, but the nasalization is gradient and variable.
Comparison of Figure 12 and Figure 7 shows similarity between the "partial nasal" realizations in L2 English and typical L1 English pronunciations. Both show close connection between words, and presumed gestural overlap at word boundaries. The difference is that during the oral part of the sequence, Korean native speakers show modal voicing, consistent with the pattern of their L1, while the American English speakers show glottalization.
These data, then, show evidence for two kinds of nasalization at word boundaries in Korean-English. In many cases, derived nasal sequences show formant patterns and amplitude consistent with nasality for 100% of closure. These are indistinguishable from underlying nasal sequences and are best analyzed as a categorical alternation. There are other sequences, however, that sound like nasals but show a gradient and variable pattern instead. This pattern is more like L1 English, but with modal voicing instead of glottalization. This pattern is best analyzed in terms of gestural overlap. There is evidence in this data for transfer of both L1 phonological substitution, for the full nasals, and of L1 gestural coordination, for the partial nasals and the voiced stops.

LEARNING FROM LEARNERS
In conclusion, what does the study of L2 English teach us about phonology? First, Word Integrity is at least partially wrong. Although one speaker did keep her words separate, she was in the minority. The majority of L2 English speakers in this experiment did "synchronize words," as evidenced by relatively high rates of nasalization and voicing across word boundaries. It remains unclear why the data in this experiment showed closer connections between words than previous experiments on Russian and Catalan speakers of English (Zsiga 2003, Cebrian 2000. It might be that the greater prevalence of connected speech assimilations is a fact particular to Korean or to Korean English. More data on different language pairs will help shed further light on this question.
Second, Articulatory Phonology is at least partially right. Most subjects showed gradient and variable intersonorant voicing, and many showed gradient and variable nasalization. These processes are consistent with the Articulatory Phonology analysis of gestural overlap at word boundaries.
However, there is also categorical assimilation across word boundaries. In native Korean, cross-word nasalization is consistently categorical. And in some cases, cross-word categorical nasalization transfers to L2 English. If nasalization was due to alternants listed in the lexicon, it would not transfer to English words. If it was due to gestural reorganization, it would not be categorical. Because it is in many cases categorical, and because it does transfer, these results show that Korean nasalization is a phonological alternation that is neither pre-specified in the lexicon nor the result of simple coarticulation. There is still a real phonology between the lexicon and speech production. SUMMARY THE PHONETICS AND PHONOLOGY OF ENGLISH CASUAL SPEECH: LEARNING FROM L2 LEARNERS This paper examines processes of "connected" or "casual" speech in second language pronunciation, focusing on the speech of Korean learners of English. The paper begins with the point of view of Articulatory Phonology, which argues that many assimilations and deletions in casual speech are the result of overlap between articulatory gestures. Examples from English and Russian illustrate gestural overlap. Further examples are provided from a more detailed phonetic study of processes of nasalization and voicing assimilation in Korean and Korean-accented English (Zsiga 2011). The Korean-English data show evidence of gradient gestural overlap in voicing assimilation, and in some instances of partial nasal assimilation, supporting the Articulatory Phonology approach. Many instances of categorical nasal substitution were also found, however. It is argued that a more traditional phonological feature-changing analysis better accounts for the categorical changes. Both Articulatory Phonology and traditional feature-based phonology are required to account for the full set of data.