1.0. Overview and some basics
ToBI (Tones and Break Indices) transcription is a method for transcribing the distinctive prosodic events of a spoken utterance. These distinctive events fall into two main categories:
a) accents, which contribute to a word's relative prominence in an utterance and
b) phrasing, which creates a grouping of words.
These supra-segmental aspects of spoken language can convey distinctive semantic, syntactic or even morphological renditions of a spoken utterance. For these reasons, speech scientists are interested in annotating the prosodic structure.
A challenging aspect of transcribing prosody is that there is a substantial level of variability. For example, an f0 that is high in a speaker's range on a salient syllable can mark a pitch accent but so can a lower but still high-for-this-speaker f0 on another salient syllable. Moreover, a high f0 might mark a pitch accented word or it could be a phrase or boundary tone. In order to determine the appropriate ToBI transcription, the entire utterance must be parsed, that is, understood as a whole to determine how the high and low tones are implemented. This tutorial will begin by introducing utterances with unambiguous ToBI annotations and will gradually include utterances for which even experienced labellers might, in part, disagree. ToBI labelling is a new endeavor compared to, say, phonetic transcription. This should not be disturbing if one recognizes that ambiguous cues are present in other types of speech annotation. For example, the acoustic cues for a canonical /t/ might not be present, e.g. in some renditions of the word "butter". However, the existence of a well established lexicon allows annotators to recognize this as a variation of a single lexical item and label consistently. At this writing, the extent of variations of ToBI labels is not fully understood and this results in some reasonable disagreement in prosodic parses in some renditions of some utterances, particularly in spontaneous speech.
1.1. The basic parts of ToBI
A ToBI transcription of an utterance consists minimally of a recording of the speech, its fundamental frequency contour, and (the transcription proper) symbolic labels for prosodic events. The transcription is usually arranged in four time-aligned parallel panels or tiers, so that the symbolic labels can be easily paired with the corresponding f0 track and speech waveform. (Other tiers can be added for the needs of particular sites -- see Section xx.) The four labeling tiers, which appear in their own window, are:
(1) a Tone tier, for transcribing tonal events
(2) an Orthographic tier, for transcribing words
(3) a Break-Index tier, for transcribing boundaries between words
(4) a Miscellaneous tier, for recording additional observations
In addition, two new tiers have been suggested for labelers who find them useful. These are not used in the examples in the beginning of this tutorial, but will be explained in Section 2.12:
(5) an Alternative Tier, for transcribing alternative labels in the case of ambiguity
(6) a Discussion tier, for recording data for selected research issues
One popular program for labeling and displaying ToBI transcriptions is Praat (available at http://www.praat.org). Here is an example using Praat to display the utterance waveform, the f0 contour, the spectrogram and the first four tiers for the utterance Armani knew the millionaire.
Play armani10.wav:
The first window in this display is the waveform of the recorded utterance; later, when you learn how to expand the horizontal time scale in this display, you will see more clearly the individual pitch pulses that are made by the vocal folds as they vibrate. The horizontal axis corresponds to the amplitude of the vibration, and you can see in the waveform that the amplitude varies roughly with the syllable structure of the utterance: large for vowels (where the mouth is open relatively wide), smaller for consonants (which have a constriction in the vocal tract), and zero (or nearly zero) when there is no speech signal.
The rate of vibration of the vocal folds is what we hear as the pitch, and this is represented in the second window as a semi-continuous blue line superimposed on a different representation of the speech signal, the spectrogram (see "What is the spectrogram?" below). When you play this utterance, notice where you hear a higher pitch. Do these words and syllables correspond to the places where the F0 contour (also called the pitch track or the f0 track, although f0 and perceived pitch are not exactly the same thing) is higher, as indicated by the blue lines?
Underneath the spectrogram with the blue pitch contour is the set of four thinner white boxes that make up the four tiers in the ToBI labeling window. Unlike the speech displays, these boxes are text writeable and make up the tiers where the labeller types in the ToBI transcription labels. The top white box is the Tone tier, and the third box is the Break Index tier. These two tiers represent the core prosodic analysis. The Tone tier is the part of the transcription that corresponds most closely to a phonological analysis of the utterance's intonation pattern. It consists of labels for distinctive pitch events, transcribed as a sequence of high (H) and low (L) tones marked with diacritics indicating their intonational function. Tones function either as pitch accents, parts of pitch accents, or as phrase accents marking the edges of two types of intonationally marked prosodic units (phrases). These categories are based on the work of Janet Pierrehumbert (1980) and joint work by Mary Beckman and Janet Pierrehumbert (1986, 1988).
The Break-Index tier captures the prosodic grouping of the words in an utterance by labelling the end of each word for the subjective strength of its association with the next word, on a scale from 0 (for the strongest perceived conjoining/ weakest perceived boundary, as in doncha for don't you) to 4 (for the most disjoint, i.e. at the end of an intonationally marked major phrase). These categories of association strength or 'break indices' are based on work by Mari Ostendorf, Patti Price, Stefanie Shattuck-Hufnagel, and their associates (Price et al., 1991). The two highest break indices (3 and 4) are equated with two kinds of prosodic groupings (or phrases) that are marked intonationally; groupings of Intonational Phrases are not marked in the ToBI system.
The Orthographic tier is the third white box. It contains a straightforward transcription of all of the words in the utterance, in ordinary English orthography. The word transcriptions are aligned with their locations in the speech waveform. For labellers using Praat or a similar labelling computer application, the convention is to place the orthographic label for a word between two marks that delineate the approximate time interval in the signal that corresponds to the utterance of that word and placing to mark silence between words, if any. The orthographic tier is arguably not part of any core prosodic analysis, except inasmuch as the labels on this tier can be used to interface the transcription to dictionary entries which do indicate such things as which syllable is likely to be most stressed in each word, prosodic information which is not otherwise included in the ToBI system (more on this below).
The Miscellaneous tier is the bottom white box in this display. It is essentially a 'comment' tier that can be used to mark events such as coughs, laughter, long silences and other non-speech events and is traditionally marked with angle brackets (e.g. cough< ... >cough). Like the orthographic tier, it can include many events that are arguably not part of prosody per se. However, many events that are typically marked on the Miscellaneous tier are important for interpreting the analyses on the Tone tier and Break-Index tier, because they disrupt the smooth rhythm of the utterance or interrupt the intonation contour. Labels on this tier usually come in pairs, to mark the beginning and end of each event interval; one exception is the label 'disfl', which often stands alone to flag the occurrence of a perceived disfluency of some type.
1.2. Guiding principles
As the preceding discussion shows, ToBI does not try to transcribe all aspects of prosody, or even all aspects that are amenable to symbolic transcription. In deciding what to include and what to leave out, we are guided by three principles. First, we want to be able to distinguish in our transcription all of the categorically distinct intonation patterns and prosodic units of the language (in this case, Mainstream American English (MAE), see Jun 2005 for ToBI systems for other languages and dialects). Second, we do not transcribe aspects of prosody which are more amenable to continuous-valued quantitative measures than to the categorical divisions of a symbolic transcription, such as the slope of the changing f0 curve. Finally, we do not want to squander the user's energies in transcribing even categorical aspects of prosody which are predictable from other parts of the transcription or from auxiliary tools such as dictionaries that can be used to determine the location of lexical stress within words.
The categorical aspects of prosody which we try to capture completely (according to the first principle) are of two types. The first is the prosodic structure -- the alternating rhythm of more and less prominent words and syllables and the grouping of words into prosodic constituents of various sizes -- and the second is the intonation pattern – the sequence of contrastive pitch events that we call pitch accents, phrase accents, and boundary tones.
[The next two paragraphs contain further discussion of what is not captured by a ToBI transcription; read them if you are curious about this question. Otherwise, skip directly to Chapter 2.0.]
An example of the non-categorical aspects of prosody which we leave out (in accordance with the second principle) is the local tempo of each word in the utterance, which we feel could be more accurately and directly captured by some quantitative measure such as normalized segment duration (e.g., Campbell, 1992) than by any symbolic transcription such as an arbitrary division into, say, categories '1', '2', and '3' (for 'slow', 'medium', and 'fast' tempi).
A categorical aspect of prosody which we leave out (in accordance with the third principle) because it should be fairly predictable is the marking of the stressed and unstressed syllables within each word. By this level of stress we mean the word-internal alternation between more stressed and less stressed syllables, where the relative prominence of any pair of syllables is fairly fixed and can be thought of as inherent to the word's dictionary entry.
1.3. ToBI features
As you may be beginning to see and will certainly appreciate as you begin to transcribe, the ToBI transcription system is theoretically based but also driven by practical constraints. For example, some possible tone combinations are not part of the ToBI transcription because even experienced labellers had difficulty agreeing on whether or not they were present in some transcriptions. Along these lines, ToBI also includes labels for uncertainty which can be used when an unassailable transcription is difficult, e.g. in intervals of minimal prosody or poor recording situations.
Forward to Chapter 2.0
6.911 Table of Contents
Return to ToBI main page