Speech emotion analysis
Speech emotion analysis refers to the use of various methods to analyze vocal behavior as a marker of affect (e.g., emotions, moods, and stress), focusing on the nonverbal aspects of speech. The basic assumption is that there is a set of objectively measurable voice parameters that reflect the affective state a person is currently experiencing (or expressing for strategic purposes in social interaction). This assumption appears reasonable given that most affective states involve physiological reactions (e.g., changes in the autonomic and somatic nervous systems), which in turn modify different aspects of the voice production process. For example, the sympathetic arousal associated with an anger state often produce changes in respiration and an increase in muscle tension, which influence the vibration of the vocal folds and vocal tract shape, affecting the acoustic characteristics of the speech, which in turn can be used by the listener to infer the respective state (Scherer, 1986). Speech emotion analysis is complicated by the fact that vocal expression is an evolutionarily old nonverbal affect signaling system coded in an iconic and continuous fashion, which carries emotion and meshes with verbal messages that are coded in an arbitrary and categorical fashion. Voice researchers still debate the extent to which verbal and nonverbal aspects can be neatly separated. However, that there is some degree of independence is illustrated by the fact that people can perceive mixed messages in speech utterances – that is, that the words convey one thing, but that the nonverbal cues convey something quite different.
Levels of description
How emotions are expressed in the voice can be analyzed at three different levels:
- the physiological level (e.g., describing nerve impulses or muscle innervation patterns of the major structures involved in the voice-production process)
- the phonatory-articulatory level (e.g., describing the position or movement of the major structures such as the vocal folds)
- the acoustic level (e.g., describing characteristics of the speech wave form emanating from the mouth)
Most of the current methods for measurement at the physiological and phonatory-articulatory levels are rather intrusive and require specialized equipment as well as a high level of expertise. In contract, acoustic cues of vocal emotion expression may be obtained objectively, economically, and unobtrusively from speech recordings, and allow some inferences about voice production and physiological determinants. Hence, acoustic measurement of voice cues, requiring basic training in voice physiology and acoustics but no special equipment, is perhaps the method that holds the greatest promise for interdisciplinary research on emotional speech.
Voice cues are commonly divided into those related to: (a) fundamental frequency (F0, a correlate of the perceived pitch), (b) vocal perturbation (short-term variability in sound production), (c) voice quality (a correlate of the perceived ‘timbre’), (d) intensity (a correlate of the perceived loudness), and (e) temporal aspects of speech (e.g., speech rate), as well as various combinations of these aspects (e.g., prosodic features).
The primary question is whether it is possible to find distinct voice profiles for different emotions, such that the voice can be used to infer what the speaker is feeling. This has proved difficult due to both practical problems and the complex nature of the voice production process. There are several sources of variability that complicate the search for voice profiles, such as individual differences among speakers, effects of the verbal content, interactions between spontaneous and strategic expression, as well as important variations within particular emotion families (e.g., hot vs. cold anger). Predictably, reviews of the literature commonly mention inconsistent data regarding voice cues to specific emotions. Only correlates of overall arousal level such as high F0 and fast tempo are very consistently replicated, which has led some to propose that only arousal is coded in voice. However, there is considerable evidence that voice cues can differentiate affective states beyond the simple affective dimensions of activation (aroused/sleepy) and valence (pleasant/unpleasant). Table 1 presents a set of empirically derived predictions for patterns of voice cues for different emotions based on over a hundred studies of emotion in speech. Most studies have used emotion portrayals by professional actors and a crucial question is to what degree such portrayals differ from natural expressions. The jury is still out because few attempts have been made to directly compare the two types of speech samples (but see Table 3.3. in Juslin & Scherer, 2005). In addition, most studies have focused on only a few “basic emotions”, while neglecting more complex emotions. Hence, much of the pertinent work on emotion differentiation in the voice remains to be done.
Although there seems to be wide agreement among researchers regarding the pertinence of adaptive evolutionary approaches to an understanding of emotional speech, most previous work has been atheoretical in nature (see Scherer, 2003). Most theories of emotion do not make explicit and detailed predictions regarding the specificity of the states conveyed in emotional speech, but they can be expected to differ in terms of whether only dimensions like arousal and valence or discrete basic emotion categories are assumed to be vocally differentiated. Component appraisal theories predict that emotional speech will convey also finer nuances that reflect the precise cognitive appraisals and consequent action tendencies that underlie each emotion. The only theory so far developed specifically to account for emotion in speech is Scherer’s (1986) component process theory, which has received some support in preliminary studies (Scherer, Johnstone, & Klasmeyer, 2003).
Emotion in speech may be regarded as a communication system featuring several parts:
- the expression or portrayal of the emotion by the speaker (the encoding)
- the acoustic cues (e.g., sound intensity) that convey the felt or intended emotion
- the proximal perception of these cues by the perceiver (e.g., perceived loudness)
- the inference about the expressed emotion by the perceiver (the decoding)
Studies may focus on any part of this process, but a thorough understanding may require that all parts are investigated in a combined fashion, in accordance with Brunswik’s lens model (see Juslin & Scherer, 2005).
Several studies have explored affect inferences from voice cues in listening tests, where the participants are required to judge the emotions expressed in speech samples using various response formats (e.g., forced choice, quantitative ratings and free labeling). Various content-masking procedures (e.g., low-pass filtering) that disrupt or degrade individual voice cues can be used to study which voice cues are used by listeners to infer specific emotions (Scherer, 2003). In the most extensive review to date (Juslin & Laukka, 2003), 39 studies of vocal expression featuring 60 listening experiments were included in a meta-analysis of decoding accuracy based on forced-choice judgments. The meta-analysis included both within-cultural and cross-cultural studies, and both portrayed and natural expressions. Results revealed that (overall) decoding accuracy for within-cultural expression was equivalent to a score of 70% correct in a forced-choice task with five response alternatives. Decoding accuracy was 7% higher for within-cultural than for cross-cultural expressions.
The above results suggest that estimates of decoding accuracy partly depend on the type of speech sample used. Three types of samples have been used in previous research:
- Emotion portrayals by professional actors in a laboratory, allowing experimental control and ensuring strong effects on voice cues but raising doubts about ecological validity.
- Natural vocal expressions recorded in the field or from reality media broadcasts. Ecological validity can be expected to be high (at least for unobtrusive recordings) but it is difficult to determine what affective state is felt or portrayed by the speaker (often inferred from situational cues).
- Experimentally manipulated affect expressions in the laboratory. This method combines experimental control with the possibility of obtaining spontaneous affect expressions. However, the induced affective states may be weak and unspecific.
Beneficial for obtaining emotion-differences in a speech study is: (a) analysis of a large number of voice cues, (b) precision in the labeling of the affective states expressed, and (c) a proper research design based on explicit predictions. In particular, it appears necessary to reach beyond single measures of the most common voice cues (e.g., F0, speech rate, intensity), which may involve similar cue levels for different emotions. It further seems necessary to control for the emotion intensity, which may affect voice cues in a differential fashion (for further discussion of design considerations in studies of emotion in speech, see Juslin & Scherer, 2005).
Speech emotion analysis requires basic knowledge about voice and speech production and speech acoustics to interpret the meaning of particular acoustic parameters and to make informed choices about analytic techniques (for detailed reviews, see Titze, 1994, and Kent, 1997). The basis of all sound making with the human vocal apparatus is air flow through the vocal tract, powered by respiration. The type of sound produced depends on whether the air flow is set into vibration by rapid opening and closing of the glottis – phonation – producing quasi-periodic voiced sounds; or whether it passes freely through the lower part of the vocal tract and is transformed into turbulent noise by friction at the mouth opening, thus producing nonperiodic, unvoiced sounds. The quality of the sound is further determined by the acoustic filter characteristics of the vocal tract, as outlined in Fant’s source-filter model of speech.
Speech emotion is often analyzed using dedicated software for acoustic analyses of the speech signal. One of the most commonly used software packages is PRAAT, developed by Boersma and Weenink, which can be downloaded at http://www.fon.hum.uva.nl/praat/. The first stage of the analysis is to segment the speech sample into voiced and unvoiced sounds, words, and syllables to allow a quantitative description of relatively homogeneous and thus comparable parts of each utterance. Then one can extract various voice cues of relevance to speech emotion including fundamental frequency, speech rate, pauses, voice intensity, voice onset time, jitter (pitch perturbations), shimmer (loudness perturbations), voice breaks, pitch jumps, and measures of voice quality (e.g., the relative extent of high- versus low-frequency energy in the spectrum, the frequency and bandwidth of energy peaks in the spectrum due to natural resonances of the vocal tract called formants). Several measures may be obtained for each type of cue. Although most of the algorithms for automatic extraction of voice cues are fairly reliable, it is nonetheless recommended that automatic measures are carefully checked. (For practical advice regarding acoustic analyses, see Owren and Bachorowski, 2007.)
In addition to acoustic analyses, there are several coding schemes that may be used for auditory assessments of voice characteristics. However, it is unclear whether the dimensions and categories that people use in processing the voice cues they hear are congruent with, the concepts that phoneticians, acousticians, and voice therapists are using. This question has sparked interest in how voice cues are proximally represented by the perceiver, and whether this representation may be captured by developing standardized rating scales that include verbal labels (e.g., ‘harsh’ or ‘shaky’) commonly used to describe voice characteristics in everyday life.
Current research directions include cross-cultural studies, comparisons of emotion portrayals with natural expressions, comparisons of different theoretical approaches, attempts to develop tools for automatic decoding of emotions, and multimodal approaches in emotional expression. So far, the vocal channel of emotional expression has received less attention than the facial channel, mirroring the relative emphasis placed on these modalities by the pioneers in the field, such as Charles Darwin. This situation is beginning to change, and one important impetus for the recent proliferation of studies on speech emotion has been the strong interest in applications of speech technology in automatic speech and speaker recognition and speech synthesis. For example, based on the results in encoding studies, researchers have been able to synthesize emotions in speech, sometimes yielding recognition accuracy scores comparable to those obtained with human speech. Synthesis of emotions in speech may be useful both for researchers who wish to test predictions about which voice cues are used to make inferences about emotions and for engineers developing practical applications in robots, communication systems for motor- or vocally-impaired individuals, call centers, lie detection, airport security, and computer games.
Speech emotion is a multi-disciplinary field of research with contributions coming from psychology, acoustics, speech science, linguistics, medicine, engineering, and computer science. More intensive and well coordinated collaboration between researchers from these fields would undoubtedly facilitate a convergence and consolidation of research results and consequently result in a better understanding of how emotions are revealed by various aspects of the voice.
(* suggested introductory readings)
- Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological Bulletin, 129, 770-814.
- *Juslin, P. N., & Scherer, K. R. (2005). Vocal expression of affect. In J. A. Harrigan, R. Rosenthal, & K. R. Scherer (Eds.), The new handbook of methods in nonverbal behavior research (pp. 65-135). New York: Oxford University Press.
- Kent, R. D. (1997). The speech sciences. San Diego, CA: Singular Press.
- Owren, M. J., & Bachorowski, J.-A. (2007). Measuring emotion-related vocal acoustics. In J. Coan & J. Allen (Eds.), Handbook of emotion elicitation and assessment (pp. 239-266). New York: Oxford University Press.
- Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin, 99, 143-165.
- Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40, 227-256.
- *Scherer, K. R., Johnstone, T. & Klasmeyer, G. (2003). Vocal expression of emotion. In R. J. Davidson, K. R. Scherer , H. Goldsmith (Eds.). Handbook of the Affective Sciences (pp. 433-456). New York and Oxford: Oxford University Press.
- Titze, I. R. (1994). Principles of voice production. Englewood Cliffs, NJ: Prentice-Hall.
- Sadaoki Furui (2008) Speaker recognition. Scholarpedia, 3(4):3715.