Motivation and Emotion, Vol. 21, No. 4, 1997 Arousal and Valence in the Direct Scaling of Emotional Response to Film Clips 1 Nancy Alvarado2 University of California, San Francisco Contributions of differential attention to valence versus arousal (Feldman, 1995) in self-reported emotional response may be difficult to observe due to (1) confounding of valence and arousal in the labeling of rating scales, and (2) the assumption of an interval scale type. Ratings of emotional response to film clips (Ekman, Friesen, & Ancoli, 1980) were reanalyzed as categorical (nominal) in scale type using consensus analysis. Consensus emerged for valence-related scales but not for arousal scales. Scales labeled Interest and Arousal produced a distribution of idiosyncratic responses across the scale, whereas scales labeled Happiness, Anger, Sadness, Fear, Disgust, Surprise, and Pain, produced consensual response. Magnitude of valenced response varied with both stimulus properties and self-reported arousal. Feldman (1995) presented evidence that individuals differ in their attention to two orthogonal dimensions of emotion: valence (evaluation) and arousal. These differences were noted when subjects were asked to make periodic mood ratings using scales that confound these two aspects of affective ex- perience. Feldman analyzed these ratings in the context of Russell's (1980) circumplex model and Watson and Tellegen's (1985) dimensions of positive affect (PA) and negative affect (NA) and suggested that the structure of 1Preparation of this article was supported in part by National Institute of Mental Health (NIMH) grant MH18931 to Paul Ekman and Robert Levenson for the NIMH Postdoctoral Training Program in Emotion Research. I thank Paul Ekman for permitting access to the data analyzed here. I also thank Jerome Kagan and several anonymous reviewers for their helpful comments on this manuscript. 2Address all correspondence concerning this article to Nancy Alvarado, who is now at the Department of Psychology (0109), University of California at San Diego, 9500 Gilman Drive, La Jolla, California 92093-0109. 323 0146-7239/97/1200-0323$12..50/0 <8 1997 Plenum Publishing Corporation 324 Alvarado affect changes with the focus of attention. She speculated that valence focus "may be associated with the tendency to attend to environmental, particu- larly social cues " (p. 163) whereas arousal focus may be related to internal (somesthetic) cues, citing Blascovich (1990; Blascovich et al., 1992). This paper presents support for Feldman's views, in a direct-scaling self-report context where valence and arousal are reported independently and the en- vironmental cues are held constant, using data originally collected by Ek- man, Friesen, and Ancoli (1980). Direct Scaling Assumptions Direct scaling of emotional response occurs when a subject is exposed to an affect-inducing stimulus, then asked to introspect and rate the amount of some affect using a rating scale, often labeled with the name of an emo- tion to be reported, and typically numbered in intervals, such as from 1 to 7. Researchers frequently anchor the endpoints of such scales with descrip- tive phrases such as not at all angry, extremely angry, or most anger ever felt in my life. These ratings are treated as judgments on an interval, continuous scale. They are then averaged to produce means which are compared using analysis of variance (ANOVA) or t-test. There is some evidence that self-report judgments of emotional re- sponse are consistent across time for the same individual (Larsen & Diener, 1985, 1987), that self report varies systematically with certain physiological changes associated with emotion and thus may be a valid indicator of emo- tional response (Levenson, 1992), and that higher ratings on a scale do correspond to greater emotional experience for the same individual (mono- tonicity). These findings justify assumption of an ordinal scale type during data analysis. On the other hand, there is no evidence that the subjective distances between adjacent numbers on every portion of the scale are equal, as would be necessary in order to assume that the data are interval in na- ture. Further, aggregation of data and interrater comparisons are problem- atic because it is unclear how individual differences in emotional response are related to individual differences in the use of rating scales. Nor have the distances between numbers been shown to correspond to the same sub- jective differences in response for each individual in a study. Consider temperature as an analogy. We can use an objective scale, such as the Fahrenheit scale, to evaluate the accuracy of subjective judg- ments. However, if we had no such scale, but instead asked subjects to rate temperature based upon the hottest or coldest temperatures they had ever experienced, their subjective experience would be confounded with variations in their devised scales. Unless we know the anchor points and Scaling Emotional Response to Film Clips 325 scale intervals, we cannot know whether two subjects reporting different temperature ratings for the same stimulus are using the same scale but experiencing the temperature differently, or experiencing the temperature as the same but using different scales. If we ignore these difficulties and average their ratings, we obtain a measure that is useful in certain experi- mental contexts but insensitive to individual variations in subjective expe- rience. Rather, we have a scale that assumes that individual differences are unimportant or nonexistent. No objective physical unit of measurement exists to compare against self-reported emotional experience. Even when we supply a 7-point scale anchored by descriptive phrases, we have no way of knowing how the in- dividual interprets such phrases, e.g., how much anger one person has ever felt in his or her lifetime, compared to the maximum experienced by an- other. Further, anchoring using descriptive phrases such as most emotion ever felt in your life invites subjects to apply a scale with unequal distances between intervals, such that the most emotion ever felt on a 10-point scale is not 10 times the amount felt when 1 is reported, but probably far greater. Use of a scale with 100 rather than 10 divisions does not remedy this prob- lem. Use of rating scales to describe emotion is further complicated if mag- nitude is part of the meaning of the label used to identify the scale itself. For example, it is unclear how the difference in meaning between scale labels such as anxiety and fear, or annoyance and fury, would affect the judgments of magnitude made using that scale. Would an experience rated in the middle of an annoyance scale be rated lower if the scale were labeled frustration, anger, or rage? Given these difficulties, the direct scaling of emotional response ap- pears to be, at best, ordinal. As Townsend and Ashby (1984) noted, ". . . if the strength of one's data is only ordinal, as much of that in the social sciences seems to be, then even a comparison of group mean differences via the standard Z or t test or by analysis of variance is illegitimate. Only those statements and computations that are invariant under monotone (or- der is preserved) transformations are permissible" (p. 395). When the pur- pose of a study is merely to demonstrate a difference using self report as a dependent variable, then the measurement concerns described above are unlikely to affect the validity of the findings. However, when these means tests are used to assert the equality of stimuli presented to evoke emotional response, or the efficacy of such stimuli as an elicitor of a specific emotion, then the concerns raised above become crucial to the findings. Everything that follows in such a study rests upon an initial assumption that mean self-report values are an accurate index of emotional response. 326 Alvarado This problem is relevant to several recent studies investigating the con- gruence between facial activity and self-reported emotional response, as noted by Ruch (1995). In an ongoing controversy over whether smiling is an indicator of expressed feeling, Fridlund (1991) reported that happiness ratings did not parallel electromyelograph (EMG) monitoring of smiling among subjects viewing film clips, but seemed related instead to the so- ciality of the viewing condition. Hess, Banse, and Kappas (1995) improved the measurement of facial activity by monitoring Duchenne versus non- Duchenne smiling and varied the amusement level of the film stimuli pre- sented as well as the viewing context. They found a more complex relationship between social context and smiling. In both studies, the crucial comparison between facial activity and emotional response rested on the accuracy of the self-report ratings, analyzed using an ANOVA across view- ing conditions, and assumed to be a valid measure of emotional response. Use of Direct Rating to Norm Film Clips This study reanalyzes self-report ratings of emotional response to film clips, originally collected by Ekman et al. (1980). These data have been frequently cited by Fridlund (1994) because they contain anomalies that he considers support for his view that smiling is related to social context rather than emotional response. Fridlund's larger issue of the sociality of smiling was addressed by Hess et al. (1995) and will not be discussed further here. This discussion instead will focus upon the complexity involved in demonstrating congruence between self-report ratings and facial activity (or other behavior), and the need to improve methods of collecting and ana- lyzing self-report data. The stimulus set used by Ekman et al. (1980) pro- vides a useful illustration of the methodological and theoretical issues discussed earlier because, unlike many similar studies, it includes both base- line self-report ratings and concurrent ratings using multiple, separately la- beled rating scales. Ekman et al. (1980) compared self-report judgments for 35 subjects with their measured facial expressions when viewing pleasant and unpleas- ant film clips selected for their ability to evoke emotion. Fridlund (1994) noted that facial expression and direct ratings agreed only for the film stim- uli with social content, but not for a third film for which the mean rated happiness was the same. At issue were three pleasant film clips: (1) a gorilla playing in a zoo, (2) ocean waves, and (3) a puppy playing with a flower. All three films evoked the same mean ratings when subjects were asked to rate their response on a scale labeled Happiness. However, as Fridlund noted, the film clips evoked differential amounts of facial activity, with the Scaling Emotional Response to Film Clips 327 gorilla film evoking the greatest duration and intensity of facial activity, the puppy film showing the greatest frequency of facial activity, and only seven subjects showing any facial response to the ocean film. From this, Fridlund argued that the gorilla and puppy films were somehow more social in nature, evoking more facial expression because such expressions only arise from social antecedents. However, this is only true if the films did in fact evoke the same emotional responses. As will be argued later, I believe they did not. Consensus Modeling The assumptions of the random-effects ANOVA model are that re- sponses are drawn from a normal distribution and that they are made using an interval scale. The model further assumes that all individuals use the same scale in the same manner (implicit to the assumption of equal vari- ance).3 The point here is not whether analysis of variance has been correctly applied in psychological research, but rather whether a model that assumes minimal individual differences is suitable for exploring whether such indi- vidual differences in fact exist. The analysis below applies consensus mod- eling to explore (1) whether the averaging of ratings produced misleading norms for the various film clips, (2) whether subject ratings were idiosyn- cratic or consensual (as is implicitly assumed by the averaging of data), and (3) whether subjects used all scales in an equivalent manner across the rating contexts. Consensus analysis is a formal computational model which uses the pattern of responses within a data set to predict the like- lihood of correct response for each subject (called the competence rating), provide an estimate of the homogeneity of response among subjects (the mean competence), and provide confidence intervals for the correctness of each potential response to a set of questions. While this model also makes certain assumptions, discussed in greater detail below, it incorporates good- ness-of-fit measures that permit an analysis of the extent to which those assumptions have been met. Thus the model can be used to investigate the nature of response using rating scales, and thereby to address the issues raised above. A formal description of the model has been provided by Batchelder and Romney (1988, 1989). Equations are provided in the Ap- pendix. 'According to Hays (1988), these assumptions can be violated without greatly affecting results when a fixed-effects model is used to test inferences about specific means. Violating assumptions of normality and equal variance has serious consequences for a random-effects model used to test inferences about the variance of the population effects. 328 Alvarado Consensus modeling assumes that subjects draw upon shared latent knowledge when making their responses. The source of this shared knowl- edge may be cultural or may be derived from shared physiology or common humanity. The model cannot distinguish between these sources of homo- geneous responding. It assumes that intercorrelation of subject responses across a data set occurs because subjects are drawing upon the same latent answer key when making their responses. Therefore, the latent answer key can be recreated using the pattern of intercorrelation. The model assumes that subjects vary in their performance and in their access to shared knowl- edge, but that subjects with higher correlation to the group are more expert because they have greater access to shared knowledge. The answer key con- fidence intervals are estimated using Bayes' theorem. Each subject's com- petence score is used as a probability of correctness. Subjects who are more expert because they agree more with the group are given greater weight in producing the estimated answer key. Thus, consensus emerges not from majority response to a particular question, but from patterns of agreement across the entire data set. For purposes of this study, the question was: "What number on this rating scale best describes the emotional response to this film clip?" This analysis assumes that there is a single correct number on each scale, for each rating context, that characterizes the group. This is the same assump- tion made when a group mean is used as a normative rating. Use of such a mean implies that one number (e.g., 4.5) best predicts the potential re- sponse of any individual selected at random from the population. Using consensus analysis, we can test whether subjects assign the same stimulus the same number on their internal subjective scales, or whether their scales are calibrated such that the same stimulus may produce widely varying response. This is important because it tells us something about the consistency of emotional response across individuals. Previous studies have also assumed that individual scales are calibrated in a similar enough man- ner to justify the aggregation of data across subjects and the use of ANOVA models. This approach tests whether that assumption is justified. In the study that follows, consensus analysis results are supplemented by analysis of the normality of the distribution of responses, and of the patterns of correlation among the scales. METHOD This analysis was performed upon the original self-report data col- lected by Ekman et al. (1980), rather than the summaries provided by the resulting article. Additional details about the data collection procedures Scaling Emotional Response to Film Clips 329 were provided in that article and are omitted here, except where relevant to the arguments presented. Subjects Subjects were 35 female volunteers, ages 18 to 35 years, recruited through advertisements to participate in a study of psychophysiology. Stimuli Stimuli consisted of five films of 1-min duration, three intended to be pleasant and two intended to be unpleasant. The three pleasant films (de- scribed above), were created by Ekman and Friesen and were always shown in the same order: gorilla, ocean, puppy. The two unpleasant films were edited versions of a workshop accident film designed to evoke fear and disgust. The first film depicts a man sawing off the tip of his finger. The second shows a man dying when a plank of wood is thrust through his chest by a circular saw. These films were always shown in this same order. Procedure Subjects rated their emotional responses for two baseline periods and five film-viewing periods using a series of nine unipolar 9-point scales, la- beled with the following terms: Interest, Anger, Disgust, Fear, Happiness, Pain, Sadness, Surprise, and Arousal. Pain was defined for subjects as "the experience of empathetic pain" and Arousal was explained as applying to the total emotional state rather than to any one of the other scales pre- sented. The other terms were not explained to subjects. Scales ranged from 0 (no emotion) to 8 (strongest feeling). Instructions explained how the ratings were to be made (Ekman et al., 1980): "... strength of a feeling should be viewed as a combination of (a) the number of times you felt the emo- tion—its frequency; (b) the length of time you felt the emotion—its dura- tion; and (c) how intense or extreme the emotions [sic] was—its intensity" (p. 1127). The first baseline occurred during a 20-min period in which the subject was instructed to relax. The presentation of pleasant or unpleasant films first was counterbalanced. Ratings for all three pleasant films were made after viewing all three films. Similarly, ratings for the two unpleasant films were made after viewing both films, A second baseline rating was made 330 Alvarado after rating of the first set of films, during a 5-min interval before starting the second series of films. RESULTS Consensus Analysis The following discussion is adapted from the description of consensus modeling provided by Weller and Romney (1988). Consensus analysis pro- vides a measure of reliability in situations where correct responses to items are not already known. Mathematically, it closely parallels item response theory or reliability theory, except that data are coded as given by subjects rather than as "correct" or "incorrect," and the reliability of the subjects is measured instead of the reliability of the items. The formal model is described in Batchelder and Romney (1988, 1989). Additional description of the model is provided in the Appendix. The main idea of the model is that when correct answers exist, the answers given by subjects are likely to be positively correlated with that correct answer key. Thus, in situations where correct answers are unknown but assumed to exist, the pattern of intercorrelations or agreement among subjects (called consensus) can be used to reconstruct the latent answer key. This is similar to the idea in reliability theory that correlations among items reflect their independent correlation with an underlying trait or ability. Similarly, high agreement among subjects about the answers to a set of items measuring a coherent domain suggests the likelihood that shared knowledge exists and provides information about what that knowledge is. In the words of Weller and Rom- ney (1988), "A consensus analysis is a kind of reliability analysis performed on people instead of items" (p. 75). This reliability analysis is used to make inferences about the nature of the domain or to determine the correct an- swers. When a correct answer key does not exist, as when subjects belong to subcultures drawing upon different sources of shared knowledge, or when subjects draw upon idiosyncratic knowledge, that violation of the model's assumptions is readily apparent in the measures provided by the model. Ratings for each of the nine emotion-labeled scales were analyzed separately; thus the data consisted of seven numerical ratings (one for each rating period) for each of the 35 subjects, for each labeled scale (nine scales). The data were treated as multiple-choice responses to the implied question "Which number corresponds to the correct emotional response rating for this particular film segment or baseline period?" Given the pre- ceding discussion about scale types, it would have been preferable to ana- Scaling Emotional Response to Film Clips 331 lyze the data using an ordinal consensus model, but such a model has not yet been developed. The categorical, multiple-choice model used here as- sumes an equal probability of guessing the alternatives in its correction for guessing. The analysis of normality (presented later) suggests that this as- sumption is appropriate for some but not all of the rating scales. With or- dinal data, it is more likely that guessing biases differ among the rating alternatives (e.g., the probability of guessing 5 may be different than the probability of guessing 0). A model incorporating such biases had not been developed at the time this analysis was performed, but now exists (see Klauer and Batchelder, 1996). In general, the application of a categorical model to what we suspect is ordinal data tends to work against a finding of consensus because subjects must agree on the exact rating number given to each stimulus out of nine alternatives (0 to 8). The measures used to evaluate results are (1) individual competence scores, (2) mean competence, (3) eigenvalues produced during the principal component analysis used to estimate the solution to the model's equations, and (4) answer key confidence estimates. Competence scores range from -1.00 to 1.00 and are maximum-likelihood parameter estimates. They are best understood as estimated probabilities rather than correlation coeffi- cients. A negative competence score indicates extreme and consistent dis- agreement with the group across rating periods. Batchelder and Romney (1988, 1989) established three criteria for judging whether consensus exists in subject responses to questions about a domain: (1) eigenvalues showing a single dominant factor (a ratio greater than 3:1 between the first and second factors), (2) a mean competence greater than .500, and (3) absence of negative competence scores in the group of subjects. While failure to meet these criteria does not necessarily rule out consensus, it can indicate a poor fit between the data and the model. Consensus analysis results for the nine scales across the seven rating periods are summarized in Table I. All scales except those labeled Interest and Arousal met the criteria for consensus. In contrast, the scales for In- terest and Arousal showed nearly half the group with negative consensus scores, indicating severe disagreement about the correct responses on those scales. The scales for Anger, Disgust, and Pain showed the greatest con- sensus, with the highest mean consensus scores and with eigenvalue ratios indicating a single dominant factor in the data. While the scales for Sadness and Surprise each showed a single negative consensus score, the otherwise high mean consensus scores and ratios between the eigenvalues suggest that consensus also existed for those scales. This finding of consensus for seven of the nine scales suggests that subjects agreed strongly in their emotional responses to the stimuli pre- 332 Alvarado Table I. Consensus Analysis of Nine Rating Scales Across Seven Rating Periods Consensus Ratio of Negative Confidence Scale label Mean SD eigenvalues scores N level Anger .831 .179 13.348 0 35 1.0000 1.702 Disgust .795 .082 13.620 0 35 .9478 1.180 Fear .699 .155 8.926 0 35 .9392 1.332 Happy .580 .131 5.998 0 35 .9943 1.200 Pain .793 .106 14.511 0 35 .9841 1.223 Sadness .674 .290 7.136 1 35 1.0000 1.584 Surprise .657 .230 8.959 1 35 .9838 1.096 Interest .101 .288 1.382 16 35 .9363 1.144 Arousal .150 .230 1.087 17 35 .8486 1.720 sented, particularly with respect to the scales labeled Anger, Disgust, and Pain. Lesser agreement existed for Surprise and Fear, and for Happiness and Sadness. Based upon the measures provided by this model, consensual emotional response did not exist for the two scales labeled Arousal and Interest. The importance of this finding will be discussed later. Answer key confidence levels were high (M = .95), even when emo- tional response was reported, but consensus appeared to be largely gov- erned by agreement about the absence of negative emotion during the pleasant film clips, and the absence of positive emotion during the unpleas- ant film clips.4 The scales showing lower consensus (but nevertheless meet- ing the criteria for consensus), Sadness, Happiness, and Surprise, showed minor violations of this pattern. Because the presentation of films was counterbalanced, half of the subjects saw pleasant films and half saw un- pleasant films before the second baseline. From the ratings, several subjects appeared to have carried residual negative emotional response into this second baseline period, producing mixed ratings. They may also have car- ried such response into the pleasant film ratings, as Ekman et al. (1980) 4This is far from a trivial finding, as several emotion theorists have hypothesized that complex emotional responses may be blends of basic emotions and thus have insisted that multiple scales be provided to permit subjects to express such complexity. A lack of response is thus as meaningful as positive response on each single scale with respect to each rating context.
Description: