Table Of ContentHindawiPublishingCorporation
EURASIPJournalonAudio,Speech,andMusicProcessing
Volume2010,ArticleID735854,19pages
doi:10.1155/2010/735854
Research Article
Determination of Nonprototypical Valence and Arousal in
Popular Music: Features and Performances
Bjo¨rnSchuller,JohannesDorfner,andGerhardRigoll
InstituteforHuman-MachineCommunication,TechnischeUniversita¨tMu¨nchen,Mu¨nchen80333,Germany
CorrespondenceshouldbeaddressedtoBjo¨rnSchuller,schuller@tum.de
Received27May2009;Revised4December2009;Accepted8January2010
AcademicEditor:LimingChen
Copyright©2010Bjo¨rnSchulleretal.ThisisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense,
whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.
MoodofMusicisamongthemostrelevantandcommerciallypromising,yetchallengingattributesforretrievalinlargemusic
collections.Inthisrespectthisarticlefirstprovidesashortoverviewonmethodsandperformancesinthefield.Whilemostpast
researchsofardealtwithlow-levelaudiodescriptorstothisaim,thisarticlereportsonresultsexploitinginformationonmiddle-
levelastherhythmicandchordalstructureorlyricsofamusicalpiece.Specialattentionisgiventorealismandnonprototypicality
oftheselectedsongsinthedatabase:allfeatureinformationisobtainedbyfullyautomaticpreclassificationapartfromthelyrics
which are automatically retrieved from on-line sources. Further more, instead of exclusively picking songs with agreement of
severalannotatorsuponperceivedmood,afullcollectionof69doubleCDs,or2648titles,respectively,isprocessed.Duetothe
severityofthistask;differentmodellingformsinthearousalandvalencespaceareinvestigated,andrelevanceperfeaturegroupis
reported.
1.Introduction (http://www.allmusic.com/).Buttheinformationwhichcan
be found there is very inaccurate because it is available on
Musicisambient.Audioencodinghasenabledustodigitise a per artist instead of a per track basis. This is where an
our musical heritage and new songs are released digitally automated way of classifying music into mood categories
every day. As mass storage has become affordable, it is usingmachinelearningwouldbehelpful.Sheddinglighton
possible for everyone to aggregate a vast amount of music current well-suited features, performances, and improving
in personal collections. This brings with it the necessity to on this task is thus the concern of this article. Special
somehoworganisethismusic. emphasisistherebylaidonstickingtorealworldconditions
The established approach for this task is derived from by absence of any preselection of “friendly” cases either by
physical music collections: browsing by artist and album is consideringonlymusicwithmajorityagreementofannota-
of course the best choice for searching familiar music for torsandrandompartitioningoftrainandtestinstances.
aspecifictrackorrelease.Additionally,musicalgenreshelp
tooverviewsimilaritiesinstyleamongartists.However,this 1.1.StateoftheArt
categorisation is quite ambiguous and difficult to carry out
consistently. 1.1.1.MoodTaxonomies. Whenitcomestoautomaticmusic
Oftenmusicisnotselectedbyartistoralbumbutbythe moodprediction,thefirsttaskthatarisesistofindasuitable
occasionlikedoingsports,relaxingafterworkoraromantic moodrepresentation.Twodifferentapproachesarecurrently
candle-lightdinner.Insuchcasesitwouldbehandyifthere established:adiscreteandadimensionaldescription.
was a way to find songs which match the mood which A discrete model relies on a list of adjectives each
is associated with the activity like “activating”, “calming” describing a state of mood like happy, sad or depressed.
or “romantic” [1, 2]. Of course, manual annotation of Hevner [3] was the first to come up with a collection of 8
music would be a way to accomplish this. There also word clusters consisting of 68 words. Later Farnsworth [4]
existon-linedatabaseswithsuchinformationlikeAllmusic, regroupedthemin10labelledgroupswhichwereusedand
2 EURASIPJournalonAudio,Speech,andMusicProcessing
Table1:Ajdectivegroups(A–J)aspresentedbyFarnsworth[4],K– LiandOgihara[5]extracteda30-elementfeaturevector
MwereextendedbyLiandOgihara[5]. containingtimbre,pitch,andrhythmfeaturesusingMarsyas
[12],asoftwareframeworkforaudioprocessingwithspecific
A cheerful,gay,happy H dramatic,emphatic
emphasisonMusicInformationRetrievalapplications.
B fanciful,light I agitated,exciting Liu[9]usedmusicinauniformformat(16kHz,16bits,
C delicate,graceful J frustrated mono channel) and divided into non-overlapping 32ms
D dreamy,leisurely K mysterious,spooky long frames. Then timbre features based on global spectral
E longing,pathetic L passionate and subband features were extracted. Global spectrum
features were centroid, bandwidth, roll off, and spectral
F dark,depressing M bluesy
flux.Subbandfeatureswereoctave-based(7subbandsfrom
G sacred,spiritual
0 to 8kHz) and consist of the minimum, maximum, and
average amplitude value for each subband. The root mean
Table2:MIREX2008MoodCategories(aggr.:aggressive,bittersw.:
squareofanaudiosignalisusedasanintensityfeature.For
bittersweet,humor.:humerous,lit.:literate,rollick.:rollicking).
extracting rhythm information only the audio information
of the lowest subband was used. The amplitude envelope
A passionate,rousing,confident,boisterous,rowdy
wasextractedbyuseofahammingwindow.Edgedetection
B rollick.,cheerful,fun,sweet,amiable/goodnatured
with a Canny estimator delivered a so-called rhythm curve
C lit.,poignant,wistful,bittersw.,autumnal,brooding
in which peaks were detected as bass instrumental onsets.
D humor.,silly,campy,quirky,whimsical,witty,wry The average strength of peaks then was used as an estimate
E aggr.,fiery,tense/anxious,intense,volatile,visceral for the strength of the rhythm. Auto-correlation delivered
information about the regularity of the rhythm and the
commondivisorofthecorrelationpeakswasinterpretedas
expanded to 13 groups in recent work [5]. Table 1 shows theaveragetempo.Luetal.[10]continuedtheworkofLiu
those groups. Also MIREX (Music Information Retrieval usingthesamepreprocessingofaudiofiles.Alsothetimbre
EvaluationeXchange)useswordclustersforitsAudioMood andintensityfeatureswereidentical.Tocalculatetherhythm
Classification(AMC)taskasshowninTable2. curve this time, all subbands were taken into account. The
However, the number and labelling of adjective groups amplitude envelope was extracted for each subband audio
suffers from being too ambiguous for a concise estimation signalusingahalf-Hanningwindow.ACannyedgedetector
ofmood.Moreover,differentadjectivegroupsarecorrelated wasusedonittocalculateanonsetcurve.Allsubbandonset
with each other as Russell showed [6]. These findings curves were then summed up to deliver the rhythm curve
implicate that a less redundant representation of mood can from which strength, regularity, and tempo were calculated
befound. asexplainedabove.
Dimensional mood models are based on the assertion Trohidis et al. [13] also used timbre and rhythm fea-
that different mood states are composed by linear com- tures, which were extracted as described in the following:
binations of a low number (i.e., two or three) of basic two estimates for tempo (bpm) (beats per minute) were
moods. The best known model is the circumplex model calculated by identifying peaks in an autocorrelated beat
of affect presented by Russell in 1980 [7] consisting of a histogram. Additional rhythmic information from the beat
“two-dimensional space of pleasure-displeasure and degree histogram was gathered by calculating amplitude ratios
ofarousal”whichallowstoidentifyemotionaltagsaspoints and summing of histogram ranges. Timbre features were
in the “mood space” as shown in Figure 1(a). Thayer [8] extracted from the Mel Frequency Cepstral Coefficients
adopted this idea and divided the “mood space” in four (MFCC)[14]andtheShort-TermFourierTransform(FFT),
quadrantsasdepictedinFigure1(b).Thismodelmainlyhas which were both calculated per sound frame of 32ms
beenusedinrecentresearch[9–11],probablybecauseitleads duration. From the MFCCs the first 13 coefficients were
to two binary classification problems with comparably low takenandfromtheFFTthespectralcharacteristicscentroid,
complexity. roll off, and flux were derived. Additionally, mean and
standarddeviationofthesefeatureswerecalculatedoverall
1.1.2. Audio Features and Metadata. Another task involved frames.
in mood recognition is the selection of features as a base Peeters [15] used the following three feature groups in
for the used learning algorithm. This data either can be his submission for the MIREX 2008, (http://www.music-
directly calculated from the raw audio data or metadata ir.org/mirex/2008/) audio mood classification task: MFCC,
about the piece of music. The former further divide into SFM/SCM, and Chroma/PCP The MFCC features were
so-called high- and low-level features. Low-level refers to 13 coefficients including the DC component. SFM/SCM
the characteristics of the audio wave shape like amplitude are the so-called Spectral Flatness and Spectral Crest
and spectrum. From these characteristics more abstract— Measures. They capture information about whether the
orhigh-level—propertiesdescribingconceptslikerhythmor spectrum energy is concentrated in peaks or if it is flat.
harmonicscanbederived.Metadatainvolvesallinformation Peaks are characteristic for sinusoidal signals while a flat
that can be found about a music track. This begins at spectrumindicatesnoise.Chroma/PCPorPitchClassProfile
essential information like title or artist and ranges from representsthedistributionofsignalenergyamongthepitch
musicalgenretolyrics. classes(refertoSection2.3).
EURASIPJournalonAudio,Speech,andMusicProcessing 3
1.1.3. Algorithms and Results. Like with mood taxonomies experiments that are conducted. Finally, Section 4 presents
thereisstillnoagreedconsensusonthelearningalgorithms the experiments’ results, and Section 5 concludes the most
to use for mood prediction. Obviously, the choice highly importantfindings.
depends on the selected mood model. Recent research,
which deals with a four-class dimensional mood model
2.Features
[9, 10], uses Gaussian Mixture Models (GMM) as a base
for a hierarchical classification system (HCS): at first a
Like in every machine learning problem it is crucial for
binary decision on arousal is made using only rhythm
the success of mood detection to select suitable features.
and timbre features. The following valence classification is Those are features which convey sufficient information on
then derived from the remaining features. This approach
themusicinordertoenablethemachinelearningalgorithm
yields an average classification accuracy of 86.3%, based
tofindcorrelationsbetweenfeatureandclassvalues.Those
on a database of 250 classical music excerpts. Additionally,
featureseithercanbeextracteddirectlyfromtheaudiodata
the mood tracking method presented there is capable of
orretrievedfrompublicdatabases.Bothtypesoffeaturesare
detecting mood boundaries with a high precision of 85.1%
usedinthisworkandtheiruseforestimatingmusicalmood
andarecallof84.1%onabaseof63boundariesin9pieces
is investigated. Concerning musical features, both low-level
ofclassicalmusic.
featureslikespectrumandmiddle-levelfeatureslikechords
Recently the second challenge in audio mood classifica-
areemployed.
tion was held as a part of the MIREX 2008. The purpose
of this contest is to monitor the current state of research:
2.1.Lyrics. Inthefieldofemotionrecognitionfromspeech
this year’s winner in the mood classification task, Peeters
it is commonly agreed that textual information may help
[15],achievedanoverallaccuracyof63.7%onthefivemood
improve over mere acoustic analysis [18, 19]. For 1937
classesshowninTable2beforethesecondplacedparticipant
of 2648 songs in the database (cf. Section 3.1) lyrics can
with55.0%accuracy.
automatically be collected from two on-line databases: in a
first run lyricsDB, (http://lyrics.mirkforce.net/) is applied,
1.2. This Work. Having presented the current state of re-
which delivers lyrics for 1779 songs, then LyricWiki,
search in automatic mood classification the main goals for
(http://www.lyricwiki.org/) is searched for all remaining
thisarticlearepresented.
songs, which delivers lyrics for 158 additional songs.
LyricsDB The only post-processing needed is to remove
1.2.1. Aims. The first aim of this work is to build up a obvious “stubs”, that is, lyrics containing only some words
musicdatabaseofannotatedmusicwithsufficientsize.The when the real text is much longer. However, this procedure
selected music should cover today’s popular music genres. doesnotensurethattheremainderofthelyricsiscomplete
Sothisworkputsemphasisonpopularratherthanclassical or correct at all. It has to be remarked that not only word
music. In contrast to most existing work no preselection bywordtranscriptsofasongarecollected,butthatthereare
of songs is performed, which is presently also considered a inconsistentconventionsusedamongthedatabases.Sosome
major challenge in the related field of emotion recognition lyricscontainpassageslike“Chorusx2”or“(Repeat)”,which
in human speech [16, 17]. It is also attempted to deal with makesthechorusappearlessoftenintherawtextthanitcan
ambiguoussongs.Forthatpurpose,amoodmodelcapable beheardinasong.Toextractinformationfromtherawtext
ofrepresentingambiguousmoodissearched. thatisusableformachinelearning,twodifferentapproaches
Most existing approaches exclusively use low-level fea- areused,asfollows.
tures. So in this work middle-level features that partly
base on preclassification are additionally used and tested
2.1.1. Semantic Database for Mood Estimation. The first
for suitability to improve the classification. Another task is
approach is using ConceptNet [20, 21], a text-processing
the identification of relevant features by means of feature
toolkit that makes use of a large semantic database auto-
relevance analysis. This step is important because it can
maticallygeneratedfromsentencesintheOpenMindCom-
improve classification accuracy while reducing the number
mon Sense Project, (http://openmind.media.mit.edu/). The
of attributes at the same time. Also all feature extraction is
software is capable of estimating the most likely emotional
basedonthewholesonglengthratherthantoselectexcerpts affect in a raw text input. This has already been shown
ofseveralsecondsandoperateonlyonthem. quiteeffectiveforvalencepredictioninmoviereviews[21].
Thefinalandmaingoalofthisarticleistopredictasong’s
Listing1displaystheoutputforanexamplesong.
mood under real world conditions, that is, by using only
The underlying algorithm profits from a subset of
metainformationavailableon-line,nopreselectionofmusic,
concepts that are manually classified into one of six emo-
and compressed music, as reliably as possible. Additionally,
tional categories (happy, sad, angry, fearful, disgusted, and
factors limiting the classification success shall be identified surprised).Nowtheemotionalaffectofunclassifiedconcepts
andaddressed.
thatareextractedfromthesong’slyricscanbecalculatedby
finding and weighting paths which lead to those classified
1.2.2. Structure. Section 2 deals with the features that concepts.
are used as the informational base for machine learning. The program output is directly used as attributes. Six
Section3containsadescriptionofthemusicdatabaseandall nominal attributes with the emotional category names as
4 EURASIPJournalonAudio,Speech,andMusicProcessing
up exponentially. In our experiments this did not lead to
improvementsonthetaskspresentedintheongoing.
(“sad”, 0.579)
(“happy”, 0.246)
(“fearful”, 0.134) 2.2. Metadata. Additional information about the music is
(“angry”, 0.000) sparse in this work because of the large size of the music
(“disgusted”, 0.000) collection used (refer to Section 3.1): besides the year of
(“surprised”, 0.000) release only the artist and title information is available for
each song. While the date is directly used as a numeric
attribute,theartistandtitlefieldsareprocessedinasimilar
Listing1:ConceptNetlyricsmoodestimationforthesong“(IJust)
way as the lyrics (cf. Section 2.1.2 for a more detailed
DiedInYourArms”byCuttingCrew.
explanation of the methods): only the binary information
abouttheoccurrenceofawordstemisobtained.Theword
stems are generated by string to word vector conversion
possiblevaluesindicatewhichmoodisthemost,second,..., applied to the artist and title attributes. Standard word
leastdominantinthelyrics.Sixadditionalnumericattributes delimiters are used to split multiple text strings to words
contain the corresponding probabilities. Note that other and the Porter stemming algorithm [24] reduces words to
alternatives exist, as the word lists found in [22], which commonstemsinordertomapdifferentformsofoneword
directly assigns arousal and valence values to words, yet to their common stem. To limit the number of attributes
consistofmorelimitedvocabulary. that are left after conversion, a minimum word frequency
isset,whichdetermineshowoftenawordstemmustoccur
withinoneclass.Whiletheartistwordlistlooksveryspecific
2.1.2. Text Processing. The second approach uses text pro- tothecollectionofartistsinthedatabase,thetitlewordlist
cessingmethodsintroducedin[23]andshownefficientfor seemstohavemoregeneralrelevancewithwordslike“love”,
sentiment detection in [19, 21]. The raw text is first split “feel”,or“sweet”.Intotal,themetadataattributesconsistof
into words while removing all punctuation. In order to one numeric date attribute and 152 binary numeric word
recognise different flexions of the same word (e.g., loved, occurrenceattributes.
loving,lovesshouldbecountedaslove),theconjugatedword
has to be reduced to its word stem. This is done using 2.3. Chords. A musical chord is defined as a set of three
the Porter stemming algorithm [24]. It is based on the (sometimes two) or more simultaneously played notes. A
following idea: each (English) word can be represented in note is characterised by its name—which is also referred to
the form [C](VC)m[V], where C(V) denotes a sequence as pitch class—and the octave it is played in. An octave is
of one or more consecutive consonants (vowels) and m is aso-calledintervalbetweentwonoteswhosecorresponding
called the measure of the word ((VC)m here means an m- frequencies are at a ratio of 2 : 1. The octave is a special
fold repetition of the string VC). Then, in five separated intervalastwonotesplayedinitsoundnearlyequal.Thisis
steps,replacementrulesareappliedtotheword.Thefirststep whysuchnotessharethesamenameinmusicnotation.The
dealswiththeremovalofpluralandparticipleendings.The octave interval is divided into twelve equally sized intervals
steps2to5thenreplacecommonwordendingslikeATION calledsemitones.Inwesternmusicthesearenamedasshown
→ ATE or IVENESS → IVE. Many of those rules contain in Figure 2 which visualises these facts. In order to classify
conditions under which they may be applied. For example, achord,onlythepitchclasses(i.e.,thenotenameswithout
therule“(m > 0)TIONAL → TION”onlyisappliedwhen octave number) of the notes involved are important. There
the remaining stem has a measure greater than zero. This areseveraldifferenttypesofchordsdependingonthesizeof
leavestheword“rational”unmodifiedwhile“occupational” intervals between the notes. Each chord type has a distinct
isreplaced.Ifmorethanonerulematchesinastep,therule sound which makes it possible to associate it with a set of
withthebiggestmatchingsuffixisapplied. moodsasdepictedinTable3.
A numerical attribute is generated for each word stem
thatisnotinthelistofstopwordsandoccursatleasttentimes 2.3.1.RecognitionandExtraction. Forchordextractionfrom
inoneclass.Thevaluecanbezeroifthewordstemcannot therawaudiodataafullyautomaticalgorithmaspresented
befoundinasong’slyrics.Otherwise,ifthewordoccurs,the by Harte and Sandler [26] is used. Its basic idea is to map
numberofoccurrencesisignored,andtheattributevalueis signalenergyinfrequencysubbandstotheircorresponding
set to one, only normalised to the total length of the song’s pitch class which leads to a chromagram [27] or pitch
lyrics.Thisisdonetoestimatethedifferentprevalenceofone class profile (PCP). Each possible chord type corresponds
wordinasongdependentonthetotalamountoftext. to specific pattern of tones. By comparing the chromagram
Themoodassociatedwiththisnumericalrepresentation with predefined chord templates, an estimate of the chord
of words contained in the lyrics is finally learned by the type can be made. However, also data-driven methods can
classifier as for any acoustic feature. Note that the word be employed [28]. Table 4 shows the chord types that are
orderisneglectedinthismodelling.Onecouldalsoconsider recognised. To determine the tuning of a song for a correct
compounds of words by N-grams, that is, N consecutive estimationofsemitoneboundaries,a36-binchromagramis
words. Yet, this usually demands for considerably higher calculated first. After tuning, an exact 12-bin chromagram
amounts of training material as the feature space is blown canbegeneratedwhichrepresentsthe12differentsemitones.
EURASIPJournalonAudio,Speech,andMusicProcessing 5
Table3:Chordtypesandtheirassociatedemotions[25].
ChordType Example AssociatedEmotions
Major C Happiness,cheerfulness,confidence,satisfaction,brightness
Minor Cm Sadness,darkness,sullenness,apprehension,melancholy,depression,mystery
Seventh C7 Funkiness,moderateedginess,soulfulness
MajorSeventh Cmaj7 Romance,softness,jazziness,serenity,exhilaration,tranquillity
MinorSeventh Cm7 Mellowness,moodiness,jazziness
Ninth C9 Openness,optimism
Diminished Cdim Fear,shock,spookiness,suspense
SuspendedFourth Csus4 Delightfultension
Seventh,MinorNinth C7/9(cid:2) Creepiness,ominousness,fear,darkness
AddedNinth Cadd9 Steeliness,austerity
Alarmed• ArousePdle•asure •As•toEnxicsihteedd Height
Afraid•
Delighted•
Tense• •Angry
•Distressed
Glad
•Annoyed •
Happy•
•Frustrated •
Pleased
al Bn+1
us
Aro Satisfied••
Content
•Miserable
•Depressed Se•rene Bn
Calm•
G•loom•Sy•adBored Relaxed••Atease G F#
Droopy• •Sleepy G# F E
•Tired
A
D#
(a) A# D
B C C# Chroma
Tensearousal Figure2:Thepitchhelixaspresentedin[26].Theheightaxisis
associatedwithanote’sfrequencyandtherotationcorrespondsto
thepitchclassofanote.Here,B isoneoctavebelowB .
n n+1
y
g
Anxious Exuberance ner
(tense-energy) (calm-energy) E Table4:Chordtypeswhicharerecognisedandextracted.
al
us ChordType Example
o
ar Augmented C+
c
eti Diminished Adim
g
er Diminished7 Cdim7
n
E
Dominant7 G7
Depression Contentment
(tense-tiredness) (calm-tiredness) d Major F(cid:3)
e
Tir Major7 D(cid:3)maj7
Minor Gm
Minor7 Cm7
MinorMajor7 F(cid:3)mmaj7
Tense Calm
(b)
The resulting estimate gives the chord type (e.g., major,
Figure 1: Dimensional mood model development: (a) shows a
multidimensional scaling of emotion-related tags suggested by minor,diminished)andthechordbasetone(e.g.,C,F,G(cid:3))
Russell[7].(b)isThayer’smodel[8]withfourmoodclusters. (cf.[29]forfurtherdetails).
6 EURASIPJournalonAudio,Speech,andMusicProcessing
2.3.2. Postprocessing. Timing information are withdrawn Bothcanhaveoneofthepossiblevalues3(fortriple)
and only the sequence of recognised chords are used or4(forduple).
subsequently. For each chord name and chord type the
(vii)ThetatummaximumT isthemaximumvalueof
max
number of occurrences is divided by the total number of T(cid:3).
chords in a song. This yields 22 numeric attributes, 21
(viii)ThetatummeanT isthemeanvalueofT(cid:3).
describingtheproportionofchordsperchordnameortype, mean
andthelastoneisthenumberofrecognisedchords. (ix)The tatum ratio T is calculated by dividing the
ratio
highestvalueofT(cid:3)bythelowest.
2.4. Rhythm Features. Widespread methods for rhythm (x)ThetatumslopeT thefirstvalueofT(cid:3)dividedby
slope
detectionmakeuseofacepstralanalysisorautocorrelationin thelastvalue.
order to perform tempo detection on audio data. However,
(xi)The tatum peak distance T is the mean of the
cepstral analysis has not proven satisfactory on music maximumandminimumvpaelaukdeisotfT(cid:3) normalisedby
withoutstrongrhythmsandsuffersfromslowperformance.
theglobalmean.
Bothmethodshavethedisadvantagesofnotbeingapplicable
tocontinuousdataandnotcontributinginformationtobeat Thisfinallyyields87numericattributes,mainlyconsist-
tracking. ingofthetatumandmetervectorelements.
Therhythmfeaturesusedinthisarticlerelyonamethod
presented in [30, 31] which itself is based on former work 2.5. Spectral Features. First the audio file is converted to
byScheirer[32].Itusesabankofcombfilterswithdifferent mono, and then a fast Fourier transform (FFT) is applied
resonant frequency covering a range from 60 to 180bpm. [33]. For an audio signal which can be described as x :
The output of each filter corresponds to the signal energy [0,T] (cid:4)→ R, t (cid:4)→ x(t), the Fourier transform is defined as
(cid:2)
belonging to a certain tempo. This approach has several X(f)= Tx(t)e−j2πftdt:
0
advantages:itdeliversarobusttempoestimateandperforms
(cid:3)
well for a wide range of music. Additionally, its output can E := ∞(cid:4)(cid:4)X(cid:5)f(cid:6)(cid:4)(cid:4)2df, (1)
be used for beat tracking which strengthens the results by
0
being able to make easy plausibility checks on the results.
Further processing of the filter output determines the base andwiththecentreofgravity fc thenthcentralmomentis
introducedas
meterofasong,thatis,howmanybeatsareineachmeasure
(cid:3)
acanndrwehcoatgnniosteewvahleuteheorneabsoeantghhaas.sTdhuepilem(p2l/e4m,e4n/4ta)tioorntruispelde Mn :=E1 ∞0 (cid:5)f − fc(cid:6)n(cid:4)(cid:4)X(cid:5)f(cid:6)(cid:4)(cid:4)2df. (2)
(3/4,6/8)meter.
The implementation executes the tempo calculation in To represent the global characteristics of the spectrum, the
two steps: first, the so called “tatum” tempo is searched. followingvaluesarecalculatedandusedasfeatures.
The tatum tempo is the fastest perceived tempo present in
(i)Thecentreofgravity f .
asong.Foritscalculation57combfiltersareappliedtothe c
(preprocessed)audiosignal.Theiroutputsarecombinedin (ii)The standard deviation which is a measure for how
theunnormalisedtatumvectorT(cid:3). muchthefrequenciesinaspectrum(cid:7)candeviatefrom
thecentreofgravity.Itisequalto M .
(i)The meter vector M = [m ···m ]T consists of 2
1 19
(iii) Theskewnesswhichisameasureforhowmuchthe
normalised entries of score values. Each score value
m(cid:3) determines how well the tempo θ ·i resonates shapeofthespectrumbelowthecentreofgravityis
i T differentfromtheshapeabovethemeanfrequency.It
withthesong.
iscalculatedasM /(M )1.5.
3 2
(ii) TheTatumvectorT=[t ···t ]T isthenormalised
1 57 (iv)The kurtosis which is a measure for how much the
vectoroffilterbankoutputs.
shape of the spectrum around the centre of gravity
(iii) Tatum candidates θT1, θT2 are the tempi corre- is d(cid:7)ifferent from a Gaussian shape. It is equal to
sponding to the two most dominant peaks T(cid:3). The M / M −3.
4 2
candidate with the higher confidence is called the
(v)Bandenergiesandenergydensitiesforthefollowing
tatumtempoθ .
T sevenoctavebasedfrequencyintervals:0Hz–200Hz,
(iv)The main tempo θB is calculated from the meter 200Hz–400Hz, 400Hz–800Hz, 800Hz–1.6kHz,
vector M. Basically, the tempo which resonates best 1.6kHz–3.2kHz, 3.2kHz–6.4kHz, and 6.4kHz–
withthesongischosen. 12.8kHz.
(v)Thetrackertempoθ isthesameasmaintempo,but
BT
refined by beat tracking. Ideally, θ and θ should
B BT 3.Experiments
be identical or vary only slightly due to rhythm
inaccuracies.
3.1. Database. For building up a ground truth music
(vi)The base meter M and the final meter M are the database the compilation “Now That’s What I Call Music!”
b f
estimateswhetherthesongshasdupleortriplemeter. (U. K. series, volumes 1–69, double CDs, each) is selected.
EURASIPJournalonAudio,Speech,andMusicProcessing 7
Valence simplification of the problem. Instead of performing any
selection, the songs are used in full length in this article to
2 sticktorealworldconditionsascloselyaspossible.
e Respecting that mood perception is generally judged
v
cti as highly subjective [38], we decided for four labellers. As
A
1 stated, mood may well change within a song, as change
of more and less lively passages or change from sad to a
al positiveresolution.Annotationinsuchdetailisparticularly
us 0
o time-intensive,asitnotonlyrequiresmultiplelabelling,but
Ar
additional segmentation, at least on the beat-level. We thus
decidedinfavourofalargedatabasewherechangesinmood
−1
duringasongaretriedtobe“averaged”inannotation,that
e
−2 Passiv ifisr,satssoingnmmienndt oreflattheedcotonnaotsaotinvge mthoaotdononeeiswwouelldl fhaamvieliaart
with. In fact, this can be very practical and sufficient in
−2 −1 0 1 2 many application scenarios, as for automatically suggestion
that fits a listener’s mood. A different question though is,
Negative Positive
whether a learning model would benefit from a “cleaner”
Figure 3: Dimensional mood model with five discrete values for representation. Yet, we are assuming the addressed music
arousalandvalence. type—mainstreampopularandbythatusuallycommercially
oriented—musictobelessaffectedbysuchvariationas,for
example, found in longer arrangements of classical music.
Itcontains2648titles—roughlyaweekofcontinuoustotal In fact, a similar strategy is followed in the field of human
play time—and covers the time span from 1983 until now. emotionrecognition:ithasbeenshownthatoftenuptoless
Likewiseitrepresentsverywellmostmusicstyleswhichare thanhalfofthedurationofaspokenutteranceportraysthe
popular today; that ranges from Pop and Rock music over perceived emotion when annotated on isolated word level
Rap,R&BtoelectronicdancemusicasTechnoorHouse.The [39]. Yet, emotion recognition from speech by and large
stereosoundfilesareMPEG-1AudioLayer3(MP3)encoded ignores this fact by using turn-level labels as predominant
using a sampling rate of 44.1kHz and a variable bit rate of paradigmratherthanword-levelbasedsuch[40].
at least 128kBit/s as found in many typical use-cases of an Details on the chosen raters (three male, one female,
automaticmoodclassificationsystem. agedbetween23and34years;(average:29years)andtheir
Like outlined in Section 1.1.1, a mood model based professional and private relation to music are provided in
on the two dimensions valence (=:ν) and arousal (=:α) Table 5. Raters A–C stated that they listen to music several
is used to annotate the music. Basically, Thayer’s mood hoursperdayandhavenodistinctpreferenceofanymusical
model is used, but with only four possible values (ν,α) ∈ style,whileraterDstatedtolistentomusiceverysecondday
(1,1),(−1,1),(−1,−1),(1,−1) it seems not to be capable on average and prefers Pop music over styles as Hard-Rock
to cover the musical mood satisfyingly. Lu backs this orRap.
assumption: As can be seen, they were picked to form a well-
balancedsetspanningfromrather“naive”assessorswithout
“[·] We find that sometimes the Thayer’s model instrumentknowledgeandprofessionalrelationto“expert”
cannot cover all the mood types inherent in a assessors including a club disc jockey (D. J.). The latter can
music piece. [···] We also find that it is still thusbeexpectedtohaveagoodrelationshiptomusicmood,
possible that an actual music clip may contain and its perception by the audiences. Further, young raters
somemixedmoodsoranambiguousmood.”[10] prove a good choice, as they were very well familiar with
all the songs of the chosen database. They were asked to
A more refined discretisation of the two mood dimen- make a forced decision according to the two dimensions
sions is needed. First a pseudo-continuous annotation was in the mood plane assigning values in {−2,−1,0,1,2} for
considered, that is, (ν,α) ∈ [−1,1]×[−1,1], but after the arousalandvalence,respectively,andasdescribed.Theywere
annotation of 250 songs that approach showed to be too further instructed to annotate according to the perceived
complexinordertoachieveacoherentratingthroughoutthe mood, that is, the “represented” mood, not to the induced,
whole database. So the final model uses five discrete values that is, “felt” one, which could have resulted in too high
perdimension.WithD :={−2,−1,0,1,2}allsongsreceivea labelling ambiguity: while one may know the represented
rating(ν,α)∈D2asvisualisedinFigure3. mood, it is not mandatory that the intended or equal
Songswereannotatedasawhole:manyimplementations mood is felt by the raters. Indeed, depending on perceived
have used excerpts of songs to reduce computational effort arousalandvalence,differentbehavioural,physiological,and
and to investigate only on characteristic song parts. This psychologicalmechanismsareinvolved[41].
either requires an algorithm for automatically finding the Listening was chosen via external sound proof head-
relevant parts as presented, for example, in [34–36] or phones in isolated and silent laboratory environment. The
[37], or needs selection by hand, which would be a clear songswerepresentedinMPEG-1AudioLayer3compression
8 EURASIPJournalonAudio,Speech,andMusicProcessing
Table5:Overviewontheraters(A–D)byage,gender,ethnicity,professionalrelationtomusic,instrumentsplayed,andballroomdance
abilities.Thelastcolumnindicatesthecross-correlation(CC)betweenvalence(V)andarousal(A)foreachrater’sannotations.
Rater Age Gender Ethnicity Prof.Relation Instruments Dancing CC(V,A)
A 34years m European clubD.J. guitar,drums/percussion Standard/Latin 0.34
B 23years m European — piano Standard 0.08
C 26years m European — piano Latin 0.09
D 32years f Asian — — — 0.43
Table6:Meankappavaluesovertheraters(A–D)forfourdifferentcalculationsofgroundtruth(GT)obtainedeitherbyemployingrounded
meanormedianofthelabelspersong.Reductionofclassesbyclusteringofthenegativeorpositivelabels,thatis,divisionbytwo.
No.ofClasses GT κ κ1 κ2
Valence
5 mean 0.307 0.453 0.602
5 median 0.411 0.510 0.604
3 mean 0.440 0.461 0.498
3 median 0.519 0.535 0.561
Arousal
5 mean 0.328 0.477 0.634
5 median 0.415 0.518 0.626
3 mean 0.475 0.496 0.533
3 median 0.526 0.545 0.578
instereovariablebitratecodingand128kBit/sminimumas musicalpieceofadatabaseasitsprototypcalityisnotknown
for the general processing afterwards. Labelling was carried in advance or, in rare works subsumed as novel “garbage”
outindividuallyandindependentoftheotherraterswithin class[17].Thelatterwasfoundunsuitedinourcase,asthe
a period of maximum 20 consecutive working days. A perceptionamongtheratersdifferstoostrongly,andalearnt
continuous session thereby took a maximum time of two modelispotentiallycorruptedtoostronglybysuchagarbage
hours. Each song was fully listened to with a maximum classthatmayeasily“consume”themajorityofinstancesdue
of three times forward skipping by 30 seconds, followed toitslackofsharpdefinition.
by a short break, though the raters knew most songs We thus consider two strategies that both benefit from
in the set very well in advance due to their popular- thefactthatour“classes”areordinal,thatis,theyarebased
ity. Playback of songs was allowed, and the judgments on a discretised continuum: mean of each rater’s label or
could be reviewed—however, without knowledge of the median, which is known to better take care of outliers. To
other raters’ results. For the annotation a plugin (available match from mean or median back to classes, a binning
at http://www.openaudio.eu/) to the open source audio is needed, unless we want to introduce novel classes “in
player Foobar: (http://www.foobar2000.org/) was provided between”(considertheexampleoftworatersjudging“0”and
that displays the valence arousal plane colour coded as two “1”: by that we obtain a new class “0.5”). We choose a
depicted in Figure 3 for clicking on the appropriate class. simpleroundoperationtothisaimofpreservingtheoriginal
The named skip of 30 seconds forward was obtained via five“classes”.
hotkey. To evaluate which of these two types of ground truth
Based on each rater’s labelling, Table 5 also depicts the calculation is to be preferred, Table 6 shows mean kappa
correlation of valence and arousal (rightmost coloumn): valueswithnone(Cohen’s),linear,andquadraticweighting
thoughtheraterswerewellfamiliarwiththegeneralconcept over all raters and per dimension. In addition to the five
of the dimensions, clear differences are indicated already classes (in the ongoing abbreviated as V5 for valence and
looking at the variance among these correlations. The A5forarousal),itconsidersaclusteringofthepositiveand
distribution of labels per rater as depicted in Figure 4 negative values per dimensions, which resembles a division
further visualizes the clear differences in perception. (The bytwopriortotheroundingoperation(V3andA3,resp.).
complete annotation by the four individuals is available at Anincreasingkappacoefficientbygoingfromnoweight-
http://www.openaudio.eu/.) ing to linear to quadratic thereby indicates that confusions
Inordertoestablishagroundtruththatconsidersevery betweenaraterandtheestablishedgroundtruthoccurrather
rater’s labelling without exclusion of instances, or songs, betweenneighbouringclasses,thatis,averynegativevalueis
respectively, that do not possess a majority agreement in less often confused with a very positive than with a neutral
label, a new strategy has to be found: in the literature such one.Generally,kappavalueslarger0.4areconsideredasgood
instances are usually discarded, which however does not agreement,whilesuchlarger0.7areconsideredasverygood
reflectarealworldusagewhereajudgmentisneededonany agreement[42].
EURASIPJournalonAudio,Speech,andMusicProcessing 9
Table7:Overviewontheraters(A–D)bytheirkappavaluesforagreementwiththemedian-basedinter-labelleragreementasgroundtruth
forthreeclassesperdimension.
Rater Valence Arousal
κ κ1 κ2 κ κ1 κ2
A 0.672 0.696 0.734 0.499 0.533 0.585
B 0.263 0.244 0.210 0.471 0.491 0.524
C 0.581 0.605 0.645 0.512 0.524 0.547
D 0.559 0.596 0.654 0.620 0.633 0.656
Valence Valence
−2 −1 0 1 2 −2 −1 0 1 2
2 6 24 126 147 74 2 1 32 87 23 3
1 11 124 434 288 81 1 13 110 390 116 14
al al
us 0 7 110 333 163 49 us 0 28 303 658 324 39
o o
Ar Ar
−1 41 179 183 71 28 −1 20 80 145 139 15
−2 43 55 38 19 14 −2 4 14 35 44 11
(a) RaterA (b) RaterB
Valence Valence
−2 −1 0 1 2 −2 −1 0 1 2
2 1 8 23 24 3 2 2 1 3 61 50
1 4 121 303 132 22 1 15 74 121 641 232
al al
us 0 86 446 617 323 30 us 0 6 31 157 366 93
o o
Ar Ar
−1 37 132 159 101 34 −1 63 176 202 286 17
−2 2 17 15 8 0 −2 12 23 9 7 0
(c) RaterC (d) RaterD
Figure4:5×5classdistributionsofthemusicdatabase(2648totalinstances)fortheannotationofeachrater(a)–(d).
Obviously, choosing the median is the better choice— Thechoiceofgroundtruthfortherestofthisarticlethusis
mayitbeforvalenceorarousal,fiveorthreeclasses.Further, either(rounded)medianafterclusteringtothreeclasses,or
threeclassesshowbetteragreementunlesswhenconsidering eachrater’sindividualannotation.
quadratic weighting. The latter is however obvious, as less InTable7thedifferencesamongtheraterswithrespect
confusionswithfarspreadclassescanoccurforthreeclasses. to accordance to this chosen ground truth strategy—three
10 EURASIPJournalonAudio,Speech,andMusicProcessing
Valence and development are obtained by selecting all songs from
−2 −1 0 1 2 odd years, whereby development is assigned by choosing
every second odd year. By that, test is defined using every
even year. The distributions of instances per partition
2 0 1 9 54 37
are displayed in Figure 7 following the three degrees per
dimension.
Once development was used for optimization of classi-
1 3 57 298 638 101
fiersorfeatureselection,thetraining anddevelopmentsets
areunitedfortraining.Notethatthispartitioningresembles
al roughly 50%/50% of overall training/test. Performances
ous 0 8 139 362 270 23 could probably be increased by choosing a smaller test
Ar
partition and thus increasing the training material. Yet, we
felt that more than 1000 test instances favour statistically
−1 45 322 144 76 6 moremeaningfullfindings.
To reveal the impact of prototypicality, that is, limiting
to instances or musical pieces with clear agreement by a
−2 24 22 6 3 0 majority of raters,we additionally consider the sets Min2/4
for the case of agreement of two out of four raters,
while the other two have to disagree among each other,
Figure5:5×5classdistributionofthemusicdatabase(2648total resembling unity among two and draw between the others,
instances)afterannotationbasedonroundedmedianofallraters. and the set Min3/4, where three out of four raters have to
agree. Note that the minimum agreement is based on the
original five degrees per dimension and that we consider
thissubsetonlyforthetestinginstances,aswewanttokeep
degrees per dimension and rounded median—are revealed. trainingconditionsfixedforbettertransparencyofeffectsof
In particular rater B notably disagrees with the valence
prototypization. The according distributions are shown in
ground truth established by all raters. Other than that,
Figure8.
generallygoodagreementisobserved.
The preference of three over five classes is further
mostly stemming from the lack of sufficient instances 3.3. Feature Subsets. In addition to the data partitions, the
for the “extreme” classes. This becomes obvious looking performance is examined in dependence on the subset of
at the resulting distribution of instances in the valence- attributes used. Refer to Table 8 for an overview of these
arousal plane by the rounded median ground truth for the subsets.Theyaredirectlyderivedfromthepartitioninginthe
original five classes per dimension as provided in Figure 5. featuressectionofthiswork.Tobetterestimatetheinfluence
This distribution shows a further interesting effect: though oflyricsontheclassification,aspecialsubsetcalledNoLyris
practically no correlation between valence and arousal was introduced,whichcontainsallfeaturesexceptthosederived
measured for the raters B and C, and not too strong such from lyrics. Note in this respect that for 25% (675) songs
for raters A and D (cf. right most coloumn in Table5), the nolyricsareavailablewithinthetwousedon-linedatabases
agreement of raters seems to be found mostly in favour of whichwasintentionallyleftasistoagainfurtherrealism.
such a correlation: the diagonal reaching from low valence
andarousaltohighvalenceandarousalisconsiderablymore
3.4.TrainingInstanceProcessing. Trainingontheunmodified
present in terms of frequency of musical pieces. This may
trainingsetislikelytodeliverahighlybiasedclassifierdueto
either stem from the nature of the chosen compilation of
theunbalancedclassdistributioninalltrainingdatasets.To
the CDs, which however well cover the typical chart and
overcomethisproblem,threedifferentstrategiesareusually
airedmusicoftheirtime,orthatgenerallymusicwithlower
employed [16, 21, 43]: the first is downsampling, in which
activationisratherfoundconnotativewithnegativevalence
instances from the overrepresented classes are randomly
and vice versa (consider hereon the examples of ballads or
removed until each class contains the same number of
“happy”discoordancemusicasexamples).
instances.Thisprocedureusuallywithdrawsalotofinstances
The distributions among the five and three classes (as
and with them valuable information, especially in highly
mentioned by clustering of negative and positive values,
unbalanced situations: it always outputs a training dataset
each)individuallyperdimensionshowninFigure6further
sizeequaltothenumberofclassesmultipliedwithnumber
illustratesthereasontobefoundinchoosingthethreeover
of instances in the class with least instances. In highly
thefiveclassesintheongoing.
unbalanced experiments, this procedure thus leads to a
pathological small training set. The second method used
3.2. Datasets. First all 2648 songs are used in a dataset is upsampling, in which instances from the classes with
named AllInst. For evaluation of “true” learning success, proportionally low numbers of instances are duplicated
training, development, and test partitions are constructed: to reach a more balanced class distribution. This way no
we decided for a transparent definition that allows easy instanceisremovedfromthetrainingsetandallinformation
reproducibilityandisnotoptimizedinanyrespect:training can contribute to the trained classifier. This is whyrandom
Description:Mood of Music is among the most relevant and commercially promising, yet challenging attributes for retrieval in large music Of course, manual annotation of Table 5: Overview on the raters (A–D) by age, gender, ethnicity, professional relation to music, instruments played, and ballroom dance.