ebook img

Research Article Determination of Nonprototypical Valence and Arousal in Popular Music PDF

19 Pages·2010·0.94 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Research Article Determination of Nonprototypical Valence and Arousal in Popular Music

HindawiPublishingCorporation EURASIPJournalonAudio,Speech,andMusicProcessing Volume2010,ArticleID735854,19pages doi:10.1155/2010/735854 Research Article Determination of Nonprototypical Valence and Arousal in Popular Music: Features and Performances Bjo¨rnSchuller,JohannesDorfner,andGerhardRigoll InstituteforHuman-MachineCommunication,TechnischeUniversita¨tMu¨nchen,Mu¨nchen80333,Germany CorrespondenceshouldbeaddressedtoBjo¨rnSchuller,[email protected] Received27May2009;Revised4December2009;Accepted8January2010 AcademicEditor:LimingChen Copyright©2010Bjo¨rnSchulleretal.ThisisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense, whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited. MoodofMusicisamongthemostrelevantandcommerciallypromising,yetchallengingattributesforretrievalinlargemusic collections.Inthisrespectthisarticlefirstprovidesashortoverviewonmethodsandperformancesinthefield.Whilemostpast researchsofardealtwithlow-levelaudiodescriptorstothisaim,thisarticlereportsonresultsexploitinginformationonmiddle- levelastherhythmicandchordalstructureorlyricsofamusicalpiece.Specialattentionisgiventorealismandnonprototypicality oftheselectedsongsinthedatabase:allfeatureinformationisobtainedbyfullyautomaticpreclassificationapartfromthelyrics which are automatically retrieved from on-line sources. Further more, instead of exclusively picking songs with agreement of severalannotatorsuponperceivedmood,afullcollectionof69doubleCDs,or2648titles,respectively,isprocessed.Duetothe severityofthistask;differentmodellingformsinthearousalandvalencespaceareinvestigated,andrelevanceperfeaturegroupis reported. 1.Introduction (http://www.allmusic.com/).Buttheinformationwhichcan be found there is very inaccurate because it is available on Musicisambient.Audioencodinghasenabledustodigitise a per artist instead of a per track basis. This is where an our musical heritage and new songs are released digitally automated way of classifying music into mood categories every day. As mass storage has become affordable, it is usingmachinelearningwouldbehelpful.Sheddinglighton possible for everyone to aggregate a vast amount of music current well-suited features, performances, and improving in personal collections. This brings with it the necessity to on this task is thus the concern of this article. Special somehoworganisethismusic. emphasisistherebylaidonstickingtorealworldconditions The established approach for this task is derived from by absence of any preselection of “friendly” cases either by physical music collections: browsing by artist and album is consideringonlymusicwithmajorityagreementofannota- of course the best choice for searching familiar music for torsandrandompartitioningoftrainandtestinstances. aspecifictrackorrelease.Additionally,musicalgenreshelp tooverviewsimilaritiesinstyleamongartists.However,this 1.1.StateoftheArt categorisation is quite ambiguous and difficult to carry out consistently. 1.1.1.MoodTaxonomies. Whenitcomestoautomaticmusic Oftenmusicisnotselectedbyartistoralbumbutbythe moodprediction,thefirsttaskthatarisesistofindasuitable occasionlikedoingsports,relaxingafterworkoraromantic moodrepresentation.Twodifferentapproachesarecurrently candle-lightdinner.Insuchcasesitwouldbehandyifthere established:adiscreteandadimensionaldescription. was a way to find songs which match the mood which A discrete model relies on a list of adjectives each is associated with the activity like “activating”, “calming” describing a state of mood like happy, sad or depressed. or “romantic” [1, 2]. Of course, manual annotation of Hevner [3] was the first to come up with a collection of 8 music would be a way to accomplish this. There also word clusters consisting of 68 words. Later Farnsworth [4] existon-linedatabaseswithsuchinformationlikeAllmusic, regroupedthemin10labelledgroupswhichwereusedand 2 EURASIPJournalonAudio,Speech,andMusicProcessing Table1:Ajdectivegroups(A–J)aspresentedbyFarnsworth[4],K– LiandOgihara[5]extracteda30-elementfeaturevector MwereextendedbyLiandOgihara[5]. containingtimbre,pitch,andrhythmfeaturesusingMarsyas [12],asoftwareframeworkforaudioprocessingwithspecific A cheerful,gay,happy H dramatic,emphatic emphasisonMusicInformationRetrievalapplications. B fanciful,light I agitated,exciting Liu[9]usedmusicinauniformformat(16kHz,16bits, C delicate,graceful J frustrated mono channel) and divided into non-overlapping 32ms D dreamy,leisurely K mysterious,spooky long frames. Then timbre features based on global spectral E longing,pathetic L passionate and subband features were extracted. Global spectrum features were centroid, bandwidth, roll off, and spectral F dark,depressing M bluesy flux.Subbandfeatureswereoctave-based(7subbandsfrom G sacred,spiritual 0 to 8kHz) and consist of the minimum, maximum, and average amplitude value for each subband. The root mean Table2:MIREX2008MoodCategories(aggr.:aggressive,bittersw.: squareofanaudiosignalisusedasanintensityfeature.For bittersweet,humor.:humerous,lit.:literate,rollick.:rollicking). extracting rhythm information only the audio information of the lowest subband was used. The amplitude envelope A passionate,rousing,confident,boisterous,rowdy wasextractedbyuseofahammingwindow.Edgedetection B rollick.,cheerful,fun,sweet,amiable/goodnatured with a Canny estimator delivered a so-called rhythm curve C lit.,poignant,wistful,bittersw.,autumnal,brooding in which peaks were detected as bass instrumental onsets. D humor.,silly,campy,quirky,whimsical,witty,wry The average strength of peaks then was used as an estimate E aggr.,fiery,tense/anxious,intense,volatile,visceral for the strength of the rhythm. Auto-correlation delivered information about the regularity of the rhythm and the commondivisorofthecorrelationpeakswasinterpretedas expanded to 13 groups in recent work [5]. Table 1 shows theaveragetempo.Luetal.[10]continuedtheworkofLiu those groups. Also MIREX (Music Information Retrieval usingthesamepreprocessingofaudiofiles.Alsothetimbre EvaluationeXchange)useswordclustersforitsAudioMood andintensityfeatureswereidentical.Tocalculatetherhythm Classification(AMC)taskasshowninTable2. curve this time, all subbands were taken into account. The However, the number and labelling of adjective groups amplitude envelope was extracted for each subband audio suffers from being too ambiguous for a concise estimation signalusingahalf-Hanningwindow.ACannyedgedetector ofmood.Moreover,differentadjectivegroupsarecorrelated wasusedonittocalculateanonsetcurve.Allsubbandonset with each other as Russell showed [6]. These findings curves were then summed up to deliver the rhythm curve implicate that a less redundant representation of mood can from which strength, regularity, and tempo were calculated befound. asexplainedabove. Dimensional mood models are based on the assertion Trohidis et al. [13] also used timbre and rhythm fea- that different mood states are composed by linear com- tures, which were extracted as described in the following: binations of a low number (i.e., two or three) of basic two estimates for tempo (bpm) (beats per minute) were moods. The best known model is the circumplex model calculated by identifying peaks in an autocorrelated beat of affect presented by Russell in 1980 [7] consisting of a histogram. Additional rhythmic information from the beat “two-dimensional space of pleasure-displeasure and degree histogram was gathered by calculating amplitude ratios ofarousal”whichallowstoidentifyemotionaltagsaspoints and summing of histogram ranges. Timbre features were in the “mood space” as shown in Figure 1(a). Thayer [8] extracted from the Mel Frequency Cepstral Coefficients adopted this idea and divided the “mood space” in four (MFCC)[14]andtheShort-TermFourierTransform(FFT), quadrantsasdepictedinFigure1(b).Thismodelmainlyhas which were both calculated per sound frame of 32ms beenusedinrecentresearch[9–11],probablybecauseitleads duration. From the MFCCs the first 13 coefficients were to two binary classification problems with comparably low takenandfromtheFFTthespectralcharacteristicscentroid, complexity. roll off, and flux were derived. Additionally, mean and standarddeviationofthesefeatureswerecalculatedoverall 1.1.2. Audio Features and Metadata. Another task involved frames. in mood recognition is the selection of features as a base Peeters [15] used the following three feature groups in for the used learning algorithm. This data either can be his submission for the MIREX 2008, (http://www.music- directly calculated from the raw audio data or metadata ir.org/mirex/2008/) audio mood classification task: MFCC, about the piece of music. The former further divide into SFM/SCM, and Chroma/PCP The MFCC features were so-called high- and low-level features. Low-level refers to 13 coefficients including the DC component. SFM/SCM the characteristics of the audio wave shape like amplitude are the so-called Spectral Flatness and Spectral Crest and spectrum. From these characteristics more abstract— Measures. They capture information about whether the orhigh-level—propertiesdescribingconceptslikerhythmor spectrum energy is concentrated in peaks or if it is flat. harmonicscanbederived.Metadatainvolvesallinformation Peaks are characteristic for sinusoidal signals while a flat that can be found about a music track. This begins at spectrumindicatesnoise.Chroma/PCPorPitchClassProfile essential information like title or artist and ranges from representsthedistributionofsignalenergyamongthepitch musicalgenretolyrics. classes(refertoSection2.3). EURASIPJournalonAudio,Speech,andMusicProcessing 3 1.1.3. Algorithms and Results. Like with mood taxonomies experiments that are conducted. Finally, Section 4 presents thereisstillnoagreedconsensusonthelearningalgorithms the experiments’ results, and Section 5 concludes the most to use for mood prediction. Obviously, the choice highly importantfindings. depends on the selected mood model. Recent research, which deals with a four-class dimensional mood model 2.Features [9, 10], uses Gaussian Mixture Models (GMM) as a base for a hierarchical classification system (HCS): at first a Like in every machine learning problem it is crucial for binary decision on arousal is made using only rhythm the success of mood detection to select suitable features. and timbre features. The following valence classification is Those are features which convey sufficient information on then derived from the remaining features. This approach themusicinordertoenablethemachinelearningalgorithm yields an average classification accuracy of 86.3%, based tofindcorrelationsbetweenfeatureandclassvalues.Those on a database of 250 classical music excerpts. Additionally, featureseithercanbeextracteddirectlyfromtheaudiodata the mood tracking method presented there is capable of orretrievedfrompublicdatabases.Bothtypesoffeaturesare detecting mood boundaries with a high precision of 85.1% usedinthisworkandtheiruseforestimatingmusicalmood andarecallof84.1%onabaseof63boundariesin9pieces is investigated. Concerning musical features, both low-level ofclassicalmusic. featureslikespectrumandmiddle-levelfeatureslikechords Recently the second challenge in audio mood classifica- areemployed. tion was held as a part of the MIREX 2008. The purpose of this contest is to monitor the current state of research: 2.1.Lyrics. Inthefieldofemotionrecognitionfromspeech this year’s winner in the mood classification task, Peeters it is commonly agreed that textual information may help [15],achievedanoverallaccuracyof63.7%onthefivemood improve over mere acoustic analysis [18, 19]. For 1937 classesshowninTable2beforethesecondplacedparticipant of 2648 songs in the database (cf. Section 3.1) lyrics can with55.0%accuracy. automatically be collected from two on-line databases: in a first run lyricsDB, (http://lyrics.mirkforce.net/) is applied, 1.2. This Work. Having presented the current state of re- which delivers lyrics for 1779 songs, then LyricWiki, search in automatic mood classification the main goals for (http://www.lyricwiki.org/) is searched for all remaining thisarticlearepresented. songs, which delivers lyrics for 158 additional songs. LyricsDB The only post-processing needed is to remove 1.2.1. Aims. The first aim of this work is to build up a obvious “stubs”, that is, lyrics containing only some words musicdatabaseofannotatedmusicwithsufficientsize.The when the real text is much longer. However, this procedure selected music should cover today’s popular music genres. doesnotensurethattheremainderofthelyricsiscomplete Sothisworkputsemphasisonpopularratherthanclassical or correct at all. It has to be remarked that not only word music. In contrast to most existing work no preselection bywordtranscriptsofasongarecollected,butthatthereare of songs is performed, which is presently also considered a inconsistentconventionsusedamongthedatabases.Sosome major challenge in the related field of emotion recognition lyricscontainpassageslike“Chorusx2”or“(Repeat)”,which in human speech [16, 17]. It is also attempted to deal with makesthechorusappearlessoftenintherawtextthanitcan ambiguoussongs.Forthatpurpose,amoodmodelcapable beheardinasong.Toextractinformationfromtherawtext ofrepresentingambiguousmoodissearched. thatisusableformachinelearning,twodifferentapproaches Most existing approaches exclusively use low-level fea- areused,asfollows. tures. So in this work middle-level features that partly base on preclassification are additionally used and tested 2.1.1. Semantic Database for Mood Estimation. The first for suitability to improve the classification. Another task is approach is using ConceptNet [20, 21], a text-processing the identification of relevant features by means of feature toolkit that makes use of a large semantic database auto- relevance analysis. This step is important because it can maticallygeneratedfromsentencesintheOpenMindCom- improve classification accuracy while reducing the number mon Sense Project, (http://openmind.media.mit.edu/). The of attributes at the same time. Also all feature extraction is software is capable of estimating the most likely emotional basedonthewholesonglengthratherthantoselectexcerpts affect in a raw text input. This has already been shown ofseveralsecondsandoperateonlyonthem. quiteeffectiveforvalencepredictioninmoviereviews[21]. Thefinalandmaingoalofthisarticleistopredictasong’s Listing1displaystheoutputforanexamplesong. mood under real world conditions, that is, by using only The underlying algorithm profits from a subset of metainformationavailableon-line,nopreselectionofmusic, concepts that are manually classified into one of six emo- and compressed music, as reliably as possible. Additionally, tional categories (happy, sad, angry, fearful, disgusted, and factors limiting the classification success shall be identified surprised).Nowtheemotionalaffectofunclassifiedconcepts andaddressed. thatareextractedfromthesong’slyricscanbecalculatedby finding and weighting paths which lead to those classified 1.2.2. Structure. Section 2 deals with the features that concepts. are used as the informational base for machine learning. The program output is directly used as attributes. Six Section3containsadescriptionofthemusicdatabaseandall nominal attributes with the emotional category names as 4 EURASIPJournalonAudio,Speech,andMusicProcessing up exponentially. In our experiments this did not lead to improvementsonthetaskspresentedintheongoing. (“sad”, 0.579) (“happy”, 0.246) (“fearful”, 0.134) 2.2. Metadata. Additional information about the music is (“angry”, 0.000) sparse in this work because of the large size of the music (“disgusted”, 0.000) collection used (refer to Section 3.1): besides the year of (“surprised”, 0.000) release only the artist and title information is available for each song. While the date is directly used as a numeric attribute,theartistandtitlefieldsareprocessedinasimilar Listing1:ConceptNetlyricsmoodestimationforthesong“(IJust) way as the lyrics (cf. Section 2.1.2 for a more detailed DiedInYourArms”byCuttingCrew. explanation of the methods): only the binary information abouttheoccurrenceofawordstemisobtained.Theword stems are generated by string to word vector conversion possiblevaluesindicatewhichmoodisthemost,second,..., applied to the artist and title attributes. Standard word leastdominantinthelyrics.Sixadditionalnumericattributes delimiters are used to split multiple text strings to words contain the corresponding probabilities. Note that other and the Porter stemming algorithm [24] reduces words to alternatives exist, as the word lists found in [22], which commonstemsinordertomapdifferentformsofoneword directly assigns arousal and valence values to words, yet to their common stem. To limit the number of attributes consistofmorelimitedvocabulary. that are left after conversion, a minimum word frequency isset,whichdetermineshowoftenawordstemmustoccur withinoneclass.Whiletheartistwordlistlooksveryspecific 2.1.2. Text Processing. The second approach uses text pro- tothecollectionofartistsinthedatabase,thetitlewordlist cessingmethodsintroducedin[23]andshownefficientfor seemstohavemoregeneralrelevancewithwordslike“love”, sentiment detection in [19, 21]. The raw text is first split “feel”,or“sweet”.Intotal,themetadataattributesconsistof into words while removing all punctuation. In order to one numeric date attribute and 152 binary numeric word recognise different flexions of the same word (e.g., loved, occurrenceattributes. loving,lovesshouldbecountedaslove),theconjugatedword has to be reduced to its word stem. This is done using 2.3. Chords. A musical chord is defined as a set of three the Porter stemming algorithm [24]. It is based on the (sometimes two) or more simultaneously played notes. A following idea: each (English) word can be represented in note is characterised by its name—which is also referred to the form [C](VC)m[V], where C(V) denotes a sequence as pitch class—and the octave it is played in. An octave is of one or more consecutive consonants (vowels) and m is aso-calledintervalbetweentwonoteswhosecorresponding called the measure of the word ((VC)m here means an m- frequencies are at a ratio of 2 : 1. The octave is a special fold repetition of the string VC). Then, in five separated intervalastwonotesplayedinitsoundnearlyequal.Thisis steps,replacementrulesareappliedtotheword.Thefirststep whysuchnotessharethesamenameinmusicnotation.The dealswiththeremovalofpluralandparticipleendings.The octave interval is divided into twelve equally sized intervals steps2to5thenreplacecommonwordendingslikeATION calledsemitones.Inwesternmusicthesearenamedasshown → ATE or IVENESS → IVE. Many of those rules contain in Figure 2 which visualises these facts. In order to classify conditions under which they may be applied. For example, achord,onlythepitchclasses(i.e.,thenotenameswithout therule“(m > 0)TIONAL → TION”onlyisappliedwhen octave number) of the notes involved are important. There the remaining stem has a measure greater than zero. This areseveraldifferenttypesofchordsdependingonthesizeof leavestheword“rational”unmodifiedwhile“occupational” intervals between the notes. Each chord type has a distinct isreplaced.Ifmorethanonerulematchesinastep,therule sound which makes it possible to associate it with a set of withthebiggestmatchingsuffixisapplied. moodsasdepictedinTable3. A numerical attribute is generated for each word stem thatisnotinthelistofstopwordsandoccursatleasttentimes 2.3.1.RecognitionandExtraction. Forchordextractionfrom inoneclass.Thevaluecanbezeroifthewordstemcannot therawaudiodataafullyautomaticalgorithmaspresented befoundinasong’slyrics.Otherwise,ifthewordoccurs,the by Harte and Sandler [26] is used. Its basic idea is to map numberofoccurrencesisignored,andtheattributevalueis signalenergyinfrequencysubbandstotheircorresponding set to one, only normalised to the total length of the song’s pitch class which leads to a chromagram [27] or pitch lyrics.Thisisdonetoestimatethedifferentprevalenceofone class profile (PCP). Each possible chord type corresponds wordinasongdependentonthetotalamountoftext. to specific pattern of tones. By comparing the chromagram Themoodassociatedwiththisnumericalrepresentation with predefined chord templates, an estimate of the chord of words contained in the lyrics is finally learned by the type can be made. However, also data-driven methods can classifier as for any acoustic feature. Note that the word be employed [28]. Table 4 shows the chord types that are orderisneglectedinthismodelling.Onecouldalsoconsider recognised. To determine the tuning of a song for a correct compounds of words by N-grams, that is, N consecutive estimationofsemitoneboundaries,a36-binchromagramis words. Yet, this usually demands for considerably higher calculated first. After tuning, an exact 12-bin chromagram amounts of training material as the feature space is blown canbegeneratedwhichrepresentsthe12differentsemitones. EURASIPJournalonAudio,Speech,andMusicProcessing 5 Table3:Chordtypesandtheirassociatedemotions[25]. ChordType Example AssociatedEmotions Major C Happiness,cheerfulness,confidence,satisfaction,brightness Minor Cm Sadness,darkness,sullenness,apprehension,melancholy,depression,mystery Seventh C7 Funkiness,moderateedginess,soulfulness MajorSeventh Cmaj7 Romance,softness,jazziness,serenity,exhilaration,tranquillity MinorSeventh Cm7 Mellowness,moodiness,jazziness Ninth C9 Openness,optimism Diminished Cdim Fear,shock,spookiness,suspense SuspendedFourth Csus4 Delightfultension Seventh,MinorNinth C7/9(cid:2) Creepiness,ominousness,fear,darkness AddedNinth Cadd9 Steeliness,austerity Alarmed• ArousePdle•asure •As•toEnxicsihteedd Height Afraid• Delighted• Tense• •Angry •Distressed Glad •Annoyed • Happy• •Frustrated • Pleased al Bn+1 us Aro Satisfied•• Content •Miserable •Depressed Se•rene Bn Calm• G•loom•Sy•adBored Relaxed••Atease G F# Droopy• •Sleepy G# F E •Tired A D# (a) A# D B C C# Chroma Tensearousal Figure2:Thepitchhelixaspresentedin[26].Theheightaxisis associatedwithanote’sfrequencyandtherotationcorrespondsto thepitchclassofanote.Here,B isoneoctavebelowB . n n+1 y g Anxious Exuberance ner (tense-energy) (calm-energy) E Table4:Chordtypeswhicharerecognisedandextracted. al us ChordType Example o ar Augmented C+ c eti Diminished Adim g er Diminished7 Cdim7 n E Dominant7 G7 Depression Contentment (tense-tiredness) (calm-tiredness) d Major F(cid:3) e Tir Major7 D(cid:3)maj7 Minor Gm Minor7 Cm7 MinorMajor7 F(cid:3)mmaj7 Tense Calm (b) The resulting estimate gives the chord type (e.g., major, Figure 1: Dimensional mood model development: (a) shows a multidimensional scaling of emotion-related tags suggested by minor,diminished)andthechordbasetone(e.g.,C,F,G(cid:3)) Russell[7].(b)isThayer’smodel[8]withfourmoodclusters. (cf.[29]forfurtherdetails). 6 EURASIPJournalonAudio,Speech,andMusicProcessing 2.3.2. Postprocessing. Timing information are withdrawn Bothcanhaveoneofthepossiblevalues3(fortriple) and only the sequence of recognised chords are used or4(forduple). subsequently. For each chord name and chord type the (vii)ThetatummaximumT isthemaximumvalueof max number of occurrences is divided by the total number of T(cid:3). chords in a song. This yields 22 numeric attributes, 21 (viii)ThetatummeanT isthemeanvalueofT(cid:3). describingtheproportionofchordsperchordnameortype, mean andthelastoneisthenumberofrecognisedchords. (ix)The tatum ratio T is calculated by dividing the ratio highestvalueofT(cid:3)bythelowest. 2.4. Rhythm Features. Widespread methods for rhythm (x)ThetatumslopeT thefirstvalueofT(cid:3)dividedby slope detectionmakeuseofacepstralanalysisorautocorrelationin thelastvalue. order to perform tempo detection on audio data. However, (xi)The tatum peak distance T is the mean of the cepstral analysis has not proven satisfactory on music maximumandminimumvpaelaukdeisotfT(cid:3) normalisedby withoutstrongrhythmsandsuffersfromslowperformance. theglobalmean. Bothmethodshavethedisadvantagesofnotbeingapplicable tocontinuousdataandnotcontributinginformationtobeat Thisfinallyyields87numericattributes,mainlyconsist- tracking. ingofthetatumandmetervectorelements. Therhythmfeaturesusedinthisarticlerelyonamethod presented in [30, 31] which itself is based on former work 2.5. Spectral Features. First the audio file is converted to byScheirer[32].Itusesabankofcombfilterswithdifferent mono, and then a fast Fourier transform (FFT) is applied resonant frequency covering a range from 60 to 180bpm. [33]. For an audio signal which can be described as x : The output of each filter corresponds to the signal energy [0,T] (cid:4)→ R, t (cid:4)→ x(t), the Fourier transform is defined as (cid:2) belonging to a certain tempo. This approach has several X(f)= Tx(t)e−j2πftdt: 0 advantages:itdeliversarobusttempoestimateandperforms (cid:3) well for a wide range of music. Additionally, its output can E := ∞(cid:4)(cid:4)X(cid:5)f(cid:6)(cid:4)(cid:4)2df, (1) be used for beat tracking which strengthens the results by 0 being able to make easy plausibility checks on the results. Further processing of the filter output determines the base andwiththecentreofgravity fc thenthcentralmomentis introducedas meterofasong,thatis,howmanybeatsareineachmeasure (cid:3) acanndrwehcoatgnniosteewvahleuteheorneabsoeantghhaas.sTdhuepilem(p2l/e4m,e4n/4ta)tioorntruispelde Mn :=E1 ∞0 (cid:5)f − fc(cid:6)n(cid:4)(cid:4)X(cid:5)f(cid:6)(cid:4)(cid:4)2df. (2) (3/4,6/8)meter. The implementation executes the tempo calculation in To represent the global characteristics of the spectrum, the two steps: first, the so called “tatum” tempo is searched. followingvaluesarecalculatedandusedasfeatures. The tatum tempo is the fastest perceived tempo present in (i)Thecentreofgravity f . asong.Foritscalculation57combfiltersareappliedtothe c (preprocessed)audiosignal.Theiroutputsarecombinedin (ii)The standard deviation which is a measure for how theunnormalisedtatumvectorT(cid:3). muchthefrequenciesinaspectrum(cid:7)candeviatefrom thecentreofgravity.Itisequalto M . (i)The meter vector M = [m ···m ]T consists of 2 1 19 (iii) Theskewnesswhichisameasureforhowmuchthe normalised entries of score values. Each score value m(cid:3) determines how well the tempo θ ·i resonates shapeofthespectrumbelowthecentreofgravityis i T differentfromtheshapeabovethemeanfrequency.It withthesong. iscalculatedasM /(M )1.5. 3 2 (ii) TheTatumvectorT=[t ···t ]T isthenormalised 1 57 (iv)The kurtosis which is a measure for how much the vectoroffilterbankoutputs. shape of the spectrum around the centre of gravity (iii) Tatum candidates θT1, θT2 are the tempi corre- is d(cid:7)ifferent from a Gaussian shape. It is equal to sponding to the two most dominant peaks T(cid:3). The M / M −3. 4 2 candidate with the higher confidence is called the (v)Bandenergiesandenergydensitiesforthefollowing tatumtempoθ . T sevenoctavebasedfrequencyintervals:0Hz–200Hz, (iv)The main tempo θB is calculated from the meter 200Hz–400Hz, 400Hz–800Hz, 800Hz–1.6kHz, vector M. Basically, the tempo which resonates best 1.6kHz–3.2kHz, 3.2kHz–6.4kHz, and 6.4kHz– withthesongischosen. 12.8kHz. (v)Thetrackertempoθ isthesameasmaintempo,but BT refined by beat tracking. Ideally, θ and θ should B BT 3.Experiments be identical or vary only slightly due to rhythm inaccuracies. 3.1. Database. For building up a ground truth music (vi)The base meter M and the final meter M are the database the compilation “Now That’s What I Call Music!” b f estimateswhetherthesongshasdupleortriplemeter. (U. K. series, volumes 1–69, double CDs, each) is selected. EURASIPJournalonAudio,Speech,andMusicProcessing 7 Valence simplification of the problem. Instead of performing any selection, the songs are used in full length in this article to 2 sticktorealworldconditionsascloselyaspossible. e Respecting that mood perception is generally judged v cti as highly subjective [38], we decided for four labellers. As A 1 stated, mood may well change within a song, as change of more and less lively passages or change from sad to a al positiveresolution.Annotationinsuchdetailisparticularly us 0 o time-intensive,asitnotonlyrequiresmultiplelabelling,but Ar additional segmentation, at least on the beat-level. We thus decidedinfavourofalargedatabasewherechangesinmood −1 duringasongaretriedtobe“averaged”inannotation,that e −2 Passiv ifisr,satssoingnmmienndt oreflattheedcotonnaotsaotinvge mthoaotdononeeiswwouelldl fhaamvieliaart with. In fact, this can be very practical and sufficient in −2 −1 0 1 2 many application scenarios, as for automatically suggestion that fits a listener’s mood. A different question though is, Negative Positive whether a learning model would benefit from a “cleaner” Figure 3: Dimensional mood model with five discrete values for representation. Yet, we are assuming the addressed music arousalandvalence. type—mainstreampopularandbythatusuallycommercially oriented—musictobelessaffectedbysuchvariationas,for example, found in longer arrangements of classical music. Itcontains2648titles—roughlyaweekofcontinuoustotal In fact, a similar strategy is followed in the field of human play time—and covers the time span from 1983 until now. emotionrecognition:ithasbeenshownthatoftenuptoless Likewiseitrepresentsverywellmostmusicstyleswhichare thanhalfofthedurationofaspokenutteranceportraysthe popular today; that ranges from Pop and Rock music over perceived emotion when annotated on isolated word level Rap,R&BtoelectronicdancemusicasTechnoorHouse.The [39]. Yet, emotion recognition from speech by and large stereosoundfilesareMPEG-1AudioLayer3(MP3)encoded ignores this fact by using turn-level labels as predominant using a sampling rate of 44.1kHz and a variable bit rate of paradigmratherthanword-levelbasedsuch[40]. at least 128kBit/s as found in many typical use-cases of an Details on the chosen raters (three male, one female, automaticmoodclassificationsystem. agedbetween23and34years;(average:29years)andtheir Like outlined in Section 1.1.1, a mood model based professional and private relation to music are provided in on the two dimensions valence (=:ν) and arousal (=:α) Table 5. Raters A–C stated that they listen to music several is used to annotate the music. Basically, Thayer’s mood hoursperdayandhavenodistinctpreferenceofanymusical model is used, but with only four possible values (ν,α) ∈ style,whileraterDstatedtolistentomusiceverysecondday (1,1),(−1,1),(−1,−1),(1,−1) it seems not to be capable on average and prefers Pop music over styles as Hard-Rock to cover the musical mood satisfyingly. Lu backs this orRap. assumption: As can be seen, they were picked to form a well- balancedsetspanningfromrather“naive”assessorswithout “[·] We find that sometimes the Thayer’s model instrumentknowledgeandprofessionalrelationto“expert” cannot cover all the mood types inherent in a assessors including a club disc jockey (D. J.). The latter can music piece. [···] We also find that it is still thusbeexpectedtohaveagoodrelationshiptomusicmood, possible that an actual music clip may contain and its perception by the audiences. Further, young raters somemixedmoodsoranambiguousmood.”[10] prove a good choice, as they were very well familiar with all the songs of the chosen database. They were asked to A more refined discretisation of the two mood dimen- make a forced decision according to the two dimensions sions is needed. First a pseudo-continuous annotation was in the mood plane assigning values in {−2,−1,0,1,2} for considered, that is, (ν,α) ∈ [−1,1]×[−1,1], but after the arousalandvalence,respectively,andasdescribed.Theywere annotation of 250 songs that approach showed to be too further instructed to annotate according to the perceived complexinordertoachieveacoherentratingthroughoutthe mood, that is, the “represented” mood, not to the induced, whole database. So the final model uses five discrete values that is, “felt” one, which could have resulted in too high perdimension.WithD :={−2,−1,0,1,2}allsongsreceivea labelling ambiguity: while one may know the represented rating(ν,α)∈D2asvisualisedinFigure3. mood, it is not mandatory that the intended or equal Songswereannotatedasawhole:manyimplementations mood is felt by the raters. Indeed, depending on perceived have used excerpts of songs to reduce computational effort arousalandvalence,differentbehavioural,physiological,and and to investigate only on characteristic song parts. This psychologicalmechanismsareinvolved[41]. either requires an algorithm for automatically finding the Listening was chosen via external sound proof head- relevant parts as presented, for example, in [34–36] or phones in isolated and silent laboratory environment. The [37], or needs selection by hand, which would be a clear songswerepresentedinMPEG-1AudioLayer3compression 8 EURASIPJournalonAudio,Speech,andMusicProcessing Table5:Overviewontheraters(A–D)byage,gender,ethnicity,professionalrelationtomusic,instrumentsplayed,andballroomdance abilities.Thelastcolumnindicatesthecross-correlation(CC)betweenvalence(V)andarousal(A)foreachrater’sannotations. Rater Age Gender Ethnicity Prof.Relation Instruments Dancing CC(V,A) A 34years m European clubD.J. guitar,drums/percussion Standard/Latin 0.34 B 23years m European — piano Standard 0.08 C 26years m European — piano Latin 0.09 D 32years f Asian — — — 0.43 Table6:Meankappavaluesovertheraters(A–D)forfourdifferentcalculationsofgroundtruth(GT)obtainedeitherbyemployingrounded meanormedianofthelabelspersong.Reductionofclassesbyclusteringofthenegativeorpositivelabels,thatis,divisionbytwo. No.ofClasses GT κ κ1 κ2 Valence 5 mean 0.307 0.453 0.602 5 median 0.411 0.510 0.604 3 mean 0.440 0.461 0.498 3 median 0.519 0.535 0.561 Arousal 5 mean 0.328 0.477 0.634 5 median 0.415 0.518 0.626 3 mean 0.475 0.496 0.533 3 median 0.526 0.545 0.578 instereovariablebitratecodingand128kBit/sminimumas musicalpieceofadatabaseasitsprototypcalityisnotknown for the general processing afterwards. Labelling was carried in advance or, in rare works subsumed as novel “garbage” outindividuallyandindependentoftheotherraterswithin class[17].Thelatterwasfoundunsuitedinourcase,asthe a period of maximum 20 consecutive working days. A perceptionamongtheratersdifferstoostrongly,andalearnt continuous session thereby took a maximum time of two modelispotentiallycorruptedtoostronglybysuchagarbage hours. Each song was fully listened to with a maximum classthatmayeasily“consume”themajorityofinstancesdue of three times forward skipping by 30 seconds, followed toitslackofsharpdefinition. by a short break, though the raters knew most songs We thus consider two strategies that both benefit from in the set very well in advance due to their popular- thefactthatour“classes”areordinal,thatis,theyarebased ity. Playback of songs was allowed, and the judgments on a discretised continuum: mean of each rater’s label or could be reviewed—however, without knowledge of the median, which is known to better take care of outliers. To other raters’ results. For the annotation a plugin (available match from mean or median back to classes, a binning at http://www.openaudio.eu/) to the open source audio is needed, unless we want to introduce novel classes “in player Foobar: (http://www.foobar2000.org/) was provided between”(considertheexampleoftworatersjudging“0”and that displays the valence arousal plane colour coded as two “1”: by that we obtain a new class “0.5”). We choose a depicted in Figure 3 for clicking on the appropriate class. simpleroundoperationtothisaimofpreservingtheoriginal The named skip of 30 seconds forward was obtained via five“classes”. hotkey. To evaluate which of these two types of ground truth Based on each rater’s labelling, Table 5 also depicts the calculation is to be preferred, Table 6 shows mean kappa correlation of valence and arousal (rightmost coloumn): valueswithnone(Cohen’s),linear,andquadraticweighting thoughtheraterswerewellfamiliarwiththegeneralconcept over all raters and per dimension. In addition to the five of the dimensions, clear differences are indicated already classes (in the ongoing abbreviated as V5 for valence and looking at the variance among these correlations. The A5forarousal),itconsidersaclusteringofthepositiveand distribution of labels per rater as depicted in Figure 4 negative values per dimensions, which resembles a division further visualizes the clear differences in perception. (The bytwopriortotheroundingoperation(V3andA3,resp.). complete annotation by the four individuals is available at Anincreasingkappacoefficientbygoingfromnoweight- http://www.openaudio.eu/.) ing to linear to quadratic thereby indicates that confusions Inordertoestablishagroundtruththatconsidersevery betweenaraterandtheestablishedgroundtruthoccurrather rater’s labelling without exclusion of instances, or songs, betweenneighbouringclasses,thatis,averynegativevalueis respectively, that do not possess a majority agreement in less often confused with a very positive than with a neutral label, a new strategy has to be found: in the literature such one.Generally,kappavalueslarger0.4areconsideredasgood instances are usually discarded, which however does not agreement,whilesuchlarger0.7areconsideredasverygood reflectarealworldusagewhereajudgmentisneededonany agreement[42]. EURASIPJournalonAudio,Speech,andMusicProcessing 9 Table7:Overviewontheraters(A–D)bytheirkappavaluesforagreementwiththemedian-basedinter-labelleragreementasgroundtruth forthreeclassesperdimension. Rater Valence Arousal κ κ1 κ2 κ κ1 κ2 A 0.672 0.696 0.734 0.499 0.533 0.585 B 0.263 0.244 0.210 0.471 0.491 0.524 C 0.581 0.605 0.645 0.512 0.524 0.547 D 0.559 0.596 0.654 0.620 0.633 0.656 Valence Valence −2 −1 0 1 2 −2 −1 0 1 2 2 6 24 126 147 74 2 1 32 87 23 3 1 11 124 434 288 81 1 13 110 390 116 14 al al us 0 7 110 333 163 49 us 0 28 303 658 324 39 o o Ar Ar −1 41 179 183 71 28 −1 20 80 145 139 15 −2 43 55 38 19 14 −2 4 14 35 44 11 (a) RaterA (b) RaterB Valence Valence −2 −1 0 1 2 −2 −1 0 1 2 2 1 8 23 24 3 2 2 1 3 61 50 1 4 121 303 132 22 1 15 74 121 641 232 al al us 0 86 446 617 323 30 us 0 6 31 157 366 93 o o Ar Ar −1 37 132 159 101 34 −1 63 176 202 286 17 −2 2 17 15 8 0 −2 12 23 9 7 0 (c) RaterC (d) RaterD Figure4:5×5classdistributionsofthemusicdatabase(2648totalinstances)fortheannotationofeachrater(a)–(d). Obviously, choosing the median is the better choice— Thechoiceofgroundtruthfortherestofthisarticlethusis mayitbeforvalenceorarousal,fiveorthreeclasses.Further, either(rounded)medianafterclusteringtothreeclasses,or threeclassesshowbetteragreementunlesswhenconsidering eachrater’sindividualannotation. quadratic weighting. The latter is however obvious, as less InTable7thedifferencesamongtheraterswithrespect confusionswithfarspreadclassescanoccurforthreeclasses. to accordance to this chosen ground truth strategy—three 10 EURASIPJournalonAudio,Speech,andMusicProcessing Valence and development are obtained by selecting all songs from −2 −1 0 1 2 odd years, whereby development is assigned by choosing every second odd year. By that, test is defined using every even year. The distributions of instances per partition 2 0 1 9 54 37 are displayed in Figure 7 following the three degrees per dimension. Once development was used for optimization of classi- 1 3 57 298 638 101 fiersorfeatureselection,thetraining anddevelopmentsets areunitedfortraining.Notethatthispartitioningresembles al roughly 50%/50% of overall training/test. Performances ous 0 8 139 362 270 23 could probably be increased by choosing a smaller test Ar partition and thus increasing the training material. Yet, we felt that more than 1000 test instances favour statistically −1 45 322 144 76 6 moremeaningfullfindings. To reveal the impact of prototypicality, that is, limiting to instances or musical pieces with clear agreement by a −2 24 22 6 3 0 majority of raters,we additionally consider the sets Min2/4 for the case of agreement of two out of four raters, while the other two have to disagree among each other, Figure5:5×5classdistributionofthemusicdatabase(2648total resembling unity among two and draw between the others, instances)afterannotationbasedonroundedmedianofallraters. and the set Min3/4, where three out of four raters have to agree. Note that the minimum agreement is based on the original five degrees per dimension and that we consider thissubsetonlyforthetestinginstances,aswewanttokeep degrees per dimension and rounded median—are revealed. trainingconditionsfixedforbettertransparencyofeffectsof In particular rater B notably disagrees with the valence prototypization. The according distributions are shown in ground truth established by all raters. Other than that, Figure8. generallygoodagreementisobserved. The preference of three over five classes is further mostly stemming from the lack of sufficient instances 3.3. Feature Subsets. In addition to the data partitions, the for the “extreme” classes. This becomes obvious looking performance is examined in dependence on the subset of at the resulting distribution of instances in the valence- attributes used. Refer to Table 8 for an overview of these arousal plane by the rounded median ground truth for the subsets.Theyaredirectlyderivedfromthepartitioninginthe original five classes per dimension as provided in Figure 5. featuressectionofthiswork.Tobetterestimatetheinfluence This distribution shows a further interesting effect: though oflyricsontheclassification,aspecialsubsetcalledNoLyris practically no correlation between valence and arousal was introduced,whichcontainsallfeaturesexceptthosederived measured for the raters B and C, and not too strong such from lyrics. Note in this respect that for 25% (675) songs for raters A and D (cf. right most coloumn in Table5), the nolyricsareavailablewithinthetwousedon-linedatabases agreement of raters seems to be found mostly in favour of whichwasintentionallyleftasistoagainfurtherrealism. such a correlation: the diagonal reaching from low valence andarousaltohighvalenceandarousalisconsiderablymore 3.4.TrainingInstanceProcessing. Trainingontheunmodified present in terms of frequency of musical pieces. This may trainingsetislikelytodeliverahighlybiasedclassifierdueto either stem from the nature of the chosen compilation of theunbalancedclassdistributioninalltrainingdatasets.To the CDs, which however well cover the typical chart and overcomethisproblem,threedifferentstrategiesareusually airedmusicoftheirtime,orthatgenerallymusicwithlower employed [16, 21, 43]: the first is downsampling, in which activationisratherfoundconnotativewithnegativevalence instances from the overrepresented classes are randomly and vice versa (consider hereon the examples of ballads or removed until each class contains the same number of “happy”discoordancemusicasexamples). instances.Thisprocedureusuallywithdrawsalotofinstances The distributions among the five and three classes (as and with them valuable information, especially in highly mentioned by clustering of negative and positive values, unbalanced situations: it always outputs a training dataset each)individuallyperdimensionshowninFigure6further sizeequaltothenumberofclassesmultipliedwithnumber illustratesthereasontobefoundinchoosingthethreeover of instances in the class with least instances. In highly thefiveclassesintheongoing. unbalanced experiments, this procedure thus leads to a pathological small training set. The second method used 3.2. Datasets. First all 2648 songs are used in a dataset is upsampling, in which instances from the classes with named AllInst. For evaluation of “true” learning success, proportionally low numbers of instances are duplicated training, development, and test partitions are constructed: to reach a more balanced class distribution. This way no we decided for a transparent definition that allows easy instanceisremovedfromthetrainingsetandallinformation reproducibilityandisnotoptimizedinanyrespect:training can contribute to the trained classifier. This is whyrandom

Description:
Mood of Music is among the most relevant and commercially promising, yet challenging attributes for retrieval in large music Of course, manual annotation of Table 5: Overview on the raters (A–D) by age, gender, ethnicity, professional relation to music, instruments played, and ballroom dance.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.