ebook img

Rhythm Transcription of Polyphonic Piano Music Based on Merged-Output HMM for Multiple Voices PDF

1.2 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Rhythm Transcription of Polyphonic Piano Music Based on Merged-Output HMM for Multiple Voices

IEEE/ACMTRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.XX,NO.YY,2017 1 Rhythm Transcription of Polyphonic Piano Music Based on Merged-Output HMM for Multiple Voices Eita Nakamura, Member, IEEE, Kazuyoshi Yoshii, Member, IEEE, and Shigeki Sagayama, Member, IEEE Abstract—In a recent conference paper, we have reported a n) 7 rMsothfrayucrtcohktnmouvvreetmnrotaoifnodpnseocalrll(yiHppmthMieootMnnhioc)mdtmehstauhttshoiecdax.tpTblcahiocsiiuestdllmdyoodnndeoesatclrsmpiobrleveorsepgsteehardel-myomuadutjepolstruicptprlirhebo-ievdbodlteihecmnee V(leofitc-hea PHnodMls ypcMpaohr r1toe)nic &?######4433 œœœ3œ œ œœœ3œœœ œœŒ#3œœœ (rVigohicte-h HanMd Mpa r2t) oice separatio 29 Jan 201 nppaHaoasinlfnahrxMsetdeotusniohMenrvtoeenefhoxmsetieamrcofaremeoefnencraitosocmnhlewnemgeopuohodaptlrtifterlchiiectaihlpenhtohtleiomoenoitdsnuineseqveflt.rpusosuamcUiuyenecr,nstnsdeiipcspcwnohtpergfiharoeooiosrbnacMnffhcaoioycbnItrfuDhiimblstrepiIehattoiveccawrelaoieyprseelmccrriseddohhonperiypoptadftvfoeroehiocsnanrrmetitgcduhdiiasevyrcosmneeto.nhysofeIamcnpvdnomcaaedrltlltaseheurtsarsipaasgsneniatevocdiprsdoeraacadnn-lrpmoeitsinpeusvpe.rwietttapWilteiwonohutrnhpoeeest Generative process ?#####0# (s43ec)œ3œ œ1œ3œ œ œ#2T#3œemœpo p(sehraMfSrMoceerdoIrmD rgbeaIyen bcoeth vo&i#c###e#0#s (s)43ec)œœ 1œœ œ œŒ2 œ on (rhythm transcription + v I] pmpieeertcfheosor,dmswabenycfemosuonardnedtthhapanetr1fto2hrepmopeirndotspaoilnmsetodhsetmaaocscduegrloaoocudytpfaoesrrfptohorelmyrbehdeysttohtmohneierc Poplyeprfhoornmica nMceIDI ######0 (sec) 1 #2 Merged-output HMM Estimati A Fig. 1. Overview of the proposed model describing the generation of for non-polyrhythmic performances. This reveals the state-of- polyphonicperformances. . the-art methods of rhythm transcription for the first time in the s c literature. Publicly available source codes are also provided for [ future comparisons. evaluations on rhythm transcription have been reported in the literature and the state-of-the-art method has not been known. 1 IndexTerms—Rhythmtranscription,statisticalmusiclanguage Rhythm transcription also raises challenging problems of v model, model for polyphonic music scores, hidden Markov 3 models, music performance model. representingandmodellingscoresandperformancesforpoly- 4 phonic music. This is because a polyphonic score has multi- 3 layerstructure,whereconcurrentlysoundingnotesaregrouped 8 I. INTRODUCTION intoseveralstreams,orvoices1.AsexplainedinSec.II,acon- 0 . MUSIC transcription is one of the most challenging ventional way of representing a polyphonic score as a linear 1 sequence of chords [10] may not retain sequential regularities problems in music information processing. To obtain 0 within voices, such as those in polyrhythmic scores, nor it musicscores,weneedtoextractpitchinformationfrommusic 7 can capture the loose synchrony between voices [21], [22] 1 audio signals. Recently pitch analysis for polyphonic (e.g. : piano) music has been receiving much attention [1], [2]. To in polyphonic performances. Therefore solutions to explicitly v describe the multiple-voice structure must be sought. i solvetheotherpartofthetranscriptionproblem,manystudies X From this point of view, in a recent conference [19], we have been devoted to so-called rhythm transcription, that is, r the problem of recognising quantised note lengths (or note reported a statistical model that can describe the multiple- a voicestructureofpolyphonicmusic.Themodelisbasedonthe values) of the musical notes in MIDI performances [3]–[19]. merged-output HMM [23], [24], which describes polyphonic Sinceearlystudiesinthe1980s,variousmethodshavebeen performances as merged outputs from multiple component proposed for rhythm transcription. As we explain in detail in HMMs, called voice HMMs, each of which describes the Sec.II,thegeneraltrendhasshiftedtousingmachinelearning generative process of music scores and performances of one techniques to capture what natural music scores are and how voice (Fig. 1). It was confirmed that the model outperformed musicperformancesfluctuateintime.Oneofthemodelsmost conventionalHMM-basedmethodsfortranscribingpolyrhyth- frequently used in recent studies [8]–[13], [18], [19] is the mic performances. hidden Markov model (HMM) [20]. In spite of its importance Thepurposeofthispaperistodiscussindetailthemerged- and about 30 years of history, however, little comparative output HMM and its inference technique. Due to the large E. Nakamura and K. Yoshii are with the Graduate size of the state space and the complex dependencies between School of Informatics, Kyoto University, Kyoto 606-8501, variables, the standard Viterbi algorithm or its refined version Japan (e-mail: [email protected], [23] cannot be applied and a new inference technique is [email protected]). E. Nakamura is supported by the JSPSresearchfellowship(PD). S. Sagayama is with the Graduate School of Advanced Mathemati- 1In this paper, a ‘voice’ means a unit stream of musical notes that can cal Sciences, Meiji University, Nakano, Tokyo 164-8525, Japan (e-mail: containchords.ThescoreinFig.2,forexample,hastwovoicescorresponding [email protected]). totheleftandrighthands. IEEE/ACMTRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.XX,NO.YY,2017 2 necessary. This problem typically arises when a voice HMM œ isanautoregressiveHMM,whichiscommonlyusedasmusic b 2 œ œ score/performancemodelswhereoutputprobabilitiesofevents & b b 4 œ ˙ (e.g. pitch, time, etc.) depend on past events. Using a trick of œ œ introducing an auxiliary variable to trace the history of output ? b 2 œ œ œ œ symbols similarly as in Ref. [24], we develop an inference b b 4 technique that can work in a practical computer environment and could be applied for any merged-output HMMs with autoregressive voice HMMs. Note model j j j j œ œ œ œ œ œ We provide a complete description of the proposed model and examine the influence of its architecture and parame- ters. First, we explain details omitted in the previous paper Metrical model including the description of the chord model and a switch of a coupling parameter between voice HMMs depending Fig.2. Twodifferentrepresentationsofamusicscoreinpreviouslyproposed on pitch contexts. The effects are examined in terms of HMMs. accuracies. Second, the determination of model parameters basedonsupervisedlearningisdiscussedandtheinfluenceof metre structure simultaneously by recursively subdividing a parameters of the performance model is investigated. Finally, time interval into two or three almost equally spaced parts a feature of the proposed method is its simultaneous voice that are likely to begin at note onsets. A similar method of separation and rhythm recognition. We examine this effect dividing a time interval using template grids and an error by evaluating accuracies of both voice separation and rhythm functionofonsetsandinter-onsetintervals(IOIs)hasalsobeen recognition and comparing with a cascading algorithm that proposed [4]. Methods using preference rules for the ratios of performs voice separation first and then recognises rhythm. quantised note durations have been developed by Chowning Another contribution of this paper is to present results etal.[5]andTemperleyetal.[6].Desainetal.[7]proposeda of systematic comparative evaluations to find the state-of- connectionist approach that iteratively converts note durations the-art method. In addition to two HMM-based methods so that adjacent durations tend to have simple integral ratios. [8], [9], [11], [12] previously tested in Ref. [19], we tested Despite some successful results, these methods have lim- frequently cited methods and theoretically important methods itations in principle. First, they use little or no information whose source codes were available: Connectionist Quantizer about sequential regularities of note values in music scores. [7], Melisma Analyzers (version 1 [6] and version 2 [14]) Since there are many logically possible sequences of note and two-dimensional (2D) probabilistic context-free grammar values, such sequential regularities are important clues to (PCFG) model [16], [17], [25]. An evaluation measure for finding the one that is most likely to appear in actual music rhythm transcription, which is briefly sketched in Ref. [19], is scores. Second, tendencies of temporal fluctuations in human explainedin fulldetail togetherwithits calculationalgorithm. performances are described only roughly. In particular, the Wemakepublicthesourcecodesforthebestmodelsfound chord clustering—that is, the identification of notes whose (the proposed model and other two HMMs) as well as the onsets are exactly simultaneous in the score—is handled with evaluation tool to enable future comparisons [26]. We hope thresholding or is not treated at all. Finally, the parameters of that these materials would encourage researchers interested in most of those algorithms are tuned manually and optimisation music transcription and symbolic music processing. methodshavenotbeendeveloped.Thismeansthatonecannot utilise a data set of music scores and performances to learn II. RELATEDWORK the parameters, or only inefficient optimisation methods like Previousstudiesonrhythmtranscriptionarereviewedinthis grid search can be applied. section.Thepurposeistwo-fold:First,wedescribethehistor- ical development of models for rhythm transcription, some of B. Statistical Methods which form bases of our model and some are subjects of our Since around the year 2000, it has become popular to use comparative evaluation. Second, we review how polyphony statistical models, which enable us to utilise the statistical has been treated in previous studies in rhythm transcription natureofmusicscoresandperformances.Usuallytwomodels, and related fields and explain in details the motivations for onedescribingtheprobabilityofascore(scoremodel)andthe explicitly modelling multiple voices. Part of discussions in otherdescribingtheprobabilityofaperformancegivenascore Secs. II-B and II-C and the figures are quoted from Ref. [19] (performance model), are combined as a Bayesian model, and to make this section more informative and self-contained. rhythm transcription can be formulated as maximum a poste- riori estimation. Below we review representative models for A. Early Studies rhythm transcription. We here consider only monophonic per- Until the late 1990s, studies on rhythm transcription used formances; polyphonic extensions are described in Sec. II-C. models describing the process of quantising note durations In one class of HMMs for rhythm transcription, which we and/or recognising the metre structure. Longuet-Higgins [3] call note HMMs, a score is represented as a sequence of note developed a method for estimating the note values and the values and described with a Markov model (Fig. 2) [8], [9]. IEEE/ACMTRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.XX,NO.YY,2017 3 State representation of merged-output HMM s 1/2 1 2 1 1 2 2 1 i(1) i(1) i(1) i(1) i(1) i(1) i(1) i(1) i(1) 0 1 1 2 3 3 3 4 i(2) i(2) i(2) i(2) i(2) i(2) i(2) i(2) i(2) 0 0 1 1 1 2 3 3 HMM 1 i(1) i(1) i(1) i(1) i(1) 0 1 2 3 4 HMM 2 i(2) i(2) i(2) i(2) 0 1 2 3 Outputs x1 x2 x3 x4 x5 x6 x7 Fig. 3. A 3 against 4 polyrhythmic passage (top; Chopin’s Fantaisie Fig. 4. A schematic illustration of the merged-output HMM. The symbols Impromptu)representedasasequenceofnoteclusters(bottom). i(1) andi(2) representauxiliarystatestodefinetheinitialtransitions. 0 0 To describe the temporal fluctuations in performances, one the importance of incorporating the multiple-voice structure introduces a latent variable corresponding to a (local) tempo in the presence of voice asynchrony is well investigated in that is also described with a Markov model. An observed studies on score-performance matching [21], [22]. duration is described as a product of the note value and the To describe the multiple-voice structure of polyphonic tempo that is exposed to noise of onset times. scores, an extension of the PCFG model called 2D PCFG In another class of HMMs, which we call metrical HMMs, model has been proposed [16], [25]. This model describes, a different description is used for the score model [11]–[13]. in addition to the divisions of a time interval, duplications of Instead of a Markov model of note values, a Markov process intervals into two voices. Unfortunately, a tractable inference on a grid space representing beat positions of a unit interval, algorithmcouldnotbeobtainedforthemodel,andthecorrect suchasabar,isconsidered(Fig.2).Thenotevaluesaregiven voice information had to be provided for evaluations. In a asdifferencesbetweensuccessivebeatpositions.Incorporation recent report, Takamune et al. [17] state that this problem is of the metre structure is an advantage of metrical HMMs. solved using a generalised LR parser. However, as we shall PCFG models have also been proposed [15], [16]. As in see in Sec. IV-B, their algorithm often fails to output results [3], a time interval in a score is recursively divided into and the computational cost is quite high. shorter intervals until those corresponding to note values are obtained, and probabilities describe what particular divisions D. Merged-Output HMM are likely. As an advantage, modifications of rhythms by Based on the fact that HMM is effective for monophonic inserting(splitting)notescanbenaturallydescribedwiththese music [8]–[13], an HMM-based model that can describe models. multiple-voice structure of symbolic music, called merged- output HMM, has been proposed [23], [24]. In the model, C. Polyphonic Extensions each voice is described with an HMM, called a voice HMM, andthetotalpolyphonicmusicsignalisrepresentedasmerged The note HMM has been extended to handle polyphonic outputs from multiple voice HMMs (Fig. 4). performances [10]. This is done by representing a polyphonic Mathematically the model is described as follows. Let us score as a linear sequence of chords or, more precisely, note clusters consisting of one or more notes. Such score considerthecaseoftwovoicesindexedbyavariables=1,2, and let i(s) denote the state variable, let π (i(cid:48),i) =P(i|i(cid:48),s) representation is also familiarly used for music analysis [27] s denote the transition probability and let φ (x;i) = P(x|i,s) andscore-performancematching[28],[29].Chordalnotescan s denote the output probability of each voice HMM (for some be represented as self-transitions in the score model (Fig. 2) outputsymbolx).Foreachinstancen,onevoices ischosen and their IOIs can be described with a probability distribution n by a Bernoulli process as s ∼ Ber(α ,α ) where Ber is with a peak at zero. Polyphonic extension of metrical HMMs n 1 2 the Bernoulli distribution and its probability parameter α is possible in the same way. sn represents how likely the n-th output is generated from the Althoughthissimplifiedrepresentationofpolyphonicscores HMM of voice s . The chosen voice HMM then makes a is logically possible, there are instances in which score and n state transition and outputs x while the other voice HMM performance models based on this representation cannot de- n stays at the current state. The whole process is described as scribe the nature of polyphonic music appropriately. First, an HMM with a state space indexed by k =(s,i(1),i(2)) and complex polyphonic scores such as polyrhythmic scores are the transition and output probabilities (in the non-interacting forced to have unrealistically small probabilities. This is be- case [23]) are given as cause such scores consist of rare rhythms in the simplified representation even if the component voices have common P(k =k|k =k(cid:48)) n n−1 rhythms(Fig.3).Second,thephenomenonofloosesynchrony =α π (i(cid:48)(s),i(s)) δ δ +δ δ , (1) between voices (e.g. two hands in piano performances [21]), s s s1 i(cid:48)(2)i(2) s2 i(cid:48)(1)i(1) P(x |k =k)=φ (x ;i(s)) (2) called voice asynchrony, cannot be described. For example, n n s (cid:0)n (cid:1) IEEE/ACMTRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.XX,NO.YY,2017 4 where δ is Kronecker’s delta. A merged-output HMM with more than two voices can be constructed similarly. High-level model ˙˙˙ seeAnsadsisacuvsasreidanitnoRfeffa.c[t1o9ri]a,lthHeMmMerg[e3d0-]ouintpiuttsHmMosMt gceannerbael œœœ œœœ... & 42 œœœ œ œœ sveonicsee.HUMnMliksemthaekesstaansdtaartde tfraacntsoirtiiaolnHanMdMou,tpountlsyaosnyemobfolthaet (cid:9463)œœœj œœœr r œ⎧⎨ œ⎩j⎧⎨ ⎩œj⎧⎨ ⎩ Note each instant. Owing to this property the sequential regularity Low-level model onsets Time g within each voice can be described efficiently in the merged- œœœ = in 0 1 out 0 0 1 1 0 1 output HMM, even when notes in one voice are interrupted (in the time order) by notes of other voices. Accordingly necessary inference algorithms are also different as we will Fig.5. HierarchicalMarkovmodelforsequencesofnoteclusters.Left:The high-levelmodeldescribesthesequenceinunitsofnoteclustersandthelow- see in Sec. III-C. level model describes internal structure of notes in each note cluster. Right: Anexamplepolyphonicscoreanditsnoteonsetsrepresentedbythemodel. III. PROPOSEDMETHOD Wepresentacompletedescriptionofarhythmtranscription note cluster or not: If g =0 the n-th and (n+1)-th notes are method based on merged-output HMM [19] that describes n in a note cluster and if g = 1, they belong to different note polyphonic performances with multiple-voice structure. The n clusters. The variable g also takes the values ‘in’ and ‘out’ to generative model is presented in Sec. III-A, the determination definetheinitialandexitingprobabilities.TheinternalMarkov ofmodelparametersisdiscussedinSec.III-Banditsinference model has the topology illustrated in Fig. 5 and is described algorithm that simultaneously yields rhythm transcription and withthefollowingtransitionprobabilities(ρ(r) =P(g(cid:48)|g;r)): voice separation is derived in Sec. III-C. g,g(cid:48) ρ(r) =β , ρ(r) =1−β , (5) in,0 r in,1 r A. Model Formulation ρ(r) =γ , ρ(r) =1−γ , ρ(r) =0, (6) 0,0 r 0,1 r 0,out A merged-output HMM for rhythm transcription proposed ρ(r) =ρ(r) =0, ρ(r) =1 (7) in Ref. [19] is reviewed here with additional details. First, the 1,0 1,1 1,out where β and γ are parameters controlling the number of description of the chord model is given, which was explained r r notes in a note cluster. Denoting w =(r,g) and w(cid:48) =(r(cid:48),g(cid:48)), as a ‘self-transition’ in the note-value state space. Since self- the transition probability of the hierarchical model is given as transitionisalsousedtorepresentrepeatednotevaluesoftwo note clusters, it should be treated with care and we introduce ξ(w(cid:48),w):=P(w|w(cid:48))=ρ(gr(cid:48),(cid:48)o)utπ(r(cid:48),r)ρ(inr,)g+δrr(cid:48)ρ(gr(cid:48),)g. (8) a two-level hierarchical Markov model to solve the problem. The initial probability is given as ξini(w ) := P(w ) = Second, a refinement of switching the probability of choosing 1 1 πini(r )ρ(r1) . We notate w =(w )N+1 =(r ,g )N+1. voice HMMs is given, which was not mentioned previously 1 in,g1 n n=1 n n n=1 Todescribethetemporalfluctuations,weintroduceatempo but necessary to improve the accuracy of voice separation. variable, denoted by v , that describes the local (inverse) In the following, a music score is specified by multiple n tempo for the time interval between the n-th and (n+1)-th sequences,correspondingtovoices,ofpitchesandnotevalues note onsets [28], [31]. To represent the variation of tempos, and a MIDI performance signal is specified by a sequence of we put a Gaussian Markov process on the logarithm of the pitches and onset times. In this paper we only consider note tempo variable u =lnv as onsets and thus note length and IOI mean the same thing. n n 1) Model for Each Voice: A voice HMM is constructed P(u )=N(u ;u ,σ2 ), (9) 1 1 ini v,ini based on the note HMM [9], which is extended to explicitly P(u |u )=N(u ;u ,σ2) (10) n+1 n n+1 n v modelpitchesinordertoappropriatelydescribevoices.Ifthere where N(·;µ,Σ) denotes the normal distribution with mean are no chords, a score note is specified by a pair of pitch and µ and variance Σ, and u , σ and σ are parameters. The note value. Note that to define N note lengths we need N+1 ini v,ini v parameter σ describes the amount of tempo changes. If the note onsets and thus N+1 score notes should be considered. v n-thand(n+1)-thnotesbelongtoanotecluster(i.e.g =0), LetN+1bethenumberofscorenotesinonevoiceandletr n n theirIOIapproximatelyobeysanexponentialdistribution[28] denote the note value of the n-th note. If there are no chords, thenotevaluesr =(r )N+1 aregeneratedbyaMarkovchain and the probability of the onset time of the (n+1)-th note, n n=1 denoted by t , is then given as with the following probabilities: n+1 P(t |t ,v ,r ,g =0)=Exp(t ;λ) (11) P(r )=πini(r ), (3) n+1 n n n n n+1 1 1 where Exp denotes the exponential distribution and λ is P(r |r )=π(r ,r ) (n=1,...,N) (4) n+1 n n n+1 the scale parameter, which controls the asynchrony of note where πini is the initial probability and π is the (stationary) onsets in a note cluster. Otherwise, t −t has a duration n+1 n transition probability. correspondingtonotevaluer andtheprobabilityisdescribed n To describe chords, we extend the above Markov model with a normal distribution as to a two-level hierarchical Markov model with state variables P(t |t ,v ,r ,g =1)=N(t ;t +r v ,σ2). (12) (r,g).Thevariablerrepresentsthenotevalueofanotecluster n+1 n n n n n+1 n n n t andgindicateswhetherthenextnoteonsetbelongstothesame Intuitivelytheparameterσ describestheamountofonset-time t IEEE/ACMTRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.XX,NO.YY,2017 5 r r however, dependence of the parameter α on features of the Note value Init g1 g2 s 1 2 giveninput(MIDIperformance)canbeintroducedtoimprove the accuracy of voice separation. As such a feature, we use Tempo Init v1 v2 contexts of pitch that reflects the constraint on pitch intervals thatcanbesimultaneouslyplayedbyonehand.Definingphigh n and plow as the highest and lowest pitch that is sounding Onset time t1 t2 t3 simultnaneously (but not necessarily having a simultaneous onset) with p , we switch the value of α depending on n sn Pitch Init p1 p2 p3 whether pn−plnow >15 or not and whether phnigh−pn >15 or not (total of four cases), reflecting the fact that a pitch interval larger than 15 semitones is rarely played with one Fig. 6. Graphical representation of the autoregressive HMM for one voice. hand at a time. The effect of using this context-dependent α The label ‘Init’ indicates the initial probability and the dotted circle of the s firstonsettimet1 indicatesthatadistributionisnotgivenforthisvariable. is examined in Sec. IV-C. If voice s is chosen, then the HMM of voice s n n outputs a note, and the hidden state of the other voice fluctuationsduetohumanmotornoisewhenaperformerkeeps HMM is unchanged. Such a model can be described a tempo. We do not put a distribution on the onset time of the with an HMM with a state space labelled by k = first note t1 because we formulate the model to be invariant (s,p(1),w(1),t(1),p(2),w(2),t(2),v). Here we have a single under time translations and this value would not affect any tempo variable v that is shared by the two voices in order resultsofinference.Wenotatev =(v )N andt=(t )N+1. n n=1 n n=1 to assure loose synchrony between them. The transition prob- Finally we describe the generation of pitches p=(pn)Nn=+01 ability P(kn=k|kn−1=k(cid:48)), for n≥2, is given as as a Markov chain (we introduce an auxiliary symbol p for 0 α P(v|v(cid:48))A (w(s),p(s),t(s)|w(cid:48)(s),p(cid:48)(s),t(cid:48)(s);v(cid:48)) later convenience). The probabilities are s s P(p1)=θ(p0,p1), (13) · δs1δw(cid:48)(2)w(2)δp(cid:48)(2)p(2)δ(t(cid:48)(2)−t(2))+(1↔2) (15) P(p |p )=θ(p ,p ) (n=2,...,N+1) (14) where(cid:104)the expression ‘(1↔2)’ means that the previ(cid:105)ous term n n−1 n−1 n where θ(p(cid:48),p) denotes the (stationary) transition probability is repeated with 1 and 2 interchanged and we have defined and if p(cid:48) =p it denotes the initial probability. A (w(s),p(s),t(s)|w(cid:48)(s),p(cid:48)(s),t(cid:48)(s);v(cid:48)) 0 s The above model can be summarised as an autoregressive =ξ (w(cid:48)(s),w(s))θ (p(cid:48)(s),p(s))P(t(s)|t(cid:48)(s),v(cid:48),w(cid:48)(s)) (16) s s HMM with hidden states (r,g,v) and outputs (p,t) (Fig. 6), and δ denotes Kronecker’s delta for discrete variables and whichwillbeavoiceHMM.Althoughsofartheprobabilities Dirac’sdeltafunctionforcontinuousvariables.Theprobability of pitches are independent of other variables, they will be P(v|v(cid:48)) is defined in Eq. (10), and P(t(s)|t(cid:48)(s),v(cid:48),w(cid:48)(s)) is significant once multiple voice HMMs are merged and the defined in Eqs. (11) and (12). For note values the initial posterior probabilities are inferred. probabilityisgivenasP(r(s))=πini(r(s)),andforpitchesthe 2) Model for Multiple Voices: We combine the multiple 1 s 1 initialprobabilityisgiveninEq.(13).Thefirstonsettimest(1) voice HMMs in Sec. III-A1 using the framework of merged- 1 outputHMMs(Sec.II-D).Sinceinpianoperformances,which andt(12) donothavedistributions,asexplainedinSec.III-A1, are our main focus, polyrhythm and voice asynchrony usually andwepracticallysett(11) =t(12) =t1 (thefirstobservedonset involve the two hands, we consider a model with two voices, time). Finally the output of the model is given as leaving a note that it is not difficult to formalise a model with p =p(sn), t =t(sn), (17) n n n n more than two voices. In what follows, voices are indexed by and thus the complete-data probability is written as avariables=1,2,correspondingtotheleftandrighthandin N+1 practice. All the variables and parameters are now considered for each voice and thus r(s) =(rn(s))Nn=s+11 is the sequence of P(k,p,t)= P(kn|kn−1)δpnp(nsn)δ(tn−t(nsn)). (18) note values in voice s, π (r(cid:48),r) their transition probability, n(cid:89)=1 s Here N = N + N denotes the total number of score etc. Simply speaking, the sequence of merged outputs is 1 2 notes, and the following notations are used: v = (v )N , obtained by gathering the outputs of the voice HMMs and n n=1 p = (p )N+1, t = (t )N+1, and k = (k )N+1. Note that sorting them according to onset times. To derive inference n n=1 n n=1 n n=1 whereas p and t are observed quantities, p(1),p(2),t(1) and algorithms that are computationally tractable, however, we t(2) are not because we cannot directly observe the voice should formulate a model that outputs notes incrementally in information encrypted in s. The graphical representation of the order of observations. This can be done by introducing the model is illustrated in Fig. 7. stochastic variables s = (s )N+1, which indicate that the n- n n=1 thobservednotebelongstovoices andfollowtheprobability n s ∼ Ber(α ,α ). The parameter α represents how likely n 1 2 sn B. Model Parameters and Their Determination the n-th note is generated from the HMM of voice s . n The variable s is determined in advance to the pitch, note We here summarise model parameters, explain how they n valueoronsettimeofthecorrespondingnoteinthegenerative can be determined from data and describe some reasonable process (which is described below). For rhythm transcription, constraints to improve the efficiency of parameter learning. IEEE/ACMTRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.XX,NO.YY,2017 6 Voice s1=1 s2=2 s3=1 s4=1 s5=2 Finally, for πini(r), the stationary (unigram) probability ob- tained from π(r,r(cid:48)) can be used. Note that the pitch proba- N(ovtoei cvea 1lu)e Init(1) gr˜˜11((11)) gr˜˜22((11)) bilities are only used to improve voice separation and their precise values do not much influence the results of rhythm N(ovtoei cvea 2lu)e Init(2) gr˜˜11((22)) gr˜˜22((22)) transcription.Likewisetheinitialprobabilitiesdonotinfluence the results for most notes due to the Markov property. Tempo Init v1 v2 v3 v4 The performance model has the following five parameters: σ ,σini,v (=exp(u )),σ andλ.Thesecanbedetermined v v ini ini t Onset time t1 t2 t3 t4 t5 fromperformancedata,forexample,MIDIrecordingsofpiano Init(1) performances whose notes are matched to the corresponding Pitch p1 p2 p3 p4 p5 notes in the scores. Among these the initial values, σini and v Init(2) v (=exp(u ))aremostdifficulttodeterminefromdatabut ini ini againhavelimitedinfluenceasapriorforglobaltempo,which Fig.7. Graphicalrepresentationoftheproposedmerged-outputHMMwhen is supposed to be musically less important (see discussion in the voice information is fixed. The variables with a tilde (r˜n(s) and g˜n(s)) Sec. IV-A). In our implementation, they are simply set by represent note values for each voice without redundancies (see Sec. III-C). hand. The method for determining the other parameters based SeealsothecaptionofFig.6.Herewehaveindependentinitialdistributions fornotevaluesandpitchesofdifferentvoiceHMMs. on a principle of minimal prediction error is discussed in a previous study [28] and will not be repeated here. Anadditionalparameterforthemerged-outputHMMisα , s Let N and N be the number of pitches and note values, p r which is generally obtained by simply counting the number which are set as 88 and 15 in our implementation in Sec. IV. of notes in each voice or can be approximated simply by The score model for each voice HMM has the following α = α = 1/2. In our implementation, we obtain four α ’s 1 2 s parameters: πini(r) [Nr], π(r,r(cid:48)) [Nr2], βr [Nr], γr [Nr], depending on the context as described in Sec. III-A2, which θ(p0,p) [Np] and θ(p,p(cid:48)) [Np2], where the number in square is also straightforward. brackets indicate the number of parameters. (The number of independent parameters may reduce because of normalisation C. Inference Algorithm conditions,whichweshallnotcarehereforsimplicity.)These Toobtaintheresultofrhythmtranscriptionusingthemodel parameters can be determined with a data set of music scores just described, we must estimate the most probable hidden with voice indications. For piano pieces, the two staffs in the state sequence kˆ =argmax P(k|p,t) given the observations grand staff notation can be used for the two voice HMMs. k (p,t). This gives us the voice information sˆ and the esti- After representing notes in each voice as a sequence of note clusters as in Fig. 3, πini(r) and π(r,r(cid:48)) can be obtained mated note values rˆ(1) and rˆ(2). Let w˜(s) = (w˜n(s))Nn=s1 = in a standard way. Determining θ(p0,p) and θ(p,p(cid:48)) is also (r˜n(s),g˜n(s))Nn=s1 be the reduced sequence of note values for straightforward. To determine the parameters β and γ , we voice s, which is obtained by, for all n, deleting the n-th r r first define the frequency of note clusters containing m notes element with sn (cid:54)=s in wˆ(s). Then the score time τn(s) of the with note value r as f(r). Since β is the proportion of note n-th note onset in voice s is given by m r clusters containing more than one notes, it is given by n−δss1 ∞ ∞ τ(s) = g˜(s)r˜(s). (21) β = f(r) f(r). (19) n m m r m m(cid:48) m(cid:88)=1 m=2 (cid:30)m(cid:48)=1 The inference algorithm of merged-output HMM has been (cid:88) (cid:88) Theγr canbeobtainedbymatchingtheexpectedstayingtime discussed previously [24]. Since a merged-output HMM can at state with g =0 (Fig. 5) as follows: be seen as an HMM with a product state space, the Viterbi ∞ algorithm [20] can be applied for inference in principle. It mfm(r) ∞ 2−γ was shown that owing to the specific form of transition m=2 =(cid:104)m(cid:105) = mγm−2(1−γ )= r. (cid:80)∞ fm(r(cid:48)) m≥2 m(cid:88)=2 r r 1−γr pfororboanbeiliVtyitmerabtiriuxpadsatienEcaqn.(b1e),rthedeuccoemdpfurotamtioOna(l4cNom2Npl2e)xittyo m(cid:48)=2 1 2 O(2N N (N +N )) where N is the size of the state space (cid:80) 1 2 1 2 s In practice, the transition probability θ(p,p(cid:48)) and the initial of the s-th voice HMM. However, since the state space of the probabilities θ(p ,p) and πini(r) are often subject to the 0 model in Sec. III-A2 involves both discrete and continuous sparseness problem since the first one has a rather large variables, an exact inference in this way is difficult. number of parameters and for the last two only one sample Tosolvethis,wediscretisethetempovariable,whichpracti- from each piece can be used. To overcome this problem, we callyhaslittleinfluencewhenthestepsizeissufficientlysmall can reduce the number of parameters in the following way, since tempo is restricted in a certain range in conventional which is used in our implementation. First, we approximate music and v always has uncertainty of O(σ /r(sn)). Dis- θ(p,p(cid:48)) as a function of the interval p(cid:48) −p, which reduces n t n−1 cretisationoftempovariableshasalsobeenusedforaudio-to- the number of parameters from N2 to 2N . Second, we can p p score alignment [32] and beat tracking [33]. Other continuous approximate θ(p0,p) by a gaussian function as variables t,t(1) and t(2) can take only values of observed θ(p ,p)∝N(p;µ ,σ2). (20) onset times and thus can, in effect, be treated as discrete 0 p p IEEE/ACMTRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.XX,NO.YY,2017 7 variables.UnfortunatelythedirectuseoftheViterbialgorithm is impractical even with this discretisation. Let us roughly Estimated estimatethecomputationalcosttoseethis.LetN ,N andN p w v Scaling bethesizesofthestatespaceforpitch,notevalueandtempo. Sinceonsettimest(s) couldtakeN values,thesizeofthestate n spacefork isO(2N2N2N2N ),andthecomputationalcost Shift n p w v for one Viterbi update is C = O(4N4N4N4N2). A rough p w v estimation (N ∼ 100, N ∼ 30, N ∼ 300, N ∼ 50) True p w v yields C ∼ 1028, which is intractable. Even after using the constraints of the transition probabilities in Eq. (15), we have Fig. 8. Example of scaling and shift operations to recover the correct C =O(4Np3Nw3N3Nv2)∼1022, which is still intractable. transcriptionfromanestimatedresult. We can avoid this intractable computational cost and derive an efficient inference algorithm by appropriately relating the hidden variables (p(1),p(2),t(1),t(2)) to observed quantities operation, which changes the score time of a particular note (p,t). We first introduce a variable h = 1,2,···, which is or equivalently, changes a note value, to count the number of n defined as the smallest h ≥ 1 satisfying s (cid:54)= s for each note-wise rhythmic errors. Musically speaking, on the other n n−h n. We find the following relations: hand, the relative note values are more important than the absolute note values, and the tempo error should also be h +1, s =s ; h = n−1 n n−1 (22) considered. This is because there is arbitrariness in choosing n (cid:40)1, sn (cid:54)=sn−1, theunitofnotevalues:Forexample,aquarternoteplayedina tempoof60BPMhasthesamedurationasahalfnoteplayed (p ,t ), s=s ; (p(ns),t(ns))=(cid:40)(pnn−hnn,tn−hn), s(cid:54)=snn. (23) ionftaentecmopnotaionfn1o2t0eBvPalMue.sStihnactearreesuultnsifoofrmrhlyythsmcalterdanfsrcormiptitohne This means that (p(1),p(2),t(1),t(2)) are determined if we correct values, which should not be considered as completely are given (s,h,p,t) and the effective number of variables is incorrect estimations [8], [34], we must take into account the reduced by using h. With this change of variables, we find scaling operation as well as the shift operation. P(k,p,t)=P(s,w(1),w(2),v,p(1),p(2),t(1),t(2),p,t) AsshownintheexampleinFig.8,therecanbelocalscaling operations and shift operations, and a reasonable definition =P(s,h,w(1),w(2),v,p,t) of the editing cost is the least number N of operations o consisting of N scaling operations and N shift operations = α P(v |v ) δ δ +(1↔2) sc sh n (cid:26) sn n n−1 sn1 wn(2)wn(2−)1 (No = Nsc +Nsh). As explained in detail in the appendix, (cid:89) (cid:2) (cid:3) this rhythm correction cost can be calculated by a dynamic · δsnsn−1δhn(hn−1+1)Asname+(1−δsnsn−1)δhn1Adniff , programmingsimilarlyastheLevenshteindistance.Definition (cid:104) (cid:105)(cid:27) andcalculationoftherhythmcorrectioncostinthepolyphonic Asname =Asn(wn(sn),pn,tn|wn(s−n1),pn−1,tn−1;vn−1), case are also discussed there. We use the rhythm correction Adniff =Asn(wn(sn),pn,tn|wn(s−n1),pn˜,tn˜;vn−1) (24) rate R =No/N as an evaluation measure. where n˜ in the last line should be replaced by n−h −1. n−1 We can now apply the Viterbi algorithm on the state B. Comparisons with Other Methods space(s,h,w(1),w(2),v).Notingthatthemaximumpossible We first present results of comparative evaluations. The value of h is N and using the constraints of the transition n purpose is to find out the state-of-the-art method of rhythm probabilities, one finds that C = O(4NN3N2)(∼ 1011), w v transcription and its relation to the proposed model. Among which is significantly smaller than the previous values. Note previous methods described in Sec. II, the following six were that so far no ad-hoc approximations have been introduced to directlycompared:ConnectionistQuantizer[7],MelismaAna- reducethecomputationalcomplexity.Practically,wecanseta lyzers (the first [6] and second [14] versions), the note HMM smallermaximalvalueN (<N)ofh toobtainapproximate h n [9], the metrical HMM [11] and the 2D PCFG model [16], optimisation, which further reduces the computational cost [17]. The first five are relatively frequently cited and the last to O(4N N3N2). The number N can be regarded as the h w v h oneistheoreticallyimportantasitprovidesanalternativeway maximum number of succeeding notes played by one hand of statistically modelling multiple-voice structure (Sec. II-C). without being interrupted by the other hand. The choice of 1) Setup: Two data sets of MIDI recordings of classical N and its dependency is discussed in Sec. IV-C1. h pianopieceswereused.One(polyrhythmicdataset)consisted of 30 performances of different (excerpts of) pieces that IV. EVALUATION contained 2 against 3 or 3 against 4 polyrhythmic passages, A. Methodology for Systematic Evaluation and the other (non-polyrhythmic data set) consisted of 30 In a few studies that reported systematic evaluations of performances of different (excerpts of) pieces that did not rhythm transcription [8], [9], [13], editing costs (i.e. the contain polyrhythmic passages. Pieces by various composers, numberofnecessaryoperationstocorrectanestimatedresult) ranging from J. S. Bach to Debussy, were chosen and the are used as evaluation measures. These studies used the shift players were also various: Some of the performances were IEEE/ACMTRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.XX,NO.YY,2017 8 takenfromthePEDBdatabase[35],afewwereperformances AVE (%) 15.01 27.24 28.16 37.61 32.74 36.02 STD (%) 13.14 17.09 19.23 11.32 20.00 16.28 we recorded, and the rests were taken from collections in 0.8 public domain websites2. e0.7 Fortheproposedmethod,allnormal,dotted,andtripletnote at r values ranging from the whole note to the 32nd note were n 0.6 o usedascandidatenotevalues.Parametersforthescoremodel, ti0.5 c π(r,r(cid:48)), βr, γr and θ(p,p(cid:48)), and the value of αs were learned rre0.4 o fwroitmh tahedatteastsedtatoaf2.clWasesicsaelt p(µian,oσp)ie=ces(5th4a,t12h)adfonrothoevelrelaftp- m c0.3 p p h0.2 hand voice HMM and (70,12) for the right-hand voice HMM yt h (seeSec.III-B).Valuesfortheparametersfortheperformance R0.1 model, σ , σ and λ, were taken from a previous study [28] 0 v t Proposed Note Metrical 2D PCFG Melisma Melisma (which used performance data different from ours). The used model HMM [9] HMM [11] [17] ver 1 [6] ver 2 [14] values were σ¯v = 3.32×10−2, σ¯t = 0.02 s and λ¯ = 0.0101 (a) Polyrhythmicdata s. For the tempo variable, we discretised v into 50 values n logarithmically equally spaced in the range of 0.3 to 1.5 s per AVE (%) 8.25 7.43 7.46 8.96 9.21 9.10 STD (%) 9.33 8.56 8.01 10.51 9.12 10.72 quarter note (corresponding to 200 BPM and 40 BPM). We 0.5 set vini as the central value of the range (0.671 s per quarter te note or 89.4 BPM) and σvini = 3σ¯v. Nh was chosen as 30, n ra0.4 which will be explained later (Sec. IV-C1). o For other methods, we used the default values provided in cti0.3 e the source codes, except for the note HMM and the metrical rr o HMM, which are closely related to our method. For these m c0.2 models, the parameters of the score models were also trained h t y0.1 with the same score data set and the performance model was h R the same as that for the proposed model. The metrical HMM 0 was build and learned for two cases, duple metres (2/4, 4/4, Proposed Note Metrical 2D PCFG Melisma Melisma etc.)andtriplemetres(3/4,6/4,etc.),andoneofthesemodels model HMM [9] HMM [11] [17] ver 1 [6] ver 2 [14] werechosenforeachperformanceaccordingtothelikelihood. (b) Non-polyrhythmicdata For Melisma Analyzers, results of the metre analysis were Fig. 9. Rhythm correction rates (lower is better). The circle indicates the used and the estimated tactus was scaled to a quarter note average(AVE),theblueboxindicatestherangefromthefirsttothirdquartiles. ThestandarddeviationisindicatedasSTD. to use the results as rhythm transcriptions. For Connection- ist Quantizer, which accepts only monophonic inputs, chord Dataset Model Unprocessed R−Rprop [%] clustering was performed beforehand with a threshold of 35 Polyrhythmic NoteHMM[9] 0 12.2±2.8 ms on the IOIs of chordal notes. The algorithm was run for MetricalHMM[11] 0 13.1±3.1 2-dimPCFG[17] 18 23.8±3.9 100 iterations for each performance. Because this algorithm Melismaver1[6] 9 17.7±3.7 outputs note lengths in units of 10 ms without indications for Melismaver2[14] 4 21.4±3.3 tactus, the most frequent note length was taken as the quarter Connectionist[7] 0 38.7±3.2 Non-polyrhythmic NoteHMM[9] 0 −0.82±0.50 note value. MetricalHMM[11] 0 −0.79±0.61 2) Results: The distributions of rhythm correction rates, 2-dimPCFG[17] 25 1.80±1.64 their averages and standard deviations are shown in Fig. 9. Melismaver1[6] 4 1.29±1.12 Melismaver2[14] 11 −0.09±1.33 For clear illustration, the results for Connectionist Quantizer, Connectionist[7] 0 30.6±2.95 which was much worse than the others, were omitted: The TABLEI average (standard deviation, first, third quantiles) was 53.7% AVERAGESANDSTANDARDERRORSOFDIFFERENCESOFTHERHYTHM (18.5%, 43.8%, 67.3%) for the polyrhythmic data and 38.9% CORRECTIONRATEFORLISTEDMODELSRANDTHATOFTHEPROPOSED (13.9%, 28.2%, 47.3%) for the non-polyrhythmic data. MODELRprop(LOWERISBETTER).THENUMBEROFUNPROCESSED As shown in Table I, some performances were not properly PIECES(SEETEXT)ISSHOWNINTHETHIRDCOLUMN.VALUESWITHA STATISTICALSIGNIFICANCE≥3σAREILLUSTRATEDINBOLDFONT. processed by the 2D PCFG model and Melisma Analyzers. For the 2D PCFG model, because it took much time in processing some performances (executions lasted more than a ‘unprocessed’ cases. To compare the results in the presence week for some performances), every performance was run for of these unprocessed cases, we calculated for each of the at least 24 hours and only performances for which execution algorithms and for successfully processed performances the ended were treated as processed cases. Among 29 (out of 60) differencesinrhythmcorrectionsratesrelativetotheproposed performancesforwhichexecutionended,12performancesdid model. Their average and standard error (corresponding to 1σ not receive any results (because the parser did not succeed in deviation in the t-test) are shown in Table I. accepting the performances) and those were also treated as For the polyrhythmic data, it is clear that the proposed model outperformed the other methods, by more than 12 2Thelistofusedpiecesisavailableonourwebpage[26]. points in the accuracies and more than 3σ deviations in the IEEE/ACMTRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.XX,NO.YY,2017 9 Correct transcription Input MIDI performance 3 (sec) # # #4# # # # 5 # # #6# # # # &####C ≈œœœ ‹œ œœ œ œœ œœ#œ œ œ œ ≈œœ œ ‹œ œ œ œœœ œœ#œ œœ œ & # # # # # # # # # # # # # # œ œ œ œ œ œ œ œ œ œ œ œ ? # # # # # # # # # # ?####C œœ œ œ œ œ œœ œ œ œ œ 6 6 6 6 R =60.00% Merged-output HMM (scaled by 2) R =4.55% Note HMM (scaled by 3/2) &#### ≈œœœ ‹œ œœ œ œœ œœ#œ œ œ œ Rœ œ œ œ‹œ œ œœœ œœ œ#œœ œ &#### œ œœ‹œœœ œ œœœ œœ#œœœ œ œ œœ œ‹œœœœ œœœ œœ#œœ œœ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ ?#### œœ œ œ œ œ œœ œ œ œ œ ?#### œ œ œ œ œ œ œœ œ œ œ œ 6 6 6 6 Metrical HMM R =63.64% 2D PCFG R =61.82% &#### œœœ‹œœœ œœ œœ œœ#œœœ œ œ œœ œ ‹œœœ œ œœœ œœ#œœ œœ œ &#### œœœ ‹œœœœ œ œœ œœ#œœœœ œ œ œœ‹œœœ œ œœœ œœ#œœœ œœ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ ?#### œ œ œ œ œ œ œœ œ œ œ œ ?#### œœ œ œ œ œ œœ œ œ œ œ Melisma Analyzer ver 1 (scaled by 3/2) R =70.91% R =64.55% Melisma Analyzer ver 2 &#### œœœ‹œœœœ œ œœœœ #œœ œœ œ œœœ ‹œœ œœ œœœ œœ#œœ œ œœ &#### œœœ‹œœœ œœœœ œœ#œœ œœ œ œ œœ ‹œœœ œ œœœœ œ#œœ œœœ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ ?#### œ œ œ œ œ œ œœ œ œ œ œ ?#### œ œ œ œ œ œ œœ œ œ œ œ Fig.10. Transcriptionresultsofapolyrhythmicpassage(Chopin’sFantaisieImpromptu).Onlypartoftheperformanceisillustrated,buttherhythmcorrection ratesforthewholeperformanceareshown.Fortheresultwiththeproposedmodel(merged-outputHMM),thestaffsindicatetheestimatedvoices. Input MIDI performance Correct transcription & 1 (sbec) 2 3 4b 5 b 6 b 7 b 8 b 9 &bbbc œ œ œ œ œ œ œ œ œ œ œ œ œnœ œ œ œ œ œ œ œ œ œœœœ b b b b b ? b b b b ?bbbc Rœ ≈ Rœ ≈ œœ œ œœnœ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œnœ R =8.26% R =4.59% Merged-output HMM 3 3 3 3 Note HMM 3 &bbb ®œ. œ. œ. œœ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œœœœ &bbb œœ.œ. œœœœ œ œ œ œ œ œœ œœnœœ œœ œœ œœ œœ œœ œœ œœ œœ œœœœ ? b œ. œ œœ œ œœnœ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ ? b œœ œ œœnœ œ œ œ œ œ œ nœ bb bb 3 3 3 3 Metrical HMM 3R =10.09% 2D PCFG (scaled by 1/2) R =4.59% 3 3 &bbb œœ œ œœœœ œ œ œ œ œ œœœœnœœ œœ œœ œœœœ œœ œœ œœ œœ œœœœ &bbb œœœ œœœœ œ œ œ œ œ œœ œœnœœ œœ œœ œœ œœ œœ œœ œœ œœ œœœœ ? b œœ œ œœnœ œ œ œ œ œ œ nœ ? b œœ œ œœnœ œ œ œ œ œ œ nœ bb bb Melisma Analyzer ver 1 (scaled by 1/2) R =19.27% R =5.50% Melisma Analyzer ver 2 (scaled by 3/4) 3 &bbb œœ œ œœœœ œ œ œ œ œ œœ œœnœœ œœ œœ œœ œœ œœ œœ œœ œœ œœœœ &bbb œœ œœ œœœ œ œ œ œ œ œœ œœnœœ œœ œœ œœ œœ œœ œœ œœ œœ œœœœ ? b œœ œ œœnœ œ œ œ œ œ œ nœ ? b œ œ œœœnœ œ œ œ œ œ œ nœ bb bb Fig.11. Transcriptionresultsofanon-polyrhythmicpassage(J.S.Bach’sInventionNo.2).SeethecaptionofFig.10. statistical significances. Among the other algorithms, the note the statistical models with (almost fully) learned parameters HMM and metrical HMM had similar accuracies and were (the proposed model, the note HMM and the metrical HMM) secondbest,andConnectionistQuantizerwastheworst.These had better accuracies than the other statistical models with results quantitatively confirmed that modelling multiple-voice partly learned parameters or without parameter learning (the structure is indeed effective for polyrhythmic performances. 2D PCFG model and Melisma Analyzer version 2) and other Contrarytoourexpectation,theresultforthe2DPCFGmodel methods.Atypicalexampleofpolyrhythmicperformancethat was second worst for these data. This might be because that is almost correctly recognised by the proposed model but not the algorithm using pruning cannot always find the optimal by other methods is shown in Fig. 10 3. One finds that the result and the model parameters have not been trained from data, both of which are difficult to ameliorate currently but 3Otherexamplesandsoundfilesareaccessibleinourdemonstrationweb couldpossiblybeimprovedinthefuture.Theresultsshowthat page[26]. IEEE/ACMTRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.XX,NO.YY,2017 10 e 20 As shown in Fig. 12, the results were similar for Nh ≥20 t on ra 15 Polyrhythmic awneduwseedretheexavcatlluyesNamhe=fo3r0Nfhor≥all3o0t.hBeraseevdaluoantiothnissirnestuhlits, ecti paper. Note that the sufficient value of Nh for exact inference rr 10 may depend on data and that smaller values with sub-optimal o c estimationscouldyieldbetteraccuracies(asthecaseofN = m Non-polyrhythmic h h 5 20 for our algorithm and data). t y 2) Effect of the Chord Model: As explained in Sec. III-A, h R 0 we propose a two-level hierarchical HMM for the description 0 10 20 30 40 50 N of chords, replacing self-transitions used in previous studies h [10], [19]. To examine its effect in terms of accuracies, we Fig.12. AveragerhythmcorrectionratesforvaryingNh. directly compared the two cases implemented in the merged- outputHMM.Sinceintheformercaseaself-transitionisalso 3 against 4 polyrhythm was properly recognised only by the used to describe repeated note values of two note clusters, proposed model (cf. Fig. 3). post-processing using the onset-time output probabilities was For the non-polyrhythmic data, Connectionist Quantizer performed on the results of Viterbi decoding to determine again had a far worst result and the differences among the whether a self-transition describe chordal notes or not. other methods were much smaller (within 2 points) compared The average rhythm correction rate by the chord model to the polyrhythmic case. The note HMM and metrical HMM using self-transitions was 15.69% for the polyrhythmic data had similar and best accuracies and the proposed model was and 9.12% for the non-polyrhythmic data. By comparing with thethirdbest.Thedifferenceintheaveragevaluesbetweenthe the values in Fig. 9, our chord model was slightly better and proposedmodelandthenoteHMMorthemetricalHMMwas the differences are 0.87 ± 0.46 and 0.68 ± 2.47 (statistical less than 1 point and the statistical significance was 1.6σ and significance1.9σand0.3σ)forthetwodatasets.Theseresults 1.3σ, respectively. Presumably, the main reason that the note indicate that our chord model is not only conceptually simple andmetricalHMMsworkedbetteristhattherhythmicpattern but also seems to improve the accuracy slightly. in the reduced sequence of note clusters is often simpler than 3) Effect of Joint Estimation of Voice and Note Values that of melody/chords in each voice in the non-polyrhythmic and Voice Separation Accuracy: A feature of our method case because of the principle of complementary rhythm [36]. is the simultaneous estimation of voice and note values. An Inparticular,notes/chordsinavoicecanhavetiednotevalues alternative approach is to use a cascading algorithm that that are not contained in our candidate list (e.g. a quarter note performs voice separation first and then estimates note values +16thnotevalueforthelastnoteofthefirstbarintheupper using the estimated voices. To examine the effectiveness of staff in the example of Fig. 11), which can also appear as a the joint estimation approach, we implemented a cascading result of incorrect voice separation. algorithm consisting of voice separation using only the pitch It is observed that the transcription by the merged-output partofthemodelinSec.III-A2andrhythmtranscriptionusing HMM can produce desynchronised cumulative note values in two note HMMs with coupled tempos and compared it with differentvoices(e.g.,thequarternoteE(cid:91)5intheuppervoicein the proposed method. Fig.11hasatimespandifferentfromthatofthecorresponding The average rhythm correction rate by the cascading algo- notesinthelowervoice).Thisisduetothelackofconstraints rithm was 16.89% for the polyrhythmic data and 9.67% for to assure the matching of these cumulative note values and the non-polyrhythmic data. By comparing with the values in the simplification of independent voice HMMs. For the note Fig. 9, we see that the proposed method was slightly better HMM and the proposed model, there were grammatically and the differences are 1.88±1.64 and 1.42±0.47 (statistical wrong sequences of note values, for example, triplets that significance1.1σand3.0σ)forthetwodatasets.Theseresults appearinsingleortwonoteswithoutcompletingaunitofbeat indicate the effectiveness of the joint estimation approach of (e.g. the triplet notes in the left-hand part in Fig. 11). Further the proposed method while the cascading algorithm may have improvements are desired by incorporating such constraints practicalimportancebecauseofitssmallercomputationalcost. and interactions between voices into the model. Wealsomeasuredtheaccuracyofvoiceseparation(intotwo hands). The accuracy with the proposed model was 94.2% for the polyrhythmic data and 88.0% for the non-polyrhythmic C. Examining the Proposed Method data and with the cascading algorithm it was 93.8% and We here examine the proposed method in more details. 92.5%.Thisindicatesfirstlythatasimilar(orhigher)accuracy 1) DependencyonN : InSec.III-Cweintroducedacutoff can be obtained by using only the pitch information and h N in the inference algorithm to reduce the computational secondly that a higher accuracy of voice separation does not h cost. In a previous study [24] that discussed the same cutoff, necessarily lead to a better rhythm recognition accuracy. it has been empirically confirmed that N =50 yields almost 4) Influence of the Model Parameters: In our implemen- h exactresultsforpianoperformances.Sinceitisdifficultforour tation, parameters of the tempo variables (mainly σ , σ and v t model to run the exact algorithm corresponding to N →∞, λ) were not optimised but adjusted to values measured in a h we compared results of varying N up to 50 to investigate its completely different experiment [28]. Since these parameters h dependency. play important roles of describing ‘naturalness’ of temporal

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.