Table Of Content

END-TO-ENDATTENTIONBASEDTEXT-DEPENDENTSPEAKERVERIFICATION Shi-XiongZhang,ZhuoChen†,YongZhao,JinyuLiandYifanGong MicrosoftCorporation,OneMicrosoftWay,Redmond,WA98052 †ColumbiaUniversity,NewYork,NY,USA † {zhashi, yonzhao, jinyli, ygong}@microsoft.com, [email protected] 7 ABSTRACT ically, the relationship between text-dependent/independent 1 SVandKeywordSpotting(KWS)wake-upsystemcanbede- 0 AnewtypeofEnd-to-Endsystemfortext-dependentspeaker scribedinthefollowingequation 2 verification is presented in this paper. Previously, using the phonetic/speaker discriminative DNNs as feature extractors n text-independentSV jointspeaker&KWSwakeup a forspeakerverificationhasshownpromisingresults. Theex- (cid:122) (cid:125)(cid:124) (cid:123) (cid:88) (cid:122) (cid:125)(cid:124) (cid:123) P(spk|O ) = P(spk,w |O ) J tracted frame-level (DNN bottleneck, posterior or d-vector) 1:T 1:L 1:T 3 features are equally weighted and aggregated to compute an w (cid:88) utterance-level speaker representation (d-vector or i-vector). = P(spk|w1:L,O1:T) P(w1:L|O1:T) ] (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) L In this work we use speaker discriminative CNNs to extract w text-dependentSV KWSspeechrecognition C thenoise-robustframe-levelfeatures. Thesefeaturesarethen . combined to form an utterance-level speaker vector through whereO1:T istheobservationsequenceandw1:L isstandfor s anattentionmechanism. Theproposedattentionmodeltakes theword/phone/statesequence. Onemajoradvantageoftext- c [ thespeakerdiscriminativeinformationandthephoneticinfor- dependentspeakerverificationsystemsisithasknowledgeof mationtolearntheweights. Thewholesystem,includingthe theutterancesphoneticcontentandcanachieverobustverifi- 1 CNNandattentionmodel,isjointoptimizedusinganend-to- cationresultswithveryshortenrollmentutterances[6]. v 2 endcriterion.Thetrainingalgorithmimitatesexactlytheeval- 6 uationprocess—directlymappingatestutteranceandafew 5 targetspeakerutterancesintoasingleverificationscore. The 0 algorithmcanautomaticallyselectthemostsimilarimpostor 0 foreachtargetspeakertotrainthenetwork. Wedemonstrated . 1 theeffectivenessoftheproposedend-to-endsystemonWin- 0 dows10“HeyCortana”speakerverificationtask. 7 1 IndexTerms— speakerverification,end-to-endtraining, : v attentionmodel,deeplearning,CNN i X 1. INTRODUCTION r a Speakerverification(SV)isabinaryclassificationproblemin Fig. 1. The keyword spotting and speaker verification (with which a person’s identity is verified based on his/her voice. thefixedphase“HeyCortana”)inWindows10. Speaker verification can be categorized into text-dependent andtext-independent[1].Intext-dependentsystems,thesame Previously,thetraditionaltechniques[7,8]usedfortext- setofphrasesareusedforenrollmentandrecognition.Intext- independent speaker verification have been found ineffec- independentsystems,ontheotherhand,differentphrasesare tiveforthetext-dependent tasks[4]. Betterperformancecan used. The text-dependent SVs usually outperforms the text- be achieved by slightly modifying the older techniques such independent SVs, because of the constraint of the phonetic as GMM-UBM [4], GMM-SVM [9, 10] and i-vector/PLDA variability[2,3,4]. Especiallywiththeincreasingpopularity [11]. More recently, the deep neural networks (DNNs) [12] of mobile/wearable devices, it is competitively beneficial to and recurrent neural networks (RNNs) [13] have been suc- enablethefullvoiceinteractionbeginningfromafixed-phrase cessfullyappliedtotext-dependentspeakerverification. Two (keyword)voice-authenticatedwake-up[5].AtMicrosoft,we types of DNNs, speaker discriminative DNNs [6] and pho- are interested in the text-dependent speaker verification with neticdiscriminativeDNNs[14],havebeeninvestigatedtoex- the global keyword “Hey Cortana” (see Fig. 1). Mathemat- tract the frame-level features, such as d-vectors [6], bottle- neckfeatures[14,15]andphoneticposteriorfeatures(align- ments)[12,15]. Thesefeaturesaretreatedequallyimportant and then aggregated together to compute an utterance-level speaker representation, such as i-vectors [12, 15], d-vectors [5, 13]. These utterance-level features from the test speaker andenrolledspeakerarethenscoredusingasimilaritymea- sure,e.g.,cosinedistance/PLDA[16,17]. TheDNNs/RNNs used to extract phonetic/speaker features are fixed after the training stage. Google recently proposed an End-to-End method to train the DNNs/RNNs for text-dependent speaker verification [5]. This is in contrast to the established approach of training DNNs to discriminate between speakers at the frame-level. In [5] the cosine distance score of two utterance-levelrepresentationsarepassedtoalogisticregres- sion layer to produce the final loss −logP(accept/reject). Theparametersofwholenetworksarelearnedbyminimizing Fig. 2. The architecture of our end-to-end attention based thisend-to-endloss. systemforspeakerverification. Thegreenblocksindicatethe InthispaperweusespeakerdiscriminativeCNNstoex- models can be learned in the training phase. All the green tract noise robust frame-level features. This is inspired by and grey blocks are fixed in the enrollment and verification recentworksin[18,19]whichillustratedadeepconvolution phases. During enrollment, the extracted speaker supervec- neural network (CNN) architecture outperformed LSTMs in tors are stored as a speaker model. The red dash and blue many speech recognition tasks. Another contribution of this solidarrowsindicatetherepresentationsofthetestutterance work is the CNN extracted frame-level features are smartly andenrolledutterances,respectively.Thespeakervectorpool combined through an attention mechanism to generate an containsallthespeakerrepresentationsinthetrainingsetand utterance-levelspeakerrepresentation,insteadofjustequally isusedtoselectthemostsimilarimpostorduringtraining. weightingandaveragingalltheframes[5,6]. Theproposed attentionmodeltakesthespeakerdiscriminativeinformation trained to discriminate between speakers at the frame-level. andphoneticinformationtolearntheattentionweights. The Instead,alltheparametersinthewholesystemwillbejointly thirdcontributionisthewholesystem,includingtheCNNand trainedusinganend-to-endcriteriondescribedinSection2.2. attention model, is joint optimized using a novel end-to-end After training, the CNN and attention networks are used training algorithm. Unlike the Google’s end-to-end training toextracttheutterance-levelspeakervectors(seeFig. 2)for [5] which randomly samples the test speaker and the target theenrolmentandtestspeakers. Thisisdonebyfreezingthe speaker,ouralgorithmusesthemostcompetingimpostorfor parametersofthelearningnetwork. Intheenrollmentphase, eachtargetspeaker(inthecaseofrejection). Finally,theend- eachspeakerprovidesafewutterancesandeachutteranceis to-end system is evaluated on Windows 10 “Hey Cortana” convertedtoasupervectorthroughthetrainednetwork. The speakerverificationtask. speaker model is obtained by averaging over a small mount ofenrollmentsuptervectors. 2. END-TO-ENDSPEAKERVERIFICATION Intheevaluationphase,ascoringfunctionS(O ,O1:N) tst spk isusedtocomputethesimilaritybetweenatestutteranceO tst This section describes the overview architecture of our end- andaclaimedspeakerspk. Thefinalscorewillbecompared to-endspeakerverificationsystem.Thedetailofitsimportant against a pre-defined (may be speaker dependent) threshold. componentswillbediscussedinSection3and4. Theexper- Thesystemwillacceptifthescoreexceedsthethreshold,i.e., imentalresultsandanalysiscanbefoundinSection5. theutteranceO belongstospk,andrejectotherwise. Typ- tst ically, a simple scoring function is the cosine similarity [16] 2.1. End-to-Endarchitecture betweenthetestspeakerrepresentationfcnn|att(Otst)andthe speakermodelf (O1:N), cnn|att spk Atypicalspeakerverificationprotocolincludesthreephases: training, enrollment, and evaluation. In the training phase, our network learns to extract an internal speaker representa- S(O ,O1:N)= fcnn|att(Otst)Tfcnn|att(O1sp:Nk) (1) tion from the utterance. The learning network includes two tst spk (cid:107)f (O )(cid:107)(cid:107)f (O1:N)(cid:107) cnn|att tst cnn|att spk parts,aCNNandanattentionnetwork,asshowninFigure2. This network learning stage is like the UBM training stage. wherethespeakermodelf (O1:N)isprecomputedinthe cnn|att spk ThedetailsofourCNNmodelandattentionnetworkwillbe enrollmentstageandf (·)indicatesthatthefeaturemap- cnn|att discussedinSection3and4. Notethatthenetworksarenot pingdependsontheCNNandattentionnetworks. In[5]the similarityscoreS isthenpassedtoalogisticregressionσ(·) Algorithm1:End-to-EndTrainingforSV whichincludesalinearlayerwithabias. Thus, Data: 10kspeakers,eachspeakerhas10-50utterances. 1 Result: speakerdiscriminativeCNN,attentionnetwork P(accept|O ,O1:N)=σ(S)= tst spk 1+e−wS(Otst,Os1p:Nk)−b andlogisticregressionmodel. (2) initialWcnn|att where scalar w is a score normalizer trained using the crite- initialspeakervectorpoolV ←i-Vectors riondescribedinthenextsection, −b/w canbeviewedasa foreachfullsweepdo verificationthreshold. Therejectprobablitycanbecomputed foreachminibatchinafullsweepdo byP(reject|O ,O1:N)=1−σ(S). foreachspkinaminibatchdo tst spk 1)sampleN +T utterancesofspk 1 N enrollment: O1:N 2.2. End-to-Endtraining spk T test: x =S (Oi ,O1:N),y =1|T1 GiventhecurrentCNNandtheattentionmodel’sparameters, 1 i i tst spk i i=1 2)searchkmostsimilarimpostorsinV W ,thesimilarityscoreS(O ,O1:N)betweenthetest cnn|att tst spk 3)sampleT utterancesbelongstotheseimp utterance and the target speaker utterances can be computed 2 T test: x =S (Oi ,O1:N),y =0|T2 andfedtothelogisticregressionσ(·). Alltheparametersin 2 i i imp spk i i=1 4)gather(x ,y ),accumulate∇F(W ) the whole network (see architecture in Fig. 2) can be jointly i i cnn|att optimizedusingthefollowingend-to-endcriterion[5] updateW ←W −η∇F(W ) cnn|att cnn|att cnn|att updateV ←f (O1:N)∀spk 1(cid:88)I cnn|att spk F(W )=− y logσ(x )+(1−y )log(1−σ(x )) cnn|att I i i i i i=1 (3) where W represents the parameters of CNN, attention 3. NEURALNETSFORSPEAKERVERIFICATION cnn|att model and logistic regression σ(·) and F(·) is a function of This section presents a survey of DNN/RNN based methods W . Theinputx = S (O ,O )isasimilarityscore cnn|att i i tst spk forspeakerverification. AnewspeakerdiscriminativeCNN ofthei-thpairof(tst,spk), whichdependsontheW . cnn|att modelisthenproposedandappliedtotheend-to-endsystem Thelabely = 1ifthetestingutteranceO belongstospk, i tst described in previous section. DNNs/RNNs have been suc- otherwisey = 0. I isthetotalnumberof(tst,spk)pairsin i cessfully integrated into the speaker verification paradigms. the training set (which is huge). To effectively use the data, The existing neural networks for speaker verification can be analgorithmthatcansmartlypickthemost“confusing”pairs categorizedintotwotypes—phoneticdiscriminativeDNNs isproposed. ThedetailoftheapproachisdescribedinAlg.1. and speaker discriminative DNNs [13] (see Fig. 2). These NotethetrainingprocesswiththecriterioninEq.3isactually twotypesofDNNswilldiscussedinSection3.1and3.2. imitatingtheend-to-endevaluationmetric. Itdirectlymapsa testutteranceandafewtargetutterancestoasinglescorefor verification. Thewholenetworkistrainedbyminimizingthe 3.1. LearningPhoneticRepresentation end-to-endloss. Previous research has demonstrated that using the phonetic Tomakesuretheparametersareupdatedusingsufficient information from the DNNs yields significant gains for SV informationfromdiversifiedspeakers,wegroupthespeakers [22, 15]. These phonetic discriminative DNNs refer to the into mini-batches. Each mini-batch contains 64 speakers as neural networks that can classify each frame of speech as a targets. For each target speaker, we sample N utterances as specific phoneme/senone. Basically, they are DNNs trained enrollment and T test utterances as “acceptance” data. For 1 for speech recognition [23] or keyword spotting [24]. The eachtargetspeaker,wealsoselectkmostsimilarspeakersas gain of this approach is mainly due to the phonetic repre- impostorstomakesurethenetworkcanlearntodiscriminate sentation extracted from these DNNs can help to align (or between the most challenging samples. For these k impos- normalize)differentspeakersintothesamephoneticspaceto torswerandomlysampleT utterancesas“reject”data. The 2 compare. exactnumberofthesehyperparameters,N,T ,T andk,are 1 2 discussed in Section 5.2. In order to efficiently search the k 3.1.1. PhoneticBottleneckFeaturesbphn mostsimilarimpostors,aquerytableisbuildtothestorethek t nearestneighborforeachspeaker. Thetableisbuiltusingthe The most successful and simple approach of using phonetic all-nearest-neighborsalgorithm[20]withthecosine-distance discriminativeDNNsforspeakerverificationistheso-called metricinO(k·nlogn)time. Thespeakervectorpoolusedto bottleneck feature approach [25]. These DNNs are learned computethetableisinitializedbyi-vectors[21].Thepoolcan inthesupervisedmannertoclassifythephoneme/senonela- be updated periodically (each full sweep) using the speaker bels.Oncethenetworkistrained,thebottleneckfeaturesbphn t supervectors,f (O1:N). canbeextractedforeveryframe,fromtheoutputsofthelast cnn|att spk Fig. 3. The architecture of the CNN model for exacting the discriminate speaker representations. The green blocks indicate theoperationsinthenetwork,suchastime-frequencyconvolutions,ReLUactivationfunction,maxpoolingandfull-connected linearprojection. Theblueblocksshowthenumberofparametersineachoperation. Thebatchnormalizationisapplied. hiddenlayeroftheDNNbeforethesigmoidnonlinearity[26]. 3.2.2. rnn-vectors Thebottleneckfeaturescanalsobeincorporatedtogetherwith Previously,aRNNbasedapproachfortext-dependentspeaker theinputMFCCfeaturesasTandemfeatures[27]. Thesefea- verification that makes use of utterance-level features has turesarethenusedtotrainaback-endclassifierlikeaGMM- beenproposed[5,13].Thisapproachaddressesthemaincrit- UBM[22]ori-vector/PLDA[15]. icism of the d-vector approach, i.e., the frame-level speaker classification. InspeakerdiscriminateRNNs,thereisasingle 3.1.2. PhoneticPosteriorFeaturesγtphn labelassociatedwitheachutterance,insteadofoneforevery frame. Thehiddenvector(aftertheactivationfunction)ofthe Anotherapproachthatmakesuseofaphoneticdiscriminative RNNs in the last frame, h , is an effective summary of the DNNforspeakerverificationisthephoneticposteriorfeature T entireutterance. Onedrawbackofthisutterance-levelh is approach. The basic idea is to replace the frame alignment T itcanonlybeusedfortext-dependentspeakerverification. posteriorsγUBM generatedbytheUBMwiththesenonepos- t teriorsγtphnproducedbytheDNN[12].Theseposteriorsγtphn 3.2.3. cnn-vectors can be directly used to compute the first-order and second- Recently, deep CNNs with small kernels have been shown order statistics for i-vector extraction [12]. In this work the phoneticposteriorsfeaturesγphn isalsousedtoprovidecon- toachieveabetterperformancethanLSTMsinmanyspeech t recognition tasks [18, 19]. Inspired by these works, a deep textinformationintheattentionmodeldescribedinSection4. CNNarchitectureisproposedheretoextractthespeakerdis- criminative information. The rnn-vector approach described 3.2. LearningSpeakerRepresentation inSection3.2.2canonlybeappliedtotext-dependentspeaker Alternative to the phonetic discriminative DNNs, this sec- verification,whiletheproposedCNNissuitableforbothtext- tion discusses the speaker discriminative DNNs, which rep- dependentandtext-independenttasks. resentsamorenaturalconfigurationforspeakerverification. The CNNs are neural networks with special structures. AspeakerdiscriminativeDNNisaneuralnetworktrainedto Fig.3illustratestheproposeddeepCNNwiththeVGG-style discriminatebetweenspeakers. Thistypeofneuralnetworks architecture[29]. InthisCNNthefirstlayeriscalledacon- hasbeensuccessfullytrainedtoextractspeakerinformation, volution layer, which consists of a number of feature maps. suchasspeakerarticulatoryfeatures[28]andd-vectors[6]. Each neuron in the convolution layer receives input from a local field (e.g. a 3×3 window) representing features of a specific frequency range. These local windows shift across 3.2.1. dnn-vectors timeandfrequency. Theneuronsreceivedifferentshiftedlo- In this approach DNNs are trained to discriminate between calfieldasinputs. Neuronsintheconvolutionlayerthatbe- speakersattheframe-level. Basically,themodeltriestoclas- longtothesamefeaturemapsharethesameweights,known sifyeachframeasbelongingto1-of-N speakers,whereN is askernelsorfilters. Thus,theconvolutionlayeryieldacon- the number of background speakers [6]. After training each volutionofthekernelswiththeinputs. Onespecialtyofour frameofanutteranceisforwardpropagatedthroughthenet- CNNisanasymmetriccontextwindow(30framesinthehis- work, and the output of the last hidden layer is used to pro- toryand5framesinfuture)isappliedtotheinputstocontrol duce an frame-level speaker representation. All the frame- thelatencyasshowninFig.3. NotetheLSTMsactuallyuse level features are then averaged to form an utterance-level theinformationfromO foreachframet,whichcanalsobe 1:t speakersupervectorcalledad-vector. Themaindrawbackof viewedasanasymmetriccontextwindow. Thezero-padding d-vectorsisthatthespeakerdiscriminationisachievedbased isusedduringtheconvolution. Thebatchnormalization[30] onatime-scaleatwhichphoneticvariabilityisdominant. isappliedtoimprovethetrainingconvergence. 4. ATTENTIONMECHANISM As discussed in Section 2, rather than simply averaging the frame-level speaker representation h from the CNN to pro- t duce an utterance-level feature, we propose to learn the best combinationofh . Thebasicideaistouseanattentionnet- t worktolearnthebestwaytocombinetheframe-levelspeaker featuresbyutilizingthephoneticcontextinformation. Theapproachismotivatedbytheideaofvisualattention intheimagecaptioningproblem[31]. Intheimagecaption- ing, when the model is trying to generate the next word of thecaption,thatwordisusuallydescribingonlyapartofthe image. Thus the attention network serves as a selection and combinationmodel. Inthecontextofourproblem,whenthe CNNmodelistryingtogenerateaspeakerfeaturevector,that vectorisonlydescribingthespeakercharacteristicsinaspe- cificcontextoraspecificphoneticspace. Thustheattention Fig. 4. The architecture of the general attention model and networkserversasanalignmentandcombinationmodel. two proposed attention networks. The network can learn to Generally,anattentionmodelisanapproachthattakesT alignandcombinetheframe-levelspeakerrepresentationh arguments h ,...,h , and a context c. It return a vector f t 1 T toformaspeakersupervectorf . whichissupposedtobethesummaryoftheh ,focusingon cnn|att 1:T informationcorrespondingtothecontextc. Moreformally,it returnsaweightedmeanoftheht,andtheweightsarechosen Eq.4theframe-levelspeakerfeaturesareweightedbyphone according the relevance of each ht given the context c, as posteriors. Howevertheframecombinationweightscanalso showninFig.4(a). belearnedthroughanattentionnetwork. Thisapproachwill bediscussedinthenextsection. 4.1. attentionnetworkwithposteriorweights The first proposed attention network simply uses the DNN 4.2. attentionnetworkwithlearnedweights posteriorfeaturesasthecontextinformationshowninFig.4 (b).TheposteriorprobabilitiesfromtheKWS-DNNnaturally Inthissection,anadvancedattentionnetworkisproposed. It providethealignmentinformationandcombinationweights. notonlyutilizesthealignmentinformationbutcanalsolearn Thusthespeakersupervectorf canbecomputedby the combination weights. This is because different frames cnn|att maycontaindifferentamountofspeakerdiscriminativeinfor- P(hh|o )·hspk mation. The architecture of this attention network is shown T T t t f =(cid:88)hspk⊗γphn =(cid:88) ..  (4) in Fig. 4 (c). In this model the context information (shown cnn|att t t  .  as blue arrows) contains two sources, the posterior features t=1 t=1 P(ey|ot)·hstpk 640 γphn andthebottleneckfeaturebphn. Thesparsevectorγphn t t t isusedtomappingthespeakercharacteristicsintothecorre- where⊗istheKroneckerproduct[32],hstpk isa64-dimfea- sponding phonetic space. The bottleneck features bphn and t tureextractedfromthespeakerdiscriminativeCNN,andfea- hspk are used together to learn the combination weights for tureγphn has10dimensions,eachrepresentsaprobabilityof t t betterspeakerdiscrimination. phonemein“HeyCortana”1,   hstpk =(cid:20)cnn(ot)(cid:21) , γtphn =P(hh...|ot) . αet ==tanhex(Wpehthstpk+Wbbpthn) ((56)) 64 P(ey|o ) t (cid:80)T expe t 10 j=1 j Note that the γphn is normally a sparse vector. The Eq.4 is (cid:88)T (cid:16) (cid:17) t f = α ·hspk⊗γphn (7) basicallystackingthespeakervectorhspk tothecorrespond- cnn|att t t t t t=1 ingphoneticspace,withaprobabilityweight.Forexample,if P(hh|o1)=1,thenthesupervectorfcnn|attwillonlyhavethe where the ⊗γphn serves as an alignment model as discribed t top64-dimnon-zerofeatures. Bythisway,differentspeaker inprevioussectionandthee servesasacombinationmodel. t vectorswillbecomparedonlyinthesamephoneticspace. In α is the softmax distribution over the input sequence. The t 1TheHey-CortanaDNNhasninephonemes{hh,ey,k,ao,r,t,aa,n,er} combinationweightsrepresentsthedensityofspeakerinfor- andonegarbagestateusedtoabsorbthesilence,noiseandotherphonemes. mation in each frame according to the context. The tensor product and the softmax functions are differentiable, there- duces the dimension of bphn to 32. The CNN model in the t foretheentirenetwork,includingtheCNN,attentionmodel, end-to-endsystemuseszeropaddingand1-stepstridingdur- logisticregressioncanbejointlytrainedusingthestochastic ingtheconvolutionoperation[18]. A2×2poolingwindow gradientdescent. and2-stepstridingareusedduringthemaxpoolingoperation. In this work we applied the attention model to the text- N =6utterancesareselectedduringtheend-to-endnetwork dependent speaker verification. However, as the attention training (see Alg. 1). This is stimulating the real enrollment model can learn to map the speaker features into the corre- scenarioasWindows10willaskuserstoenroll6utterances spondingphoneticspace,theproposedapproachwouldwork inordertoallowCortanatrytorespondsonlytotheuser. To evenbetterfortext-independenttask. avoid mismatch between the negative/positive samples ratio intrainingandevaluation,T =1andT =5areusedinthe 1 2 experiments. 5. EXPERIMENTSANDRESULTS Inthissectionwepresentthedetailsofnetworkarchitectures, 5.3. Results networktrainingandexperimentalresults. Allthemodelsin- First,wecomparetheresultsbetweendifferentsystems. The vestigated in this work were implemented and trained using equal error rates (EERs) are shown in Table 1. The KWS- Theano[33]andtheKeraspackage[34]. DNN bottleneck feature is effective for all the systems. The end-to-end system outperforms the GMM and i-vector sys- 5.1. DataSets temby25%and9%, respectively. Second, wecomparedif- ferentattentionmechanismsandspeakerdiscriminateneural TheproposedapproachisevaluatedontheMicrosoftinternal networks(seeTable2). Theproposedattentionnetworkwith text dependent speaker verification task. A set of utterances learned weights (Section 4.2) works the best. The proposed beginning with the voice activation phrase “Hey Cortana” is CNNmodelalsooutperformstheDNN(withthesamemount collectedfromtheWindows10desktopCortanaservicelogs. of parameters) by 6%. Comparing with LSTMs the gain of TheHeyCortanasegmentsaretakenoutusingakeywordde- proposedCNNisverysmall. Thismayduetothetaskistext- tector. The length of these segments is around 65 frames to dependentspeakerverification. Largergainscanbeexpected 110 frames. The evaluation comprises about 60k utterances fortext-independenttasks(futurework). Third,weshowthe from 3k target speakers and 3k impostors. The enrollment importance of using speaker vector pool to select the most setcomprises6utterancesofHeyCortana. Wethenselected similarimpostorduringtheend-to-endtraininginTable3. about200kutterancesfrom10kspeakers,eachwith10to30 utterancesforUBMtraining. Thereisnooverlapofspeakers Features GMM-UBM i-vector/PLDA End-to-End betweenthetrainingandtesting(enrollment/evaluation)data. MFCCo 5.8% 4.7% 4.2% t 5.2. ExperimentSetups MFCCo +BNbphn 4.8% 4.3% 4.1% t t The spectral features used in KWS-DNN consist of 38 di- Table1. TheresultsoftheGMM-UBM,i-vector/PLDAand mensional mel-frequency cepstral coefficients (MFCC) with End-to-End systems in EER%. The CNN model shown in 13filterbanks. TheMFCCscontain12static(withoutenergy Fig.3andthelearning-weightscaledattentionnetworkshown C0),13deltaand13delta-deltafeatures.ThebaselineGMM- inFig.4(c)areusedinEnd-to-Endsystem. UBM and i-Vector systems are using the same MFCCs as KWS-DNNs. Arolling-windowbasedcepstralmeannormal- ization(CMN)andfeaturewarping[35]areappliedinthese SpeakerModeling systems. The size of the rolling windows is 41. The CNN AttentionMechanism CNN DNN LSTM model has three channel inputs. The first channel contains 12-dim static features (d = 12 in Fig. 3). The second and posteriorscaledattention 4.5% 5.0% 4.5% thirdchannelsusethedeltaanddelta-deltafeatures[36],each learning-weightscaledattention 4.1% 4.5% 4.2% has12dimensions. Tocontroltherun-timelatency,anasym- metriccontextwindow,o ,...,o ,isused(seeFig.3). Table 2. EERs of different attention networks and speaker t−25 t+5 TheKWS-DNNconsistsoftwohiddenlayers.Eachlayer modelsinEnd-to-Endsystemwiththesame[o +b ]features. t t has 128 and 64 nodes respectively. The output layer has 10 classes. TheGMM-UBMsystemhas128Gaussianmixtures. TrainingAlgorithm bestEnd-to-Endsystem The i-vector/PLDA system uses 300 dimensional i-vectors andthesearethenprojecteddownto64dimensionsusingthe randomsampletestspeakers 4.6% LDA[21]. ThebottleneckfeaturesbphnfromtheKWS-DNN usingspeakervectorpool 4.1% t have 64 dimensions. These bottleneck features are decorre- latedusingPCAforGMM-UBMsystems. ThePCAalsore- Table3. TheeffectivenessofspeakervectorpoolinAlg.1. 6. CONCLUSION [9] Sergey Novoselov, Timur Pekhovsky, Andrey Shulipa, and Alexey Sholokhov, “Text-dependent gmm-jfa sys- Anovelend-to-endsystemfortext-dependentSVisdescribed tem for password based speaker verification,” in in this paper. A speaker discriminative CNN is used to ex- ICASSP.IEEE,2014,pp.729–737. tractframe-levelfeatures. Anothercontributionofthiswork is the extracted frame-level speaker features are combined [10] Shi-Xiong Zhang and Man-Wai Mak, “Optimized dis- through an attention network to generate an utterance-level criminative kernel for SVM scoring and its application speaker representation. The proposed attention model takes to speaker verification,” IEEE Transactions on Neural thespeakerdiscriminativeinformationandphoneticinforma- Networks,vol.22,no.2,pp.173–185,2011. tion to learn the attention weights. The third contribution is [11] TStafylakis,PatrickKenny,POuellet,JPerez,MKock- the whole system, including the CNN and attention models, mann,andPierreDumouchel, “Text-dependentspeaker arejointlearnedusinganend-to-endtrainingalgorithm. The recognition using plda with uncertainty propagation,” algorithm can automatically select the most similar impos- matrix,vol.500,pp.1,2013. torsforeachtargetspeakerduringthetraining. Weshowthe effectiveness of the proposed system on Windows 10 “Hey [12] YunLei,NicolasScheffer,LucianaFerrer,andMitchell Cortana”speakerverificationtask. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in ICASSP.IEEE,2014,pp.1695–1699. 7. REFERENCES [13] Gautam Bhattacharya, Patrick Kenny, Jahangir Alam, [1] J. P. Campbell Jr., “Speaker recognition: A tutorial,” and Themos Stafylakis, “Deep neural network based Proc.IEEE,vol.85,no.9,pp.1437–1462,1997. text-dependent speaker verification : Preliminary results,” in Odyssey 2016: The Speaker and Language [2] R.Auckenthaler,E.Parris,andM.Carey, “Improvinga RecognitionWorkshop,Bilbao,Spain,June21-242016, GMM speaker verification system by phonetic weight- pp.9–15. ing,” inProc.ICASSP,1999,pp.1440–1444. [14] FredRichardson,DouglasReynolds,andNajimDehak, “Deep neural network approaches to speaker and lan- [3] T. Matsui and S. Furui, “A text-independent speaker guage recognition,” IEEE Signal Processing Letters, recognitionmethodrobustagainstutterancevariations,” vol.22,no.10,pp.1671–1675,2015. inProc.ICASSP,1991,pp.377–380. [15] Hossein Zeinali, Lukas Burget, Hossein Sameti, On- [4] Bin Ma Haizhou Li Anthony Larcher, Kong-Aik Lee, drej Glembek, and Oldrich Plchot, “Deep neural net- “Text-dependent speaker verification: Classifiers, worksandhiddenmarkovmodelsini-vector-basedtext- databases and rsr2015.,” Speech Communication, vol. dependentspeakerverification,” inOdyssey2016: The 60,pp.56–77,2014. Speaker and Language Recognition Workshop, Bilbao, Spain,June21-242016,pp.24–30. [5] Georg Heigold, Ignacio Moreno, Samy Bengio, and [16] NajimDehak, RedaDehak, JamesRGlass, DouglasA NoamM.Shazeer, “End-to-endtext-dependentspeaker Reynolds,andPatrickKenny,“Cosinesimilarityscoring verification,” inInternationalConferenceonAcoustics, without score normalization techniques.,” in Odyssey, SpeechandSignalProcessing(ICASSP),2016. 2010,p.15. [6] EhsanVariani,XinLei,ErikMcDermott,IgnacioLopez [17] Patrick Kenny, Themos Stafylakis, Pierre Ouellet, Moreno, andJavierGonzalez-Dominguez, “Deepneu- Md Jahangir Alam, and Pierre Dumouchel, “Plda for ralnetworksforsmallfootprinttext-dependentspeaker speaker verification with utterances of arbitrary dura- verification,” inICASSP.IEEE,2014,pp.4052–4056. tion,”in2013IEEEInternationalConferenceonAcous- tics, Speech and Signal Processing. IEEE, 2013, pp. [7] D.A.Reynolds, “Speakeridentificationandverification 7649–7653. usingGaussianmixturespeakermodels,” SpeechCom- [18] Tom Sercu and Vaibhava Goel, “Advances in very munications,vol.17,pp.91–108,1995. deepconvolutionalneuralnetworksforLVCSR,” arXiv preprintarXiv:1604.01792,2016. [8] P. Kenny, P. Boulianne, G. amd Ouellet, and P. Du- mouchel, “Joint factor analysis versus eigenchannels [19] GeorgeSaon, Hong-KwangJKuo, StevenRennie, and in speaker recognition,” IEEE Transactions on Au- Michael Picheny, “The IBM 2015 english conversa- dio,SpeechandLanguageProcessing,vol.15,no.4,pp. tional telephone speech recognition system,” arXiv 1435–1447,May2007. preprintarXiv:1505.05899,2015. [20] Pravin M Vaidya, “An O(nlogn) algorithm for the all- [31] Oriol Vinyals, Alexander Toshev, Samy Bengio, and nearest-neighborsproblem,” Discrete&Computational Dumitru Erhan, “Show and tell: A neural image cap- Geometry,vol.4,no.2,pp.101–115,1989. tiongenerator,” inProceedingsoftheIEEEConference onComputerVisionandPatternRecognition,2015,pp. [21] NajimDehak,PatrickJKenny,Re´daDehak,PierreDu- 3156–3164. mouchel,andPierreOuellet, “Front-endfactoranalysis [32] Alexander Graham, “Kronecker products and matrix forspeakerverification,” IEEETransactionsonAudio, calculus: With applications.,” JOHN WILEY & SONS, Speech, and Language Processing, vol. 19, no. 4, pp. INC.,605THIRDAVE.,NEWYORK,NY10158,1982, 788–798,2011. 130,1982. [22] FredRichardson,DouglasReynolds,andNajimDehak, [33] TheanoDevelopmentTeam, “Theano:APythonframe- “Aunifieddeepneuralnetworkforspeakerandlanguage workforfastcomputationofmathematicalexpressions,” recognition,” arXivpreprintarXiv:1504.00923,2015. arXive-prints,vol.abs/1605.02688,May2016. [23] George E Dahl, Dong Yu, Li Deng, and Alex Acero, [34] FrançoisChollet, “Keras,”https://github.com/ “Context-dependent pre-trained deep neural networks fchollet/keras,2015. for large-vocabulary speech recognition,” IEEE Trans- actions on Audio, Speech, and Language Processing, [35] Jason Pelecanos and Sridha Sridharan, “Feature warp- vol.20,no.1,pp.30–42,2012. ing for robust speaker verification,” in Interspeech, 2001. [24] Guoguo Chen, Carolina Parada, and Georg Heigold, [36] Ossama Abdel-Hamid, Abdel-Rahman Mohamed, Hui “Small-footprint keyword spotting using deep neural Jiang, Li Deng, Gerald Penn, and Dong Yu, “Con- networks,” in 2014 IEEE International Conference volutional neural networks for speech recognition,” onAcoustics, SpeechandSignalProcessing(ICASSP). IEEE/ACM Transactions on audio, speech, and lan- IEEE,2014,pp.4087–4091. guageprocessing,vol.22,no.10,pp.1533–1545,2014. [25] TianfanFu,YanminQian,YuanLiu,andKaiYu, “Tan- dem deep features for text-dependent speaker verification.,” inINTERSPEECH,2014,pp.1327–1331. [26] Frantisek Gre´zl, Martin Karafia´t, Stanislav Konta´r, and Jan Cernocky, “Probabilistic and bottle-neck features forlvcsrofmeetings,” in2007IEEEInternationalCon- ference on Acoustics, Speech and Signal Processing- ICASSP’07.IEEE,2007,vol.4,pp.IV–757. [27] QifengZhu,BarryChen,NelsonMorgan,andAndreas Stolcke, “Tandem connectionist feature extraction for conversational speech recognition,” in International Workshop on Machine Learning for Multimodal Inter- action.Springer,2004,pp.223–231. [28] S. X. Zhang, M. W. Mak, and H. M. Meng, “Speaker verification via high-level feature based phonetic-class pronunciation modeling,” IEEE Trans. on Computers, vol.56,no.9,pp.1189–1198,2007. [29] K. Simonyan and A. Zisserman, “Very deep convolu- tional networks for large-scale image recognition,” in ICLR,2015. [30] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167,2015.