Interspeech 2018 2-6 September 2018, Hyderabad Real-Time Scoring of an Oral Reading Assessment on Mobile Devices JianCheng AnalyticMeasuresInc.,PaloAlto,California,U.S.A. [email protected] Abstract adaptivelybasedonthepastorreal-timeperformanceofaspe- We discuss the real-time scoring logic for a self-administered cificstudenttolevelthestudentmoreaccurately,etc. oral reading assessment on mobile devices (Moby.Read) to measurethethreecomponentsofchildren’soralreadingfluency 2. Themobilespeechrecognitionsystem skills: words correct per minute, expression, and comprehen- Although automatic speech recognition (ASR) is a compute- sion.Criticaltechniquesthatmaketheassessmentreal-timeon- intensive process, feasible, on-device speech recognition was devicearediscussedindetail. Weproposetheideaofproduc- researched[7,8,9,10,11]. Withtherecentintroductionofthe ingcomprehensionscoresbymeasuringthesemanticsimilarity neural processing unit (NPU) or neural engine to mobile de- betweenthepromptandtheretellingresponseutilizingthere- vices, we expect that complex acoustic models and language centadvanceofdocumentembeddingsinnaturallanguagepro- models can be implemented on the latest mobile devices to cessing. Bycombiningfeaturesderivedfromwordembedding achievebetterperformance. Speechrecognitionon-devicewill withthenormalizednumberofcommontypes, weachieveda notbeabarrierforawideclassofmobileapplications. Inthe human-machinecorrelationcoefficientof0.90attheparticipant followingsubsections,weintroduceoursystem. levelforcomprehensionscores,whichwasbetterthanthehu- maninter-ratercorrelation0.88. Weachievedabetterhuman- machine correlation coefficient than that of the human inter- 2.1. Acousticmodels raterinexpressionscorestoo.Experimentalresultsdemonstrate Theacousticmodelusedforspeechrecognitionon-deviceisa thatMoby.Readcanprovidehighlyaccuratewordscorrectper DeepNeuralNetwork-HiddenMarkovModel(DNN-HMM) minute,expressionandcomprehensionscoresinreal-time,and [12]thatcontains4hiddenlayersand300p-norm(p=2)non- validate the use of machine scoring methods to automatically linearityneuronswithagroupsizeG=10[12]perhiddenlayer, measureoralreadingfluencyskills. trainedusingallofLibrispeech’strainingsets[13]: 960hours Index Terms: assessment, oral reading fluency, literacy, ex- ofcleannative(L1)readingdata. Thesampleratefortheon- pression,comprehension device speech recognition is 8,000. The inputs of the DNN are40-dimensionallogmel-filterbankenergiescalculatedona 1. Introduction 25mswindowevery10ms, andtheoutputdimensionis2,064 context-dependent triphone states. Both left andright context Moby.Read is a new, self-administered, fully automated oral are6. readingfluencyassessmentdevelopedforK-5students[1,2,3]. Thereareseveralmodelmismatchissuesthatmaydegrade TheprototypesystemwasbuiltonaniPadmini4asastand- the ASR performance: 1) an adult acoustic model was used alone app. In each test session, students are asked to read a torecognizechildrenspeech; 2)narrowbandwasused; 3)the wordlist,readaneasypracticepassage,andreadthreegrade- acoustic model was trained using very clean/quiet recordings, level passages. After reading one passage aloud, students are sotheASRaccuracymaydiminishwithverynoisydata. De- askedtoretellthepassageintheirownwords,putinallthede- spitetheseissues,theoverallon-deviceacousticmodelperfor- tailstheycanremember,thenanswertwoshortquestionsaloud. manceisgoodsincewedealwithverylowperplexitysituations Fluencyistheabilityto“readtextwithspeed,accuracy,and withsuitablelanguagemodels. properexpression”[4].Inthispaper,wefocusedourautomatic Inourpreviouswork[14,pp.24],themodelmismatchef- scoring logic on passage reading (PRead) to produce Words fectwasresearched,suchascheckingthechildtestsetperfor- Correct Per Minute (WCPM) and reading expression scores, mancewhenadultacousticmodelsareused.Weconcludedthat on passage retelling (PRetell) to produce reading comprehen- DNNsseemtobegoodatlearninginvariantrepresentationsof sionscores. WCPMisascorebasedonthenumberofwords speechsignals,andadultdatacouldbemoresuitableforlearn- read correctlyin aminute of reading, an informativemeasure ingspeechrepresentations. Whenusingmismatchadultacous- oforalreadingfluency[5]. Expressionisthedegreethatastu- ticmodels,theperformancedamagetoconstraineditemtypesis dentcanclearlyexpressthemeaningandstructureofthetext notsosevere.Still,afterwecollectenoughchildresponsesand throughappropriateintonation, rhythm, phrasing, andempha- transcriptions,weplantotrainabetteracousticmodelbycom- sisthatwillenhanceunderstandingandenjoymentinalistener biningLibrispeech’sdatawithchildren’sspeechdata.Domain- [6].Comprehensionisthedegreethatastudentcanretellmajor specifictrainingdataalwayshelp[14]. andminorconcepts/themes/factsintheoriginalpassage. Scor- ing of expression and comprehension will emphasize reading 2.2. Languagemodels for meaning instead of reading for speed. Automatic scoring canreducetheneedforteachertrainingandhelpensureconsis- Item-specificrule-basedlanguagemodels(RBLMs)[15,16]are tency. builtforPRead. Nodatafromthisstudywasusedtotunethe Scoresareproducedinreal-timeon-device.Theadvantages RBLMs. Item-specific3-gramlanguagemodelswerebuiltfor ofreal-timeon-devicearethatwemayprovidescoresandfeed- PRetell,usingallthehumantranscriptionswehaveforthespec- backimmediately;wemayselectappropriatereadingmaterials ifieditem,around59transcriptionsperitemwiththeaveraged 1621 10.21437/Interspeech.2018-34 vocabularysize230.Thecomprehensionscoresreportedinthis speech,featuredmoreelaborateacousticmodels,andutilizeda paperarebiasedbythefactthatweusedthesamedatatotrain largerbeamvaluecanachieveWER9.6%forthe282PRetell thelanguagemodels. Weassumethatthisbiasisnotbigsince responses. wedealwithaverynarrowdomain. 3. Machinescoringmethods 2.2.1. Advantagesofitem-specificlanguagemodels 3.1. Wordcorrectperminute Withtheconstraintsthatthedecodingshouldbefinishedinreal- timeandthedecodingdevicesaremobile,theASRaccuracyde- ThenumberofcorrectwordsisderivedbyusingtheASRresult creasessignificantlywhenthelanguagemodelisbig. Wedeal todoedit-distancewiththeprompt. Theinsertionscausedby withanarrowdomainwithexpectedanswers.Buildingasmall disfluencieswereignored.Thetimedurationisfromthebegin- andconstraintlanguagemodelcanhelptodecodequicklyand ningofthefirstcorrectwordtotheendofthelastcorrectword achievegoodaccuracy. Ithelpstoovercomethechallengethat according to the ASR result. The session level WCPM is the children’sspeechcontainslargeracousticvariabilitybecauseof median WCPM value from the three passages, a widely used their variable vocal tract length and formant frequencies [17] procedureinmeasuringoralreadingfluency[21]. andsomeothermodelmismatchchallenges. Itgivesthebene- fitofthedoubtforaccentornon-nativespeech. Manyspoken 3.2. Expression assessment applications fall to the category that has a narrow Although our engine generates a lot of features, only relevant domainwithexpectedanswers. Item-specificlanguagemodels andnormalizedoneswereused(Table1). Thesefeaturesdon’t withsmallervocabularysizesarepreferred,andareoftenused dependonthelengthofthematerialsproduced. Thedifference inpracticeforspokenassessmentapplications[14]. between log seg prob and nlog seg prob, iw log seg prob andniw log seg probisthatforthelatter: a)whenwebuilt 2.2.2. Therule-basedlanguagemodel nativedurationstatistics,thedurationswerenormalizedbythe Foreachexpectedpassage,sentence,phraseortokensequence, articulationrateofeachresponse; b)whenwecomputedseg- asimpledirectgraphisbuiltthathasapathfromthefirstword mentalprobabilities,thedurationsofsegmentswerenormalized inthesequencetothelastword[15,16]. Differentdirectarcs by the articulation rate of the corresponding response. These with probabilities are added to represent different classes of canhelpremovetheeffectsofthespeakingrate,whichusually changesmadebysubjects,suchasskipping,repeating,insert- hasastrongcorrelationwithhumanexpressionscores. ing,andsubstituting. Addingaback-offarcwillallowdomain Table1: Thefeaturesusedtopredictexpressionscores. wordstobespokeninanyorder. Bothchangesandprobabil- ities can be learned from data. Using domain data can help feature description to build better language models. The graphs generated from logleadingsil Logtheleadingsilenceduration. several different expected answers can be combined together Articulationrate:thenumberofphonemespersecond with the expected probabilities as the final RBLM. Naturally, art ofspeech. RBLMsgivetheexpectedanswersequenceshigherprobabili- ros Rateofspeech:thenumberofphonemespersecondof speechandinterwordpauses. ties, thelesslikelyorderslowerprobabilities. RBLMscanbe Theaveragedloglikelihoodsegmentalprobabilityfor compiled on-line. It gives us the flexibility to recognize any logsegprob phonemes[22]basedonLibrispeechnativestatistics. contentsthataregenerateddynamically. Humanscanaddarbi- Theaveragedloglikelihoodsegmentalprobabilityfor iwlogsegprob interwordsilences[22]basedonLibrispeechnative traryreasonablerulestobeusedbyRBLMsdirectly. statistics. nlogsegprob Thenormalizedversionoflogsegprob. 2.3. TheASRdecoder niwlogsegprob Thenormalizedversionofiwlogsegprob. Theacousticmodelloglikelihoodoftherecognizedre- amloglike Thedecodingengine[18]isbasedonKALDI[19]. Themodi- sultnormalizedbythetotalnumberofframes. Thelanguagemodelloglikelihoodoftherecognized ficationsweremadetoutilizemobiledevices’singleinstruction lmloglike resultnormalizedbythetotalnumberofframes. multipledata(SIMD)anddigitalsignalprocessor(DSP)frame- works. The supporting utils were built to convert RBLMs to finitestatetransducers(FSTs)fordecoding. Whenwestartto 3.3. Comprehension recordresponses,theenginedecodesprogressivelyevery0.128 In natural language processing (NLP), it is becoming popular seconds. Thedecodingreal-timefactorfloatsaround0.2onan touseneuralnetworkbasedunsupervisedlearningalgorithms iPadMini4. Wechosetheacousticscalesothatinsertionsand torepresentvariable-lengthpiecesoftexts,suchaswords,sen- deletionsarebalancedtoavoidignoringthespeechsignal. tences,passage,anddocumentsasfixed-lengthrealvaluefea- turerepresentationsthatencodethemeaningoftexts.Forthese 2.4. AnASRperformancecomparison methods(e.g. word2vec[23]ordoc2vec[24]),thetrainingob- AlthoughGooglecloudspeechAPI[20](GSpeech)canbeused jective is usually to learn better word/document vector repre- off-the-shelfwithoutanyadditionalmodifications,theworder- sentationssothattheycanbeusedtopredictthenearbywords ror rates (WERs) are rather high on our children reading and withhigherprobabilities.Asaconsequence,inthetrainedcon- retellingtasks.ThemainreasonisthatGSpeechisdesignedfor tinuousvectorspacesemanticallysimilarwordsordocuments recognizinganygeneralEnglishwithabroadlanguagemodel. aremappedtosimilarpositions.Meaningfulresults(e.g.king- Itspurposeistoogeneraltoperformwellinthisnarrowdomain. man+women=queen)canbeobtainedbyadding/subtracting Forthe282PReadresponses,GSpeechachievedWER34.8% these vectors. These methods achieved better performance in (n=23,736) and our on-device ASR engine achieved WER manyNLPtasks[23,24]. 10.7%.Forthe282PRetellresponses,GSpeechachievedWER We seek semantic similarity measurements between the 32.9% (n=11,070) and our on-device ASR engine achieved promptpassagesandtheretellingresponsesthatareautomated WER16.3%. Ourserver-sideASRenginethatusedbroadband andobjective. Thevectorrepresentationsofdocumentscould 1622 beagoodfit. Comparingtopreviousworks[25,26],ourtask 4. Experimentalresults couldbeeasiertohandlesincethedomainhasbeenconstrained ApreliminarystudyoftheMoby.Readsystemwasconducted bytheprompts. withasampleof99childreningrades2-4fromfourdifferent The number of words spoken or the number of different elementary schools [1, 2]. Recordings of PRead and PRetell words used could be a good indicator of the similarity if the ofthreegrade-levelunpracticedpassagesfromthepreliminary subjectsareingood-faith,althoughsuchnonlinguisticsurface studywereused. 5childrenwhobarelyproducedmeaningful featuresaretoosuperficial. Wearemoreinterestedinthemea- responses(silenceorinaudible)wereexcludedfromthisstudy. surementsthatcanchecksemanticsimilaritydirectly,anddon’t Thetotalnumberofsubjectsinthisstudyare94. Thedetails havestrongcorrelationswiththesesurfacefeatures. Thenum- abouttheraters’qualifications,trainingprocedures,rubricsand berofcommonspokentokensorthenumberofcommonspoken how to derive human WCPM, expression and comprehension typescouldbeagoodsemanticsimilarityindicator,butitcould scorescanbefoundin[2]. fail when a subject uses semantic similarity words or phases The results reported were produced on a simulation plat- thatarenotintheprompt. Usingword2vecordoc2vecmayfix formthatmimickedtheconditionsasifitwererunonmobile someissues. Assumethateveryresponsecanbeconvertedtoa devices. Thereal-timeturnaroundwasverifiedonmobilede- vectorthatrepresentthewholecontentoftheresponse,these- vices.ThesystemwaspublishedonAppleAppStore[3]. manticsimilaritybetweentwodocumentsmaybecomputedby checkingthedistanceorsimilarityoftwovectors. Asaresult, 4.1. Wordcorrectperminutescores weproposedfeaturesw2v ed,d2v cos,LSI cos.Allpotentially useful similarity metrics for the comprehension scores we are A scatter plot of WCPMs between human and machine was interestedinarelistedinTable2. showninFigure1. Thecorrelationbetweentwoexpertraters As words can be represented by real number vectors, we is0.991.Itrepeatedtheconclusionweknew:machinecanpro- mayusethecentroidofthewordvectorsofthetexttorepresent ducereliableWCPMsfororalreadingfluency[16,30,31]. thetext. Usuallyitmakessensetoremovestopwordsbefore computingthecentroid. Wedidobservetheperformancegain bydoingso. WecanuseeithercosinesimilarityorEuclidean distancebetweentwovectorstoserveasameasureofthesim- ilarity between two texts. For w2v, we observed a significant performance gain by using Euclidean distance. We used the simple average of the word vectors of the text as the centroid torepresentthetext. Wedidn’tobserveanyperformancegain whenusingTF-IDFweightedaverageofwordvectors.Further- more,wemayusedifferentstatisticalfunctionstoaggregatethe wordvectorstorepresentthedocument.Weconcatenated4sta- tisticalvectors(mean,minimum,maximum,media)togetherto form a 4 * 300 = 1200 dimension vector for a document. It canproducebetterresults. Wehypothesizethatthedistribution Figure1: Session-levelscatterofmedianWCPM. ofwordembeddingvectorsplaysanimportantroletorepresent Welistenedtothe6recordingsoftwooutliers.Bothspoke thedocument.Thestatisticalvectorsmaycatchsomeproperties quitesoftlyandonemumbledthereadingsforcertaintimepe- ofthedistribution. riods. The consequence of low Signal-to-Noise Ratio (SNR) We used Googles word2vec pre-trained vectors that were makes it difficult to recognize certain part of signals. Google trained on part of Google News dataset (about 100 billion cloudspeechAPIonlyrecognized25.2%wordscorrectlyand words).ThearchiveisavailableonlineasGoogleNews-vectors- ignored most of the signal for these 6 recordings. The same negative300.bin.gz[27]. Themodelcontains300-dimensional kindofoutliersandissuesforkidswereidentifiedbefore[32]. vectorsfor3millionwordsandphrases. AddressinglowSNReffectivelybyinstructions,e.g. avoiding highbackgroundnoiseandlowspeechvolume,isthekeysolu- Table 2: Some potential useful similarity metrics between tionwearelookingfor.“BeinaQUIETplace”isonthesign-in promptsandresponsesforcomprehensionscores. pageoftheapp. Speakingclearlyinsteadofmumblingsothat otherscanhearistherequirement. Reliablescoresdependon feature description audiblespeech. Thenumberofwordswerespokenintheresponsenormalized ntokensn bythenumberofwordsintheprompt. 4.2. Oralreadingexpressionscores Thenumberofdifferentwordswerespokenintheresponsenor- ntypesn malizedbythenumberofdifferentwordsintheprompt.Itisa EveryrecordingofPReadwasratedby3differenthumanraters measureofthevocabularysizeintheresponse. fortheir‘OralReadingExpression’scoreson6categories,with Thenumberofthesamewordsbetweenthepromptandthe responsenormalizedbythenumberofdifferentwordsinthe 5representingthebestratingand0representingsilenceorirrel- nctypesn prompt. Itisameasureoftheoverlappedvocabularysizebe- evantorcompletelyunintelligiblematerial.Theratingdistribu- tweenthepromptandtheresponse. Euclideandistancebetweentwodocuments’statisticalvector tionis:0,n=5;1,n=37;2,n=95;3,n=263;4,n=304;5,n=142. w2ved representationsthatarederivedfromwordembeddingsafterre- Theaverageofthecorrelationsofhumanraterswhocorrelate movingstopwords. withtheaverageofothersattheresponselevelis0.795.Forthe Cosinesimilaritybetweentwodocuments’vectorrepresenta- d2vcos tionsbasedondoc2vec[24]. 3pairsofraters,theaverageoftheinter-ratercorrelationsatthe Wordmover’sdistance[28]betweentwodocumentsbasedon responselevelwas0.740. wmd wordembeddingsafterremovingstopwords. The correlations among features we discussed in Subsec- Cosinesimilaritybetweentwodocuments’vectorrepresenta- LSIcos tionsderivedfromLatentSemanticIndexingbasedontheterm tion3.2andhumanratingsattheresponselevelareshownin vectormodel[29]. Table3. Thespeechratefeatureshavethehighestcorrelations 1623 withhumanratings. Puttingmoreweightsonthefeaturesnor- There is no setting parameter for the features nctypes n, malizedbyspeakingratemaydownplaytheroleofrate. w2v ed, wmd. We noticed that wmd has a strong correlation with nctypes n. At the same time, wmd used the same word Table3: Featurecrosscorrelationsforexpressionscores. embeddingasw2v ed. Inthatsense,w2v edcouldbebetterto enhancethefinalperformance.Wescalednctypes n,w2v edto hv 1 2 3 4 5 6 7 8 1:logleadingsil -0.18 therange[0,1]andthenusedtheirsimpleaverageasourfinal 2:ros 0.82 -0.22 3:art 0.77 -0.20 0.92 metricandachievedr=0.888attheresponselevel. 4:logsegprob 0.81 -0.19 0.90 0.92 All the comprehension performance discussed so far is 5:iwlogsegprob 0.50 -0.17 0.62 0.38 0.45 6:nlogsegprob 0.60 -0.11 0.65 0.69 0.65 0.40 basedonhumantranscriptions.Intherealapplication,weused 7:niwlogsegprob 0.31 -0.12 0.39 0.14 0.23 0.87 0.27 8:amloglike 0.59 -0.10 0.61 0.49 0.58 0.40 0.60 0.28 the ASR recognition results to do the computation. It drags 9:lmloglike -0.57 0.25 -0.82 -0.69 -0.68 -0.70 -0.62 -0.53 -0.55 down the performance a little bit. Following the same proce- duresdiscussedbutusingtheASRtranscriptions,weachieved Thefinalsession-levelexpressionisanaverageofindivid- r=0.903atthesessionlevel(Figure2).Aftercollectingenough ualexpressionscores. Ifaresponsedoesn’thaveenoughinfor- data,usingacomplexsupervisedmachinelearningmodelthat mationtogenerateanexpressionscore,itwillbeignoredwhen can combine different features discussed in Table 2 together computingthefinalsession-levelexpressionscore. mayfurtherimprovethefinalperformance. Usinganeuralnetworkmodel10-foldcross-validation,we achievedcorrelation0.856(0.902)inresponse(session)level. Thisisbetterthanalinearregressionmodel0.840(0.887). We madesuredifferentfoldshavenooverlapofthesamesubjects. 4.3. Comprehensionscores EveryrecordingofPRetellwasratedbyatleast4differenthu- manratersfortheir‘RetellingComprehension’scoreson7cat- egories, with6representingthebestratingand0representing silenceorirrelevantorcompletelyunintelligiblematerial. On average, there are 4.5 ratings per response. The rating distri- bution is: 0, n=25; 1, n=128; 2, n=173; 3, n=288; 4, n=266; 5,n=203;6,n=179. Theaverageofthecorrelationsofhuman raterswhocorrelatewiththeaverageofothersattheresponse Figure2: Session-levelscatterofcomprehension. level is 0.842. For the 11 pairs of raters who have more than 100commonratings,theaverageoftheinter-ratercorrelations 5. Conclusions attheresponselevelwas0.786. Table4: Featurecrosscorrelationsforcomprehensionscores. Webuiltanoralreadingassessmentsystemonmobiledevices that delivers reliable WCPM, expression, and comprehension hv 1 2 3 4 5 6 scoresinreal-timeforfirst-languagelearnersingrades2-4[3]. 1:ntokensn 0.82 OurRBLMsrelievetherequirementoffielddatacollectionfor 2:ntypesn 0.84 0.95 newreadingpassagestoproduceWCPMandexpressionscores; 3:nctypesn 0.87 0.85 0.90 however, datacollectionisstillrequiredforpassageretellings 4:w2ved -0.83 -0.78 -0.84 -0.84 in order to build suitable language models to achieve the best 5:wmd -0.85 -0.75 -0.79 -0.92 0.88 6:d2vcos 0.71 0.61 0.60 0.73 -0.68 -0.79 performance. Theproposedideaofproducingcomprehension 7:LSIcos 0.65 0.52 0.54 0.72 -0.65 -0.83 0.83 scoresbymeasuringthesemanticsimilaritybetweentheprompt passageandtheretellingresponseutilizingthedocumentem- We calculated the cross correlations (Table 4) at the re- beddingsworkswell. Forbothexpressionandcomprehension sponse level among features we discussed in Subsection 3.3 scores,thehuman-machinecorrelationsarebetterthanthehu- and human ratings. These features were extracted using hu- man inter-rater ones, which validates the effectiveness of the mantranscriptions. Theperformancesofd2v cosandLSI cos system. Thefindingssupporttheuseofmachinescoringmeth- depend on the training settings: e.g. the training corpus, ran- odstomeasureoralreadingfluencyskillsautomatically. domseedsandsettingparameters. InTable4wereportedthe Weexpectthesystemcanbehighlyusefulbeyondtheap- bestresultsweachievedford2v cosandLSI cosfromseveral plication discussed here, such as in second-language learning trials. Becauseofthelimiteddomaindata, thepotentialover- foradultsaswellaschildren.Assessinginreal-timemeansthe fittingandweakercorrelationscomparingtoothers, wedidn’t systemcanrapidlyadapttoalearner’sperformance,whichcan explorethemfurther.Allotherresultsdidn’tinvolveoverfitting. beusedbylearningsystemstoconditionimmediate,personal- Itcanbeseenthatthenormalizednumberofdifferentwords izedfeedbackandselectthenextchallengewithinasession. spoken in the response is a good indicator of comprehension. When the subject is in the good-faith (it is almost always the 6. Acknowledgements caseforK-5gradekids), itmakessensesincecomprehension willdependonthecomplexityofthematerialsproduced.Bythe The research described here was supported by the Institute of natureofPRetell,alotoftermoverlapisexpected.Thetablere- EducationSciences,U.S.DepartmentofEducation,throughthe flectsthatonlyconsideringthewordsinthepromptcanimprove SmallBusinessInnovationResearch(SBIR)programcontract theperformancesignificantly. Amongthefeaturesthatutilize ED-IES-16-C-0004 and ED-IES-17-C-0030 to Analytic Mea- thewordembeddingsimilaritiesbyconsideringandweighting suresInc. Theopinionsexpressedarethoseoftheauthorsand the semantically similar words that are not in the prompt and donotrepresentviewsoftheInstituteortheU.S.Department areignoredbynctypes,w2v edandwmdaregoodones. ofEducation. 1624 7. References [20] Google, “Google cloud speech API.” [Online]. Available: https://cloud.google.com/speech/ [1] J.Bernstein,J.Cheng,J.Balogh,andE.Rosenfeld,“Studiesofa self-administeredoralreadingassessment,”inSLaTE,2017. [21] Y.PetscherandY.-S.Kim,“Theutilityandaccuracyoforalread- ing fluency score types in predicting reading comprehension,” [2] J.Bernstein,J.Cheng,J.Balogh,andR.Downey,“Artificialin- Journalofschoolpsychology,vol.49,no.1,pp.107–129,2011. telligenceforscoringoralreadingfluency,”inApplicationsofar- tificialintelligencetoassessment,H.JiaoandR.W.Lissitz,Eds. [22] J.Cheng,“Automaticassessmentofprosodyinhigh-stakesEn- Charlotte,NC:InformationAgePublisher,2018,forthcoming. glishtests,”inInterspeech,2011,pp.1589–1592. [3] Analytic Measures Inc., “Moby.Read Lite,” 2018, Apple App [23] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient es- Store.[Online].Available:https://itunes.apple.com/us/app/moby- timation of word representations in vector space,” CoRR, vol. read-lite/id1292957097?mt=8 abs/1301.3781,2013. [4] NationalInstituteofChildHealthandHumanDevelopment,“Re- [24] Q.V.LeandT.Mikolov,“Distributedrepresentationsofsentences portofthenationalreadingpanel.Teachingchildrentoread: An anddocuments,”inICML,2014,pp.1188–1196. evidence-basedassessmentofthescientificresearchliteratureon [25] E.Agirre,M.Diab,D.Cer,andA.Gonzalez-Agirre,“SemEval- readinganditsimplicationsforreadinginstruction,”U.S.Govern- 2012Task6: apilotonsemantictextualsimilarity,”inSemEval, mentPrintingOffice,Tech.Rep.NIHPublicationNo.00-4769, 2012,pp.385–393. 2000. [26] E.Agirre,C.Banea,D.M.Cer,M.T.Diab,A.Gonzalez-Agirre, [5] M.M.Wayman,T.Wallace,H.I.Wiley,R.Tich,andC.A.Espin, R.Mihalcea,G.Rigau,andJ.Wiebe,“SemEval-2016Task1:se- “Literaturesynthesisoncurriculum-basedmeasurementinread- mantictextualsimilarity, monolingualandcross-lingualevalua- ing,” JournalofSpecialEducation, vol.41, no.2, pp.85–120, tion,”inSemEval,2016,pp.497–511. 2007. [27] Google, “word2vec,” Computer Data. [Online]. Available: [6] J.MillerandP.J.Schwanenflugel,“Alongitudinalstudyofthe https://code.google.com/archive/p/word2vec/ developmentofreadingprosodyasadimensionoforalreading fluency in early elementary school children,” Reading research [28] M.J.Kusner,Y.Sun,N.I.Kolkin,andK.Q.Weinberger,“From quarterly,vol.43,no.4,pp.336–354,2008. word embeddings to document distances,” in ICML, 2015, pp. 957–966. [7] D.Huggins-Daines,M.Kumar,A.Chan,A.W.Black,M.Rav- ishankar, andA.I.Rudnicky, “Pocketsphinx: Afree, real-time [29] S.Deerwester,S.T.Dumais,G.W.Furnas,T.K.Landauer,and continuousspeechrecognitionsystemforhand-helddevices,”in R.Harshman,“Indexingbylatentsemanticanalysis,”Journalof ICASSP,2006,pp.185–188. theAmericanSocietyforInformationScience,vol.41,no.6,pp. 391–407,1990. [8] B. Zhou, X. Cui, S. Huang, M. Cmejrek, W. Zhang, J. Xue, J.Cui, B.Xiang, G.Daggett, U.V.Chaudhari, S.Maskey, and [30] J.Balogh,J.Bernstein,J.Cheng,A.VanMoere,B.Townshend, E.Marcheret,“TheIBMspeech-to-speechtranslationsystemfor andM.Suzuki,“Validationofautomatedscoringoforalreading,” smartphone:improvementsforresource-constrainedtasks,”Com- EducationalandPsychologicalMeasurement,vol.72,no.3,pp. puterSpeechandLanguage,vol.2,pp.592–618,2013. 435–452,2012. [9] X.Lei,A.Senior,A.Gruenstein,andJ.Sorensen,“Accurateand [31] D.Bolanos,R.A.Cole,W.H.Ward,G.A.Tindal,J.Hasbrouck, compactlargevocabularyspeechrecognitiononmobiledevices,” andP.J.Schwanenflugel,“Humanandautomatedassessmentof inInterspeech,2013,pp.662–665. oral reading fluency,” Journal of Educational Psychology, vol. 105,no.4,pp.1142–1151,2013. [10] I.McGraw,R.Prabhavalkar,R.Alvarez,M.G.Arenas,K.Rao, D.Rybach,O.Alsharif,H.Sak,A.Gruenstein,F.Beaufays,and [32] J.Cheng,Y.ZhaoD’Antilio,X.Chen,andJ.Bernstein,“Auto- C.Parada,“Personalizedspeechrecognitiononmobiledevices,” maticspokenassessmentofyoungEnglishlanguagelearners,”in inICASSP,2016,pp.5955–5959. ProceedingsoftheNinthWorkshoponInnovativeUseofNLPfor BuildingEducationalApplications,2014,pp.12–21. [11] CMUSphinx, “Pocketsphinx,” Computer Software. GitHub. [Online].Available:https://github.com/cmusphinx/pocketsphinx [12] X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, “Improving deepneuralnetworkacousticmodelsusinggeneralizedmaxout networks,”inICASSP,2014,pp.215–219. [13] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: anASRcorpusbasedonpublicdomainaudiobooks,” inICASSP,2015,pp.5206–5210. [14] J. Cheng, X. Chen, and A. Metallinou, “Deep neural network acoustic models for spoken assessment applications,” Speech Commun.,vol.73,no.C,pp.14–27,Oct.2015. [15] J.ChengandB.Townshend, “Arule-basedlanguagemodelfor readingrecognition,”inSLaTE,2009. [16] J.ChengandJ.Shen,“Towardsaccuraterecognitionforchildren’s oralreadingfluency,”inIEEE-SLT,2010,pp.91–96. [17] H.VorperianandR.Kent,“Vowelacousticspacedevelopmentin children: Asynthesisofacousticandanatomicdata,”Journalof Speech,Language,andhearingresearch,vol.50,no.6,pp.1510– 1545,2007. [18] Analytic Measures Inc., “amiASR,” 2018, Computer Soft- ware. Apple App Store, Version 1.0. [Online]. Available: https://itunes.apple.com/us/app/amiasr/id1032618938?mt=8 [19] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The KALDI speech recognitiontoolkit,”inIEEE-ASRU,2011. 1625