Table Of Content

Kernel Approximation Methods for Speech Recognition Avner May Submittedinpartialfulfillmentofthe requirementsforthedegree ofDoctorofPhilosophy intheGraduateSchoolofArtsandSciences COLUMBIA UNIVERSITY 2018 (cid:13)c 2017 AvnerMay AllRightsReserved ABSTRACT Kernel Approximation Methods for Speech Recognition Avner May Over the past five years or so, deep learning methods have dramatically improved the state of the artperformanceinavarietyofdomains,includingspeechrecognition,computervision,andnatural languageprocessing. Importantly,however,theysufferfromanumberofdrawbacks: 1. Training these models is a non-convex optimization problem, and thus it is difficult to guar- anteethatatrainedmodelminimizesthedesiredlossfunction. 2. Thesemodelsaredifficulttointerpret. Inparticular,itisdifficulttoexplain,foragivenmodel, whythecomputationsitperformsmakeaccuratepredictions. In contrast, kernel methods are straightforward to interpret, and training them is a convex optimization problem. Unfortunately, solving these optimization problems exactly is typically pro- hibitively expensive, though one can use approximation methods to circumvent this problem. In thisthesis, weexploretowhatextentkernelapproximationmethodscancompetewithdeeplearn- ing,inthecontextoflarge-scalepredictiontasks. Ourcontributionsareasfollows: 1. Weperformthemostextensivesetofexperimentstodateusingkernelapproximationmethods in the context of large-scale speech recognition tasks, and compare performance with deep neuralnetworks. 2. We propose a feature selection algorithm which significantly improves the performance of the kernel models, making their performance competitive with fully-connected feedforward neuralnetworks. 3. Weperformanin-depthcomparisonbetweentwoleadingkernelapproximationstrategies— random Fourier features [Rahimi and Recht, 2007] and the Nystro¨m method [Williams and Seeger, 2001] — showing that although the Nystro¨m method is better at approximating the kernel,itperformsworsethanrandomFourierfeatureswhenusedforlearning. We believe this work opens the door for future research to continue to push the boundary of whatispossiblewithkernelmethods. Thisresearchdirectionwillalsoshedlightonthequestionof when,ifever,deepmodelsareneededforattainingstrongperformance. Table of Contents ListofFigures iv ListofTables vii 1 Introduction 1 2 Preliminaries 8 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Kernelmethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Primalformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Dualformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Kernelapproximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 RandomFourierfeatures(RFF) . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 Nystro¨mmethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 ReproducingkernelHilbertspaces(RKHS) . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 RepresenterTheorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Neuralnetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.2 Otherarchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Automaticspeechrecognition(ASR) . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6.1 Acousticmodeltraining . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6.2 Usingneuralnetworksforacousticmodeling . . . . . . . . . . . . . . . . 35 3 Relatedwork 37 i 4 RandomFourierfeaturesforacousticmodeling 42 4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.1 Usingkernelapproximationmethodsforacousticmodeling . . . . . . . . 43 4.1.2 Linearbottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.3 Entropyregularizedperplexity(ERP) . . . . . . . . . . . . . . . . . . . . 44 4.2 Tasks,datasets,andevaluationmetrics . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Detailsofacousticmodeltraining . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5 OtherPossibleImprovementstoDNNsandKernels . . . . . . . . . . . . . . . . . 53 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5 Compactkernelmodelsviarandomfeatureselection 56 5.1 Randomfeatureselection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 AsparseGaussiankernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4 Analysis: Effectsofrandomfeatureselection . . . . . . . . . . . . . . . . . . . . 64 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6 Nystro¨mmethodvs. randomFourierfeatures 67 6.1 ReviewofNystro¨mmethodproperties . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.2.1 Taskanddatasetdetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.2.2 Traindetails. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3 Nystro¨mmethoderroranalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7 Conclusion 83 7.1 Futurework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Bibliography 86 ii Appendices 103 A Definitions 104 B DerivationforrandomFourierfeatures 106 C Detailedresults 108 C.1 ResultsfromSection4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 C.2 ResultsfromSection5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 D Nystro¨mAppendix 114 D.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 D.2 HyperparameterChoices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 D.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 D.4 BackgroundforProofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 D.4.1 DefinitionsofacoupleinfinitedimensionalHilbertSpaces . . . . . . . . . 122 D.4.2 ReviewofReproducingKernelHilbertSpaceDefinitions . . . . . . . . . . 122 D.5 Proofs: Nystro¨mBackgroundSection . . . . . . . . . . . . . . . . . . . . . . . . 123 D.6 Proofs: Nystro¨mErrorAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 D.6.1 Theorem4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 D.6.2 Theorem5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 D.6.3 Theorem6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 D.6.4 Theorem7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 D.7 OtherwaysofunderstandingtheNystro¨mmethod . . . . . . . . . . . . . . . . . . 132 D.7.1 Nystro¨mmethodasaprojectionontoasubspace . . . . . . . . . . . . . . 132 D.7.2 Nystro¨mmethodasasolutiontoanoptimizationproblem . . . . . . . . . 132 D.7.3 Nystro¨mmethodasapreconditioner . . . . . . . . . . . . . . . . . . . . . 133 D.7.4 Nystro¨mmethodaseigenfunctionapproximator . . . . . . . . . . . . . . . 134 iii List of Figures 1.1 Impactofdeeplearningmethodsonstateoftheartperformanceinspeechrecogni- tionandcomputervision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.1 Kernel-acousticmodelseenasashallowneuralnetwork . . . . . . . . . . . . . . 44 4.2 PerformanceofkernelacousticmodelsonBN50dataset,asafunctionofthenumber of random features D used. Results are reported in terms of heldout cross-entropy as well as development set TER. The color and shape of the markers indicate the kernelused. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1 PerformanceofkernelacousticmodelsonBN50dataset,asafunctionofthenumber of random features D used. Results are reported in terms of heldout cross-entropy as well as development set TER. Dashed lines signify that feature selection was performed, while solid lines mean it was not. The color and shape of the markers indicatethekernelused. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Fractionofthes featuresselectediniterationtthatareinthefinalmodel(survival t rate)forCantonesedataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 The relative weight of each input feature in the random matrix Θ, for Cantonese dataset,D = 50,000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 iv 6.1 Kernel approximation errors for the Nystro¨m method vs. random Fourier features, in terms of the total numbers of features (top) and the total memory requirement (bottom) in the respective models. Error is measured as mean squared error on the heldout set. For Nystro¨m experiments with D ≤ 2500, and RFF experiments with D ≤ 20000, we run the experiments 10 times, and report the median, with error barsindicatingtheminimumandmaximum. Notethatduetosmallvariance,error barsareoftennotclearlyvisible. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 SpectrumofkernelmatricesgeneratedfromN = 20k randomtrainingpoints. . . . 73 6.3 Heldout classification or regression performance for the Nystro¨m method vs. ran- domFourierfeatures,intermsofthetotalnumbersoffeatures(left),totalmemory requirement (middle), and kernel approximation error (right) of the corresponding models. For Nystro¨m experiments with D ≤ 2500, and RFF experiments with D ≤ 20000,weruntheexperiments10times,andreportthemedianperformance, witherrorbarsindicatingtheminimumandmaximum. Notethatduetosmallvari- ance,errorbarsareoftennotclearlyvisible. . . . . . . . . . . . . . . . . . . . . . 74 6.4 Histograms of kernel approximation errors for Nystro¨m features random Fourier features. Thedifferenthistogramscorrespondtoapartitionofthek(x,y)−z(x)Tz(y) valuesbasedonthevaluesofk(x,y). NotethattheNystro¨mmethodhasmanyout- liersfork(x,y) ≥ 0.25,someofwhicharetruncatedfromthehistogram. . . . . . 76 6.5 HistogramsofthefeaturevectornormsforNystro¨m(left)andRFF(right),forvar- iousnumbersoffeatures. NotethatfortheRBFkernel, k(x,x) = 1 ∀x ∈ X, soa featurevectorz(x)ofnormcloseto1approximatesthisself-similaritymeasurewell. 76 6.6 Heldout classification or regression performance for the Nystro¨m method vs. ran- domFourierfeatures,intermsoftheaveragekernelapproximationerrors,measured as|k(x,y)−z(x)Tz(y)|r forr ∈ {2.5,3.5,5.5}. Note thatduetonumericunder- flow,someofthemodelswithlowestapproximationerrorsometimesdonotappear inther = 5.5plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 D.1 Kernelapproximationerror,intermsofthenumberoffeatures. . . . . . . . . . . . 116 D.2 Kernelapproximationerror,intermsofthetotalmemoryrequirement. . . . . . . . 116 v D.3 Heldout classification or regression performance for the Nystro¨m method vs. ran- domFourierfeatures,intermsofthetotalnumbersoffeatures(left),totalmemory requirement (middle), and kernel approximation error (right) of the corresponding models. ResultsreportedonAdult,Cod-RNA,CovType,andForest. . . . . . . . . 117 D.4 Heldout classification or regression performance for the Nystro¨m method vs. ran- domFourierfeatures,intermsofthetotalnumbersoffeatures(left),totalmemory requirement (middle), and kernel approximation error (right) of the corresponding models. ResultsreportedonTIMIT,Census,CPU,andYearPred. . . . . . . . . . . 118 D.5 SpectrumofkernelmatricesgeneratedfromN = 20k randomtrainingpoints. . . . 119 D.6 Heldout classification or regression performance for the Nystro¨m method vs. ran- domFourierfeatures,intermsoftheaveragekernelapproximationerrors,measured as|k(x,y)−z(x)Tz(y)|r forr ∈ {2.5,3.5,5.5}. Note thatduetonumericunder- flow,someofthemodelswithlowestapproximationerrorsometimesdonotappear intheplots. ResultsreportedonAdult,Cod-RNA,CovType,andForest. . . . . . . 120 D.7 Heldout classification or regression performance for the Nystro¨m method vs. ran- domFourierfeatures,intermsoftheaveragekernelapproximationerrors,measured as|k(x,y)−z(x)Tz(y)|r forr ∈ {2.5,3.5,5.5}. Note thatduetonumericunder- flow,someofthemodelswithlowestapproximationerrorsometimesdonotappear intheplots. ResultsreportedonTIMIT,Census,CPU,andYearPred.. . . . . . . . 121 vi

Description:

This thesis is organized as follows: Chapter 2 provides background on kernel methods, kernel approximation methods, deep neural networks, speech recognition, and acoustic modeling. We. 4We use the IARPA Babel Program Cantonese (IARPA-babel101-v0.4c) and Bengali (IARPA-babel103b-v0.4b)

Kernel Approximation Methods for Speech Recognition PDF

155 Pages·2017·2.22 MB·English

by Avner May

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Kernel Approximation Methods for Speech Recognition PDF Free - Full Version

by Avner May| 2017| 155 pages| 2.22| English

Download Kernel Approximation Methods for Speech Recognition by Avner May in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Kernel Approximation Methods for Speech Recognition

Detailed Information

Author:	Avner May
Publication Year:	2017
Pages:	155
Language:	English
File Size:	2.22
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Kernel Approximation Methods for Speech Recognition Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Kernel Approximation Methods for Speech Recognition PDF?

Yes, on https://PDFdrive.to you can download Kernel Approximation Methods for Speech Recognition by Avner May completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Kernel Approximation Methods for Speech Recognition on my mobile device?

After downloading Kernel Approximation Methods for Speech Recognition PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Kernel Approximation Methods for Speech Recognition?

Yes, this is the complete PDF version of Kernel Approximation Methods for Speech Recognition by Avner May. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Kernel Approximation Methods for Speech Recognition PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.