Table Of ContentKernel Approximation Methods for Speech Recognition
Avner May
Submittedinpartialfulfillmentofthe
requirementsforthedegree
ofDoctorofPhilosophy
intheGraduateSchoolofArtsandSciences
COLUMBIA UNIVERSITY
2018
(cid:13)c 2017
AvnerMay
AllRightsReserved
ABSTRACT
Kernel Approximation Methods for Speech Recognition
Avner May
Over the past five years or so, deep learning methods have dramatically improved the state of the
artperformanceinavarietyofdomains,includingspeechrecognition,computervision,andnatural
languageprocessing. Importantly,however,theysufferfromanumberofdrawbacks:
1. Training these models is a non-convex optimization problem, and thus it is difficult to guar-
anteethatatrainedmodelminimizesthedesiredlossfunction.
2. Thesemodelsaredifficulttointerpret. Inparticular,itisdifficulttoexplain,foragivenmodel,
whythecomputationsitperformsmakeaccuratepredictions.
In contrast, kernel methods are straightforward to interpret, and training them is a convex op-
timization problem. Unfortunately, solving these optimization problems exactly is typically pro-
hibitively expensive, though one can use approximation methods to circumvent this problem. In
thisthesis, weexploretowhatextentkernelapproximationmethodscancompetewithdeeplearn-
ing,inthecontextoflarge-scalepredictiontasks. Ourcontributionsareasfollows:
1. Weperformthemostextensivesetofexperimentstodateusingkernelapproximationmethods
in the context of large-scale speech recognition tasks, and compare performance with deep
neuralnetworks.
2. We propose a feature selection algorithm which significantly improves the performance of
the kernel models, making their performance competitive with fully-connected feedforward
neuralnetworks.
3. Weperformanin-depthcomparisonbetweentwoleadingkernelapproximationstrategies—
random Fourier features [Rahimi and Recht, 2007] and the Nystro¨m method [Williams and
Seeger, 2001] — showing that although the Nystro¨m method is better at approximating the
kernel,itperformsworsethanrandomFourierfeatureswhenusedforlearning.
We believe this work opens the door for future research to continue to push the boundary of
whatispossiblewithkernelmethods. Thisresearchdirectionwillalsoshedlightonthequestionof
when,ifever,deepmodelsareneededforattainingstrongperformance.
Table of Contents
ListofFigures iv
ListofTables vii
1 Introduction 1
2 Preliminaries 8
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Kernelmethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Primalformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Dualformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Kernelapproximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 RandomFourierfeatures(RFF) . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Nystro¨mmethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 ReproducingkernelHilbertspaces(RKHS) . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 RepresenterTheorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Neuralnetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 Otherarchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Automaticspeechrecognition(ASR) . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6.1 Acousticmodeltraining . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.2 Usingneuralnetworksforacousticmodeling . . . . . . . . . . . . . . . . 35
3 Relatedwork 37
i
4 RandomFourierfeaturesforacousticmodeling 42
4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Usingkernelapproximationmethodsforacousticmodeling . . . . . . . . 43
4.1.2 Linearbottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.3 Entropyregularizedperplexity(ERP) . . . . . . . . . . . . . . . . . . . . 44
4.2 Tasks,datasets,andevaluationmetrics . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Detailsofacousticmodeltraining . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 OtherPossibleImprovementstoDNNsandKernels . . . . . . . . . . . . . . . . . 53
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Compactkernelmodelsviarandomfeatureselection 56
5.1 Randomfeatureselection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 AsparseGaussiankernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Analysis: Effectsofrandomfeatureselection . . . . . . . . . . . . . . . . . . . . 64
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6 Nystro¨mmethodvs. randomFourierfeatures 67
6.1 ReviewofNystro¨mmethodproperties . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.1 Taskanddatasetdetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.2 Traindetails. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3 Nystro¨mmethoderroranalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7 Conclusion 83
7.1 Futurework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Bibliography 86
ii
Appendices 103
A Definitions 104
B DerivationforrandomFourierfeatures 106
C Detailedresults 108
C.1 ResultsfromSection4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
C.2 ResultsfromSection5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
D Nystro¨mAppendix 114
D.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
D.2 HyperparameterChoices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
D.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
D.4 BackgroundforProofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
D.4.1 DefinitionsofacoupleinfinitedimensionalHilbertSpaces . . . . . . . . . 122
D.4.2 ReviewofReproducingKernelHilbertSpaceDefinitions . . . . . . . . . . 122
D.5 Proofs: Nystro¨mBackgroundSection . . . . . . . . . . . . . . . . . . . . . . . . 123
D.6 Proofs: Nystro¨mErrorAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
D.6.1 Theorem4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
D.6.2 Theorem5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
D.6.3 Theorem6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
D.6.4 Theorem7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
D.7 OtherwaysofunderstandingtheNystro¨mmethod . . . . . . . . . . . . . . . . . . 132
D.7.1 Nystro¨mmethodasaprojectionontoasubspace . . . . . . . . . . . . . . 132
D.7.2 Nystro¨mmethodasasolutiontoanoptimizationproblem . . . . . . . . . 132
D.7.3 Nystro¨mmethodasapreconditioner . . . . . . . . . . . . . . . . . . . . . 133
D.7.4 Nystro¨mmethodaseigenfunctionapproximator . . . . . . . . . . . . . . . 134
iii
List of Figures
1.1 Impactofdeeplearningmethodsonstateoftheartperformanceinspeechrecogni-
tionandcomputervision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1 Kernel-acousticmodelseenasashallowneuralnetwork . . . . . . . . . . . . . . 44
4.2 PerformanceofkernelacousticmodelsonBN50dataset,asafunctionofthenumber
of random features D used. Results are reported in terms of heldout cross-entropy
as well as development set TER. The color and shape of the markers indicate the
kernelused. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1 PerformanceofkernelacousticmodelsonBN50dataset,asafunctionofthenumber
of random features D used. Results are reported in terms of heldout cross-entropy
as well as development set TER. Dashed lines signify that feature selection was
performed, while solid lines mean it was not. The color and shape of the markers
indicatethekernelused. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Fractionofthes featuresselectediniterationtthatareinthefinalmodel(survival
t
rate)forCantonesedataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 The relative weight of each input feature in the random matrix Θ, for Cantonese
dataset,D = 50,000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
iv
6.1 Kernel approximation errors for the Nystro¨m method vs. random Fourier features,
in terms of the total numbers of features (top) and the total memory requirement
(bottom) in the respective models. Error is measured as mean squared error on the
heldout set. For Nystro¨m experiments with D ≤ 2500, and RFF experiments with
D ≤ 20000, we run the experiments 10 times, and report the median, with error
barsindicatingtheminimumandmaximum. Notethatduetosmallvariance,error
barsareoftennotclearlyvisible. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 SpectrumofkernelmatricesgeneratedfromN = 20k randomtrainingpoints. . . . 73
6.3 Heldout classification or regression performance for the Nystro¨m method vs. ran-
domFourierfeatures,intermsofthetotalnumbersoffeatures(left),totalmemory
requirement (middle), and kernel approximation error (right) of the corresponding
models. For Nystro¨m experiments with D ≤ 2500, and RFF experiments with
D ≤ 20000,weruntheexperiments10times,andreportthemedianperformance,
witherrorbarsindicatingtheminimumandmaximum. Notethatduetosmallvari-
ance,errorbarsareoftennotclearlyvisible. . . . . . . . . . . . . . . . . . . . . . 74
6.4 Histograms of kernel approximation errors for Nystro¨m features random Fourier
features. Thedifferenthistogramscorrespondtoapartitionofthek(x,y)−z(x)Tz(y)
valuesbasedonthevaluesofk(x,y). NotethattheNystro¨mmethodhasmanyout-
liersfork(x,y) ≥ 0.25,someofwhicharetruncatedfromthehistogram. . . . . . 76
6.5 HistogramsofthefeaturevectornormsforNystro¨m(left)andRFF(right),forvar-
iousnumbersoffeatures. NotethatfortheRBFkernel, k(x,x) = 1 ∀x ∈ X, soa
featurevectorz(x)ofnormcloseto1approximatesthisself-similaritymeasurewell. 76
6.6 Heldout classification or regression performance for the Nystro¨m method vs. ran-
domFourierfeatures,intermsoftheaveragekernelapproximationerrors,measured
as|k(x,y)−z(x)Tz(y)|r forr ∈ {2.5,3.5,5.5}. Note thatduetonumericunder-
flow,someofthemodelswithlowestapproximationerrorsometimesdonotappear
inther = 5.5plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
D.1 Kernelapproximationerror,intermsofthenumberoffeatures. . . . . . . . . . . . 116
D.2 Kernelapproximationerror,intermsofthetotalmemoryrequirement. . . . . . . . 116
v
D.3 Heldout classification or regression performance for the Nystro¨m method vs. ran-
domFourierfeatures,intermsofthetotalnumbersoffeatures(left),totalmemory
requirement (middle), and kernel approximation error (right) of the corresponding
models. ResultsreportedonAdult,Cod-RNA,CovType,andForest. . . . . . . . . 117
D.4 Heldout classification or regression performance for the Nystro¨m method vs. ran-
domFourierfeatures,intermsofthetotalnumbersoffeatures(left),totalmemory
requirement (middle), and kernel approximation error (right) of the corresponding
models. ResultsreportedonTIMIT,Census,CPU,andYearPred. . . . . . . . . . . 118
D.5 SpectrumofkernelmatricesgeneratedfromN = 20k randomtrainingpoints. . . . 119
D.6 Heldout classification or regression performance for the Nystro¨m method vs. ran-
domFourierfeatures,intermsoftheaveragekernelapproximationerrors,measured
as|k(x,y)−z(x)Tz(y)|r forr ∈ {2.5,3.5,5.5}. Note thatduetonumericunder-
flow,someofthemodelswithlowestapproximationerrorsometimesdonotappear
intheplots. ResultsreportedonAdult,Cod-RNA,CovType,andForest. . . . . . . 120
D.7 Heldout classification or regression performance for the Nystro¨m method vs. ran-
domFourierfeatures,intermsoftheaveragekernelapproximationerrors,measured
as|k(x,y)−z(x)Tz(y)|r forr ∈ {2.5,3.5,5.5}. Note thatduetonumericunder-
flow,someofthemodelswithlowestapproximationerrorsometimesdonotappear
intheplots. ResultsreportedonTIMIT,Census,CPU,andYearPred.. . . . . . . . 121
vi
Description:This thesis is organized as follows: Chapter 2 provides background on kernel methods, kernel approximation methods, deep neural networks, speech recognition, and acoustic modeling. We. 4We use the IARPA Babel Program Cantonese (IARPA-babel101-v0.4c) and Bengali (IARPA-babel103b-v0.4b)