ebook img

Knowledge Distillation for Small-footprint Highway Networks PDF

22 Pages·2017·0.44 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Knowledge Distillation for Small-footprint Highway Networks

Knowledge Distillation for Small-footprint Highway Networks Liang Lu , Michelle Guo and Steve Renals † ‡ ∗ †Toyota Technological Institute at Chicago ‡Stanford University ∗The University of Edinburgh 6 March 2017 Why smaller model? Deep learning has made a tremendous impact • Large amount of data for training ◦ Powerful computational devices ◦ Connected to a server ◦ Embedded (Client-side) deep learning • Local inference ◦ Small footprint ◦ Energy efficient ◦ 2of22 Why smaller model for speech recognition? Speech recognition as an interface (requiring internet connection) • Google Home ◦ Amazon Alexa ◦ Microsoft Cortana ◦ Apple Siri ◦ ... ◦ Local speech service • Internet is unavailable ◦ Privacy issues ◦ Low latency ◦ 3of22 Background: smaller models Low-ranks matrices for DNNs • J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network acoustic ◦ models with singular value decomposition.” in Proc. INTERSPEECH, 2013 T.N.Sainath, B.Kingsbury, et al., “Low-rank matrix factorization for deep ◦ neural network training with high-dimensional output targets,” in Proc. ICASSP. IEEE, 2013 4of22 Background: smaller models Structured linear layers • V. Sindhwani, T. N. Sainath, and S. Kumar, “Structured transforms for ◦ small-footprint deep learning”, in Proc. NIPS, 2015. M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “ACDC: A ◦ Structured Efficient Linear Layer,” ICLR 2016 5of22 Background: smaller models y yˆ (yˆ,y) L T S x x FitNet by teacher-student training • J.Li, R.Zhao, J.-T.Huang, and Y.Gong,“Learning small-size DNN with ◦ output-distribution-based criteria,” in Proc. INTERSPEECH, 2014 R. Adriana, B. Nicolas, K. Samira Ebrahimi, C. Antoine, G. Carlo, and B. ◦ Yoshua, “FitNets: Hints for thin deep nets,” in Proc. ICLR, 2015 6of22 Model h = σ(h ,θ ) T(h ,W )+h C(h ,W ) (1) l l−1 l l−1 T l−1 l−1 c ◦ ◦ (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) transformgate carrygate Shortcut connections with gates • Similar to Residual networks • W and W are layer independent T C • [1]R.K.Srivastava,K.Greff,andJ.Schmidhuber,“Trainingverydeepnetworks,”inProc. NIPS,2015 [2]Y.Zhang,etal. “HighwayLongShort-TermMemoryRNNsforDistantSpeech Recognition”,inProc. ICASSP2015 [3]L.LuandS.Renals,“Small-footprintdeepneuralnetworkswithhighwayconnections forspeechrecognition”,inProc. Interspeech2016 7of22 Loss Functions Cross Entropy Loss • (cid:88) (CE)(θ) = yˆ log y , (2) jt jt L − (cid:124)(cid:123)(cid:122)(cid:125) (cid:124)(cid:123)(cid:122)(cid:125) j label prediction where j is the class index, and t is the time step. Teacher-Student Loss (KL-divergence) • (cid:88) (KL)(θ) = y˜ log y , (3) jt jt L − (cid:124)(cid:123)(cid:122)(cid:125) (cid:124)(cid:123)(cid:122)(cid:125) j prediction-T prediction-S 8of22 Loss Functions Sequence-level teacher-student loss • (cid:88) (sKL)(θ) P∗( X)logP( X,θ) (4) L ≈ W| W| W∈Φ where P( X,θ) is the posterior given by MMI or sMBR. W| Teacher-student training to sequence training • (cid:100)(θ) = (sMBR)(θ)+p (KL)(θ). (5) L L L where p is the regularization weight. [1]J.WongandM.Gales,“SequenceStudent-TeacherTrainingofDeepNeural Networks,”inProc. Interspeech,2016 [2]Y.KimandA.Rush,“Sequence-LevelKnowledgeDistillation”,inarXiv2016 9of22 Experiments AMI corpus 80h training data (28 million frames) • Using the standard Kaldi recipe • fMLLR acoustic features ◦ 3-gram language models ◦ CNTK was used to build HDNN models • The same decision tree was used • 10of22

Description:
neural network training with high-dimensional output targets,” in Proc. ICASSP. IEEE, 2013 M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “ACDC: A. Structured Efficient Linear The same decision tree was used. 10 of 22
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.