ebook img

Improving Hindi Speech Recognition Using Filter Bank Optimization and Acoustic Model ... PDF

206 Pages·2012·7.77 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Improving Hindi Speech Recognition Using Filter Bank Optimization and Acoustic Model ...

Improving Hindi Speech Recognition Using Filter Bank Optimization and Acoustic Model Refinement A Thesis Submitted in fulfillment of the requirement of the degree of Doctor of Philosophy by Rajesh Kumar Aggarwal Reg. No. 2K07-NITK-Ph.D.-1142-CO Under the supervision of Dr. Mayank Dave Associate Professor Department of Computer Engineering National Institute of Technology Kurukshetra-136119 India December 2012 Department of Computer Engineering National Institute of Technology Kurukshetra – 136119 Candidate’s Declaration I hereby certify that the work which is being represented in the thesis entitled “Improving Hindi Speech Recognition Using Filter Bank Optimization and Acoustic Model Refinement” in partial fulfillment of the requirements for the award of the degree of Doctor of Philosophy of National Institute of Technology, Kurukshetra, is an authentic record of my own work carried out during the period March 2007 to November 2012, under the supervision of Dr. Mayank Dave, Associate Professor, National Institute of Technology, Kurukshetra. The matter presented in this thesis has not been submitted by me for the award of any other degree of this or any other Institute/University. (Rajesh Kumar Aggarwal) Regn. No. 2K07-NITK-Ph.D.-1142-CO This is to certify that the above statement made by the candidate is correct to the best of my knowledge. Date: (Dr. Mayank Dave) Associate Professor National Institute of Technology, Kurukshetra Acknowledgement Success in life is never attained single handedly. My deepest gratitude goes to my thesis advisor, Dr. Mayank Dave, for his guidance, help and encouragement throughout my research work. His enlightening ideas, comments, interpretations and suggestions increased my cognitive awareness and helped considerably in the fruition of my objective. He gave me all the freedom I needed for doing what I liked. I really enjoyed working with him. Words are not enough to express my gratitude to Dr. J. K. Chhabra, Head, Department of Computer Engineering, for his insightful comments and administrative help at various occasions. His hard working attitude and high expectation towards research have inspired me to mature into a better researcher. I would also like to thank my DRC members, Dr. A. K. Singh and Dr. S. K. Jain for stimulating questions and invaluable feedback. I am also grateful to Dr. O. P. Sahu, Head, EC & CE Department for many wonderful tips and technical discussions that helped me to accomplish my goals. Special thanks to some eminent personalities, Dr. S. S. Agrawal and Dr. B. Yegnanarayana, from whom I have drawn inspiration for research. Doing a thesis is a long way and ups and downs are common during this period. Fortunately I have many understanding friends, students and well-wishers who have helped me a lot on many critical conditions. In particular, I would like to thank Mr. Gaurav Leekha, and Mr. Yogesh Soni for careful reviews of many of my papers and chapters. My sincere thanks go to my family members and all those who have directly and indirectly provided me moral support and other kind of help. Finally I wish to thank to the almighty who has blessed me with the power to think and to articulate my thoughts. The thesis would have never been possible without His grace and guidance. (Rajesh Kumar Aggarwal) iii Abstract Although enormous progress has been made during the last four decades in the domain of Automatic Speech Recognition (ASR) systems, still there is a considerable gap between human and machine performance, due to their lack of robustness against speech variability especially in noisy environment. While many algorithms have been proposed to cope with these problems, they tend to be more effective only in the presence of stationary noise (such as white or pink noise) compared to the real world scenario such as background music, background speech, and reverberation. The goal of this thesis is to resolve these critical robustness issues using the combined acoustic features, integrated acoustic and language models, and by optimizing the front-end and back-end. Open issues, pros and cons of the different methodologies and techniques are also highlighted. Two new methods are proposed at front-end related with features: optimization of filter banks and sequential combination of heterogeneous feature vectors. The parameters of Mel filter banks are optimized by two separate ways, using genetic algorithm (GA) and particle swarm optimization (PSO). These filters are optimized in clean environment and at 12 dB noise-level using white and babble noise sources. The proposed optimization methods are applied for Hindi vowel recognition. To derive a new feature representation, Mel frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) feature extraction techniques are combined intuitively by concatenating the two different feature streams generated from the same data source. To reduce the size of concatenated features vectors, heteroscedastic linear discriminant analysis (HLDA) projection scheme is used. With new features, experiments are performed on isolated Hindi words. At back-end, mainly two novel approaches are proposed, one based on system combination and the other based on optimization of neural network classifiers. Three different subsystems are developed and integrated into one system using the combination techniques like recognized output voting error reduction (ROVER) and confusion network combination (CNC). The proposed techniques are applied on the continuous Hindi speech recognition system in normal field condition as well as in noisy environment. The algorithms normally work in real time environment and significant improvement of system performance v is achieved. Self-developed speech database and text corpus are used for this experimental work. To cope with the limitations of back-propagation (BP) training technique, the multilayer perceptrons (MLPs) are optimized. GA and hybrid of PSO-BP are used to optimize the weights between the nodes of different layers of MLPs. The optimized neural network techniques are compared with standard neural networks and statistical hidden Markov models (HMMs) for Hindi vowel recognition in different environments. Speech is a natural medium of communication for humans, and various speech technologies’ applications (e.g. voice response systems) can work reliably only by improving the performance of ASR. In a developing country like India, there lies vast potential and immense possibility to use speech effectively as a medium of communication between Man and Machine, to enable the common man to reap the benefits of Information and Communication Technology. This is the key focus of the thesis. vi Table of Contents List of Figures xiii List of Tables xv List of Acronyms xvii 1 Introduction 1 1.1 Motivation and Application ......................................................................................... 1 1.2 Framework of ASR ...................................................................................................... 2 1.2.1 Feature Extraction ............................................................................................... 5 1.2.2 Acoustic Modeling .............................................................................................. 6 1.2.3 Pronunciation Models ......................................................................................... 6 1.2.4 Language Models .............................................................................................. 7 1.2.5 Decoder ............................................................................................................... 8 1.3 Articulatory Phonetics ................................................................................................. 9 1.3.1 Vocal Tract Physiology and Speech Generation ................................................ 10 1.3.2 Manner of Articulation ....................................................................................... 12 1.3.3 Place of Articulation ........................................................................................... 12 1.3.4 Speech Sound Classification ............................................................................... 13 1.4 Issues in Speech Recognition....................................................................................... 14 1.4.1 Mismatched and Noisy Environment .................................................................. 15 1.4.2 Lack of Resources for Indian Languages ............................................................ 16 1.4.3 Vocabulary Size and Operating Mode ................................................................ 17 1.4.4 Speaker Variability ............................................................................................. 18 1.4.5 Metrics and Tools for ASR Development .......................................................... 19 1.5 Thesis Outline and Contribution .................................................................................. 20 1.5.1 Scope of the Present Work .................................................................................. 21 1.5.2 Organization of the Thesis .................................................................................. 22 vii 2 Literature Review and Analysis 25 2.1 Introduction .................................................................................................................. 25 2.2 Front-End Signal Processing........................................................................................ 28 2.2.1 Short Term Standard Features ............................................................................ 30 2.2.2 Posterior Features................................................................................................ 32 2.2.3 Long Temporal Context Features and Hybrids ................................................... 34 2.2.4 Wavelet based Features....................................................................................... 36 2.2.5 Localized Spectro-Temporal Features (LSTFs).................................................. 37 2.3 Back-End Classification ............................................................................................... 38 2.3.1 Conventional Methods ........................................................................................ 40 2.3.2 Limitations of HMM/GMM Framework ............................................................ 41 2.3.2.1 Assumptions of Independence ................................................................ 41 2.3.2.2 Poor Discrimination ................................................................................ 41 2.3.2.3 Weak Duration Modeling ....................................................................... 42 2.3.3 Refinements of HMM ......................................................................................... 42 2.3.3.1 Trajectory Modeling ............................................................................... 43 2.3.3.2 Discriminative Techniques ..................................................................... 44 2.3.3.3 Trended HMM ........................................................................................ 45 2.3.3.4 Second Order HMM ............................................................................... 45 2.3.3.5 Connectionist HMM ............................................................................... 46 2.3.4 Advanced Acoustic Models ................................................................................ 46 2.3.4.1 Margin based Approach .......................................................................... 46 2.3.4.2 HMM with Wavelet Networks................................................................ 47 2.3.4.3 Dual Stream Approach ............................................................................ 48 2.3.4.4 Conditional Random Field based Models ............................................... 48 2.4 Research Work for Indian Languages.......................................................................... 48 2.5 LVCSR Challenges ...................................................................................................... 51 2.5.1 Computational Overhead in GMM ..................................................................... 51 2.5.2 Variability from Environment............................................................................. 52 2.5.3 Microphone Quality and Distance ...................................................................... 55 2.5.4 Integrating Multiple Knowledge Sources ........................................................... 55 viii

Description:
Computer Engineering, for his insightful comments and administrative help at various robustness issues using the combined acoustic features, integrated acoustic and language At back-end, mainly two novel approaches are proposed, one based on system . 1.5.2 Organization of the Thesis .
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.