i YET ANOTHER ALGORITHM FOR PITCH TRACKING (YAAPT) by Kavita Kasi B.Eng. June 1999, Andhra University, India A Thesis Submitted to the Faculty Of Old Dominion University in Partial Fulfillment of the Requirement for the Degree of MASTER OF SCIENCE ELECTRICAL ENGINEERING OLD DOMINION UNIVERSITY December 2002 Approved by: ______________________________ Stephen A. Zahorian (Director) ______________________________ Vijayan Asari (Member) ______________________________ Glenn Gerdin (Member) ii ABSTRACT YET ANOTHER ALGORITHM FOR PITCH TRACKING (YAAPT) Kavita Kasi Old Dominion University, 2002 Director: Dr. Stephen A. Zahorian This thesis presents a pitch detection algorithm that is extremely robust for both high quality and telephone speech. The kernel method for this algorithm is the Normalized Cross Correlation (NCCF) reported by David Talkin [16]. Major innovations include: processing of the original acoustic signal and a nonlinearly processed version of the signal to partially restore very weak F0 components; intelligent peak picking to select multiple F0 candidates and assign merit factors; and, incorporation of highly robust pitch contours obtained from smoothed versions of low frequency portions of spectrograms. Dynamic programming is used to find the “best” pitch track among all the candidates, using both local and transition costs. The algorithm has been evaluated using the Keele pitch extraction reference database as “ground truth” for both “high quality” and “telephone” speech. For both types of speech, the error rates obtained are lower than the lowest reported in the literature. iii © 2002 Kavita Kasi. All Rights Reserved. iv This thesis is dedicated to Sri Saibaba, my parents and my advisor, Dr.Stephen Zahorian. v ACKNOWLEDGMENTS I wish to express my sincere gratitude to my thesis advisor, Dr. Stephen A. Zahorian, for his invaluable advice, guidance, motivation and patience throughout the research work. Without his support, this thesis would not have been possible. I also thank the National Science Foundation, which partially funded the research under grant BES-9977260. I would like to thank Dr.Vijayan Asari and Dr.Glenn Gerdin for graciously agreeing to be on my committee and for their valuable time and generous assistance. I would like to thank all my friends in the Speech Communication Lab for establishing a positive working environment. Finally, I wish to thank my family for their love and support in completion of my thesis. vi TABLE OF CONTENTS Page LIST OF TABLES ..……........…………………………………………….............................ix LIST OF FIGURES ...………………………………………………………………......…….x LIST OF EQUATIONS ..………………………………………........................……............xii CHAPTERS I. INTRODUCTION ..........................................................................................................1 1.1 General Introduction………………………………………....................................1 1.2 Speech Production and Various Properties of Speech Signal..................................2 1.3 How Does Pitch Relate to Speech Production and what is Pitch Tracking.……....7 1.4 The Difficulty of Pitch Tracking……………………………………......................8 1.5 Basic Overview of Pitch Tracking Approaches.....……………………………......9 1.5.1 Pre-processing.....................…………………………………………............9 1.5.2 F0 Candidate Estimation.………………………………………….....................11 1.5.3 Post Processing..............……………………………………………..................13 1.6 Objectives and Overview of the Thesis.……………………………………..............14 II. BACKGROUND...........................................................................................................16 2.1 General Information……………………………………………………………...16 2.2 Survey of Existing Pitch Tracking Algorithms..……………………………........16 2.2.1 Comparison Study..............………………………………………...............17 2.2.2 Average Magnitude Difference Function (AMDF) Approach …………..................……………………………………………..........................18 2.2.3 Maximum Posteriori Pitch Tracking.....………………………………........20 2.2.4 Pitch Tracking for Telephone Speech Using Frequency Domain Methods ………................…………………………………………………………...........23 2.2.5 Application of Pitch Tracking for Prosodic Modeling to Improve ASR for Telephone Speech.………………………………………………………….........26 2.3 Summary............................……………………………………………................27 III. THE ALGORITHM..........................................................................................................28 3.1 General Overview of Pitch Estimation Steps..……………………………….......29 3.1.1 Pre-processing..............…………………………………………...............30 3.1.2 F0 Candidate Estimation..........…………………………………...............37 3.1.3 Candidate Refinement Based on Spectral Information…………...............42 3.1.4 Candidate Modification Based on Plausibility and Continuity Constraints …………......................…………………………………………………………..45 3.1.5 Final Path Determination Using Dynamic Programming ………...….........49 vii 3.1.6 Determination of Pitch Period Markers for Voiced Regions of the Speech Signals...........…………………………………………………………….............50 3.2 Summary……………………………………………............................................57 IV. EXPERIMENTAL VERIFICATION..............................................................................59 4.1 Introduction...........................…………………………………………….............59 4.2 Brief Description of The Test Database....………………………………….........61 4.3 Error Measures.....…………………………………………..................................63 4.3.1 Error type I.......…………………………………………….........................65 4.3.2 Error type II.........…………….…………………………….........................66 4.3.3 Error type III.....................…………………………………………............68 4.3.4 Error type VI....………………………………………….............................69 4.3.5 Error type V.......…………………………………………...........................70 4.3.6 Error type VI.............…………………………………………....................71 4.4 Results and Discussion........………………………………………......................72 4.4.1 Comparison With Results from Literature Survey……………………….. 82 4.5 Summary........................……………………………………………....................84 V. CONCLUSIONS AND FUTURE IMPROVEMENTS.....................................................85 5.1 Overview...........…………………………………………….................................85 5.2 Conclusions and Future Work…………………………………………………..85 REFERENCES.......................................................................................................................88 APPENDIX VARIABLE NAMES AND HELP FILE........................................................91 1. Variable Names……..............….……………………………………….......................91 2. Help File……................…………………………………………….............................92 CURRICULAM VITA...........................................................................................................95 viii LIST OF TABLES Table Page 2.1 Error rates indicating relative comparision of Maximum Likelihod,Vwaves and MAP Pitch Trackers……. …...............…………………………………..............23 2.2 Error table for the performance comparision of DLFT and Xwaves..…………...26 4.1 Pitch Tracking error summary for several databases to minimize BIG errors ….77 4.2 Pitch Tracking error summary for several databases to minimize GROSS errors ……….….………………………………………………………………………..78 4.3 Pitch Tracking error summary for several databases to minimize overall BIG errors………………………………………………………………………….….79 4.4 Summary of BIG errors and Gross voiced errors for Keele database with different controls….......................……………………………………………....................80 4.5 Parameter values for experiments….……….………………………....................81 4.6 Table of summary of performance of available algorithms……….......…............84 ix LIST OF FIGURES Figure Page 1.1 Illustrations of physiologic components of human speech production ...................3 1.2 Illustrations of an acoustic speech signal, the spectrogram view and the voiced spectrogram view and the voiced-unvoiced portions of speech.….........................5 1.3 Illustration of adult human vocal track elaborating the mechanism of human speech....………………………………………………………………...................6 1.4 Illustration of effect of center clipping on a typical speech signals.......................10 3.1 Illustrations of effects of nonlinear processing of speech signals on the F0 of spectrogram of the signals….………………………............................................31 3.2 Presence of five prominent peaks for a pulse train with fundamental of 100 Hz for four duty cycles……………………………………………………………..........33 3.3 Spectra of a highpass filtered pulse train illustrating the nearly eliminated fundamental at 100Hz....……………………………............................................34 3.4 Spectra of absolute valued highpass filtered pulse train, illustrating the restoration of missing fundamental………………………………………………………......35 3.5 Illustration of effect autocorrelation and NCCF signals on a typical voiced frame of speech……………………………………………………………………........39 3.6 Illustration of the overall pitch tracking algorithm…............................................50 3.7 Illustration of the identified locations of the pitch period markers in the LPC residual signal………………………………………………………………........52 3.8 Depiction of the overall pitch tracking algorithm in the form of flowchart ........54 4.1 Illustration where the reference contour contains non-zero pitch values while the estimated contour contains zero pitch values (voiced to unvoiced errors) ...........66 4.2 Illustration where the reference contour contains zero pitch values while the estimated contour contains the zero pitch values (unvoiced to voiced errors)......67 4.3 Illustration of final pitch contour estimation by the algorithm to represent the lack of gross errors estimations as pitch halving or pitch doubling for a typical case ……………………………………………………………………..……………..73 x 4.4 Illustration of cases of pitch halving for Keele reference (Cntrl_1) and the subsequent correction by setting a threshold to create a second reference (Cntrl_2) ………………………………………………………............................................76
Description: