1 A cknowledgem ent s First and foremost, I would liketo expressmy deep and sinceregratitudeto my su- pervisor,Dr. DijanaPetrovska-Delacr´etaz,for her support andwisesupervisionthroughout my thesis. I am especially grateful for her constructivecriticism and for her confidencein my work and my ideas. I am deeply grateful to my supervisor, Prof. G´erard Chollet, for hisreview and his constructivecommentsand suggestionsthat havebeen of great valuefor me. My sincere thanks are due to Prof. Genevi`eve Baudoin, Prof. Hermann Ney, Dr. Xavier Anguera, Prof. Laurent Besacier, Prof. Geoffroy Peetersand Prof. Ga¨el Richard for beingmembersof my PhD committeeand for their valuablecommentsand suggestions for improvingthisthesis. Financial support from the French National Research Agency (ANR) SurfOnHertz project under the contract number ARPEGE 2009 SEGI-17, the vAssist project (AAL- 2010-3-106) and theFUI Arhomeproject, isgreatly acknowledged. My thanksarealsoduetoFr´edericBimbot, JanCˇernocky´, AsmaaEl Hannani,Guido Aversano and membersof theFrench SYMPATEX project who provided mesomeof the programsused for thisresearch. My warmest gratitude is due to the head of the department of the TSI research group in T´el´ecom ParisTech, Prof. Yves Grenier, and all its members. I owe particular thankstoLeila,Joseph,Asmaa,Daniel,Pierre,Pierrick,Manel,Mounir,S´ebastien,Sathish, Patrick,Jacques,Stephan,Jirasri andYafei for their encouragement,reviewsandinteresting discussions. Special thanks go to Prof. Bernadette Dorizzi and all the members of the EPH 2 department in T´el´ecom SudParis who have been very supportive when it was necessary, especially toSanjay, Sarra, Monia, Maher, Mouna, Nadia, Toufik and Mohamed. I would also liketo thank HuguesSansen, Founder and CEO of SHANKAA, for his relevant comments, suggestions, help and interestingdiscussions. My deep and sincere gratitude go to all my friends, especially Haythem, Borhen, Sana, Hamdi, Rachid, Kamel, Takoua, Ahlem, Azza, Meriem, Hamed, Maher, Slim, Mehdi, Marwen, Rim, Alae, Dali, Zied, Walid and Bilel for their continuoussupport, particularly duringthedifficult moments. Finally, my most sinceregratitudegoesto my wife, my lovely daughter, my parents, sister and brother for their encouragement and thetremendoussupport they provided me throughout my entirelife, especially duringmy PhD period. Tothem I dedicatethisthesis. 3 A bst r act Theamount ofavailableaudiodata,suchasbroadcast newsarchives,radiorecordings, musicand songscollections, podcastsor variousinternet mediaisconstantly increasing. In thesametimetherearenot alot of audioclassificationandretrieval tools, whichcouldhelp userstobrowseaudio documents. Content-based audio-retrieval is a less mature field compared to image and video retrieval. There are some existing applications such as song classification, advertisement (commercial) detection, speaker diarization and identification, with varioussystemsbeing developed toautomatically analyzeand summarizeaudiocontent for indexingand retrieval purposes. Within thesesystemsaudiodataistreated differently dependingon theapplica- tions. For example, song identification systemsaregenerally based on audiofingerprinting using energy and spectrogram peaks(asin theSHAZAM and thePhilipssystems). While speaker diarizationandidentificationsystemsareusingcepstral featuresandmachinelearn- ing techniquessuch asGaussian MixtureModels(GMMs) and/or Hidden Markov Models (HMM). Thediversity of audio indexing techniquesmakesunsuitablethesimultaneoustreat- ment of audio streamswheredifferent typesof audio content (music, commercials, jingles, speech, laughter, etc.) coexist. In thisthesiswereport our recent effortsin extending theALISP (Automatic Lan- guageIndependent Speech Processing) approach developed for speech asagenericmethod for audio indexing, retrieval and recognition. ALISP isa data-driven technique that was first developed for very low bit-ratespeech coding, and then successfully adapted for other taskssuchasspeaker verificationandforgery,andlanguageidentification. Theparticularity 4 of ALISP toolsisthat no textual transcriptionsareneeded during thelearning step, and only raw audio data issufficient. Any input speech data istransformed into a sequenceof arbitrary symbols. Thesesymbolscan beused for indexing purposes. Themain contribu- tion of thisthesisistheexploitation of theALISP approach asagenericmethod for audio (and not only speech) indexing and recognition. To this end, an audio indexing system based on theALISP techniqueisproposed. It iscomposed of thefollowingmodules: • Automated acquisition (with unsupervised machine learning methods) and Hidden Markov Modeling(HMM) of ALISP audiomodels. • Segmentation (alsoreferred assequencingand transcription) modulethat transforms the audio data into a sequence of symbols (using the previously acquired ALISP Hidden Markov Models). • Comparison and decision module, including approximate matching algorithms in- spired form the Basic Local Alignment Search (BLAST) tool widely used in bioin- formaticsand theLevenshtein distance, tosearch for asequenceof ALISP symbolsof unknown audio datain thereferencedatabase(related todifferent audio items). Our main contributionsin thisPh.D can bedivided intothreeparts: 1. Improving theALISP toolsby introducing a simplemethod to find stablesegments within theaudio data. Thistechnique, referred asspectral stability segmentation, is replacingthetemporal decompositionusedbeforefor speechprocessing. Themainad- vantageof thismethod isitscomputation requirementswhich arevery lowcomparing to temporal decomposition. 2. Proposing an efficient technique to retrieve relevant information from ALISP se- quencesusing BLAST algorithm and Levenshtein distance. Thismethod speedsup theretrieval processwithout affectingtheaccuracy of theaudioindexingprocess. 3. Proposinga genericaudio indexing system, based on data-driven ALISP sequencing, for radiostreamsindexing. Thissystemisappliedfor different fieldsof audioindexing to cover themajority of audioitemsthat could bepresent in a radio stream: 5 - audio identification: detection of occurrencesof aspecificaudiocontent (mu- sic, advertisements, jingles) in aradiostream; - audio motif discovery: detectionof repeatingobjectsinaudiostreams(music, advertisements, and jingles); - speaker diarization: segmentation of an input audiostream intohomogenous regionsaccording to speaker’sidentitiesin order to answer thequestion: ”Who spokewhen?”; - nonlinguistic vocalization detection: detection of nonlinguisticsoundssuch aslaughter, sighs, cough, or hesitations; Theevaluationsof theproposed systemsaredoneon theYACAST database(awork- ing database for the SurfOnHertz project) and other publicly available corpora. The ex- perimental resultsshow an excellent performancein audioidentification (for advertisement and songs), audio motif discovery (for advertisement and songs), speaker diarization and laughter detection. Moreover, the ALISP-based system has obtained the best results in ETAPE 2011 (Evaluationsen Traitement Automatiquedela Parole) evaluation campaign for thespeaker diarization task. 6 G lossar y Automatic speech recognition: Conversion of aspeech signal intoatextual representa- tion by automated methods. Audio fingerprint: Compact content-based signaturethat representsan audiorecording. Audio identification: Detection and location of occurrencesof a specific audio content (music, advertisement, jingle,..) in audio streamsor audiodatabases. Audio indexing: Extraction of relevant information from unknown audiodata. Audio motif discovery: Detecting repeating audio objects in audio streams or audio databases. Basic Local Alignment Search Tool (BLAST): Algorithm for comparing primary biological sequence information, such asamino-acid sequencesof different proteinsor the nucleotidesof DNA sequences. Data-driven approaches: Techniques that automatically learn the linguistic units and information required from representativeexamplesof datawithout human expertise. Hidden Markov Model (HMM):Statistical modelusedtomodel aprocesswhichevolves over time, wheretheexact stateof theprocessisunknown, or ”hidden”. High-level information: Set of information that reflects the behavioral traits such as prosody, phonetic information, pronunciation, idiolectal word usage, conversational pat- terns, topicsof conversations, etc. Levenshtein distance: Stringmetricsfor measuringthedifferencebetweentwosequences. The Levenshtein distance between two words is the minimum number of single-character edits(insertions, deletions, substitutions) required tochangeoneword intoanother. Mel-Frequency Cepstral Coeffi cients (MFCC): Coefficients of the cepstrum of the 7 short-term spectrum, downsampled and weighted according to the Mel scale that follows thesensitivity of thehuman ear. Nonlinguistic vocalization: Very brief, discrete, nonverbal expressionsrelated tohuman behavior. Precision: Fraction of retrieved documentsthat arerelevant tothesearch. Recall: Fraction of the documents that are relevant to the query that are successfully retrieved. Reference Database: Containsall theaudio itemsto beidentified by an audio identifi- cation system. Speaker diarization: Segmentinganinput audiodataintohomogenousregionsaccording tospeaker’sidentitiesin order toanswer thequestion ”Whospokewhen?”. Speaker identification: Determiningwhichregisteredspeaker providesagivenutterance. Speaker verification: Accepting or rejecting theidentity claim of a speaker. 8 C ont ent s List of Figures 13 List of Tables 16 1 R´esum´e Long 18 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.2 E´tat del’Art desSyst`emesd’Indexation Audio par Extraction d’Empreinte 21 1.2.1 TechniquesBas´eessur laRepr´esentation Spectrale . . . . . . . . . . 22 1.2.2 TechniquesBas´eessur laVision par Ordinateur . . . . . . . . . . . . 23 1.2.3 TechniquesBas´eessur laMod´elisation Statistique . . . . . . . . . . 24 1.2.4 EtudeComparative . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.3 Contributions`al’Indexation Audio Non Supervis´ee . . . . . . . . . . . . . . 27 1.3.1 Am´elioration desOutilsALISP . . . . . . . . . . . . . . . . . . . . . 28 1.3.2 Appariement Approximatif desS´equencesALISP . . . . . . . . . . . 31 1.3.2.1 RechercheExhaustive . . . . . . . . . . . . . . . . . . . . . 31 1.3.2.2 BLAST Algorithm . . . . . . . . . . . . . . . . . . . . . . . 32 1.3.2.3 M´ethodePropos´eepour l’Appariement Approximatif . . . 32 1.3.3 Syst`emeG´en´eriqued’Indexation Audio `aBased’ALISP . . . . . . . 34 1.4 Evaluationset R´esultats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.4.1 Identification Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.4.2 D´ecouvertedesMotifsAudioR´ecurrents. . . . . . . . . . . . . . . . 38 1.4.3 Segmentation et Regroupement en Locuteurs . . . . . . . . . . . . . 39 1.4.4 D´etection du Rire . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 1.5 Conclusionset Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2 General Introduction 47 2.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.2 AudioIndexing: Problematic . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.4 ThesisStructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Description: