Table Of ContentSignals and Communication Technology
Mourad Abbas Editor
Analysis and
Application
of Natural
Language and
Speech Processing
Signals and Communication Technology
SeriesEditors
Emre Celebi, Department of Computer Science, University of Central Arkansas,
Conway,AR,USA
JingdongChen,NorthwesternPolytechnicalUniversity,Xi’an,China
E. S. Gopi, Department of Electronics and Communication Engineering, National
InstituteofTechnology,Tiruchirappalli,TamilNadu,India
AmyNeustein,LinguisticTechnologySystems,FortLee,NJ,USA
H. Vincent Poor, Department of Electrical Engineering, Princeton University,
Princeton,NJ,USA
AntonioLiotta,UniversityofBolzano,Bolzano,Italy
MarioDiMauro,UniversityofSalerno,Salerno,Italy
This series is devoted to fundamentals and applications of modern methods of
signal processing and cutting-edge communication technologies. The main topics
are information and signal theory, acoustical signal processing, image processing
and multimedia systems, mobile and wireless communications, and computer and
communication networks. Volumes in the series address researchers in academia
and industrial R&D departments. The series is application-oriented. The level of
presentation of each individual volume, however, depends on the subject and can
rangefrompracticaltoscientific.
Indexing: All books in “Signals and Communication Technology” are indexed
byScopusandzbMATH
Forgeneralinformationaboutthisbookseries,commentsorsuggestions,please
contact Mary James at mary.james@springer.com or Ramesh Nath Premnath at
ramesh.premnath@springer.com.
Mourad Abbas
Editor
Analysis and Application of
Natural Language and
Speech Processing
Editor
MouradAbbas
HighCouncilofArabic
Algiers,Algeria
ISSN1860-4862 ISSN1860-4870 (electronic)
SignalsandCommunicationTechnology
ISBN978-3-031-11034-4 ISBN978-3-031-11035-1 (eBook)
https://doi.org/10.1007/978-3-031-11035-1
©TheEditor(s)(ifapplicable)andTheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerland
AG2023
Thisworkissubjecttocopyright.AllrightsaresolelyandexclusivelylicensedbythePublisher,whether
thewholeorpartofthematerialisconcerned,specificallytherightsoftranslation,reprinting,reuse
ofillustrations,recitation,broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,and
transmissionorinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilar
ordissimilarmethodologynowknownorhereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication
doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant
protectivelawsandregulationsandthereforefreeforgeneraluse.
Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbook
arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor
theeditorsgiveawarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforany
errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional
claimsinpublishedmapsandinstitutionalaffiliations.
ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG
Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland
Preface
Theincreasedaccesstopowerfulprocessorshasmadepossiblesignificantprogress
in natural language processing (NLP). We find more research in NLP targeting
diverse spectrum of major industries that use voice recognition, text-to-speech
(TTS) solutions, speech translation, natural language understanding (NLU), and
manyotherapplicationsandtechniquesrelatedtotheseareas.
Thisbookpresentsthelatestresearchrelatedtonaturallanguageprocessingand
speechtechnologyandshedslightonthemaintopicsforreadersinterestedinthis
field.ForTTSandautomaticspeechrecognition,itisdemonstratedhowtoexplore
transferlearninginordertogeneratespeechinothervoicesfromTTSofaspecific
language (Italian), and to improve speech recognition for non-native English.
Languageresourcesarethecornerstoneforbuildinghigh-qualitysystems;however,
some languages, as Arabic, are considered under-resourced compared to English.
Thus, a new Arabic linguistic pipeline for NLP is presented to enrich the Arabic
languageresourcesandtosolvecommonNLPissues,likewordsegmentation,POS
tagging, and lemmatization. Arabic named entity recognition, a challenging task,
hasbeenresolvedwithinthisbookusingtransformer-based-CRFmodel.
Inaddition,thereadersofthisbookwilldiscoverconceptionsandsolutionsfor
otherNLPissuessuchaslanguage modeling,questionanswering,dialogsystems,
andsentenceembeddings.
MouradAbbas
v
Contents
ITAcotron2:ThePowerofTransferLearninginExpressiveTTS
Synthesis .......................................................................... 1
AnnaFavaro,LiciaSbattella,RobertoTedesco,andVincenzoScotti
ImprovingAutomaticSpeechRecognitionforNon-nativeEnglish
withTransferLearningandLanguageModelDecoding .................... 21
PeterSullivan,ToshikoShibano,andMuhammadAbdul-Mageed
KabyleASRPhonologicalErrorandNetworkAnalysis .................... 45
ChristopherHaberlandandNiLao
ALP:AnArabicLinguisticPipeline ........................................... 67
AbedAlhakimFreihat,GáborBella,MouradAbbas,HamdyMubarak,
andFaustoGiunchiglia
Arabic Anaphora Resolution System Using New Features:
PronominalandVerbalCases .................................................. 101
Abdelhalim Hafedh Dahou, Mohamed Abdelmoazz,
andMohamedAmineCheragui
ACommonsense-EnhancedDocument-GroundedConversational
Agent:ACaseStudyonTask-BasedDialogue................................ 123
CarlStrathearnandDimitraGkatzia
BloomQDE: Leveraging Bloom’s Taxonomy for Question
DifficultyEstimation............................................................. 145
Sabine Ullrich, Amon Soares de Souza, Josua Köhler,
andMichaelaGeierhos
AComparativeStudyonLanguageModelsforDravidianLanguages.... 157
Rahul Raman, Danish Mohammed Ebadulla, Hridhay Kiran Shetty,
andMamathaH.R.
vii
viii Contents
ArabicNamedEntityRecognition withaCRFModelBased
onTransformerArchitecture................................................... 169
MuhammadAl-Qurishi,RiadSouissi,andSarahAl-Qaseemi
StaticFuzzyBag-of-Words:ExploringStaticUniverseMatrices
forSentenceEmbeddings ....................................................... 191
MatteoMuffo,RobertoTedesco,LiciaSbattella,andVincenzoScotti
Index............................................................................... 213
ITAcotron 2: The Power of Transfer
Learning in Expressive TTS Synthesis
AnnaFavaro ,LiciaSbattella ,RobertoTedesco ,and
VincenzoScotti
Abstract Atext-to-speech(TTS)synthesiserhastogenerateintelligibleandnatu-
ralspeechwhilemodellinglinguisticandparalinguisticcomponentscharacterising
humanvoice.Inthiswork,wepresentITAcotron2,anItalianTTSsynthesiserable
to generate speech in several voices. In its development, we explored the power
of transfer learning by iteratively fine-tuning an English Tacotron 2 spectrogram
predictor on different Italian data sets. Moreover, we introduced a conditioning
strategy to enable ITAcotron 2 to generate new speech in the voice of a variety
of speakers. To do so, we examined the zero-shot behaviour of a speaker encoder
architecture, previously trained to accomplish a speaker verification task with
Englishspeakers,torepresentItalianspeakers’voiceprints.Weasked70volunteers
to evaluate intelligibility, naturalness, and similarity between synthesised voices
and real speech from target speakers. Our model achieved a MOS score of 4.15
in intelligibility, 3.32 in naturalness, and 3.45 in speaker similarity. These results
showedthesuccessfuladaptationoftherefinedsystemtothenewlanguageandits
abilitytosynthesisenovelspeechinthevoiceofseveralspeakers.
1 Introduction
The development of text-to-speech (TTS) synthesis systems is one of the oldest
problems in the natural language processing (NLP) area and has a wide variety of
applications [14]. Such systems are designed to output the waveform of a voice
A.Favaro((cid:2))
CenterforLanguageandSpeechProcessing(CLSP),JohnsHopkinsUniversity,Baltimore,MD,
USA
e-mail:afavaro1@jhu.edu
L.Sbattella·R.Tedesco·V.Scotti
DipartimentodiElettronica,InformazioneeBioingegneria(DEIB),PolitecnicodiMilano,
Milano,MI,Italy
e-mail:licia.sbattella@polimi.it;roberto.tedesco@polimi.it;vincenzo.scotti@polimi.it
©TheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerlandAG2023 1
M.Abbas(ed.),AnalysisandApplicationofNaturalLanguageandSpeech
Processing,SignalsandCommunicationTechnology,
https://doi.org/10.1007/978-3-031-11035-1_1
2 A.Favaroetal.
utteringtheinputtextstring.Inthelastyears,theintroductionofapproachesbased
ondeeplearning(DL),andinparticulartheend-to-endones[11,20,22,25],ledto
significantimprovements.
Mostoftheevaluationscarriedoutonthesemodelsareperformedonlanguages
with many available resources, like English. Thereby, it is hard to tell how good
these models are and whether they are general across languages. With this work,
we propose to study how these models behave with less-resourced languages,
leveragingthetransferlearningapproach.
In particular, we evaluated the effectiveness of transfer learning on a TTS
architecture,experimentingwiththeEnglishandItalianlanguages.Thus,westarted
fromtheEnglishTTSTacotron2andfine-tuneditstrainingonacollectionofItalian
corpora. Then, we extended the resulting model, with speaker conditioning; the
resultwasanItalianTTSwenamedITAcotron2.
ITAcotron 2 was evaluated, through human assessment, on intelligibility and
naturalnessofthesynthesisedaudioclips,aswellasonspeakersimilaritybetween
targetanddifferentvoices.Intheend,weobtainedreasonablygoodresults,inline
withthoseoftheoriginalmodel.
The rest of this paper is divided into the following sections: In Sect.2, we
introducetheproblem.InSect.3,wepresentsomeavailablesolutions.InSect.4,we
detailtheaimofthepaperandtheexperimentalhypothesesweassumed.InSect.5,
wepresentthecorporaemployedtotrainandtestourmodel.InSect.6,weexplain
the structure of the synthesis pipeline we are proposing and how we adapted it to
ItalianfromEnglish.InSect.7,wedescribetheexperimentalapproachwefollowed
to assess the model quality. In Sect.8, we comment on the results of our model.
Finally,inSect.9,wesumupourworkandsuggestpossiblefutureextensions.
2 Background
Every TTS synthesiser represents an original imitation of the human reading
capability, and, to be implemented, it has to cope with the technological and
imaginativeconstraintscharacterisingtheperiodofitscreation.
In the mid-1980s, the concomitant developments in NLP and digital signal
processing (DSP) techniques broadened the applications of these systems. Their
firstemploymentwasinscreenreadingsystemsforblindpeople,whereaTTSwas
inchargeofreadinguserinterfacesandtextualcontents(e.g.websites,books,etc.),
converting them into speech. Even though the early screen readers (e.g. JAWS1)
sounded mechanical and robotic, they represented a valuable alternative for blind
peopletotheusualbraillereading.
SincethequalityofTTSsystemshasbeenprogressivelyenhanced,theiradoption
waslaterextendedtootherpracticaldomainssuchastelecommunicationsservices,
1https://www.freedomscientific.com/products/software/jaws.