Political Speech Generation ValentinKassarnig CollegeofInformationandComputerSciences UniversityofMassachusettsAmherst 6 1 [email protected] 0 2 n Abstract a J 0 In this reportwe present a system that can generate political speeches for a de- 2 siredpoliticalparty. Furthermore,thesystemallowstospecifywhetheraspeech shouldholdasupportiveoropposingopinion.Thesystemreliesonacombination ] ofseveralstate-of-the-artNLPmethodswhicharediscussedinthisreport. These L includen-grams,Justeson & Katz POS tag filter, recurrentneuralnetworks, and C latentDirichletallocation. Sequencesofwordsaregeneratedbasedonprobabil- s. ities obtained fromtwo underlyingmodels: A languagemodeltakes care of the c grammaticalcorrectnesswhile a topic modelaims for textualconsistency. Both [ modelswere trained on the Convote datasetwhich containstranscriptsfromUS congressionalfloordebates. Furthermore,wepresentamanualandanautomated 2 v approachtoevaluatethequalityofgeneratedspeeches. Inanexperimentaleval- 3 uationgeneratedspeecheshaveshownveryhighqualityintermsofgrammatical 1 correctnessandsentencetransitions. 3 3 0 . 1 Introduction 1 0 6 Manypoliticalspeechesshowthesamestructuresandsamecharacteristicsregardlessoftheactual 1 topic.Somephrasesandargumentsappearagainandagainandindicateacertainpoliticalaffiliation : v oropinion.Wewanttousetheseremarkablepatternstotrainasystemthatgeneratesnewspeeches. i Since thereare majordifferencesbetweenthepoliticalpartieswe wantthe system to considerthe X politicalaffiliationandtheopinionoftheintendedspeaker. Thegoalistogeneratespeecheswhere r noonecantellthedifferencetohand-writtenspeeches. a In this reportwe first discuss related works which dealwith similar or related methods. Then we describeandanalyzethe datasetweuse. Next,we presentthemethodsweusedtoimplementour system. Wealsodescribeinvestigatedmethodsthatwerenotusedinthefinalimplementation.Then we describe a performed experimentand how we evaluated the results. Finally, we conclude our work and give an outlook. The appendix of this report contains the generated speeches from the experiment. 2 Related work Creatingmodelsforacorpusthatallowretrievingcertaininformationisamajorpartofthisproject aswellasintheentireNLPdomain. Bleietal.[1]presentintheirpaperamodelwhichisknown aslatentDirichletallocation(LDA).LDAhasbecomeoneofthemostpopulartopicmodelsinthe NLP domain. LDA is generative probabilistic model that discovers automatically the underlying topics. Eachdocumentismodeledasa mixtureofvarioustopics. Thesetopicscanbeunderstood as a collection of words that have different probabilities of appearance. Words with the highest probabilitiesrepresentthetopics. 1 However,LDAisabag-of-wordsmodelwhichmeansthatthewordordersarenotpreserved. That meansLDAdoesnotcapturecollocationsormultiwordnamedentities. Lauetal.[2]claimthatcol- locationsempiricallyenhancetopicmodels.Inanexperimenttheyreplacedthetop-rankedbigrams withsingletokens,deletedthe200mostfrequenttermsfromthevocabularyandperformedordinary LDA.Theresultsfromexperimentsonfourdistinctdatasetshaveshownthatthisbigram-variantis verybeneficialforLDAtopicmodels. Frnkranz [3] has studied the usage of n-grams in the text-categorization domain. He has shown thatusingbi-andtrigramsinadditiontotheset-of-wordrepresentationimprovestheclassification performancesignificantly.Furthermore,hehasshownthatsequenceslongerthanthreewordsreduce theclassificationperformance.Thatalsoindicatesthatcollocationsplayacrucialrolewhenitcomes toinferringthelatentstructureofdocuments. Cavnar and Trenkle [4] have also used an n-gram-based approach for text categorization. Their system is based on calculating and comparingprofiles of N-gram frequencies. They compute for every category a representing profile from the training data. Then the system computes a profile fora particulardocumentthatisto beclassified. Finally,thesystem computesa distancemeasure between the documents profile and each of the category profiles and selects the category whose profilehasthesmallestdistance. Smadja[5] presentsa tool, Xtract, which implementsmethodsto extractsvariable-lengthcolloca- tions. Theextractionprocessisdoneinseveralstages. Inthefirststagethesystemdeterminesthe top-rankedbigrams of the corpus. In the second stage Xtract examines the statistical distribution of wordsand part-of-speechtagsaroundthe bigramsfromthe previousstage. Compoundswith a probability above a certain threshold are retained while the others are rejected. In the third stage theyenrichthecollocationswithsyntacticalinformationobtainedfromCass[6]. Thesyntacticalin- formationhelpstoevaluatethecandidatecollocationsandtodecidewhethertheyshouldberejected ornot. Wang et al [7] propose a topical n-gram model that is capable of extracting meaningful phrases andtopics. Itcombinesthebigramtopicmodel[8]andLDAcollocationmodel[9]. Oneofthekey featuresofthismodelistodecidewhethertwoconsecutivewordsshouldbetreatedasasingletoken ornotdependingontheirnearbycontext. ComparedtoLDAtheextractedtopicsaresemantically moremeaningful.Thismodelshowsalsoreallygoodresultsininformationretrieval(IR)tasks. JustesonandKatz[10]presentamethodtoextracttechnicaltermsfromdocuments.Theirapproach isnotrestrictedtotechnicaltermsbutappliestoallmultiwordnamedentitiesoflengthtwoorthree. Thefoundationsoftheirmethodarebi-andtrigramswhichhaveacertainPOStagstructure. That is,theyextractallbi-andtrigramsfromthecorpus,identifytheirPOStagsandcheckthemagainst apredefinedlistofacceptedPOStagpatterns.Intheirexperimentthismethodidentifies99%ofthe technicalmultiwordtermsinthetestdata. Wacholder[11]presentsanapproachforidentifyingsignificanttopicswithinadocument.Thepro- posedmethodbasesontheidentificationofNounPhrases(NPs)andconsistsofthreesteps. First, a listofcandidatesignificanttopicsconsistingofallsimplexNPsis extractedfromthedocument. Next, these NPsareclusteredbyhead. Finally,a significancemeasureis obtainedbyrankingfre- quencyofheads.ThoseNPswithheadsthatoccurwithgreaterfrequencyinthedocumentaremore significantthanNPswhoseheadoccurslessfrequently. BleiandLafferty[12]proposetheirCorrelatedTopicmodel(CTM).WhileLDAassumesalllatent topicsareindependentCTMaimstocapturecorrelationsbetweenthem.Theyarguethatadocument aboutgeneticsismorelikelyalsoaboutdiseasethanX-rayastronomy.TheCTMbuildsontheLDA modelbuttheyuseahierarchicaltopicmodelofdocumentsthatreplacestheDirichletdistribution ofper-documenttopicproportionswithalogisticnormal.Accordingtotheirresultsthemodelgives betterpredictiveperformanceanduncoversinterestingdescriptivestatistics. Ivyeretal.[19]applyRecursiveNeuralNetworks(RNN)topoliticalideologydetection.TheRNNs wereinitializedwithword2vecembeddings. Thewordvectordimensionsweresetto300toallow directcomparisonwithotherexperiments. However,theyclaimthatsmallervectorsizes(50,100) do not significantly change accuracy. They performed experiments on two different dataset: the Convote dataset [25] and the Ideological Books Corpus (IBC) [21]. They claim that their model outperformsexistingmodelsonthesetwodatasets. 2 Party Republicans Democrats Support(Yea) RY DY Opinion Opposition(Nay) RN RN Table1:Speechclasses There has been a lot of research in the field of Natural Language Generation (NLG). The paper BuildingAppliedNaturalLanguageGenerationSystems[13]discussesthemainrequirementsand tasksofNLGsystems. Amongothers,theyinvestigateaso-calledCorpus-basedapproach. Thatis, a collection of example inputsis mappedto outputtextsof the corpus. This is basically what we plantodobecausewehavealreadyallthespeechsegmentslabeledwiththepoliticalpartyandthe opinion. However,ourgeneratorwillhaveasimplerarchitecturebutwewillusethedescribedlist oftasksasaguideline. Most NLG systems are designed to create a textual representation of some input data. That is, the input data determines the content. For example SumTime-Mousam [14] generates a textual weatherforecastbasedonnumericalweathersimulations. AnotherexampleistheModelExplainer system[15]whichtakesasinputaspecificationofanobject-orientedclassmodelandproducesas outputatextdescribingthemodel.OtherNLGsystemsareusedasauthoringaidforexampletohelp personnelofficerstowritejobdescriptions[16]ortohelptechnicalauthorsproduceinstructionsfor usingsoftware[17]. ANLGsystemthatfollowsadifferentapproachisSciGen[22]. SciGenisanautomaticcomputer scienceresearchpapergeneratordevelopedbythreeMITstudents.Thatis,itcreatesrandompapers which show actually a very high quality in terms of structuring and lexicalization, and they even includegraphs,figures,andcitations. SciGenhasbecomeprettyfamousaftersomeofitsgenerated papersgotacceptedat conferencesand publishedin journals. In particular,their paper Rooter: A MethodologyfortheTypicalUnificationofAccessPointsandRedundancyraisedalotofattention becauseitwasacceptedtothe2005WorldMulticonferenceonSystemics,CyberneticsandInformat- ics(WMSCI)andtheauthorswereeveninvitedtospeakattheconference.SciGenrequiresasinput onlythenamesoftheauthors;allthecontentwillbegeneratedrandomly.Ourgeneratorwillfollow thesameapproachsincewealsodonotspecifythecontentofthegeneratedspeech. Thecontentis determinedbythetrainingdataandrequiresnofurtherspecification. 3 Data set ThemaindatasourceforthisprojectistheConvotedataset[25]. Itcontainsatotalof3857speech segmentsfrom 53 US Congressionalfloor debatesfrom the year 2005. Each speech segmentcan be referredto its debate, its speaker, the speakerspartyand the speakersvote which servesas the ground-truthlabelforthespeech.ThedatasetwasoriginallycreatedinthecourseoftheprojectGet outthevote[18]. Theauthorsusedthedatasettotrainaclassifierinordertodeterminewhethera speech representssupportof or oppositionto proposedlegislation. They did notonly analyze the speechesindividuallybutalsoinvestigatedagreementsanddisagreementswiththeopinionsofother speakers. Thatis,theyidentifiedreferencesinthespeechsegments,determinedthetargetsofthose references, and decidedwhethera referencerepresentsan instance of agreementor disagreement. However,wefocusonlyontheindividualspeechsegmentsanddisregardreferences. For our work we have removed single-sentence speeches, HTML-tags and corrected punctuation marks. In orderto enable simple sentencesplitting we replacedall sentence delimitersby a stop- token. Furthermore, we inserted special tokens which indicate the start and the end of a speech. Thenwedividedallthespeechesintothefourclassesgivenbythecombinationofpossiblepolitical partiesandspeechopinions. Table1showsthefourspeechclassesandtable2givesaquantitative overviewofthecorpuscontent. ItcanbeseenthattheclassesRY andDN containthemajorityof thespeeches. 3 #Speeches #Sentences Avg.speechlength Avg.sentencelength RY 1259 20697 16.4sentences 23.0words RN 151 3552 23.5sentences 23.5words DY 222 4342 19.6sentences 23.6words DN 1139 22280 19.6sentences 22.7words Total 2771 50871 18.4sentences 23.0words Table2:Corpusoverview 4 Method 4.1 LanguageModel Weuseasimplestatisticallanguagemodelbasedonn-grams. Inparticular,weuse6-grams. That is,foreachsequenceofsixconsecutivewordswecalculatetheprobabilityofseeingthesixthword giventhe previousfive ones. Thatallowsusto determineveryquicklyallwordswhichcan occur afterthepreviousfiveonesandhowlikelyeachofthemis. 4.2 TopicModel For our topic model we use a Justeson and Katz (J&K) POS tag filter for two- and three-word terms [10]. As suggested by WordHoard [23] we expanded the list of POS tag patterns by the sequence Noun-Conjunction-Noun. We determined the POS tags for each sentence in the corpus and identified then all two- and three-word terms that match one of the patterns. For the POS taggingweusedmaxenttreebankpostaggingmodelfromtheNaturalLanguageToolkit(NLTK)for Python. Ituses themaximumentropymodeland wastrainedonthe Wall StreetJournalsubsetof thePennTreebankcorpus[24]. Someofthetermsareverygenericandappearveryofteninallclasses. Inordertofindthoseterms thatappearparticularlyoftenina certainclass we calculatea significancescore. Oursignificance score Z is defined by the ratio of the probability of seeing a word w in a certain class c to the probabilitytoseethewordintheentirecorpus: P(w|c) Z(w,c)= P(w) Thissignificancescoregivesinformationabouthowoftenatermoccursinacertainclasscompared totheentirecorpus. Thatis, everyscoregreaterthan1.0indicatesthatinthegivenclassacertain termoccursmoreoftenthanaverage. We considerallphraseswhichoccuratleast20timesinthe corpusandhavearatiogreaterthan1. Thesetermsrepresentthetopicsofthecorpus. Table3lists thetoptentopicsofeachclassorderedbytheirscore. Allthesetermsrepresentmeaningfultopics anditseemsreasonablethatthereweredebatesaboutthem. 4.3 SpeechGeneration Forthespeechgenerationonehastospecifythedesiredclasswhichconsistsofthepoliticalparty andtheintendedvote. Basedontheselectedclassthecorrespondingmodelsforthegenerationare picked. Fromthelanguagemodeloftheselectedclassweobtaintheprobabilitiesforeach5-gram thatstartsaspeech. Fromthatdistributionwepickoneofthe5-gramsatrandomanduseitasthe beginningofouropeningsentence.Thenthesystemstartstopredictwordafterworduntilitpredicts thetokenthatindicatestheendofthespeech. Inordertopredictthenextwordwefirstdetermine what topics the so far generated speech is about. This is done by checking every topic-term if it appearsinthespeech. ForeveryoccurringtermwecalculatethetopiccoverageTC inourspeech. Thetopic coverageis an indicatorof howwella certaintopict isrepresentedin a speechS. The followingequationshowsthedefinitionofthetopiccoverage: #occurrencesoftinS TC(S,t,c)= #occurrencesoftinallspeechesofclassc 4 RY RN DY DN cellresearch headstartprogram innercell schooloflaw enhancement research publiclaw innercellmass cbcalternative enhancementact deathtax humanembryo spinalcord cbcbudget budgetrequest humanlife vitrofertilization professoroflaw community adultstemcell cleancoal republicanbudget protectionact community worldtrade stemcell gunindustry protection organization gangdeterrence adultstem airnationalguard bigoil humanembryonic federaljurisdiction worldtrade judicialconference stem committeeon democratic sicklecell stemcellresearch homeland alternative deterrenceand associateprofessor heartdisease middleclass community Total: 205topics Total: 84topics Total: 98topics Total: 182topics Table3:Toptopicsperclass Werankalltopicsbytheirtopiccoveragevaluesandpickthetop3termsasourcurrenttopicsetT. Forthese3termswenormalizethevaluesoftheratiossothattheysumupto1. Thisgivesusthe probabilityP(t|S,c)ofseeingatopictinourcurrentspeechS ofclassc. Thenextstepis tofind ourcandidatewords. All wordswhichhavebeenseenin thetrainingdata followingtheprevious5-gramareourcandidates. Foreachcandidatewecalculatetheprobability ofthelanguagemodelP andtheprobabilityofthetopicmodelP . language topic P tellshowlikelythiswordistooccuraftertheprevious5ones. Thisvaluecanbedirectly language obtainedbythelanguagemodelofthespecifiedclass. P tellshowlikelythewordwistooccur topic inaspeechwhichcoversthecurrenttopicsT.ThefollowingequationshowsthedefinitionofP topic whereX denotesourdatasetandX(c)isthesubsetcontainingonlyspeechesofclassc. P [(#occurrencesofwinS’)∗P P(t|S′,c)] P (w|T,c)= S′∈X(c) t∈T topic P [(#wordsinS’)∗P P(t|S′,c)]∗ε S′∈X(c) t∈T Thefactorεpreventsdivisionsbyzeroisset toa verysmallvalue(ε = 0.001). Theprobabilities forallcandidatewordsarenormalizedsothattheysumupto1. Withtheprobabilitiesfromthelanguagemodelandthetopicmodelwecannowcalculatetheprob- abilityofpredictingacertainword.Thisisdonebycombiningthosetwoprobabilities.Theweight- ingfactorλ balancestheimpactofthetwoprobabilities. Furthermore,wewantto makesurethat a phrase is not repeated again and again. Thus, we check how often the phrase consisting of the previous five words and the current candidate word has already occurred in the generated speech and divide the combined probability by this value squared plus 1. So if this phrase has not been generatedyetthedenominatorofthisfractionis1andtheoriginalprobabilityremainsunchanged. Thefollowingequationshowshowtocalculatefora wordw theprobabilityofbeingpredictedas nextwordoftheincompletespeechS: λP +(λ−1)P P (w|S)= language topic word 1+(#occurrencesof(S[−5:]+w)inS)2 Fromthedistributiongivenbythenormalizedprobabilitiesofallcandidatewordswepickthenone of the words at random. Then the whole procedurestarts again with assessing the currenttopics. Thisisrepeateduntiltheend-of-speechtokenisgeneratedoracertainwordlimitisreached. 5 –iraqwaruspresidsupportvoteadministrcongress –jobmakeworkcompanibusirightamericangood Goodexamples –economijobseeneedpercentcontinuimportnow –programfundeduccutprovidhealthhelpmillion –cutrepublicanbillionwillpaypercentbenefitcost –issucountriaddressustodaytalkneedcan –gopeoplgetknowwanthappencansay Badexamples –gosaytalkwantpeoplgetjustsaid –membercommittehouschangprocessstandardvote –billcanmeanusevotejustfirsttake Table4:ResultsfromLDA Insteadofusingtheprobabilitydistributionofthecandidateswecouldhavealsojustpickedtheword withthehighestprobability. Butthenthemethodwouldbedeterministic. Usingthedistributionto pickawordatrandomenablesthegeneratortoproduceeverytimeadifferentspeech. 5 AlternativeMethods In this section we present some alternative approaches which were pursued in the course of this project. These methods have not shown sufficiently good results and were therefore not further pursued. 5.1 RecurrentNeuralNetworks Insteadofusingn-gramswealsoconsideredusingRecurrentNeuralNetworks(RNN)aslanguage models. Ourapproachwasheavilybasedonthe onlinetutorialfromDennyBritz [26]. TheRNN takesasinputasequenceofwordsandoutputsthenextword.Welimitedthevocabularytothe6000 mostfrequentwords. Wordswererepresentedbyone-hot-encodedfeaturevectors. TheRNN had 50hiddenlayersandusedtanhasactivationfunction.Forassessingtheerrorweusedcross-entropy loss function. Furthermorewe used Stochastic GradientDescent (SGD) to minimize the loss and BackpropagationThroughTime(BPTT)tocalculatethegradients. After training the network for 100 time epochs (∼ 14 h) the results were still pretty bad. Most of the generatedsentences were grammatically incorrect. There are many options to improve the performance of RNNs but due to the good performance shown by n-grams, the time-consuming training,andthelimitedtimeforthisprojectwehavedecidedtonotfurtherpursethisapproach. 5.2 LatentDirichletAllocation As alternative to the J&K POS tag filter we used LDA as topic model. In particular we used the approach from Lau et al. [2]. That is, we removed all occurrences of stop words, stemmed the remainingwords,replacedthe1000most-frequentbigramswithsingletokens,anddeletedthe200 mostfrequenttermsfromthevocabularybeforeapplyingordinaryLDA.Sinceourdatasetcontains speech segments from 53 different debates we set the number of underlying topics to 53. Some oftheresultsrepresentedquitemeaningfultopics. However,themajoritydidnotrevealanyuseful information.Table4showssomeexamplesofgoodandbadresultsfromLDA.Itcanbeseenthatthe extractedtermsofthebadexamplesareverygenericanddonotnecessarilyindicateameaningful topic. 5.3 Sentence-basedapproach Forthespeechgenerationtaskwehavealsopursuedasentence-basedapproachinthebeginningof thisproject. Theideaofthesentence-basedapproachisto takewholesentencesfromthetraining dataandconcatenatetheminameaningfulway.Westartbypickingaspeechofthedesiredclassat randomandtakethefirstsentenceofit. Thiswillbethestartsentenceofourspeech. Thenwepick 20 speechesat randomfromthe same class. We compareourfirst sentencewith each sentencein those20speechesbycalculatingasimilaritymeasure. Thenextsentenceisthandeterminedbythe 6 successorofthesentencewiththehighestsimilarity.Incasenosentenceshowssufficientsimilarity (similarityscore belowthreshold)we justtake the successor ofourlast sentence. Inthe nextstep wepickagain20speechesatrandomandcompareeachsentencewiththelastoneinordertofind themostsimilarsentence. Thiswillberepeateduntilwecomeacrossthespeech-terminationtoken orthegeneratedspeechreachesacertainlength. Thecrucialpartofthismethodisthemeasureofsimilaritybetweentwosentences. Oursimilarityis composedofstructuralandtextualsimilarity. Botharenormalizedtoarangebetween0and1and weightedthroughafactorλ. WecomputethesimilaritybetweentwosentencesAandBasfollows: Sim(A,B)=λSim (A,B)+(1−λ)Sim (A,B) struct text ForthestructuralsimilaritywecomparethePOStagsofbothsentencesanddeterminethelongest sequenceofcongruentPOStags.Thelengthofthissequence,normalizedbythelengthoftheshorter sentence,givesusthestructuralsimilarity.Thestructuralsimilaritymeasureaimstosupportsmooth sentence transitions. Thatis, if we find sentenceswhich have a verysimilar sentence structure, it is very likely that they connect well to either of their following sentences. The textual similarity isdefinedbythenumberoftrigramsthatoccurinbothsentences, normalizedbythe lengthofthe longersentence.Thissimilarityaimstofindsentenceswhichusethesamewords. Theobviousadvantageofthesentence-basedapproachisthateverysentenceisgrammaticallycor- rectsincetheyoriginatedirectlyfromthetrainingdata. However,connectingsentencesreasonable isaverychallengingtask. Afurthersteptoimprovethisapproachwouldbetoextendthesimilarity measurebyatopicalsimilarityandasemanticsimilarity. Thetopicalsimilarityshouldmeasurethe topicalcorrespondenceoftheoriginatingspeeches,whilethesemanticsimilarityshouldhelptofind sentences which express the same meaning although using different words. However, the results fromtheword-basedapproachweremorepromisingandthereforewehavedecidedtodiscardthe sentence-basedapproach. 6 Experiments This section describes the experimental setup we used to evaluate our system. Furthermore, we presentheretwodifferentapproachofevaluatingthequalityofgeneratedspeeches. 6.1 Setup Inordertotestourimplementedmethodsweperformedanexperimentalevaluation. Inthisexperi- mentwegeneratedtenspeeches,fiveforclassDNandfiveforclassRY.Wesettheweightingfactor λ to 0.5 which meansthe topic and the languagemodelhave both equalimpacton predictingthe nextword.Thequalityofthegeneratedspeecheswasthenevaluated.Weusedtwodifferentevalua- tionmethods: amanualevaluationandanautomaticevaluation. Bothmethodswillbedescribedin moredetailinthefollowingparagraphsofthissection. Thegeneratedspeechescanbefoundinthe appendixofthisreport. 6.2 ManualEvaluation Forthemanualevaluationwehavedefinedalistofevaluationcriteria. Thatis,ageneratedspeech isevaluatedbyassessingeachofthecriterionandassigningascorebetween0and3toit. Table5 listsallevaluationcriteriaanddescribesthemeaningofthedifferentscores. 6.3 AutomaticEvaluation Theautomaticevaluationaimstoevaluateboththegrammaticalcorrectnessandtheconsistencyof thespeechintermsofitscontent. Forevaluatingthegrammaticalcorrectnessweidentifyforeach sentenceofthespeechitsPOStags. Thenwecheckallsentencesoftheentirecorpuswhetherone hasthe same sequenceof POS tags. Havinga sentence with the same POS tag structuredoesnot necessarilymeanthatthegrammariscorrect. Neitherdoesthelackoffindingamatchingsentence implytheexistenceofanerror. Butitpointsinacertaindirection. Furthermore,weletthesystem 7 Grammatical Arethesentencesgrammaticallycorrect? correctness 0... Themajorityofthesentencesaregrammaticallyincorrect 1... Morethan50%ofthesentencescontainmistakes 2... Somesentencescontainminormistakes. 3... Themajorityofthesentencesaregrammaticallycorrect Sentence Howwelldoconsecutivesentencesconnect? Reasonablereferencestopre- transitions vioussentences? 0... Bad/notransitions,Incorrectuseofreferences 1... Afewmeaningfultransitions,Partlyincorrectuseofreferences 2... Mosttransitionsandreferencesaremeaningful 3... Verygoodtransitions. Containsmeaningfulreferencestoprevious sentences. Reasonable start / end of speech? Reasonable structuring of arguments Speechstructure (claim,warrant/evidence,conclusion)?Goodflow? 0... Noclearstructure.Randomsentences. 1... Containssomekindofstructure.Argumentsareveryunclear. 2... Clearstructure.Mostoftheargumentsaremeaningfullyarranged. 3... Verygoodstructure.Containsmeaningfulstartandendofspeech. Clearbreakdownofarguments. Sametopicinentirespeech?Reasonablearguments? Speechcontent 0... Nocleartopic. Randomarguments. 1... Argumentscannotbeassignedtoasingletopic. 2... Themajorityofthespeechdealswithonetopic. Mostargumentsare meaningful. 3... Speechcoversonemajortopicandcontainsmeaningfularguments. Table5:Evaluationcriteria outputthe sentence for which it could notfind a matchingsentence so that we can evaluate those sentencesmanually. Inordertoevaluatethecontentofthegeneratedspeechwedeterminethemixtureoftopicscovered bythespeechandorderthembytheirtopiccoverage. Thatgivesusinformationabouttheprimary topic and secondary topics. Then we do the same for each speech in our dataset which is of the sameclassandcomparethetopicorderwiththeoneofthegeneratedspeech. Wesumupthetopic coveragevaluesofeachtopicthatoccursinbothspeechesatthesameposition.Thehighestachieved valueis usedas evaluationscore. Thatis, findinga speechwhichcoversthe same topicswith the sameorderofsignificancegiveusascoreof1. 7 Results In this section we present the results from our experiments. Table 6 shows the results from the manual evaluation. Note that each criterion scores between 0 and 3 which leads to a maximum totalscoreof12. Theachievedtotalscorerangefrom5to10withanaverageof8.1. Inparticular, thegrammaticalcorrectnessandthe sentencetransitionswereverygood. Eachofthemscoredon average 2.3 out of 3. The speech content yielded the lowest scores. This indicates that the topic modelmayneedsomeimprovement. Table7showstheresultsfromtheautomaticevaluation. Theautomaticevaluationconfirmspretty muchtheresultsfromthemanualevaluation. Mostofthespeecheswhichachievedahighscorein the manualevaluationscored also high in the automatic evaluation. Furthermore,it also confirms thattheoverallthegrammaticalcorrectnessofthespeechesisverygoodwhilethecontentisabit behind. 8 Conclusion Inthisreportwehavepresentedanovelapproachoftrainingasystemonspeechtranscriptsinorder togeneratenewspeeches.Wehaveshownthatn-gramsandJ&KPOStagfilterareveryeffectiveas 8 Grammatical Sentence Speech Speech Total correctness transition structure Content DN#1 2 2 3 2 9 DN#2 3 3 3 1 10 DN#3 2 2 1 1 6 DN#4 3 3 2 1 9 DN#5 3 3 2 1 9 RY#1 2 1 1 1 5 RY#2 2 2 1 1 6 RY#3 2 3 2 2 9 RY#4 2 2 3 3 10 RY#5 2 2 2 2 8 Average 2.3 2.3 2 1.5 8.1 Table6: Resultsfrommanualevaluation Grammatical Speechcontent Mean correctness DN#1 0.65 0.49 0.57 DN#2 0.5 0.58 0.54 DN#3 0.86 0.25 0.56 DN#4 0.61 0.34 0.48 DN#5 0.70 0.25 0.48 RY#1 0.52 0.28 0.4 RY#2 0.65 0.68 0.67 RY#3 0.25 0.95 0.6 RY#4 0.63 0.8 0.72 RY#5 0.5 0.21 0.36 Average 0.59 0.48 0.54 Table7:Resultsfromautomaticevaluation 9 languageandtopicmodelforthistask. We haveshownhowtocombinethesemodelstoasystem thatproducesgoodresults.Furthermore,wehavepresenteddifferentmethodstoevaluatethequality ofgeneratedtexts. Inanexperimentalevaluationoursystemperformedverywell. Inparticular,the grammaticalcorrectnessandthe sentence transitionsof mostspeecheswere verygood. However, therearenocomparablesystemswhichwouldallowadirectcomparison. Despite the good results it is very unlikely that these methods will be actually used to generate speechesforpoliticians.However,theapproachappliestothegenerationofallkindoftextsgivena suitabledataset. Withsomemodificationsitwouldbepossibletousethesystemtosummarizetexts aboutthesametopicfromdifferentsource,forexamplewhenseveralnewspapersreportaboutthe same event. Termsthatoccurin the reportofeverynewspaperwouldgeta high probabilityto be generated. All of our source code is available on GitHub [27]. We explicitly encourage others to try using, modifyingandextendingit. Feedbackandideasforimprovementaremostwelcome. References [1] D. Blei, A. Ng, M. Jordan. (2003). LatentDirichletallocation. Journalof MachineLearning Research,3:9931022. [2] Lau,J.H.,Baldwin,T.,Newman,D.(2013).Oncollocationsandtopicmodels. ACMTransac- tionsonSpeechandLanguageProcessing(TSLP),10(3),10. [3] Frnkranz,J.(1998). A studyusingn-gramfeaturesfortextcategorization. AustrianResearch InstituteforArtificalIntelligence,3(1998),1-10. [4] Cavnar, W. B., Trenkle, J. M. (1994). N-gram-based text categorization. Ann Arbor MI, 48113(2),161-175. [5] Smadja,F.A.(1991,June).Fromn-gramstocollocations:AnevaluationofXtract.InProceed- ings of the 29thannualmeeting on Association forComputationalLinguistics(pp. 279-284). AssociationforComputationalLinguistics. [6] Abney, S. (1990, October). Rapid incrementalparsing with repair. In Proceedingsof the 6th NewOEDConference:ElectronicTextResearch(pp. 1-9). [7] Wang,X.,McCallum,A.,Wei,X.(2007,October). Topicaln-grams: Phraseandtopicdiscov- ery,withanapplicationtoinformationretrieval. InDataMining,2007. ICDM2007. Seventh IEEEInternationalConferenceon(pp.697-702).IEEE. [8] Wallach, H. M. (2006, June). Topic modeling: beyond bag-of-words. In Proceedings of the 23rdinternationalconferenceonMachinelearning(pp. 977-984).ACM. [9] Griffiths, T. L., Steyvers, M., Tenenbaum, J. B. (2007). Topics in semantic representation. Psychologicalreview,114(2),211. [10] Justeson, J. S., Katz, S. M. (1995). Technicalterminology: some linguistic propertiesand an algorithmforidentificationintext. Naturallanguageengineering,1(01),9-27. [11] Wacholder, N. (1998). Simplex NPs clustered by head: a method for identifying significant topics within a document. In The ComputationalTreatmentof Nominals: Proceedingsof the Workshop(pp.70-79). [12] Blei,D.M.,Lafferty,J.D.(2007).Acorrelatedtopicmodelofscience. TheAnnalsofApplied Statistics,17-35. [13] Reiter, E., Dale, R. (1997). Building applied natural language generation systems. Natural LanguageEngineering,3(1),57-87. [14] Reiter,E.,Sripada,S.,Hunter,J.,Yu,J.,Davy,I.(2005).Choosingwordsincomputer-generated weatherforecasts.ArtificialIntelligence,167(1),137-169. [15] Lavoie,B.,Rambow,O.,Reiter,E.(1996,June).Themodelexplainer.InProceedingsofthe8th internationalworkshoponnaturallanguagegeneration(pp.9-12). [16] Caldwell, D. E., Korelsky, T. (1994, October). Bilingual generation of job descriptions from quasi-conceptualforms. InProceedingsofthe fourthconferenceonAppliednaturallanguage processing(pp.1-6).AssociationforComputationalLinguistics. 10