Henning Wachsmuth Text Analysis 3 8 3 9 S Pipelines C N L Towards Ad-hoc Large-Scale Text Mining consequat, vel illum dolore eu feugiat nulla fddLatvUteimnodNcpraiacnoiodxudoieeltaugoacto cl rilnliiluwmroeleploecinneaeliegidtnan nesiiims p fddLatvUteimnodNcpssss durlfcaiau ediiriinotaiodxudoscitatucieelsb iaetau igoaoct n qgoi ecl iecriln iataloliilpt uwmemroelnleploecniat.neuln na aeliniegu umosfidtrclnagn nestu iiimasiiai p ms isss d g guursshlfqmoue tdiir vtmiti ctascitatucesesi b iaemuui,i inc n quesglloin eeiec iatalomia plot eevemnlam osat.iiriuln n arni tamu umooodesfdh rclegb,dr tu paiiai fddLatvUteimnodNdd m i tp g gprursshr lcaiaeoqmooqtqs p vtml tnooee ciodxudieeuueavesi tamoimuu,iueigooctn cnleuesulll uen cl rreeriloc liileiilmi anlouwmseevroelreploeni odssuam osnediiriaaiir elitrecr o tamrodn sodeidtdh nad en nsssmbd,r iiim rpsb ot p ddsdddsss dun tpsipride, llfieoue uoqdiiimqs piml it oeetalscitatuceeuusmioo osavb iLvsis$sisvekdmakDdmoi areve iin n qtslegpnduloiueeparreiecioc iatalo tteeiilpeaeaeeaa lllno4rc esemoontrrnlc odssuaut.a ooodvu aalnsavoiir aam tnirocr ou umdaqrorndddrsssdn ss8rclrcllldid gissmtiiu amrrr ruaes b otiooogoinl etmpddd idssun ddg gocsiursscoeeoieuide, i 0uiqmo, utiim vtmeslmmptnt dJ rr ln m tn eeeilmioo osesui lecm uut,m r a arevieeBsvi icns tuetsuesuudlsnag u ittmpasnei isaipo ttemia llllnoatu e seeavueetntranooecam osfddLatvUteimnodNcplmta oo oiiretm a,vo dcnsi. miu ro ro tat aian daqorasotadem nmldht ml a dim etnossdeilbd,io$rodxudoaamrrrpieelnes apcnl itauddffDpaoubhhbs tgoactu ocpr .oeeoeuin unelgeao ege oqcl uiqs d3 rilndcesll liil tn noee,eos uwmeae utn bl reuueoelilaeveploecdiomlecmnei Lum r agetcs vonm ae uu tuelenelin uueu7lgcsualooaittmrr eqsneocidttrttipo eniil nafonrn neuratuiiiimtoiueetrsltd r od ssup tagglmtdosynaeetmsss aad.mus,iigyau ebtlni.utso creel ofa un rud rardn diir5pom nmiLirl a di esst maj ta srscintatucea aunb os mtsmacpeb iabene ffDrddusdiebu tli2nciqeu s n aa .idegdr,yrgioiotseeegi uesuuieie.cdcm iuatalox iigtn t p ,eos ltet atecutemeosmmin oo 0oslc ttomeat. unlu Lu ar ev s eirlno inr t uut s a nonnLditv i cmu umnlooapaovpeiqsnisrftntl t ti a,rtcreleqfgll1l unnurr teu oiaatointrudtciliaoiueeiagga ioo ome ea ,ioefsmusvoo g gieebtlna mursoshlus rtosouqmsmaoqmidst 5isbuuupo yvltmedntirdipd ic mptu tamrrrvllaau eenr smneslli mne tpmruuuisieeiisds iuiqeocncaaau ueeoeseeuiallyroellhi ntse, nn tsi eetu oee l sll giinmieoa f loetatec tn meetm mdisl afeaa ttleam cosiumu nluiip rrt adiem slvm o u ts t uereupr zutev i ls ttamnoii ttmm moeieodevtntl t po dhiv qoeeeeeeeyut r nnatb, ,drur ooiaatpuseedtnuetti lueeeeil mtddzaseeeeeeee etuaaamuffs, d tttttsp ieeeeeeermap.r elluo uu uoionnnneeeed oeirraoqqqqqqqqqqqqqdaabuuum qqqqqcccs e pme aedntll aaaa a deupedoiaeeoersmuuuuuputazlllllllllllllnuuumlllgeaacpvvvvvvvvvd mmmmmmmmmmimpffmoDlsbaleids i iiiiiiiiuuuunn nnnn aiiii leo ccallaooauuuuuuut hi nuuuuuuuuuopeggeegserstt ttttttttsrmurlelqiq quuuuuuuutuuuu odccccslltneeeeeeemiidf ,enooos eieeteuoso mt m s.roa afffffeaad mmomdddodssri dlu aiiemm mnnaaaaaaajmt uo iiiiiiiiiid ruu reeeeeeeunn ntotttttttttttttzeetueeecro ooocooouuullloorrrrrrarsuzqqqsdn nnnnnsxoouetttttvvvvvvvvvtvetuvrledAf ptiiu t r ursssi rrrr itnbbbbbbb ssooooooottttttto l agg sszasssssss taaeeaddd aaaaaaaauuuamuuuffaummmrnnnnnusn izii ebtsldnise eeeeeeeeeela issssseeedeo,,,,,ol muiiiiiiiotuuuuimtta upppdddaaapoimtamiee toduuuuuuueuuutn riirsssssss llllllltzee,e aau rlllllmllllgglgggggmmsiooosssssmostineees r..llllllllllg aaaaaaaaaaieerrrreeeeoeeelevvvvvvvdideiiiiieieiiiiiiiiiiitaa tsmaao ccnyyyro i tsttu2er eeeeeesessssssssssrsssssspoac lqqqqqqqiq q iiii mtceteeeeegiieldi lllleeeeiiiiieieeeeeeeeeat cceceeeeeeeee mnnnnttttlrrrr oaand ctts i nat ll ooolllluon aaaaaiic as mtttttttttt vop0t n m iorviiooooiii i uuuuuuuuullnnn doa aqqqqqqzvesidxxxxoooooooouuoouuuuuuuuounnntltttt tttlll l ddddiiiiqqqqq ptii ainnaopimavaarrrtr tttttttues unnnneeeeruu2i ssuetpe rsssssiiiiiiiziiooiiiiingceeeeee uloeeeiiiuuuuuueuiolmmmmmmmmmmm ooooou upppppppp bbbbbiiiiuuueuuuteeelmslllntlopnoiimtt ttutrr0ttttlnmls traoillllilles cmrrrr..smn aaaaas adsssss ivvvvuuuuo te ee uuuaue ouuaaualalli bhhhhhhsin r irttttmccccc tui ppppo ssillellfre ateteu ttttttt uuuuteaeetttttt a eteeefaae iyltttmmm.mtnpeetmmmmme ,uo urrrrrei .....iiiiqzitoaa edn arrrrateeeem$ ntvtm nlllla a uuuuut errrr oaaaa nnettsaaaapppcp. l mzsfaffDDDDDDueabdduffuu o . eluuuuuu4g iovemegedddddddsaasddddcccacrtteedn eetur,eeeeso suuuulmilgyomiil Luuluuuuueaaii iiiii0 id.iri usuu ppncco ccc cLdsssslrroeoooorqraslqssss qn tttiteeameeeeeurie oriiiiiib..aaod0aeaSaaaagigg aaiirmmmmmssuuustsssssuoeeeeb tlllln usuouutlltmzxxxxxpoooooueirAAttpBtt tiiu tttt. toccat a asmmmmrnnnness niey ziiqeeee eeeeee aaaaaaooooyrrrrm stseespmtLL ttd iiiiiiiatttteeeecartminnt ttt rrrrr un..uulu dddddd ssss tmo si vvvvoo i n eeccccc e nltlll tm luq.nggnncoiaaa attttuuuu ueeeeeiineee e srri eteetiiiiiiieluto tticbuuuuuul ed eennnnnnn ppppppmpuuu uutuui lllllieemmmmem fddLatvUteimnodNcpeimntoemsps iiiiiieutlallllrrhimm caiaa ttttttttttu iasllllfnolleeeeeeteiiadodxudoxudatm tieelaaa afffrfffeaaaastaeiimuu o goaoocttpreeeeeeaztueeeeeee tta cl e ttttvvvvttvtrilnli iliiliuutttt r rruwmnrroelelo eploecloaniazassssssn eouuuuamuuuffufft aa,, eeeeeeeliuiieeiioigmdaaaa,idt d uueuunaanrssssn nes ziiimrmmmmlggggs p lllllssidsss dddiiu cccc lffseeeessrssue li qddiiri dieiiiieeeeettaa mscitatuccuoasibiii ae aaaaiiiimttt tii n qgogulloollizzzzeexooooieuuuuuocc iataloolpiitiipt ttttemnll ssssat. uzzzzinlnl ne aaolnmmmm miiu ummpootsforcclliiiigtu rrrraiiiiiaii fddLatvUteimnodNvUtaLLLaattvvUUUr ..mmd i g gosursrsshriiiiii caiaiiirq mooct vtmmlllllltnonnnooo ciodxuddooodd iee e sesii ttttatnmuu,iuiigooct cnnciuessll nn cl ccclllerilrrr liiliiiimia loouwmuuwwwwu eevroeleploepppnimam ossoneeeeiiria eliiiire taamooddeidtdddtttdhh naeen nsb,dr riiiiiiiiimmmmps p pppddssssss ssssd ttppppr llfeeoouuuue oqqdiiqqs p il ioeetaiiscicccitataaaucuusavvbi moiee inn n nnnqleguuloiiiiuueerreeeeieecocc iiiiatatttloeeliippppnsemrrnnnnl oddssuat.ttt..duaalniiir atnnnnicr oou u mrrodn sssssssrcld ggggggsssmtuuuu u u rrab oottiiakDdLmsisemvsismdvsisv$k mddd iiiiiiiiiunn g g siuuursuuuuusidee,npp,,iqmmmmmmmmmou utim vttttmttmtt aaeeaeeaee ea lcoreeooemi ooosseeeeeesi ummmmmmmmmmmmuu, arreevvviiasascntssauesdlllllln pasddrndrndrdsiicr4 ttelleemiaaaaaaaaaa lllllllllo iieeeevnttruurrcogooooaaam osiieeeiiimmmemnn fddLatvUteimnodNcpa ooetoiiiiiiiiiir asdvodd m crrrrroooi u tttttttttacaiadaq,qqooooooooosodelldddhd diipmmenoirJr d b,,,diiiiiirxxxuxxodxudoam nrrm rpieeles nnnultaddp tuuuuuuuuussees8 ootogoaocto i csprrrrrrrrroeeouieui leo aguoqcl isnqssssss rilnesliasill liil noeeeeeeeee nuwmtn saruuoelinaleooaveploecnile cmneimr aevn a tcduensleeliuu iueulgst uittmrreeatocidtpo tmeliina mtdss0nn nesatou$iiimueetrs od ssup li mtudsetomhhbsss ad,iiui.ton nuecr lofna rue radn diirm nd mil a i esstmaa aee rscitatucnl eeb ostacpedbi aeffDdddbgctsu inuo mes n q .nBideg,g3ioiaeegu eu iecdcrttm iatalotn n t p ,eosrlt uitemetlmin oooslmo rt at. Loyun arevu ailn n gyauu ts a nundieec au umlo oadpaoqsnisftiL rtceldfglllur j teu sonientrad m7ciiaiabgga ooomreaus imtl2sucvo g gebtlnm ursshsrdrouoqmaoqistupioe. vltmiruxdit ic tte amrrrltoua asee s 0mcnesli ne pmuu,iie isaiqeroiroccn aaueeoeseuioyrnLtlltsme n svp.ierte l a,rgiienmia 1 u loatcetn eevmilb lttoleam cosu n iluiir r ai ,eosfv ot t ueo5urvts i s tamnmsmittmode i5iseodentl ty po dh qe nn atb,dur ooiaatvupeetun rueelilmtdde uretseid, tpieap.rlu euo looene oinraoqb uuuem qs pmredent ll a peooeeemputallmnuudsaacpv uimmmffoDieptsdbels iumu n leapll ulhitn ueeego ti rureetu odccsvilltneiilfy,enos, rteous tm sdr afttefaai mmmmmmmmmmmmodddddoduss duem aaaa dsuo iiruurereeemannnnnnnnnnnnnnn ntrztueeu u ecr ooocodelooarqsdn nnnsectvevt etdf eut airo urssma rrrrinub oottttotttt l daggtzassipmaaaaaaaaeaddduamulauffmmmmmmmmumnnnnnnnsun aebtslnio eeelisssssssseeesssde,,,uaoiiaiotuuui op usdmlaapoimtamiduuemut rss ltdzooee ua amlsg.gmsioo ssssoms,rneeeeeeeeeeeeee lll aieerrrreeevvvevvimndjiiiiiiei iid taaaaasssaasaaa u ccyyyyyyyyyyrd o tseseesrssusssssssssssspali qiiiiiieuvrl tteeeegiieodeA lllseiieeat ccccccccccccceeeees m ntttttrrrrosao ccccctti na llllllloo lolu ataeii asmrtt vot ds m oorvoooiiiiiiioi i tuumllnnnnnnndaqqqqqelztatesiaxoouounnnnnnnnnnnnnnnntltttttttttt tlllll l etdiiiiqqqqqqqqqpntiiriiiiiiiinn oaimaarrrtet,t ue s tunnnneeeeeeeeeeli s sseppppppppo l sssssezziieiiiiiioicedet mloiieeoiuuueuuinolmm oooooooooooootu2 ppobbbbbbbiuuuuuuuuurt eellslntlomecpniiim ttttttttutetlnl rrliiiiiillillaidsle cmert..mr aaaa sac d sssssssssssss vvvivuuuuuuuuuop0 ttttttuuteeeuueonuaaaaaauallii hhhhhhhsin r oa itttmdc dtu etpo llll f atteu opav tttttttttttuuu ueettttt aeeeeeeefaa ru2lllllmmltttmmtnuteeetmmmmmmmmr ,uo mng rrrrrrrrrei....uiztoooe iun uraeeeeeeeem nvtme lmlllla aouuuuuuuuut rrrrrrre r0aaaa mstnnoapppcp l szsffDDDDsnubbbabbbuffo u a . elluuuuuuugiobageemrddddddddduaaiddddcccacisetdeeeeene eurer,esos euuuutamlg te ooomy. Luuulu iiiiiiideg ii uuu n cci ad ccc c assssssslrrrrroeooororastlqs qn$ttttti na f eoeeeeeeeeeurrrie oiiiieoaaaaadaas.aaaaiggmvna aaeiiedmmmmmmmmsuuusstus otttettb lltllnso uullvmzssxxxxxxxxxpooooouriiiiiirttptttttttiio ttttalt aau isssmmmmilness ueie i4 iziiqeeee.rs eeeeeeeee aaaaappooooooooyrrrrm tseLdqsplt ee giimiiateeeecurbrmei ttSarrrrrrrrr unnn..uuulurddddddd sssss toouui vvvvi ttn ccccccccce ientl t pplq0 unnl. iiioiiiaaa aatttttocu ueeeeeirneeee innysss i eeeeeiiiiiilusoqietmbuuuL ednnntndipmuupuutalll llllllluttn mmmmt ss iiiiumu 0allohi n e tttt tuuu msllauu.fggeeetectm t affffeaay emr uo treeeeetttztttueeee .ttcevtttvvtvl e ut r a i Bni l eezassssmuaaamuuuuffpe tAl eeeleuioimdaamiaadeuursssslaztllmlllggggrslllliidtiiiiiipa cc taseeersslqi q du eiiiieeeer moaai aaaaiiomttt ,ouullszmxououuuo, ptii ttr ss iiziie olm m tppto iirir..doi r c l n i aatvUt fddLatvUteimnodNcriacaianonoiddodxudoieettaugotoct clcl irilniiliiluwuwmrooelppeploenineaeiliegdtidtnnan nsiiiiiims pp sssssss dulfuu e diiiitaccisciatatuucsbi e in n qgoieeiecii itatallop mennnllat.t.unln aanni u umosrclgggggggtuu u uaiiiLvsiskakDdd$vsis$esim mmmmm iiii g gsurshnpqmmmmot vtttmtteaeaaeaea ocrooeeeeeseeei umuu,ivisacccccnauesllllln rddssrndndrc4 llimaaa allloiieevuoooogmaaaa ooosetiiiirisddd cci u tttam,ooooooooooodedh pme Jrr ddbddddddd,dr nm pueeetteeei fddLatvUteimnodNcpdd tttees8 t sprrrrrrrrr lcaiaaeog uuoqsqs piiiasl nooeeeeenixxxodxudo saieeluunaeooav tamiueagoaoctn cdnsleuiult uecl rretarilnoctmliilelii mdss0n$uwmroelr eploecni odssui nedobhaoaiirelitnuneegcr oa r nd idtnad nn nesssm e eiiim rl esb oted p sdddgtcssss dunomeusnBide,3lfiue u diiritttimnit rdtalitscitatuceeltsim ooos rt bi aeoa reva iiga n qtsuugdeeh aoi depaieciL iatalo ttedpllljt esnmentrm7nlcbat.a ooorusu ltcln nvo am rniroou umdaqiaosuei.sflurcldigite tu amrrrltoasae eciiainl mp isarir g goc ursshoeeoeuionLmqmo vp.tirt vtmel,aretnu c tn eilboesi elc iymuu,imr a ,eoivocn tue5uesutsllssmrmn ittmd seey po ima loatuoeevueetn rlam oslmtuiirieteid,a i.u enro oey tamn n oraodem nmre dh l a oeeeb,dr ampndacpuimddffDeptblm tpu dpr p lgteoeego i oqueqs pdcvl tnoeee,os ,r ou sdeuutavmo mmoi uoee d ss nn uuremlennnnn nruu ulcduelooarresqqqqqqqqqqqqqqqncce2 eoccccccctieilifeionurasiluurrdd ddodssuaggtipdaaaaaaaaaaaeaalammmmmsuoiiiretttblnotsaooattcr oooouuuuuuu opxrslndn nspotiiiiimmd sssm tdoo rr ua as.b otsssssssssm,rne ldddimnn0jjunnnniiiiiiiqe ddsiaaaaaaaau iiede,yyyyyyyyyyyyyyr oitse uuuu sssssssimeurlmm giioAt sltacess teessoimo osssos tt nnnnnnnnnnu nlllllllu te arrevv sri tddsstss dviiiiii i tmnnnnnnnnnnnelttapaeiaiiinnnnnnnltt eet tteeeeqqqqqqqqnr1 lllnn o eeiiiiiiiioiaateet nttruttccueeis a o ooeoo assssssssseevvoiiiiiiiiiedd tm m lunooroooooooott2fdaqqqqosbbbbbbbbbuuuuuuuuur lledtnmmaediipiiimtueeamrrrllllllllllees adsnnl metppsc sssssss s iuooiocn5oeoeuuuiieuiaaaaaaallhhhhhhhhhin ma tddddituuuuuuu tteslllll nnf te ppoavvtn tttttttm tiillla efaaruru0elcuem r aaa aemmmmmmmrvv uo mg ttuuuerrrrrrreuuuttttttzteiiuus iittmeeeeeeeeeeevteelmpo aoouuuuuuut r rraatumsstuueettttnol llmmmtazssnetmuaaaaaaauffoo, ei.. leloo uuuuuuuuuoibatn mrrradddddddaaii2mm nmasisedll a eedeeeeeurers e,azttanmlllllllg t apcpyffDDDDlbbiiiiiiiiideg u iiiiiiiiu .cciia addg aegesssssssssrrerrrrrsiulqqqqqqiqqq $tcdddc nnaa tn e,,eoseeeeeeeie auu maaoaaaaas.oomimvna Luuuu aiiedd0 mmmmmmmmmto o uu no ocuuuuuuullvvmmlooooazsqqsnxxxxxxxuorymttitfttptttttiio urrrraloiiiitidiillaagg ss eea ii4 nimssssuziiiiiii.rsettttblllneeeeeee ppoosooooooooom uuLddqpppppppppllpooot eeiiiirmii ttub au aresssssmiSaene rrrrrrr..rrieddddddddssviiqeeoaaaaaouuuyrrrrri tttse essssccccccc eeppe igil0 ul.ateeeece toccom tt rnn unnuuuluninny s t eoiiiiiiissvvvv i qtnettmeiLLannnntl t diilqnnaeiioiiaatttt uuttunnuueeeeeeit seeee ssssieeeemu 0luoooooo eei buuumtedtnaauu.pggp muupuutuclllllllltu mmmmy srrressss iiiiutteallhin .ttctt ttuutuu le et sllfa rtei Bttttm it affefaaeeeuemsempe uo AtAAlrrrreeee ttzttueeemmmievvvvtauuuut r llaaatnrl szassssiiaauaamuuufftt pa eeeleuoi ttaddaa . adeuu rssssrrzllmllggga a lloidtiiiii, cc sseeeersmlqqiqq , d eeiiiee mroai aaaiiimt o uulltzxxouuo ptiittt ss ziiee olmm pptoiirirr..doi r cc l n ii ttvUteidLatvUteimnodNcp fdriiaiacannonoiixodxudoeelittauoagttoc cclcl irilnliilwuumwreooeleploecnineaeiiilegddtdtddddiannnaanan nesiiiims pppp ms$mvsis$LvdakDdmvsiseksissss ssssdupnplfuuuuuu e diiriieeaeeaeaeaaetaocriciisiaaatatcccuuccucuoosb iaeuv iaasasn n nnqagoiedrnrsrndsddeeeiec4rci tttataooolllooolllpt iiemuunllllgoooooga...t.ttetudsln ndd aaaaccnnnni iu u ummm,osfrclmmpmgd rr Jt un m aiaiiu mmmmmmm ttt iiiiiiieee8s g g sursshqmmmmo guuts vmtaiiist cn saeanooeesi muui,incdncsuseuiall tn eat tmim am ssd0lo$eev ma soi iiriohbo rnnue tama oode hd eb,d r e epl enedddgcts tpomeprnB l3eooq qs ptttl noeerdituutlavr tmoioean ga leuuleeha ue drreocuLeiildnjssnrm7 dossubdrsualtciirtrcr ooriad nsuie.ud ssmte rtosb ot eclddd unrirsi ide,oanLaimu vp.imrtma,ret uleeimoo soo iy arev b,eoiots5dtsmsmpad siy tte lll entrn rrcola oo ou aeidvoam uenrooedaqn sle dieoiramrrrmes dnlyiumpptslmoce doeeoiuept oi ieeslvnnnn , rotn sdeiltelc m ra ov dss tttuermuuruu sediiiittmeccceee e po eioaaaaaatuaaueettuuldlllllmtipetmla,toi..ooooo aooatt2tn oopxraslmmmm nmt tla a mme aoons.acp,d,rffDDDDDDDlbbbbmnnjju dd .u g ogggeeg uuudceurltn A,,e,ossssuso stsso0ooomo Luuuuu te r uuddssnnnnn n ctmlooaeeetttaqqqsqnaitile et fnrurrrr oiiiieeet dottaggs aaaaaeaoomsssuseeettttblndddt msnoouuuutt2fo1po iiiirmmmeari ttee a au llssssmadsne tiec iiiiqeeooiaaaaaaaaaanyrtse massssdddddtt iig atecppoavvmt tt urru0 unnnnlluullluulue5 sr tguviiiii iiimiuun eiennntnl t elmeqqqqqqaaaoonnrroiiiiaattttmsstuoueei esnssssooieeee e luloooobtimmmrrbuuuuuuuuaii2edntsiseprmput e,lllllltta t mmmmyssssi s iue u aaaaaallgiia addhi nadt ttuuuu $tcsll nnaa f eet ttttm afeaas.mvvvaemmmmmmedd0 uonoo rrrreizutttte ovvmmsevvvvtyratuuuut r ooo lni l aiillzas eumaaaauffii4 ni .rs eeeelppoouiomLddqddddaallla eedeeeeeum rs zuuubmllllgesSellllrridassiiiiio ccuuu ttserrrrrseliqqqq ee ppped0 ueeeeieeee.. mtoccaoaaaalvir aiinnymt eioossuuuulltzetttmxxoxxuuuuoqLLadiiotpttttiiaetuuuttnn sst s ziiiieeee m 0olmm mmoo ueelpppp tmtoaaauu.iigg ucrtiurrrr.. drrreyttoei r ...ttccccc pl ee l r i B iat eeeenmspe tAAAliiii mmti amllaau trsiitttpa tta aurr aa ot,,.ms, ri t eksisvsisakDmvsisvdm$dm$sLnppeaeaaeaeaeeercooouvasasaadndsdrndsrrdc4rlliiuugooooogtedsddcc i u,mpmmdd J rr n mu ttteees8 sg uusiaiisn saaneoo ndcsuiat tatmm dss0$ i obhonunea e el enedgtcsomenB3 tttnrditltr toaga ueeha duLdjsnm7brsultcroiauie.ute tos ecl rir oanaLmvp.rta,reuo iy b,eoo5tsmsmd sy n rrolueida uenoen e eormdyiumptlme dptoi ev, rosdet o ds rm44ruu edce e eioaludipaltooaoa2topxslt m oo08s.,drlmnj du o ueurlAssos ts0o terds tm00etaaile t nr et ots oeed tmnot2fo1 maerieladstc eoin ma d2dt poavvt ur0ue5rguimu eeelmaormstos0no e lbtmrai2sier e,ta t ye u gia dad$tc 1na e s.mvaed0no i ovmsyrato liail ei4 ni.5rspomLdql em ubesSerasouutee pe0 u.toclvrny eiostetmqLadioaeutnt sm 0o uel mtau.gu ctu rreyte .tcpl e ri BiateemspetAl mtiamlautrsitpa ta aur a2 ot,.ms, ri 0t 20 congue nihil imperdiet doming id quod mazim congue nihil imperdiet doming id quod mazim placerat facer possim assum. Lorem ipsum dolor placerat facer possim assum. Lorem ipsum dolor placerat facer possim assum. Lorem ipsum dolor 123 Lecture Notes in Computer Science 9383 Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen Editorial Board David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zürich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany More information about this series at http://www.springer.com/series/7407 Henning Wachsmuth Text Analysis Pipelines Towards Ad-hoc Large-Scale Text Mining 123 Author Henning Wachsmuth Bauhaus-Universität Weimar Weimar Germany This monograph constitutes a revised version of the author’s doctoral dissertation, which was submitted tothe University of Paderborn,Faculty of Electrical Engineering, Computer Science and Mathematics, Department of Computer Science, Warburger Straße 100, 33098 Paderborn, Germany, under the original title “Pipelines for Ad-hoc Large-Scale Text Mining”, and which wasaccepted in February2015. Coverillustration:TheimageonthefrontcoverwascreatedbyHenningWachsmuthin2015. It illustrates the stepwise mining of structured information from large amounts of unstructured text with asequentialpipeline of text analysisalgorithms. ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notesin Computer Science ISBN 978-3-319-25740-2 ISBN978-3-319-25741-9 (eBook) DOI 10.1007/978-3-319-25741-9 LibraryofCongressControlNumber:2015951772 LNCSSublibrary:SL1–TheoreticalComputerScienceandGeneralIssues SpringerChamHeidelbergNewYorkDordrechtLondon ©SpringerInternationalPublishingSwitzerland2015 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodologynow knownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbookare believedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsortheeditors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissionsthatmayhavebeenmade. Printedonacid-freepaper SpringerInternationalPublishingAGSwitzerlandispartofSpringerScience+BusinessMedia (www.springer.com) To Max, my son. Foreword The last few years have given rise to a new and lasting technology hype that is receiving much attention, not only in academia and industry but also in the news and politics: big data. Hardly anything in the computer science world has been as con- troversial as the ubiquitous storage and analysis of data. While activists keep on warningthatbigdatawilltakeawayourprivacyandfreedom,industrycelebratesitas the holy grail that will enhance everything from decisions over processes to products. The truth lies somewhere in the middle. Extensivedataiscollectednowadaysaboutwhereweare,whatwedo,andhowwe think. Being aware of that, dangers like not getting a job or paying higher health insurancefeesjustbecauseofone’sprivatebehaviorseemrealandneedtobetackled. In this respect, big data indeed reduces freedom in that we are forced to refrain from doingthingsnotacceptedbypublicopinion.Ontheupside,bigdatahasthepotentialto improve our lives and society in many respects including health care, environment protection,problemsolving,decisionmaking,andsoforth.Itwillbringunprecedented insightsintodiseasesanditwillgreatlyincreasetheenergyefficiencyofinfrastructure. It will provide immediate information access to each of us and it will let influencers better understand what people really need. Striving for such goals definitely makes it worth and honorable working on big data. Maturetechnologies existfor storingand analyzing structureddata, fromdatabases to distributed computing clusters. However, most data out there (on the web, in business clouds, on personal computers) is in fact unstructured, given in the form of images,videos,music,or—aboveall—naturallanguagetext.Today,themostevolved technologiestodealwithsuchtextaresearchengines.Searchenginesexcelinfinding texts with the information we need in real time, but they do not understand what information is actually relevant in a text. Here, text mining comes into play. Textminingcreatesstructureddatafrominformationfoundinunstructuredtext.For this purpose,analysisalgorithms are executed that aim tounderstand naturallanguage to some extent. Natural language is complex and full of ambiguities. Even the best algorithms therefore create incorrect data from time to time, especially when a pro- cessedtextdiffersfromexpectation.Usually,awholebunchofalgorithmsisassembled together in a sequential pipeline. Although text mining targets large-scale data, such pipelines still tend to be too inefficient to cope with the scales encountered today in reasonable time. Moreover, the assembly of algorithms in a pipeline depends on the information to be found, which is often only known ad-hoc. VIII Foreword Henning Wachsmuth addresses these problems of pipelines for ad-hoc large-scale text mining in his dissertation, on which the book at hand is based. He systematically follows engineering principles to design pipelines automatically and to execute them optimally. A focus is put on the optimization of the run-time efficiency of pipelines, bothintheoryandindifferentpracticallyrelevantscenarios.Inaddition,Henningpays specialattentiontothefundamentalquestionofhowtomaketheprocessingofnatural language robust to text of any kind. Owing to its interdisciplinarity, the work of Henning Wachsmuth was supervised cooperativelybymycolleagueBennoSteinfromtheBauhaus-UniversitätWeimarand me. In particular, Henning leverages artificial intelligence (AI) to solve software engineering tasks. The design and execution of efficient pipelines benefit from classic AI techniques that represent and reason about expert knowledge and environment information. For robustness, Henning uses modern AI techniques, summarized under the term “machine learning.” Machine learning solves tasks automatically based on statistical patterns found in data. It is of upmost importance for text mining and the analysis of big data as a whole. According to current trends at leading companies like GoogleorYahoo,itscombinationwithclassicAItechniquesandengineeringmethods becomes more and more necessary for facing the challenges of search engines and many other software technologies. InmyworkasaprofessorattheUniversityofPaderborn,Iexperiencethegrowing relevance of big data for our industrial partners. Still, major challenges remain, espe- cially when it comes to the analysis of text. This book is one the first that brings together cutting-edge research from natural language processing with the needs of real-world applications. As such, the results presented here are of great pratical importance. They have been evaluated in close collaboration with companies in technology transfer projects at our s-lab – Software Quality Lab. At the same time, Henning developed novel text analysis approaches of great scientific value. All his main findings have been published in proceedings of renowned international confer- ences. On one of these conferences, Henning Wachsmuth received the “Best Presen- tationAward,”exemplifyingthathisworkisacknowledgedbythecommunityandthat he knows how to make it understandable for a large audience. Overall,Henningshowswithhisdissertationthathisresearchachievesahighlevel ofexcellence.Hedeeplyinvestigatestheproblemofperformingtextminingad-hocin thelarge.Building uponthestateoftheart,heprovidesboth theoreticalsolutionsand practical approaches for each major facet of this problem. All approaches are imple- mented as open-source software applications. Some properties of the approaches are proven formally, others are evaluated in extensive experiments. In doing so, Henning Wachsmuthdemonstrateshisbroadknowledgeincomputerscienceandhisexpertisein the area of natural language processing. The book at hand proves that he can advance this and other areas originally. August 2015 Gregor Engels Preface People search on the web to find relevant information on topics they wish to know moreabout.Accordingly,companiesanalyzebigdatatodiscovernewinformationthat is relevant for their business. Today’s search engines and big data analytics seek to fulfillsuchinformationneedsad-hoci.e.,immediatelyinresponsetoasearchqueryor similar.Often,therelevantinformationishiddeninlargenumbersofnaturallanguage texts from web pages and other documents. Instead of returning potentially relevant texts only, leading search and analytics applications have recently started to return relevantinformationdirectly.Toobtaintheinformationsoughtforfromthetexts,they perform text mining. Textminingdealswithtasksthattargettheinferenceofstructuredinformationfrom collections and streams of unstructured input texts. It covers all techniques needed to identify relevant texts, to extract relevant spans from these texts, and to convert the spans into high-quality information that can be stored in databases and analyzed sta- tistically.Textminingrequirestask-specifictextanalysisprocessesthatmayconsistof several interdependent steps. Usually, these processes are realized with text analysis pipelines. A text analysis pipeline employs a sequence of natural language processing algorithms where each algorithm infers specific types of information from the input texts. Although effective algorithms exist for various types, the use of text analysis pipelines isstillrestrictedtoafewpredefinedinformation needs.Weargue thatthisis due to three problems: First, text analysis pipelines are mostly constructed manually for the tasks to be addressed, because their design requires expert knowledge about the algorithms to be employed. When information needs have to be fulfilled that are unknown beforehand, textmininghencecannotbeperformedad-hoc. Second,textanalysispipelinestendto be inefficient in terms of run-time, because their execution often includes analyzing texts with computationally expensive algorithms. When information needs have to be fulfilled ad-hoc, text mining hence cannot be performed in the large. And third, text analysis pipelines tend not to robustly achieve high effectiveness on all input texts (in terms of the correctness of the inferred information), because they often include algorithms that rely on domain-dependent features of texts. Generally, text mining hence cannot guarantee to infer high-quality information at present. X Preface In this book, we tackle the outlined problems by investigating how to fulfill information needs from text mining ad-hoc in a run-time efficient and domain-robust manner. Text mining is studied within the broad field of computational linguistics, bringingtogetherresearchfromnaturallanguageprocessing,informationretrieval,and data mining. On the basis of a concise introduction to the foundations and the state of the art of text mining, we observe that knowledge about a text analysis process as well as information obtained within the process can be exploited in order to improve the design, the execution, and the output of the text analysis pipeline that realizes the process.Todothisfullyautomatically,weapplydifferent techniques fromclassicand statistical artificial intelligence. Inparticular,wefirstdevelopknowledge-basedartificialintelligenceapproachesfor anad-hocpipelineconstructionandfortheoptimalexecutionofapipelineonitsinput. Then, we show how to theoretically and practically optimize and adapt the schedule ofthealgorithmsinapipelinebasedoninformationintheanalyzedinputtextsinorder to maximize the pipeline's run-time efficiency. Finally, we learn novel patterns in the overall structures of input texts statistically that remain strongly invariant across the domains of the texts and that, thereby, allow for more robust analysis results in a specific set of text analysis tasks. Weanalyzeallthedevelopedapproachesformallyandwesketchhowtoimplement theminsoftwareapplications.OnthebasisofrespectiveJavaopen-sourceapplications that we provide online, we empirically evaluate the approaches on established and on newly created collections of texts. In our experiments, we address scientifically and industriallyimportanttextanalysistasks,suchastheextractionoffinancialeventsfrom news articles or the fine-grained sentiment analysis of reviews. Ourfindingspresentedinthisbookshowthattextanalysispipelinescanbedesigned automatically,whichprocessonlyportionsoftextthatarerelevantfortheinformation need to be fulfilled. Through an informed scheduling, we improve the run-time effi- ciency ofpipelines byup tomore than oneorder of magnitudewithoutcompromising theireffectiveness.Evenonheterogeneousinputtexts,efficiencycanbemaintainedby learningtopredictthefastestpipelineforeachtextindividually.Moreover,weprovide evidencethatthedomainrobustnessofapipeline’seffectivenesssubstantiallybenefits from focusing on overall structure in argumentation-related tasks such as sentiment analysis. We conclude that the developed approaches denote essential building blocks of enabling ad-hoc large-scale text mining in web search and big data analytics appli- cations. In this regard, the book at hand serves as a guide for practitioners and inter- estedreadersthatdesiretoknowwhattopayattentiontointhecontextoftextanalysis pipelines. At the same time, we are confident that our scientific results prove valuable forotherresearcherswhoworkontheautomaticunderstandingofnaturallanguageand on the future of information search.
Description: