ebook img

Determining Authorship of Anonymous Texts PDF

68 Pages·2013·2.06 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Determining Authorship of Anonymous Texts

MASARYK UNIVERSITY FACULTY OF INFORMATICS }w(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:31) !"#$%&’()+,-./012345<yA| Determining Authorship of Anonymous Texts PHD THESIS PROPOSAL Jan Rygl Brno,January2013 Supervisor:doc.PhDr.KarelPala,CSc. ................................ Consultant:doc.RNDr.Alesˇ Hora´k,Ph.D. Declaration HerebyIdeclare,thatthispaperismyoriginalauthorialwork,whichIhaveworkedout bymyown.Allsources,referencesandliteratureusedorexcerptedduringelaboration ofthisworkareproperlycitedandlistedincompletereferencetotheduesource. January9,2013,Brno JanRygl ............................. ii Acknowledgement Iwouldliketoexpressmygratitudetodoc.Alesˇ Hora´kanddoc.KarelPalafortheir support, motivation and mentoring. I would also like to thank my colleagues at the Natural Language Processing Centre for their help and friendship. Finally, I want to thanktomyfamilyfortheirkindnessandencouragement. ThisworkhasbeenpartlysupportedbytheMinistryoftheInteriorofCRwithinthe projectVF20102014003. iii Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1 ProblemDefinition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 ProblemsofAuthorshipRecognition . . . . . . . . . . . . . . . . . . . . . 4 1.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 ThesisStructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 StateoftheArt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 AuthorshipAttribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 AuthorshipVerification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 AuthorshipClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 SelectedResearchGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 AimsoftheThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 ProposedPlanofWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 FuturePublications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Achievedresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 AuthorshipRecognitionTool(ART) . . . . . . . . . . . . . . . . . . . . . 19 4.2 SuggestedImprovements . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Author’sCharacteristicbasedonSyntaxFeatures . . . . . . . . . . . . . 23 4.4 PresentationsofResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5 Author’sPublications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1 Chapter1 Introduction 1.1 ProblemDefinition Authorshiprecognitionconsistsofthreemainproblems:authorshipattribution,author- shipverification andauthorshipclustering.Rarelyplagiarism isincorporatedintothis field.Authorshiprecognitionislanguage-dependent,thereforealltechniquesneedtobe optimisedfortheCzechlanguage. AuthorshipAttribution • 1. Given a particular sample of a text known to be by one author from a set of authors,determinewhichone[26,p.238]. 2.Givenaparticularsampleofthetextbelievedtobebyoneauthorfromasetof authors,determinewhichone,ifany[26,p.238]. 3.Hereisadocument,tellmewhowroteit[26,p.238]. Thefirstvariantisa“closedclass”andisconsideredtobethemoststraightforward one.Anauthor,whosestylometricfeaturesarethemostsimilartothesampletext, isselected. Thesecondandthirdproblemdefinitionsare“openclasses”.Thesecondvariant addsapossibilitythatnoneofcandidatesistheauthorofthesampletext.Oncethe mostprobableauthorisselected,itisdecidedwhetherthecandidate’ssimilarity exceedstheconfidencethreshold. Thelasttaskextendstheproblembysearchingavailablesourcesforsamplesofall possiblecandidates.Advancedmethodsofartificialintelligence,Internetcrawlers andhumanexpertsarerequiredtosolvethethirdvariant. AuthorshipVerification • 1.Confirmingordenyingauthorshipbyasingleknownauthor[62,p.1]. 2.Givenasetofdocumentswrittenbyasuspectalongwithadocumentdataset collectedfromthesamplepopulation,wewanttodeterminewhetherornotan anonymousdocumentiswrittenbythesuspect[24,p.1591]. In the first variant, only documents of suspected authors are used to verify the authorship,thereforeathresholdmustbesettodecidewhetherdocumentswere writtenbyoneauthor. 2 1. INTRODUCTION The similarity between the suspected author and the anonymous document is comparedwithsimilaritiesbetweentheanonymousdocumentandotherreference authorsinthesecondscenario.Ifthesimilarityofthesuspectedauthoroutperforms other samples significantly, it can be conducted that the anonymous document waswrittenbythesuspectedauthor. AuthorshipClustering • Wasthedocumentsinglyormultiplyauthored [14]? Regions of the text are compared with others to determine whether the sample is homogenous stylistically. If a distance between two regions is defined as a similarityscoreobtainedbyauthorshipverification,wecanapproachtheproblem byestablishedclusteringalgorithms(e.g.hierarchicalclustering,centroid-based clustering,etc.).Easiertaskssuchasclusteringbygender[2]canalsoincreasethe accuracyoftheauthorshipclusteringmethods. Plagiarism • Submittingsomeoneelse’sideasorwork [63]. Theproblemofauthorshiprecognitioncanalsoincludethefrequently-mentioned plagiarism. The plagiarism field does not, in most cases, consider problematic whethertheauthorpublishesanonymously.Itinsteadattemptstodetecttheuseof textswrittenbyotherauthors,thereforetheutilizedmethodsaredifferentfrom previousapproaches.Forthisreason,andbecausethisfieldhasbeenthoroughly investigated (e.g. Masaryk University integrated plagiarism detection methods intoitsInformationSystemin2006[3]),mythesiswillnotdealwithplagiarism. 1.2 Motivation Literaryauthorship: • Authorship attribution of historical documents and literary works has been ex- plored by many research studies since the 18th century[40]. Most authorship problemshavebeensuccessfullysolved,thereforethistopicisoutofscopeofmy work. Legalauthorship(forensiclinguistic): • – Criminalinvestigations: Nationalsecurityagenciesandpoliceunits(e.g.FBI[13])buildrepositories forallcommunicatedthreatsandothercriminallyorientedcommunications. Collecteddatacanbeeffectivelyusedforpurposesofauthorshiprecognition inprojectssuchasDarkWebproject. DarkWebprojectisaresearchinitiativetoidentifyindividualsandgroups usinganonymousonlinecommunicationtosupportextremistandterrorist activities[1]. 3 1. INTRODUCTION – Prosecutions: A forged check, a ransom note, a farewell letter printed at a university computerlab,athreateningletter,adiaryofcrimes,motelreceipts,suicide notes on computer disks – such documents create paper trails leading to suspects. The authorship of documents has played an important role in the investigation of such recent high-profile cases as the Unabomber, the OklahomaCitybombing,theWorldTradeCenterbombing,andthemurder ofJonBenetRamsey[7]. Stylometrictechniquesarecurrentlyusedasevidenceincourtsoflawinthe UK,theU.S.,andAustralia[47]. 1.3 ProblemsofAuthorshipRecognition Obstaclesdependonthedefinitionoftheproblemastheauthorshiprecognitionfield enclosesmanytaskswithvariedapproaches. Thewritingstyleofindividualstendstochangeovertime,andcompensatingfor thatisadifficulttask[4]. It is impossible to search all candidate author’s documents in the open-class au- thorshipattribution.Relevantdocumentscanbepublishedunderadifferentidentity, password protected or destroyed. All potential sources such as the Internet, archives andthesubject’spropertycannotbesearchedexhaustively. Mostofdocumentsexaminedbyforensiclinguisticdonotcontainenoughstylistic information to use authorship recognition methods effectively. Threatening letters, anonymouscommunications,businesse-mails,etc.areshortincomparisontoessays,ar- ticlesandbooksforwhichauthorshiprecognitionmethodsaredesignedandevaluated. Attacksagainstauthorshiprecognitionarepossibleandhavebeenexploitedboth in literary writing (Manuscripts of Dvur Kra´love´ and of Zelena´ Hora[35], Federalist Papers[48],etc.)andprosecution.Obfuscationattacks attempttohideidentitiesoftheir authors.Experimentsindicatethatcurrentapproachesareverysusceptibletothisform ofattacks[4].Statisticsshowthatshortersentencesandlessdescriptivewordsaremostly usedduringobfuscationattacks,thereforeatleastitcanbepossibletodetectobfuscation attempts. Imitation attacks make an effort to frame other subjects by imitating their writingstyle.Currenttechniquescannoteffectivelydistinguishbetweenimitatedand realdocuments[4]. TheabsenceofaCzechcorpusofsigneddocumentsmakesdesigningandtesting newapproachesfortheCzechlanguagedifficult.Manycorporaofe-mails(theEnron corpus was made public during the legal investigation concerning the Enron corpo- ration[28]), blogs (The Blog Authorship Corpus – collected posts of 19320 bloggers gathered from blogger.com in August 2004[29]) and literary works (The Federalist Papers–aseriesof85articlesoressays[18])areavailableforEnglish. 4 1. INTRODUCTION 1.4 Goals MyworkfocusesonauthorshiprecognitionintheInternetenvironment.Theresearchis supportedbytheMinistryoftheInterioroftheCR.Thegoalsofmythesisareto: tocreatetoolstocollect,storeandcategorizedocuments.Tobuildacorpusofthe • taggedCzechdocumentswiththesetools. todevelopnewmethodsandimprovecurrentapproachesofauthorshipverifica- • tion,attributionandclustering. toproposeandevaluatenewcharacteristicsofauthors(e.g.stylometricfeatures; • typographyanderroranalysisandtextformattingstatistics). tocreateanauthorshiprecognitiontoolwithawebinterface. • 1.5 ThesisStructure This work is divided into five parts including Introduction. In the following chapter State of art, related works are mentioned and up-to-date state of the art is described. In the third chapter Aims of the Thesis, thesis goals are presented and my approach to the problem is explained. Results of my research, presentations and implemented applicationsarelistedinthefourthchapterAchievedresults.ThelastchapterAuthor’s publicationscontainsalistofmypublicationswithcomments. 5 Chapter2 State of the Art 2.1 History Thebeginningofinterestindeterminingauthorshipgoesbacktothelate18thcentury. TheauthorshipofsomeofShakespeare’splayswasquestioned(E.Malonein1787)and the first authorship verification procedures were proposed[40]. This topic gradually begantodrawattentionofpublicandovertimeauthorshipsofmanyotherworkswere challenged. The first attempt to quantify stylometry utilized word-length frequencies. T. C. Mendenhalldiscoveredthatword-lengthfrequencydistributiontendstobeconsistent foroneauthoranddiffersfordifferentauthors(1887,[44]). Originally,alltheanalysesweredonemanuallybycountingvariousstatistics.The efficiency of verification of proposed algorithms was limited and experiments were performedonsmallamountofdocuments.Thisledtothefactthatmostattentionwas paid to literary works – new theories were usually applied to the already examined bookstoallowcomparisonwithotherresults. The Bible as one of the most influential books shortly came to the attention of researchers.VocabularystatisticsofthepastoralswereanalyzedbyTheNewTestament scholarKennethGreystonandthestatisticianGustavHerdan(1959–1960,[16]). AuthorshiprecognitionwassubstantiallymarkedbytheseminalstudyofMosteller and Wallace on the authorship of the disputed Federalist Papers (1964, [48]). It was oneofthefirstpublicationsdealingwithBayesianmethodsappliedtolargescaledata analysis. In the second half of the 20th century the authorship recognition gained enough publicattentiontobeadmittedasevidenceincourt[17,p.46].Themostfamouswork fromthistimeisprobablyWordDetectiveProvestheBardwasn’tBaconwrittenbyA.Q. Morton(1976,[46])–Mortonwasinvitedtoseveraljudicialhearingstoactasanexpert inthedefence,whereherefutedauthorshiptestimonyagainsttheaccusedbyapplying quantitative methods of authorship recognition. This period can be described as the beginningofforensiclinguistics. Overthepastfewyears,authorshiprecognitionhasbeendevelopedsignificantly, takingadvantageoftheresearchadvancesinnaturallanguageprocessingandmachine learning. 6 2. STATE OF THE ART 2.2 AuthorshipAttribution Techniquessuchasauthorialfingerprint[22]andwriteprint[51]arepredominantlyused to solve the authorship attribution problem. Techniques combine quantified authors’ characteristicsandusethemastheinputforheuristic,statisticalandmachinelearning methods. Authors’characteristicsdescribeauthorsaccordingtostylometricfeatures.Features aredesignedtobeconsistentfortheauthor(goodfeatureofauthorshipisonewhich willshowbetween-authorvariationandwithin-authorconsistencyacrossasampleof authors[15,p.521]).Eachauthor’scharacteristicdeterministicallyassignsoneormore numberstothedocument.Iftwodocumentsareassignedsimilarlylargenumbersfor each author’s characteristic, it is more probable that texts were written by the same author. Thelistofpossibleauthors’characteristics(stylometricfeatures)presentedbyWalter DaelemansatTSD2012conference[12]: letterfrequency,punctuation,spellingerrors • charactern-grams,wordn-grams • distributionsoffunctionwords,contentwords,frequentwords,pronouns • morphology(inflections,prefixes,adverb/adjectivesubstitution,plural/possesive • confusion[41,p.499–500]) syntaxandPOStagdistributions • semanticssubclasses(wordnet) • stablewords(staythesameiftranslatedandtranslatedback,willprobablysurvive • editing) complexitymeasures(readability,averagewordlength,averagesentencelength) • vocabularyrichness, • discourselevelfeatures(connectors). • Most of implementations of authors’ characteristics are designed for the English language or other major languages (all mentioned results have been measured for Englishdocuments).Forourresearch,itisnecessarytomodifyestablishedcharacteristics andfilteroutonesdescribinglinguisticphenomenonsnotpresentintheCzechlanguage. Newauthors’characteristicsdedicatedtotheCzech(orgenerallySlavonic)language canbeproposed. The accuracy of authorship attribution methods depends not merely on selected methods,butonthetrainingandtestdata.Moststudiesusesizesoftrainingdatathat areunrealisticformostsituationsinwhichstylometryisapplied(e.g.,forensics),and 7

Description:
morphology (inflections, prefixes, adverb/adjective substitution, plural/possesive . 2.3.2 One-Class Machine Learning and Unmasking Algorithm.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.