Table Of Content

MASARYK UNIVERSITY FACULTY OF INFORMATICS }w(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:31) !"#$%&’()+,-./012345<yA| Determining Authorship of Anonymous Texts PHD THESIS PROPOSAL Jan Rygl Brno,January2013 Supervisor:doc.PhDr.KarelPala,CSc. ................................ Consultant:doc.RNDr.Alesˇ Hora´k,Ph.D. Declaration HerebyIdeclare,thatthispaperismyoriginalauthorialwork,whichIhaveworkedout bymyown.Allsources,referencesandliteratureusedorexcerptedduringelaboration ofthisworkareproperlycitedandlistedincompletereferencetotheduesource. January9,2013,Brno JanRygl ............................. ii Acknowledgement Iwouldliketoexpressmygratitudetodoc.Alesˇ Hora´kanddoc.KarelPalafortheir support, motivation and mentoring. I would also like to thank my colleagues at the Natural Language Processing Centre for their help and friendship. Finally, I want to thanktomyfamilyfortheirkindnessandencouragement. ThisworkhasbeenpartlysupportedbytheMinistryoftheInteriorofCRwithinthe projectVF20102014003. iii Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1 ProblemDefinition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 ProblemsofAuthorshipRecognition . . . . . . . . . . . . . . . . . . . . . 4 1.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 ThesisStructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 StateoftheArt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 AuthorshipAttribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 AuthorshipVerification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 AuthorshipClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 SelectedResearchGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 AimsoftheThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 ProposedPlanofWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 FuturePublications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Achievedresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 AuthorshipRecognitionTool(ART) . . . . . . . . . . . . . . . . . . . . . 19 4.2 SuggestedImprovements . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Author’sCharacteristicbasedonSyntaxFeatures . . . . . . . . . . . . . 23 4.4 PresentationsofResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5 Author’sPublications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1 Chapter1 Introduction 1.1 ProblemDefinition Authorshiprecognitionconsistsofthreemainproblems:authorshipattribution,authorshipverification andauthorshipclustering.Rarelyplagiarism isincorporatedintothis field.Authorshiprecognitionislanguage-dependent,thereforealltechniquesneedtobe optimisedfortheCzechlanguage. AuthorshipAttribution • 1. Given a particular sample of a text known to be by one author from a set of authors,determinewhichone[26,p.238]. 2.Givenaparticularsampleofthetextbelievedtobebyoneauthorfromasetof authors,determinewhichone,ifany[26,p.238]. 3.Hereisadocument,tellmewhowroteit[26,p.238]. Thefirstvariantisa“closedclass”andisconsideredtobethemoststraightforward one.Anauthor,whosestylometricfeaturesarethemostsimilartothesampletext, isselected. Thesecondandthirdproblemdefinitionsare“openclasses”.Thesecondvariant addsapossibilitythatnoneofcandidatesistheauthorofthesampletext.Oncethe mostprobableauthorisselected,itisdecidedwhetherthecandidate’ssimilarity exceedstheconfidencethreshold. Thelasttaskextendstheproblembysearchingavailablesourcesforsamplesofall possiblecandidates.Advancedmethodsofartificialintelligence,Internetcrawlers andhumanexpertsarerequiredtosolvethethirdvariant. AuthorshipVerification • 1.Confirmingordenyingauthorshipbyasingleknownauthor[62,p.1]. 2.Givenasetofdocumentswrittenbyasuspectalongwithadocumentdataset collectedfromthesamplepopulation,wewanttodeterminewhetherornotan anonymousdocumentiswrittenbythesuspect[24,p.1591]. In the first variant, only documents of suspected authors are used to verify the authorship,thereforeathresholdmustbesettodecidewhetherdocumentswere writtenbyoneauthor. 2 1. INTRODUCTION The similarity between the suspected author and the anonymous document is comparedwithsimilaritiesbetweentheanonymousdocumentandotherreference authorsinthesecondscenario.Ifthesimilarityofthesuspectedauthoroutperforms other samples significantly, it can be conducted that the anonymous document waswrittenbythesuspectedauthor. AuthorshipClustering • Wasthedocumentsinglyormultiplyauthored [14]? Regions of the text are compared with others to determine whether the sample is homogenous stylistically. If a distance between two regions is defined as a similarityscoreobtainedbyauthorshipverification,wecanapproachtheproblem byestablishedclusteringalgorithms(e.g.hierarchicalclustering,centroid-based clustering,etc.).Easiertaskssuchasclusteringbygender[2]canalsoincreasethe accuracyoftheauthorshipclusteringmethods. Plagiarism • Submittingsomeoneelse’sideasorwork [63]. Theproblemofauthorshiprecognitioncanalsoincludethefrequently-mentioned plagiarism. The plagiarism field does not, in most cases, consider problematic whethertheauthorpublishesanonymously.Itinsteadattemptstodetecttheuseof textswrittenbyotherauthors,thereforetheutilizedmethodsaredifferentfrom previousapproaches.Forthisreason,andbecausethisfieldhasbeenthoroughly investigated (e.g. Masaryk University integrated plagiarism detection methods intoitsInformationSystemin2006[3]),mythesiswillnotdealwithplagiarism. 1.2 Motivation Literaryauthorship: • Authorship attribution of historical documents and literary works has been ex- plored by many research studies since the 18th century[40]. Most authorship problemshavebeensuccessfullysolved,thereforethistopicisoutofscopeofmy work. Legalauthorship(forensiclinguistic): • – Criminalinvestigations: Nationalsecurityagenciesandpoliceunits(e.g.FBI[13])buildrepositories forallcommunicatedthreatsandothercriminallyorientedcommunications. Collecteddatacanbeeffectivelyusedforpurposesofauthorshiprecognition inprojectssuchasDarkWebproject. DarkWebprojectisaresearchinitiativetoidentifyindividualsandgroups usinganonymousonlinecommunicationtosupportextremistandterrorist activities[1]. 3 1. INTRODUCTION – Prosecutions: A forged check, a ransom note, a farewell letter printed at a university computerlab,athreateningletter,adiaryofcrimes,motelreceipts,suicide notes on computer disks – such documents create paper trails leading to suspects. The authorship of documents has played an important role in the investigation of such recent high-profile cases as the Unabomber, the OklahomaCitybombing,theWorldTradeCenterbombing,andthemurder ofJonBenetRamsey[7]. Stylometrictechniquesarecurrentlyusedasevidenceincourtsoflawinthe UK,theU.S.,andAustralia[47]. 1.3 ProblemsofAuthorshipRecognition Obstaclesdependonthedefinitionoftheproblemastheauthorshiprecognitionfield enclosesmanytaskswithvariedapproaches. Thewritingstyleofindividualstendstochangeovertime,andcompensatingfor thatisadifficulttask[4]. It is impossible to search all candidate author’s documents in the open-class authorshipattribution.Relevantdocumentscanbepublishedunderadifferentidentity, password protected or destroyed. All potential sources such as the Internet, archives andthesubject’spropertycannotbesearchedexhaustively. Mostofdocumentsexaminedbyforensiclinguisticdonotcontainenoughstylistic information to use authorship recognition methods effectively. Threatening letters, anonymouscommunications,businesse-mails,etc.areshortincomparisontoessays,ar- ticlesandbooksforwhichauthorshiprecognitionmethodsaredesignedandevaluated. Attacksagainstauthorshiprecognitionarepossibleandhavebeenexploitedboth in literary writing (Manuscripts of Dvur Kra´love´ and of Zelena´ Hora[35], Federalist Papers[48],etc.)andprosecution.Obfuscationattacks attempttohideidentitiesoftheir authors.Experimentsindicatethatcurrentapproachesareverysusceptibletothisform ofattacks[4].Statisticsshowthatshortersentencesandlessdescriptivewordsaremostly usedduringobfuscationattacks,thereforeatleastitcanbepossibletodetectobfuscation attempts. Imitation attacks make an effort to frame other subjects by imitating their writingstyle.Currenttechniquescannoteffectivelydistinguishbetweenimitatedand realdocuments[4]. TheabsenceofaCzechcorpusofsigneddocumentsmakesdesigningandtesting newapproachesfortheCzechlanguagedifficult.Manycorporaofe-mails(theEnron corpus was made public during the legal investigation concerning the Enron corpo- ration[28]), blogs (The Blog Authorship Corpus – collected posts of 19320 bloggers gathered from blogger.com in August 2004[29]) and literary works (The Federalist Papers–aseriesof85articlesoressays[18])areavailableforEnglish. 4 1. INTRODUCTION 1.4 Goals MyworkfocusesonauthorshiprecognitionintheInternetenvironment.Theresearchis supportedbytheMinistryoftheInterioroftheCR.Thegoalsofmythesisareto: tocreatetoolstocollect,storeandcategorizedocuments.Tobuildacorpusofthe • taggedCzechdocumentswiththesetools. todevelopnewmethodsandimprovecurrentapproachesofauthorshipverifica- • tion,attributionandclustering. toproposeandevaluatenewcharacteristicsofauthors(e.g.stylometricfeatures; • typographyanderroranalysisandtextformattingstatistics). tocreateanauthorshiprecognitiontoolwithawebinterface. • 1.5 ThesisStructure This work is divided into five parts including Introduction. In the following chapter State of art, related works are mentioned and up-to-date state of the art is described. In the third chapter Aims of the Thesis, thesis goals are presented and my approach to the problem is explained. Results of my research, presentations and implemented applicationsarelistedinthefourthchapterAchievedresults.ThelastchapterAuthor’s publicationscontainsalistofmypublicationswithcomments. 5 Chapter2 State of the Art 2.1 History Thebeginningofinterestindeterminingauthorshipgoesbacktothelate18thcentury. TheauthorshipofsomeofShakespeare’splayswasquestioned(E.Malonein1787)and the first authorship verification procedures were proposed[40]. This topic gradually begantodrawattentionofpublicandovertimeauthorshipsofmanyotherworkswere challenged. The first attempt to quantify stylometry utilized word-length frequencies. T. C. Mendenhalldiscoveredthatword-lengthfrequencydistributiontendstobeconsistent foroneauthoranddiffersfordifferentauthors(1887,[44]). Originally,alltheanalysesweredonemanuallybycountingvariousstatistics.The efficiency of verification of proposed algorithms was limited and experiments were performedonsmallamountofdocuments.Thisledtothefactthatmostattentionwas paid to literary works – new theories were usually applied to the already examined bookstoallowcomparisonwithotherresults. The Bible as one of the most influential books shortly came to the attention of researchers.VocabularystatisticsofthepastoralswereanalyzedbyTheNewTestament scholarKennethGreystonandthestatisticianGustavHerdan(1959–1960,[16]). AuthorshiprecognitionwassubstantiallymarkedbytheseminalstudyofMosteller and Wallace on the authorship of the disputed Federalist Papers (1964, [48]). It was oneofthefirstpublicationsdealingwithBayesianmethodsappliedtolargescaledata analysis. In the second half of the 20th century the authorship recognition gained enough publicattentiontobeadmittedasevidenceincourt[17,p.46].Themostfamouswork fromthistimeisprobablyWordDetectiveProvestheBardwasn’tBaconwrittenbyA.Q. Morton(1976,[46])–Mortonwasinvitedtoseveraljudicialhearingstoactasanexpert inthedefence,whereherefutedauthorshiptestimonyagainsttheaccusedbyapplying quantitative methods of authorship recognition. This period can be described as the beginningofforensiclinguistics. Overthepastfewyears,authorshiprecognitionhasbeendevelopedsignificantly, takingadvantageoftheresearchadvancesinnaturallanguageprocessingandmachine learning. 6 2. STATE OF THE ART 2.2 AuthorshipAttribution Techniquessuchasauthorialfingerprint[22]andwriteprint[51]arepredominantlyused to solve the authorship attribution problem. Techniques combine quantified authors’ characteristicsandusethemastheinputforheuristic,statisticalandmachinelearning methods. Authors’characteristicsdescribeauthorsaccordingtostylometricfeatures.Features aredesignedtobeconsistentfortheauthor(goodfeatureofauthorshipisonewhich willshowbetween-authorvariationandwithin-authorconsistencyacrossasampleof authors[15,p.521]).Eachauthor’scharacteristicdeterministicallyassignsoneormore numberstothedocument.Iftwodocumentsareassignedsimilarlylargenumbersfor each author’s characteristic, it is more probable that texts were written by the same author. Thelistofpossibleauthors’characteristics(stylometricfeatures)presentedbyWalter DaelemansatTSD2012conference[12]: letterfrequency,punctuation,spellingerrors • charactern-grams,wordn-grams • distributionsoffunctionwords,contentwords,frequentwords,pronouns • morphology(inflections,prefixes,adverb/adjectivesubstitution,plural/possesive • confusion[41,p.499–500]) syntaxandPOStagdistributions • semanticssubclasses(wordnet) • stablewords(staythesameiftranslatedandtranslatedback,willprobablysurvive • editing) complexitymeasures(readability,averagewordlength,averagesentencelength) • vocabularyrichness, • discourselevelfeatures(connectors). • Most of implementations of authors’ characteristics are designed for the English language or other major languages (all mentioned results have been measured for Englishdocuments).Forourresearch,itisnecessarytomodifyestablishedcharacteristics andfilteroutonesdescribinglinguisticphenomenonsnotpresentintheCzechlanguage. Newauthors’characteristicsdedicatedtotheCzech(orgenerallySlavonic)language canbeproposed. The accuracy of authorship attribution methods depends not merely on selected methods,butonthetrainingandtestdata.Moststudiesusesizesoftrainingdatathat areunrealisticformostsituationsinwhichstylometryisapplied(e.g.,forensics),and 7

Description:

morphology (inflections, prefixes, adverb/adjective substitution, plural/possesive . 2.3.2 One-Class Machine Learning and Unmasking Algorithm.

Determining Authorship of Anonymous Texts PDF

68 Pages·2013·2.06 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Download Determining Authorship of Anonymous Texts PDF Free - Full Version

by Unknow| 2013| 68 pages| 2.06| English

Download Determining Authorship of Anonymous Texts by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Determining Authorship of Anonymous Texts

morphology (inflections, prefixes, adverb/adjective substitution, plural/possesive . 2.3.2 One-Class Machine Learning and Unmasking Algorithm.

Detailed Information

Author:	Unknown
Publication Year:	2013
Pages:	68
Language:	English
File Size:	2.06
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Determining Authorship of Anonymous Texts Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Determining Authorship of Anonymous Texts PDF?

Yes, on https://PDFdrive.to you can download Determining Authorship of Anonymous Texts by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Determining Authorship of Anonymous Texts on my mobile device?

After downloading Determining Authorship of Anonymous Texts PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Determining Authorship of Anonymous Texts?

Yes, this is the complete PDF version of Determining Authorship of Anonymous Texts by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Determining Authorship of Anonymous Texts PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.