Table Of ContentMASARYK UNIVERSITY
FACULTY OF INFORMATICS
}w(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:31) !"#$%&’()+,-./012345<yA|
Determining Authorship of
Anonymous Texts
PHD THESIS PROPOSAL
Jan Rygl
Brno,January2013 Supervisor:doc.PhDr.KarelPala,CSc.
................................
Consultant:doc.RNDr.Alesˇ Hora´k,Ph.D.
Declaration
HerebyIdeclare,thatthispaperismyoriginalauthorialwork,whichIhaveworkedout
bymyown.Allsources,referencesandliteratureusedorexcerptedduringelaboration
ofthisworkareproperlycitedandlistedincompletereferencetotheduesource.
January9,2013,Brno
JanRygl
.............................
ii
Acknowledgement
Iwouldliketoexpressmygratitudetodoc.Alesˇ Hora´kanddoc.KarelPalafortheir
support, motivation and mentoring. I would also like to thank my colleagues at the
Natural Language Processing Centre for their help and friendship. Finally, I want to
thanktomyfamilyfortheirkindnessandencouragement.
ThisworkhasbeenpartlysupportedbytheMinistryoftheInteriorofCRwithinthe
projectVF20102014003.
iii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 ProblemDefinition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 ProblemsofAuthorshipRecognition . . . . . . . . . . . . . . . . . . . . . 4
1.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 ThesisStructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 StateoftheArt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 AuthorshipAttribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 AuthorshipVerification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 AuthorshipClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 SelectedResearchGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 AimsoftheThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 ProposedPlanofWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 FuturePublications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Achievedresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 AuthorshipRecognitionTool(ART) . . . . . . . . . . . . . . . . . . . . . 19
4.2 SuggestedImprovements . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Author’sCharacteristicbasedonSyntaxFeatures . . . . . . . . . . . . . 23
4.4 PresentationsofResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Author’sPublications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1
Chapter1
Introduction
1.1 ProblemDefinition
Authorshiprecognitionconsistsofthreemainproblems:authorshipattribution,author-
shipverification andauthorshipclustering.Rarelyplagiarism isincorporatedintothis
field.Authorshiprecognitionislanguage-dependent,thereforealltechniquesneedtobe
optimisedfortheCzechlanguage.
AuthorshipAttribution
•
1. Given a particular sample of a text known to be by one author from a set of
authors,determinewhichone[26,p.238].
2.Givenaparticularsampleofthetextbelievedtobebyoneauthorfromasetof
authors,determinewhichone,ifany[26,p.238].
3.Hereisadocument,tellmewhowroteit[26,p.238].
Thefirstvariantisa“closedclass”andisconsideredtobethemoststraightforward
one.Anauthor,whosestylometricfeaturesarethemostsimilartothesampletext,
isselected.
Thesecondandthirdproblemdefinitionsare“openclasses”.Thesecondvariant
addsapossibilitythatnoneofcandidatesistheauthorofthesampletext.Oncethe
mostprobableauthorisselected,itisdecidedwhetherthecandidate’ssimilarity
exceedstheconfidencethreshold.
Thelasttaskextendstheproblembysearchingavailablesourcesforsamplesofall
possiblecandidates.Advancedmethodsofartificialintelligence,Internetcrawlers
andhumanexpertsarerequiredtosolvethethirdvariant.
AuthorshipVerification
•
1.Confirmingordenyingauthorshipbyasingleknownauthor[62,p.1].
2.Givenasetofdocumentswrittenbyasuspectalongwithadocumentdataset
collectedfromthesamplepopulation,wewanttodeterminewhetherornotan
anonymousdocumentiswrittenbythesuspect[24,p.1591].
In the first variant, only documents of suspected authors are used to verify the
authorship,thereforeathresholdmustbesettodecidewhetherdocumentswere
writtenbyoneauthor.
2
1. INTRODUCTION
The similarity between the suspected author and the anonymous document is
comparedwithsimilaritiesbetweentheanonymousdocumentandotherreference
authorsinthesecondscenario.Ifthesimilarityofthesuspectedauthoroutperforms
other samples significantly, it can be conducted that the anonymous document
waswrittenbythesuspectedauthor.
AuthorshipClustering
•
Wasthedocumentsinglyormultiplyauthored [14]?
Regions of the text are compared with others to determine whether the sample
is homogenous stylistically. If a distance between two regions is defined as a
similarityscoreobtainedbyauthorshipverification,wecanapproachtheproblem
byestablishedclusteringalgorithms(e.g.hierarchicalclustering,centroid-based
clustering,etc.).Easiertaskssuchasclusteringbygender[2]canalsoincreasethe
accuracyoftheauthorshipclusteringmethods.
Plagiarism
•
Submittingsomeoneelse’sideasorwork [63].
Theproblemofauthorshiprecognitioncanalsoincludethefrequently-mentioned
plagiarism. The plagiarism field does not, in most cases, consider problematic
whethertheauthorpublishesanonymously.Itinsteadattemptstodetecttheuseof
textswrittenbyotherauthors,thereforetheutilizedmethodsaredifferentfrom
previousapproaches.Forthisreason,andbecausethisfieldhasbeenthoroughly
investigated (e.g. Masaryk University integrated plagiarism detection methods
intoitsInformationSystemin2006[3]),mythesiswillnotdealwithplagiarism.
1.2 Motivation
Literaryauthorship:
•
Authorship attribution of historical documents and literary works has been ex-
plored by many research studies since the 18th century[40]. Most authorship
problemshavebeensuccessfullysolved,thereforethistopicisoutofscopeofmy
work.
Legalauthorship(forensiclinguistic):
•
– Criminalinvestigations:
Nationalsecurityagenciesandpoliceunits(e.g.FBI[13])buildrepositories
forallcommunicatedthreatsandothercriminallyorientedcommunications.
Collecteddatacanbeeffectivelyusedforpurposesofauthorshiprecognition
inprojectssuchasDarkWebproject.
DarkWebprojectisaresearchinitiativetoidentifyindividualsandgroups
usinganonymousonlinecommunicationtosupportextremistandterrorist
activities[1].
3
1. INTRODUCTION
– Prosecutions:
A forged check, a ransom note, a farewell letter printed at a university
computerlab,athreateningletter,adiaryofcrimes,motelreceipts,suicide
notes on computer disks – such documents create paper trails leading to
suspects. The authorship of documents has played an important role in
the investigation of such recent high-profile cases as the Unabomber, the
OklahomaCitybombing,theWorldTradeCenterbombing,andthemurder
ofJonBenetRamsey[7].
Stylometrictechniquesarecurrentlyusedasevidenceincourtsoflawinthe
UK,theU.S.,andAustralia[47].
1.3 ProblemsofAuthorshipRecognition
Obstaclesdependonthedefinitionoftheproblemastheauthorshiprecognitionfield
enclosesmanytaskswithvariedapproaches.
Thewritingstyleofindividualstendstochangeovertime,andcompensatingfor
thatisadifficulttask[4].
It is impossible to search all candidate author’s documents in the open-class au-
thorshipattribution.Relevantdocumentscanbepublishedunderadifferentidentity,
password protected or destroyed. All potential sources such as the Internet, archives
andthesubject’spropertycannotbesearchedexhaustively.
Mostofdocumentsexaminedbyforensiclinguisticdonotcontainenoughstylistic
information to use authorship recognition methods effectively. Threatening letters,
anonymouscommunications,businesse-mails,etc.areshortincomparisontoessays,ar-
ticlesandbooksforwhichauthorshiprecognitionmethodsaredesignedandevaluated.
Attacksagainstauthorshiprecognitionarepossibleandhavebeenexploitedboth
in literary writing (Manuscripts of Dvur Kra´love´ and of Zelena´ Hora[35], Federalist
Papers[48],etc.)andprosecution.Obfuscationattacks attempttohideidentitiesoftheir
authors.Experimentsindicatethatcurrentapproachesareverysusceptibletothisform
ofattacks[4].Statisticsshowthatshortersentencesandlessdescriptivewordsaremostly
usedduringobfuscationattacks,thereforeatleastitcanbepossibletodetectobfuscation
attempts. Imitation attacks make an effort to frame other subjects by imitating their
writingstyle.Currenttechniquescannoteffectivelydistinguishbetweenimitatedand
realdocuments[4].
TheabsenceofaCzechcorpusofsigneddocumentsmakesdesigningandtesting
newapproachesfortheCzechlanguagedifficult.Manycorporaofe-mails(theEnron
corpus was made public during the legal investigation concerning the Enron corpo-
ration[28]), blogs (The Blog Authorship Corpus – collected posts of 19320 bloggers
gathered from blogger.com in August 2004[29]) and literary works (The Federalist
Papers–aseriesof85articlesoressays[18])areavailableforEnglish.
4
1. INTRODUCTION
1.4 Goals
MyworkfocusesonauthorshiprecognitionintheInternetenvironment.Theresearchis
supportedbytheMinistryoftheInterioroftheCR.Thegoalsofmythesisareto:
tocreatetoolstocollect,storeandcategorizedocuments.Tobuildacorpusofthe
•
taggedCzechdocumentswiththesetools.
todevelopnewmethodsandimprovecurrentapproachesofauthorshipverifica-
•
tion,attributionandclustering.
toproposeandevaluatenewcharacteristicsofauthors(e.g.stylometricfeatures;
•
typographyanderroranalysisandtextformattingstatistics).
tocreateanauthorshiprecognitiontoolwithawebinterface.
•
1.5 ThesisStructure
This work is divided into five parts including Introduction. In the following chapter
State of art, related works are mentioned and up-to-date state of the art is described.
In the third chapter Aims of the Thesis, thesis goals are presented and my approach
to the problem is explained. Results of my research, presentations and implemented
applicationsarelistedinthefourthchapterAchievedresults.ThelastchapterAuthor’s
publicationscontainsalistofmypublicationswithcomments.
5
Chapter2
State of the Art
2.1 History
Thebeginningofinterestindeterminingauthorshipgoesbacktothelate18thcentury.
TheauthorshipofsomeofShakespeare’splayswasquestioned(E.Malonein1787)and
the first authorship verification procedures were proposed[40]. This topic gradually
begantodrawattentionofpublicandovertimeauthorshipsofmanyotherworkswere
challenged.
The first attempt to quantify stylometry utilized word-length frequencies. T. C.
Mendenhalldiscoveredthatword-lengthfrequencydistributiontendstobeconsistent
foroneauthoranddiffersfordifferentauthors(1887,[44]).
Originally,alltheanalysesweredonemanuallybycountingvariousstatistics.The
efficiency of verification of proposed algorithms was limited and experiments were
performedonsmallamountofdocuments.Thisledtothefactthatmostattentionwas
paid to literary works – new theories were usually applied to the already examined
bookstoallowcomparisonwithotherresults.
The Bible as one of the most influential books shortly came to the attention of
researchers.VocabularystatisticsofthepastoralswereanalyzedbyTheNewTestament
scholarKennethGreystonandthestatisticianGustavHerdan(1959–1960,[16]).
AuthorshiprecognitionwassubstantiallymarkedbytheseminalstudyofMosteller
and Wallace on the authorship of the disputed Federalist Papers (1964, [48]). It was
oneofthefirstpublicationsdealingwithBayesianmethodsappliedtolargescaledata
analysis.
In the second half of the 20th century the authorship recognition gained enough
publicattentiontobeadmittedasevidenceincourt[17,p.46].Themostfamouswork
fromthistimeisprobablyWordDetectiveProvestheBardwasn’tBaconwrittenbyA.Q.
Morton(1976,[46])–Mortonwasinvitedtoseveraljudicialhearingstoactasanexpert
inthedefence,whereherefutedauthorshiptestimonyagainsttheaccusedbyapplying
quantitative methods of authorship recognition. This period can be described as the
beginningofforensiclinguistics.
Overthepastfewyears,authorshiprecognitionhasbeendevelopedsignificantly,
takingadvantageoftheresearchadvancesinnaturallanguageprocessingandmachine
learning.
6
2. STATE OF THE ART
2.2 AuthorshipAttribution
Techniquessuchasauthorialfingerprint[22]andwriteprint[51]arepredominantlyused
to solve the authorship attribution problem. Techniques combine quantified authors’
characteristicsandusethemastheinputforheuristic,statisticalandmachinelearning
methods.
Authors’characteristicsdescribeauthorsaccordingtostylometricfeatures.Features
aredesignedtobeconsistentfortheauthor(goodfeatureofauthorshipisonewhich
willshowbetween-authorvariationandwithin-authorconsistencyacrossasampleof
authors[15,p.521]).Eachauthor’scharacteristicdeterministicallyassignsoneormore
numberstothedocument.Iftwodocumentsareassignedsimilarlylargenumbersfor
each author’s characteristic, it is more probable that texts were written by the same
author.
Thelistofpossibleauthors’characteristics(stylometricfeatures)presentedbyWalter
DaelemansatTSD2012conference[12]:
letterfrequency,punctuation,spellingerrors
•
charactern-grams,wordn-grams
•
distributionsoffunctionwords,contentwords,frequentwords,pronouns
•
morphology(inflections,prefixes,adverb/adjectivesubstitution,plural/possesive
•
confusion[41,p.499–500])
syntaxandPOStagdistributions
•
semanticssubclasses(wordnet)
•
stablewords(staythesameiftranslatedandtranslatedback,willprobablysurvive
•
editing)
complexitymeasures(readability,averagewordlength,averagesentencelength)
•
vocabularyrichness,
•
discourselevelfeatures(connectors).
•
Most of implementations of authors’ characteristics are designed for the English
language or other major languages (all mentioned results have been measured for
Englishdocuments).Forourresearch,itisnecessarytomodifyestablishedcharacteristics
andfilteroutonesdescribinglinguisticphenomenonsnotpresentintheCzechlanguage.
Newauthors’characteristicsdedicatedtotheCzech(orgenerallySlavonic)language
canbeproposed.
The accuracy of authorship attribution methods depends not merely on selected
methods,butonthetrainingandtestdata.Moststudiesusesizesoftrainingdatathat
areunrealisticformostsituationsinwhichstylometryisapplied(e.g.,forensics),and
7
Description:morphology (inflections, prefixes, adverb/adjective substitution, plural/possesive . 2.3.2 One-Class Machine Learning and Unmasking Algorithm.