Table Of Content

A Data-Oriented Model of Literary Language AndreasvanCranenburgh RensBod Institutfu¨rSpracheundInformation InstituteforLogic,Languageand HeinrichHeineUniversityDu¨sseldorf Computation,UniversityofAmsterdam [email protected] [email protected] Abstract tion by building a model of literary evaluation to estimate the contribution of textual factors. This Weconsiderthetaskofpredictinghowlit- taskhasbeenconsideredbeforewithasmallerset erary a text is, with a gold standard from 7 of novels (restricted to thrillers and literary nov- 1 human ratings. Aside from a standard bi- els),usingbigrams(vanCranenburghandKoolen, 0 grambaseline,weapplyrichsyntactictree 2015). Weextendthisworkbytestingonalarger, 2 fragments,minedfromthetrainingset,and more diverse corpus, and by applying rich syn- n aseriesofhand-pickedfeatures. Ourmodel a tacticfeaturesandseveralhand-pickedfeaturesto J isthefirsttodistinguishdegreesofhighly the task. This task is first of all relevant to liter- 6 and less literary novels using a variety of ary studies—to reveal to what extent literature is 2 lexicalandsyntacticfeatures,andexplains empiricallyassociatedwithtextualcharacteristics. ] 76.0%ofthevariationinliteraryratings. However,practicalapplicationsarealsopossible; L e.g.,anautomatedmodelcouldhelpaliterarypub- C 1 Introduction lisher decide whether the work of a new author . s c Whatmakesaliterarynovelliterary? Thisseems fits its audience; or it could be used as part of a [ firstofalltobeavaluejudgment;buttowhatex- recommendersystemforreaders. 2 tentisthisjudgmentarbitrary,determinedbysocial Literary language is arguably a subjective no- v factors,orpredictableasafunctionofthetext? The tion. Agoldstandardcouldbebasedontheexpert 9 lastexplanationisassociatedwiththeconceptof opinionsofcriticsandliteraryprizes,butwecan 2 3 literariness,thehypothesizedlinguisticandformal alsoconsiderthereaderdirectly,which,intheform 3 propertiesthatdistinguishliterarylanguagefrom of a crowdsourced survey, more easily provides 0 otherlanguage(Baldick,2008). Althoughthedefi- a statisticallyadequate number ofresponses. We . 1 nitionanddemarcationofliteratureisfundamental thereforebaseourgoldstandardonalargeonline 0 7 to the field of literary studies, it has received sur- surveyofreaderswithratingsofnovels. 1 prisinglylittleempiricalstudy. Commonwisdom Literaturecomprisessomeofthemostrichand : v hasitthatliterarydistinctionisattributedinsocial sophisticated language, yet stylometry typically i X communicationaboutnovelsandthatitliesmostly does not exploit linguistic information beyond r outsideofthetextitself(Bourdieu,1996),butan part-of-speech(POS)tagsorgrammarproductions, a increasingnumberofstudiesarguethatinaddition whensyntaxisinvolvedatall(cf.e.g.,Stamatatos to social and historical explanations, textual fea- et al., 2009; Ashok et al., 2013). While our re- turesofvariouscomplexitymayalsocontributeto sultsconfirmthatsimplefeaturesarehighlyeffec- theperceptionofliteraturebyreaders(cf.Harris, tive, we also employ full syntactic analyses and 1995;McDonald,2007). Thecurrentpapershows argue for their usefulness. We consider tree frag- thatnotonlylexicalfeaturesbutalsohierarchical ments: arbitrarily-sized connected subgraphs of syntacticfeaturesandothertextualcharacteristics parsetrees(SwansonandCharniak,2012;Bergsma contributetoexplainingjudgmentsofliterature. etal.,2012;vanCranenburgh,2012). Suchfeatures Our main goal in this project is to answer the are central to the Data-Oriented Parsing frame- followingquestion: arethereparticulartextualcon- work (Scha, 1990; Bod, 1992), which postulates ventionsinliterarynovelsthatcontributetoreaders that language use derives from arbitrary chunks judgingthemtobeliterary? Weaddressthisques- (e.g., syntactic tree fragments) of previous lan- SMAIN- clude16novelsthathavebeenratedbylessthan sat:inf:pv 50participants. 91%oftheremainingnovelshave INF-vc:inf at-distributed95%confidenceintervală 0.5;e.g., NP-mod NP-su givenameanof3,theconfidenceintervaltypically VNW WW ranges from 2.75 to 3.25. Therefore for our pur- VNW-hd WW[pv]-hd ADJ-mod [pron]-hd [inf,vrij]-hd posestheratingsformareliableconsensus. Novels er ging iets verschrikkelijks gebeuren there going something terrible tohappen ratedashighlyliteraryhavesmallerconfidencein- tervals, i.e., show a stronger consensus. Where a Figure1: AparsetreefragmentfromFranzen,The binarydistinctionisneeded,wecallaratingof5 Corrections. Originalsentence: somethingterrible orhigher‘literary.’ wasgoingtohappen. Since we aim to extract relevant features from the texts themselves and the number of novels is guage experience. In our case, this suggests the relatively small, we apply cross-validation, so as followinghypothesis. toexploitthedatatothefullestextentwhilemain- HYPOTHESIS 1: Literaryauthorsemployadis- taininganout-of-sampleapproach. Wedividethe tinctiveinventoryoflexico-syntacticconstructions corpusin5foldsofroughlyequalsize,withthefol- (e.g.,aregister)thatmarksliterarylanguage. lowingconstraints: (a)novelsbythesameauthor Next we providean analysis ofthese construc- mustbeinthesamefold,sincewewanttoruleout tionswhichsupportsoursecondhypothesis. any influence of author style on feature selection HYPOTHESIS 2: Literary language invokes a ormodelvalidation;(b)thedistributionofliterary larger set of syntactic constructions when com- ratingsineachfoldshouldbesimilartotheoverall pared to the language of non-literary novels, and distribution(stratification). thereforemorevarietyisobservedintheparsetree Wecontrolforlengthandpotentialparticulari- fragmentswhoseoccurrencefrequenciesarecorre- tiesofthestartofnovelsbyconsideringsentences latedwithliteraryratings. 1000–2000 of each novel. 18 novels with fewer Thesupportprovidedforthesehypothesessug- than2000sentencesareexcluded. Togetherwith geststhatthenotionofliteraturecanbeexplained, theconstraintofatleast50ratings,thisbringsthe toasubstantialextent,fromtextualfactors,which totalnumberofnovelsweconsiderto369. contradicts the belief that external, social factors Weevaluatetheeffectivenessofthefeaturesus- aremoredominantthaninternal,textualfactors. ing a ridge regression model, with 5-fold cross- validation;wedonottunetheregularization. The 2 Task,experimentalsetup resultsarepresentedincrementally,toillustratethe contributionofeachfeaturerelativetothefeatures Weconsideraregressionproblemofasetofnovels beforeit. Thismakesitpossibletogaugetheeffec- andtheirliteraryratings. Theseratingshavebeen tivecontributionofeachfeaturewhiletakingany obtainedinalargereadersurvey(about14kpartici- overlapintoaccount. pants),1inwhich401recent,bestsellingDutchnov- WeuseR2 astheevaluationmetric,expressing els(aswellasworkstranslatedintoDutch)where thepercentageofvarianceexplained(perfectscore ratedona7-pointLikertscalefromdefinitelynot 100); this shows the improvement of the predic- tohighlyliterary. Theparticipantswerepresented tions over a baseline model that always predicts with the author and title of each novel, and pro- themeanvalue(4.2,inthisdataset). Ameanbase- videdratingsfornovelstheyhadread. Theratings linemodelisthereforedefinedtohaveanR2 of0. mayhavebeeninfluencedbywellknownauthors Otherbaselinemodels,e.g.,alwayspredicting3.5 ortitles,butthisdoesnotaffecttheresultsofthis or7,attainnegativeR2 scores,sincetheyperform paperbecausethemachinelearningmodelsarenot worsethanthemeanbaseline. Similarly,arandom given such information. The task we consider is baselinewillyieldanegativeexpectedR2. topredictthemean2 ratingforeachnovel. Weex- 1ThesurveywaspartofTheRiddleofLiteraryQuality,cf. 3 Basicfeatures http://literaryquality.huygens.knaw.nl 2StrictlyspeakingtheLikertscaleisordinalandcallsfor Sentencelength,directspeech,vocabularyrichness, themedian,butthesymmetric7-pointscaleandthenumberof andcompressibilityaresimpleyeteffectivestylo- ratingsarguablymakesusingthemeanpermissible;thelatter providesmoregranularityandsensitivitytominorityratings. metricfeatures. Wecountdirectspeechsentences bymatchingonspecificpunctuation;thisprovides capturedinthebag-of-wordsrepresentation. Sec- a measure of the amount of dialogue versus nar- ond, larger n imply a combinatorial explosion of rative text in the novel. Vocabulary richness is possible features, which makes it desirable to se- definedastheproportionofwordsinatextthatap- lectthemostrelevantfeatures. Finally,wordand pearinthetop3000mostcommonwordsofalarge charactern-gramsaresurfacefeatureswithoutlin- referencecorpus(Sonar500;Oostdijketal.,2013); guistic abstraction. One way to overcome these this shows the proportion of difficult or unusual limitations is to turn to syntactic parse trees and words. Compressibility is defined as the bzip2 minethemforrelevantfeaturesunrestrictedinsize. compressionratioofthetexts;theintuitionisthat Specifically,weconsidertreefragmentsasfea- arepetitiveandpredictabletextwillbehighlycom- tures,whicharearbitrarily-sizedfragmentsofparse pressible. CLICHESisthenumberofcliche´ expres- trees. If a parse tree is seen as consisting of a se- sions in the texts based on an external dataset of quence of grammar productions, a tree fragment 6641cliche´s(vanWingerdenandHendriks,2015); isaconnectedsubsequencethereof. Comparedto cliche´s,beingmarkedasinformalandunoriginal, bag-of-word representations, tree fragments can are expected to be more prevalent in non-literary capture both syntactic and lexical elements; and texts. Table 1 shows the results of these features. thesecombinetorepresentconstructionswithopen Severalotherfeatureswerealsoevaluatedbutwere slots (e.g., to take NP into account), or sentence eithernoteffectiveordidnotachieveappreciable templates(e.g.,“Yes,but...”,hesaid). Treefrag- improvementswhenthesebasicfeaturesaretaken mentsarethusaveryrichsourceoffeatures,and into account; notably Flesch readability (Flesch, larger or more abstract features may prove to be 1948),averagedependencylength(Gibson,2000), morelinguisticallyinterpretable. andD-level(Covingtonetal.,2006). Wepresentadata-drivenmethodforextracting andselectingtreefragments. Duetocombinatorics, R2 thereareanexponentialnumberofpossiblefrag- MEANSENT.LEN. 16.4 ments given a parse tree. For this reason it is not +%DIRECTSPEECHSENTENCES 23.1 feasibletoextractallfragmentsandselecttherel- +TOP3000VOCAB. 23.5 +BZIP2 RATIO 24.4 evantoneslater;wethereforeuseastrategytodi- +CLICHES 30.0 rectlyselectfragmentsforwhichthereisevidence ofre-usebyconsideringcommonalitiesinpairsof Table1: Basicfeatures,incrementalscores. trees. This is done by extracting the largest com- monsyntacticfragmentsfrompairsoftrees(San- 4 Automaticallyinducedfeatures gati et al., 2010; van Cranenburgh, 2014). This methodisrelatedtotree-kernelmethods(Collins Inthissectionweconsiderextractingsyntacticfea- and Duffy, 2002; Moschitti, 2006), with the dif- tures,aswellasthree(sub)lexicalbaselines. ferencethatitextractsanexplicitsetoffragments. TOPICSisasetof50topicweightsinducedwith The feature selection approach is based on rele- LatentDirichletAllocation(LDA;Bleietal.,2003) vanceandredundancy(YuandLiu,2004),similar fromthecorpus(fordetails,cf.Jautzeetal.,2016). toSwansonandCharniak(2013). Kimetal.(2011) Furthermore,weusecharacterandwordn-gram alsousetreefragments,forauthorshipattribution, features. Forwords,bigramspresentagoodtrade butwithafrequenttreeminingapproach;thedif- offintermsofinformativeness(abigramfrequency ference with our approach is that we extract the ismore specificthanthe frequencyof anindivid- largestfragmentsattestedineachtreepair,which ualword)andsparsity(threeormoreconsecutive arenotnecessarilythemostfrequent. words results in a large number of n-gram types 4.1 Preprocessing with low frequencies). For character n-grams, n “ 4 achieved good performance in previous We parse the 369 novels with Alpino (Bouma et work(e.g.,Stamatatos,2006). al., 2001). The parse trees include discontinuous Wenotethreelimitationsofn-grams. First,the constituents, non-terminal labels consist of both fixedn: largerordiscontiguouschunksarenotex- syntactic categories and function tags, selected tracted. Combining n-grams does not help since morphologicalfeatures,3 andconstituentsarebina- alinearmodelcannotcapturefeatureinteractions, 3TheDCOItagset(vanEynde,2005)isfinegrained;we noristheconsecutiveoccurrenceoftwofeatures restrictthesettodistinguishthe7coarsePOStags,aswell rizedhead-outwardwithamarkovizationofh=1, basedonthePearsoncorrelationcoefficient, v=1(KleinandManning,2003). anduseanF-testtofilteroutfragmentswhose For a fragment to be attested in a pair of parse p-value4ą 0.05. TheF-testdeterminessignif- trees, its labels need to match exactly, including icancebasedonthenumberofdatapointsN, theaforementionedcategories,tags,andfeatures. andthecorrelationr;theeffectivethreshold The h “ 1 binarization implies that fragments isapproximately|r| ą 0.11. maycontainpartialconstituents;i.e.,acontiguous 3. Redundancyremoval: greedilyselectthemost sequenceofchildrenfromann-aryconstituent. relevantfragmentandremoveotherfragments Figure1showsanexampleparsetree;forbrevity, that are too similar to it. Similarity is mea- this tree is rendered without binarization. The suredbycomputingthecorrelationcoefficient non-terminallabelsconsistofasyntacticcategory betweenthefeaturevectorsoftwofragments, (showninred),followedbyafunctiontag(green). withacutoffof|r| ą 0.5. Experimentswhere Thepart-of-speechtagsadditionallyhavemorpho- thisstepwasnotappliedindicatedthatitim- logicalfeatures(black)insquarebrackets. Some provesperformance. labelscontainpercolatedmorphologicalfeatures, Notethatthereissomeriskofoverfittingsince prefixedbyacolon. fragmentsarebothextractedandselectedfromthe 4.2 Miningsyntactictreefragments trainingset. However,thisismitigatedbythefact Theprocedureisdividedintwoparts. Thefirstpart that fragments are extracted from pairs of folds, concernsfragmentextraction: while selection is constrained to fragments that areattestedandsignificantlycorrelatedacrossthe 1. Given texts divided in folds F1...Fn, each wholetrainingset. Ci isthesetofparsetreesobtainedfrompars- Thevaluesforthethresholdswerechosenman- ingalltextsinFi. Extractthelargestcommon ually and not tuned, since the limited number of fragments of the parse trees in all pairs of novels is not enough to provide a proper tuning folds xCi,Cjy with i ă j. A common frag- set. Table2liststhenumberoffragmentsextracted ment f of parse trees t1,t2 is a connected fromfolds2–5aftereachofthesesteps. subgraph of t and t . The result is a set of 1 2 initialcandidatesthatoccurinatleasttwodif- recurringfragments 3,193,952 ferenttexts,storedseparatelyforeachpairof occursiną5%oftexts 375,514 totalfreq.ą50acrosscorpus 98,286 foldsxC ,C y. i j relevance:correlateds.t.pă0.05 30,044 2. Countoccurrencesofallfragmentsinalltexts. redundancy:|r|ă0.5 7,642 Fragmentselectionisdoneseparatelyw.r.t.each Table 2: The number of fragments in folds 2–5 test fold. Given test fold i, we consider the frag- aftereachfilteringstep. mentsfoundintrainingfoldst1..nu z i;e.g.,given n “ 5,fortestfold1weselectonlyfromthefrag- mentsandtheircountsasobservedintrainingfolds 4.3 Evaluation 2–5. Givenasetoffragmentsfromtrainingfolds, Duetothelargenumberofinducedfeatures,Sup- selectionproceedsasfollows: port Vector Regression (SVR) is more effective than ridge regression. We therefore train a linear 1. Zerocountthreshold: removefragmentsthat SVRmodelwiththesamecross-validationsetup, occur in less than 5 % of texts (too specific and feed its predictions to the ridge regression toparticularnovels);frequencythreshold: re- model (i.e., stacking). Feature counts are turned movefragmentsthatoccurlessthan50times intorelativefrequencies. Themodelhastwohyper- acrossthecorpus(tooraretoreliablydetecta parameters: C determinestheregularization,and correlationwiththeratings). (cid:15)isathresholdbeyondwhichpredictionsarecon- 2. Relevancethreshold: selectfragmentsbycon- sidered good enough during training. Instead of sidering the correlation of their counts with the literary ratings of the novels in the train- 4Ifwewereactuallytestinghypotheseswewouldneedto ing folds. Apply a simple linear regression applyBonferronicorrectiontoavoidtheFamily-WiseError duetomultiplecomparisons;however,sincetheregression asinfiniteverbs,auxiliaryverbs,propernouns,subordinating hereisonlyameanstoanend,weleavethep-valuesuncor- conjunctions,personalpronouns,andpostpositions. rected. 1 2 3 4 5 Mean Barnes: Sense of an ending WordBigrams 59.8 47.0 58.0 63.6 50.7 55.8 Murakami: 1q84 Char.4-grams 58.6 50.4 54.2 65.0 56.2 56.9 Voskuil: Buurman Fragments 61.6 53.4 58.7 65.8 46.5 57.2 Franzen: Freedom Murakami: Norwegian wood Table3: Regressionevaluation. R2 scoresonthe5 Grunberg: Huid en haar true cross-validationfolds. Voskuijl: Dorp pred_frag Smeets: Afrekening pred_bigram Ammaniti: Me and you R2 Bakker: Omweg BASICFEATURES(TABLE1) 30.0 1 2 3 4 5 6 7 +TOPICS 52.2 +BIGRAMS 59.5 Figure2: Thetennovelswiththelargestprediction +CHAR.4-GRAMS 59.9 error(usingbothfragmentsandbigrams). +FRAGMENTS 61.2 Table4: Automaticallyinducedfeatures;incremen- residual mean %Top (true- sent.%direct 3000 bzip2 talscores. Novel pred.) len. speech vocab. ratio Rosenboom:Zoetemond 0.075 23.5 24.7 0.80 0.31 Mortier:Godenslaap 0.705 24.9 25.2 0.77 0.34 Lewinsky:Johannistag 0.100 18.3 28.6 0.85 0.32 tuning these parameters we pick fixed values of Eco:ThePraguecemetery 0.148 24.5 15.7 0.79 0.33 C=100and(cid:15)=0,reducingregularizationcompared Franzen:Freedom 2.154 16.2 56.8 0.84 0.33 tothedefaultofC=1anddisablingthethreshold. Barnes:Senseofanending 2.143 14.1 23.1 0.85 0.32 Voskuil:Buurman 2.117 7.66 58.0 0.89 0.28 Cf. Table 3 for the scores. The syntactic frag- Murakami:1q84 1.870 12.3 20.4 0.84 0.32 mentsperformbest,followedbychar.4-gramsand Table5: Comparisonofbaselinefeaturesfornovels word bigrams. We report scores for each of the withgood(1–4)andbad(5–8)predictions. 5 folds separately because the variance between folds is high. However, the differences between thefeaturetypesarerelativelyconsistent. Thevari- changewhenthetop10errorsusingonlyfragments ance is not caused by the distribution of ratings, orbigramsisinspected;i.e.,thehardestnovelsto since the folds were stratified on this. Nor can it predictarehardwithbothfeaturetypes. beexplainedbytheagreementinratingspernovel, Whatcouldexplaintheseerrors? Atfirstsight, since the 95 % confidence intervals of the indi- thereisnoobviouscommonalitybetweentheliter- vidualratingsforeachnovelwereofcomparable arynovelsthatarepredictedwell,orbetweenthe widthacrossthefolds. Lastly,authorgender,genre, ones with a large error; e.g., whether the novels andwhetherthenovelwastranslateddonotdiffer have been translated or not does not explain the markedlyacrossthefolds. Itseemsmostlikelythat error. A possible explanation is that the success- the novels simply differ in how predictable their fully predicted literary novels share a particular ratingsarefromtextualfeatures. (e.g.,rich)writingstylethatsetsthemapartfrom Inordertogaugetowhatextenttheseautomati- othernovels,whiletheliterarynovelsthatareun- callyinducedfeaturesarecomplementary,wecom- derestimatedbythemodelarenotmarkedbysuch binetheminasinglemodeltogetherwiththebasic awritingstyle. Itisdifficulttoconfirmthisdirectly features; cf. the scores in Table 4. Both charac- by inspecting the model, since each prediction is ter4-gramsandsyntacticfragmentsstillprovidea thesumofseveralthousandfeatures,andthecon- relativelylargeimprovementoverthepreviousfea- tributionsofthesefeaturesformalongtail. Ifwe tures,takingintoaccounttheinherentdiminishing definethecontributionofafeatureastheabsolute returnsofaddingmorefeatures. value of its weight times its relative frequency in Figure2showsabarplotofthetennovelswith the document, then in case of Barnes, The sense thelargestpredictionerrorwiththefragmentand ofanending,thetop100featurescontributeonly wordbigrammodels. Ofthesenovels,9arehighly 34%ofthetotalprediction. literaryandunderestimatedbythemodel. Forthe Table 5 gives the basic features for the top 4 othernovel(Smeets,Afrekening)theliteraryrating literarynovelswiththelargesterrorandcontrasts is overestimated by the model. Since this top 10 them with 4 literary novels which are well pre- isbasedonthemeanpredictionfrombothmodels, dicted. The most striking difference is sentence the error is large for both models. This does not length: the underestimated literary novels have R2 60 BASICFEATURES(TABLE1) 30.0 50 +AUTO.INDUCEDFEAT.(TABLE4) 61.2 +GENRE 74.3 40 +TRANSLATED 74.0 ore 30 +AUTHORGENDER 76.0 c 2 sR 20 Table6: Metadatafeatures;incrementalscores. 10 0 itweretrainedonthefulldataset;ifthemodelwas R2 cross-validated 10 stillgainingperformancesignificantlywithmore 0.0 0.2 0.4 0.6 0.8 1.0 proportion of training set trainingdata,thecross-validationscorewouldun- derestimatethetruepredictionperformance. Figure3: Learningcurvewhenvaryingtrainingset Asimilarexperimentwasperformedvaryingthe size. Theerrorbarsshowthestandarderror. numberoffeatures. Heretheperformanceplateaus quicklyandreachesanR2 of53.0%at40%,and growsonlyslightlyfromthatpoint. markedlyshortersentences. VoskuilandFranzen have a higher proportion of direct speech (they 5 Metadatafeatures are in fact the only literary novels in the top 10 novels with the most direct speech). Lastly, the Inadditiontotextualfeatures,wealsoincludethree underestimatednovelshaveahigherproportionof (categorical)metadatafeaturesnotextractedfrom commonwords(lowervocabularyrichness). These the text, but still an inherent feature of the novel observationsarecompatiblewiththeexplanation inquestion: GENRE, TRANSLATED,and AUTHOR suggestedabove,thatasubsetoftheliterarynovels GENDER;cf.Table6fortheresults. Figure4shows share a simple, readable writing style with non- avisualizationofthepredictionsinascatterplot. literarynovels. Suchastylemaybemoredifficult GENREisthecoarsegenreclassificationFiction, todetectthanaliterarystylewithlongandcomplex Suspense,Romantic,Other,derivedfromthepub- sentences,orrichvocabularyandphraseology,be- lisher’s categorization. Genre alone is already a causeasimple,well-craftedsentencemaynotoffer strong predictor, with an R2 of 58.3 on its own. overtsurfacemarkersofstylization. Bookreviews However, this score is arguably misleading, be- appear to support this notion for The sense of an cause the predictions are very coarse due to the ending: “A slow burn, measured but suspenseful, discretenatureofthefeature. thiscompactnovelmakeseveryslylycraftedsen- A striking result is that the variables AUTHOR tencecount”(Tonkin,2011);and“polishedphras- GENDERandTRANSLATEDincreasethescore,but ings, elegant verbal exactness and epigrammatic only when they are both present. Inspecting the perceptions”(Kemp,2011). meanratingsshowsthattranslatednovelsbyfemale Inordertotestwhethertheamountofdataissuf- authorshaveanaverageratingof3.8,whileorigi- ficienttolearntopredicttheratings,weconstruct nallyDutchmaleauthorsarerated5.0onaverage; alearningcurvefordifferenttrainingsetsizes;cf. theratingsoftheothercombinationslieinbetween Figure3. Thesetofnovelsisshuffledonce,sothat theseextremes. Thisexplainswhythecombination initialsegmentsofdifferentsizerepresentrandom worksbetterthaneitherfeatureonitsown,butdue samples. Thenovelsaresampledin5%increments to possible biases inherent in the makeup of the (i.e.,20modelsaretrained). Thegraphsshowthe corpus,suchaswhichfemaleortranslatedauthors cross-validatedscores. arepublishedandselectedforthecorpus,nocon- clusionsontheinfluenceofgenderortranslation Thegraphsshowthatincreasingthenumberof shouldbedrawnfromthesedatapoints. novelshasalargeeffectonperformance. Thecurve issteepupto30%ofthetrainingset,andtheper- 6 Previouswork formancekeepsimprovingsteadilybutmoreslowly uptothelastdatapoint. Sincetheperformanceis Table7showsanoverviewofpreviousworkonthe relativelyflatstartingfrom85%,wecanconclude task of predicting the (literary) quality of novels. thatthek-foldcross-validationwithk “ 5provides Notethatthedatasetsandtargetsdiffer,therefore anadequateestimateofthemodel’sperformanceif none of the results are directly comparable. For JoLheawninnsisktyag GoMdoerntiselarap 7 Fiction Suspense Rosenboom Zoete mond Romantic Koch Other The dinner Gilbert 6 Eat pray love French Blue monday s ment5 AfreSkmeneientgs SeBnasren eosf an g ending der jud Donoghue FFrreaendzoemn a4 Room e d r Stockett e The help ct di e pr3 Fifty shaJdaemse osf Grey Baldacci Hell's corner Kinsella Remember me? 2 1 1 2 3 4 5 6 7 actual reader judgments Figure 4: A scatter plot of regression predictions and actual literary ratings. Original/translated titles. Notethehistogramsbesidetheaxesshowingthedistributionofratings(top)andpredictions(right). example, regression is a more difficult task than Binary Dataset,task Acc. classification binary classification, and recognizing the differ- Louwerseetal. 119all-timeliteraryclassics 87.4 encebetweenanaverageandhighlyliterarynovel (2008) and55othertexts,literary ismoredifficultthandistinguishingeitherfroma novelsvs.non-fiction/sci-fi differentdomainorgenre(e.g.,newswire). Ashoketal.(2013) 80019thcenturynovels, 75.7 lowvs.highdownload Louwerse et al. (2008) discriminate literature count from other texts using Latent Semantic Analysis. vanCranenburgh 146recentnovels,lowvs. 90.4 Ashok et al. (2013) use bigrams, POS tags, and andKoolen(2015) highsurveyratings grammarproductionstopredictthepopularityof Regressionresult Dataset,task R2 Gutenberg texts. van Cranenburgh and Koolen vanCranenburgh 146recentnovels, 61.3 (2015) predict the literary ratings of texts, as in andKoolen(2015) surveyratings thepresentpaper,butonlyusingbigrams,andona Thiswork 401recentnovels, 76.0 surveyratings smaller,lessdiversecorpus. Comparedtoprevious work, thispapergivesamorepreciseestimateof Table7: Overviewofpreviousworkonmodeling how well shades of literariness can be predicted (literary)qualityofnovels. fromadiverserangeoffeatures,includinglarger andmoreabstractsyntacticconstructions. by the correlation metric, as extracted from the 7 Analysisofselectedtreefragments firstfold. Thefirstfragmentshowsanincomplete Anadvantageofparsetreefragmentsisthatthey constituent, indicated by the ellipses as first and offer opportunities for interpretation in terms of lastleaves. Suchincompletefragmentsaremade linguistic aspects as well as basic distributional possiblebythebinarizationscheme(cf.Sec.4.1). aspectssuchasshapeandsize. Table8showsabreakdownoffragmenttypesin Figure 5 shows three fragments ranked highly thefirstfold. Incontrastwithn-grams,wealsosee ROOT PP-mod NP-obj1 SMAIN NP-obj1 ...N-hd LET ... LET NP-su SMAIN LET VZ-hd LID-det N-hd ... , ’ ... ... . ... een ... 120 r=0.526 12 r=-0.417 60 r= 0.4 100 10 50 nt nt nt ou 80 ou 8 ou40 c c c ent 60 ent 6 ent 30 m m m g g 4 g a 40 a a20 fr fr fr 2 20 10 0 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 mean literary rating mean literary rating mean literary rating Figure 5: Three fragments whose frequencies in the first fold have a high correlation with the literary ratings. Notethedifferentscalesonthey-axis. Fromlefttoright;Blue: complexNPwithcomma;Green: quotedspeech;Red: AdjunctPPwithindefinitearticle. fullylexicalized 1,321 1400 syntactic(nolexicalitems) 2,283 positive corr. mixed 4,038 1200 negative corr. discontinuous 684 ments1000 dtoistaclontinuoussubstitutionsite 7,369462 mber of frag 680000 nu 400 Table8: Breakdownoffragmenttypesselectedin 200 thefirstfold. 0 1 3 5 7 9 11 13 15 17 19 21 fragment size (non-terminals) Figure6: Breakdownbyfragmentsize(numberof a large proportion of purely syntactic fragments, non-terminals). and fragments mixing both lexical elements and substitutionsites. Inthecaseofdiscontinuousfrag- ments,itturnsoutthatthemajorityhasapositive literary ratings. On the other hand there is a sig- correlation;thismightbeduetobeingassociated nificantnegativecorrelationbetweenfragmentsize withmorecomplexconstructions. and literary ratings (r “ ´0.2,p ă 0.001); i.e., Figure 6 shows a breakdown by fragment size smallerfragmentstendtobepositivelycorrelated (definedasnumberofnon-terminals),distinguish- withtheliteraryratings. ingfragmentsthatarepositivelyversusnegatively Itisstrikingthattherearemorepositivelythan correlatedwiththeliteraryratings. negativelycorrelatedfragments,whileliterarynov- Notethat1and3arespecialcasescorrespond- els are a minority in the corpus (88 out of 369 ingtolexical(e.g.,DTÑthe)andbinarygrammar novels are rated 5 or higher). Additionally, the productions(e.g.,NPÑDTN),respectively. The breakdown by size shows that the larger number fragmentswith2,4,and6non-terminalsarenotas ofpositivelycorrelatedfragmentsisduetoalarge commonbecauseanevennumberimpliesthepres- numberofsmallfragmentsofsize3and5;however, enceofunarynodes. Exceptforfragmentsofsize combinatorially,thenumberofpossiblefragment 1, the frontier of fragments can consist of either types grows exponentially with size (as reflected substitutionsitesorterminals(sincewedistinguish intheinitialsetofrecurringfragments),solarger only the number of non-terminals). On the one fragmenttypeswouldbeexpectedtobemorenu- hand smaller fragments corresponding to one or merous. In effect, the selected negatively corre- twogrammarproductionsaremostcommon,and lated fragments ignore this distribution by being are predominantly positively correlated with the relatively uniform with respect to size, while the 1200 tacticcategoriesandfunctiontagsoftherootnode positive corr. 1000 negative corr. of the fragments. The positively correlated fragments 800 mentsarespreadoveralargervarietyofbothsyn- mber of frag 460000 tfaocrtmicocsattelagboerlise,stahnednfuumnbcteiroonftapgoss.itTivheilsymcoearrneslathteadt nu fragmentsishigher;theexceptionsareROOT,SV1 200 (averb-initialphrase,notpartofthetop15),and 0 NP PP SMAINSSUB CONJ n DU ROOT ww INF adjPPART CP bw vnw theabsenceofafunctiontag(indicativeofanon- category of root node terminaldirectlyundertherootnode). Allofthese 1400 positive corr. exceptionspointtoatendencyfornegativelycorre- 1200 negative corr. latedfragmentstorepresenttemplatesofcomplete ments1000 sentences. number of frag 468000000 8 Conclusion 200 0 mod(none)obj1 hd body dp su vc nucl pc ld predc sat app det The answer to the main research question is that function tag of root node literaryjudgmentsarenon-arbitraryandcanbeex- plainedtoalargeextentfromthetextitself: thereis Figure7: Breakdownbycategory(above)andfunc- anintrinsicliterarinesstoliterarytexts. Ourmodel tiontag(below)offragmentroot(top15labels). employsanensembleoftextualfeaturesthatshow acumulativeimprovementonpredictions,achiev- literaryfragmentsactuallyshowtheoppositedis- ingatotalscoreof76.0%variationexplained. This tribution. result is remarkably robust: not just broad genre Whatcouldexplainthepeakofpositivelycorre- distinctions,butalsofinerdistinctionsintheratings lated,smallfragments? Inordertoinvestigatethe arepredicted. peak of small fragments, we inspect the 40 frag- The experiments showed one clear pattern: lit- mentsofsize3withthehighestcorrelations. These erarylanguagetendstousealargersetofsyntac- fragments contain indicators of unusual or more ticconstructionsthanthelanguageofnon-literary complexsentencestructure: novels. Thisalsoprovidesevidenceforthehypoth- esisthatliteratureemploysaspecificinventoryof • DU,dp: discoursephenomenaforwhichno constructions. All evidence points to a notion of specific relation can be assigned (e.g., dis- literature which to a substantial extent can be ex- courserelationsbeyondthesentencelevel). plainedpurelyfrominternal,textualfactors,rather • appositiveNPs,e.g.,‘Johntheartist.’ thanbeingdeterminedbyexternal,socialfactors. • a complex NP, e.g., containing punctuation, nestedNPs,orPPs. Code and details of the experimental setup • anNPcontaininganadjectiveusednominally are available at https://github.com/ oraninfinitiveverb. andreasvc/literariness Ontheotherhand,mostnon-literaryfragmentsare Acknowledgments top-levelproductionscontainingROOTorclause- levellabels,forexampletointroducedirectspeech. Anotherwayofanalyzingtheselectedfragments We are grateful to David Hoover, Patrick Juola, is by frequency. When we consider the total fre- CorinaKoolen,LauraKallmeyer,andthereview- quenciesofselectedfragmentsacrossthecorpus, ers for feedback. This work is part of The Rid- thereisarangeof50to107,270. Thebulkoffrag- dleofLiteraryQuality,aprojectsupportedbythe mentshavealowfrequency(beforefragmentselec- RoyalNetherlandsAcademyofArtsandSciences tion2isbyfarthemostdominantfrequency),but through the Computational Humanities Program. thetailisverylong. Exceptforthefactthatthereis In addition, part of the work on this paper was alargernumberofpositivelycorrelatedfragments, fundedbytheGermanResearchFoundationDFG thehistogramshaveaverysimilarshape. (DeutscheForschungsgemeinschaft). Lastly,Figure7showsabreakdownbythesyn- References Peter Kemp. 2011. The sense of an ending by Julian Barnes. Book review, The Sunday Times, Vikas Ashok, Song Feng, and Yejin Choi. 2013. July 24. http://www.thesundaytimes. Success with style: using writing style to pre- co.uk/sto/culture/books/fiction/ dict the success of novels. In Proceedings of article674085.ece. EMNLP, pages 1753–1764. http://aclweb. org/anthology/D13-1181. Sangkyum Kim, Hyungsul Kim, Tim Weninger, and Jiawei Han. 2011. Authorship classification: A ChrisBaldick. 2008. Literariness. InTheOxfordDic- syntactic tree mining approach. In Proceedings of tionary of Literary Terms. Oxford University Press, SIGIR, pages 455–464. ACM. http://dx.doi. USA. org/10.1145/2009916.2009979. ShaneBergsma,MattPost,andDavidYarowsky. 2012. DanKleinandChristopherD.Manning. 2003. Accu- Stylometric analysis of scientific articles. In Pro- rate unlexicalized parsing. In Proceedings of ACL, ceedings of NAACL, pages 327–337. http:// volume 1, pages 423–430. http://aclweb. aclweb.org/anthology/N12-1033. org/anthology/P03-1054. David M. Blei, Andrew Y. Ng, and Michael I. Jor- Max Louwerse, Nick Benesh, and Bin Zhang. 2008. dan. 2003. Latent Dirichlet allocation. the Computationally discriminating literary from non- Journal of machine Learning research, 3:993– literarytexts. InS.Zyngier,M.Bortolussi,A.Ches- 1022. http://www.jmlr.org/papers/ nokova, and J. Auracher, editors, Directions in em- volume3/blei03a/blei03a.pdf. piricalliterarystudies: InhonorofWillieVanPeer, pages 175–191. John Benjamins Publishing Com- RensBod. 1992. Acomputationalmodeloflanguage pany,Amsterdam. performance: Data-oriented parsing. In Proceed- ingsCOLING,pages855–859. http://aclweb. RonanMcDonald. 2007. Thedeathofthecritic. Con- org/anthology/C92-3126. tinuum,London. GosseBouma,GertjanvanNoord,andRobertMalouf. Alessandro Moschitti. 2006. Making tree kernels 2001. Alpino: Wide-coveragecomputationalanaly- practicalfornaturallanguagelearning. InProceed- sis of Dutch. Language and Computers, 37(1):45– ingsofEACL,pages113–120. http://aclweb. 59. http://www.let.rug.nl/vannoord/ org/anthology/E06-1015. papers/alpino.pdf. Nelleke Oostdijk, Martin Reynaert, Ve´ronique Hoste, PierreBourdieu. 1996. Therulesofart: Genesisand andInekeSchuurman. 2013. Theconstructionofa structure of the literary field. Stanford University 500-million-wordreferencecorpusofcontemporary Press. written dutch. In Essential speech and language technologyforDutch,pages219–247.Springer. Michael Collins and Nigel Duffy. 2002. New rank- ing algorithms for parsing and tagging: Kernels Federico Sangati, Willem Zuidema, and Rens Bod. over discrete structures, and the voted perceptron. 2010. Efficiently extract recurring tree frag- In Proceedings of ACL. http://aclweb.org/ ments from large treebanks. In Proceedings of anthology/P02-1034. LREC,pages219–226. http://dare.uva.nl/ MichaelA.Covington,CongzhouHe,CatiBrown,Lo- record/371504. rinaNaci,andJohnBrown. 2006. Howcomplexis Remko Scha. 1990. Language theory and lan- thatsentence?AproposedrevisionoftheRosenberg guagetechnology;competenceandperformance. In andAbbedutoD-Levelscale. CASPRResearchRe- Q.A.M. de Kort and G.L.J. Leerdam, editors, Com- port2006-01,ArtificialIntelligenceCenter. putertoepassingen in de Neerlandistiek, pages 7– Rudolph Flesch. 1948. A new readability yardstick. 22. LVVN, Almere, the Netherlands. Original ti- Journalofappliedpsychology,32(3):221. tle: Taaltheorie en taaltechnologie; competence en performance. English translation: http://iaaa. Edward Gibson. 2000. The dependency locality the- nl/rs/LeerdamE.html. ory: Adistance-basedtheoryoflinguisticcomplex- ity. Image,language,brain,pages95–126. Efstathios Stamatatos. 2006. Ensemble-based author identification using character n-grams. In Pro- Wendell V. Harris. 1995. Literary meaning. Reclaim- ceedingsofthe3rdInternationalWorkshoponText- ingthestudyofliterature. PalgraveMacmillan,Lon- based Information Retrieval, pages 41–46. http: don. //ceur-ws.org/Vol-205/paper8.pdf. Kim Jautze, Andreas van Cranenburgh, and Corina Efstathios Stamatatos. 2009. A survey of modern au- Koolen. 2016. Topic modeling literary qual- thorship attribution methods. Journal of the Amer- ity. In Digital Humanities 2016: Conference Ab- ican Society for Information Science and Technol- stracts, pages 233–237, Kra´kow, Poland. http: ogy,60(3):538–556. http://dx.doi.org/10. //dh2016.adho.org/abstracts/95. 1002/asi.21001.