ebook img

Adaptation of fictional and online conversations to communication media PDF

1 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Adaptation of fictional and online conversations to communication media

EPJ manuscript No. (will be inserted by the editor) Adaptation of fictional and online conversations to communication media Christian M. Alis and May T. Lim National Institute of Physics, University of the Philippines Diliman 1101 Quezon City, Philippines 3 1 Received: date / Revised version: date 0 2 Abstract. Conversationsallowthequicktransferofshortbitsofinformationanditisreasonabletoexpect n that changes in communication medium affect how we converse. Using conversations in works of fiction a and in an online social networking platform, we show that the utterance length of conversations is slowly J shortening with time but adapts more strongly to the constraints of the communication medium. This 8 indicatesthattheintroductionofanynewmediumofcommunicationcanaffectthewaynaturallanguage evolves. ] h PACS. 89.65.Ef Social organizations; anthropology – 89.20.-a Interdisciplinary applications of physics p - c o 1 Introduction spoken speech are also included in corpora like A Cor- s . pus of English Dialogues 1560–1760 [12] and The Cor- s c With an estimated vocabulary size of 20,000 to 40,000 pus of Historical American English: 400 million words, i base words [1,2,3], conversations quickly transfer short 1810-2009 [13]. However, only conversations (fictional di- s y bits of information via two general means: the oral and alogues)innovels,shortstories,andmovieswereanalyzed h thewrittenform.Althoughthewrittenvocabularyisoften in this paper because utterances tend to be less narrative p larger [4], the grammatically looser and more error-prone and directed to another person unlike in other genres like [ oralmediumhastheadvantageofhavingaccesstononver- drama comedies or trial transcripts. Although it has been bal cues like gestures and intonations [5] to aid communi- shownthatstylesvaryacrossandevenwithinauthors[14], 1 cation. Aside from vocabulary size—word choices, uncon- we assumed that conversations in their works are mostly v 9 sciously repeating words, and other idiosyncrasies [6] also independent of the author’s style, i.e., a conversation in 2 affect the way we perceive conversations. their works conveys how another person (character), and 4 Conversation analysis typically looks into how turn not how the author, speaks. Furthermore, errors due to 1 takingpatternsininstitutionalsettingsdepartfromthose transcribing are practically eliminated when using books 1. observed in informal conversations [7], or on the psycho- and movies. 0 logicalorsociologicalaspects[8]ofsocialstructure.Inthis Twitter, as a form of computer-mediated communica- 3 work, the length distribution of a single speaking turn, or tion, is different from oral or written media [15]. While 1 utterance, was derived to determine if the medium affects assumed to be happening in real-time, the purely writ- v: the way we express ideas by using datasets that include a ten nature of a Twitter-based conversation differentiates i mix of real-world (online) and fictional (offline) conversa- it from the transcribed oral communication in books and X tions:onlineconversationinTwitter(twitter.com);con- movies.Inaddition,Twitterconversationshaveanexplicit r versationsfrom19thcenturynovelsandshortstories;and length limit—an utterance can only be up to 140 charac- a subtitles from 20th century movies. ters long. Humans typically converse orally, thus the analysis of Putting a length constraint on the outset would show conversationsisusuallyperformedbytranscribingrecorded drastic changes. A case in point would be SMS messages. audio conversations into text. In cases when this is not At its peak, textspeak looked very much different from possible e.g., before the invention of recorded audio, one standard spelling—primarily due to the effort it takes to techniqueistousewrittenrecordsofrealandconstructed spelloutwordsthroughanumericalkeypad.Tweets,how- conversations as were done in studies on the emergence ever, was largely spared from this phenomenon and usu- of complementary clauses (Paul persuaded John to kiss ally have correct spelling. Among the three media ana- Mary) [9], the use of do in negative declaratives (I do not lyzedinthisstudy,Twitteristheonlyconsideredmedium understandyou)[10],andtheincreasingprevalenceofthe that is constrained. Conversations in books and movies modals gonna, gotta and wanna [11]. Written records of are supposedly oral conversations that were written down 2 Christian M. Alis, May T. Lim: Adaptation of fictional and online conversations to communication media in the form of a book or a subtitle so their written form where α and s are fitting parameters that describe the should have no effect on them. shape and ordinate scaling factor, respectively, were then Wenowarguethatifconversationsareindependentof fitted on each distribution using the maximum likelihood medium,thennosignificantdifferenceshouldbeobserved estimation [28] feature of the Scipy python module [29]. among conversations in Twitter, books and movies. On For each trial, 100,000 sentences were generated follow- theotherhand,ifdifferencesinamediumisduetoanex- ing the fitted sentence length (in words) and word length plicit quirk in the medium e.g., an utterance length limit, (in letters) distributions. This process was repeated for a then conversations inTwitter must besignificantly differ- total of 100 trials resulting to 100 sentence length in let- entfromconversationsinbooksandmovies,butthelatter tershistograms.Thehistogramswereconvertedtoasingle two should not be significantly different from each other. probabilitydistributionbyusingthemedianfrequencyfor Finally,ifconversationsareindeeddependentonmedium, each sentence length. thenconversationsinTwitter,movies,andbooksmustbe significantly different from each other. 2 Orthographic sentence length and the Brown corpus The study of sentence lengths in text dates back to the 1939paperofUdnyYule[16]whereitwasusedtoestablish authorship. More recently, sentence length has been used to classify text genre by itself [17] or in combination with other text properties [18]. Yule’s 1939 paper did not pro- vide the sentence length distribution but several decades afteritspublication,thedistributionwasdescribedaslog- normal[19,20,21]whichwaslatershownbySichel[22,23] tobeflawed.Morerecently,Sigurdetal.[24]showedthat sentence length distributions may be approximated by a Fig. 1. (a) Word length in letters and (b) sentence length gamma distribution. inwordsdistributionsoftheBrowncorpussuperimposedwith Inthiswork,weusedthenon-standardunitofnumber the maximum likelihood estimate of Eq. (1) (solid line). (c) of characters (orthographic length), instead of the usual Simulated sentence length (solid dots) in letters distribution sentencelengthunitsofclausesorwords,inmeasuringut- usingthefittedwordlength(inletters)andsentencelength(in terancelengthsforeaseofcomparisonwithTwitterwhich words) distributions of the Brown corpus superimposed with has a maximum utterance length in terms of characters. theleast-squaresfit(solidline)andvalueswithinonestandard Although the distribution of sentence lengths in terms of deviation (shaded) words and word lengths in terms of letters can be de- scribed by a gamma distribution [24], there is no mathe- maticalguaranteethatthedistributionofsentencelengths in terms of letters would also follow the same distribution in the general case of different shape and scale parame- Both the word length (WL, in letters) [Fig. 1(a)] and tersofthesentencelength(inwords)andwordlength(in sentence length (SL, in words) [Fig. 1(c)] distributions of letters)distributions.Ifitcanbeshownthatthesentence the Brown corpus follow a gamma distribution (WL: α = length (in letters) distribution can be approximated by a 3.43, s = 1.39, r2 = 0.948; SL: α = 2.09, s = 8.44, r2 = member of the same distribution family as the sentence 0.989). The simulated sentence length in letters distribu- length (in words), then the use of sentence length com- tion [Fig. 1(b)] also follows a gamma distribution (α = parison using orthographic length is a valid approach. 1.98, s = 51.2) but has a much larger s than the sentence The Brown corpus [25] consists of about one million lengthinwordswhichisexpectedsincelettersisasmaller words of edited English prose printed during 1961 in the syntactic unit than words. UnitedStates[26].Toverifyifmeasuringsentencelengths in terms of characters may be approximated by a gamma Thesentencelengthdistributioninlettersthusbelongs distribution, the sentence length (in letters) distribution to the same family of distributions as when measured in was simulated, as follows. The word length (in letters) words.Sinceutterancelengthsarebeingcomparedempir- andsentencelength(inwords)distributionsofthetagged ically,theuseoforthographiclengthasaunitofutterance Brown corpus was first constructed using the natural lan- length is therefore valid despite known idiosyncrasies [30] guagetoolkit[27]Pythonmodule.Inconstructingthedis- of the English language. Interestingly, the orthographic tributions,onlywordsthatcontainatleastoneletterwere length was also used by Piantadosi et al. [31] when they considered. A gamma distribution given by, showed that word lengths are optimized for efficient com- munication because it is easier to measure while still be- xα−1ex/s ing highly correlated with word length in terms of sylla- Pr(x)= , (1) sαΓ(α) bles [32]. Christian M. Alis, May T. Lim: Adaptation of fictional and online conversations to communication media 3 3 Datasets each,theweeklydatasetsweresubdividedintotengroups of shuffled hourly data. subs consists of about 14.7 million utterances from Fourdatasetswereusedforouranalysis:utterancesinfic- 15,809moviesprovidedbyopensubtitles.org.Themovie tionalworksinProjectGutenberg(pg)(gutenberg.org), releaseyearsspanfrom1896to2010.SeeRef.[34]forpars- utterances in pg split into sentences (pgs), tweets from ing details and Ref. [38] for the complete list of movies. Twitter(twitter),andutterancesinmoviesubtitles(subs) from opensubtitles.org. pgwasgeneratedbyextractingutterances—definedas 4 Utterance length distributions of datasets text enclosed in double quotes—from the available works in Project Gutenberg of 50 authors whose selection was Twitter conversations [Fig. 2(a)] have an asymmetric and roughly based on availability (see Ref. [33] for list of ti- bimodalutterancelengthdistribution.Theleftpeak(mode) tles,andRef.[34]forauthorselectionandtextparsingde- is at 16 characters which we take to be the natural distri- tails). The resulting dataset consists of about 2.3 million bution of message lengths i.e., it is the distribution of an utterances, with zero-length utterances (0.01% of origi- unrestrictedconversation.Similartotheargumentusedby nal dataset) removed. The author with the most number Sigurdetal.[24]intheirstudyofwordandsentencelength of utterances (George Manville Fenn) has 238,640 utter- distributions of English, Swedish and German texts, and anceswhiletheauthorwiththeleastnumberofutterances by Cancho and Sol´e [39] in their work on the origin of (David Herbert Lawrence) has 1,170 utterances. The me- Zipf’s law, we posit that the length of an utterance in a diannumberofutterancesis36,955utterancesperauthor. conversationisalsogovernedbyatrade-offbetweenpack- When split into sentences, pg is converted to pgs which ing as much information as possible in an utterance and hasabout4.2millionutteranceswithamediannumberof expressing the utterance as quickly as possible: the first utterances equal to 69,311 utterances per author. objective is biased towards increasing length (∼ xα−1) Conversations in twitter were identified by looking while the other is biased towards decreasing it (∼ e−x). forreplies,whichareTwittermessages(ortweets)directed Combining the two objectives, the following distribution to specific users. We used the convention that replies be- is obtained: ∼xα−1e−x. gin with the @username of the receiver, e.g., @bob Hello! Howareyou? tofilterthetweetsforourdataset.1 Though not in the original design, the use of replies emerged as the leading method of addressing a particular person in Twitter [35]. The presence of an @username anywhere in the tweet makes that tweet a mention [36]. Unlike men- tions, which appear in the timeline of a user following the sender,areply appearsinsaiduser’stimelineonlyifhefol- lows both sender and receiver of the reply message. Thus, conversationsaremostlikelyrestrictedtoreplies toavoid flooding the timeline of people not involved in the discus- sion. Though mentions may carry conversations, we still Fig.2.(a)Messagelengthdistributionofsampledtweetswith excluded them from the dataset, as they are more likely thecurvefithavingthehighestr2 value(α=1.37,solidline). non-conversational tweets. Errorbarsarestandarddeviationsfromfiveone-weeksamples. It is possible that a reply is not reciprocated, e.g., if it (b) The α values (filled squares) of the fit from x = 0 to xc was meant to bring an item, such as a URL, to the atten- using Eq. (2) and its corresponding r2 (unfilled triangles). tionofanotheruser.Thisisstillconsideredaconversation because it conveys a short bit of information directly tar- To account for a strict length limit for Twitter mes- geted to a certain user. This is similar to someone telling sages, the natural utterance length distribution was esti- another to “watch out!” or “be careful”: a reply by the mated by fitting a more general equation using a modi- other person is not required. fied Levenberg-Marquardt least squares algorithm [29] to UsingtheTwitterStreamingapplicationprogramming the utterance length distribution from x = 0 to a cut-off interface (API) [37], five one-week sampled public tweets length x ∈[16,140] [Fig. 2(b)], from September 2009 to July 2010 were selected. From c the one-week samples composed of around 16.2 million x˜α−1e−x˜ to 57.6 million tweets representing about 15% of pub- Pr(x)= , (2) Γ(α) lic tweets [37], nonzero-length messages were extracted whichyieldedabout52millionmessagesorutterances(see wherex˜=(x−x )/sisthescaledutterancelengthx,while 0 Ref.[34]fordatasetsandparsingdetails).Forbettercom- α,x andsarefittingparametersthatdescribetheshape, 0 parison with pg and pgs that have 50 subsets (authors) translation and ordinate scaling factor, respectively. This method of estimation assumes that the mixing parame- 1 The current Twitter API supports a method for explicitly ter of the bimodal distribution is almost one in favor of classifyingatweetasareply butthiswasnotyetwidelyavail- the natural utterance length distribution. A bimodal dis- able and followed when our data were gathered. tribution fitted using expectation maximization was not 4 Christian M. Alis, May T. Lim: Adaptation of fictional and online conversations to communication media utilizedbecauseofalackofanexplicitmodelofthetrun- x =0.87, s=10.7, r2 =0.988; Fig. 2(f)] fits Eq. (2) and 0 cation distribution. Our goal is to estimate the median has almost no tail (1−F(140) = 1.19×104). Thus, all of the natural utterance length distribution so a resulting datasets share the same distribution family as the Brown non-normalized unimodal distribution is acceptable. sentence length in words distribution further giving cre- When α approaches one, Eq. (2) approaches an ex- dence to the validity of the use of characters as a unit of ponential distribution. The range of acceptable values of utterance length. α ∈ [1.1,1.6], [r2 ∈ (0.86,0.93)] for the Twitter dataset The mean length of utterance (MLU) is used to eval- corresponds to a 57-order-of-magnitude increase in like- uate the level of language development of a child [40,41]. lihood of finding an utterance length of x = x = 140 However,theuseofthemeanasameasureofcentralten- c chars.comparedtoanexponentiallydecayingcurveinthe dency is invalid because the utterance length distribution absence of a Twitter-imposed limit (see Ref. [27] for the isveryskewedtotheright.Themodeofagammadistribu- fitting parameters distributions). However, another peak tion[Eq.(2)]isgivenby(α−1)s+x butitdoesnotappear 0 wasfoundat124charactersduetothe140-characterlimit, tobecorrelatedwiths[Fig.4(a)].Incontrast,themedian, a limit that is absent in the other datasets, and is at- thoughnothavingaclosedformequationforagammadis- tributedtovarioustweet-shorteningschemes.Theabsence tribution,appearstobemorecorrelatedwiths[Fig.4(b)]: of a length limit results to unimodal utterance length dis- a larger median roughly implies a larger spread. The me- tributions for pg, pgs and subs [Fig. 3]. dian, therefore, allows us to simultaneously describe both thelocationandscaleoftheutterancelengthdistribution. Fig. 3. Utterancelengthdistributionsof(a)differentauthors inpg(b)differentauthorsinpgsand(c)50randomlyselected Fig. 4. Mode and median of the distribution fits. (a) movies in subs. Distribution of utterance lengths over the en- Mode and (b) median of the fit of each distribution plotted tire (d) pg, (e) pgs and (f) subs datasets fitted with Eq. (2). against s. Conversations in movies (interquartile range IQR = Fortherestofthispaper,themedianutterancelength difference between the 3rd and 1st quartiles = 21 chars.) anditsmedianwereusedtodescribeeachutterancelength are of more uniform length than those in books (pg IQR distribution. These measures are suitable for comparison median = 88 chars, pgs IQR median = 50 chars.). The between datasets because both are insensitive to outliers muchsmallersubsIQRmediancomparedtothatoftwit- (robust) and do not assume a distribution (nonparamet- ter (IQR median = 46 chars.) or that of its best fit of ric). Any author dependence or deviation from a gamma Eq.(2)(IQRmedian=50chars.)suggeststhatconversa- distribution of the data would therefore not affect the re- tions in movies are less dependent on author style while sults [34]. Tests for significant differences were performed the much larger IQR medians of pg and pgs point to a using the Mann-Whitney U test [42] with continuity cor- stronger dependence of these media on author style. rection because the distributions being compared are dis- To minimize the effect of unequal author or movie ut- crete and skewed. terances, and of noise due to differences in spelling and punctuation, Eq. (2) was fitted to pg, pgs and subs by computingforthenormalizedhistogramofeachauthoror 5 Utterance length and sample size movie then using the average probability for each utter- ancelengthastheprobabilitydensityfunctiontobefitted twitter, pgs and subs were subsampled (with replace- using least squares. Based on the fit of Eq. (2) (α=1.48, ment) such that the sample size would be the same for x =0.862,s=34.4,r =0.984),thepgsutterancelength each author’s sample size in pg. By taking the distribu- 0 distribution [Fig. 3(e)] seems to be a horizontally com- tion of subsample medians (Fig. 5) which is analogous to pressed twitter best fit curve (α = 1.37, x = 0.86, taking the distribution of sample means from normally- 0 s = 36.4) because of a smaller s value. The pg utterance distributed data, we found that the median median ut- length distribution has a fatter tail [1−F(140)=0.0896; terance length (analogous to mean of sample means) of Fig. 3(d)] than that of the pgs utterance length distribu- subs (25 chars.) is very different from that of twitter tion (1−F(140) = 0.0427), and only its tail fits Eq. (2) (38 chars.), pg (48 chars.) and pgs (41 chars.). quite well (α = 1.24, x = 2.63, s = 48.6, r2 = 0.970). In Notably, the median median utterance length value of 0 contrast, the entire subs median distribution [α = 2.71, subsof25chars.,whichisnotrelatedtotheexistingmax- Christian M. Alis, May T. Lim: Adaptation of fictional and online conversations to communication media 5 Fig. 5. Distributionofmedianutterancelengths(medianme- dian utterance length: dashed lines) for (a) pg, (b) subs, (c) Fig.6.Distributionofmedianutterancelengthinsubsampled pgsand(d)twitter.Themedianutterancelengthdatain(d) twitter (black), pg (dark gray), pgs (light gray) and subs was estimated from the natural utterance length distribution (unfilled). of each twitter subset. 6 Utterance length through time imum subtitle line length of 32-34 characters (Ofcom reg- Themedianmedianutterancelengthinbothpg[Fig.7(a)] ulation [43]), points to a fundamental difference in how (slope=-0.266chars./yr,r2 =0.903,p<10−3 two-sided) the verbal medium is used in movies. and pgs [Fig. 7(b)] (slope = -0.189 chars./yr, r2 = 0.814, Themedianutterancelengthdistributionofalldatasets p<10−3 two-sided) decreases with time but is not corre- aresignificantlydifferentfromeachother(seeRef.[27]for lated with size (pg Spearman ρ2 < 10−3; pgs Spearman complete test results between each pair of dataset). Since ρ2 =0.00524). the pgs median distribution is significantly different from On the other hand, the median utterance length of the subs median distribution, conversational sentences in subs [Fig. 7(c)] remains almost constant (∼ 27 chars.) books are not the same as conversational sentences in in time (slope = −1.897 × 10−3 chars./yr, r2 = 0.121, movies though we posit that conversations in movies are p < 10−3 two-sided) except for a conspicuous rise and closer to that of actual transcribed speech. twitter ut- increasedspreadinthemedianutterancelengthataround terancelengthsarestochasticallysmallerthanpgandpgs 1920 that does not flatten out even if the window size is but differ significantly from subs suggesting that Twit- increased from 1 year to 5 years [Fig. 7(d)]. The bump ter is a less formal medium. We surmise that the smaller is likely due to the availability of “talking pictures” and lengthisduetothemorespontaneousandlessformaltone commercialtelevisionstartinginthelate1920s.Thesilent of Twitter conversations than those in books. moviespriortotheirreleasehaveadifferent“conversation signature” from those of “talkies”. To investigate the effect of sample size N on the me- The temporal behavior of twitter was not studied dian utterance length, each dataset was sampled (with because twitter spans only a few weeks. replacement) into 50 groups each having N utterances. Similar to word frequency distributions that are depen- dent on N [44], the spread in, but not the location of, 7 Conclusion the medians distribution decreases as N increases (Fig. 6) for all datasets. At N = 105 utterances, the median Thoughwedonotusuallynoticethemedium-dependence value of subs collapsed to a single value of 25 characters. of conversations, we showed that conversations, as mea- At N =106 utterances, pg and pgs collapsed to different sured by orthographic utterance length, are slowly short- single median utterance length values of 48 and 41 char- ening in time within media but are drastically different acters, respectively, while twitter falls into two unique across different media. These are fundamental differences values of 38 and 39 characters. that are effects not just of the milieu, but of the medium The median utterance length distribution of subs is itself. Evolving technologies that lead to changes in com- very different from the median utterance length distribu- munication media seemingly lead us to adapt our conver- tion of the other datasets—it can be clearly distinguished sations, rather than such a technology suffering an early from them even if the sample size is only N = 100 ut- demise because it cannot adapt to our natural use of lan- terances (Fig. 6). pg and pgs median utterance length guage. An extreme case in point is the short message ser- distributions are already distinguishable from each other vice (SMS) or “texting.” Originally designed with a char- but both overlap with twitter at N = 100 utterances. acter limit of 160 such that most sentences would fit in a The pg, pgs and twitter median utterance length dis- single text message [45], but with an “access a letter via tributionsdonotoverlaponlyatN =104utterances,thus numerical keypad” constraint—it became a popular form giving us the required minimum sample size for meaning- of communication [46] with its own lingo [47]. Clearly, fulcomparisonacrosscommunicationmediaasafunction adaptation occurs with changing medium and sometimes of time (see Ref. [34] for complete test results). with unexpected side-effects. 6 Christian M. Alis, May T. Lim: Adaptation of fictional and online conversations to communication media 8. W. Sack, J. Manage. Inf. Syst., 17, 73 (2000) 9. A. Warner, Complementation in Middle English and the MethodologyofHistoricalSyntax:AStudyoftheWyclifite Sermons (Taylor & Francis, 1982) 10. A. Warner, Language Variation and Change, 17 257 (2005). 11. D.Lorenz,inICAME33:Corporaatthecentreandcross- roads of English linguistics, Leuven, 2012 (University of Leuven, 2012), p. 185 12. M. Kyto¨ and T. Walker, Guide to A Corpus of English Dialogues, 1560-1760, (Uppsala Universitet, 2006) 13. M. Davies, The Corpus of Historical American English: 400 million words, 1810–2009, (2010), http://corpus.byu.edu/coha/ [Retrieved 23-07-2012] 14. J.A.Smith,C.Kelly,ComputersandtheHumanities,36, 411 (2002) 15. D. Crystal, Language and the Internet, 2nd edn. (Cam- bridge Univ Press, 2006) 16. G. U. Yule, Biometrika, 30 363 (1939). 17. E. Kelih, P. Grzybek, G. Ant´ıc, E. Stadlob¨er, in From DataandInformationAnalysistoKnowledgeEngineering, editedbyM.Spiliopoulou,R.Kruse,C.Borgelt,A.Nurn- berger, W. Gaul (Springer Berlin Heidelberg, 2006) 18. T.Copeck,K.Barker,S.Delisle,S.Szpakowicz,inTALN- 2000,: Actes de la 7e Confrence Annuelle sur le Traite- ment Automatique des Langues Naturelle, Laussane, 2000 Fig.7.Medianutterancelengthdistributionof(a)pgand(b) 19. C. B. Williams, Biometrika, 31 356 (1940) pgswithwindowsizeof10years,andsubswithwindowsizeof 20. C. B. Williams, Style and vocabulary: numerical studies (c)1yearand(d)5years.Onlybookswithatleast1,000utter- (Griffin, 1970) ances were considered. Publication years were retrieved from 21. W. C. Wake, J. Royal Statistical Soc. A, 120 331 (1957) theUSLibraryofCongress.Thewindowsizeswereselectedso 22. H. S. Sichel, J. Royal Statistical Soc. A, 137 25 (1974) thattheplotsdonotchangeappreciablywhenthewindowsize is varied slightly. First to third quartiles (shaded), pg median 23. P. Grzybek, in Contributions to the Science of Text and median utterance length (a-b, solid line), pgs median median Language, edited by P. Grzybek (Springer Netherlands, utterance length (a-b, dashed line). Dordrecht, 2006) 24. B. Sigurd, M. Eeg-Olofsson, J. van de Weijer, Studia Lin- guistica, 58 37 (2004) Wethanktheadministratorof opensubtitles.orgforprovid- 25. H. Kuˇcera and W. Francis, Computational analysis of ingusthetextversionoftheirEnglish-languagesmoviessubti- present-day American English (Dartmouth Publishing tles. This work is supported by a grant from the UP Diliman- Group, 1967) Office of the Vice Chancellor for Research and Development 26. W. Francis, H. Kucera, Brown Corpus Man- and by an Amazon AWS Education grant. ual of Information (Brown University, 1979), http://icame.uib.no/brown/bcm.html [Retrieved 19- 01-2012] References 27. S. Bird, E. Loper, E. Klein, Natural Language Processing with Python (O’Reilly Media Inc., 2009) 1. R.Goulden,P.Nation,J.Read,Appl.Linguistics,11,341 28. L.Wasserman,All of Statistics: A Concise Course in Sta- (1990). tistical Inference (Springer, 2004) 2. P. Nation, R. Waring, in Vocabulary: Description, Acqui- 29. E. Jones, T. Oliphant, P. Peterson, et al., “SciPy: sition and Pedagogy, edited by N. Schmitt, M. McCarthy Open Source Scientific Tools for Python,” 2001, (Cambridge University Press, Cambridge, 1997) http://www.scipy.org/ [Retrieved 19-04-2011] 3. C. Browne, G. Cihi, B. Culligan, “Measuring 30. “English spelling: You write potato, i write ghough- vocabulary size via online technology” (2007), pteighbteau,” The Economist (2008) http://www.lexxica.com [Retrieved 08-12-2012] 31. S.T.Piantadosi,H.Tily,E.Gibson,Proc.Natl.Acad.Sci. 4. D. P. Hayes, M. G. Ahrens, J. of Child Lang., 15, 395 U. S. A., 108, 3526 (2011) (1988) 5. S. Hill, N. Launder, Australian J. of Lang. and Lit., 33, 32. U. Strauss, P. Grzybek, G. Altmann, in Contributions to 240 (2010). the Science of Text and Language, edited by P. Grzybek, 6. W. Chafe, D. Tannen, Annual Rev. of Anthropology, 16, (Springer-Verlag, Berlin/Heidelberg, 2006) 383 (1987) 33. C. M. Alis, M. T. Lim, “Supplemen- 7. R.Wooffitt,Conversationanalysisanddiscourseanalysis: tary material: pg authors list.” (2012), Acomparativeandcriticalintroduction(SagePublications http://www.nip.upd.edu.ph/ipl/data/conversations/pg- Ltd, 2005) authorslist.csv Christian M. Alis, May T. Lim: Adaptation of fictional and online conversations to communication media 7 34. C. M. Alis, M. T. Lim, ”Supplementary ma- terial: Adaptation of fictional and online con- versations to communication media” (2012), http://www.nip.upd.edu.ph/ipl/data/conversations/epjb si.pdf [Retrieved 27-09-2012] 35. E.Williams,“How@repliesworkontwitter(andhowthey might)” (2008), http://blog.twitter.com/2008/05/how- replies-work-on-twitter-and-how.html [Retrieved 25-09- 2012] 36. Twitter Help Center, “What are @replies and mentions?” (2012), https://support.twitter.com/articles/14023-what- are-replies-and-mentions [Retrieved 25-09-2012] 37. J. Kalucki, “Streaming API documentation,” (2010), http://apiwiki.twitter.com/w/page/22554673/Streaming- API-Documentation?rev=1268351420 [Retrieved 15-04- 2012] 38. C. M. Alis, M. T. Lim, “Supplemen- tary material: subs movie list” (2012), http://www.nip.upd.edu.ph/ipl/data/conversations/subs- movielist.csv [Retrieved 27-09-2012] 39. R. F. i. Cancho, R. V. Sol´e, Proc. Natl. Acad. Sci. U. S. A., 100 788 (2003) 40. T.Klee,M.D.Fitzgerald,J.ofChildLang.,12251(1985) 41. C. A. Dollaghan, T. F. Campbell, J. L. Paradise, H. M. Feldman,J.E.Janosky,D.N.Pitcairn,M.Kurs-Lasky,J. Speech Lang. Hear. Res., 42 1432 (1999) 42. H. B. Mann, D. R. Whitney, Ann. Math. Stat., 18, 50 (1947) 43. Independent Television Commission, “ITC Guid- ance on standards for subtitling,” (1999), http://www.ofcom.org.uk/static/archive/itc/itc publications/codes guidance/index.asp.html [Retrieved 15-04-2012] 44. S. Bernhardsson, L. E. C. da Rocha, P. Minnhagen, New J. of Phys., 11, 123015 (2009) 45. M.Milian,“Whytextmessagesarelimitedto160charac- ters,” Los Angeles Times, May 2009. 46. “ictDATA.org: top SMS 2009.” (2010), http://www.ictdata.org/2010/10/top-sms-2009.html [Re- trieved 21-02-2012] 47. C. Thurlow, Discourse Analysis Online, 1(1) (2003), http://www.shu.ac.uk/daol/articles/v1/n1/a3/thurlow2002003- paper.html [Retrieved 21-02-2012]

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.