ebook img

Natural Language Processing for Corpus Linguistics PDF

96 Pages·2022·5.155 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Natural Language Processing for Corpus Linguistics

D Corpus analysis can be expanded and scaled up by u n incorporating computational methods from natural language n processing. This Element shows how text classification and text similarity models can extend our ability to undertake corpus linguistics across very large corpora. These computational methods are becoming increasingly important as corpora Corpus Linguistics grow too large for more traditional types of linguistic analysis. We draw on five case studies to show how and why to use computational methods, ranging from usage-based grammar to authorship analysis to using social media for corpus-based sociolinguistics. Each section is accompanied by an interactive code notebook that shows how to implement the analysis in n natural Language Python. A stand-alone Python package is also available to help a t readers use these methods with their own data. Because large- u r a scale analysis introduces new ethical problems, this Element l L Processing for pairs each new methodology with a discussion of potential an g ethical implications. u a Corpus Linguistics g e P r o c e About the Series Series Editor s s Corpus Linguistics has grown to become Susan Hunston in g part of the mainstream of Linguistics and University of f o Applied Linguistics. This Elements series Birmingham r C Jonathan Dunn is designed to meet the needs of students o r p and researchers who need to keep up with u s this changing field, including introductions L in to main topics areas as well as accounts g u of the latest ideas and developments. is t ic s Cover image: monsitj / iStock / Getty Images Plus Downloaded from https://www.cambridge.org/core. IP address: 179.113.62.209, on 08 Mar 2022 at 17:22:37, subject to thIeS SCNam 26b3r2id-8g0e9 C7o (roen ltienrem)s of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781009070447 ISSN 2632-8089 (print) Downloaded from https://www.cambridge.org/core. IP address: 179.113.62.209, on 08 Mar 2022 at 17:22:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781009070447 Elements in Corpus Linguistics editedby SusanHunston UniversityofBirmingham NATURAL LANGUAGE PROCESSING FOR CORPUS LINGUISTICS Jonathan Dunn University of Canterbury Downloaded from https://www.cambridge.org/core. IP address: 179.113.62.209, on 08 Mar 2022 at 17:22:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781009070447 UniversityPrintingHouse,CambridgeCB28BS,UnitedKingdom OneLibertyPlaza,20thFloor,NewYork,NY10006,USA 477WilliamstownRoad,PortMelbourne,VIC3207,Australia 314–321,3rdFloor,Plot3,SplendorForum,JasolaDistrictCentre, NewDelhi–110025,India 103PenangRoad,#05–06/07,VisioncrestCommercial,Singapore238467 CambridgeUniversityPressispartoftheUniversityofCambridge. ItfurtherstheUniversity’smissionbydisseminatingknowledgeinthepursuitof education,learning,andresearchatthehighestinternationallevelsofexcellence. www.cambridge.org Informationonthistitle:www.cambridge.org/9781009074438 DOI:10.1017/9781009070447 ©JonathanDunn2022 Thispublicationisincopyright.Subjecttostatutoryexception andtotheprovisionsofrelevantcollectivelicensingagreements, noreproductionofanypartmaytakeplacewithoutthewritten permissionofCambridgeUniversityPress. Firstpublished2022 AcataloguerecordforthispublicationisavailablefromtheBritishLibrary. ISBN978-1-009-07443-8Paperback ISSN2632-8097(online) ISSN2632-8089(print) Additionalresourcesforthispublicationatwww.cambridge.org/dunnresources CambridgeUniversityPresshasnoresponsibilityforthepersistenceoraccuracyof URLsforexternalorthird-partyinternetwebsitesreferredtointhispublication anddoesnotguaranteethatanycontentonsuchwebsitesis,orwillremain, accurateorappropriate. Downloaded from https://www.cambridge.org/core. IP address: 179.113.62.209, on 08 Mar 2022 at 17:22:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781009070447 Natural Language Processing for Corpus Linguistics ElementsinCorpusLinguistics DOI:10.1017/9781009070447 Firstpublishedonline:March2022 JonathanDunn UniversityofCanterbury Authorforcorrespondence:JonathanDunn, [email protected] Abstract:Corpusanalysiscanbeexpandedandscaledupby incorporatingcomputationalmethodsfromnaturallanguageprocessing. ThisElementshowshowtextclassificationandtextsimilaritymodelscan extendourabilitytoundertakecorpuslinguisticsacrossverylargecorpora. Thesecomputationalmethodsarebecomingincreasinglyimportantas corporagrowtoolargeformoretraditionaltypesoflinguisticanalysis.We drawonfivecasestudiestoshowhowandwhytousecomputational methods,rangingfromusage-basedgrammartoauthorshipanalysisto usingsocialmediaforcorpus-basedsociolinguistics.Eachsectionis accompaniedbyaninteractivecodenotebookthatshowshowto implementtheanalysisinPython.Astand-alonePythonpackageisalso availabletohelpreadersusethesemethodswiththeirowndata.Because large-scaleanalysisintroducesnewethicalproblems,thisElementpairs eachnewmethodologywithadiscussionofpotentialethicalimplications. ThisElementalsohasavideoabstract:www.cambridge.org/dunnabstract Keywords:computationallinguistics,naturallanguageprocessing,corpus linguistics,textclassification,textsimilarity,usage-basedgrammar, corpus-basedsociolinguistics,computationalstylistics,computationalsyntax JELclassifications:A12,B34,C56,D78,E90 ©JonathanDunn2022 ISBNs:9781009074438(PB),9781009070447(OC) ISSNs:2632-8097(online),2632-8089(print) Downloaded from https://www.cambridge.org/core. IP address: 179.113.62.209, on 08 Mar 2022 at 17:22:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781009070447 Contents 1 Computational Linguistic Analysis 1 2 Text Classification 13 3 Text Similarity 39 4 Validation and Visualization 62 5 Conclusions 79 References 81 Downloaded from https://www.cambridge.org/core. IP address: 179.113.62.209, on 08 Mar 2022 at 17:22:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781009070447 Accessing the Code Notebooks https://doi.org/10.24433/CO.3402613.v1 https://github.com/jonathandunn/text_analytics https://github.com/jonathandunn/corpus_analysis TorunthenotebooksthroughCodeOcean,youwillneedtoclickthecommand thatsays“EditYourCopy”inthetopright-handcorner,asshowninthefirst screenshot: The“Jupyter”commandwillnowbeavailableundertheheading“Reproduci- bleRun”asshowninthesecondscreenshot: This will start up the interactive notebook container. You can now find the notebookswithinthe“code”folder. Thefollowingisalistofinteractivenotebookstogetherwiththesectionofthe Elementwhichtheyaccompany: Lab1.2.AccessingtheCorpora Lab1.3.VisualizingCategories Lab1.4.UsingGroupbytoExploreCategories Lab1.5.VectorizingTexts Lab2.1.GettingxandyArraysforDialects Lab2.2.ClassifyingCitieswithTF-IDFandPMI Lab2.3.ClassifyingAuthorswithFunctionWordN-Grams Lab2.4.UsingPositionalVectorsforPartsofSpeech Downloaded from https://www.cambridge.org/core. IP address: 179.113.62.209, on 08 Mar 2022 at 17:22:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781009070447 Lab2.5.ClassifyingHotelsbyQualityUsingSentimentAnalysis Lab2.7.ClassifyingCitiesUsingMLPs Lab3.2.RegisterandCorpusSimilarity Lab3.3.FindingSimilarDocuments Lab3.4.FindingAssociatedWords Lab3.5.WorkingwithWordEmbeddings Lab3.6.ClusteringWordEmbeddings Lab4.1.BaselinesforClassifyingPoliticalSpeeches Lab4.2.EnsuringValidityUsingCross-Validation Lab4.3.UnmaskingAuthorship Lab4.4.ComparingWordEmbeddings Lab4.5.MakingMapsforLinguisticDiversity Downloaded from https://www.cambridge.org/core. IP address: 179.113.62.209, on 08 Mar 2022 at 17:22:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781009070447 NaturalLanguageProcessingforCorpusLinguistics 1 1 Computational Linguistic Analysis 1.1 ScalingUp Corpus Linguistics Corpuslinguisticshasenteredagoldenage,drivenbyboththeamountandthe rangeoflanguagethatisnowavailableforlinguisticanalysis.Corpusdatais abletorepresentapopulation’susageatscale,bypassingthelimitationswhich made introspection so important in the 1950s. But this wide availability of languagedatarequiresthatlinguistshavethemethodsavailabletoanalyzeit. Andwhiletherehasbeenasurgeofadvancesinnaturallanguageprocessing and computational linguistics, these advances have become increasingly dis- connected from corpus linguistics and linguistic theory. This Element brings naturallanguageprocessingandcorpuslinguisticstogether,showinghowcom- putationalmodelscanbeusedtoanswerbothcategorizationandcomparison problems. These computational models are presented using five case studies thatwillbeintroducedinthenextsection,rangingfromsyntacticanalysisto registeranalysistocorpus-basedsociolinguistics. The goal here is to show how to use these computational models, what linguisticquestionstheycananswer,andwhyitisimportanttoscaleupcorpus linguisticsinthisway.AlinguistcanusethisElementtolearnhowtousenatu- rallanguageprocessingtoanswerlinguisticquestionstheyarealreadyfamiliar with.AndacomputerscientistcanusethisElementtolearnaboutthelinguis- ticassumptionsandlimitationsbehindcomputationalmethods,mattersthatare toooftendisregardedwithinnaturallanguageprocessingitself. Acategorizationproblemisaboutassigningapredefinedlabeltosomepiece oflanguage.Atthewordlevel,thiscouldinvolveaskingwhetheraparticular open-classwordisanounoraverb.Atthesentencelevel,thiscouldbeasking what kind of construction a particular sentence represents. At the document level,thiscouldbeaskingwhetheraparticularspeakerrepresentsNewZealand EnglishorAustralianEnglish.Allofthesequestionscanbeansweredusinga text classifier. This is a type of supervised machine learning in which we as linguistsdefinethecategoriesthatweareinterestedin. A comparison problem is about measuring the relationship between two observations.Atthewordlevel,thiscouldbeaskingwhethertwonounslike cat and dog belong to the same semantic domain. At the sentence level, this could be asking whether two tweets have a similar sentiment. At the document level, this could be asking whether two articles are examples of a similar style. These questions can be approached using a text similarity model. This is a type of unsupervised machine learning in which we as lin- guistsonlycontroltherepresentationsbeingused,notthesetoflabelsusedfor annotation. Downloaded from https://www.cambridge.org/core. IP address: 179.113.62.209, on 08 Mar 2022 at 17:22:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781009070447 2 CorpusLinguistics Howwelldocomputationalmodelscomparewithhumanintrospection?In some cases, models can reproduce human intuitions with a high degree of accuracy. For example, text classifiers have been shown to make very good predictions about the part of speech of individual words when trained on small amounts of annotated data. In a case like this, a small amount of seed data, which is annotated by a linguist, supports the analysis of corpora too large to be annotated by a linguist. So a text classifier allows us to scale up introspection-basedannotations. Inothercases,computationalmodelscandetectpatternsinlanguagethatare not visible to human introspection. For example, research in both authorship analysisanddialectidentificationhasshownthatthereareenoughindividual- specificandcommunity-specificvariantstoenableaccuratepredictionsofwho producedaspecificdocument.But,aslinguists,ourownintrospectionsarenot preciseenoughtoidentifythesesamepatterns.Inacaselikethis,computational linguisticanalysismakesitpossibletoanswernewquestionsaboutlanguage. Finally,therearecaseswherecomputationalmodelscompletelymisssome- thing that is easily accessible to humans. For example, we will follow a case study on multilingualism online which shows that 90 percent of digital lan- guagedata(fromthewebandsocialmedia)representsjusttwentylanguages. Mostlanguagesintheworldarelow-resourcelanguagesfromacomputational perspective.Asaresult,manyofthecomputationalmethodsthatwecoverin this Element are difficult to apply to these languages. As linguists, however, we do not require millions or billions of words in a language before we can beginouranalysis. Weneedcomputationallinguisticanalysisfortworeasons:for reproduci- bilityandforscalability.First,everystepinacomputationalpipelineisfully automated,whichmeansthatitcanbereproducedandverified.Forexample, this Element follows five separate case studies that we will introduce in the nextsubsection.AllthegraphsandfiguresandexperimentsintheElementcan bereproducedusingthecodenotebooksthatarelinkedwithineachsection.1 Thisisanexampleofhowcomputationalmethodssupportreproducibility. Second, the once-revolutionary Brown Corpus contained 1 million words (Francis & Kucera, 1967). But it is common now for corpora to range from 1billionwords,liketheGeoWACfamilyofcorpora(Dunn&Adams,2020), up to 400 billion words, like the Corpus of Global Language Use (Dunn, 2020).Theseverylargecorporaareoftendrawnfromdigitalsourceslikethe web, social media, Wikipedia, and news articles. While these sources of lan- guage data have tremendous potential for testing linguistic hypotheses on a 1 Andathttps://github.com/jonathandunn/corpus_analysis Downloaded from https://www.cambridge.org/core. IP address: 179.113.62.209, on 08 Mar 2022 at 17:22:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781009070447

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.