ebook img

Who’s Afraid of George Kingsley Zipf? PDF

22 Pages·2010·0.19 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Who’s Afraid of George Kingsley Zipf?

Who’s Afraid of George Kingsley Zipf? CharlesYang DepartmentofLinguistics&ComputerScience UniversityofPennsylvania [email protected] June2010 Abstract WeexploretheimplicationsofZipf’slawfortheunderstandingoflinguisticproductivity.Focusingon languageacquisition,weshowthattheitem/usagebasedapproachhasnotbeensupportedbyadequate statisticalevidence. Bycontrast, thequantitativepropertiesofaproductivegrammarcanbeprecisely formulated, and are consistent with even very young children’s language. Moreover, drawing from re- search in computational linguistics, the statistical properties of natural language strongly suggest that thetheoryofgrammarbecomposedofgeneralprincipleswithoverarchingrangeofapplicationsrather thanacollectionofitemandconstructionspecificexpressions. 2 1 Introduction Einsteinwasaverylatetalker.“Thesoupistoohot”,aslegendhasit,werehisfirstwords,attheveryripe ageofthree.Apparentlytheboygeniushadn’thadanythinginterestingenoughtosay. Thecredulityofsuchtalesaside—similarstorieswithotherfamoussubjectsabound—theydocon- tainakerneloftruth:achilddoesn’thavetosaysomething,anything,justbecausehecan.Andthisposes achallengeforthestudyofchildlanguagewhenchildren’slinguisticproductionisoftentheonly, and certainlythemostaccessible,dataonhand. Languageuseisthecompositeoflinguistic,cognitiveand perceptualfactorsmanyofwhich,inthechild’scase,arestillindevelopmentandmaturation.Itisthere- fore difficult to draw inferences about the learner’s linguistic knowledge from his linguistic behavior. Indeed,themoralholdsforlinguisticstudyingeneral: anindividual’sgrammaticalcapacitymaynotbe fullyreflectedinhisorherspeech.Sinceagoaloflinguistictheoryistoidentifythepossibleandimpos- siblestructuresoflanguage,restrictingoneselftonaturalisticdataisdoublylimited: someexpressions maynothavebeensaidbutareneverthelesswellformed, others—syntacticislands, forinstance—will neverbesaidfortheyareunsayable. ThismuchhasbeenwellknownsinceChomsky(1965)drewthecompetence/performancedistinc- tion. Thepioneeringworkonchildlanguagethatsoonfollowed, includethosewhodidnotfollowthe generative approach, also recognized the gap between what the child knows and what the child says (Bloom 1970, Bowerman 1973, Brown & Fraser 1963, Brown & Bellugi 1964, McNeil 1966, Schlesinger 1971,Slobin1971).Twoexamplesfromthatperiodoftimesufficetoillustratethenecessitytogobeyond the surface. Shipley, Gleitman & Smith (1969) show that children in the so-called telegraphic stage of languagedevelopmentneverthelessunderstandfullyformedEnglishsentencesbetterthantelegraphic patterns that resemble their own speech. Roger Brown, in his landmark study (1973) and synthesis of other work available at the time, provides distributional and quantitative evidence against the Pivot Grammar hypothesis (Braine 1963), under which early child syntax supposedly consists of templates centering around pivot words.1 Brown advocates the thesis, later dubbed the Continuity Hypothesis, that childlanguage be interpretedin terms ofadult-like grammatical devices, whichhas continued to feature prominently in language acquisition (Wexler & Culicover 1980, Pinker 1984, Crain 1991, Yang 2002). This tradition has been challenged by the item or usage-based approach to language most clearly representedbyTomasello(1992,2000a,2000b,2003),whichreflectsacurrenttrend(Bybee2001,Pierre- humbert2001,Goldberg2003,Culicover&Jackendoff2005,Hay&Baayen2005,etc.) thatemphasizes the storage of specific linguistic forms and constructions at the expense of general combinatorial lin- guisticprinciplesandoverarchingpointsoflanguagevariation(Chomsky1965,1981). Childlanguage, especiallyintheearlystages,isclaimedtoconsistofspecificitem-basedschemas,ratherthanproduc- tive linguistic system as previously conceived. Consider, for instance, three case studies in Tomasello (2000a,p213-214)whichhavebeencitedasevidencefortheitem-basedviewatnumerousplaces. • TheVerbIslandHypothesis(Tomasello1992). Inalongitudinalstudyofearlychildlanguage,itis notedthat“ofthe162verbsandpredicatetermsused,almosthalfwereusedinoneandonlyone constructiontype,andovertwo-thirdswereusedineitheroneortwoconstructiontypes...”.There is“greatunevenessinhowdifferentverbs,eventhosethatwereverycloseinmeaning,wereused— bothintermsofthenumberandtypesofconstructionstypesused.”Hence,“the2-year-oldchild’s 1Apositionthatbearsmorethanapassingresemblancetoastrandofcontemporarythinkingtowhichwereturnmomen- tarily. 3 syntacticcompetenceiscomprisedtotallyofverb-specificconstructionswithopennominalslots”, rather than abstract and productive syntactic rules under which presumably a broader range of combinationsisexpected. • Limited morphological inflection. According to a study of child Italian (Pizutto & Caselli 1994), 47% of all verbs used by 3 young children (1;6 to 3;0) were used in 1 person-number agreement form,andanadditional40%wereusedwith2or3forms,wheresixformsarepossible(3person × 2 number). Only 13% of all verbs appeared in 4 or more forms. Again, the low level of usage diversityistakentoshowthelimitednessofgeneralizationcharacteristicofitem-basedlearning. • Unbalanceddeterminerusage. CitingPine&Lieven(1997)andothersimilarstudies, itisfound thatwhenchildrenbegantousethedeterminersaandthewithnouns,“therewasalmostnoover- lapinthesetsofnounsusedwiththetwodeterminers,suggestingthatthechildrenatthisagedid nothaveanykindofabstractcategoryofDeterminersthatincludedbothoftheselexicalitems”. Thisfindingisheldtocontradicttheearlieststudy(Valian1986)whichmaintainsthatchilddeter- mineruseisproductiveandaccuratelikeadultsbytheageof2;0. Sofaraswecantell,however,theseevidenceinsupportforitem-basedlearninghasbeenpresented, and accepted, on the basis of intuitive inspections rather than formal empirical tests. For instance, among the numerous examples from child language, no statistical test was given in the major treat- ment(Tomasello1992)wheretheVerbIslandHypothesisandrelatedideasaboutitem-basedlearning areputforward. Specifically,notesthasbeengiventoshowthattheobservationsabovearestatistically inconsistentwiththeexpectationofafullyproductivegrammar, thepositionthatitem-basedlearning opposes. Nor,forthatmatter,aretheseobservationsshowntobeconsistentwithitem-basedlearning, which, asweshallsee, hasnotbeenclearlyenougharticulatedtofacilitatequantitativeevaluation. In thispaper,weprovidestatisticalanalysistofillthesegaps. Wedemonstratethatchildren’slanguageuse actuallyshowstheoppositeoftheitem-basedview;theproductivityofchildren’sgrammarisinfactcon- firmed. Morebroadly, weaimtodirectresearcherstocertainstatisticalpropertiesofnaturallanguage andthechallengestheyposeforthetheoryoflanguageandlanguagelearning.Ourpointofdepartureis anamethathasbeen,andwillcontinueto,tormenteverystudentoflanguage:GeorgeKingsleyZipf. 2 ZipfianPresence 2.1 ZipfianWords Undertheso-calledZipf’slaw(Zipf1949),theempiricaldistributionsofwordsfollowacuriouspattern: relatively few words are used frequently—very frequently—while most words occur rarely, with many occurringonlyonceinevenlargesamplesoftexts. Moreprecisely,thefrequencyofawordtendstobe approximatelyinverselyproportionaltoitsrankinfrequency. Let f bethefrequencyofthewordwith therankofr inasetofN words,then: C f = whereC issomeconstant (1) r IntheBrowncorpus(Kucera&Francis1967),forinstance,thewordwithrank1is“the”,whichhasthe frequencyofabout70,000,andthewordwithrank2is“of”,withthefrequencyofabout36,000: almost exactlyasZipf’slawentails(i.e.,70000×1≈36000×2). TheZipfiancharacterizationofwordfrequency 4 canbevisualizedbyplottingthelogofwordfrequencyagainstthelogofwordrank.Bytakingthelogon bothsidesoftheequationabove(logf = logC −logr),aperfectZipfianfitwouldbeastraightlinewith theslope-1. Indeed, Zipf’slawhasbeenobservedinvocabularystudiesacrosslanguagesandgenres, andthelog-logslopefitisconsistentlyinthecloseneighborhoodof-1.0(Baroni2008). 12 words pseudowords 10 8 log(freq) 6 4 2 0 0 2 4 6 8 10 12 log(rank) Figure1.Zipfiandistributionofwords(top)andpseudowords(bottom)intheBrowncorpus.Thelowerlineis plottedbytaking“words”tobeanysequenceoflettersbetweene’s(Chomsky1958).Thetwostraightdottedlines arelinearfunctionswiththeslope-1,whichillustratethegoodnessoftheZipfianfit. TherehasbeenagooddealofcontroversyovertheinterpretationofZipf’slaw,whichshowsupnot only in the context of words but also many other physical and social systems (Bak, Tang, Wiesenfeld 1987,Gabaix1999,Axtell2001,amongmanyothers).ItisnowclearthattheobservationofZipfiandistri- butionaloneisofnoinherentinterestorsignificance,ascertainrandomlettergeneratingprocessescan produceoutcomesthatfollowZipf’slaw(Mandelbrot1954,Miller1957,Li1992,Niyogi&Berwick1995). AsnotedinChomsky(1958),ifweredefine“words”asalphabetsbetweenanytwooccurrencesofsome letter,say,“e”,ratherthanspaceasinthecaseofwrittentext,theresultingdistributionmayfitZipf’slaw evenbetter.ThisisillustratedbythelowerlineinFigure1,whichfollowstheZipfianstraightlineatleast aswellasrealwords. Itisoftenthecasethatwearenotconcernedwiththeactualfrequenciesofwordsbuttheirprobabil- ityofoccurrence;Zipf’slawmakesthisestimationsimpleandaccurate. Given(1),theprobabilityp of r thewordn withtherankr amongN wordscanbeexpressedasfollows: r (cid:18)C(cid:19)(cid:30)(cid:32)(cid:88)N C(cid:33) 1 (cid:88)N 1 p = = whereH istheNthHarmonicNumber (2) r N r i rH i i=1 N i=1 TheapplicationofZipf’slawtowordshasbeenverywellstudied. Yetrelativelylittleattentionhas beengiventothecombinatoricsoflinguisticunitsunderagrammarandmoreimportant,howonemight drawinferenceaboutthegrammargiventhedistributionofwordcombinatorics.Weturntotheseques- tionsimmediately. 5 2.2 ZipfianCombinatorics The“longtail”ofZipf’slaw,whichisoccupiedbylowfrequencywords,becomesevenmorepronounced when we consider combinatorial linguistic units. Take, for instance, n-grams, the simplest linguistic combination that consists of n consecutive words in a text.2 Since there are a lot more bigrams and trigramsthanwords,thereareconsequentlyalotmorelowfrequencybigramsandtrigramsinalinguistic sample, as Figure 2 illustrates from the Brown corpus (for related studies, see Teahan 1997, Ha et al. 2002): 100 90 80 Cumulative 70 %oftypes 60 words bigrams 50 trigrams 40 1 2 3 4 5 10 20 30 40 50 100 200 Frequency Figure2.Thevastmajorityofn-gramsarerareevents.Thex-axisdenotesthefrequencyofthegram,andthe y-axisdenotesthecumulative%ofthegramthatappearatthatfrequencyorlower. Forinstance,thereareabout43%ofwordsthatoccuronlyonce,about58%ofwordsthatoccur1-2 times,68%ofwordsthatoccur1-3times,etc.The%ofunitsthatoccurmultipletimesdecreasesrapidly, especially for bigrams and trigrams: approximately 91% of distinct trigram types in the Brown corpus occuronlyonce,and96%occuronceortwice. Therangeoflinguisticformsissovastthatnosampleislargeenoughtocaptureallofitsvarieties evenwhenwemakeacertainnumberofabstractions.Figure3plotstherankandfrequencydistributions ofsyntacticrulesofmodernEnglishfromthePennTreebank(Marcusetal. 1993). Sincethecorpushas beenmanuallyannotatedwithsyntacticstructures, itisstraightforwardtoextractrulesandtallytheir frequencies.3 Themostfrequentruleis“PP→PNP”,followedby“S→NPVP”:again,theZipf-likepattern canbeseenbythecloseapproximationbyastraightlineonthelog-logscale. 2Forexample,giventhesentence“thecatchasesthemouse”,thebigrams(n=2)are“thecatchasesthemouse”are“thecat”, “catchases”,“chasesthe”,and“themouse”,andthetrigrams(n=3)are“thecatchases”,“catchasesthe”,“chasesthemouse”. Whenn=1,wearejustdealingwithwords. 3CertainruleshavebeencollapsedtogetherastheTreebankfrequentlyannotatesrulesinvolvingdistinctfunctionalheads asseparaterules. 6 12 10 8 log(freq) 6 4 2 0 0 1 2 3 4 5 6 7 8 9 log(rank) Figure3.ThefrequencydistributionofthesyntacticrulesinthePennTreebank. The long tail of linguistic combinations must be taken into account when we assess the structural propertiesofthegrammar. Claimsofitem-basedlearningbuildonthepremisethatlinguisticproduc- tivityentailsdiversityofusage: the“unevenness”inusagedistributionistakentobeevidenceagainst asystematicgrammar. Theunderlyingintuition, therefore, appearstobethatlinguisticcombinations might follow something close to a uniform distribution. Take the notion of overlap in the case of de- terminer use in early child language (Pine & Lieven 1997). If the child has fully productive use of the syntacticcategorydeterminer, thenonemightexpecthertousedeterminerswithanynounforwhich they are appropriate. Since the determiners “the” and “a” have (virtually) identical syntactic distribu- tions,alinguisticallyproductivechildthatuses“a”withanounisexpectedtoautomaticallytransferthe useofthatnounto“the”.Quantitatively,determiner-nounoverlapisdefinedasthepercentageofnouns thatappearswithbothdeterminersoutofthosethatappearwitheither. Thelowoverlapvaluesinchil- dren’sdetermineruse(Pine&Lieven1997,amongothers)aretakentosupporttheitem-basedviewof childlanguage. However,usingasimilarmetric,Valianandcolleagues(Valianetal.2009)findthattheoverlapmea- suresforyoungchildrenandtheirmothersarenotsignificantlydifferent,andtheyarebothverylow. In fact,whenappliedtotheBrowncorpus(seesection3.2formethods),wefindthat“a/the”overlapforsin- gularnounsisonly25.2%: almostthreequartersthatcouldhaveappearedwithbothdeterminersonly appearedwithoneexclusively. Theoverlapvalueof25.2%isactuallylowerthanthoseofsomechildren reportedinPine&Lieven(1997). ItwouldfollowthatthelanguageoftheBrowncorpus, whichdraws fromvariousgenresofprofessionalprintmaterials,islessproductiveandmoreitem-basedthanthatof atoddler—whichseemsabsurd. ThereasonfortheseseeminglyparadoxicalfindingsliesintheZipfiandistributionofsyntacticcat- egoriesandthegenerativecapacityofnaturallanguagegrammar. Considerafullyproductiverulethat combinesadeterminerandasingularnoun,or“DP→DN”,where“D→ a|the”and“N→ cat|book|desk|...”. Weusethisruleforitssimplicityandforthereadilyavailabledataforempiricaltestsbutonecaneasily substitutetherulefor“VP→VDP”,“VP→VinConstruction ”,“V →V +Person+Number+ x inflection stem Tense”.Allsuchcasescanbeanalyzedwiththemethodsprovidedhere. 7 Suppose a linguistic sample containsS determiner-noun pairs, which consist of D and N unique determinersandnouns.(InthepresentcaseD=2for“a”and“the”.)ThefullproductivityoftheDPrule, by definition, means that the two categories combine independently. Two observations, one obvious andtheothernovel,canbemadeaboutthedistributionsofthetwocategoriesandtheircombinations. First,nouns(andopenclasswordsingeneral)willfollowzipf’slaw.Forinstance,thesingularnounsthat appearintheformof“DP→DN”intheBrowncorpusshowalog-logslopeof-0.97. IntheCHILDES (MacWhinney2000)speechtranscriptsofsixchildren(seesection3.2fordetails), theaveragevalueof log-logslopeis-0.98. Thismeansthatinalinguisticsample,relativelyfewnounsoccuroftenbutmany willoccuronlyonce—whichofcoursecannotoverlapwithmorethanonedeterminers. Second,whilethecombinationofD andN issyntacticallyinterchangeable,N’stendtofavoroneof thetwodeterminers,aconsequenceofpragmaticsandindeednon-linguisticfactors. Forinstance,we say“thebathroom”moreoftenthan“abathroom”but“abath”moreoftenthan“thebath”,eventhough all four DPs are perfectly grammatical. The reason for such asymmetries is not a matter of linguistic interest: “thebathroom”ismorefrequentthan“abathroom”onlybecausebodilyfunctionsareamore constantthemeoflifethanrealestatematters. We can place these combinatorial asymmetries in a quantitative context. As noted earlier, about 75% of distinct nouns in the Brown corpus occur with exclusively “the” or “a” but not both. Even the remaining25%whichdooccurwithtendtohavefavorites: onlyafurther25%(i.e. 12.5%ofallnouns) areusedwith“a”and“the”equallyfrequently,andtheremaining75%areunbalanced.Overall,fornouns thatappearwithbothdeterminersasleastonce(i.e. 25%ofallnouns),thefrequencyratiobetweenthe moreoverthelessfavoreddetermineris2.86:1.(Ofcourse,somenounsfavor“the”whileothersfavor“a”, asthe“bathroom”and“bath”examplesaboveillustrate.)Thesegeneralpatternsholdforchildandadult speech data as well. In the six children’s transcripts (section 3.2), the average percentage of balanced nounsamongthosethatappearwithboth“the”and“a”is22.8%,andthemorefavoredvs. lessfavored determinerhasanaveragefrequencyratioof2.54:1.Eventhoughtheseratiosdeviatefromtheperfect2:1 ratiounderthestrictversionofZipf’slaw—themorefavoredisevenmoredominantovertheless—they clearlypointouttheconsiderableasymmetryincategorycombinationusage. Asaresult,evenwhena nounappearsseveraltimesinasample,thereisstillasignificantchancethatithasbeenpairedwitha singledeterminerinallinstances. Together, Zipfian distributions of atomic linguistic units (words; Figure 1) and their combinations (n-grams Figure 2, phrases Figure 3) ensure that the determiner-noun overlap must be relatively low unless the sample sizeS is very large. In section 4, we examine, and discover similar patterns, for the usagepatternsofverbalsyntaxandmorphology. Forthemoment,wedevelopaprecisemathematical treatmentandcontrastitwiththeitem-basedlearningapproachinthecontextoflanguageacquisition. 3 QuantifyingProductivity 3.1 Theoreticalanalysis Considerasample(N,D,S),whichconsistsofN uniquenouns,Duniquedeterminers,andSdeterminer- nounpairs. HereD=2for“the”and“a”thoughweconsiderthegeneralcasehere. Thenounsthathave appearedwithmorethanone(i.e.two)determinerswillhaveanoverlapvalueof1;otherwise,theyhave theoverlapvalueof0.Theoverlapvaluefortheentiresamplewillbethenumberof1’sdividedbyN. Our analysis calculates the expected value of the overlap value for the sample (N,D,S) under the 8 productive rule “DP→D N”; let it beO(N,D,S). This requires the calculation of the expected overlap valueforeachoftheN nounsoverallpossiblecompositionsofthesample. Considerthenounn with r therankr outofN.Followingequation(2),ithastheZipfianprobabilityp =1/(rH )ofbeingdrawnat r N anysingletrialinS. Lettheexpectedoverlapvalueofn beO(r,N,D,S). Theoverlapforthesamplecan r bestatedas: N 1 (cid:88) O(D,N,S)= O(r,N,D,S) (3) N r=1 Consider now the calculationO(r,N,D,S). Since n has the overlap value of 1 if and only if it has r beenusedwithmorethanonedeterminerinthesample,wehave: O(r,N,D,S)=1−Pr{n isnotsampledduringStrials} r D (cid:88) − Pr{n issampledbutwiththeithdeterminerexclusively} r i=1 =1−(1−p )S r D (cid:88)(cid:104) (cid:105) − (d p +1−p )S−(1−p )S i r r r i=1 Thelasttermaboverequiresabriefcomment. Underthehypothesisthatthelanguagelearnerhas aproductiverule“DP→DN”,thecombinationofdeterminerandnounisindependent. Therefore,the probabilityofnounn combiningwiththeithdetermineristheproductoftheirprobabilities,ord p . r i r Themultinomialexpression (p1+p2+...+pr−1+dipr +pr+1+...+pN)S givestheprobabilitiesofallthecompositionsofthesample,withn combiningwiththeithdeterminer r 0,1,2,...Stimes,whichissimply(dipr+1−pr)Ssince(p1+p2+pr−1+pr+pr+1+...+pN)=1.However, thisvalueincludestheprobabilityofn combiningwiththeithdeterminerzerotimes—again(1−p )S— r r whichmustbesubtracted. Thus, theprobabilitywithwhichn combineswiththeithdeterminerex- r clusivelyinthesampleSis[(d p +1−p )S−(1−p )S].Summingthesevaluesoveralldeterminersand i r r r collectingterms,wehave: D (cid:88)(cid:104) (cid:105) O(r,N,D,S)=1+(D−1)(1−p )S− (d p +1−p )S (4) r i r r i=1 Theformulationsin(3)—(4)allowustocalculatetheexpectedvalueofoverlapusingonlythesample sizeS,thenumberofuniquenounN andthenumberofuniquedeterminersD,undertheassumption thatnounsanddeterminersbothfollowZipf’slawasdiscussedinsection2.4 Figure4givesanillustra- tion,withN =100,D=2andS=200. 4Forthepresentcaseinvolvingonlytwodeterminers“the”and“a”,d =2/3andd =1/3. Asnotedinsection2.2,the 1 2 empiricalprobabilitiesofthemorevs.lessfrequentdeterminersdeviatesomewhatfromthestrictZipfianratioof2:1,numerical resultsshowthatthe2:1ratioisaveryaccuratesurrogateforawiderangeofactualrationsinthecalculationof(3)—(4).Thisis becausemostofaverageoverlapvaluecomesfromtherelativelyfewandhighfrequentnouns,asFigure4makesclear. 9 1 ××× × × 0.9 × × 0.8 × × 0.7 × × 0.6 × × Expected0.5 × × Overlap × 0.4 ×× × × × 000...123 ××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××× 0 0 10 20 30 40 50 60 70 80 90 100 Rank Figure4.Expectedoverlapvaluesfornounsorderedbyrank,forN =100nounsinasamplesizeofS=200with D=2determiners.WordfrequenciesareassumedtofollowtheZipfiandistribution.Ascanbeseen,fewofnouns havehighprobabilitiesofoccurringwithbothdeterminers,butmostare(far)belowchance.Theaverageoverlap is21.1%. UnderZipfiandistributionofcategoriesandtheirproductivecombinations,lowoverlapvaluesare amathematicalnecessity. Asweshallsee, thetheoreticalformulationherenearlyperfectlymatchthe distributionalpatternsinchildlanguage,towhichweturnpresently. 3.2 Determinersandproductivity Methods. To study the determiner system in child language, we consider the data from six children Adam,Eve,Sarah,Naomi,Nina,andPeter. ThesearetheallandonlychildrenintheCHILDESdatabase (MacWhinney2000)withsubstantiallongitudinaldatathatstartsattheverybeginningofsyntacticde- velopment (i.e, one or two word stage) so that the item-based stage, if exists, could be observed. For comparison, we also consider the overlap measure of the Brown corpus (Kucera & Francis 1967), for whichproductivityisnotindoubt. We first removed the extraneous annotations from the child text and then applied an open source implementationofarule-basedpart-of-speechtagger(Brill1995):5 wordsarenowassociatedwiththeir part-of-speech (e.g., preposition, singular noun, past tense verb, etc.). For languages such as English, whichhasrelativelysalientcuesforpart-of-speech(e.g.,rigidwordorder,lowdegreeofmorphological syncretism),suchtaggerscanachievehighaccuracyatover97%.Thisalreadylowerrorratecauseseven lessconcernforourstudy, sincethedeterminers“a”and“the”arenotambiguousandarealwayscor- rectlytagged,whichreliablycontributestothetaggingofthewordsthatfollowthem.TheBrownCorpus isavailablewithmanuallyassignedpart-of-speechtagssonocomputationaltaggingisnecessary. Withtaggeddatasets,weextractedadjacentdeterminer-nounpairsforwhichD iseither“a”or“the”, and N has been tagged as a singular noun. Words that are marked as unknown, largely unintelligible 5Availableathttp://gposttl.sourceforge.net/. 10

Description:
Who’s Afraid of George Kingsley Zipf? Charles Yang Department of Linguistics & Computer Science University of Pennsylvania [email protected]
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.