Table Of Content

RESEARCHARTICLE Why are some languages confused for others? Investigating data from the Great Language Game HedvigSkirgård1☯,Sea´nG.Roberts2☯*,LarsYencken3☯ 1TheWellspringsofLinguisticDiversityLaureateproject,DepartmentofLinguistics,SchoolofCulture, HistoryandLanguage,CollegeofAsiaandthePacific,AustralianNationalUniversity,Acton,ACT,Australia, 2Language&CognitionDepartment,MaxPlanckInstituteforPsycholinguistics,Nijmegen,Gelderland, Netherlands,3Independentresearcher,Melbourne,Victoria,Australia a1111111111 a1111111111 ☯Theseauthorscontributedequallytothiswork. *[email protected] a1111111111 a1111111111 a1111111111 Abstract Inthispaperweexploretheresultsofalarge-scaleonlinegamecalled‘theGreatLanguage Game’,inwhichpeoplelistentoanaudiospeechsampleandmakeaforced-choiceguess OPENACCESS abouttheidentityofthelanguagefrom2ormorealternatives.Thedatainclude15million Citation:SkirgårdH,RobertsSG,YenckenL guessesfrom400audiorecordingsof78languages.Weinvestigatewhichlanguagesare (2017)Whyaresomelanguagesconfusedfor confusedforwhichinthegame,andifthiscorrelateswiththesimilaritiesthatlinguistsiden- others?InvestigatingdatafromtheGreat tifybetweenlanguages.Thisincludessharedlexicalitems,similarsoundinventoriesand LanguageGame.PLoSONE12(4):e0165934. https://doi.org/10.1371/journal.pone.0165934 establishedhistoricalrelationships.Ourfindingsare,asexpected,thatplayersaremore likelytoconfusetwolanguagesthatareobjectivelymoresimilar.Wealsoinvestigatefactors Editor:NielsO.Schiller,LeidenUniversity, NETHERLANDS thatmayaffectplayers’abilitytoaccuratelyselectthetargetlanguage,suchashowmany peoplespeakthelanguage,howoftenthelanguageismentionedinwrittenmaterialsand Received:December22,2015 theeconomicpowerofthetargetlanguagecommunity.Weseethatnon-linguisticfactors Accepted:October20,2016 affectplayers’abilitytoaccuratelyidentifythetarget.Forexample,languageswithwider Published:April5,2017 ‘globalreach’aremoreoftenidentifiedcorrectly.Thissuggeststhatbothlinguisticandcul- Copyright:©2017Skirgårdetal.Thisisanopen turalknowledgeinfluencetheperceptionandrecognitionoflanguagesandtheirsimilarity. accessarticledistributedunderthetermsofthe CreativeCommonsAttributionLicense,which permitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginal authorandsourcearecredited. DataAvailabilityStatement:Allrelevantdataare withinthepaperanditsSupportingInformation Introduction files. Objectivestudiesofsimilaritiesanddifferencesoflanguagesallowresearcherstoreconstruct Funding:SGRissupportedbytheMaxPlanck patternsofhistoricalchangeinlinguisticstructures[1],orsocialchangessuchasmigrations societyandanERCAdvancedGrantNo.269484 [2,3].However,knowledgeoflinguisticvariationisalsoimportantintheday-to-daylifeof INTERACTtoStephenLevinson.Thefundershad non-linguists.Beingconfrontedwith,andoftenconfoundedby,adifferentaccentoraforeign noroleinstudydesign,datacollectionand analysis,decisiontopublish,orpreparationofthe languageisperhapsauniversalhumanexperience;weareofteninsituationswherewehear manuscript. snippetsofunfamiliarconversationinapublicspaceorontheradio,andwetrytoidentify whatitiswehear.Peoplearesensitivetodifferencesinpronunciationandvocabularyandare Competinginterests:Theauthorshavedeclared thatnocompetinginterestsexist. quicktolinkthesedifferencestotheirknowledgeofculturaldifferences.Ourattitudesandthe PLOSONE|https://doi.org/10.1371/journal.pone.0165934 April5,2017 1/35 Whyaresomelanguagesconfusedforothers? waywechangeourbehaviourwhentryingtocommunicatepastthesedifferencesformpartof thebasisforlanguagechange,butisveryhardtostudy. Insomecases,identifyingnon-groupmembersbylinguisticmeansmaybeamatteroflife ordeath,asintheBiblicalepisodeinvolvingthepronunciationofHebrews(h)ibboleth‘earof wheat’(Judges12:6).Morerecentexamplesincludethe20th-centurycivilwarinSriLanka whereallophonicdifferenceswereusedtodifferentiateTamilandSinhalesespeakers(see[4]). Indeed,linguisticdiversitycanbedrivenbytheneedtodifferentiatebetweenculturalgroups [5](seealsoThurston’sconceptof‘esoterogeny’[6]andLarsen’s‘nabo-opposisjon’[7]).In extremecases,abiasedattitudetowardsalanguagecanseverelyaffectperception,eventothe extentofreducingcomprehension[8,9]. Inspiteoftheimportanceofperceivinglinguisticdifferencesindailylife,therehavebeen fewstudiesexploringthefactorswhichinfluencepeople’sabilitytorecognisethelanguagesof theworldandtheirsimilarities.Thisgamepresentsuswithanewkindofdatathathasprevi- ouslynotbeenavailable,andoffersthechancetoinvestigatetheperceptionoflanguagesimi- larity.Inthispaper,weanalysetheresultsofanonlinegameplayedmillionsoftimesbya globalsampleofpeople,inwhichthegoalwastoaccuratelyidentifylanguagesbasedona 20-secondaudioclipofspeech.Basedontheplayers’behaviour,wecanextractinformation aboutwhichlanguagestheycaneasilyidentify,andwheretheygetconfused.Weusetheresults ofthisgametoinvestigatewhatfactorsmakealanguageeasytoidentifyaccurately,including non-linguisticfactorssuchaseconomicpowerandthequalityoftheaudiorecordings.We alsoassesswhetherlanguageswhichareoftenconfusedalsohavesomeobjectivelinguistic similarity,forexamplebeingcloselyrelated,geographicallyclose,orhavingsimilarsoundsys- temsorlexicon. Morespecifically,theresearchquestionsweaddressinthispaperare: • Q1:Whichlanguagesareconfusedforwhichothers? • Q2:Arethereanyasymmetriesofconfusion? • Q3:Whatfactorscanpredictwhetherplayersconfusetwolanguagesforeachother?Candi- datesinclude: – geographicalcloseness; – genealogy; – similarityofphonemeinventories; – lexicalsimilarity. • Q4:Whatfactorscanpredictplayers’accuracy?Candidatesinclude: – acousticqualityofthespeechsamples; – proportionofnon-nativespeakers(L2speakers); – totalnativespeaker(L1)speakerpopulation; – linguisticdiversityofthemaincountryinwhichthelanguageisspoken; – numberofcountriesthelanguageisspokenin; – “languagenametransparency”—howclearthelinkisbetweenthelanguagenameandthe nameofthemaincountry; – economicpowerofmaincountryasmeasuredbyGrossDomesticProduct(GDP); PLOSONE|https://doi.org/10.1371/journal.pone.0165934 April5,2017 2/35 Whyaresomelanguagesconfusedforothers? – Familiaritywiththelanguage,asmeasuredbythefrequencyofoccurrenceofthelan- guagenameinwrittenmaterialssuchastheEnglishandChinesecorporafromGoogle Books. • Q5:Isplayers’performancemoreaccuratelypredictedbylinguisticornon-linguistic factors? • Q6:Aresomephonologicalcuesmoreimportantthanothersinpredictingplayers’ accuracy? Identifyinglanguagesandrelationshipsbetweenthem Dividingacontinuumofrelatedlinguisticvarietiesintodiscretelanguagesandfamiliesisa difficulttask.AsEvans&Levinsonnote,“wearetheonlyspecieswithacommunicationsys- temwhichisfundamentallyvariableatalllevels[oflinguisticorganization]”[10].Themost commonmethodsofseparatingvarietiesintolanguagesarebasedontestsofmutualintelligi- bility,sharedliterarytraditionand/orsharedlexicons[11,12].However,speaker/signercom- munitiesmayhavetheirownperceptionsandlabellingofgroupidentity(cf[13]).Theeditors oftheEthnologuerecognisethat“alanguageismoreoftencomprisedofcontinuaoffeatures thatextendacrosstime,geography,andsocialspace.Thereisgrowingattentionbeinggivento therolesorfunctionsthatlanguagevarietiesplaywithinthelinguisticecologyofaregionora speechcommunity”[14].Languageusers’ownperceptionofwhataresalientcontrastsand importantcategoriescanoftendiffersignificantlyfromwhatlinguistsfocuson.Often,catego- rizationsbylanguageusersarepoliticallymotivated.Forexample,NorwegianandSwedishare considereddifferentlanguagesbytheirrespectivelanguagecommunities,despitetherebeing largegroupsofspeakersfromthetwocommunitieswhoareabletounderstandeachotherand alargesharedlexicon(cf.[15]). Thereareatleastthreetypesofrelationshipbetweenlanguages:temporal,spatialandsocial. Forexample,thegradualdifferencesleadingfromMiddleEnglishtoModernEnglishforma temporalcontinuumofvariationwithinthesamecommunity.Linguisticborrowingdueto contactcanresultinaspatialcontinuum,asfoundinthearealpatternsofEastAsia(e.g.Can- tonese,Mandarin,JapaneseandKoreaninoursample)orMainlandSoutheastAsia(Khmer, Burmese,LaoandThai).Often,languagesarerelatedbothintime(i.e.genealogy)andinspace (i.e.geographicalproximity),asinthecaseoftheclosely-relatedGermanicdialectsofBelgium, theNetherlands,Germany,Switzerland,LuxembourgandAustria.Languagescanalsovaryon asocialscale,thisisforexamplereflectedindifferentlanguagevarietieswithinthesamelan- guagedependingonsocio-economicstatus(cf.thebasilect-acrolectcontinuumincreole languages). Despitevariationalongthetemporal,spatial,andsocialdimensions,communication betweenspeakersofquitedifferentvarietiesisbothpossibleandproductive,suggestingthat wemustpossessaneffectivecapacityforperceptuallyhandlingvariation.Inthispaper,we takeararechancetoinvestigatenon-linguists’perceptionoflinguisticvariationacrossthe world.Weexcludedguessesmadebyplayersfromcountrieswherethetargetlanguageisan officialordefactoofficiallanguage,becauseitislessinterestingthattheyareabletoaccurately identifythetarget.Thismeansthatwearemostlikelygettingdataonhowoutsiders(e.g.non- speakersofthetarget)perceivelanguagesimilarity. Predictions Thereisnoresearch,asfarasweareaware,whichexplorestheperceptionofawiderangeof languagesbyoutsiders(individualswhoarenotfamiliarwiththetargetlanguages)andhow PLOSONE|https://doi.org/10.1371/journal.pone.0165934 April5,2017 3/35 Whyaresomelanguagesconfusedforothers? thiscomparestolinguists’objectivemeasurementsofsimilarity.Thereishoweverrelevant workinperceptualdialectology(i.e.dealingwithverycloselyrelatedvarieties[16]),suchas workonlisteneridentificationofregionalvarietiesofEnglishfromtheUnitedStates[17,18]. Listenersareabletoidentifyregionalvarietieswellabovechance,andtheperceptionofsimi- laritydiffersbetweenlistenersfromdifferentdialectbackgroundsandbetweenlistenerswith varyingknowledgeofrelevantsociolinguisticcues. Listenersusedifferencesinspeechvarietiesascuestosocialidentity(e.g.[19]),leadingto theformationofstereotypesaboutcertainspeechsoundsorwaysofspeaking(e.g.[20]).These attitudesmay,inturn,affecttheperceptionofspeechsounds(e.g.[8,21]). AnotherstudyshowedthatAmericanchildrencouldidentifyAmericanEnglishandIndian Englishvarietiesinaforcedchoicetask,butwerelessgoodatdistinguishingAmericanEnglish andBritishEnglish[22].However,thechildrendidlinkpicturesoffamiliarculturalitems(e.g. housesandclothing)tomorefamiliarvarieties.Thissuggeststhatperceptualandcultural knowledgeofvariationemergeatayoungage.Indeed,youngadultswithhigh-functioning autism(whohavegoodperceptualprocessing,butpoorsocialprocessing)canidentifydialect varieties,butarelessgoodatlinkingthemwithsocialstereotypes[23],suggestingapossible dissociationbetweenthetwoskills. Anexperimentalstudyoftheabilityoflistenerstodistinguishbetweentwoforeignlan- guages(Germanvs.Russian)hasbeencarriedout[24].Participantswereexposedtorecord- ingsofthesamespeakersproducingdifferentlanguages.Theyperformedabovechance,and theirsuccesswaspredictedbytheirlevelofexposuretothevariety,withevenoccasionalexpo- sureviamediaaidingperformance.Thissuggeststhatlistenersaresensitivetodifferences betweenlanguagesaswellasbetweendialects. InthispaperweanalysetheresultsoftheGreatLanguageGame,apopularonlinegamecre- atedbyoneoftheauthorsofthispaper,LarsYencken.Playerslistentoasampleofspeechand mustguesswhichlanguageitisfromasetofalternatives.Weexplorewhichfactorspredict whichlanguagesareguessedaccuratelyandwhatfactorspredictwhichlanguagesareconfused foreachother.Wetestedfourfactorsforconfusion(similarphonemeinventories,geographic proximity,genealogy,andsharedlexicon)andninefactorsforaccuracy(acousticqualityof audioclip,proportionofL2speakers,totalL1speakerpopulation,linguisticdiversityofmain country,numberofcountriesthelanguageisspokenin,languagenametransparency,eco- nomicpowerofmaincountryandfrequencyofoccurrenceofthelanguagenameinwritten materials).Thereareotherpossiblefactorsonecouldconsider,butthesearetheonesthatwe predictedwouldbeinteresting,andwereabletomeasure.Anexampleofafactorthatwould beinterestingbuthardtomeasureismutualintelligibility:linguistscommonlydividespeech varietiesintolanguagesbasedonmutualintelligibility(andsharedlexicon),butnomeasure- mentofmutualintelligibilityofallthelanguagesinoursampleexists. Themostobviouspredictionwewouldmakeisthatplayerswilldifferentiatelanguages basedonphonologicalproperties.Inotherwords,languageswhichsoundmoredifferentmay beeasiertodistinguishfromeachother.Particularphonologicalfeaturesmaybemoresalient thanothers(toneorretroflexconsonants,forexample).Lexicalitemsarealsoapotentialcue towhichspeakersmightpayattention:wepredictthatthemorelexicalitemsareshared betweentwolanguages,themoreoftenthelanguageswillbeconfusedforeachother.Since phonologicalfeaturesandlexicalitemsdiffusethroughborrowingandhistoricaldescent,play- ersmayalsofinditeasiertodistinguishbetweenlanguagesthatarefurtherapartinspace(spo- kenindifferentcountries)ortime(distantlyrelated,ornotatallrelated). Anotherobviousfactorthattheplayermightrelyonissimplyknowledgeofthetargetlan- guage.Thiseffectcanbelimitedtosomeextentbyexcludingresponsesfromcountrieswhere thelanguageisofficialordefacto-official(aswedobelow);butitcannotbeentirelyeliminated. PLOSONE|https://doi.org/10.1371/journal.pone.0165934 April5,2017 4/35 Whyaresomelanguagesconfusedforothers? Theprobabilityofexposuretothetargetlanguagemayberelatedtothenumberofpeoplewho speakit,weconsiderbothL1andL2speakerpopulations. Similarly,theplayermayrelyontheirculturalknowledge:aplayermightnotknowalan- guageverywell,butmighthavehearditbeingspoken(e.g.infilmsorothermedia)orknowa fewfactsaboutit.Aplayer’sexposureorculturalknowledgeofalanguagemaydependonthe ‘globalreach’ofthelanguage,whichcanbereflectedinthepopulationsize,thelevelofindus- trialisation(asmeasuredby,forexample,grossdomesticproduct,mentionsinwrittenmate- rial,howmanycountriesitisspokeninandproportionofL2-speakers).Forexample, MandarinhasmorespeakersthanSpanish,butSpanishisanofficiallanguageinmanymore countries,whichmayincreaseitsglobalreachdespitehavingfewerspeakers. Wealsoconsideredthelinguisticdiversityofthetargetlanguage’smaincountry,asaproxy forculturaldiversity.Countrieswithaunifiedcultural‘brand’mightbemoreeasilyrecognisa- bleandsalientinpeople’sculturalknowledgethanacountrywithdiverseculturesandlan- guages(thisshouldbynomeansbetakenasanendorsementofmonolingualismbythe authors,webelievethatdiversityandmultilingualismaremostimportantandshouldbe encouraged). Languagesdiscussedmorewidelymightbebetterknown.Weestimatethisbylookingat thefrequencyofthenamesofthelanguagesintheGoogleBooksN-gramcorpusofEnglish andChinesetextsfrom1800–2000[25]. Anotherfactoristhequalityoftherecordings.Wepredictthatclearer,higher-quality recordingswillfacilitatelanguagerecognitionandthereforeleadtomorecorrectguesses. Finally,playersmightusevariousnon-linguisticheuristics.Forexample,participantsmight ruleoutcandidatelanguagesthattheyknow,andthereforededucethecorrectanswer,though thisishardtotestgivenourdata.Anotherheuristicistochooselanguagesbasedongeo- graphicorculturalproximity.Forexample,aplayermightthinkthatalanguagesoundslike Russian,butthenseethatRussianisnotpresentedasacandidateanswer.However,itmaybe thatSlovakisanalternative,theplayerreasonsthatRussiaandSlovakiaareculturallycon- nected,areinthesamepartoftheworld,orformpartofthesamelanguagefamily,andsothey maychooseSlovakdespitenotknowingwhatSlovaksoundslike.Thiswouldpredictthatcon- fusionbetweenlanguagesisrelatedtothegeographicdistancebetweenthem,butalsothat playersmightconfuselanguageswhicharegeographicallyclose,butquitedifferentintermsof theirgenealogy,forexampleLatvian(Baltic,Indo-European)andEstonian(Finnic,Uralic). Thisheuristiccanbeappliedtosomelanguagesmoreeasilythanothers.Forexample,the geographiclocationof“ScottishGaelic”istransparentfromthename,whilethisisnotthecase for“Shona”,whichisspokeninZimbabwe.Thename“Kannada”maybeentirelymisleading forWesternplayerswhomightlinkitwith“Canada”,whenitisactuallyaDravidianlanguage spokeninIndia.Therefore,weconsiderthetransparencybetweenthelanguagename,andthe languagemaincountry.Thisisreferredtoas‘languagenametransparency’andcanalsobe linkedtothepreviouslydiscussedideaofa‘cultural/nationalbrand’. Thus,inrelationtoQ3above,wepredictthattwolanguagesaremorelikelytobeconfused iftheyare: • geographicallyclosetoeachother; • closelyrelatedhistorically; • similarintheirphonemeinventories; • similarintheirlexicon. InrelationtoQ4,wepredictthatalanguageismorelikelytobeguessedcorrectlyifitis: PLOSONE|https://doi.org/10.1371/journal.pone.0165934 April5,2017 5/35 Whyaresomelanguagesconfusedforothers? • representedinthegamewithspeechsamplesofhighquality; • spokenasasecondlanguagebyalargeproportionofthetotalnumberofspeakers; • spokenbyalargepopulationasanativelanguage(L1); • mainlyspokeninacountrywheremostpeoplespeakthesamelanguage(lowlinguistic diversity); • spokeninmanycountries; • transparently-named; • spokenmainlyinacountrywithgreateconomicpower; • oftenmentionedinChineseGoogleBooks; • oftenmentionedinEnglishGoogleBooks. Finally,sinceperceptionmightbelinkedtoculturalknowledgeorshapedbylinguistic experience,wepredictthatpatternsofconfusionwilldifferbetweenplayersfromdifferent countries. Caveats Thedatainthecurrentstudyprovidesuswithauniqueopportunity,butalsocarrieswithit certainproblems.Ontheonehand,thereisalmosttoomuchdata:therearemorequestionsto beaskedthanwecanpossiblycoverhere.Wethusrestrictourselvestoanalysingdatapertain- ingtolanguagesandcountries,andleavemorefine-grainedanalyses,suchasattheregionalor statelevel,forfutureresearch.Ontheotherhand,thereisnotenoughdata.Thedataonplayer identityislimited,weonlyknowthelocationoftheirIP-addresses.Furthermore,theselection ofparticipantsisbiased(thegamerequiresacomputerandtheInternettoplay,andwasdevel- opedanddistributedincertainsocialcircles,andrequiressomeknowledgeofEnglish).Lastly, thesampleoflanguagesinthegameisnotrepresentativeoftheworld’slanguages.Wetryto controlfortheseimbalances,butcanonlydrawtentativeconclusions.Wewilldiscusslaterin thepaperhowthiskindofgamecouldbechangedtobetterservelinguisticresearch,andpres- entanalternative—LingQuest. TheworkofCharlotteGooskensandcolleaguesonmutualintelligibilityofEuropeanlan- guagesdemonstratesthatspeakers’linguisticknowledgeismorecomplicatedthanitmight seem.Forexample,speakers’attitudestowardsthenativespeakersofalanguagemightinflu- encetheirperceptionofthatlanguage[8].Thisissomethingthatwewillbeawareofaswe interpretourresults,butthatweunfortunatelycannottestforinarigorousmanner. Materialsandmethods Thegame “TheGreatLanguageGame”(henceforthGLG;seehttps://greatlanguagegame.com/)wascre- atedbyLarsYenckenin2013.Theoriginalintentwastoincreasepublicawarenessofthelin- guisticdiversityofurbanareasofAustralia,theUnitedKingdomandtheUnitedStates.Inthis sectionwepresentthegame,thelanguagesinthegameandthefactorsthatwehaveinvesti- gatedinrelationtoplayers’abilitytoaccuratelyidentifylanguagesandtheirconfusion. TheGLGworksasfollows:theplayerlistenstoa20-secondaudioclipofnaturalspeech fromoneof78languages.Theymustthenidentifythecorrectlanguagefromasetofalterna- tives.ThealternativesarerepresentedwiththeircommonEnglishnames.Theplayercan PLOSONE|https://doi.org/10.1371/journal.pone.0165934 April5,2017 6/35 Whyaresomelanguagesconfusedforothers? Fig1.AscreenshotfromtheGreatLanguageGame. https://doi.org/10.1371/journal.pone.0165934.g001 makeachoiceatanytime,thereisnotimelimitorrewardforbeingspeedy.Theycanalso playtheclipmorethanonce.Wedonotknowhowplayersbehavedhere.Aftertheymakea choice,theyaretoldwhethertheywerecorrectorincorrect,andareshownthecorrectanswer. Inthefirstround,theplayerhastwoalternatives,thecorrectanswerandonedistractor.After everythreecorrectanswers,thenumberofdistractorsincreasesby1,uptoamaximumof10 optionsintotal.Whentheplayerhasmadethreeincorrectchoices,thegameends.Theyare thengivensomeinformationaboutthethreelanguageswhichtheyguessedincorrectly,asan encouragementtolearnmoreaboutthem.TheinstructionallanguageofthegameisEnglish, butthereisverylittletextthatneedstobereadinordertoplaythegame.Fig1isascreenshot ofthegameasitappearsinawebbrowser. Inthispaper,wemaketheassumptionthattheselectionoflanguagesastargetsinthegame israndom(seeS1Appendix),althoughwetakeintoaccountthenumberoftimesthelanguage wasofferedasatargetlanguage.Givenanaudioclip,theincorrectalternatives(distractors) providedareselectedrandomlywithuniformlikelihood.Thedistractorsandthecorrectalter- nativearedisplayedasoptionsinalphabeticalorder.Inpractice,thealternativesaplayer receivesmayvastlyalterthedifficultyofaquestion.Forexample,theplayermightbegivena setoflanguagesthatbelongtoverydifferentlanguagefamilies(e.g.Polish,UrduandTongan), mostlikelymakingthetaskeasier.Alternatively,theymightbegivenasetofveryclosely relatedlanguages(e.g.Swedish,IcelandicandNorwegian):amuchhardertask.Theyarenot awardedanypointsforselectingadistractorthatisinsomewaymoresimilartothetarget thantheotherdistractorsare.Forexample:ifthetargetisSamoanandthealternativesare Samoan,TonganandNorwegian,youwillbeequallywrongpickingTonganasyouwould havebeenpickingNorwegian,despiteTonganbeingcloselyrelatedtoSamoan. Fromhereonwards,wedescribeinmoredetailthegame’sinventoryoflanguagesandits playerbase.OurdiscussionisbasedontheGLG’spubliclyavailable2014-03-02confusion dataset(seehttp://lars.yencken.org/datasets/languagegame/andsupportinginformationS8 Data),andassociatedaudiofilesfromthattime(asmartphonecloneofGLGwascreatedin 2015byIsaacDrachman,called“Linguini”whichusesmanyofthesameaudioclipsasGLG, butcurrentlynodatahasbeenreleasedforLinguini,sononehasbeenconsideredinthis paper). PLOSONE|https://doi.org/10.1371/journal.pone.0165934 April5,2017 7/35 Whyaresomelanguagesconfusedforothers? Fig2.GeographicallocationsofthelanguagesincludedintheGreatLanguageGame.ThemapwascreatedinRusingthepackage maps[26].CoordinatescomefromGlottolog[12]. https://doi.org/10.1371/journal.pone.0165934.g002 Thelanguagesinthegame Atthetimeofwriting,theGLGfeatures78languages.Forfulldetailsonthelanguagesinthe game,theirgenealogyandtherateatwhichtheyareguessedaccuratelyinthegame,seesup- portinginformationS1Data.Fig2showsthegeographicallocationofthelanguages,withloca- tionstakenfromGlottolog[12]. Thesampleoflanguagesinthegameisnotbalancedaccordingtogeographyorlinguistic relatedness.Thisisevidentfromthemap:EuropeisoverrepresentedandtheAmericas completelyabsent.39ofthe78languagesarefromtheIndo-Europeanlanguagefamily.Thisis becausethesamplewasoriginallydesignedtoreflectthelinguisticdiversityofurbanareasof Australia,theUnitedKingdomandtheUnitedStates,forthepurposeofspreadingawareness aboutlinguisticdiversityinthesenationstothegeneralpopulation. Thesampleoflanguagesisalsodependentonaccesstospeechsamples.Thespeechsamples usedinthegamearemostlydrawnfromlargeradiobroadcasterswhichbroadcastshowsand newsinnon-majoritylanguagesoftheirrespectivecountries.Themajorityofthespeechsam- plescamefromAustralia’sSpecialBroadcastingService(SBS),VoiceOfAmerica(VOA),and BritishBroadcastingCorporation(BBC).SBS,VOAandBBCbroadcastradioandTVinlan- guagesspokenbysignificantpopulationsinAustralia,theUnitedStatesandtheUnitedKing- domrespectively.OthersmallersourcesofspeechsamplesforthegameareTheHindu, Phonemica(China),DeutscheWelle,TeluguOne,NHKJapan,ORFEins(Austria),SR(Swe- den),CBC(Canada),RFI(France),andthePacificandRegionalArchiveforDigitalSourcesin EndangeredCultures(PARADISEC).PARADISECisaarchivefordigitalconservationof endangeredlanguagesandcultures,withafocusonthePacificregion.Atthistime,theGLG doesnotrepresentthefullsetoflanguagesavailablefromthesebroadcasters(Englishand KreyòlareforexamplenotpresentdespitebeingcoveredbyVOA),norallthelanguagesof PARADISEC. This,incombinationwiththefactthattheaimofthegameistospreadawarenessabout linguisticdiversityincertainspecificareasinWesterncountries,resultsinunder- PLOSONE|https://doi.org/10.1371/journal.pone.0165934 April5,2017 8/35 Whyaresomelanguagesconfusedforothers? representationofindigenouslanguagesofotherareas,suchastheAmericas(0languages, “Americas”herestandsforbothSouthAmericaandNorthAmerica.Asasidenoteitisinter- estingtoobservethatdespiteVOAbeinglocatedintheUnitedStatestheydonotbroadcast newsinanyindigenouslanguageoftheAmericas,withtheexceptionofKreyòl.Kreyòlisa French-lexifiedcreole)andAfrica(8languages)andover-representationofothers,suchas Europe(33)andAsia(31). Audioclipscomefromrandomlychosen20secondsegmentsofbroadcasts,identifiedas beingfreefromexcessivenoise,musicorcode-switching(mixingofmultiplelanguages).Most clipsareofaninformalnature,andmayincludevariablenumbersofspeakers,laughter,back- channellingandoverlappingdialogue.Duetotheirdifferentsources,clipsalsovaryinaudio quality.Inthecaseoftelephoneinterviews,audioqualityalsovariesbetweenspeakerswithin thesameclip.Elicitedaudiorecordingscreatedinacontrolledlabenvironmentmighthave lessvariationintermsofacousticquality,typesofvoices,speedetc.Thismakethemmore standardisedandeasytocomparetoeachother,moresuitableforascientificcontrolledexper- iment.However,theclipsusedinthisgamehavetheadvantageofbeingmoresimilartoactual naturaldiscourse,hencemorefaithfultotheactualexperiencesofpeopleperceivinglanguages outintheworld. Forthepurposesofcomparingtheresultsofthedatafromthegametootherdatabaseswe havelinkedeachlanguageinthegametoitsappropriatecodeintheISO639-3setoflanguage codes.TheISO639-3isaninternationalstandardoflanguagecodesusedbymanylinguists andappliedinmanycross-linguisticdatabases.Inafewcasestherearenotclearone-to-one mappings;insucheventsthesocalledmacrolanguage-codesoftheISO639-2havebeenused instead(thisconcernsAlbanianwhichhasbeenmarkedas[sqi],Arabic[ara],Dinka[din], Kurdish[kur],Latvian[lav]andYiddish[yid]).Theissueoflanguagenames,languageclassifi- cationandISO-setsisverycomplicated;formoreonthistopicsee[27]and[28].Inourset someofthesecomplicationsaremadeparticularlyevidentbythelanguages“Bosnian”[bos], “Serbian”[srp]and“Croatian”[hrv]whichallareseenasdifferentlanguagesbytheEthnolo- gueandGlottolog,butareperceivedasonelanguagebysomeotherlinguistsandnationalcen- suses.Weacknowledgetheseproblems,solvingthemishoweverbeyondthescopeofthis study. Theissueoflanguagenamesandcodesalsobecomesaconcernwhenwecompareourdata withotherdatabases.Wecomparedtheconfusiondatafromthegamewiththesimilarities anddifferencesbetweenthelanguages’phonemeinventories.Dataonthephonemeinvento- riesofthelanguagesweretakenfromthePhoneticsInformationBaseandLexicon-database (PHOIBLE)[29],whichisitselfacompilationofinformationfromotherdatabases.PHOIBLE alsoemploysISO639-3,butunfortunatelynotalllanguagesoftheGLGsamplehadadirect correspondenceinPHOIBLE.Inthecaseofthemacro-languagesof639-2weusedthesub- varietyof639-3thatwaspresentinPHOIBLE.Inaveryrestrictedsetofcasesweusedaneigh- bouringlanguagevarietywhentherewasnotaclearmapping;thisconcernsonlyMalay([zlm] inGLG,whichwaslinkedto[zsm]inPHOIBLE),Saami[sme−>sma]andKhmer[khm−> kxm].Inotherwords,forthesethreelanguageswedidnotcomparetheexactsamevarieties acrossourGLGsampleandthesampleofPHOIBLE.Wedidhoweverhaveaccesstoavery closelyrelatedvarietyandwebelievethatthiswillhavelittle,ifany,effectontheresults. Phonemeinventoriesfor60ofthelanguagesincludedintheGLGwerefoundinthePHO- IBLEdatabase.PHOIBLEincludesdatafromtheSPA(StanfordPhonologyArchive),UPSID (UCLAPhonologicalSegmentInventoryDatabase)andothercollections.Formanyofour languages,therewasmorethanoneanalysisoftheinventoryavailableinPHOIBLE.Thedif- ferentdatabasesthatmakeupPHOIBLEandprovidedifferentanalysesofthesamelanguage differintheirdesign,renderingdirectcomparisonsovertheminappropriate.Languageswere PLOSONE|https://doi.org/10.1371/journal.pone.0165934 April5,2017 9/35 Whyaresomelanguagesconfusedforothers? onlycomparedifthereexistedinventoriesforeachofthelanguagesfromcamefromthesame source(UPSIDwasonlycomparedwithUPSID,SPAonlywithSPA).Intotaltherewere31 languagesforwhichwecouldmeasurethedistancebetweeneachpairinthismanner.For eachlanguagepair,thedistancebetweenthemwascalculatedas1minustheproportionof phonemesincommon.Tonesegmentswereexcludedinallcomparisonssincetheywerenot representedinUPSID. Theplayers TheonlyinformationabouttheplayerswhichthegameelicitsistheirIP-address,whichwe usetoinferwhatcountrytheyareplayingfrom(theIPaddressesweremadeanonymousin thepubliclyavailabledata).Wemaketheassumptionthattheplayeridentifieswiththiscoun- tryinsomemanner(thoughsomemaybevisiting,travelling,tunnellingtheirtrafficthrough othercountriesviaVPNorsimilartechnology,etc.).Thedataincludesresponsesfromanesti- mated767,000uniqueIPaddresses(GoogleAnalyticsestimates964,000uniqueuseridentities, meaningthatprobablymanydifferentpeopleplayonthesamecomputer).Playerscanplay multipletimes,butforsimplicity,weassumethatmostplayersarecasualplayerswhodonot playoftenenoughtolearnfromplayingonly. Furthermore,weassumethatmostplayershaveaknowledgeofEnglish,atleastsufficiently tofindthegameinthefirstplaceandunderstandtheinstructions.Theplayersarealsomost likelycomputer-literateandshareaninterestinlanguages,orareinterestedingamesofthis kindwhereonemeasuresknowledgeoftheworldand/orlinguisticsandcompetesagainst friends.Thegamehasbeenextensivelysharedoncertainblogs,forumsandsocialmedia, whereweexpecttheretobealargenumberoflanguageenthusiasts/nerds.Theplayersofthe GLGcomefromallovertheworld,butcertainareasareoverrepresented.Table1displays numberofguesses(trials)fromeachcontinent. Measuringconfusionbetweenlanguages Therawdataproducedbythegameisalistofplayerresponses,alongwithinformationonthe optionsavailabletotheplayerineachtrial,aswellasthecountryoforiginoftheIP-address. Weusethisdatatocalculatetheprobabilityofonelanguagebeingconfusedforanother,inthe followingway:givenatargetlanguageinthespeechsample,andanumberofalternativelan- guagenamestochoosefrom,aplayerselectsonelanguageastheirguess.Fromalargesample ofgames,weknowtheconditionalprobabilityofguessingaparticularlanguageG,giventhat theplayeractuallyheardaparticulartargetlanguageT—thatis,thechanceofaplayerconfus- ingTforG.ThisistheproportionoftimesaplayeractuallyguessedGwhenhearingT Table1.LocationsofguessesbycontinentbasedonIP-address.Notethatwhiletherearenoindigenous peopleofAntarctica,thereareinfactresidentsthere,varyingfrom1,100to4,400duringtheyear[30]. ContinentofIP-address Numberofguesses Europe 7,963,630 NorthAmerica 5,980,767 Asia 841,609 Paciﬁc 364,390 SouthAmerica 356,390 Africa 74,032 Antarctica 11 Total 15,580,829 https://doi.org/10.1371/journal.pone.0165934.t001 PLOSONE|https://doi.org/10.1371/journal.pone.0165934 April5,2017 10/35

Description:

Slavic and East European Language Research Center (SEELRC),. Duke University; 2003. 48. Beekes R. Comparative Indo-European linguistics: An Introduction. The Netherlands: John Benjamins. 1995;. 49. Gray RD, Atkinson QD. Language-tree divergence times support the Anatolian theory of Indo-

Why are some languages confused for others? Investigating data from the Great Language Game PDF

35 Pages·2017·3.6 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Why are some languages confused for others? Investigating data from the Great Language Game

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.