ebook img

Sentiment/Subjectivity Analysis Survey for Languages other than English PDF

0.23 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Sentiment/Subjectivity Analysis Survey for Languages other than English

Sentiment/Subjectivity Analysis Survey for Languages other than English 6 1 0 2 MohammedKorayem*,Khalifeh Aljadda*,and DavidCrandall** g *CareerBuilder, GA. u **SchoolofInformaticsand Computing A IndianaUniversity 5 Bloomington,IN 2 [email protected] ] mkorayem,[email protected] L C August26,2016 . s c [ Abstract 3 v 7 Subjective and sentiment analysis have gained considerable attention recently. Most 8 oftheresourcesandsystemsbuiltsofararedoneforEnglish. Theneedfordesigning 0 systemsforotherlanguagesisincreasing. Thispapersurveysdifferentwaysusedfor 0 building systems for subjective and sentiment analysis for languages other than En- 0 glish. Therearethreedifferenttypesofsystemsusedforbuildingthesesystems. The . 1 first(andthebest)oneisthelanguagespecificsystems. Thesecondtypeofsystemsin- 0 volvesreusingortransferringsentimentresourcesfromEnglishtothetargetlanguage. 6 1 Thethirdtypeofmethodsisbasedonusinglanguageindependentmethods.Thepaper : presentsaseparatesectiondevotedtoArabicsentimentanalysis. v i X 1 Introduction r a Nowadays,theWeb hasbecomea readandwrite platformwhereusersarenolonger consumersofinformationbutproducersofitaswell. User-generatedcontentwritten innaturallanguagewithunstructuredfreetextisbecominganintegralpartoftheweb mainly because of the dramatic increase of social network Web sites, video sharing Websites,newsportals,onlinereviewssites,andonlineforumsandblogs. Becauseof thisproliferationofuser-generatedcontent,WebContentMiningisgainingconsider- ableattentionduetoitsimportanceformanybusinesses, governmentalagencies,and institutions. Sentiment analysis (also referred to as opinion mining) is a computationalstudy of attitudes, views, and emotions found in texts. The texts could be any document (e.g., comments, feedback, reviews or blogs). Sentiment analysis can be viewed as 1 aclassificationprocessthataimsatdeterminingwhetheracertaindocument/textwas writtentopassapositiveoranegativeopinionaboutacertaintopic,product,orperson. Thisprocessregardseachdocumentasabasicinformationunit. Theprocesshasbeen referredtoas“thedocumentlevelsentimentclassification”wherethedocumentisseen asanopinionatedproduct.Theanalysisorclassificationofsentimentonthesentential levelisreferredtoas“sentence-levelsentimentclassification”[31]. Sentiment analysis is gaining vast attention because of the potentiality of using opinionsummaryofalargenumberofpopulationinindustryaswellasinotherfields. For instance, having thisopinionsummaryavailablecan enhancebusinesses as busi- ness owners would have access to consumer opinions. Individuals can benefit from this informationas they would be able to compare products. Thus, sentiment analy- sismakesitpossibletosummarizetheopinionofpeopletowardsproductsaswellas politicians. Performing this type of analysis (either on the sentential level or the document level)hasbeendoneusingtwotypesofclassifiers,Rule-basedclassifier[13,15,17,24, 52],andMachinelearningclassifiers[1,27,28,31,36,37,46]. Currently,mostofthese systemsarebuiltforEnglish[27,31]. Thecurrentpaperattemptstoexploresentiment/subjectiveanalysissystemscreated generally for languages other than English. A special attention is given to Arabic. The paper aims at providingthe reader with informationaboutthe methodsused for buildingsentimentanalysissystems. Aftersurveyingthedifferentwaysusedforbuildingsentimentanalysissystemsfor languagesotherthanEnglish,thepaperconcludeswithasuggestionabouttheoptimum method(s)tobefollowed. Thebestmethodistheemploymentoftoolsthathavetodo withlanguage-specificfeatures.Themainproblemwiththismethodisthatitcostsalot tobuildresourcesforeachlanguage. Thesecondmethodistransferringthesentiment knowledge from English into the target language. The final way is to use language independentmethods. Thepaperisdividedintofourparts.Thefirstpartcoversthelanguageindependent methods.Thesecondsectionsurveyssentimenttransfermethodscreatedtotransferthe sentimentfrom English to other languages. The third section exploressystems done specifically for languages other than English. The last part focuses on the methods usedforArabic. 2 Language-Independent Feature Selection/Extraction Methods OnewayofperformingsentimentanalysesforlanguagesotherthanEnglishorbuilding systemsworkableformultiplelanguagesistoextractandselectfeaturesthatdonotde- pendontheselanguages.Differentapproacheshavebeenfollowedtoselectandextract thesefeatures:(1)WeightedEntropyGeneticalgorithms,(2)FeatureSubsumption,(3) LocalGrammarbasedmethods, (4)PositionalFeaturesand(5) Commonseedsword methods.Here,eachfeatureselection/extractionapproachisdescribedseparately. 2 2.1 Entropy WeightedGenetic Weighted GeneticAlgorithmisanoptimizationtechniquethatcanbeusedforfeatureselection. Entropy Weighted Genetic Weighted (EWGA) combines Information Gain (IG) and genetic algorithms (GA) to select the features. EWGA proposed in [1] was used to selectfeaturesofArabicandEnglish. IGiscombinedwitheachstepinthegenetical- gorithmsprocess.Itisusedtoselecttheinitialsetoffeaturesfortheinitialstage.Also, itisappliedduringthecross-overandmutationstages. Abbasietal. [1]presentedsen- timent analysis system for Web forumsin multiple languagebased on EWGA. They used two types of features, stylistics featuresand lexical features. Semantic features were avoided because they are language deepened and need lexicon resources while the limitation of their data prevents the use of linking features. They evaluate their systemonabenchmarktestbedofmoviereviewsconsistingof1000positiveand1000 negativemoviereviews [29,30,32,48]. Importantly,theirsystemwhichisbasedonfeatureselectionmethodoutperforms systemsin[29,30,32,48]. Usingthissystem,theyachievedanaccuracyrateof91% while other systems achieved accuracy rates between 87-90% on the movie reviews dataset. Theywerealsoabletoachieve92%accuracyrateonMiddleEasternforums and90%onUSforumsusingEWGAfeatureselectionmethod. 2.2 Feature Subsumption for Sentiment Classification in Multiple Languages Another method for extracting and selecting the features is proposed by Zhai et al. [51]. The authors proposed the feature “subsumption” method to extract and select substring-groupfeatures. This methodwas appliedto Chinese, English andSpanish. ThesystemdesignedbyZhaietal.consistsoffourprocesses:(1)Substringfeatureex- traction,(2)termweighting,(3)featureselection,and(4)classification. Forextracting substring-groupfeatures,theybuiltasuffixtreewithincorporatingtransductivelearn- ing through considering unlabeled test documents for building the suffix tree. They appliedfourdifferentweightingschemes(binary,three,tfandtfidf-c)andThe”tfidf- c”outperformsallotherapproaches.The”tfidf-c”isextendedformthestandrad”tfidf” andisdefinedasfollows 1 tf(t ,d )×log(N/df(t )) k j k tfidf −c= (1) (tf(t ,d )×log(N/df(t )))2 k j k stX∈dj wheret representsthetermcorrespondingto thesinglefeatureandtf(t ,d )is the k k j termfrequencyforthetermkindocumentd,df(t )isthenumberofdocumentscon- k tainingthetermandN isthetotalnumberofdocuments. Termpresenceusuallyout- performstermfrequency[33,51]. Zhai et al. [51] applieddocumentfrequencymethod as a feature selection tech- nique by keeping the top N features with highest documentfrequencyscores. They testedtheproposedsystemonthreedatasets: 1)anEnglishdatasetofmoviereviews, 2) a Chinese data set of hotel reviews, and 3) a Spanish data set of reviews on cars, 3 hotels,andotherproducts.Theaccuracyratesachievedwere94.0%,84.3%and78.7% forChinese, English,andSpanishrespectively. Thissystemisa successif compared tosystemsin[25,33]whichareusedfortheEnglishandChinesedatasets. However, itwasoutperformedbyAbbasiandetal. [1]ontheEnglishdatasetdescribedinthe previoussection. 2.3 Local GrammarMethods LocalGrammarisanothermethodthatcanbeusedtoextractsentimentfeatures. Itis usedtoextractsentimentphrasesinthefinancialdomain [8,9]. Ahmedetal.[9]proposedthisapproachforfinancialnewsdomain.Theyidentified the interesting key words by comparing the distribution of words in financial news corpuswiththedistributionofthesamewordsingenerallanguagecorpus. Usingthe context around these words they built a local grammar to extract sentiment bearing phrases. TheyappliedtheirapproachtoArabic,English,andChinese. Theyevaluated the system manually and achievedaccuracy rates between 60-75%for extracting the sentimentbearingphrases. Importantly,theproposedsystemcouldbeusedtoextract thesentimentphrasesinthefinancialdomainforanylanguage. Agic´etal.[8]usedlocalgrammartoextractsentimentphrasesoffinancialarticles. They demonstrated that there is a relation between the number of polarized phrases andtheoverallsentimentofthearticle. Theybuilta“Goldensentimentanalysisdata set”offinancialdomainforCroatian. Theymanuallyannotatedthearticleswithposi- tive/negativeannotation.Someofthearticleswereannotatedatthephraselevel. Importantly,whileBollenetal.[12]showedthatthereisacorrelationbetweencol- lectivemoodstatesextractedfromlarge-scaleTwitterfeedsononehandandthevalue oftheDowJonesIndustrialAverage(DJIA)overtimeontheother,Agic´etal. demon- stratethatthereisastatisticallysignificantcorrelationbetweenthetotalmarkettrend ontheZagrebStockExchangeandthenumberofpositivelyandnegativelyannotated articles within the same periods. The corpus used for this analysis is collected from two differentresources: online newspapersspecialized on finance and a large forum discussing the Zagreb Stock Exchange (CROBEX). For CROBEX two long periods oftimearechosen,oneforpositivearticlesbetween2007-01-02and2007-05-21and theotherfornegativeonespublishedbetween2008-01-07and2008-04-16.Ofcourse, thefinancialnewsdocumentsareselectedrandomlyfromthecorpusforthesametwo periodsandannotatedmanually. 2.4 PositionalFeatureMethods Positional information of the words and sentences has been used to build sentiment systems. RaychevandNakov[34]proposedthelanguageindependentmethodwhich isbasedonsubjectivityandpositionalinformation. Specifically, they weighted unigram and bigram terms based on their position in documents.Theyincorporatethesubjectivityinformationbyremovingnon-subjective sentencesandthentheymovedthesubjectivesentencestotheendofthedocumentsby computingthelikelihoodofsentencesubjectivity. ThiswasdonebytrainingaNaive 4 BayesclassifieronsubjectivedatasetandsortingthesentencesbasedontheirLikeli- hoodsubjectivityscore.Theyevaluatetheirmethodonthestandardmoviereviewsdata setusedin[48][30] [29] [32]. Theyachieved89.85%accuracyrateusingunigrams, bigrams,subjectivityfilter,andsubjectivitysorting. 2.5 Commonseedwords methods Usingveryfew commonwordslike “very,” “bad, ” and“ good”in English, a senti- mentanalysissystem is built byLin et al. [26]. The authorsproposeda multilingual sentiment system using few seed words which could be applied to any language be- cause it is language independent and does not depend on features of any language. First,theyextractedopinionwordsbasedontheassumptionthatthereisanadverbof degree on each language( e.g “very” in English). They extracted words by heuristic informationbasedonpatternslike“wordbehindvery”andremovingstopwordsbased onfrequency. Thenextstepafterextractingopinionwordsistoclusteropinionwords intopositiveandnegativeclusters. To cluster the words, they proposed a simple and effective method consisting of three steps: (1)Labeling all samples and wordsbased on two seed words“goodand bad”, (2) Computingexclusivepolarityfor each opinionword using KL-divergentto solvedisambiguationforwordsappearingin positiveandnegativeexamples, and(3) Computingthenewlabelsforsamplesbasedonthecomputedpolarityofwords. Aftercreatinglexiconsofpositiveandnegativewords,theyintroducedSemi-supervised learningto build sentiment classifier. They evaluated the system using hotel reviews data sets for many languages (French, German, Spanish, and Dutch). Their system achieved accuracy rates (80.37%, 79.13%, 80.05%, and 81.33%) corresponding to (French,German,Spanish,andDutch). They compared their system to two baseline systems “Sentiment lexicon based methods”and“Machinetranslationbasedmethods”. Whilethetranslationbasedsys- tem outperformsthe lexiconbasedsystem, the proposedsystem outperformsthe two baselines. 3 Sentiment Translation Methods TransferringSentimentTranslationtechniquesofwell-studiedlanguagestonewones isanotherwayforbuildingsentiment/subjectivitysystems. Simply,thesemethodsare based on using machine translation techniquesto translate the resources(corpora)to the new languages. Here, various sentiment/subjectivity methods based on machine translationwillbesurveyed.Thetechniquesusedtosolvetheproblemsresultingfrom non-accuratemachine translationprocesseswill be tackled. Other methodsbased on graphmethodsandusedtotranslatesentimentwillalsobepresented. 3.1 Machine Translation Machine translation (henceforth MT) has been used as a simple tool to create senti- mentsystemsformultiplelanguages. Inthesesystems,MThasbeenused[15,28,45] 5 totranslatecorporaofdifferentlanguagesintoEnglish.Followingthetranslation,sub- jectivity/sentimentclassifiers are builtin English. The simplicity of using MT stems fromtheavailabilityofitstechniquesandtheavailabilityofEnglishResources. Also, MTisusedtogenerateresourcesandcorporaforlanguageotherthanEnglish. Using it, sentiment lexicons and corpora have been generated for Bengali, Czech, French, German,Hungarian,Italian,Romanian,andSpanish[11,13,14,40,41]. 3.1.1 MachineTranslationSSAsystems: DuetothesimplicityandavailabilityofMT,Kerstin[15]proposedapipelinesystem based on SenitwordNet [19] and MT techniques for multi-languages. The proposed systemisapipelinesystemconsistingofthefollowingsteps: • Languageclassification wherethe LingPipelanguageidentifieris used forlan- guageClassification; • TranslationofthedocumentfromtheidentifiedlanguagetoEnglish; • textpreparationbystemming;and • Classificationofthesentiment. Simplicityandvariabilityareattributesofthedifferentwaysusedinbuildingtheclas- sifiers. Forinstance,threedifferentwayswereusedinbuildingtheclassifiersin[15]. Thesewaysaremachinelearningclassifiers,lingpipeandrulebasedclassifiers. Comparison of the three methods of classifier building shows that, the classifier basedonmachinelearningprovidesthemostaccuraterates(thescoresofSentiwordNet were 62% on MPQA corpus[49] and 66% for German movie reviews). In [15], the proposedsystemissimpleandcouldbeappliedtoanylanguage. Similarly,MTtechniquesarealsousedtobuildsentimentsystemsforChinese[45]. Wanetal.[45]usedtheautomaticmachinetranslationtechniquetotranslateChinese documents into English. The English lexicon is used afterwards. Many methods to assembletheresultsfrombothlanguagesweresuggested.Thesemethodsincludeaver- age,weightedaverage,min,maxandmajorityvoting.Thesemanticorientationmethod hasalsobeenusedtocomputethescoreofthedocumentsaswellasthewindowsize inordertoenablehandlingnegation. The obtainedEnglish results showed that using the translated reviews give better results if comparedto the originalChinese ones. This situation stands in contrast to whatmighthave beenexpected: The originallanguageusing originallexiconshould havegivenbetterresultsifcomparedtothetranslatedone.Also,theensemblemethods improvetheobtainedresults. Another usage of MT is incorporatingfeatures from many languages to improve theclassifiersaccuracy. Baneaetal.[10]integratedfeaturesfrommultiplelanguages whenbuildingahighprecisionclassifierusingmajorityvote. Abasicsinglelanguage trainedclassifierwasusedasabasisforthishighprecisionclassifier. Thesystemwas evaluatedonMPQAcorpus.Theintegratedfeaturetechniquewasusedforsixdifferent languages(Arabic,French,English,German,Romanian,andSpanish). Twotypesof feature-sets (monolingual and multilingual) were used. The feature vector for each 6 sentenceofthemonolingualfeaturesetconsistsofunigramsforthislanguagewhilethe featurevectorofthemultilingualfeaturesetconsistsofcombinationsofmonolingual unigrams. Importantly,resultsshowthatusingEnglishannotateddatasetscanbuildsuccess- fulclassifiersforotherlanguagesbyleveragingtheannotateddataset.Thecreatedclas- sifiershavemacro-accuracybetween71.30%to73.89%forArabicandEnglish.Here, theEnglishclassifieroutperformedthoseforotherlanguages.Non-Englishbasedclas- sifierresultsshowthatusingthemultilingualdatasetcanimprovetheaccuracyofthe classifierforthesourcelanguageaswellasclassifiersforthetargetlanguages.Specif- ically,thebestresultsareobtainedwhenaclassifiertrainedoverthecombinationofall sixlanguageswasused[10]. Thissuggeststhatusingmultilanguagedatasetscanenrichthefeaturesandreduce ambiguity.Inaddition,theEnglishclassifierachievedthebestaccuracyrateamongall monolingualclassifiers. Also,wheninvestigatingthecombinationofanytwo-language from the six languages, the German and Spanish classifier achieved the best results. PerformanceincreasedwhenRomanianandArabicwereadded. AddingEnglishand Frenchdidnotimprovetheresults. Indeed,theseresultssuggestthatSpanishandGer- manexpandedthedimensionalitycoveredinEnglish,ArabicandRomanianbyadding high quality features for the classification task. They also showed that the majority votingclassifiercouldbeusedasahighprecisionclassifierwithacceptablerecalllevel bycombiningallmonolingualclassifiers. 3.1.2 MachineTranslationasaResourceGenerator In addition to using MT as a technique in building sentiment/subjectivitysystems as previously explained, it was used to create resources and dictionaries for the analy- ses of sentiment in multiple languages. Mihalcea et al. employ two different ways to generate resources for subjectivity in languagesby leveragingtools and resources of English. The first method is translating an existing English lexicon to the target languageusingbi-lingualdictionary. The secondmethodis a corpusbased approach wheretheannotatedcorpusinthetargetlanguageisbuiltusingaprojectionfromthe sourcelanguage[28]. In the first method, authors translate the target language lexicon using two bi- lingual dictionaries [49]. Some problems emerged with this approach. First, some wordslosttheirsubjectivityinthisprocess. Forexample,whentranslatingintoRoma- nian,thewordmemorieslostitssubjectivityasitwastranslatedintothepowerofre- tainingtheinformation.Second,therewerecasesoflackonthesenseoftheindividual entriesinthelexiconandthebilingualdictionary.Third,somemulti-wordexpressions werenottranslatedaccurately.Consequently,thisledtolosingthesubjectivityofsome ofthesemultiwordexpressionsaftertranslation. Trials to solve the first problemhave been introduced. In [14], researchersover- came this obstacle by clustering the words that have the same Root. Then, the root itself is checked against the English lexicon. If the root exists then the word is kept in the list which will be translated. To overcome the second problem, heuristic ap- proachesareused.Examplesoftheseheuristicapproachesareusingthemostfrequent techniquein[28]andFirsttypeisFirstWord(FW)[24].Inthethirdproblem,asimple 7 wayforsolvingthemulti-wordexpressionissueisusingword-by-wordapproach[28] andusingtheWebtovalidatethetranslationbycheckingitsoccurrenceintheWeb. Evaluationofthemethodoftranslatingthelexiconusingbilingualdictionariesre- flectsthat, thetranslatedlexiconislessreliablethantheEnglishone. Therulebased classifierisusedtoevaluatethelexicon.Thisclassifierusesasimpleheuristic.Itlabels thesentenceassubjectiveifitcontainstwoormorestrongsubjectiveexpressionsand asobjectiveifitcontainsatmosttwoweaksubjectiveexpressions(nostrongsubjec- tiveexpressionsatall). Otherthanthat,thesentenceislabeledasunknown. Thistype ofclassifiersgenerallyhashighprecisionandlowrecallsoitcouldbeusedtocollect sentencesfromunlabeledcorpus. Importantly,therule-basedclassifierperformsbadlyintheobjectivetask. Onerea- sonisthat,weaksubjectivitycluesloseitssubjectivityduringthetranslationprocess. In[28], researchersworkedona manualannotationstudywhichshowedthata small fractionofthetranslatedwordskeepitssubjectivityaftertranslation. The second method is the corpus-based approach where the annotated corpus in thetargetlanguageisbuiltusingprojectionfromthesourcelanguage. ThenMachine learningclassifiersaretrainedonthelabeleddata. Theexperimentalresultsobtained in applying this method show that generally machine learning classifiers outperform therule-basedclassifier. Toovercomechallengesmetincaseswherenobilingualdictionaryorparallelcor- pora are available, Banea et al. [11] extend the work in [28] by employing multiple waystoperformautomatictranslationfromEnglish.Thisisbasicallydonetogenerate resourcesin the new language using English resources. They designed three experi- mentstoevaluatewhetherautomatictranslationisa goodtoolforgeneratingnewre- sources. Thefirstandsecondtypesofexperimentsaredonebytranslatingthetraining sourceinto the targetlanguage. In the first experiment,the trainingdata is manually annotated.Inthesecondone,opinionfinderclassifiersareusedtoannotatethecorpus whentheannotationdoneisinthesentencelevel. Theobtainedresultsshowthatthe automaticannotatedcorpusisworkingbetterthanthemanuallyannotatedcorpus.This suggeststhatthecluesusedbyresearcherstoannotatethedatamightbelostduringthe translationprocesswhilethecluesusedbyclassifiersarekeptduringthisprocess. In the third experiment, the target language is translated into the source language then theopinionfindertoolisusedtolabelthesentences. Followingthat,thesentencesare projected back to the target language. Finally, the classifier is trained. The authors evaluate the MT methods used for Romanian and Spanish. Results show that man- ually or automated labeled data are sufficient to build tools for subjectivity analysis in the new language. Furthermore, the results show comparable results to manually translatedcorpora. MThasalsobeenusedtogenerateresources,in [40,41]parallelcorporaforseven languagesofsentimenttowardsentitiesarebuilt. Specifically,theGoldstandardsen- timent data is built in English then projected into other languages (Czech, French, German,Hungarian,Italian,andSpanish). Here,ageneralandsimplesentimentcom- putingmethodhasbeenusedbycountingthenumberofsubjectivitywordswithinthe window for a given entity [41]. The resourcesused were sentiment dictionaryavail- ableinto15languages[40]. Negationishandledbyadding4toeachsentimentscore ofnegatedwords(thesentimentscoreofeachwordisbetween-5and5). 8 Importantly, this system is language-independentbecause it depends only on the lexicons.Thesystememployingthegoldenstandarddataachievedaccuracyratesfrom 66%(Italian)to74%for(EnglishandCzech). As in [11], MT is used to translate English lexicon (a productof merging Senti- Word English lexicon [19] and Subjectivity Word List [49]) into Bengali [14]. Das andBandyopadhyay[14]usedmachinelearningclassifierswithmanyfeaturessuchas partofspeechtaggingandchunkingtodivideeachdocumentintobeginning,interme- diate,andend. Eachsentenceisthengivenafeatureindicatingwhetheritbelongsto the beginning,intermediateorend. Theyalso used lexiconscoresasfeaturesto give subjectivityscores of the word, stemming, frequency,positionof subjectivityclue in thedocument,thetitleofthedocument,thefirstparagraphandthelasttwosentences. Theoverallaccuracyrateofthesystemisfoundtobe76.08%precisionrate(PR)and 83.33%recallrate(RR) forMPQAdata set[49], and79,90(PR) and86,55(RR) for IMDPcorpus,and72.16%(PR)and76.00%(RR)forBengaliNewscorpusand74.6% (PR)and80.4%(RR)forBlogcorpus. To recap, in thissectiondifferentmethodsforusingMT tobuild subjectivityand sentimentsystemswerereviewed.Themainissuesthatemergedintheexperimentation of MT havealso beenhighlighted. In the nextsection, the methodsdoneto improve machinetranslationSSAsystemswillbereviewed. 3.2 Improving Machine Translation-basedSystems TwomethodshavebeenbuilttoimprovemachinetranslationSSAsystems.Mainlythis section describes the co-training [46] and the structural corresponding learning [47] methods. 3.2.1 Co-trainingMethod In [46], Wan introduces the co-training algorithm to overcome the problem of low performanceofMT methodsused in [25]. Theproposedco-trainingframeworkuses unlabeleddatainthetrainingprocess.Specifically,theco-trainingmethodmanipulates twoviewsforlearning,thesourcelanguageviewaswellasthetargetlanguageview. Here,twoseparateclassifiersaretrainedonlabeleddata,oneforthesourcelanguage and the otherfor the targetone. Using of unlabeleddata comesafter havingthe two classifiers. This is done by adding the most confident samples to the labeled set on eachviewifthetwoclassifiersagree. Then,theclassifiersareretrained. Theoutcome wouldbetwodifferentclassifiers. Thepredictionofthesentimentwillbebasedupon thescoreofthetwoclassifiers(e.gaverageofthescoresofthetwoclassifiers). Theobtainedexperimentalresultsshowthattheco-trainingalgorithmoutperforms theinductiveandtransductiveclassifiers[46].Thisframeworkwastestedonsentiment classificationforChinesereviews. Thefeaturesusedareunigramsandbigrams. The term-frequencyis used to weight the features which works better than tf-idf in their empiricalexperiments. 9 3.2.2 TheStructuralCorrespondingLearningMethod In[47],researcherstrytoovercomethenoisecomingfromMTmethodsusedin[28] byusingstructuralcorrespondinglearning(SCL)to findsharedimportantfeaturesin the two languages. SCL is used for domain adaptations. Here, the authors suggest thatthe sentimentclassification ofthe cross-lingualcouldbe consideredas a domain adaptionproblem. TouseSCL,thefirststepistofindthesetofpivotfeatures. These features/words have the same manner on the source and target language (e.g “very good”and“perfect”). SCL worksasfollows: First, itstartsbygeneratingtheweightedmatrixbasedon theco-occurrencebetweenpivotfeaturesandordinaryfeatures. Second,singularvec- tordecompressionisusedtoselectthetopeigenvectorfeaturestocreatethemapping matrixfromoriginaldomainto lowerdimensiondomain. Third, the mappingmatrix will be used with the new features in the new language/domain to train the classi- fier. Theauthorskeptonlythepivotfeaturesonthetranslationprocessandthenused weighted matrix from source language in addition to using the new translated pivot featurestotraintheclassifier. Fortheselectionofthepivot,somewordsareselected accordingtotheiroccurrence. Followingthatthesewords/featuresarerankedaccord- ingtotheirconditionalprobabilitiesthatarecomputedonthelabeleddata. Importantly,anevaluationoftheSCLisdoneforthesamedatasetusedon[46]The resultsshowthatSCLoutperformstheco-training[46]intermsofF-measure(reported tobe85%inthiscase). 3.3 GraphMethods forTranslating Sentiment InadditiontousingMTfortranslatingandtransferringsentimentfromonelanguageto another,Graphmethodshavebeenused. Scheibleetal.[38,39]usesthegraph-based approachtotransfersentimentfromEnglishtoGerman. Theybuiltgraphscontaining twotypesofrelations(coordinationsandadjective-nounmodifications). Theyspecif- ically chose these types of relations as they contain clues for sentiment. The graph contains adjectives and nouns as nodes and relations is represented by edges. They built two graphs one for English and the other for German. To compute sentiment of the target language, SimRank algorithm is used. SimRank computes the similar- ity between nodesin the two graphs. SimRank algorithm is an iterative processthat measuresthesimilaritybetweenallnodesinthegraph.SimRankassumesthatthetwo nodesaresimilariftheirneighborsaresimilar. Similaritybetweentwonodesaandb aredescribedbyequation 2 v sim(a,b)= sim(i,j) (2) |N(a)||N(b)| i∈N(aX),,j∈N(b) WhereN(a)istheneighborhoodgroupofa andv isa weightedvectortodetermine theeffectofdistanceofneighborsofaandinitiallythesim(a,a)=1. In this method, the bi-lingual lexicon is used to get the seeds between the two graphs. TheexperimentsaredoneonEnglishandGermanversionsofWikipedia. The resultsshow thatthis methodworksbetter thanthe Semantic Orientationwith Point- 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.