Replication issues in syntax-based aspect extraction for opinion mining EdisonMarrese-Taylor, Yutaka Matsuo DepartmentofTechnologyManagementforInnovation TheUniversityofTokyo 7-3-1 Hongo,Bunkyo-ku,Tokyo113-8656,Japan emarrese,[email protected] Abstract Sentiment orientation can be obtained later based on the extracted terms using or adapting any of Reproducing experiments is an important 7 the generic approaches for sentiment classifica- 1 instrument to validate previous work and tion. Therefore, animportant amountoffocushas 0 build upon existing approaches. It has beenposedontheproblem ofaspectextraction. 2 been tackled numerous times in different n Researchers haveproposed several methods for areas of science. In this paper, we in- a aspectextractionsofar,andmanyauthorsconsider J troduce anempirical replicability study of that these approaches largely fall into three main 6 three well-known algorithms for syntac- categories. Ontheonehand,wefindsyntacticalor tic centric aspect-based opinion mining. ] linguistic methods, which are generally based on L We show that reproducing results contin- C ues to be a difficult endeavor, mainly due other basic NLP tasks, such as POS-tagging and s. tothelackofdetailsregardingpreprocess- parsing, plus somefixedrules orrankings mecha- c nisms. On the other hand, we find purely statisti- ing and parameter setting, as well as due [ cal approaches, which are mainly based on topic to the absence of available implementa- 1 modeling. Finally, we also find extensive work tions that clarify these details. We con- v onsupervisedlearningmethods,inwhichcasethe 5 sider these are important threats to valid- problem is approached as sequence labeling. Ex- 6 ityoftheresearchonthefield,specifically 5 perimentswithbothNeuralNetworksandGraphi- whencompared tootherproblems inNLP 1 calModelshavereportedfairlygoodresultssofar. 0 where public datasets and code availabil- 1. ity are critical validity components. We A review of the literature in syntactical ap- 0 conclude by encouraging code-based re- proaches showed us that most of the proposed 7 search, which we think has a key role ideasareinspiredbyordirectlybuiltontopofpre- 1 : in helping researchers to understand the vious methods. Papers generally include detailed v meaning of the state-of-the-art better and comparisonsoftheapproaches, butwefoundvery i X togenerate continuous advances. few publications accompanied by code releases r thatmakeiteasiertoeffectivelycompareandcon- a 1 Introduction trast methods. We believe the lack of code avail- ability is increasingly becoming a threat to valid- Aspect-based opinion mining is one of the main ityinthefieldbyaddinglayersofobscuritytonew frameworks for sentiment analysis. It aims to approaches, specially tothosethatarebuiltontop extract fine-grained opinion targets from opinion ofprevious ideas. texts and its importance resides in the fact that withoutknowingtheaspects,opinionanalysesare Giventhecurrentstateinthefield,inthispaper of limited use (Liu,2012). The concept origi- westudyreplicability issuesinaspect-based opin- nated morethan 10yearsagoasaspecific caseof ionmining. Wefocusonsyntacticmethods,which sentiment analysis and has gradually gained rele- tendtoshowalowerdegreeoftransparencydueto vanceasaconcreteandcompleteprobleminopin- the increasing level of model complexity and the ion mining. The key task in aspect-based senti- lackofcodeavailability. Inthatsense,inthiswork ment analysis is to extract the aspects or targets we want to encourage discussion on this topic by that have been commented in opinion documents. addressing somekeyquestions. 1. Are the explanations given in the papers Thereplicability issue has been tackled numer- generally sufficient to replicate the proposed ous times in different areas of science. Forexam- models? ple, Casadevall andFang(2010) explore the im- portance and limits of reproducibility in scientific 2. Do differences in preprocessing have a big manuscripts in the field of Microbiology. In the impactonperformance? fieldofMachineLearning,Drummond(2009)dis- cusses issues arising from the inability to repli- 3. Doparametersneedtobeheavilytunedinor- catetheexperimental results published inapaper. dertoachievethereportedperformance? Also, Raederetal.(2010) show that when com- paring the performance of different techniques Ourgoalthroughoutthispaperistostartexplor- some methodological choices can have a signifi- ing possible answers to these questions and pro- cantimpactontheconclusions ofanystudy. vide an environment for further discussions and Furthermore, we also find studies in Software improvements. Wewilltrytotackle thequestions Engineering. For example, Monperrus(2014) keeping in mind that reproducibility of an experi- aimedtocontribute tothefieldwithaclarification mental result is a fundamental assumption in sci- of the essential ideas behind automatic software ence. Aswewillseeinthenextsection,theinabil- repairandincludedanin-depthcriticalanalysisof ity to replicate the experimental results published Kimetal.(2013),anapproach that hadbeen pub- in a paper is an issue that has been considered in lishedtheyearbeforeinthesameconference. various other machine learning and computer sci- It is also possible to find work on replicabil- enceconferences. Therehavebeenseveraldiscus- ity in Natural Language Processing. Conferences sionsarisingfromthisissueandthereseemstobe such as CICLing have undertaken maximum ef- a widespread view that we need to do something fort —though so far rather fruitless— in order toaddressthisproblem. Wewouldliketojointhis to address the topic, giving a special prize every questtoo. year to the best replicable paper1. In addition, 2 Related Work the proceedings of the ACL conference have in- cluded words on this topic on several occasions. Aspect-based opinion mining aims to identify the For example, Kilgarriff(2007) introduces the is- aspects of entities being reviewed in an text and suesofdatacleaningandpre-processing,specially to determine the sentiment reviewers express for forthosecasesthatinvolvecrawlingand/ordown- each aspect. Aspects usually correspond to arbi- loadinglinguisticdata. Thepaperclaimsthateven trarytopicsconsideredimportantorrepresentative though expertise and tools are available for most ofthetextthatisbeinganalyzed. of these preprocessing steps, such as lemmatizers The aspect-based approach has become fairly and POS-taggers for many languages, in the mid- popular. Since its conception, arguably af- dle there is a logjam and questions always arise. ter HuandLiu(2004b), many unsupervised Theauthorsindicatethatitseemstobeundeniable approaches based on statistical and syn- that even though cleaning is a low-level, unglam- tax analysis such as Qiuetal.(2011) and orous task, it is yet crucial: the better it is done, Liuetal.(2012) have been developed. While the better the outcomes. All further layers of lin- here we specifically tackle these kind of models, guisticprocessingdependonthecleanlinessofthe other popular unsupervised techniques such as data. Mukherjee andLiu(2012)arebasedonLDA. On the other hand, Pedersen(2008) presents On the other hand, existing supervised ap- the sad tale of the Zigglebottom tagger, a fic- proaches in the field are mainly based on se- tional tagger with spectacular results. However, quence labeling. Since 2014 the SemEval the code is not available and a new implemen- workshop included a shared task on the topic tation does not yield the same results. In the (Pontikietal.,2014), which has also encour- end, the newly implemented Zigglebottom tag- aged the development of new supervised meth- ger is not used, because it does not lead to the ods. We find approaches based on CRFs promised results and all effort went to waste. such as Mitchelletal.(2013) and deep learn- Fokkensetal.(2013) go further and actually ex- ing (IrsoyandCardie,2014) (Liuetal.,2015a), (Zhangetal.,2015). 1http://cicling.org/why verify.htm perimentwithwhattheythinkmaybeareal-world clude parsers for word lists such as stopwords, caseoftheZigglebottom tagger, particularly, with opinion lexicons and also for more complex data the NER approach by Freireetal.(2012). The structures regarding aspect-based opinion min- reimplementation process involved choices about ing. In particular, we work with the well-known seemingly small details such as rounding to how Customer Review Dataset (HuandLiu,2004a; manydecimals,howtotokenizeorhowmuchdata HuandLiu,2004b) which became the de facto cleanup to perform. They also tried different pa- benchmark for evaluation in syntax-based aspect- rameter combinations for feature generation, but based opinion mining. This is also a very impor- the algorithm never yielded the exact same re- tantpartofourenvironment. sults. Particularly, their best run of their first re- We also include a simple module devoted to production attempt achieved nearly a 20% drop model evaluation, which makes the evaluation in F-measure on average. Other authors such as processtransparent. Weseeaspectextractionasan Dashtipour etal.(2016)have worked on thesame informationretrievalproblemandthustheevalua- issue but for the task of sentiment classification, tionisbasedonprecision, recallandF1-score,us- being unable toreplicate the results ofseveral pa- ing exact matching to define a correctly extracted pers. Our work is directly related to these since aspect. here we also attempt to re-implement other ap- On top of the framework we built our im- proaches. plementations of three different aspect extraction techniques, which we selected based on the ap- 3 Empirical ReplicationStudy proach they are based on, their novelty and their As a first step, we first devoted ourselves to cre- importance in the community. As we mentioned ating a friendly environment for experimentation. earlier, since we limit our study to syntactic ap- The goals of developing this framework were the proaches, here we explicitly omit algorithms that following. (a) To provide a public Python imple- are intensively based on Web sources —or other mentation ofnotablealgorithms foraspectextrac- private sources or datasets— and also approaches tion in aspect-based opinion mining that to date thatusetopicmodelsorsupervised learning mod- lack available implementations, (b) Toprovide an elsforsequencelabeling. Weselectedthreediffer- implementation that is easy to extend and thus to ent papers, HuandLiu(2004b), Qiuetal.(2011), allowresearchers tobuildnovelapproaches based Liuetal.(2012). In the subsections below, we on the routines we provide, and (c) To increase proceed to comment on the reasons for each transparency in the field by providing full details choiceandgivedetailsonourimplementations. about preprocessing steps, parameter setting and model evaluation. We are publicly releasing our 3.1 Frequency-BasedAlgorithm(FBA) codeinGitHub2,soitwillwelcomebugfixes,ex- We first consider the aspect extraction algorithm tensions andpeervalidation ofitscontents. by HuandLiu(2004b), which pioneered on the Our framework is an object-oriented package problem of aspect-based opinion mining. This thatiscenteredontherepresentation ofasentence work is still being considered as a baseline for as a property-rich object. Likewise, sentences comparison and contrast with new approaches by are composed of tokens, which represent words most of the work on syntactic approaches in the and other punctuation marks with their respective literature. Despitethis,thereseemstobenoavail- properties such as stems, POS-tags, IOB-tags for able implementation of this technique to the best chunks and dependency relation tags, among oth- of our knowledge. These were our main motiva- ers. We have also developed wrappers for some tionstoworkwiththistechnique. popularpackagesforNLP,concretelytheStanford The aspect extraction procedure is based on CoreNLP and Senna. This allows us to easily ex- frequent itemset mining, which given a database perimentwithdifferenttokenizers,stemmers,POS of transactions and a minimum support thresh- taggers, chunkers andparsers. old min extracts the set of all the itemsets ap- sup Our package also includes a module for cor- pearinginatleastmin transactions —anitem- sup pora management, which provides easy access to set is just an unordered set of items in a trans- the set of linguistic resources needed. We in- action. In this case, each transaction is built us- 2github.com/epochx/opminreplicability ing the nouns and words in the noun phrases of a sentence. Later, stopword removal, stemming rithm for frequent itemset mining (Borgelt,2012; andfuzzymatchingareappliedtothetransactions PudiandHaritsa,2002;PudiandHaritsa,2003). in order to reduce the term dimensionality and to Afteritemsetmining, twopruning steps areap- dealwithwordmisspellings. Authorsdonotmen- plied in order to get rid of the incorrect, uninter- tionwhichstemmingalgorithmtheyuse,sowere- esting and redundant features. We implemented sort both the well-known Porter stemmer and the thesepruningtechniquescloselyfollowingthede- Stanfordlemmatizer,whichcanberegardedasthe tailsgiveninthepaper. standard choices. Finally, extracted aspects areusedtoextract in- Regarding fuzzy matching, the approach uses frequent features that might also be important. In Jokinen andUkkonen(1991), but authors simply ordertodoso,theyusedtermsinalexiconaspiv- state that [... all the occurrences of “autofocus” ots to extract those nearby nouns that the terms arereplaced with“auto-focus”]. Thisdescription modify. Togeneratethelistofopinionwords,they was insufficient to give us a full notion of how extracted the nearby adjective that modifies each theprocess iscarriedout,specially sincearbitrary feature on each of the sentences in which it ap- word replacements can have an important impact pears,usingstemmingandfuzzymatchingtotake whenextracting aspectsbasedontheirfrequency. careofwordvariants andmisspellings. Thepaper states that “a nearby adjective refers to the adja- Similarly to deAmorimandZampieri(2013), cent adjective that modifies the noun/noun phrase whoproposedaclusteringmethodforspellcheck- that is a frequent feature”. However, it is not ing, here we use clustering with the Levenshtein clear how they really find these adjectives. In distance ratio as similarity metric to group terms. our implementation, we defined a distance win- We tried with different strategies of hierarchical dow from the aspect position and extract all ad- clustering and, based on our exploratory experi- jectives that appear within this window. The size ments, we decided to use complete linkage to ex- of this window became another parameter of the tract flat clusters so that the original observations model. Finally, to extract infrequent features, au- in each flat cluster have a maximum cophenetic thors checked those sentences that contain nofre- distance given by a parameter min . Finally, sim quentfeatures butoneormoreopinion wordsand we represent each stem as a fixed single stem in thenextractedthenearestnoun/noun phrase. its cluster, keeping an index back from each of Wetrytokeepparametersettingascloseaspos- theoriginalunstemmedwordstoitscluster, sowe sible to the values reported by the original paper, arelaterabletorecoverthetermsastheyappeared butforPOS-taggingweuseCoreNLPorSennain- originally. stead of NLProcessor 2000. To obtain flat noun Authors later proceeded to mine frequent oc- phrases, we use the Penn Treebank II output gen- curring phrases by running the association rule eratedbytheStanfordConstituencyParserandap- miner CBA (Liuetal.,1998), which is based on plythesamePerlscript3 usedtogenerate thedata the Apriori algorithm. The paper indicates that fortheCoNLL-2000SharedTask. the Apriori algorithm works in two steps, first finding all frequent itemsets to later generate as- 3.2 Dependency-BasedAlgorithm(DBA) sociation rules from the discovered itemsets, so Oursecond implemented model is Double Propa- authors state they only needed the first step and gation (Qiuetal.,2011), an approach that is fun- use the CBA library for this part. This seems damentally based on dependency relations be- reasonable since it is a known fact that it is tween words for both aspect/target and opinion very efficient to use frequent itemsets to gener- word extraction. This paper pioneered on the us- ate association rules (Agrawaletal.,1993). They age of dependency grammars to extract terms by limited itemsets to have a maximum of three iteratively using aset ofeight rules based on dep- words as they believed that a product feature relations and POS-tags. Basically, the process contained no more than that number of terms. starts withaset ofseed opinion words whose ori- For minimum support, they defined an itemset entation is already known. In general, this is a as frequent if it appeared in more than 1% of reasonable assumption since several opinion lex- the review sentences. In our case, since the icons already exist inthe literature. Theseeds are CBA library was never released, we resort to an open-source implementation of the Apriori algo- 3http://ilk.uvt.nl/team/sabine/homepage/software.html firstlyusedtoextractaspects,whicharedefinedas K adjectives before it. After obtaining the aspect nouns that are modified by the seeds. Aspects are phrases, another frequency-based pruning is con- laterusedtoextractmoreopinionwordsindicated ducted, removingaspects thatappearonlyoncein by adjectives, other aspects and so on. This iter- the dataset. Again, here we tried to set all the ative process that propagates the knowledge with parameters as reported by the authors. Based on the help of the rules ends when no more opinion preliminaryexperiments,wedecidedtoalsoelim- wordsoraspects areextracted. inatethosetermsthatwereextractedbyleveraging In the original paper, the set of depen- on aspects that were later pruned, since they may dencyrelationships givenbytheMINIPARparser introduce noise. (Lin,2003)isusedtodevelop thewordextraction rules. We actually could not find the binaries on- 3.3 Translation-based Algorithm(TBA) line since the official website4 is down; other bi- The work of Liuetal.(2012) is a novel applica- nariesfoundontheWebwerecorruptedandunus- tion of classic statistical translation models and able. This convinced usthat MINIPARcan be re- Graph Theory to extract opinion targets. Nov- garded as a rather outdated model, so we decided eltyandthegoodresultsobtainedbytheapproach to use the Stanford Parser Manningetal.(2014) wereourmainmotivationstoworkwiththispaper. instead, whichisamongstate-of-the-art modelsin For target extraction, the authors proposed a the field. Our choice is supported by the results technique based on statistical word alignment. of Liuetal.(2015b), who successfully work with Specifically, they used a constrained version of DoublePropagation based ontheStanford depen- the well-known IBM statistical alignment models dency parser. Since the Stanford dep-tags differ (Brownetal.,1993). The proposal is directly re- from the tagset used by MINIPAR, we use the lated to monolingual alignment, as proposed by equivalences definedintheaforementioned paper. Liuetal.(2009). For monolingual alignment, the After the extraction steps, the approach pro- parallel corpora fed to the model is simply two ceeds to apply a clause pruning phase. For each copies of the same corpus. At the same time, the clauseoneachsentence,ifithasmorethanoneas- condition that words cannot be aligned to them- pectandthesearenotconnectedbyaconjunction, selvesisadded. Liuetal.(2012)stilluseamono- only the most frequent one is kept. In the paper, lingual parallel corpus but set the constraint that authorssimplystatethatthey[“identifythebound- nounsandnounphrasescanonlybealignedtoad- ary of a clause using MINIPAR”] and do not ex- jectives or vice-versa, meanwhile the rest of the plainhowtodetermineiftheaspectsareconnected words can be aligned freely. As a result, authors bytheconjunction. Weidentifyclausesbyfinding areabletocapturenoun/noun phrase-adjective re- thesetofnon-overlapping parsesub-treeswithla- lations that have longer spans that direct depen- bel“S”.Todetermine iftheaspects areconnected dencyrelations inasentence. byanyexistingconjunctioninasentence,wesim- Since the IBM alignment models work at word ply check if the conjunction appears between the granularityandthenneedtoreceivetokenizedsen- aspectsinthesameclause. tences as input, here we assume that authors first Thenext step was toprune aspects that may be proceeded to group noun phrases in single to- namesofotherproductsornamesofproductdeal- kens. According to the paper, they resorted to the ers,whichmayappearduetocomparisons. Inthis C-value technique (Frantzietal.,2000) for multi- case, the procedure is based on pre-defined pat- wordtermextraction, whichwasoriginally devel- terns which are first matched in the text to later oped to detect technical terminology in medical check if nearby nouns had previously been ex- documents, but was also previously used in the tracted as aspects. These are removed. The def- domain of opinion mining by Zhuetal.(2009). inition of nearby noun is not given in the paper, The method firstly generates a list of all possi- so we add it as another parameter for the model, ble multi-word terms and later ranks them using againusingthenotionofdistance windows. statistical features from the corpus. Even though Finally, a rule is proposed to identify aspect in the original paper candidate multi-word terms phrases bycombining eachaspect withQconsec- areextractedusingfixedpatterns, authors decided utive nouns right before and after the aspect and togenerateallcandidates assimplen-grams(with 4https://webdocs.cs.ualberta.ca/˜lindek/mimniapxanr.=ht4m). We implemented and experimented with both fixed pattern and the n-gram versions 4 Preliminary Results for the C-value technique. We also added a sim- Aswehaveshown,theimplementationprocessin- pleheuristicthatworkswithoutranking, grouping volved choices about several details that were not sets of nouns and other related figures that appear clear or not mentioned on the papers. In our ex- onthesameparseNPsub-tree. periments we have found that even when trying After the most likely constrained alignments different parameter combinations we remain un- are obtained for each sentence, authors estimated able to yield the exact same results in the original noun/noun phrase-adjective pair associations as papers. Below wesummarize our best results and theharmonicmeanbetweenthemutualtranslation findingsforeachalgorithm. probabilities. Finally, they built a bipartite graph Original Ours withthewordsandestimatetheconfidenceofeach Corpus P R P R targetcandidatebeingarealtargetusingthemined ApexDVDPlayer 0.797 0.743 0.389 0.355 associationsandapplyingagraph-basedalgorithm CreativeMP3Player 0.818 0.692 0.293 0.319 NikonCamera 0.792 0.710 0.265 0.457 basedonrandomwalking. Thisisbasicallyaniter- NokiaPhone 0.761 0.718 0.328 0.489 ative algorithm that exploits themutual reinforce- CanonCamera 0.822 0.747 0.352 0.286 ment between terms as given by the associations. Average 0.8 0.72 0.325 0.381 Authorssettherelevance ofeachtargetastheini- Table1: Performancecomparison forFBA. tial value of confidence, defining relevance as the normalizedtf-idf scoresofthecandidates,wheretf Table 1 compares our implementation’s best isthefrequencyoneachterminthecorpusandidf results so far with the values reported by is the inverse document frequency obtained using HuandLiu(2004b). We remain unable to repli- theGoogleN-gramCorpus. cate the performance reported by the authors and see a big drop for both precision and recall in all In the paper, authors experimented withIBM1- thedatasets. Inourexperiments,wenotedthatthe 3 models and showed that fertility parameters in- most sensitive parameter was min for itemset troduced by the third model help to improve the sup mining. We also experimented omitting the prun- performance by a small margin. The estimation ing steps and observed that precision and recall of this model is rather complicated, in this case werenottoodifferentfromtheresultsweobtained specially since it also includes a constrained ver- withpruning. sion of the hill-climbing heuristic, so in our im- We also observed that several parameters con- plementation we only include our versions of the figurations conveyed the same final performance IBM1-2models. Regardingparameters,wesetthe for each corpus. Among the 1470 per-corpus pa- proportion of candidate importance λ = 0.3 and rameter configurations we tried, we found 18 op- themaximumseriespowerparameterk = 100,as timal settings for both the Apex DVD Player and given by the original paper. To compute the ini- Canon Camera corpora, 16 for Nikon Camera, 6 tial relevance of each candidate, authors use the forCreativeMP3Playerand3forNokiaPhone. Google Ngram corpus to obtain the idf of a term. Differences in preprocessing did not offer con- Duetothelack ofexplanations onwhattheycon- sistent differences in performance. For the Apex siderasadocument,weresortedtothetheEnglish DVD Player, Creative MP3 Player and Canon Wikipedia. To calculate the idf score of a term, Camera corpora we found that processing with we count the number of articles that contain the SennaConstParser + CoreNLPDepParser conveys queriedtermandcompareittothetotalnumberof the best results. For the Nikon Camera corpus, articles. Whenwefindnoarticlesforagiventerm, adding the PorterStemmer to the latter gave us wesimpleuseaminimumarticlecountof1. the best performance. For the case of the Nokia Finally, the authors stated that the targets with Phone corpus, the pipeline CoreNLPDepParser + higher confidence scores than a certain threshold CoNLL2000Chunkergaveusthebestresults. t are extracted as the opinion targets, but they do In the original paper, authors reported the per- not specify the value they use. We let our imple- formance of the model at different stages, show- mentation output the unfiltered list of candidates ing that average values of precision an recall for andtheirconfidencesandfindthebestvalueofthe theitemsetminingstageare0.68and0.56respec- threshold later. tively. Weweresurprisedtofindoutthatwecould not even replicate these results, specially consid- only experimented using the CoreNLPDepParser ering that only two parameters are at play at this +CoNLL2000Chunkerpipeline. level. As shown by the original paper, the final Corpus t∗ performanceachievedisactuallymainlyduetothe ApexDVDPlayer 160 output of the itemset mining phase. We believe CreativeMP3Player 200 this is the reason why we observed some param- NikonCamera 100 NokiaPhone 90 eters have minimum impact on the performance. CanonCamera 110 This means that no matter how good the pruning strategies are, results will not be as good as the Table3: Optimalvalueoftforeachcorpus. original if we remain unable to replicate the out- putoftheitemsetminingphase. Table 4 shows a comparison of the results we have obtained so far using our implementation of Original Ours TBAandthevalues provided byLiuetal.(2012). Corpus P R P R Oncemore,weremainunabletoreplicatetheper- ApexDVDPlayer 0.90 0.86 0.239 0.328 CreativeMP3Player 0.81 0.84 0.180 0.319 formancereportedbythepaper. NikonCamera 0.87 0.81 0.194 0.287 On our experiments we tried with all three NokiaPhone 0.92 0.86 0.287 0.359 grouping strategies to generate multi-word terms; CanonCamera 0.90 0.81 0.201 0.356 namely, our simple heuristic and C-value using Average 0.88 0.83 0.220 0.330 both n-grams and fixed patterns. We also tried Table2: Performancecomparison forDBA. adding a limit for the number of groups gener- atedbytheC-value technique andused stemming RegardingDBA,Table2summarizestheresults to improve frequency counts. The “ngram” tech- we obtained. Again, we see huge differences be- niqueturnedouttobethebestperformingoneach tween our results and the ones reported by the corpus, although the limit parameter varies from original paper. Moreover, in this case we observe casetocase. particularly low values for precision. A detailed Aswementionedearlier, toevaluatetheimpact review of the extracted aspects showed us that in of the t parameter whose value was not reported factmanyoftheextractedtermsdonotcorrespond by Liuetal.(2012), we let the model return the toaspectsbutrathertocommonnounsthatarenot unfiltered aspect candidates and evaluate the per- relatedtotheproduct. formancefort ∈ [10,20,...,t ]. Notethatt max max Duringexperimentation, wealsoaddedsupport might be different each time. Because of this, the for different matching strategies —for example, number of parameter configurations we tried for using word stems and including fuzzy matching each corpora is slightly different. Instead of re- as in FBA— and although we observed improve- porting each value, we rather report the average ments on the results, these were marginal. We ofper-corpus evaluated parameter settings, which used different opinion word seeds, firstly based was1006. AsTable3shows,wefoundthatrather only onthe words “good” and “bad” and later us- than a single cross-corpus optimal value, this pa- ing9same-sizesubsetsoftheopinionlexiconpro- rameter needs to be tuned per-corpus. In this vided by Liu5. In all cases, our best performing sense, when we experimented without setting a modelusesoneofthesesubsets. threshold weobtained amaximum recall of0.697 As in the previous case, different parameter —fortheNokiaPhone corpus— butatthecost of configurations led to the same performance for precision0.151. Whenwesett = 10,weobtained each corpus. In this case, among 240 parame- a maximum precision of 0.9 but at the cost of re- ter settings for each corpus, we found 12 optimal call being lower than five percent. These results configurations for the Apex DVD Player corpus meanthemodeldoesnotseemtogeneralize well. and 24foreach theother corpora. Regarding pre- Since in our implementation we do not use the processing, we could not use CoreNLP to trans- IBM3model,wewereawarewecouldseeadiffer- formtheconstituenttreesgivenbySennaintodep- ence in the performance. However, based on the trees. Constituency treesseemedtobemalformed results by the original paper, which showed that and raised grammar parsing errors, therefore we improvements of IBM3 over IBM2 are small — about5%—wethinkitisveryunlikelythisdiffer- 5http://www.cs.uic.edu/∼liub/FBS/opinion-lexicon- English.rar ence can explain the big drop in performance we haveobserved. they are generally not the core of the research de- scribed. However,codereleasesplayacriticalrole Original Ours Corpus inuncoveringthesedetailsandmakingresearchat P R P R ApexDVDPlayer 0.89 0.87 0.362 0.389 leastreplicable. CreativeMP3Player 0.81 0.85 0.400 0.327 Regarding pre-processing, in our experiments NikonCamera 0.84 0.85 0.380 0.404 sofar withboth Sennaand CoreNLP wesaw per- NokiaPhone 0.88 0.89 0.588 0.381 CanonCamera 0.87 0.85 0.400 0.341 formance differences that arehowevernot consis- Average 0.86 0.86 0.426 0.368 tent, which seems to indicate that there is no op- timal preprocessing pipeline for each algorithm. Table4: Performancecomparison forTBA. Ontheother hand, modelparameters donotseem to be correlated with pre-processing choices, al- 5 Discussionandfurther directions thoughwedidfindasinglecaseinwhichaspecial pre-processingstepleadtobetterresultsinasingle The ongoing empirical study we introduce in this corpus. paper has provided concrete cases to help us an- Though we could not replicate the results pub- swer the questions that motivate this paper. As lished in the original papers, we have shown that seen, wehave so far failed to reproduce the origi- parameter values as reported by these papers do nalresults inthethreestudied cases. Eventhough not necessarily yield the best results. Moreover, several reasons may be the cause for this failure, parameters that mayseem unimportant turned out we think further experimentation can allow us to tocauseimportantperformancedifferencesforus. determinethekeyelementsthatwouldexplainthe Mostparametersindeedhadtobeheavilytunedin differences. In fact, our preliminary experiments ordertoachievethebestperformance. have already helped usisolate specific parameters Finally, the poor results obtained by our im- foreachmodelthatseemtomorestronglyimprove plementations also leave us puzzled about how theperformance. Ourresultsshowthatparameters the evaluation is really performed on the origi- thatarecloselyrelatedtothecoreoftheextraction nal papers. Authors do not give much details on methods, such as min for FBA and the confi- sup this topic. For example, (HuandLiu,2004b) use dencethresholdtforTBAseeminfacttobeplay- stemming and simply eliminate some words from ingthesekeyroles. the text based on their fuzzy matching approach. We are planning to run controlled experiments This means their extracted terms are word stems in order to isolate as much as possible the effect only. However, we do not know if stemming is of each parameter or processing step and under- also applied to the gold standard to evaluate. We stand their interplay. This will enable us to tell manually examined theCustomer ReviewDataset where important implementation differences be- anddiscoveredthatthemanuallyextractedaspects tweenourversionandtheoriginalversionmaybe. do not seem to be stemmed. Moreover, we noted Given that we do not have access to the original several inconsistencies in the annotation. This is- codes, itisonlybymeansoftheseinferred differ- sueraisesmorequestions forourresearch. ences that we can gain real insights on where the keysforreplicability lay. 6 Conclusions We believe the results in this paper already prove that explanations given in the original pa- We have presented three replication cases in the pers were generally insufficient to correctly repli- domainofaspect-basedopinionminingandshown cate the models. The lack of resources —except thatrepeatingexperimentsinthefieldisacomplex for the evaluation datasets— caused us to navi- issue. The experiments we designed and carried gate in the dark as we could not reverse-engineer outhavehelped usanswerourresearch questions, many intermediate steps. Certain details of pre- also raising some new ones. These answers seem processing and parameter setting are only men- to indicate that explanations on pre-processing, tioned superficially ornotatallintheoriginal pa- models specifications and parameter setting are pers. However,manyoftheseseeminglysmallde- generally insufficient to successfully replicate pa- tails did make a big difference in our results. We persinthefield. understand there is often not enough space in the Our observations indicate that sharing data and manuscripts to capture all details, specially since software play key roles in allowing researchers to completely understand how methods work. Shar- [Frantzietal.2000] Katerina Frantzi, Sophia Anani- ing is key to facilitating reuse, even if the code is adou, and Hideki Mima. 2000. Automatic recog- nition of multi-word terms:. the C-value/NC-value imperfect and contains hacks and possibly bugs. method. InternationalJournalonDigitalLibraries, Having access to such a set-up allows other re- 3(2):115–130. searchers to validate research and to systemati- cally test the approach in order to learn its limi- [Freireetal.2012] Nuno Freire, Jose´ Borbinha, and Pa´vel Calado, 2012. The Semantic Web: Research tations and strengths, ultimately allowing to im- andApplications:9thExtendedSemanticWebCon- proveonit. ference,ESWC2012,Heraklion,Crete,Greece,May 27-31, 2012. Proceedings, chapter An Approach forNamedEntityRecognitionin PoorlyStructured References Data, pages 718–732. Springer Berlin Heidelberg, Berlin,Heidelberg. [Agrawaletal.1993] Rakesh Agrawal, Tomasz Imielin´ski, and Arun Swami. 1993. Mining [HuandLiu2004a] MinqingHu and Bing Liu. 2004a. association rules between sets of items in large Mining and Summarizing Customer Reviews. In databases. In Proceedings of the 1993 ACM SIG- Proceedings of the Tenth ACM SIGKDD Interna- MOD International Conference on Management of tional Conference on Knowledge Discovery and Data, SIGMOD ’93, pages 207–216, New York, DataMining,KDD’04,pages168–177,NewYork, NY,USA.ACM. NY,USA.ACM. [Borgelt2012] Christian Borgelt. 2012. Frequentitem [HuandLiu2004b] MinqingHu and Bing Liu. 2004b. set mining. Wiley Interdisciplinary Reviews: Data MiningOpinionFeaturesin CustomerReviews. In MiningandKnowledgeDiscovery,2(6):437–456. Proceedingsofthe19thNationalConferenceonAr- tifical Intelligence, AAAI’04, pages 755–760, San [Brownetal.1993] Peter F. Brown, Vincent J. Della Jose,California.AAAIPress. Pietra,StephenA.DellaPietra,andRobertL.Mer- cer. 1993. The mathematics of statistical machine [IrsoyandCardie2014] Ozan Irsoy and Claire Cardie. translation: Parameter estimation. Computational 2014. Opinion Mining with Deep Recurrent Neu- linguistics,19(2):263–311. ral Networks. In Proceedings of the 2014 Confer- ence on Empirical Methods in Natural Language [CasadevallandFang2010] A. Casadevall and F. C. Processing(EMNLP),pages720–728,Doha,Qatar, Fang. 2010. Reproducible Science. Infection and October.AssociationforComputationalLinguistics. Immunity,78(12):4972–4975,December. [JokinenandUkkonen1991] Petteri Jokinen and Esko [Dashtipouretal.2016] Kia Dashtipour, Soujanya Po- Ukkonen. 1991. Two algorithms for approxmate ria, Amir Hussain, Erik Cambria, Ahmad Y. A. stringmatchinginstatictexts. InInternationalSym- Hawalah, Alexander Gelbukh, and Qiang Zhou. posium on MathematicalFoundationsof Computer 2016. Multilingual Sentiment Analysis: State of Science,pages240–248.Springer. theArtandIndependentComparisonofTechniques. CognitiveComputation,8(4):757–771. [Kilgarriff2007] Adam Kilgarriff. 2007. Googleol- ogy is bad science. Computational linguistics, [deAmorimandZampieri2013] Renato Cordeiro 33(1):147–151. de Amorim and Marcos Zampieri. 2013. Effec- tive Spell Checking Methods Using Clustering [Kimetal.2013] DongsunKim,JaechangNam,Jaewoo Algorithms. In Proceedings of the International Song, and Sunghun Kim. 2013. Automatic patch Conference Recent Advances in Natural Language generationlearnedfromhuman-writtenpatches. In Processing RANLP 2013, pages 172–178, Hissar, Proceedings of the 2013 International Conference Bulgaria, September. INCOMA Ltd. Shoumen, on Software Engineering, pages 802–811. IEEE BULGARIA. Press. [Drummond2009] Chris Drummond. 2009. Replica- [Lin2003] Dekang Lin. 2003. Dependency-Based bility is not reproducibility: nor is it good science. EvaluationofMinipar. InAnneAbeill,editor,Tree- In Proceedings of the Evaluation Methods for Ma- banks, volume 20 of Text, Speech and Language chineLearningWorkshop,Montreal,Canada. Technology,pages317–329.SpringerNetherlands. [Fokkensetal.2013] AntskeFokkens,MariekevanErp, [Liuetal.1998] BingLiu,WynneHsu,andYimingMa. Marten Postma, Ted Pedersen, Piek Vossen, and 1998. Integratingclassificationandassociationrule Nuno Freire. 2013. Offspring from Reproduction mining. InProceedingsoftheFourthInternational Problems: WhatReplicationFailureTeachesUs. In ConferenceonKnowledgeDiscoveryandDataMin- Proceedings of the 51st Annual Meeting of the As- ing. sociationforComputationalLinguistics(Volume1: Long Papers), pages 1691–1701, Sofia, Bulgaria, [Liuetal.2009] Zhanyi Liu, Haifeng Wang, Hua Wu, August.AssociationforComputationalLinguistics. and Sheng Li. 2009. Collocation extraction using monolingualword alignment method. In Proceed- [Pedersen2008] Ted Pedersen. 2008. Empiricism is ingsofthe 2009Conferenceon EmpiricalMethods not a matter of faith. Computational Linguistics, inNaturalLanguageProcessing: Volume2-Volume 34(3):465–470. 2, pages 487–495. Association for Computational Linguistics. [Pontikietal.2014] Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion An- droutsopoulos, and Suresh Manandhar. 2014. [Liuetal.2012] Kang Liu, Liheng Xu, and Jun Zhao. SemEval-2014 Task 4: Aspect Based Sentiment 2012. OpinionTargetExtractionUsingWord-Based Analysis. In Proceedings of the 8th International TranslationModel. InProceedingsofthe2012Joint WorkshoponSemanticEvaluation(SemEval2014), Conference on Empirical Methods in Natural Lan- pages 27–35, Dublin, Ireland, August. Association guageProcessingandComputationalNaturalLan- forComputationalLinguisticsandDublinCityUni- guageLearning,pages1346–1356,JejuIsland,Ko- versity. rea,July.AssociationforComputationalLinguistics. [PudiandHaritsa2002] VikramPudiandJayantR.Har- [Liuetal.2015a] Pengfei Liu, Shafiq Joty, and Helen itsa. 2002. On the Efficiency of Association- Meng. 2015a. Fine-grained Opinion Mining with RuleMiningAlgorithms. InProceedingsofthe6th RecurrentNeuralNetworksandWordEmbeddings. Pacific-AsiaConferenceonAdvancesinKnowledge In Proceedings of the 2015 Conference on Empiri- DiscoveryandDataMining,PAKDD’02,pages80– calMethodsinNaturalLanguageProcessing,pages 91,London,UK,UK.Springer-Verlag. 1433–1443, Lisbon, Portugal, September. Associa- tionforComputationalLinguistics. [PudiandHaritsa2003] VikramPudiandJayantR.Har- itsa. 2003. ARMOR: Association Rule Mining [Liuetal.2015b] Qian Liu, Zhiqiang Gao, Bing Liu, basedonORacle. InProceedingsoftheIEEEICDM and Yuanlin Zhang. 2015b. Automated Rule Se- Workshop on Frequent Itemset Mining Implemen- lectionforAspectExtractioninOpinionMining. In tations (FIMI-03), volume 90, Melbourne, Florida, Proceedingsofthe24thInternationalConferenceon USA,November.CEURWorkshopProceedings. ArtificialIntelligence, IJCAI’15,pages1291–1297, BuenosAires,Argentina.AAAIPress. [Qiuetal.2011] Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. 2011. Opinion Word Expansion [Liu2012] Bing Liu. 2012. Sentiment Analysis and andTargetExtractionThroughDoublePropagation. OpinionMining. SynthesisLecturesonHumanLan- ComputationalLinguistics,37(1):9–27,March. guageTechnologies,5(1):1–167. [Raederetal.2010] T. Raeder, T. R. Hoens, and N. V. Chawla. 2010. ConsequencesofVariabilityinClas- [Manningetal.2014] Christopher D. Manning, Mihai sifier Performance Estimates. In 2010 IEEE Inter- Surdeanu, John Bauer, Jenny Finkel, Steven J. national Conference on Data Mining, pages 421– Bethard,andDavidMcClosky. 2014. TheStanford 430. CoreNLPNaturalLanguageProcessing Toolkit. In Association for Computational Linguistics (ACL) [Zhangetal.2015] Meishan Zhang, Yue Zhang, and SystemDemonstrations,pages55–60. Duy Tin Vo. 2015. Neural Networks for Open DomainTargetedSentiment. InProceedingsofthe [Mitchelletal.2013] Margaret Mitchell, Jacqui 2015 Conference on Empirical Methods in Natu- Aguilar,TheresaWilson,andBenjaminVanDurme. ral Language Processing, pages 612–621, Lisbon, 2013. Open Domain Targeted Sentiment. In Portugal,September.AssociationforComputational Proceedings of the 2013 Conference on Empirical Linguistics. Methods in Natural Language Processing, pages 1643–1654, Seattle, Washington, USA, October. [Zhuetal.2009] Jingbo Zhu, Huizhen Wang, Ben- AssociationforComputationalLinguistics. jaminK.Tsou,andMuhuaZhu. 2009. Multi-aspect opinion polling from textual reviews. In Proceed- [Monperrus2014] Martin Monperrus. 2014. A critical ingsofthe18thACMconferenceonInformationand review of automatic patch generation learned from knowledgemanagement,pages1799–1802.ACM. human-writtenpatches: essayontheproblemstate- ment and the evaluation of automatic software re- pair. InProceedingsofthe36thInternationalCon- ference on Software Engineering, pages 234–242. ACM. [MukherjeeandLiu2012] Arjun Mukherjee and Bing Liu. 2012. Aspect Extraction Through Semi- supervised Modeling. In Proceedings of the 50th Annual Meeting of the Association for Computa- tional Linguistics: Long Papers - Volume 1, ACL ’12,pages339–348,Stroudsburg,PA,USA.Associ- ationforComputationalLinguistics.