Exploration of Proximity Heuristics in Length Normalization PranavA VITUniversity Vellore,India [email protected] Abstract the information retrieval are term frequency, in- verse document frequency and length normaliza- Rankingfunctionsusedininformationre- tion. TF-IDF(TermFrequencyandInverseDocu- trieval are primarily used in the search mentFrequency)isaprevalentvector-spaceinfor- 7 engines and they are often adopted for mation retrieval technique. It balances the simi- 1 various language processing applications. larity of the query and the document, and penal- 0 However,featuresusedintheconstruction 2 izes the common terms (Wu et al., 2008). Piv- n of ranking functions should be analyzed oted document length normalization with TF-IDF a beforeapplyingitonadataset. Thispaper helps to reward shorter documents (Singhal et J gives guidelines on construction of gener- al., 1996). Okapi BM25 is a complex version 5 alized ranking functions with application- of pivoted length normalization and it is the pre- dependent features. The paper prescribes ] vailing state-of-the-art retrieval method (Robert- R a specific case of a generalized function son and Walker, 1994). PL2 is a retrieval model I . for recommendation system using feature basedondivergencefromrandomnessofthequery s c engineering guidelines on the given data termfrequency(AmatiandVanRijsbergen,2002). [ set. The behavior of both generalized and Dirichlet Prior based retrieval is based on lan- 1 specific functions are studied and imple- guage modeling where the smoothing function is v mented on the unstructured textual data. derivedfromDirichletdistribution(ZhaiandLaf- 7 Theproximityfeaturebasedrankingfunc- 1 ferty, 2004). MPtf2ln and MDtf2ln are improve- 4 tionhasoutperformedby52%fromregu- ments on previous methods and more elaborate 1 larBM25. 0 ranking functions which balances the extremities . of normalization effects (Fang et al., 2011). Fang 1 1 Introduction 0 et al. (2011) also lay down certain guidelines to 7 Informationretrievaldealswiththetaskofretriev- evaluatethebehavioroftherankingfunctions. 1 : ing, or in simple words, obtaining relevant and This paper analyses the feature engineering v necessary information resources from a huge col- concepts in the information retrieval, especially i X lectionofdocuments. Theinformationobtainedis length normalization. The objective of this paper r considered relevant according to a piece of infor- are: a mation asked about (also known as query) (Zhai togivegeneralguidelinesofgeneralstructure and Massung, 2016). The use of information re- • and nature of feature which can be included trievalhasbeenfundamentalfordevelopingsearch intherankingfunction. (Section2) engines. But there are many other interesting ap- plicationssuchasrecommendationsystems,spam to study the existing work which use infor- filtering,plagiarismdetectionandsoon(Manning • mationretrievalinrecommendationsystems. et al., 2008). Ranked information retrieval use (Section3) rankingfunctionswhichdeterminethedecreasing orderofthedocumentsinrelevancetothequery. toprepareanunstructuredtextualdatasetfor ZhaiandMassung(2016)laydownthein-depth • penpalrecommendationsystem. (Section4) analysis of variety of ranking functions used in information retrieval. Common features used in to analyze the dataset and apply the general • guidelinestoconstructthefeatureforpenpal recommendationsystem. (Section5) to test and compare the constructed function • againstotherrankingfunctions. (Section6) 2 GuidelinesforFeatureEngineeringin RankingFunctions The main idea of the information retrieval sys- temsistoincludeaproximityscoreinconjunction Figure1: RoughSketchofFeatureCurve to the ranking function. A proximity score is an application-dependent score, like page-rank met- Similarlyitcanbeshownthat, rics,contextualsimilarityandsoon. Thefunction should have well engineered feature embedded in ∂f(x,y) itself and it should be able to generalize other re- = 0ifd(x) d(y) ∂(d(y)) ≈ sults. Theobjectivesoftherankingfunctionare: From here onwards, arguments would be taken Thescoreshouldbehigheriftheconstituents • from the d(x) perspective, as d(y) would have of proximity score (real numbers or vectors) similararguments. aresimilar. The function should decrease monotonically if The proximity scoreshould bebelow certain d(x) < d(y). Inotherwords, • cut-off scores. These cut-off scores or lim- iting parameters can be determined through ∂f(x,y) < 0ifd(x) < d(y) (2) machinelearning. ∂(d(x)) The function should behave similar and able The function should increase monotonically if • togeneralizethebehaviorandworkingofthe d(x) > d(y). Inotherwords, existingrankingfunctionsinIR. ∂f(x,y) The assumption is taken that modeled function > 0ifd(x) > d(y) (3) ∂(d(x)) is bi-modal (as one has to consider extremities of the cut-off scores). Let f(x,y) be the modeled 2.2 NatureoftheLimitsoftheCurve function,wherexandybetheconstituentvectors. Here the left extremity of the curve could be Inotherwords,thisfunctiondeterminessimilarity boundedtotheparameterb . between x and y and it is a function of d(x) and 1 d(y)whered( )isthedistancemetric. · lim f(x,y) = b1 (4) Length normalization functions, especially in d(x)→0 theBM25orpivotedlengthfunctionareinversely proportionaltothetotheTF-IDFscore. Thus,the Similarly,therightextremityofthecurvecould model would be f(x,y) inversely proportional to beboundedtotheparameterb2. theTF-IDFscore. lim f(x,y) = b (5) 2 1 d(x)→∞ TF IDF(q,d) − ∝ f(x,y) ∂f(x,y) The trough of the curve exists when = 2.1 NatureoftheFunctionCurve ∂(d(x)) 0. Thus the limit is set to 1. It can not be set it to Since the function has to be bi-modal distribu- 0, because f(x,y) lies in the denominator of the tion,distributionfunctionwouldbemodeledinthe scoringfunction. form asymmetrical inverted bell curve. Thus the trough of the curve should occur when distances lim f(x,y) = 1 (6) ofxandyaresimilar. Inotherwords, d(x)→d(y) ∂f(x,y) Thespecificcaseoftheseguidelinescanbeap- = 0ifd(x) d(y) (1) ∂(d(x)) ≈ pliedtotherecommendationsystems. 3 InformationRetrievalin cle)andagedecay(topenalizeolderarticles)have RecommendationSystems been considered as parameters. This experiment also showed that it surpasses the system that uses Not many efforts have been carried out which page-ranklikemetrics. userankingfunctionsinrecommendationsystems. Recommendation systems that use unstruc- Most of the recommendation systems use latent tured data: Suchal and Na´vrat (2010) and Es- Dirichlet allocation (LDA) or collaborative filter- parza et al. (2012) have used unstructured tex- ing. AdomaviciusandTuzhilin(2005)giveamore tual data in the recommendation system designs. comprehensivestudyofexistingresearchworkon Users’ choices have been concatenated into a sin- recommendationsystems. gle query which is used to recommend articles Some recommendation systems have experi- (Suchal and Na´vrat, 2010). Movie recommenda- mented with ranking functions. However, all of tion has also been evaluated with ranking func- them did not tune the parameters correctly. The tions (Esparza et al., 2012). Tags and reviews of systems which deploy IR functions, have gener- movies are served as an TF-IDF component. Re- ally used extremely well-structured data like tags views tend to have better performance than tags (Wang and Yuan, 2009) or contextual variables because unstructured nature of reviews provide (KwonandKim,2010). some noise and undiscovered information as op- Tag-based systems : These systems have con- posed to structured data like tags. This results catenated the tags of an user into a single docu- high IDF values. Thus unstructured textual data ment. Relevant proximity is determined through is a suitable advantage. According to the re- similarity between query and tags. This idea has sults, BM25 underperformed against TF-IDF al- used to develop music-recommendation systems gorithm. This happened because parameters have (Cantadoretal.,2010; Bellog´ınetal.,2010). The not been tuned according to the dataset. Esparza cosinesimilaritybetweenBM25valueofuserpro- et al. (2012) also demonstrated that ranking func- file tags and item profile tags produced the best tionsperformbetterthancollaborativefilteringal- results. This leads to conclusion that BM25 out- gorithms. performsotherfunctionsbecauseBM25penalizes The main conclusions from this literature sur- commontagsandfocusesonrareterms(Cantador veyare: et al., 2010). The effects of length normalization 1. Rankingfunctionstendtoperformbetterthan are not discussed. More evaluation metrics have well-known recommendation system algo- been introduced to evaluate the ranking functions rithmslikeLDAandcollaborativefiltering. (Bellog´ınetal.,2010). Resultsconcludedthatcol- laborative filtering methods have more novel and 2. Unstructureddataprovidescertainamountof diverse recommendations and ranking functions noise which is helpful for ranking function have more coverage and better accuracy. TF-IDF parameters. outperforms BM25 but the parameters of BM25 havenotbeentuned. 3. Effects of length normalization on recom- Publication-recommendationsystems: Some mendationsystemhavenotbeenstudied. publication-recommendation systems have used 4 PenpalRecommendationSystemand ranking functions (Nishioka and Scherp, 2016). DatasetPreparation Concept Frequency Inverse Document Frequency (CF-IDF) and Hierarchical CF-IDF (HCF-IDF) Penpal (online friends) recommendation system havebeenutilizedwhichgenerallyassignweights was set up from the online users. 630 users were on certain pre-determined words in the docu- askedabouttheirinterests,likesandrelationships. ments. The results showed that CF-IDF com- Few volunteers helped in assessing in the match- bined with sliding window produced the best re- ing of the penpals. Since every respondent was sults. Surprisingly, it outperforms the popular exclusively assigned one penpal, thus it gave 315 LDA method. Another publication-recommender pairsasaresult. system have used information retrieval elements Theresponseoftheuserwasconcatenatedinto (Totti et al., 2016). Here citation context (TF- a single sample or document. Hence the rec- IDF similarity between two papers), query simi- ommendation system situation was there for tex- larity (TF-IDF similarity between query and arti- tual data for 630 users. MeTA (Massung et al., 2016)toolkitwasemployedinthecodebaseofthis monotonically decreasing Richard’s curve, • project. Preprocessing on the data was done with g(x)whenx < y the stopword removal, tokenization and lemmati- monotonically increasing Richard’s curve, zation. Thedatasetwasdividedintotwoparts: • g(x)whenx > y Trainingset: Textdataof504samples(252 • Valueofheuristicfunction,h(x,y) = 1when pairs) • x = y Testing set : Text data of 126 samples (63 • Thedefinitionofthefunctionnowlookslike: pairs) Average document length of the training set g(x ) ifx < y 1 is 131 words. The observations followed that h(x,y) = 1 ifx = y (8) assessors paired up users whose response con- g(x ) ifx > y 2 tainedfewerwordsandsimilarinterests(Table1). Similar process was carried out for users having wherex [0,y],x [y, ]andx ,x x 1 2 1 2 ∈ ∈ ∞ ⊂ lengthy responses and similar interests. This may Applying equation (1), the constraint obtained beduetothereasonthatuserwhoiswritinglesser is: words in the response form is less interested in (cid:12) having a penpal. Thus text similarity and length ∂h(x,y)(cid:12) (cid:12) = 0 similarityplaysamajorroleinpenpalrecommen- ∂x (cid:12)x=y dation. Fromequation(2),theconstraintobtainedis: Description NumberofPairs ∂g(x ) 1 TotalPairsinTraining 252 < 0 ∂x 1 Dataset Pairsmatchedwith 62 Fromequation(3),theconstraintobtainedis: bothdocument’s ∂g(x ) lengthlessthanthe 2 > 0 ∂x averagedocument 2 length From limiting condition (4), the constraint ob- Pairsmatchedwith 131 tainedis: bothdocument’s lengththantheaverage lim g(x ) = b documentlength 1 1 x1→0 Othersmatches 53 (cid:12) ∂g(x1)(cid:12) (cid:12) 0 Table 1: Document length characteristics in the ∂x1 (cid:12)x1=0 ≈ trainingdataset From the limiting condition (5), the constraint obtainedis: 5 FeatureEngineering Here, the function would be modeled using lim g(x ) = b 2 2 Richard’scurve. Thecurveisdefinedby, x2→∞ (cid:12) ∂g(x2)(cid:12) u l (cid:12) 0 g(x) = l+ (A+e−B−(x−M))1/ν (7) ∂x2 (cid:12)x2=∞ ≈ From the limiting condition (6), the constraint Here, l is lower limit, u is upper limit and obtainedare: M,ν,A,B arefreeparameters The distance metrics are taken to be just real (cid:12) (cid:12) numbers. Thus, d(x) = x and d(y) = y. The ∂g(x1)(cid:12) ∂g(x2)(cid:12) (cid:12) (cid:12) 0 trough of the curve should occur when, d(x) = ∂x1 (cid:12)x1=y ≈ ∂x2 (cid:12)x2=y ≈ d(y) or x = y. The function can be partitioned lim g(x ) = lim g(x ) = 1 1 2 intothreeparts: x1→y x2→y After solving the constraints, and using the bi- nomial approximation for small ν, the solutions (cid:88) M +1 score(q,d) = f(t,q)log obtainedare: df(t) × t∈d∩q (k+1)f(t,d) 1+ 1+ebB11−(x1−cy) ifx < y f(t,d)+k(cid:16)1−b+bav|dg|dl(cid:17) h(x,y) = 1 ifx = y (cid:16) (cid:17) 1+ b2−1 ifx > y The normalization feature 1 b+b |d| in 1+e−B2(x−(1+c)y) − avgdl (9) theBM25formulacanbereplacedwithh( d , q ) | | | | where, togetourdesiredrankingfunction: (cid:88) M +1 B1andB2aregrowthparameters,canbetyp- score(q,d) = f(t,q)log • df(t) × icallysetto1 t∈d∩q (11) (k+1)f(t,d) cisthetroughcurvature,wherec (0,1) • ∈ f(t,d)+k(h( d , q )) | | | | Accordingtotherequirements,lengthsimilarity 6 ResultsandConclusions would be rewarded between query and document The relevant feedback of the dataset is limited. lengths. In that case, x = d and y = q . Intu- | | | | The human assessors have provided only one rel- itively,thatmeansthevalueofthescoringfunction evant document for one query. Thus, only MRR willbehigherif d y . | | ≈ | | (MeanReciprocalRank)isusedtoassessthemod- Substitutingthevalues,weget: els. s 1+ 1+eBb11(−|d1|−c|q|) if|d| < |q| MRR = 1s (cid:88) ra1nk (12) h( d , q ) = 1 if d = q i=1 i | | | | | | | | 1+ b2−1 if d > q Heresisthesizeofthedatasetandranki isthe 1+e−B2(|d|−(1+c)|q|) | | | | positionoftherankforthefirstrelevantdocument (10) oftheith query. The nature of this feature can be plotted as Theparametersofthebaselinealgorithmswere shown against various parameters (Figure 2). It tuned according to the training dataset and tested can be seen that it closely resembles the desired onthetestsetusingMRR. graph. Training Testing 6 b1=3,b2=4,|q|=20 Model Set Set b1=4,b2=3,|q|=15 MRR MRR 5 Length Simi- 4 larity Heuristic 0.34 0.29 q)| withBM25 ,h(d|||3 BM25 0.24 0.19 Pivoted Length 2 0.22 0.18 Normalization 1 MPtf2ln 0.21 0.18 MDtf2ln 0.19 0.16 0 0 10 20 30 40 50 PL2 0.17 0.15 d | | DirichletPrior 0.16 0.13 Figure 2: The nature of h( d , q ) with B = 1, 1 | | | | Table 2: Mean reciprocal rank (MRR) values on B = 1andc = 0.5 2 thedata-setusingvariousretrievalmethods Thisfeaturecannowbesubstitutedintherank- ingfunctioninplaceoflengthnormalizationfunc- The obtained MRR values have been given in tion. Forexample,theoriginalBM25functionis: (Table2). Theproximityfeaturehasoutperformed by 52 % from regular BM25. It can be observed [Esparzaetal.2012] SandraGarciaEsparza,MichaelP. that query-document length similarity, not docu- OMahony,andBarrySmyth. 2012. Miningthereal- timeweb: Anovelapproachtoproductrecommen- ment length normalization has helped in this situ- dation. Knowledge-BasedSystems,29:3–11. Arti- ation. ficialIntelligence2010. Parameter Value [Fangetal.2011] Hui Fang, Tao Tao, and Chengxiang Zhai. 2011. Diagnostic evaluation of information k(BM25parameter) 2.8 retrieval models. ACM Trans. Inf. Syst., 29(2):7:1– b (LeftBound) 2.9 1 7:42,April. b (RightBound) 3.7 2 [KwonandKim2010] Joonhee Kwon and Sungrim B (Growthparameter) 1 1 Kim. 2010. Friend recommendation method using B (Growthparameter) 1 2 physicalandsocialcontext. c(Troughcurvature) 0.5 [Manningetal.2008] Christopher D. Manning, Prab- hakarRaghavan,andHinrichSchu¨tze. 2008. Intro- Table 3: Parameters used in similarity feature in duction to Information Retrieval. Cambridge Uni- BM25 versityPress,NewYork,NY,USA. [Massungetal.2016] SeanMassung,ChaseGeigle,and The parameters used in similarity feature (Ta- ChengXiangZhai. 2016. Meta:Aunifiedtoolkitfor ble 3) with BM25 show that b and b are tuned 1 2 textretrievalandanalysis. ACL2016,page91. around3and4. Thismeansthatthemagnitudeof [NishiokaandScherp2016] C.NishiokaandA.Scherp. TF-IDFvalueispenalizearound3or4timeswhen 2016. Profiling vs. Time vs. Content: What thelengthsaredissimilar. doesMatterforTop-kPublicationRecommendation Furtherworkistobecarriedoutbyusingmore basedonTwitterProfiles? -AnExtendedTechnical evaluation tests, creating a better dataset, giving Report. ArXive-prints,March. proofs for generalization and applying the algo- [RobertsonandWalker1994] S. E. Robertson and rithm into more applications like spelling correc- S. Walker. 1994. Some simple effective approx- tion. imations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Re- References trieval, SIGIR’94, pages232–241, NewYork, NY, USA.Springer-VerlagNewYork,Inc. [AdomaviciusandTuzhilin2005] G. Adomavicius and A. Tuzhilin. 2005. Toward the next generation of [Singhaletal.1996] Amit Singhal, Chris Buckley, and recommendersystems: asurveyofthestate-of-the- MandarMitra. 1996. Pivoteddocumentlengthnor- art and possible extensions. IEEE Transactions on malization. InProceedingsofthe19thAnnualInter- Knowledge and Data Engineering, 17(6):734–749, national ACM SIGIR Conference on Research and June. Development in Information Retrieval, SIGIR ’96, pages21–29,NewYork,NY,USA.ACM. [AmatiandVanRijsbergen2002] Gianni Amati and CornelisJoostVanRijsbergen. 2002. Probabilistic [SuchalandNa´vrat2010] Ja´nSuchalandPavolNa´vrat, modelsofinformationretrievalbasedonmeasuring 2010. FullTextSearchEngineasScalablek-Nearest the divergence from randomness. ACM Trans. Inf. NeighborRecommendationSystem,pages165–173. Syst.,20(4):357–389,October. SpringerBerlinHeidelberg,Berlin,Heidelberg. [Tottietal.2016] Luam C. Totti, Prasenjit Mitra, [Bellog´ınetal.2010] Alejandro Bellog´ın, Iva´n Canta- Mourad Ouzzani, and Mohammed J. Zaki. 2016. dor, and Pablo Castells. 2010. A study of hetero- A query-oriented approach for relevance in citation geneity in recommendations for a social music ser- networks. In Proceedings of the 25th International vice. InProceedingsofthe1stInternationalWork- ConferenceCompaniononWorldWideWeb,WWW shop on Information Heterogeneity and Fusion in ’16 Companion, pages 401–406, Republic and RecommenderSystems,HetRec’10,pages1–8,New CantonofGeneva,Switzerland.InternationalWorld York,NY,USA.ACM. WideWebConferencesSteeringCommittee. [Cantadoretal.2010] Iva´n Cantador, Alejandro Bel- [WangandYuan2009] XinWangandFangYuan. 2009. log´ın, and David Vallet. 2010. Content-based rec- Course recommendation by improving bm25 to ommendation in social tagging systems. In Pro- identity students’ different levels of interests in ceedingsoftheFourthACMConferenceonRecom- courses. In New Trends in Information and Ser- mender Systems, RecSys ’10, pages 237–240, New vice Science, 2009. NISS’09. International Confer- York,NY,USA.ACM. enceon,pages1372–1377.IEEE. [Wuetal.2008] HoChungWu,RobertWingPongLuk, Kam Fai Wong, and Kui Lam Kwok. 2008. In- terpreting tf-idf term weights as making relevance decisions. ACMTrans.Inf.Syst.,26(3):13:1–13:37, June. [ZhaiandLafferty2004] Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods forlanguagemodelsappliedtoinformationretrieval. ACMTrans.Inf.Syst.,22(2):179–214,April. [ZhaiandMassung2016] ChengXiang Zhai and Sean Massung. 2016. TextDataManagementandAnal- ysis: A Practical Introduction to Information Re- trievalandTextMining. Morgan&Claypool.