Table Of Content

User Clustering in Online Advertising via Topic Models Sahin Cem Geyik Ali Dasdan Kuang-Chih Lee * AppliedScienceDivision AppliedScienceDivision ComputationalAdvertising TurnInc. TurnInc. Yahoo! Labs RedwoodCity,CA,USA RedwoodCity,CA,USA Sunnyvale,CA,USA [email protected] [email protected] [email protected] 5 1 ABSTRACT 1. INTRODUCTION 0 In thedomain of online advertising, our aim is to serve the Inonlineadvertising,thegoalistofindthebestadunder 2 bestadtoauserwhovisitsacertainwebpage,tomaximize constraints for a given user in an online context. The con- b the chance of a desired action to be performed by this user textvariesbasedontheapplicableadformat(e.g.,banneror e afterseeing thead. Whileitispossibletogenerateadiffer- display ad, video ad) and the device(e.g., desktop,mobile) F ent prediction model for each user to tell if he/she will act the user is using at the time of the ad impression (show- 4 on a given ad, the prediction result typically will be quite ing thead touser). Theconstraints are mainly imposed by 2 unreliable with huge variance, since the desired actions are advertisers on the target user and context parameters, e.g., extremely sparse, and the set of users is huge (hundreds of users of certain age range visiting webpages of certain cat- ] millions) and extremely volatile, i.e., a lot of new users are egories. The expectation of an advertiser is one of: brand I introduced everyday, or are no longer valid. In this paper recognition,clicks,oractions. Actionsareadvertiserdefined A we aim to improve the accuracy in finding users who will and can beone of inquiringabout or purchasingaproduct, s. performthedesiredaction,byassigningeachusertoaclus- filling out a form, visiting a certain page, etc. [8]. c ter, where the number of clusters is much smaller than the A certain ad from an advertiser can be shown to a user [ number of users (in the order of hundreds). Each user will in an online context only if the value for the ad impression fall into the same cluster with another user if their event opportunity is high enough to win in a real-time, compet- 2 history are similar. For this purpose, we modify theproba- itive auction [2]. Advertisers directly or indirectly through v 5 bilistic latent semantic analysis (pLSA)model by assuming demand-side platforms, entities that work on behalf of ad- 9 the independence of the user and the cluster id, given the vertiserstodealwithreal-timebiddingadexchanges,signal 5 history of events. This assumption helps us to identify a their value via bids. The bid for an ad impression is calcu- 6 clusterofanewuserwithoutre-clusteringalltheusers. We latedastheactionprobabilitygivenauserinacertainonline 0 present the details of the algorithm we employed as well as contextmultiplied bythecost-per-action goal an advertiser . thedistributedimplementationonHadoop,andsomeinitial wants to meet or beat. 1 resultsontheclustersthatweregeneratedbythealgorithm. The action probability computation is about computing 0 the likelihood of an action by a certain user on a certain 5 Categories andSubject Descriptors ad on a certain webpage. Since it is impossible for every 1 user to see every ad on every webpage, the likelihood is : J.0 [Computer Applications]: General v almost always unknown. As a result, we need to use cam- i ; I.5.3 [Computing Methodologies]: Pattern Recogni- paign/website/user hierarchies and other techniques to re- X tion—Clustering ducethisextremesparsityandcomputethisextremelyrare r event [8]. The idea can be explained best by an example: a General Terms The likelihood of a user visiting a certain webpage can be Algorithms, Application unknown or zero but we may have better confidence if we considerthelikelihoodofthesameuservisitingthetoplevel Keywords domain that thewebpage belongs to (e.g. financepage of a news website vs. any page under the same news website). Onlineadvertising, User clustering, Topic models Theexamplethatappliestothispaperisalong theuserdi- mension: Thelikelihood of a useracting on acertain ad on *ThisworkwascompletedwhiletheauthorwaswithTurn Inc. acertainwebpagecanbeunknownbutwemayapproximate the likelihood if we consider the actions of similar users on thesame orsimilar webpages orthetop leveldomains they belong to. Usersimilaritycanbededucedfromuserclustering. User clusteringin turncan bedoneutilizinguserattributessuch asage,gender,income,geographicattributes;however,most oftheuserattributescomefromthird-partydataproviders, who provide the data to advertisers at a cost (and only for a subset of users). As a result, we have to resort to using . userattributesthatdonotcomefrom thesedataproviders. 2. PREVIOUSWORK Inthispaper,weproposetousetheeventhistoryofusers User clustering (or user segmentation) for better perfor- to cluster users into multiple classes and use the class id as mance in advertising systems is a subject that has gained a feature combined with other features, such as advertiser increasing attention in the past years. An analysis of how andpublisherproperties,tocalculatetheactionprobability user segmentation can help click-through rate (CTR), i.e. at the time of ad serving. This clustering process, named the percentage of the advertisements that are clicked by a user segmentation in the advertising domain, is receiving user,hasbeenpresentedin[17],whichjustifiesthisinterest. increasingattention. Wepresentcurrenteffortsinliterature Oneofthemostcommonmethodologiesforgroupingsim- in § 2. Please see Figure 1 for a description of the problem ilarlybehavingusersintheonlineadvertisingdomainisthe weareattemptingtosolve. Theeventhistoryisbuiltbythe utilization of topic models. [16]introducesProbabilistic La- event tracking process. When a user performs the designed tent Semantic User Segmentation (PLSUS) for advertising, actionintheadvertiserconversionpage,asmallpieceofjava where each user is represented as a bag of words according script code, called beacon or tracking pixel, fires an event tohis/hersearchqueries. TheirmethodologyappliespLSA, to the ad serving system to report that a desired action hence suffers from the same problems aforementioned (new has been fulfilled. Since we use the beacon id to identify usersandcost ofcomputation). LatentDirichlet Allocation each eventinourimplementation,we will refertoeventsas (LDA)isalsoanothertypeoftopicmodelthatisusedcom- beacons for therest of thepaper. monly to cluster users for many applications. [5] gives an example where web users are clustered, assuming the users Set of Ads are the documents and the visited websites are the words. User Publisher ad 1, ad 2 Furthermore,[14]givesanapplicationofLDAforonlinead- , ..., ad n vertising (but on a small, sampled set of users). The main problemwithLDAisthecomputationalcomplexitytotrain ap the user o a cluster in an Wachtiaotn i/sc ltihcek gpirvoebna bthiliet yu osef re, apcuhb alidsh eenrd, ianngd ad ? tcimhluepsmltee,mra(entnodtpatitchio)e.nsdWioffihficLluealwttyeonritknsDasisursiccihghnlaemste[A1n2tll]oopcfraetasioennnetwadniudstsrteihrbeutotHeida- Mt Now the question becomes erarchical Dirichet Process as fast ways to train topic models, implementation overhead (cost of human resources for What is the probability of each ad implementation, as well as scalability issues that affect ap- Cluster ending in an action/click given the cluster, publisher, and ad ? plicability on a real advertisingscenario) is not negligible. Other than directly segmenting the users, there has also Figure 1: Problem Description beenworkonestimatingclicksoractionsviatheuserprop- erties. The first case we would like to list, similar to our The methodology we propose in this paper is inspired work, is given in [1], where the authors utilize user actions highly by probabilistic latent semantic analysis (pLSA) [6], to classify users into two groups for each advertiser: con- which has been used in document clustering, scene classifi- verters (will purchase, or click etc.) and non-converters. cation[3],anduserclusteringfornewsrecommendation[4]. Theyutilizesupportvectormachines(SVM),andtrainthis However,thereisasignificantdifferencebetweenpLSAand modelusingthepastactionsofusers(similartothebeacons our approach. While in pLSA the user and the item (e.g., wementioninthispaper). Theproblemwiththisapproach recommended item for recommender systems, or generated isthatiteffectivelyclassifiesusersintotwogroups,andsuch wordsfordocumentclustering)areindependentofeachother grouping may not be valid for all campaigns (i.e. converter giventheclusterinformation,inourcase,theclusteridand for an ad for a certain product may not be a converter for the user id are independent of each other given a previous another product). Our clustering approach aims at sepa- eventofauser. Suchadifferenceactuallyhelpstodetermine ratinguserstomultiplegroups,andcalculatingaction/click thecluster which a newuser belongs toat run-timejust by probabilities for each group, over manycampaigns. looking at its beacon history. On the contrary, pLSA re- Anotherapproachin utilizinguserinformation for adver- quires a huge amount of memory to store the cluster id for tising is given in [10], where the authors employ friendship each existing user, and has to recompute the topic model data (due to similarity of friends, i.e. homophily) as well from scratch for new users. Although themodification pro- as otherinformation that arises in a large online social net- posedforpLSAisprettygeneralandcanbeappliedtomul- workingsite. Thiskindofinformationhoweverisoftenpro- tiple domains, we believe it especially is useful for online prietary, and not available to most online advertisers. The advertisementdomain(duetothedynamicnature,i.e. con- paper in [7] aims at incorporating user demographic infor- stantly evolving set of users). This modification also sig- mation into a text representation. This representation is nificantlyreducesthecomputationoftheexpectationmaxi- laterutilizedformatchingrelevantadstobothadandpub- mization(EM)algorithmthatisusedtolearntheclustering lisher’s textual properties to enhance advertising efficiency. parameters. While the authors show a CTR lift due to their methodol- The rest of thepaperis as follows. Wefirst give therele- ogy, it is not certain whether such matching is feasible or vantpreviousworkforbothuserclustering,aswellasitsdif- usefulinonlineadvertisingsystemswherethereisonlylim- ferent applications, in §2. Then,weintroduceourmethod- iteddemographicinformationforusersaswellaspublishers ology,aswellasprovidingthedifferencescomparedtopLSA (textualfeaturessuchastitle,keywordsetc. arenot always and our motivation in these differences in § 3. We give the available, but are crucial for the methodology presented in implementationdetailsin§4. Theinitialresultsofourclus- thepaper). tering methodology is provided in § 5. Finally, we conclude Finally,inmobileadvertisingdomain,[11]suggeststheus- thepaper and propose futurework in § 6. age of mobile user patterns and therefore segmenting users based on thepattern similarity. The authorsutilize catego- wheren(dj,wk) is thenumberof timesdj andwk has rization of location and activities as well as a constraint- been seen together. This number arises when α is based Bayesian Matrix Factorization model to deal with taken to be the total number of word-document co- svmpaaactyrisoiantny.davMsacialaailnbablpeilrtiotoyb,laedim.ve.serwtthiisteihnlgatchskyissotafepmdpesrt,oaaaiclnehddatarhceetidnvuuitemytboineprforori--f αoc=curPendcnePs (wi.me.np((ddnj,,wwmk)))=. PdnPnw(dmj,nw(kd)n,wm), hence usersintheadvertisingdomainthatmakeitextremelyhard for such methodsto beapplied. Furthermore, 3.AsMafoEreTmHenOtioDneOd,LoOurGclYustering algorithm is highly in- p(ci|dj)= PcPmPwkwpk(cpi(,cdmj,,dwjk,)wk) spiredbytheprobabilisticLatentSemanticAnalysis (pLSA) [6],withadifferenceintheindependenceassumptionofuser = Pwkp(dj,wk) p(ci|dj,wk) and cluster, given the history of the user. In this section, p(dj) we first give an introduction to pLSA, and then give the dthiffaetrepnacteh.ofWouercmonocdleuldaendthoisursemctoitoinvawtiiotnhttohwehaylgworeitchhmosice = Pwkαp(dj,wk) p(ci|dj,wk) and mathematical details of the model we utilized. αp(dj) 3.1 ProbabilisticLatent Semantic Analysis = Pwkn(dj,wk) p(ci|dj,wk) Probabilistic latent semantic analysis [6] assumes the ex- n(dj) istenceofalatentclassvariable(i.e. clusters,ortopics),and thateachdocumenthasadistributionoverthesetoftopics. where n(dj) is the number of times we have seen a Furthermore, it assumes that given the topic information, word from document dj in the training data. This number arises when α is again taken to be the total probability of a word being generated by a document is independentof that document’sid: numberof word-document co-occurences (i.e. p(dj)= p(wordk|documentj,topici)=p(wordk|topici). (1) aPnddPnwPnm(wdmnj()dnj(=,dwnm,Pw)mw)m, nhe(dnjc,ewαm)=. PPledansePnwomten(tdhna,twtmhe) This makespLSAareally powerfulgenerativemodel. That α value also causes p(dj,wk) above the fraction to be is, if we want to generate a new word from a document d, n(dj,wk). we need to first generate the topic, or cluster c (according top(c|d)),andthenpickupawordw(accordingtop(w|c)). As aforementioned, the pLSA model is a very appropriate Mapping of this to a recommendation system is by first se- model for recommendation systems, since it is capable of lectingtheclusterforauser(document)andthengenerating suggesting a new item (i.e. one that the user has not seen theitem (word, i.e. recommendation) from this cluster. before)toauser,duetotheassumptionthatgiventheclus- The parameters of the pLSA model is learned using the ter id, the user and the item sets are independent of each expectation maximization algorithm [6] as follows: other. But it comes with a cost, first being the necessity to store the cluster information for each user, and second the • Expectation Step: Updatetheprobabilityofaclus- problem of determining cluster for the new users. In most ter given thedocument and word: recommendation systems,theusersetsarequitestable(i.e. p(ci|dj,wk)= Pcpm(cpi(,cdmj,,wdjk,)wk) nwpoeertiwtooidlolimhsaoavvneeyrt.uosEdevresetnecroimmfitenheaenaidtuegmsoe)rl,’isshtceslnuccshetaeintrgiasefutfersreuqatuhlleeyntutrlnyal,iinkitienliygs amucheasiertasktoupdatetheitemprobabilitiesforeach = p(wk|ci) p(ci|dj) p(dj) cluster(asinglemaximizationstep),aslongastheusersets Pcmp(wk|cm) p(cm|dj) p(dj) do not change too frequently, in which case we need to run thewhole EM-step. = p(wk|ci) p(ci|dj) . Wewill next givethedifficultiesthat arise in thedomain Pcmp(wk|cm) p(cm|dj) ofonlineadvertising,hencetheneedforadifferentapproach. 3.2 Challenges ofOnlineAdvertising • Maximization Step: Updatetheconditionalproba- bilitiesofwordsgivenclusters,andclustersgivendoc- As we have explained in the previous subsection, pLSA uments: works well in recommending new items, given the users are p(wk|ci)= PwPmdPjpd(jcpi(,cdij,,dwjk,w)m) saatdllavbtehlreetiawsninedbgw(hheoewknencvoeewnr,otwhseuebctslrucyrsittpeotripoidrnofvsoyirdstetehtmeos)e,alaulnstdehrest.hueIsneursosenrolviindeesr are much more volatile, since a web user’s id is based on = Pdjαp(dj,wk) p(ci|dj,wk) the cookies on his/her device. If these cookies are deleted, PwmPdjαp(dj,wm) p(ci|dj,wm) nexoisltosngaesrwiselhl.isAcllusostnerotiedtvhaalitd,thsiisncceauthseesumsearniydnneowlounsgeerrs arrivein (or leave) theecosystem every day. What weneed = Pdjn(dj,wk) p(ci|dj,wk) isasystemthatlooksatthehistoryoftheuser’seventsand PwmPdjn(dj,wm) p(ci|dj,wm) assign it to a cluster at the ad-servingtime(i.e. run-time). Anotherproblem in online advertising domain is thefact thatwearenottryingtorecommendnewitemstousers,at leastnotinthecontextofthispaper. Whatwearetryingto p(c|u) p(b|c) doistobeabletounderstandtheaction/clickprobabilityof auser,giventhatitbelongstoacertaincluster(i.e. ifauser User Cluster Beacon belongstoaclusterduetohis/herevents,wewantitscluster id, not what other events (beacons) it will perform, i.e. a (a) Plate notation for pLSA generalization given its past events (beacons)). Hence we wantamodelwhichonly looksattheeventhistoryofauser (not its user id, since it is too volatile) and give the group (cluster) id (this also covers all the new users, as long as p(b|u) p(c|b) weknowtheirbeaconhistory),sothatwecancalculatethe action probability of that group (cluster). This also gives User Beacon Cluster us the biggest difference of our methodology compared to pLSA:Given the beacon history, theuser is independentof (b) Plate notation for the clustering method we utilized the cluster id. This difference, as well as others that arise from this,are given in thenext subsection. Figure 2: Difference betweenpLSA and the cluster- 3.3 Differences ofOurMethodology ing we utilized. Please notice that while pLSA as- Compared to pLSA sumes that the beacons and users are independent given the cluster id, we assume that the users and Thebiggestdifferenceoftheclusteringmethodweemploy clusters are independentgivena beacon id from the comparedtopLSAisinthemodel. Keepingwiththeterms user’s history of events. ofonlineadvertising,whilepLSAassumestheindependency of the beacons given the user id, as long as the cluster id is known, we assert that the cluster id is independent of user id, given the beacon id. This independence assumption in e.g. h(ui)={bi,1,bi,2,...,bi,n}foruseri,ifit hasperformed the model helps us in determining the cluster of a new or n actions/beacons. Not all of these n beacons have to be existinguseratrun-timeduetoitsbeaconhistory,therefore, unique,hencewecancomeupwithaprobabilitydistribution wnoewdowenoctanneeddetteormstinoeretahneycclulusstteerroifdafoursaenryautsreur.n-Stiimncee, tsiumchesausspe(rbij|huais)s=eenn(nbb(jue,iau)ic)o,nwj,haernednn((buj,i)uiis)tihsethtoetnaulmnubmerbeorf an ad-serving system tries to calculate the action or click of beacons that user i has seen. Therefore, in run-time, we probabilityofthisusergivenaspecificadonaspecificweb- calculate below: site (publisher) as“What is the action/click probability of a user belonging to cluster i for this ad at this website?”. Pleaseseethedifferenceofthemodelsasaplatenotation p(ci|uj)=Xp(ci,bk|uj)=Xp(ci|bk)p(bk|uj) (2) in Figure 2. In the figure, it can be seen that the user bk bk notation could be directly removed from our model, and for each cluster i, and determine the cluster that gives the hence its relationship with pLSA becomes pretty weak (for maximumconditionalprobabilitytobetheclusterthatuser good reason too, due to the nature of online advertising as j belongs to. Since the beacon probabilities given users are we listed in previous section). Furthermore, our model is knownduringrun-time,allwehavetostoreisthebeaconto pretty general and can be applied to different application domains, however it is especially necessary and useful in cluster mapping (i.e. p(ci|bk) for all beacon cluster pairs). Pleasenotethatthisrequiresmuchlessstoragethankeeping online advertising duetoconstantly evolving set of users. cluster ids for each user (∼500 million users, while we take Ourproposed formulation naturally also brings some dif- ∼600clusters,duetoourpreviousexperiencewithdifferent ferencesintheEMalgorithmthatneedstoberunfortrain- numbers of clusters, and around 13,000 beacons, and not ing model probabilities. As an obvious example of this, we everybeaconhasabovezeroconditionalprobabilityforeach hadmentionedpreviouslythattheexpectationstepofpLSA cluster). Itisalsorobusttochangingbeaconhistoryforeach was calculating p(ci|dj,wk) which is the conditional proba- user, as well as new users (whose beacons might have been bilityofaclusterigivendocumentjandword k,whichcan firedelsewhere, ortheseuserswerenotpresentwhenmodel be interpreted into the online advertising domain as condi- training was being performed). tional probability of cluster i given user j (uj) and beacon The EM algorithm is tryingtomaximize k (b ). But due to our model’s independence assumption, p(ci|kuj,bk)=p(ci|bk). The details of the EM algorithm we Q∀ujQ∀ciwherep(ci|uj)>0p(ci|uj) employ,as well as thedetermination of clusterfor a userat =Q∀ujQ∀ciwherep(ci|uj)>0Pbkp(ci|bk)p(bk|uj),perEq.2. run-timeis given in thenextsubsection. This process is directed by the above cluster determination policy, and is as follows: 3.4 TrainingtheModelandRun-timeCluster Determination forUsers • Expectation Step: For each user in the training Tobeabletodeterminetheclusterof auserat run-time, data, calculate the probability of this user belonging we need to have access to the beacon history of this user. to acluster i: The beacon history of a user is the set of beacons (events) thathavebeenperformedbythisuserinthepreviousxdays p(ci|uj)=Xp(ci|bk)p(bk|uj), (we took this value to be 60 days for our implementation), bk which can furtherbe written as: Algorithm 1 EM Algorithm to Learn the Parameters of theUser Clusters p(ci|uj)=Xp(ci)p(pb(b)k|ci)p(bk|uj), learnClusterParameters bk k (userSet U, beaconSet B, numClusters): where p(b ) is the probability of beacon k, and is cal- for each useruj ∈U do culated askp(bk)= PuPnuPnbmn(unn(u,bnk,)bm), where n(un,bk) aps(sciig|unju)j=to1 a random ci wherei∈[1,numClusters] is thenumberof times thebeacon k has been seen by endfor user n in thetraining data. for each clusterci do pold(ci) = null • Maximization Step: In the maximization step, we calculate pnew(ci) calculate thevaluesp(ci) and p(bk|ci): for each b do k p(ci)=Xp(uj)p(ci|uj), pcaolldc(ublka|tcei)pn=ewn(ublkl|ci) uj endfor where p(ci|uj) is the output of the expectation step. endfor Furthermore, the marginal probability for the user ui while ibseapc(ouni)s u=i hPa∀snufij(urnie()du.j)Wwehaelrseo nca(luciu)laitse:the number of {poldd(ciffi)eOraeRnndt fponrewan(cyi)cia}re significantly p(bk|ci)=Xp(bk|uj)p(uj|ci) {pold(bk|ci) and pnew(bk|ci) are significantly uj different for any ci and bk} do for each user uj do =Xp(bk|uj)p(ci|up(jc)i)p(uj) if CSoafltcuClalutestpe(rcini|gujt)hfeonr all ci uj else where p(ci|uj) is the output of the expectation step, Choose ci for uj wherep(ci|uj) is maximum and p(ci) is calculated as part of the maximization p(ci|uj)=1 end if step. endfor The above formulation means that in the expectation step, for each cluster ci do weassignuserstoclusters(accordingtobeaconclustermap- pold(ci) = pnew(ci) ping, and since we do not have this at the beginning, ran- calculate pnew(ci) domly); and in the maximization step we calculate beacon for each b do k to cluster mapping parameters, as well as the cluster prob- pold(bk|ci) = pnew(bk|ci) abilities. This goes on until convergence (i.e. the cluster calculate pnew(bk|ci) probabilities as well as beacon to cluster probabilities no end for longer change significantly). This scheme is detailed in Al- endfor gorithm 1. One detail in the algorithm is the difference endwhile between hard clustering and soft clustering. In hard clus- for each clusterci do tering,weassign theusertothemostlikelyclusterci,mak- for each beacon bk do ing p(ci|u)=1 (similar to run-time cluster determination); Calculate and return p(ci|bk) whileinsoftclusteringweassigntheusertomultipleclusters endfor where Pcip(ci|u)=1. endfor At the end of EM process, we output p(ci|bk) for each beacon cluster pair, to be used in run-time user to cluster determination. (udf) which has the inputs of p(ci) and p(bk|ci) as well as 4. IMPLEMENTATION DETAILS the user (bag of p(bk|uj)) and produces a bag of p(ci|uj) (i.e. clusterIdsWeights). This structure means that we can We utilized Apache Pig [13] to implement the clustering calculate the EM step parameters by flattening (unfolding) algorithmoutlinedinthispaper. Pigisahigh-levellanguage either theuser bag, or clusterIdsWeights. workingontopofHadoop[15]framework, hencemakesour Another problem is the calculation of normalized proba- clusteringtaskcompletemuchquickerduetoparallelization. bilities, where at the beginning we have a weight for each As stated in the previous section, the clustering task starts probability,butweneedthenormalized valuesforthesame withdeterminingclusteridforeachline(i.e. dataforasin- group. For example, suppose in the below example, we gle user) in the user information folder, that contains p(uj) startwithavariableclusterBeaconWeight whichhasforeach aswellasallthesetofuniquebeaconsthisuserhasencoun- (ci,bk) pair the value αp(bk|ci) (where α is not known). To tered, with their conditional probabilities (p(bk|uj)). This calculatetheexactvaluep(bk|ci),weneedtofirstgroupac- user data is pre-calculated to compress the whole training cordingtoclusterid,thensumupp(bk|ci)asthenormalizing data,aswellastoreducethetimeforasingleEMstep. The factor. Later,weunfoldthegroupwiththissumappended, pigscriptsectionforthisclusterdeterminationstepisgiven and then divide by the normalizer to find the exact prob- in Figure 3(a). In the figure, userWeight is p(uj), user is ability. The example code piece that achieves this flow is a bag of p(bk|uj), CLUSTERID is a user defined function given in Figure 3(b), where the number after PARALLEL userClusterIds = FOREACH userData GENERATE 5. RESULTS userWeight, user, CLUSTERID(user) AS clusterIdsWeights; In this section we will present our evaluation of the proposed clustering algorithm. We will provide both inter- (a) Code for User-Cluster Determination pretability results (i.e. how the constructed clusters look a = GROUP clusterBeaconWeight BY clusterId like), as well as numerical results (i.e. how well advertis- PARALLEL 400; ing with the new clustering algorithm performs) in the rest of the section. Please note that we are giving a prelimi- -- we sum up the alpha * p(b|c) naryoverviewfortheperformanceofourproposedmethod, -- for each c to get normalized p(b|c) and the reasons for non-exhaustive comparisons with other b = FOREACH a GENERATE well-knownmethodsaretwo-folds. Firstoneistheinappro- FLATTEN($1) AS (clusterId,beaconId,weight), priateness of methods like pLSA and LDA for the dynamic SUM($1.weight) AS clusterTotalWeight; natureoftheusersinonlineadvertising,aswellasthescala- bilityissuesinLDA(forthemoreadvancedparalleltraining c = FOREACH b GENERATE clusterId, beaconId, methodsasmentionedin§2,westillhavesignificantimple- weight/clusterTotalWeight AS normalizedWeight; mentation overhead in terms of man-hours). Second reason (b)Code for Probability Normalization is the way we wanted to evaluate the proposed methodology. In our domain, we deploy new models into our actual Figure 3: Apache Pig code examples from our user advertisingsystemtochecktheirperformancesandanytest clustering implementation, which should be helpful means lost/gained revenue in such an industrial scenario. in understanding the parallel computation steps. In most cases, the online test is the only feasible way to evaluateanewmodel/method,sincethisistheonlywaywe candirectlymeasuremorerelevantmetrics(suchasrevenue, gives the number of reducers, and the keyword FLATTEN actions/clicksvs. costofadvertisingetc.) inonlineadvertis- is used to unfold thebag which is theset of p(bk|ci) for the ing. Thisiswhysomeof themorerecentwork [9]onoffline same cluster. evaluationisoftennotuseful: themetricsthatcanbeexam- The power of using Pig, hence the Hadoop framework is ined in such settings (suchas click-through rate, AUCetc.) due to assigning the cluster ids to many users in different often do not correlate with online metrics. Due to this, it machines, as well as the sum operations working on differ- is not often possible to test many models (unless we are ent cluster parameters at the same time. Please note that fairly confident of the performance) since it may mean lost thealgorithm(Algorithm1)weprovidedformodeltraining revenue; this is why we limited the testing of our proposed is embarrassingly parallel, i.e. at each point of training we model and algorithm to our previously employed algorithm can divide the dataset completely according to current for- basedoncosinesimilarity metric. Therefore,inthissection, mulation we want to compute (whether by user id, beacon we are basically evaluating how our model replacement ef- id, or cluster id). This also means that our parallel imple- forts resulted. mentation is linearly scalable, i.e. it is n-times faster than Let us start with our interpretability results, which is a a single machine approach at any step of training∗, where way to manually check whether the clusters generated by n is the number of machines (e.g. number of reducers for our method make sense or not. Figure 4 gives a number of reducer-side operations such as grouping and counting, or exampleclusterswhichwereconstructedbytheproposedal- numberofmappersformapper-sideoperations,suchascal- gorithm. Althoughwesetoursystemtogeneratearound600 culating p(ci|uj) in Algorithm 1). It all adds up to be able clusters(duetoourpreviousexperiencewiththedomain,we tolearnthemodelfeasiblyforthehugenumberofusersand found this to be a good number of partitions for the whole events we deal with in online advertising domain. In our set ofusers), weonlygiveeight ofthoseclustersforpresen- model, we utilized a set of ∼13,000 beacons that havebeen tationalpurposes. Inthefigure,weonlyshowthetopthree captured over3 months. This set has been selected by first beaconsforeachcluster,andthenumberbetweenparanthe- removing the both tails from the whole set of beacons, i.e. sesaftereachbeacondefinitiongivestheprobabilityofthat removing the most frequent (because they are not separa- beacon being generated by a user belonging to that cluster tive enough), and the least frequent (because they are too (e.g. p(bHotel2-ConversionAction | Cluster 1) = 0.12). Due uncommon), and then sampling among the remaining bea- to privacy reasons, we did not give the exact names of the cons. Thisset of beaconshavetranslated into∼500 million beaconsweused,butratherwegaveacategorizationonthe users. WehaveobservedthattheEMalgorithmconvergein subjectofthebeaconandthetypeofeventthebeaconindi- about 25-30 cycles (for hard clustering). This process takes cates. Forexample,HomepageAction meansthevisitofthe about15minutes(forasettingof∼3500mappers,wherethe homepageofanadvertiserbyauser,whereasaConversion processingtakes∼1.5minutespermapper;and∼400reduc- Action usually indicates a purchase. We furthermore give ers, wheretheprocessingtakes∼4minutesperreducer)for the numberof users that had fallen undera specific cluster each cycle in our test cluster, hence a total of 7 hours for (duringtheclusteringprocess) nexttotheclusterinparan- thewhole process. theses(e.g. duringtheclusteringprocess, 661,083 usersout of the ∼500M users we processed fell underCluster 5). The results, as a subset is given in the figure, shows that theclustersaremostlymeaningful,i.e. theevents(beacons) *InAlgorithm1,calculationofp(ci|uj)isonemap-reduce that fall under the same cluster are similar. Furthermore, job(mapper-intensive),andcalculationofbothpnew(ci)and it can be seen that they do not always belong to the same pnew(bk|ci) is another separate map-reduce job (reducer- advertiser,butusuallythesamesubject. Forexample,itcan intensive). These two run sequentially and repeatedly until convergence. be seen that Cluster 7 have events/beacons from different Cluster 1 (486,299 users) Cluster 2 (805,853 users) Cluster 3 (871,820 users) Hotel 1 - Reservation Action (0.35) Touristic Location 1 - Remarketing Action 1 (0.36) Car Rental Comp. 1 - Retargeting Action (0.44) Hotel 1 - Directory Search Action (0.13) Touristic Location 1 - Remarketing Action 2 (0.35) Car Rental Comp. 2 - Homepage Action (0.23) Hotel 2 - Conversion Action (0.12) Touristic Location 2 - Homepage Action (0.10) Car Rental Comp. 3 - Homepage Action (0.08) Cluster 4 (353,682 users) Cluster 5 (661,083 users) Trousers Comp. 1 - Women's Homepage Action (0.19) Women's Clothing Comp. 1 - Catalog Action (0.21) Trousers Comp. 1 - Homepage Action (0.18) Women's Clothing Comp. 2 - Site Interaction (0.20) Trousers Comp. 2 - Homepage Action (0.16) Clothing Comp. - Site Interaction (0.14) Cluster 6 (1,261,022 users) Cluster 7 (280,334 users) Cluster 8 (357,532 users) Cruise Comp. 1 - Homepage Action (0.39) Political Campaign - Remarketing Action (0.36) Japanese Car Comp. - Homepage Action (0.25) Cruise Website 1 - Homepage Action (0.18) Political Group - Homepage Action (0.16) German Car Comp. - Homepage Action (0.24) Cruise Website 1 - Cruise Comp. 2 Action (0.10) Politics Magazine - Homepage Action (0.16) French Car Comp. - Homepage Action (0.16) Figure 4: Example clusters for the online advertising domain. Cluster ids are followed by the number of users in our system that belong to this cluster. We have listed only the three most significant beacons for each cluster and the numbers in parantheses give the values p(beacon|cluster). advertisersunderit,oneofthembeingapoliticalcampaign, advertising domain, hence the computational issues. Sec- one a political group, and one a political magazine. The ond one is the assumption that any beacon seen by a user clustering algorithm was able to gather these events under has the same weight for that user (i.e. we do not utilize thesameclustersincetheyallbelongtothesubjectPolitics p(beacon|user) values), which does not take into account (of course thisistheexpectedbehaviorsincetheusersthat thepossibility thatmanyofthesameevents(beacons)may fall under this cluster are the ones that are interested in indeedindicatehigherinterestonaspecifictypeofproduct. politics). Similarargumentscaneasilybemadefortheother We have run an A/B test using two models, one uti- clusterspresentedinthefigure,andmostoftheclustersthat lizing the new clusters/methodology and one the old clus- are currently being used in our system, but not presented ters/methodology. Thesemodelshavebeenrunonthewhole here. set of campaigns within Turn, where impression traffic was Next, we will give a comparison of the clustering algo- directed to two models with equal priority. In the models, rithm against aprevioussystemthat wasbeingusedat our theusersthatbelongtothesameclusterhavethesamepre- company’s advertising framework again for user segmenta- dicted action/click rates (which are utilized for calculating tion. Ourpreviousalgorithm also utilizesuserbeacons,but thebidvalue)forthesamecampaigns. Theseratesareagain employsacosinesimilarity metric(bytakingthebeaconset calculated from historical data of action/clicks (separately as a feature vector for each user) to determine which users foreachcampaign)amongtheusersthatbelongtothesame are closer, or more similar, to each other, hence should be cluster. Therefore, theclusterid is used as asingle feature, in the same cluster for action rate and click rate calcula- alongsidewiththecampaignid. Weareprovidingresultsin tion. Inthisalgorithm, each beacon bi hasaweight wi, i.e. terms of effective cost per action (eCPA) and effective cost wi =α|u(bi)| where u(bi) is the set of users that have fired per click (eCPC) metrics. These metrics can be described thisbeacon at least once. Furthermore,each useruj is rep- as follows: resented by a vector where the entries for the beacons that havebeenseenbythisuserhasthevalue1,andothers0(i.e. • Effective Cost per Action (eCPA): What is the v(uj) = [v1,...,vn] where n is the total number of distinct averageamountofmoneythatisspentbyanadvertiser beaconsinthesystem,andvi =1ifuj hasseenbeaconbi,0 (on advertising) to receive one action (i.e. purchase otherwise). The algorithm starts with n centroids, where n etc.)? This metric can becalculated as: isthetotalnumberofbeaconsavailable(furthermoreatthe AdvertisingCost beginning, each centroid cm is a vector v(cm) = [v1,...,vn] eCPA= . # of Actions where only vm, representing beacon bm, is 1, and all other entries 0), and applies a k-means on these set of centroids (we take k = 600 similar to the new methodology). Each • EffectiveCost per Click (eCPC):Whatistheav- user uj is assigned toa centroid, hencecluster, as follows: erage amount of money that is spent by an advertiser (on advertising) to receive one click (on its ad)? This Cluster(uj)=argmaxci∈Ckvv((ccii))k·kvv((uujj))k metric can be calculateAddavse:rtisingCost eCPC = . where the norm and dot product calculations take into ac- # of Clicks count the weights (wi) of beacons (bi). Furthermore, the entries for the centroid vectors are updated during the k- Abovemetricsarerepresentativeofthequalityofclustering means according to vectors of the users assigned to them. sinceourbiddinglogictakestheclusteridsofusersdirectly In this algorithm, each k-means is followed by a merging into consideration. If we are able to separate theusers into stage where significantly closer centroids are merged. This meaningfulsegments(viaclustering),thentheactionproba- process goes untilnocentroid mergesarepossible. Thedis- bilitycalculatedforeachsegmentismoreaccurate(therefore advantage of this algorithm is two-folds. The first one is ourbidvaluesarecloser totheactualvalueof each impres- the already high (and increasing) number of users in the sion). Due to this, the money spent for each impression 6. CONCLUSIONSAND FUTUREWORK Comparison of the Two Clustering Algorithms in Terms of eCPA 200 Cosine Similarity Inthispaper,wedescribedamethodologytoclusterusers Topic Model inonlineadvertising. Wehaveexplainedthedifficultiessuch 175 as non-stable set of users as well as the large data that we havetodealwith,andhowthosefactorsshapethealgorithm 150 we employ or how we implement this algorithm. We have givenabriefoverviewoftheimplementationinApachePig, A CP 125 on Hadoop, as well as some initial experimental results. It e is important to mention that we are not claiming to have 100 explored all possible algorithms/models for analysis in this work, but rather that we have developed a meaningful and 75 efficient system that solves a real-world online advertising problem and improves the performance. In summary, we 50 believethatthisworkfillsasignificantvoidintheliterature, 0 1 2 3 4 5 6 7 8 9 10 11 Day sinceitdealsdirectlywiththelarge-scaleproblemsthatarise in online advertising domain. Our futurework includes the Figure 5: Comparison of eCPA Performance for the extension of our results, as well as an improved clustering Clustering Algorithms over 10 Days. Lower eCPA algorithm where the user properties (such as age, gender, that has been achieved by the proposed methodol- location etc.) are used alongside with the item sets (i.e. ogy indicates better performance. beacon history) for better determination of clusters. 7. REFERENCES Comparison of the Two Clustering Algorithms in Terms of eCPC 500 [1] M.Aly,A.Hatch,V.Josifovski, andV.K.Narayanan. Cosine Similarity 475 Topic Model Web-scaleuser modeling for targeting. In Proc. ACM WWW,pages 3–12, 2012. 450 425 [2] C. Borgs, J. Chayes, O. Etesami, N.Immorlica, K.Jain, and M. Mahdian. Dynamicsof bid 400 optimization in online advertisement auctions. In 375 C Proc. ACM WWW,pages 531–540, 2007. P 350 C e [3] A.Bosch, A. Zisserman, and X.Munoz. Scene 325 classification viaplsa. In Proc. European Conference 300 on Computer Vision,2006. 275 [4] A.Das, M. Datar, A. Garg, and S.Rajaram. Google 250 news personalization: Scalable onlinecollaborative 225 filtering. InProc. ACM WWW,pages 271–280, 2007. 200 0 1 2 3 4 5 6 7 8 9 10 11 [5] H.Fujimoto, M. Etoh,A. Kinno,and Y. Akinaga. Day Web userprofiling on proxy logs and its evaluation in personalization. In LNCS 6612, pages 107–118, 2011. Figure 6: Comparison of eCPC Performance for the [6] T. Hofmann. Latent semantic models for collaborative Clustering Algorithms over 10 Days. Lower eCPC filtering. ACM Trans. Inf. Sys., 22(1):89–115, 2004. that has been achieved by the proposed methodol- [7] A.Joshi, A.Bagherjeiran, and A.Ratnaparkhi.User ogy indicates better performance. demographic and behavioral targeting for content match advertising. In Proc. ACM ADKDD, pages 53–60, 2011. (due to bidding) brings more actions/clicks, hence improv- [8] K.-C. Lee, B. Orten,A.Dasdan, and W. Li. ing eCPC and eCPA metrics. Estimating conversion rate in display advertising from Theresultsofourinitialexperiments,thatspanovera10 past performance data. InProc. ACM SIGKDD Conf. day period within July-August 2013, are given in Figures 5 on Knowledge Discovery and Data Mining,pages and 6. While Figure 5 compares theday-by-dayeCPA per- 768–776, 2012. formance for both algorithms, Figure 6 presents the same [9] L. Li, W.Chu, J. Langford, and X.Wang. Unbiased analysis for eCPC performance. Due to privacy issues, we offline evaluation of contextual-bandit-basednews are not presenting the exact values in the plots, but rather article recommendation algorithms. In Proc. ACM we havechangedthevaluesbyafactor. Itcan beseen that WSDM,pages 297–306, 2011. the proposed algorithm utilizing the topic model performs much better (i.e. lower eCPA or eCPC is better) compared [10] K.Liu and L. Tang. Large-scale behavioral targeting to the algorithm which utilizes the cosine similarity metric. with a social twist. In Proc. ACM CIKM,pages Furthermore, we see that the eCPA has an trend for in- 1815–1824, 2011. creasing for both models at the end of the experimentation [11] H.Ma, H.Cao, Q.Yang, E. Chen,and J. Tian. A period. Thisisduetotheaction attribution probleminher- habit miningapproach for discovering similar mobile ent in advertising systems,where an action that happensis users. In Proc. ACM WWW,pages 231–240, 2012. attributedtoanadthatisshowntoauserseveraldaysago. [12] D.Newman, A.Asuncion, P.Smyth,and M. Welling. Due to this, the later days still have some actions that are Distributedalgorithms for topicmodels. J. Machine to bereceived in time. Learning Research, 10:1801–1828, 2009. [13] C. Olston, B. Reed,U. Srivastava,R.Kumar, and A.Tomkins. Pig latin: A not-so-foreign language for data processing. In Proc. ACM SIGMOD,pages 1099–1110, 2008. [14] S.Tu and C. Lu.Topic-based usersegmentation for online advertisingwith latent dirichlet allocation. In LNCS 6441, pages 259–269, 2010. [15] T. White. Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol, CA,2012. [16] X.Wu,J. Yan,N.Liu, S.Yan,Y.Chen, and Z. Chen. Probabilistic latent semantic user segmentation for behavioral targeted advertising. InProc. Proc. ACM ADKDD,pages 10–17, 2009. [17] J. Yan,N. Liu,G. Wang, W. Zhang,Y. Jiang, and Z. Chen.How much can behavioral targeting help online advertising? In Proc. ACM WWW,pages 261–270, 2009.