Variational Bayesian Inference for Big Data Marketing Models1 AsimAnsari YangLi JonathanZ.Zhang2 December2014 1Thisisapreliminaryversion. Pleasedonotciteorcirculate. 2AsimAnsariistheWilliamT.DillardProfessorofMarketingatColumbiaBusinessSchool,YangLiisAssistant ProfessorofMarketingatCheungKongGraduateSchoolofBusinessinBeijing,andJonathanZ.ZhangisAssistant ProfessorofMarketingatUniversityofWashingtoninSeattle. Abstract Hierarchical Bayesian approaches play a central role in empirical marketing as they yield individual-level parameter estimates that can be used for targeting decisions. MCMC methods have been the methods of choiceforestimatinghierarchicalBayesianmodelsastheyarecapableofprovidingaccurateindividual-level estimates. However,MCMCmethodsarecomputationallyprohibitiveanddonotscalewellwhenappliedto massivedatasetsthathavebecomecommoninthecurrenteraof“BigData”. Weintroducetothemarketing literature a new class of Bayesian estimation techniques known as variational Bayesian (VB) inference. Thesemethodstacklethescalabilitychallengeviaadeterministicoptimizationapproachtoapproximatethe posterior distribution and yield accurate estimates at a fraction of the computational cost associated with simulation-based MCMC methods. We exploit and extend recent developments in variational Bayesian inference and highlight how two VB estimation approaches – Mean-field VB (that is analogous to Gibbs sampling) for conjugate models and Fixed-form VB (which is analogous to Metropolis-Hasting) for non- conjugate models – can be effectively combined for estimating complex marketing models. We also show how recent advances in parallel computing and in stochastic optimization can be used to further enhance the speed of these VB methods. Using simulated as well as real data sets, we apply the VB approaches to several commonly used marketing models (e.g. mixed linear, logit, selection, and hierarchical ordinal logit models),anddemonstratehowtheVBinferenceiswidelyapplicableformarketingproblems. Keywords: Big Data, Variational Inference, Mean-field Variational Bayes, Fixed-form Variational Bayes, Parallelization,StochasticOptimization,AdaptiveMini-batch,MovieLensDatabase. 1 Introduction Therecentadvancesininformationtechnology,coupledwiththerapiddecreaseindatastoragecost,havere- sultedintheexplosivegrowthinthecollectionandavailabilityofcustomerleveldata. Consumerinteractions governed by the Internet, e-commerce, social media, geographical positioning systems and information- sensing mobile devices result in a massive trail of data-points, collected at a pace faster than ever before (BoydandCrawford2012). Asthevolumeofbusinessdataworldwide,acrossallcompanies,doublesevery 1.2years(eMarketer2013),datasetsizesinempiricalmarketingresearchhavealsoincreasedsignificantly overtheyears. Forexample,withinthepastdecadeorso,inmarketingpapersusingscannerdata,wehave seendatasetsizesincreasefromthethousands(e.g.,Villas-BoasandWiner1999)tohundredsofthousands ofobservations(e.g.,Gordon,GoldfarbandLi2013). Thesenovelphenomena,collectivelyknownas“BigData”,providescholarswithexcitingopportunities tounderstand,explain,andpredictthebehaviorofconsumersinmoredetailedwaysthanbefore. Effective analyses of such large customer databases can yield insights on individual preferences and can help firms increaseprofitviabettertargeting,suchasinthecasesofHarrah’sEntertainment,CapitalOne,andNetflix (DavenportandHarris2007). However,thisdatadelugealsochallengesfirms’abilitytoprocessdatainan effectiveandmanageriallytimelymanner. BigDataischaracterizedbyhighvolume,highvelocity,andhighvarietythat“requirenewformsofpro- cessingtoenableenhanceddecisionmaking,insightdiscoveryandprocessoptimization”(Gartner2012). In thisresearch,ofthe“3VsofBigData”,wetackletheissuesofhighdatavolumeandvelocitybyintroducing tothemarketingliteratureanewandrapidlyevolvingBayesianestimationframework,knownasVariational Bayesian (VB) inference. This framework is deterministic, scales well in the presence of large number of observations, andallowsmarketerstoexpeditiouslyestimateindividual-levelresponseparametersformar- ketingactions,whenotherstatisticalmethodswouldtaketoolongtoyieldresults. Understanding heterogeneity in consumer preferences is of paramount importance in many marketing activities. Strategically, an understanding of the distribution of consumers responses to product attributes wouldguideproductdesigndecisions–aninsightthatwouldbelostifthepreferenceisexaminedonlyatthe mean(AllenbyandRossi1999). Tactically,modelingindividual-levelresponsestomarketingactionsallows firmstoadjustallocationofresourcesacrossregions,stores,andconsumers(Rossietal.1996). Inthepast20 years,thehierarchicalBayesianmethodshavebecomepopularinmarketingbecauseoftheirdemonstrated utility in yielding highly accurate individual level estimates by appropriately pooling information across consumers. On the estimation front, advances in Markov chain Monte Carlo methods have fueled this Bayesianrevolutioninmarketing(Rossi,AllenbyandMcCulloch2005). The marketing literature has used a variety of MCMC methods for estimating hierarchical Bayesian models. Forconjugatemodels,Gibbssampling(GelfandandSmith1990;Gelfandetal.1990)isthemethod of choice. In contrast, for non-conjugate models, the full conditional distributions do not have closed- form,andMetropolis-Hastingmethods(ChibandGreenberg1995)andtheirextensionssuchasHamiltonian MCMCandslicesampling(Neal2003;Neal2011),havebeenusedacrossawidespectrumofapplications. Streams of research that leverage individual-level information have flourished, to address a wide array of topicssuchascouponing(Rossietal.1996),recommendationsystems(Ansari,EssegaierandKohli2000), digitalmarketingcampaigns(AnsariandMela2003),B2Bcommunicationcontacts(VenkatesanandKumar 2004),pharmaceuticaldetailingandsampling(Dongetal.2009;Montoyaetal.2010),targetedpromotions (Zhang and Krishnamurthi 2004), price endogeneity (Li and Ansari 2014), and dynamic targeted pricing (Zhangetal.2014). Despite their unquestionable usefulness and prevalence in marketing, MCMC methods are hampered by the bottlenecks of speed and scalability, and are, therefore, difficult to apply in large data situations where data sets can often contain hundreds of thousands or even millions of individuals. The situation is further exacerbated by models with complex posteriors such as those involving non-conjugacy, multiple decisions,anddynamics. Itisironicthatwhilelargedatasetscontainenoughinformationcontenttosupport complexmodelswithouttheriskofoverfitting,estimatingthesemodelsbecomesintractableusingsampling based methods. MCMC methods suffer from poor mixing, especially in hierarchical models involving a largenumberoflatentvariables,andtherefore,requirealargenumberofparameterdrawsforconvergence. Mixing also becomes difficult in large data situations because of a concentrated posterior distribution. It is therefore not uncommon in such situations for an MCMC simulation to run for days or even weeks to ensure convergence and to obtain a reasonable number of effectively uncorrelated samples. In short, in the face of big data, traditional MCMC methods do not scale well and converge too slowly to be useful for making timely managerial decisions including recalibrating price sensitivities for customers, making productrecommendations,orcustomizingadvertisingbyfirmssuchasNetflix,Amazon,Google,andthose involvedinmobiletargeting. Toaddressthescalabilityissue,itiscommontoprunethedataandonlyworkwithasubsetofindividuals 2 orasubsetofobservations. Thisapproach,however,discardsinformationthatisvaluableforestimationand for targeting to every individual, and might lead to poor estimates (Zanutto and Bradlow 2006). Other methods of data compression, for instance discarding old data, the judicious choice of sufficient statistics, dataaggregation(Musalem,BradlowandRaju2009)anddatasquashing(Madiganetal.2002)canalsobe used. In this paper, we show that variational Bayesian approaches can be used as an efficient and scalable alternative to MCMC methods in such “Big Data” situations. In contrast to MCMC methods that simulate from the posterior, variational Bayesian methods use an optimization approach to construct an approxima- tiontotheposterior. TheyofferthebenefitofsignificantspeedupswhencomparedtoMCMCmethodsand canrecoverparameterestimatesaccurately. Wederivevariationalalgorithmsfortheconjugate(linear)and non-conjugate(logitandordinallogit)hierarchicalmodelscommonlyusedinmarketing,andillustratehow differentvariationalmethodsthatarecounterparts toGibbsandMetropolis-Hastingsamplingmethodscan be used. We also show how recent advances in parallel computing and in stochastic approximation can be used to further enhance the speed of these VB methods. We apply the methods to simulated and real data setsofdifferentsizesanddemonstratethatVBmethodsnotonlycanachievethesamelevelofaccuracyas MCMC, but with speeds that can be up to thousands of times faster than MCMC. We also make available a suite of computer programs that marketing academics and practitioners can readily use to conduct VB inferenceondifferentmodels. The rest of the paper proceeds as follows. We begin in Section 2 by explaining the basic concept behindvariationalBayesianinference. TheninSection3,weintroducemean-fieldVBforconjugatemodels and fixed-form VB for non-conjugate models, and illustrate algorithms for linear-mixed, selection, and logit models. Section 4 shows how mean-field and fixed-form can be combined into a “hybrid” VB to efficiently estimate models with both conjugate and non-conjugate components, and offers an algorithm for one such case - a hierarchical ordinal logit model. In Section 5, we discuss how VB can be further enhanced via parallelization and stochastic optimization with adaptive mini-batch sizes. Section 6 and 7 contain applications of these VB methods and extensions to the hierarchical ordinal logit model. Lastly in Section8,wesummarizethebenefitsofVBinmarketingandhighlightpotentialavenuesforfutureresearch. All other details are located in the Appendix. The codes and data sets are available from the authors upon request. 3 2 Variational Bayesian Inference Bayesianinferenceisbasedonsummarizingtheposteriordistributionofallunknowns. InagenericBayesian modelwithobserveddatay andtheunknownparametervectorθ,theposteriordistributionp(θ|y)follows theBayesrule, p(y,θ) p(y|θ)p(θ) p(θ|y) = = , (1) p(y) p(y) where, p(θ) is the prior distribution, p(y|θ) is the likelihood, and p(y) is the normalizing constant or evidence. For almost all interesting marketing models, the normalizing constant cannot be computed in closed-form. Theposteriordistribution, therefore, isnotavailableanalytically. Thisnecessitatestheuseof approximationstosummarizetheposteriordistribution. SimulationmethodsbasedonMarkovChainMonte Carlo (MCMC) are the methods of choice in this regard, given their excellent approximation capabilities. However,MCMCmethodscanbeverytimeconsuming,especiallyinBigdatasituations. In contrast to MCMC methods that approximate the posterior by simulating random draws, variational inferenceseekstoapproximatetheintractableposterior, p(θ|y), withasimplerdistribution,q(θ|η), called thevariationaldistribution(Bishop2006;OrmerodandWand2010). Thevariationaldistributionbelongsto afamilyofdistributionsthatisindexedbyasetoffreeparametersη. InvariationalBayesianinference,one searches over the space of the free parameters to find a member of the variational family that is closest to theposteriorofinterest. Inshort,variationalmethodsrecastthemodelinferenceproblemasanoptimization problem,thusmakingitpossibletoobtainspeedandscalability. VBmethodsrelyontheKullback-Leibler(KL)divergence(KullbackandLeibler1951)tomeasurethe dissimilarity (distance) between two probability distributions.1 The KL divergence between the approxi- matingdistributionq(θ)andtheposteriorp(θ|y)isdefinedas (cid:90) q(θ) KL[q(θ)||p(θ|y)] = q(θ)log dθ ≥ 0, (2) p(θ|y) where, the equality holds if and only if q(θ) = p(θ|y) almost everywhere. We can manipulate the KL 1NotethatEuclideandistancebetweendistributionalparametersisoftenapoormeasureofclosenessbetweenprobabilitydistri- butions(Hoffmanetal.2013).Forinstance,thetwonormaldistributionsN(0,1002)andN(10,1002)arealmostindistinguishable, andtheEuclideandistancebetweentheirparametervectorsis10. Incontrast,thedistributionsN(0,0.12)andN(0.1,0.12)barely overlap,butthisisnotreflectedintheEuclideandistancethatisonly0.1. 4 divergencesuchthat KL[q(θ)||p(θ|y)] = E [logq(θ)]−E [logp(θ|y)] (3) q q = E [logq(θ)]−E [logp(y,θ)]+logp(y). q q In the above, the expectation E [·] is with respect to the variational distribution q(θ), and the last term, q logp(y)isaconstant. MinimizingtheKLdivergenceisthusequivalenttomaximizingthescalarquantity, L(q) = E [logp(y,θ)]−E [logq(θ)], (4) q q whichisusuallyreferredastheevidencelowerbound(ELBO;Bishop2006;OrmerodandWand2010). Our goal here is to find an approximating variational distribution q(θ) that makes KL as close to zero as possible. However, because the posterior is unknown to begin with, we need to impose restrictions on theapproximatingvariationaldistributionforinferencetoproceed. Theserestrictionsservetostructurethe approximatingdistributionsuchthatitsfunctionalformcanbeeitherinferredorset. Themachinelearning literature has explored a number of different approaches for structuring these approximating distributions. In the current paper, we focus on two approaches that can be effectively applied for marketing models – mean-field approximations for conjugate models and fixed-form approximations for non-conjugate ones. These two approaches can also be used in tandem to yield a hybrid approximation for models that involve bothconjugateandnon-conjugateelements. 3 Mean-field and Fixed-form Methods 3.1 Mean-fieldVariationalBayes(MFVB) The mean-field2 approximation can be considered the deterministic counterpart to Gibbs sampling, and as in Gibbs sampling, it is applicable for conjugate or semi-conjugate models (Ormerod and Wand 2010; Grimmer2010). Inmean-fieldinference,thevariationaldistributionq(θ)isrestrictedtoafactorizedproduct form (cid:81)D q (θ ), over some partition {θ ,...,θ } of θ. Underlying this restriction is the assumption of i=1 i i 1 D independence across the different parameter blocks. Such an approximation is nonparametric in spirit. Note that no parametric assumptions regarding the functional form of q (θ ) are used. Rather, the optimal i i functionalform,aswellastheoptimalparametersofthevariationaldistributioncanbeinferredgivenonly 2Thename“mean-field”originatedfromphysics. 5 the likelihood and the prior. Under the above assumption, setting ∂L(q)/∂q = 0 leads to the following optimalsolutiontotheminimizationoftheKullback-Leiblerdivergence(Murphy2012), q∗(θ ) ∝ exp{E [logp(θ |y,θ )]}, (5) i i θ−i i −i where, E [·] refers to the expectation with respect to the variational distribution of the other parameter θ−i blocks, θ , and p(θ |y,θ ) is the posterior full conditional distribution for θ . When a conjugate prior −i i −i i is used, the optimal variational component, q∗(θ ), is available in closed-form and belongs to the same i i distributional family as the posterior full conditional. Also, because the posterior full conditional exists analytically,thefunctionalformofthevariationaldistributionisreadilyavailable. Denoteq (θ ) = q (θ |η ),andη astheparameterfortheithvariationaldistribution. Thenfindingthe i i i i i i optimal variational density only requires optimization to obtain the variational parameters {η } . As we i ∀i seeinthenextsection, thiscanbedoneusingsimplecoordinateascentoptimizationinwhichthedifferent variational parameters are updated sequentially in an iterative and deterministic fashion, until convergence isachieved. 3.1.1 ALinearRegressionModel Wenowillustratethemean-fieldapproximationanditsuseinvariationalBayesianinferenceusingthemost familiarofstatisticalsettings–alinearregressionmodel. Consider y = x(cid:48)β+(cid:15) , (6) i i i (cid:15) ∼ N(0,σ2), i = 1...N, i withsemi-conjugatepriors, β ∼ N(µ ,Σ ) and σ2 ∼ IG(a,b). β β Whileeachofthetwopriorsareindividuallyconjugatetothenormallikelihood,giventheotherparame- ter,thejointpriorisnotconjugate,andthustheresultingposteriordistributionforthemodelisnottractable. However,onecanuseamean-fieldapproachandapproximatetheposteriorusingthefactorization: q(β,σ2) = q(β) q(σ2). (7) 6 Accordingto(5),theoptimaldensitiestaketheform q∗(β) ∝ exp E [logp(β|σ2,X,Y)] (8) q(σ2) q∗(σ2) ∝ exp E [logp(σ2|β,X,Y)] . q(β) Given the semi-conjugate setup, the full conditional distributions, p(β|σ2,X,Y) and p(σ2|β,X,Y), are fully known. Plugging in these full conditional distributions, and after some algebraic manipulations, onecandeducethatq∗(β)isamultivariatenormaldensityN(µ ,Σ ),where, q(β) q(β) Σ−1 = Σ−1+E [σ−2]X(cid:48)X, q(β) β q(σ2) µ = Σ (cid:0)Σ−1µ +E [σ−2]X(cid:48)Y(cid:1), (9) q(β) q(β) β β q(σ2) andq∗(σ2)isaninverseGammadensityIG(a ,b )with q(σ2) q(σ2) a = a+N/2, q(σ2) N b = b+1/2(cid:88)(cid:16)(y −x(cid:48)E [β])2+Tr(cid:2)x x(cid:48)Var [β](cid:3)(cid:17), (10) q(σ2) i i q(β) i i q(β) i=1 whereTr[·]denotesthetraceofamatrix. Itiseasytodiscernfromtheabovesetsofequationsthatthevariationalparametersareinterdependent. Forinstance,in(9),theprecisionofthevariationalapproximationforβ,Σ−1 ,dependsonthemeanofthe q(β) variationaldistributionforσ−2,E [σ−2]. Wecanthereforesequentiallyupdatethesevariationalparam- q(σ2) etervalues(9)and(10)inaniterativefashion. ThisissimilartoGibbssamplingwhereweiterativelydraw from the full conditionals, except that in variational Bayes, we update the variational parameters directly, insteadofsamplingthemodelparameters. Algorithm1. MFVBfortheLinearRegressionModel 1. InitializeΣ ,µ andb . q(β) q(β) q(σ2) 2. Iterativelyupdate Σ−1 ← Σ−1+a /b X(cid:48)X q(β) β q(σ2) q(σ2) µ ← Σ (cid:0)Σ−1µ +a /b X(cid:48)Y(cid:1) q(β) q(β) β β q(σ2) q(σ2) (cid:16) (cid:17) b ← b+1/2(cid:80)N (y −x(cid:48)µ )2+Tr[x x(cid:48)Σ ] q(σ2) i=1 i i q(β) i i q(β) untiltheincreaseinELBOisnegligible. 7 Algorithm 1 describes the resulting coordinate ascent schedule for obtaining the optimal values of the variationalparametersforthelinearregressionmodel. Inthealgorithm,ELBOisderivedfrom(4)andgiven by L(q) =−(µ −µ )(cid:48)Σ−1(µ −µ )−Tr[Σ−1Σ ] q(β) β β q(β) β β q(β) +log|Σ |−2a logb . q(β) q(σ2) q(σ2) The preceding discussion serves as a tutorial for explaining the basics of applying mean-field approx- imation in a simplified setting. However, this is too simple a model to fully illustrate the effectiveness of variational Bayesian methods. We now show the use of VB in more complicated hierarchical settings. In the next section, we discuss how to use mean-field methods for a cross-nested mixed linear model, which haswideapplicabilityinmarketingresearch. 3.1.2 ACross-NestedMixedLinearModel Marketing research environments are replete with panel data which require careful modeling of multiple sources of unobserved heterogeneity (Allenby and Rossi 1999). In many settings, data are available on multipleconsumersandonmanydifferentproducts. Forinstance,datafromrecommendersystemsinclude ratingsfromdifferentusersonmanydifferentitems. Aproperaccountingofthevariationinsuchdatasets requirestheuseofrandomeffectsforproductsaswellasforcustomers(Ansari,EssegaeirandKohli2000), resultinginacross-nestedstructure. Here we consider a linear model with cross-nested random coefficients (Rasbash and Browne 2008) to accountforbothindividualaswellasproductheterogeneity. Specifically,wesimulatepaneldatasetsfrom thefollowingmodel, y = x(cid:48) β+z(cid:48)λ +w(cid:48)γ +(cid:15) , ij ij j i i j ij (cid:15) ∼ N(0,σ2), λ ∼ N(0,Λ), γ ∼ N(0,Γ), (11) ij i j where, y represent the response for person i on item j, i = 1,...,I. Each person is assumed to respond ij to an idiosyncratic set of j ∈ J items. This yields an unbalanced data set with a total of (cid:80)I J = N i i=1 i observations. Suchamodelarises,forinstance,inrecommendationsystemswhereusersratedifferentitems (products). The vector x includes covariates that characterize the persons and the items, z consists of ij j item-specific covariates and w contains person-specific covariates, such as demographics. The vectors λ i i 8
Description: