Table Of ContentVariational Bayesian Inference for Big Data Marketing Models1
AsimAnsari YangLi JonathanZ.Zhang2
December2014
1Thisisapreliminaryversion. Pleasedonotciteorcirculate.
2AsimAnsariistheWilliamT.DillardProfessorofMarketingatColumbiaBusinessSchool,YangLiisAssistant
ProfessorofMarketingatCheungKongGraduateSchoolofBusinessinBeijing,andJonathanZ.ZhangisAssistant
ProfessorofMarketingatUniversityofWashingtoninSeattle.
Abstract
Hierarchical Bayesian approaches play a central role in empirical marketing as they yield individual-level
parameter estimates that can be used for targeting decisions. MCMC methods have been the methods of
choiceforestimatinghierarchicalBayesianmodelsastheyarecapableofprovidingaccurateindividual-level
estimates. However,MCMCmethodsarecomputationallyprohibitiveanddonotscalewellwhenappliedto
massivedatasetsthathavebecomecommoninthecurrenteraof“BigData”. Weintroducetothemarketing
literature a new class of Bayesian estimation techniques known as variational Bayesian (VB) inference.
Thesemethodstacklethescalabilitychallengeviaadeterministicoptimizationapproachtoapproximatethe
posterior distribution and yield accurate estimates at a fraction of the computational cost associated with
simulation-based MCMC methods. We exploit and extend recent developments in variational Bayesian
inference and highlight how two VB estimation approaches – Mean-field VB (that is analogous to Gibbs
sampling) for conjugate models and Fixed-form VB (which is analogous to Metropolis-Hasting) for non-
conjugate models – can be effectively combined for estimating complex marketing models. We also show
how recent advances in parallel computing and in stochastic optimization can be used to further enhance
the speed of these VB methods. Using simulated as well as real data sets, we apply the VB approaches to
several commonly used marketing models (e.g. mixed linear, logit, selection, and hierarchical ordinal logit
models),anddemonstratehowtheVBinferenceiswidelyapplicableformarketingproblems.
Keywords: Big Data, Variational Inference, Mean-field Variational Bayes, Fixed-form Variational Bayes,
Parallelization,StochasticOptimization,AdaptiveMini-batch,MovieLensDatabase.
1 Introduction
Therecentadvancesininformationtechnology,coupledwiththerapiddecreaseindatastoragecost,havere-
sultedintheexplosivegrowthinthecollectionandavailabilityofcustomerleveldata. Consumerinteractions
governed by the Internet, e-commerce, social media, geographical positioning systems and information-
sensing mobile devices result in a massive trail of data-points, collected at a pace faster than ever before
(BoydandCrawford2012). Asthevolumeofbusinessdataworldwide,acrossallcompanies,doublesevery
1.2years(eMarketer2013),datasetsizesinempiricalmarketingresearchhavealsoincreasedsignificantly
overtheyears. Forexample,withinthepastdecadeorso,inmarketingpapersusingscannerdata,wehave
seendatasetsizesincreasefromthethousands(e.g.,Villas-BoasandWiner1999)tohundredsofthousands
ofobservations(e.g.,Gordon,GoldfarbandLi2013).
Thesenovelphenomena,collectivelyknownas“BigData”,providescholarswithexcitingopportunities
tounderstand,explain,andpredictthebehaviorofconsumersinmoredetailedwaysthanbefore. Effective
analyses of such large customer databases can yield insights on individual preferences and can help firms
increaseprofitviabettertargeting,suchasinthecasesofHarrah’sEntertainment,CapitalOne,andNetflix
(DavenportandHarris2007). However,thisdatadelugealsochallengesfirms’abilitytoprocessdatainan
effectiveandmanageriallytimelymanner.
BigDataischaracterizedbyhighvolume,highvelocity,andhighvarietythat“requirenewformsofpro-
cessingtoenableenhanceddecisionmaking,insightdiscoveryandprocessoptimization”(Gartner2012). In
thisresearch,ofthe“3VsofBigData”,wetackletheissuesofhighdatavolumeandvelocitybyintroducing
tothemarketingliteratureanewandrapidlyevolvingBayesianestimationframework,knownasVariational
Bayesian (VB) inference. This framework is deterministic, scales well in the presence of large number of
observations, andallowsmarketerstoexpeditiouslyestimateindividual-levelresponseparametersformar-
ketingactions,whenotherstatisticalmethodswouldtaketoolongtoyieldresults.
Understanding heterogeneity in consumer preferences is of paramount importance in many marketing
activities. Strategically, an understanding of the distribution of consumers responses to product attributes
wouldguideproductdesigndecisions–aninsightthatwouldbelostifthepreferenceisexaminedonlyatthe
mean(AllenbyandRossi1999). Tactically,modelingindividual-levelresponsestomarketingactionsallows
firmstoadjustallocationofresourcesacrossregions,stores,andconsumers(Rossietal.1996). Inthepast20
years,thehierarchicalBayesianmethodshavebecomepopularinmarketingbecauseoftheirdemonstrated
utility in yielding highly accurate individual level estimates by appropriately pooling information across
consumers. On the estimation front, advances in Markov chain Monte Carlo methods have fueled this
Bayesianrevolutioninmarketing(Rossi,AllenbyandMcCulloch2005).
The marketing literature has used a variety of MCMC methods for estimating hierarchical Bayesian
models. Forconjugatemodels,Gibbssampling(GelfandandSmith1990;Gelfandetal.1990)isthemethod
of choice. In contrast, for non-conjugate models, the full conditional distributions do not have closed-
form,andMetropolis-Hastingmethods(ChibandGreenberg1995)andtheirextensionssuchasHamiltonian
MCMCandslicesampling(Neal2003;Neal2011),havebeenusedacrossawidespectrumofapplications.
Streams of research that leverage individual-level information have flourished, to address a wide array of
topicssuchascouponing(Rossietal.1996),recommendationsystems(Ansari,EssegaierandKohli2000),
digitalmarketingcampaigns(AnsariandMela2003),B2Bcommunicationcontacts(VenkatesanandKumar
2004),pharmaceuticaldetailingandsampling(Dongetal.2009;Montoyaetal.2010),targetedpromotions
(Zhang and Krishnamurthi 2004), price endogeneity (Li and Ansari 2014), and dynamic targeted pricing
(Zhangetal.2014).
Despite their unquestionable usefulness and prevalence in marketing, MCMC methods are hampered
by the bottlenecks of speed and scalability, and are, therefore, difficult to apply in large data situations
where data sets can often contain hundreds of thousands or even millions of individuals. The situation
is further exacerbated by models with complex posteriors such as those involving non-conjugacy, multiple
decisions,anddynamics. Itisironicthatwhilelargedatasetscontainenoughinformationcontenttosupport
complexmodelswithouttheriskofoverfitting,estimatingthesemodelsbecomesintractableusingsampling
based methods. MCMC methods suffer from poor mixing, especially in hierarchical models involving a
largenumberoflatentvariables,andtherefore,requirealargenumberofparameterdrawsforconvergence.
Mixing also becomes difficult in large data situations because of a concentrated posterior distribution. It
is therefore not uncommon in such situations for an MCMC simulation to run for days or even weeks to
ensure convergence and to obtain a reasonable number of effectively uncorrelated samples. In short, in
the face of big data, traditional MCMC methods do not scale well and converge too slowly to be useful
for making timely managerial decisions including recalibrating price sensitivities for customers, making
productrecommendations,orcustomizingadvertisingbyfirmssuchasNetflix,Amazon,Google,andthose
involvedinmobiletargeting.
Toaddressthescalabilityissue,itiscommontoprunethedataandonlyworkwithasubsetofindividuals
2
orasubsetofobservations. Thisapproach,however,discardsinformationthatisvaluableforestimationand
for targeting to every individual, and might lead to poor estimates (Zanutto and Bradlow 2006). Other
methods of data compression, for instance discarding old data, the judicious choice of sufficient statistics,
dataaggregation(Musalem,BradlowandRaju2009)anddatasquashing(Madiganetal.2002)canalsobe
used.
In this paper, we show that variational Bayesian approaches can be used as an efficient and scalable
alternative to MCMC methods in such “Big Data” situations. In contrast to MCMC methods that simulate
from the posterior, variational Bayesian methods use an optimization approach to construct an approxima-
tiontotheposterior. TheyofferthebenefitofsignificantspeedupswhencomparedtoMCMCmethodsand
canrecoverparameterestimatesaccurately. Wederivevariationalalgorithmsfortheconjugate(linear)and
non-conjugate(logitandordinallogit)hierarchicalmodelscommonlyusedinmarketing,andillustratehow
differentvariationalmethodsthatarecounterparts toGibbsandMetropolis-Hastingsamplingmethodscan
be used. We also show how recent advances in parallel computing and in stochastic approximation can be
used to further enhance the speed of these VB methods. We apply the methods to simulated and real data
setsofdifferentsizesanddemonstratethatVBmethodsnotonlycanachievethesamelevelofaccuracyas
MCMC, but with speeds that can be up to thousands of times faster than MCMC. We also make available
a suite of computer programs that marketing academics and practitioners can readily use to conduct VB
inferenceondifferentmodels.
The rest of the paper proceeds as follows. We begin in Section 2 by explaining the basic concept
behindvariationalBayesianinference. TheninSection3,weintroducemean-fieldVBforconjugatemodels
and fixed-form VB for non-conjugate models, and illustrate algorithms for linear-mixed, selection, and
logit models. Section 4 shows how mean-field and fixed-form can be combined into a “hybrid” VB to
efficiently estimate models with both conjugate and non-conjugate components, and offers an algorithm
for one such case - a hierarchical ordinal logit model. In Section 5, we discuss how VB can be further
enhanced via parallelization and stochastic optimization with adaptive mini-batch sizes. Section 6 and 7
contain applications of these VB methods and extensions to the hierarchical ordinal logit model. Lastly in
Section8,wesummarizethebenefitsofVBinmarketingandhighlightpotentialavenuesforfutureresearch.
All other details are located in the Appendix. The codes and data sets are available from the authors upon
request.
3
2 Variational Bayesian Inference
Bayesianinferenceisbasedonsummarizingtheposteriordistributionofallunknowns. InagenericBayesian
modelwithobserveddatay andtheunknownparametervectorθ,theposteriordistributionp(θ|y)follows
theBayesrule,
p(y,θ) p(y|θ)p(θ)
p(θ|y) = = , (1)
p(y) p(y)
where, p(θ) is the prior distribution, p(y|θ) is the likelihood, and p(y) is the normalizing constant or
evidence. For almost all interesting marketing models, the normalizing constant cannot be computed in
closed-form. Theposteriordistribution, therefore, isnotavailableanalytically. Thisnecessitatestheuseof
approximationstosummarizetheposteriordistribution. SimulationmethodsbasedonMarkovChainMonte
Carlo (MCMC) are the methods of choice in this regard, given their excellent approximation capabilities.
However,MCMCmethodscanbeverytimeconsuming,especiallyinBigdatasituations.
In contrast to MCMC methods that approximate the posterior by simulating random draws, variational
inferenceseekstoapproximatetheintractableposterior, p(θ|y), withasimplerdistribution,q(θ|η), called
thevariationaldistribution(Bishop2006;OrmerodandWand2010). Thevariationaldistributionbelongsto
afamilyofdistributionsthatisindexedbyasetoffreeparametersη. InvariationalBayesianinference,one
searches over the space of the free parameters to find a member of the variational family that is closest to
theposteriorofinterest. Inshort,variationalmethodsrecastthemodelinferenceproblemasanoptimization
problem,thusmakingitpossibletoobtainspeedandscalability.
VBmethodsrelyontheKullback-Leibler(KL)divergence(KullbackandLeibler1951)tomeasurethe
dissimilarity (distance) between two probability distributions.1 The KL divergence between the approxi-
matingdistributionq(θ)andtheposteriorp(θ|y)isdefinedas
(cid:90)
q(θ)
KL[q(θ)||p(θ|y)] = q(θ)log dθ ≥ 0, (2)
p(θ|y)
where, the equality holds if and only if q(θ) = p(θ|y) almost everywhere. We can manipulate the KL
1NotethatEuclideandistancebetweendistributionalparametersisoftenapoormeasureofclosenessbetweenprobabilitydistri-
butions(Hoffmanetal.2013).Forinstance,thetwonormaldistributionsN(0,1002)andN(10,1002)arealmostindistinguishable,
andtheEuclideandistancebetweentheirparametervectorsis10. Incontrast,thedistributionsN(0,0.12)andN(0.1,0.12)barely
overlap,butthisisnotreflectedintheEuclideandistancethatisonly0.1.
4
divergencesuchthat
KL[q(θ)||p(θ|y)] = E [logq(θ)]−E [logp(θ|y)] (3)
q q
= E [logq(θ)]−E [logp(y,θ)]+logp(y).
q q
In the above, the expectation E [·] is with respect to the variational distribution q(θ), and the last term,
q
logp(y)isaconstant. MinimizingtheKLdivergenceisthusequivalenttomaximizingthescalarquantity,
L(q) = E [logp(y,θ)]−E [logq(θ)], (4)
q q
whichisusuallyreferredastheevidencelowerbound(ELBO;Bishop2006;OrmerodandWand2010).
Our goal here is to find an approximating variational distribution q(θ) that makes KL as close to zero
as possible. However, because the posterior is unknown to begin with, we need to impose restrictions on
theapproximatingvariationaldistributionforinferencetoproceed. Theserestrictionsservetostructurethe
approximatingdistributionsuchthatitsfunctionalformcanbeeitherinferredorset. Themachinelearning
literature has explored a number of different approaches for structuring these approximating distributions.
In the current paper, we focus on two approaches that can be effectively applied for marketing models –
mean-field approximations for conjugate models and fixed-form approximations for non-conjugate ones.
These two approaches can also be used in tandem to yield a hybrid approximation for models that involve
bothconjugateandnon-conjugateelements.
3 Mean-field and Fixed-form Methods
3.1 Mean-fieldVariationalBayes(MFVB)
The mean-field2 approximation can be considered the deterministic counterpart to Gibbs sampling, and
as in Gibbs sampling, it is applicable for conjugate or semi-conjugate models (Ormerod and Wand 2010;
Grimmer2010). Inmean-fieldinference,thevariationaldistributionq(θ)isrestrictedtoafactorizedproduct
form (cid:81)D q (θ ), over some partition {θ ,...,θ } of θ. Underlying this restriction is the assumption of
i=1 i i 1 D
independence across the different parameter blocks. Such an approximation is nonparametric in spirit.
Note that no parametric assumptions regarding the functional form of q (θ ) are used. Rather, the optimal
i i
functionalform,aswellastheoptimalparametersofthevariationaldistributioncanbeinferredgivenonly
2Thename“mean-field”originatedfromphysics.
5
the likelihood and the prior. Under the above assumption, setting ∂L(q)/∂q = 0 leads to the following
optimalsolutiontotheminimizationoftheKullback-Leiblerdivergence(Murphy2012),
q∗(θ ) ∝ exp{E [logp(θ |y,θ )]}, (5)
i i θ−i i −i
where, E [·] refers to the expectation with respect to the variational distribution of the other parameter
θ−i
blocks, θ , and p(θ |y,θ ) is the posterior full conditional distribution for θ . When a conjugate prior
−i i −i i
is used, the optimal variational component, q∗(θ ), is available in closed-form and belongs to the same
i i
distributional family as the posterior full conditional. Also, because the posterior full conditional exists
analytically,thefunctionalformofthevariationaldistributionisreadilyavailable.
Denoteq (θ ) = q (θ |η ),andη astheparameterfortheithvariationaldistribution. Thenfindingthe
i i i i i i
optimal variational density only requires optimization to obtain the variational parameters {η } . As we
i ∀i
seeinthenextsection, thiscanbedoneusingsimplecoordinateascentoptimizationinwhichthedifferent
variational parameters are updated sequentially in an iterative and deterministic fashion, until convergence
isachieved.
3.1.1 ALinearRegressionModel
Wenowillustratethemean-fieldapproximationanditsuseinvariationalBayesianinferenceusingthemost
familiarofstatisticalsettings–alinearregressionmodel. Consider
y = x(cid:48)β+(cid:15) , (6)
i i i
(cid:15) ∼ N(0,σ2), i = 1...N,
i
withsemi-conjugatepriors,
β ∼ N(µ ,Σ ) and σ2 ∼ IG(a,b).
β β
Whileeachofthetwopriorsareindividuallyconjugatetothenormallikelihood,giventheotherparame-
ter,thejointpriorisnotconjugate,andthustheresultingposteriordistributionforthemodelisnottractable.
However,onecanuseamean-fieldapproachandapproximatetheposteriorusingthefactorization:
q(β,σ2) = q(β) q(σ2). (7)
6
Accordingto(5),theoptimaldensitiestaketheform
q∗(β) ∝ exp E [logp(β|σ2,X,Y)] (8)
q(σ2)
q∗(σ2) ∝ exp E [logp(σ2|β,X,Y)] .
q(β)
Given the semi-conjugate setup, the full conditional distributions, p(β|σ2,X,Y) and p(σ2|β,X,Y),
are fully known. Plugging in these full conditional distributions, and after some algebraic manipulations,
onecandeducethatq∗(β)isamultivariatenormaldensityN(µ ,Σ ),where,
q(β) q(β)
Σ−1 = Σ−1+E [σ−2]X(cid:48)X,
q(β) β q(σ2)
µ = Σ (cid:0)Σ−1µ +E [σ−2]X(cid:48)Y(cid:1), (9)
q(β) q(β) β β q(σ2)
andq∗(σ2)isaninverseGammadensityIG(a ,b )with
q(σ2) q(σ2)
a = a+N/2,
q(σ2)
N
b = b+1/2(cid:88)(cid:16)(y −x(cid:48)E [β])2+Tr(cid:2)x x(cid:48)Var [β](cid:3)(cid:17), (10)
q(σ2) i i q(β) i i q(β)
i=1
whereTr[·]denotesthetraceofamatrix.
Itiseasytodiscernfromtheabovesetsofequationsthatthevariationalparametersareinterdependent.
Forinstance,in(9),theprecisionofthevariationalapproximationforβ,Σ−1 ,dependsonthemeanofthe
q(β)
variationaldistributionforσ−2,E [σ−2]. Wecanthereforesequentiallyupdatethesevariationalparam-
q(σ2)
etervalues(9)and(10)inaniterativefashion. ThisissimilartoGibbssamplingwhereweiterativelydraw
from the full conditionals, except that in variational Bayes, we update the variational parameters directly,
insteadofsamplingthemodelparameters.
Algorithm1. MFVBfortheLinearRegressionModel
1. InitializeΣ ,µ andb .
q(β) q(β) q(σ2)
2. Iterativelyupdate
Σ−1 ← Σ−1+a /b X(cid:48)X
q(β) β q(σ2) q(σ2)
µ ← Σ (cid:0)Σ−1µ +a /b X(cid:48)Y(cid:1)
q(β) q(β) β β q(σ2) q(σ2)
(cid:16) (cid:17)
b ← b+1/2(cid:80)N (y −x(cid:48)µ )2+Tr[x x(cid:48)Σ ]
q(σ2) i=1 i i q(β) i i q(β)
untiltheincreaseinELBOisnegligible.
7
Algorithm 1 describes the resulting coordinate ascent schedule for obtaining the optimal values of the
variationalparametersforthelinearregressionmodel. Inthealgorithm,ELBOisderivedfrom(4)andgiven
by
L(q) =−(µ −µ )(cid:48)Σ−1(µ −µ )−Tr[Σ−1Σ ]
q(β) β β q(β) β β q(β)
+log|Σ |−2a logb .
q(β) q(σ2) q(σ2)
The preceding discussion serves as a tutorial for explaining the basics of applying mean-field approx-
imation in a simplified setting. However, this is too simple a model to fully illustrate the effectiveness of
variational Bayesian methods. We now show the use of VB in more complicated hierarchical settings. In
the next section, we discuss how to use mean-field methods for a cross-nested mixed linear model, which
haswideapplicabilityinmarketingresearch.
3.1.2 ACross-NestedMixedLinearModel
Marketing research environments are replete with panel data which require careful modeling of multiple
sources of unobserved heterogeneity (Allenby and Rossi 1999). In many settings, data are available on
multipleconsumersandonmanydifferentproducts. Forinstance,datafromrecommendersystemsinclude
ratingsfromdifferentusersonmanydifferentitems. Aproperaccountingofthevariationinsuchdatasets
requirestheuseofrandomeffectsforproductsaswellasforcustomers(Ansari,EssegaeirandKohli2000),
resultinginacross-nestedstructure.
Here we consider a linear model with cross-nested random coefficients (Rasbash and Browne 2008) to
accountforbothindividualaswellasproductheterogeneity. Specifically,wesimulatepaneldatasetsfrom
thefollowingmodel,
y = x(cid:48) β+z(cid:48)λ +w(cid:48)γ +(cid:15) ,
ij ij j i i j ij
(cid:15) ∼ N(0,σ2), λ ∼ N(0,Λ), γ ∼ N(0,Γ), (11)
ij i j
where, y represent the response for person i on item j, i = 1,...,I. Each person is assumed to respond
ij
to an idiosyncratic set of j ∈ J items. This yields an unbalanced data set with a total of (cid:80)I J = N
i i=1 i
observations. Suchamodelarises,forinstance,inrecommendationsystemswhereusersratedifferentitems
(products). The vector x includes covariates that characterize the persons and the items, z consists of
ij j
item-specific covariates and w contains person-specific covariates, such as demographics. The vectors λ
i i
8
Description:Variational Bayesian Inference for Big Data Marketing Models1 Asim Ansari Yang Li Jonathan Z. Zhang 2 December 2014 1This is a preliminary version.