ebook img

A Collaborative Kalman Filter for Time-Evolving Dyadic Processes PDF

1.1 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Collaborative Kalman Filter for Time-Evolving Dyadic Processes

A Collaborative Kalman Filter for Time-Evolving Dyadic Processes San Gultekin John Paisley Department of Electrical Engineering, Columbia University Email: {sg3108, jpaisley}@columbia.edu Abstract—WepresentthecollaborativeKalmanfilter(CKF), into consideration the patterns of all other measurements up a dynamic model for collaborative filtering and related fac- until that point. torization models. Using the matrix factorization approach to An effective approach to collaborative filtering is via 5 collaborative filtering, the CKF accounts for time evolution 1 matrix factorization [7], [11]. These methods embed users by modeling each low-dimensional latent embedding as a 0 multidimensional Brownian motion. Each observation is a and objects into a low-dimensional latent space and make 2 random variable whose distribution is parameterized by the predictions based upon the relevant dot products. Such n dot product of the relevant Brownian motions at that moment methodshaveprovedverypowerfulsincetheyarerelatedto a in time. This is naturally interpreted as a Kalman filter with thelow-rankmatrixcompletionproblem,whereinitisshown J multiple interacting state space vectors. We also present a that a sparsely sampled low-rank matrix can be exactly method for learning a dynamically evolving drift parameter 2 for each location by modeling it as a geometric Brownian recoveredinidealcases,orverycloselyotherwise[2].Inthe 2 motion. We handle posterior intractability via a mean-field Bayesian setting,treating missing data is efficiently handled ] variational approximation, which also preserves tractability by the joint likelihood function, which allows for simple L for downstream calculations in a manner similar to the closed form updates to the latent embeddings [12]. M Kalmanfilter.Weevaluatethemodelonseverallargedatasets, One common assumption in matrix factorization ap- providing quantitative evaluation on the 10 million Movielens . and 100 million Netflix datasets and qualitative evaluation on proaches is that the model is “static,” meaning that every t a a set of 39 million stock returns divided across roughly 6,500 user or object has a latent location that is fixed in time and t companies from the years 1962–2014. themodellearnsthissinglepointusingalldataina“batch” s [ setting. Therefore, these models assume, e.g., that a user’s taste in movies is unchanging, or a team’s sequence of wins 1 I. INTRODUCTION and losses is a stationary process. Though performance of v 4 Collaborative filtering is a general and effective method such models can be very good, this assumption is clearly 2 for making pairwise predictions based on historical data. It untrueandmodelsthataccountfordynamicinformationare 6 ismostfrequentlyassociatedwithrecommendationsystems, more appropriate in principle. 5 where the model learns user preference by considering In this paper we present a dynamic collaborative filter- 0 all previous user/object interactions. User rating systems ing model for temporally evolving data for applications . 1 are widespread online. For example, Netflix lets users rate including recommendation systems, currency exchange rate 0 movies on a five star rating system, Amazon allows users tracking,student/questiontutoringsystemsandathleticcom- 5 1 to rate products and YouTube allows users to “like” or petition prediction, among others. The model is inspired by : “dislike” videos on their website. In cases such as these, the matrix factorization approach to collaborative filtering, v a major objective is to recommend content to users in whereeachobjecthasalocationinalatentspace.However, i X an intelligent way. Collaborative filtering addresses this rather than fixing these locations, we allow them to drift r problembyallowingthepreferencesofotheruserstoinform accordingtoamultidimensionalBrownianmotion[3].Since a the recommendations made for a user of interest [13]. we are motivated by real-time prediction, we only process Though typically used for predicting user responses, the each observation in the data stream once. Therefore our mainideabehindcollaborativefilteringcanbebroadenedto problem is ideally suited to large data sets. For inference include other situations where the outcome of an event is we introduce a variational approximation [16] since the predicted using information about the outcomes of all other posterior distribution of model variables given an event is events in a dyadic setting. For example, in sporting events not in closed form; this allows for closed form updates one might want to predict whether team A will beat team and preserves tractability for future calculations. The model B by using the results of all other games up to that point. thus learns a time-evolving probability distribution on latent Or one might wish to estimate the probability that batter user/object locations. From the perspective of a single loca- X will get a hit against pitcher Y at a specific moment tion, the proposed model can be interpreted as a state-space in time using all previous at-bats for every batter/pitcher modelwithdynamicssimilartoaKalmanfilter[17],andso combination. In other problem domains, one might wish to wecallourapproachthecollaborativeKalmanfilter (CKF). predictthestockpricesorexchangeratesinawaythattakes In addition, for problems such as stock modeling the drift parameter of the latent Brownian motions cannot be A large number of collaborative filtering methods can considered to be a fixed value because of the changing be viewed as variations and developments of this basic volatility of the market. We therefore further develop the model. A potential drawback of such approaches is that CKFbymodelingthedriftparameterofeachlatentlocation they do not use temporal information. A goal in Section III with a geometric Brownian motion. This approach will re- will be to show how this model can be easily extended to quireapproximationsfortractableinferencethatnevertheless model the temporal evolution of u and w using Brownian i j provide good and meaningful results as shown on a large motion. Since the resulting model has dynamics similar to stock dataset. the Kalman filter, we briefly review this method next. Weorganizethepaperasfollows:InSectionIIwereview the basic matrix factorization approach to collaborative B. The Kalman filter filtering and the Kalman filter for time evolving linear As mentioned, a drawback of many collaborative filter- modeling.InSectionIIIwepresentthecollaborativeKalman ing techniques such as those that involve updates similar filtering model. We derive a variational inference algorithm to Equation (1) is that no temporal dynamics are being for this model in Section IV and show experimental results modeled. When viewing the locations of a user u or i in Section V. object w as being in a state space, the Kalman filter j naturally arises as the first place to look for developing a II. BACKGROUND dynamic representation for matrix factorization models. In Before presenting the collaborative Kalman filter (CKF), this section we provide a brief review of the Kalman filter we review the basic matrix factorization approach to col- [17], focusing on the equations and using parameterizations laborative filtering and the Kalman filter for dynamic state that are relevant to our CKF model. space modeling. These two approaches form the basis for The Kalman filter models a sequence of observed vectors our model, which we present in Section III. y ∈ Rp as linear functions of a sequence of latent state n A. Collaborative filtering with matrix factorization vectors w ∈ Rd with additive noise. These state vectors n evolve according to a first-order Markov process, where the Generally speaking, the matrix factorization approach to collaborativefilteringmodelsanincompleteM×N matrixZ current state equals the previous state plus additive noise. asafunctionoftheproductoftwomatrices,aK×M matrix Assuming a Gaussian prior distribution on w0, then for n= U and a K×N matrix W [7]. The columns of U (denoted 1,...,N and zero-mean noise, this is written as follows, u ) represent the locations of each user in a latent space i w |w ∼ N(w ,αI), (2) and the columns of W (denoted w ) represent the locations n+1 n n j y |w ∼ N(Aw ,σ2I). (3) of each object in the same latent space. In the probabilistic n+1 n+1 n+1 approachtocollaborativefilteringusingmatrixfactorization, Inference for the state evolution of w proceeds according the observed values in Z have a distribution that depends to the forward algorithm, which includes two steps: (i) on the dot products of U and W, z ∼ p(uTw ). For ij i j marginalizingthepreviousstatetoobtainapriordistribution example, when Z is binary, p(·) could be the logistic or on the current state, and (ii) calculating the posterior of the probit link functions, when m-ary it could be the ordered current state given an observation.1 probit model, and when real-valued it could be a univariate Specifically, let the distribution of the current state be Gaussian. After learning U and W based on the observed w ∼ N(µ ,Σ ). The distribution of the next state w data,predictionsofz fortheunseenpair(i,j)canbemade n n n n+1 ij is calculated by marginalizing the current state, using the distribution function p(uTw ). i j (cid:90) To see how matrix factorization can be interpreted as a p(w ) = p(w |w )p(w )dw collaborative filtering technique, consider the case where n+1 n+1 n n n Rd zij ∼N(uTi wj,σ2).Usingamaximumlikelihoodinference = N(wn+1|µn,αI+Σn). (4) approach for U and W, the maximum likelihood update of u is a least squares vector using the relevant w and z . If The drift parameter α controls the volatility of the state i j ij we let the set Ω contain the index values of objects that vectors and is a measure of drift in one unit of time. (In ui user i has rated, this update is this paper, we present a method for learning a dynamically evolving value for α.) (cid:16) (cid:17)−1(cid:16) (cid:17) u = (cid:80) w wT (cid:80) z w . (1) After observing y , the posterior of w is i j∈Ωui j j j∈Ωui ij j n+1 n+1 Therefore, the design matrix for ui uses the locations of all p(wn+1|yn+1) ∝ p(yn+1|wn+1)p(wn+1) objects w rated by user i. The update for each w parallels j j = N(w |µ ,Σ ), (5) n+1 n+1 n+1 this update for u using the relevant user locations, and so i thelocationsofallotherusersindirectlyinfluencetheupdate 1Becauseweareonlyinterestedinreal-timepredictioninthispaper,we for user i in this way. donotdiscussthebackwardalgorithmusedforKalmansmoothing. where, after defining B =αI+Σ , time.Atanygiventimet,theoutputfordyad(i,j)usesthe n n dot product (cid:104)u[t],w[t](cid:105) as a parameter to a distribution Σ = (ATA/σ2+B−1)−1, (6) i j n+1 n as discussed in Section II-A. µ = Σ (ATy/σ2+B−1µ ). (7) n+1 n+1 n n In the following discussion of the model we focus on FollowingEquation(4),wecanusethisposteriordistribu- one event (e.g., rating, trial, etc.), which can then be easily tion on w to find the prior for the next state w . This generalizedtoaccountforeveryeventinawaysimilartothe n+1 n+2 alsoindicateshowµ andΣ wereobtainedforstepn,and Kalman filter. We also present the model in the context of n n so only an initial Gaussian distribution on w is required to m-arypredictionusingorderedprobitregression[5].Forthis 0 allow for analytical calculations to be made for all n. problem, the real line R is partitioned into m regions with The extension to continuous time requires a simple mod- each region denoting a class, such as star rating. Let Ik = ification: For continuous-time Kalman filters, the variance (lk,rk]bethepartitionforclasskwherelk <rk,rk =lk+1, of the drift from wn to wn+1 depends on the time between lk = rk−1 and k = 1,...,m. The output for pair (i,j) at these two events. Calling this time difference ∆tn, we can timet,denotedzij[t]∈{1,...,m},isobtainedbydrawing make the continuous-time extension by replacing αI with from a Gaussian distribution with the mean (cid:104)ui[t],wj[t](cid:105) ∆t αI. Therefore w is a Brownian motion [3]. and some preset variance σ2 and finding which partition it n falls within. Therefore, the model assumes an order relation C. Related work between the m classes, for example that a 4 star rating is A similar model that addressed spatio-temporal infor- closer to a 5 star rating than a 1 star rating. (In the binary mation was presented in [8]. A fixed-drift version of the special case, this order relation no longer matters.) We give model presented here was originally presented at a machine a more detailed description of the model below. learning workshop [10] and is most closely related to [15], Likelihood model: Let u[t] ∈ Rd represent the la- i [14]. Because the models share the same name, we discuss tent state vector for object i at time t and similarly for some important differences before presenting our version, w[t] ∈ Rd. Using the partition I and the probit function, j and also to highlight our contributions: the probability that z [t]=k is ij • Our approach does not require Kalman smoothing. (cid:90) Therefore,wedonotstorethesequenceofstatevectors, P(z [t]=k|u ,w )= N(cid:0)y|(cid:104)u[t],w[t](cid:105),σ2(cid:1)dy. which allows for our model to scale to massive data ij i j i j sets.Themodelin[15],[14]isonlydesignedtohandle Ik (8) small-scale data problems. For inference we introduce the latent variable yij[t] and • Wederiveavariationalinferencealgorithmforlearning expand Equation (8) hierarchically, time-evolving distributions on the latent state space vectors akin to the Kalman filter. We also use a z [t]|y [t] = (cid:80) kI(y [t]∈I ), (9) ij ij k ij k geometric Brownian motion for learning a dynamic y [t]|u ,w ∼ N((cid:104)u[t],w[t](cid:105),σ2). (10) ij i j i j drift parameter. Our experiments will show that a fully Bayesian approach on the state space improves We observe that, unlike most collaborative filtering models, predictive ability and learning a dynamic drift gives there is no assumption that the pair (i,j) only interacts significant information about the underlying structure once, which has practical advantages. For example, teams of the data set. may play each other multiple times, and even in a user • We derive a probit model for binary and ordinal obser- rating setting the user may change his or her rating at vationssuchasmovieratingsinsteadofdirectlymodel- different points in time, which contains valuable dynamic ingalldatatypeswithGaussians.Wewillshowthatthis information. In related problems such as stock modeling, approach gives an improvement in performance when updated values arrive at regular intervals of time.2 compared with modeling, e.g., discrete star ratings as Prior model: At time t we are interested in posterior real-valued Gaussian random variables. inference for state vectors u[t] and w[t]. This requires i j III. COLLABORATIVEKALMANFILTERING a prior model, which as we’ve discussed we take to be a latent multidimensional Brownian motion. Let the duration The motivation behind the collaborative Kalman filter of time since the last event be ∆[t] for u and ∆[t] for (CKF) is to extend the matrix factorization model described ui i wj w . Also, as with the Kalman filter, assume that we have in Section II-A to the dynamic setting. The continuous-time j multivariate normal posterior distributions for u[t−∆[t]] Kalman filtering framework discussed in Section II-B is a and w[t − ∆[t]], with t − ∆[t] being the tiime ofutihe natural means for doing this. The CKF models each latent j wj ui last observation for u and similarly for w . We write these location of a user or object as moving in space according i j to a Brownian motion. In our case, these correspond to the sets of vectors {ui} and {wj}, which are now functions of 2Aswewillsee,inthiscasey isobservedandthelikelihoodis(10). distributions as Again there is a drift parameter c that plays an important roleandrequiresagoodsetting,butweobservethatdefining u[t−∆[t]] ∼ N(µ(cid:48) [t−∆[t]],Σ(cid:48) [t−∆[t]]), i ui ui ui ui ui αtobeanexponentiatedBrownianmotionhasanimportant wj[t−∆[wt]j] ∼ N(µ(cid:48)wj[t−∆[wt]j],Σ(cid:48)wj[t−∆[wt]j]). (11) modeling purpose by allowing for volatility in time, and is The posterior parameters µ(cid:48) and Σ(cid:48) are also dynamically not done simply to avoid parameter setting. When α is a geometric Brownian motion, the transition evolving and indexed by time. As is evident, we are assum- from posterior to prior for u and w needs to be modified. ing that we have independent posteriors of these variables. i j In Equation (12) the constant value of α allowed for Σ[t] Though this is not analytically true, our mean-field varia- to be calculated by adding a time-scaled αI to the previous tionalinferencealgorithmwillmakethisapproximationand posterior. In this case, the update is modified for, e.g., u by so we can proceed under this assumption. Bythecontinuous-timeextensionofEquation(4),wecan performing the integration implied in Equation (12), mtoaorgbitnaainliztheeupi[rito−r d∆is[uttri]i]buatinodnswoj[ntu−i a∆n[wdt]jw]jinatthtiemientte,rval Σu[t]=Σ(cid:48)u[t−∆[ut]]+I(cid:82)tt−∆[ut]ea[s]ds. (14) Wewillderiveasimpleapproximationofthisthisstochastic u[t] ∼ N(µ [t],Σ [t]), i ui ui integral for inference. w[t] ∼ N(µ [t],Σ [t]), (12) j wj wj IV. VARIATIONALINFERENCEFORTHECKF where Having defined the model prior, we next turn to posterior µ[t]=µ(cid:48)[t−∆[t]], Σ[t]=Σ(cid:48)[t−∆[t]]+∆[t]αI, inference. Since the CKF model is dynamic, we treat poste- for the respective u and w . We note that we use an rior inference in the same way by learning a time-evolving i j posterior distribution on the underlying vectors {u } and apostrophe to distinguish priors from posteriors. i We see that we can interpret the CKF as a Kalman filter {wj}(andthehiddendatayij[t]whentheobservationsare with multiple interacting state space vectors rather than a from a latent probit model). We also learn the Brownian knownandfixeddesignmatrixAasinEquation(3).Wealso motioncontrollingthedriftofeachlatentstatevector,a[t], recall from the Kalman filter that αI is the drift covariance using a point estimate. We therefore break the presentation in one unit of time, and so the transition from posterior to of our inference algorithm into two parts: One dealing with prior involves the addition of a small value to the diagonal u,w andy,andonedealingwitha.Wepresentanoverview of the posterior covariance matrix. This is what allows for of the inference algorithm in Algorithm 1. dynamic modeling; when α = 0, this is simply an online A. An approximation for the hyperprior model Bayesian algorithm and the posteriors will converge to a Beforediscussingourinferencealgorithm,wefirstpresent delta function at the mean. Hyperprior model: The drift parameter α controls how an approximation to the stochastic integral in Equation muchthestatevectorscanmoveinoneunitoftime.Whenα (14) that will lead to efficient updates of the Brownian motion a[t]. To this end, we first introduce the following becomesbigger,thestatevectorscanmovegreaterdistances to better fit the variable y, also making these vectors more approximation, forgetful of previous information. When α → 0 the model (cid:82)t ea[s]ds ≈ ea[t]∆[t]. (15) does not forget any previous information and the state t−∆[t] vectorssimplyconvergetoapointsincetheposteriordistri- We recall from the description of the model that a[t] can bution becomes more concentrated at the mean. Therefore, be shared or vector-specific. That is, e.g., a[t] could be α is clearly an important parameter that can impact model uniqueforeachu ,orsharedamong{u }(inwhichcasean i i performance. appropriate subscript notation for a could be added). From WedeveloptheCKFmodelbyallowingαtodynamically the perspective of a generic multidimensional Brownian changeintimeaswell.Thisprovidesameasureofvolatility motion, the generative structure for this approximation is thatwillbeespeciallymeaningfulinmodelsforstockprices, u[t] ∼ N(µ(cid:48)[t−∆ ],Σ(cid:48)[t−∆ ]+ea[t]∆ I), as we will show. For notational convenience, we define the u u u u u hyperprior for a shared α, but observe that extensions to a a[t] ∼ N(a[t−∆a],c∆a). (16) u – and w –specific α is straightforward (which we discuss i j That is, with this approximation we first draw the log drift in more detail in Section IV). value at time t, a[t], according to its underlying Brownian WemodelαasageometricBrownianmotionbydefining motion. We then approximate the drift parameter as being α[t] = ea[t] and letting a[t] be a Brownian motion. constant in the unobserved interval. Clearly, as the intervals Therefore, as with the state vectors, if t−∆[t] is the last a between observations become smaller, this approximation observed time for a[t], the distribution of a[t] is becomes better. For inference we will learn a point estimate a[t]∼N(a[t−∆[t]],c∆[t]). (13) of the Brownian motion a[t]. a a Algorithm 1 CKF parameter updates at time t C. Variational q distribution for the prior model 1: To update the parameters of dyad (i,j) at time t: AsindicatedinEquation(19),constructingthevariational 2: for iteration=1,...,T do objective function for the event at time t involves taking 3: Update q(yij[t]) as in Equation (24). the expectation of the log joint likelihood and adding the 4: Update q(ui[t]) as in (26). entropies of each q. This requires defining the distribution 5: Update q(wj[t]) by modifying (26) as noted in text. family for q(yij), q(ui) and q(wj). The optimal form for 6: Update aui[t] as in (27). each q distribution can be found by exponentiating the 78:: enUdpfodrate awi[t] by modifying (27) as noted in text. eoxfpiencteteredstlo[g16j]o:inqtil∝ikeelxihpo{oEdq−hio[llndipn(g·)o]}u.t the q distribution 9: Notes: The following changes can be made. Temporarily ignoring the time notation, at time t we can • When y is observed, skip Step 3 and directly use y in find q(yij[t]) by calculating Steps 4 and 5. q(y )∝exp{lnp(z |y )+E [lnp(y |u ,w )]}. (20) ij ij ij q ij i j •Whenmodelingashareda amongall{u},theupdate u is changed as noted in text, and similarly for w. Since p(z [t]|y [t])=I(y [t]∈I ), it follows that ij ij ij zij[t] q(y [t]) is defined on the interval I , which is the cell ij zij[t] corresponding to class z [t]. After taking the expectation, ij the optimal q distribution is found to be B. A variational approximation Focusing on the latent state-space variables first, the q(y [t])=TN (cid:0)y [t]|(cid:104)E u[t],E w[t](cid:105),σ2(cid:1). ij Izij[t] ij q i q j posterior of an event at time t is (21) ThisisatruncatednormaldistributiononthesupportI p(ui[t],wj[t],yij[t]|zij[t],aui,awj)∝ (17) withmeanparameter(cid:104)Equi[t],Eqwj[t](cid:105)(definedlaterz)iaj[ntd] p(z [t]|y )p(y [t]|u ,w )p(u[t]|a )p(w[t]|a ) variance σ2. ij ij ij i j i ui j wj We can follow the same procedure to find q(u[t]) for i Unlike the Kalman filter the full posterior of the CKF time t. Again ignoring the time component we have is not analytically tractable. Typically when this occurs q(u )∝exp{E [lnp(y |u ,w )]+lnp(u )}. (22) with nonlinear Kalman filters sampling methods such as i q ij i j i sequential Monte Carlo are employed as an approximation The prior on u[t] is discussed below. Solving for this i [4]. This is not computationally feasible for our problem distribution shows that q(u[t]) is a multivariate Gaussian, i since we are learning the evolution of a large collection of latent state vectors. Instead we use the deterministic mean- q(u[t])=N(cid:0)u[t]|µ(cid:48) [t],Σ(cid:48) [t](cid:1), (23) i i ui ui field variational method for approximate posterior inference withmeanandcovarianceparametergivenbelow.Symmetry [16]. As we will show, this not only allows for tractable results in the same q distribution family for w[t]. posterior calculations for the state vectors u, w and hidden j datay,butalsoallowsforatransitionfrompriortoposterior D. A coordinate ascent algorithm for q(u), q(w) and q(y) similar to the Kalman filter that retains tractability for all We next present a coordinate ascent algorithm for updat- downstream calculations. ing the three q distributions that appear at time t. We approximate the posterior of Equation (17) with the following factorized distribution, Coordinate update of q(y): The analytical update for the mean parameter m [t] of q(y [t]) is ij ij q(u[t],w[t],y [t])=q(u[t])q(w[t])q(y [t]). i j ij i j ij (18) m [t]=(cid:104)E u[t],E w[t](cid:105). (24) ij q i q j We derive the specific distributions in Section IV-C. This The values of E u[t] and E w[t] are given below. The q distribution approximates the posterior of each variable q i q j expectationofy [t]dependsontheintervalinwhichitfalls as being independent of all other variables, which has a ij according to the truncated normal. Let l and r be significantadvantageinthisdynamicsetting.Havingdefined zij[t] zij[t] the left and right boundaries of this interval. Then defining q, we seek to minimizes the KL divergence between q and the true posterior at time t by equivalently maximizing the l −m [t] r −m [t] α [t]= zij[t] ij , β [t]= zij[t] ij , variational objective function [16], ij σ ij σ L =E [lnp(z [t],u[t],w[t],y [t])]−E [lnq]. (19) we have that t q ij i j ij q φ(α [t])−φ(β [t]) Since this is a non-convex function, a local maximum of Lt Eqyij[t]=mij[t]+σΦ(βij[t])−Φ(αij[t]), (25) isfoundbyoptimizingtheparametersofeachqasdiscussed ij ij in Section IV-D. whereφ(·)isthepdfandΦ(·)thecdfofastandardnormal. Coordinate update of q(u): The updates for mean µ(cid:48) [t] F. Simplifications to the model and predictions of new data ui and covariance Σ(cid:48)ui[t] of q(ui[t]) are Whenyij[t]isobservedonlyaslightmodificationneeds (cid:16) (cid:17)−1 to be made to the algorithm above. In this case q(y) is no Σ(cid:48) = Σ−1+(µ(cid:48) µ(cid:48)T +Σ(cid:48) )/σ2 , ui ui wj wj wj longer necessary and in the update for each q(u) and q(w) µ(cid:48) = Σ(cid:48) (cid:16)E [y ]µ(cid:48) /σ2+Σ−1µ (cid:17). (26) the term Eqyij[t] can be replaced with the observed value. ui ui q ij wj ui ui Predictions using the CKF require an approximation of All parameters above are functions evaluated at t. We recall an intractable integral. One approximation is to use Monte tchoavtartihaencepriΣouri[mt]ean≈µΣu(cid:48)ui[i[tt]−=∆[utiµ]](cid:48)ui+[te−aui[∆t[]ut∆i][]uti]aIn,dwphriicohr aCraerliontemreesthteoddsi.nStaukpipnrgesissinυgijth:=e tEimq[eΦ(t(cid:104),uthi,ewejx(cid:105)p/eσc)t]atWioencwane follows from the approximation discussed above. approximate υ using S samples from these distributions, ij Coordinate update of q(w): Because of symmetry, the S updates for µ(cid:48)wj[t] and Σ(cid:48)wj[t] of q(wj[t]) are similar to υˆi(jS) = S1 (cid:88)Φ((cid:104)ui(s),wj(s)(cid:105)/σ), (28) those for ui, but with the indices ui and wj switched on all s=1 means and covariances. We therefore omit these updates. whereu(s) i∼idq(u[t])andw(s) i∼idq(w[t]).Thisgivesan E. Inferring the geometric Brownian motion exp{a[t]} i i j j unbiasedestimateoftheexpectation.Afasterapproximation Wenextderiveanalgorithmforinferringthedriftprocess is to directly use the means from q for u[t] and w[t]. i j a[t] used in the geometric Brownian motion. We derive a pointestimateofthisvalueassumingindividualaforeachu V. EXPERIMENTS i andw .Toupdate,e.g.,a [t]weapproximatetherelevant We evaluate the performance of the collaborative Kalman j ui terms in L using a second order Taylor expansion about the filter on the following three data sets: pointaui[t−∆[atu]i],whichisthelastinferredvalueforthis • The Netflix data set containing 100 million movie process. We recall that the form of this approximation for a ratings from 1999 to 2006. The movies are rated from general function f(x) is 1 to 5 and ratings are distributed across roughly 17K f(x)≈f(x )+(x−x )f(cid:48)(x )+ 1(x−x )2f(cid:48)(cid:48)(x ), movies and 480K users. 0 0 0 2 0 0 • The MovieLens data set containing 10 million movie with the solution at the point x=x0−f(cid:48)(cid:48)(x0)−1f(cid:48)(x0). In ratings from 1995 to 2009. The ratings are given on a our case, f =Eq[lnp(ui[t],aui[t])]. half star scale from 0.5 to 5 and are distributed across We can efficiently update a [t] using this Taylor ex- ui 10K movies and 71K users. pansion as follows: Let Σ(cid:48) [t − ∆[t]] = QΛQT be the ui ui • Stock returns data measured at opening and closing eigendecomposition of the posterior covariance of u at the i times for 433 companies from the AMEX exchange, previous observation. (This needs to be done only once for 2,774 companies from NASDAQ and 3,273 companies the updates at this time point.) Also, let from the NYSE for a total of 6,480 stocks and 39.1 v =(µ(cid:48) [t]−µ [t])TQ, M =QTΣ(cid:48) [t]Q. million total measurements from 1962–2014.3 ui ui ui The model requires setting some important parameters: The Since this depends on the updated posterior mean and dimension d of u and w , the standard deviation of the covariance of u , v and M should be recomputed after each i j i probit model σ, the partition widths r −l , and the drift updateofq(u ).Letλ bethedtheigenvalueofΣ(cid:48) [t−∆[t]] k k i d ui ui parameter c for the geometric Brownian motion a[t]. We and define η = eaui[t]∆aui . discuss these below for the respective problems. d λd+eaui[t]∆aui A. Movie rating prediction ThenusingasecondorderTaylorexpansionoftheobjective We first present results on dynamic modeling of the function at the previous point aui[t − ∆aui] gives the Netflix and MovieLens data sets. For these problems, we following first and second derivatives with respect to aui, tried several different dimension settings (d = 10,20,30). f(cid:48) = −a[t]−ca∆[ta−∆a] − 12(cid:80)dηd(1− λvdd2++eMa[td]∆,da), sWheareledarbnyedmaovcoiensvebrgyinsgetatiungshcare=d a0mosningceusewres adnod naowt f(cid:48)(cid:48) = − 1 − 1(cid:80) η (1−η ) c∆a 2 d d d assume there to be fundamental shifts in the overall user or +1(cid:80) η (1−2η ) vd2+Md,d . (27) movie landscape. We set σ to be the value that minimizes 2 d d d λd+ea[t]∆a the KL divergence between the probit and logistic link We have removed the subscript dependency on ui above. functions, which we found to be approximately σ = 1.76. We then use the closed form equation from the Taylor We randomly initialized the mean of the priors on each u i expansion to update aui[t]. The update for awj[t] has the and wj at time zero, and set the covariance equal to the same form, but with the relevant variational parameters and eigendecomposition replacing those above. 3Datadownloadedfromhttp://ichart.finance.yahoo.com/ TableI User 15701 User 101 RMSERESULTSFORTHENETFLIX100MILLIONANDMOVIELENS10 MILLIONDATASETS.COMPARISONSSHOWANADVANTAGETO MODELINGTHEDYNAMICINFORMATIONWITHINTHEDATA. Number of ratings Number of ratings MCKodFel dS=ize10 N0.e8t5fl4i0x Mo0v.7i7eL26ens d=20 0.8534 0.7654 d=30 0.8540 0.7635 Time (days) Time (days) Online VB-EM d=10 0.8682 0.7855 Figure1. CumulativetotalofmoviesratedbytwousersfromtheNetflix d=20 0.8707 0.7805 data set. (left) This user rates large batches of movies in a few sittings. d=30 0.8668 0.7786 Thoughthedynamicswon’tbecapturedbythemodel,westillcanperform sequential inference for this user to make predictions. (right) A more Batch VB-EM d=10 0.8825 0.7996 incrementalratingpatternthathasdynamicvalue. d=20 0.8688 0.7896 d=30 0.8638 0.7865 Batch MAP-EM d=10 0.9277 0.9133 6Number of ratings (10)x 5Number of ratings (10)x MBP3MFF dddd====33320000 0000....9999001114485732 0000....8899441147317233 Months Months (a) Netflix (b) MovieLens to have a dynamic benefit, but we do for the second user. However, as an online algorithm we note that the CKF still Figure2. Histogramsofthetotalnumberofratingswithinamonthover can make useful predictions in both cases; in the limiting thecourseofeachdataset. case that all users rate all movies in a given instant, the CKF will simply reduce to the online zero-drift model. In identity matrix. For the partition width, we set r −l =σ Figure 2 we show the monthly ratings histogram for both k k forNetflixandr −l =σ/2forMovieLens,whichaccounts data sets, which gives a sense of the dynamic information k k for the half vs. whole star rating system. We compare with from the movie perspective. several algorithms: We use the RMSE as a performance measure, which we 1) OnlineVB-EM:Thenon-driftspecialcaseoftheCKF showfor allalgorithmsin TableI.For thebatchalgorithms, with exp{a[t]}=0. werandomlyheldout5%ofthedatafortestingtocalculate 2) BatchVB-EM:Thebatchvariationalinferenceversion thisvalue.Weranmultipletrialsandfoundthatthestandard of (1) that uses all ratings when updating u and w . deviation for these algorithms was small enough to omit. i j 3) Batch MAP-EM: A version of probabilistic matrix (Wediscusstraining/testingfortheonlinealgorithmsbelow.) factorization (PMF) [11] that uses the probit model We observe that our algorithm outperforms the other base- as above. Point estimates of u and w are learned linealgorithms,andsodynamicmodelingofuserpreference i j and the hidden data y is inferred using EM. doesindeedgivenanimprovementinratingprediction.This ij 4) BPMF: Bayesian PMF [12] without a probit model. is especially evident when comparing the CKF with Online 5) M3F: Mixed-membership matrix factorization [9]. VB, the only difference between these algorithms being the The zero-drift version of this model is an online sequential introduction of a drift in the state space vectors. method for Bayesian inference that is similar to other “big Wealsoobservetheimprovementofthevariationalframe- data” extensions of the variational inference approach [1], work in general. For example, by comparing Batch VB [6]. The difference between this and the batch model is that with PMF EM, we see that variational inference provides we only process each rating once with the online algorithm, an improvement over a MAP EM implementation, which while with batch inference we iterate over users and movies models a point estimate of u and w . Both methods use i j several times processing all data in a single iteration, as is a latent variable y in a probit function for the observed ij more typically done. rating,butthevariationalapproachmodelstheuncertaintyin The ability of our model to exploit dynamic information thesestatespacevectors,andsoweseethatafullyBayesian dependsonhowdynamicthedatais.Forexample,inFigure approach is helpful. We also found in our experiments that 1 we show the cumulative number of ratings for two users treating the rating z [t] as being generated from a probit ij as a function of time. With the first user, we don’t expect model and learning a latent y is important as well, which ij r a 0.90 0.84 5 st 3 0.83 0.89 0.82 ar 2 BTahtemManatBrixegins 0.88 0.81 st FindingNemo 4 0.87 0.80 1 PGFuirgolphutFnCdiclhutiobognDay 0.79 r a 0.86 0.78 st 0 3 AceVentura (a) Netflix (b) MovieLens −1 Figure 3. The RMSE as a function of number of ratings for a user and ar movie. The (m,n) entry contains the RMSE calculated over user/movie st Anchorman pairs where the user has rated at least 10(m−1) movies and the movie 2 −2 LizzieMcGuire hasbeenratedatleast10(n−1)times.Thevalueshowninthelower-right corneris200+ratingseach,whichweuseinTableI. −3 58 60 62 64 66 68 70 72 74 76 78 Month we observed by comparing PMF EM with the original PMF Figure4. Anexampleofauserdriftovertimeasseenthroughpredicted algorithm [11].4 ratingsforseveralmoviesfromtheNetflixdataset.They-axisisthelatent Calculating the RMSE requires a different approach be- variablespace,whichwepartitionaccordingtostarratingasindicated. tween the online and static models. To calculate the RMSE for the two dynamic models—CKF and the online model— we do not use the test set for the batch models, but instead observed no clear evolution in movie preference for a user. makepredictionsofeveryratinginthedatasetbeforeusing We consider to be intuitively reasonable since we argue the observed rating to update the model. This gives a more that few people fundamentally change their taste. However, realistic setting for measuring performance of the online being able to find those individual users or movies that models, especially for the CKF where we’re interested in do undergo a significant change can have an impact on predicting the user’s rating at that time. Therefore, we must learningthelatentvectorsforallusersandmoviessincethe choose which predictions to calculate the RMSE over, since model is collaborative, and so we argue that modeling time clearly it will be bad for the first several ratings for a evolution can be valuable even when the actual percentage particularuserormovie.Bylookingattheusersandmovies of dynamically changing behaviors is small. that appear in the testing sets of the batch models we found B. Stock Data that for Netflix each user in the test set had an average of 613 ratings in the training set and each movie in the test Wenextpresentaqualitativeevaluationofourmodelona set had 53,395 ratings in the training set. For MovieLens stockreturnsdataset.Forthisproblem,thevalueofy [t]is ij this was 447 per user and 7,157 per movie, meaning that in observed since it is treated as the continuous-valued stock both cases a substantial amount of data was used to learn price, so we can model it directly as discussed in section locations in training before making predictions in testing. IV-F. We set the latent dimension d = 5, but observed Therefore, to calculate our RMSE of the online models we similar results for higher values. We learned stock-specific disregardthepredictionifeithertheuserormoviehasunder drift Brownian motions a [t] and set c = 5 × 10−2, ui u 200 previous ratings. This arguably still puts the RMSE for which we found to be a good setting through parameter our model at a disadvantage, but we noticed that the value tuning since the learned a [t] were not too smooth, but ui didn’t change much with values larger than 200. We show still stable. We also observe that, for this problem, there is the RMSE as a function of this number in Figure 3. only one state vector corresponding to w, which we refer In Figure 4 we show the dynamics of an individual user to as a “state-of-the-world” (SOW) vector. Therefore, this by plotting the predicted rating for a set of movies as a problem falls more within the traditional Kalman filtering function of time. We can see an evolving preference in setting discussed in Section II-B. For the SOW vector, we this plot–for example a strong interest in Batman Begins set c =0 to learn a converging value of a [t], which we w w that then slightly decreases with time, while interest in note was a [t]→−11.7. For the noise standard deviation w The Matrix begins to increase toward the end. Also, while we set σ =0.01, which enforce that the state space vectors there’s some interest in Anchorman at the beginning, the track y [t] closely. (We again note that j = 1 for this ij user then quickly becomes uninterested. In most cases, we problem.) We plot the number of active stocks by year in Figure 5 where we see that as time goes by, the number of 4We omit these PMF results in the table, which we note gave RMSE stocks actively being traded increases significantly. resultsoveroneascanbeseenintheoriginalpaper.Wealsonotethatthe PMFalgorithmisthenon-dynamicversionof[15]. We first assess the tracking ability of our model. The 6000 140 Chevron ks5000 120 BP c sto4000 100 of 80 er 3000 b 60 m2000 nu 40 1000 1980 1985 1990 1995 2000 2005 2010 0 (a) Stockprice 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 year Figure5. Thenumberofactivelytradedstocksasafunctionoftime. −2 −3 Walmart Pfizer −4 −5 −6 1982 1987 1992 1997 2002 2007 2012 (b) Brownianmotionvolatilityparameteraui[t]. Figure 7. (a) The historical stock price for two oil companies, BP and Chevron. (b) The log drift Brownian motion (aui[t]) indicating the volatilityofeachstock.Weseethatthoughthestockpricesaredifferent, the volatility of both oil companies share the same shape since they are Coca-Cola Microsoft closelylinkedinthemarket. linked in the market. For example in the early 80’s, late 90’s and late 2000’s, oil prices were particularly volatile. This is modeled by the increase of the geometric Brownian motion exp{a [t]}, which captures the fact that the state Figure 6. A histogram of the tracking errors of the CKF for four ui vectors are moving around significantly in the latent space stocksusingthemeanofeachq distribution(inlog2 scale).Thetracking performance is very accurate using a 5 dimensional latent space (order during this time to rapidly adjust to stock prices. 10−3).Theseresultaretypicalofallhistograms. We also consider the stock volatility across different market sectors. In Figure 8 we show the historical stock ability to accurately track the stock prices indicates that prices for five companies along the top row and their the latent structure being learned is capturing meaningful respective a [t] below them on the second row. Three of ui information about the data set. We show these results as the companies are in the steel market, while the other two errorhistogramsonlog2scaleinFigure6forfourcompanies are from different markets (pharmaceutical and beverage). (tracking was very accurate and could not be distinguished We again see that the learned Brownian motion captures a visually from the true signal) and mention that these results similarvolatilityforthesteelcompanies.Ineachcase,there arerepresentativeofalltrackingperformances.Weseefrom is significant volatility associated with the 2008 financial these plots that the prediction errors of the stock prices are crisisduetothedecreaseinnewconstruction.Thisvolatility small, on the order of 10−3, and so we can conclude that isnotfoundwiththepharmaceuticalorbeveragecompanies. our five dimensional state space representation for ui and w However,wedoseeamajorspikeinvolatilityforCoca-Cola is sufficient to capture all degrees of freedom in time across around1985,whichwasaresultoftheirunsuccessful“New the 6,480 stocks. Coke” experiment. These observations are confirmed by the We next look more closely at the Brownian motions respective stock prices. However, we note that the level of a [t] learned by the model to capture the volatility of the volatility is not only associated with large changes in stock ui stock prices. In Figure 7(a) we show the historical stock value. In the Pfizer example, we see that the volatility is price for two oil companies, BP and Chevron. Below this consistently high for the first half of its stock life, and then in Figure 7(b) we show the respective functions a learned decreases significantly for the second half, which is due to ui for these stocks. We see that the volatility of both of these thesignificantlydifferentlevelsofvolatilitybeforeandafter oil companies share similar shapes since they are closely the year 2000. Posco (steel) Schnitzer (steel) US Steel Pfizer Coca-Cola 128000 110100 128000 114600 140 160 90 160 120 140 80 140 120 120 70 120 100 100 100 60 100 80 80 50 80 60 80 246000 12340000 246000 2400 4600 1999 2004 2009 1998 2003 2008 2013 01991 1996 2001 2006 2011 0 1982 1987 1992 1997 2002 2007 2012 19621967197219771982198719921997200220072012 Posco(steel) Schnitzer(steel) USSteel Pfizer Coca-Cola −1 −1 −1 −1 −2 −2 −2 −2 −2 −3 −3 −3 −3 −3 −4 −4 −4 −4 −4 −5 −−56 −−65 −5 −5 −6 −6 −7 −7 −6 −7 1997 2002 2007 2012 1997 2002 2007 2012 1992 1997 2002 2007 2012 1982 1987 1992 1997 2002 2007 2012 19621967197219771982198719921997200220072012 Figure8. (Toprow)Thehistoricalstockpricesforthreesteelcompanies,onepharmaceuticalandonebeveragecompany.(Bottomrow)Thecorresponding log drift Brownian motions (aui[t]) for the respective stocks from the top row. We see that the three steel companies shared high volatility during the periodofthe2008financialcrises,butcompaniesinotherareassuchasthepharmaceuticalandbeverageindustrywerenotsimilarlyaffected. VI. CONCLUSION [2] E. J. Cande`s and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathe- matics, 9(6):717–772, 2009. We have presented a collaborative Kalman filter for dy- namic collaborative filtering. The model uses the matrix [3] E. C¸inlar. Probability and Stochastics. Springer, 2011. factorizationapproach,whichembedsusersandobjectsina [4] A.Doucet,S.Godsill,andC.Andrieu. OnsequentialMonte latent space. However, unlike most related models in which CarlosamplingmethodsforBayesianfiltering. Statisticsand these locations are fixed in time, we allowed them to drift computing, 10(3):197–208, 2000. according to a Brownian motion. This allows for dynamic evolution of, e.g., user preference. We built on this basic [5] W. Greene. Econometric Analysis. Prentice Hall, 2011. model by presenting a method for learning a time-evolving [6] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. drift parameter for the state space vectors using a geometric Stochastic variational inference. The Journal of Machine Brownian motion. Learning Research, 14(1):1303–1347, 2013. Many other applications could potentially benefit from [7] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization thisframework.Forexample,wecouldpredictresultsofbat- techniques for recommender systems. Computer, 42(8):30– ter/pitchereventsovertimeinbaseballorwinnersofsporting 37, 2009. competitions. Such a dynamic model would clearly be an [8] Z. Lu, D. Agarwal, and I. S. Dhillon. A spatio-temporal advantage since player or team performance can experience approach to collaborative filtering. In ACM Conference on streaks and slumps. Another potential use is in automated Recommender Systems, 2009. tutoring systems, where a student is asked a sequence of questions (e.g., math questions or language flashcards). In [9] L. Mackey, D. Weiss, and M. I. Jordan. Mixed membership matrixfactorization. InInternationalConferenceonMachine a collaborative filtering environment we would wish to learning, 2010. estimate the probability that a student answers a question correctly. Clearly as the student learns these probabilities [10] J. Paisley, S. Gerrish, and D. Blei. Dynamic modeling with will evolve. Having estimates for such a probability would thecollaborativeKalmanfilter. InNYAS5thAnnualMachine Learning Symposium, 2010. allow for intelligent tutoring, where questions are selected that are not too easy or too hard and are able to help with [11] R. Salakhutdinov and A. Mnih. Probabilistic matrix factor- the learning process. ization. InAdvancesinNeuralInformationProcessing,2007. [12] R.SalakhutdinovandA.Mnih. Bayesianprobabilisticmatrix REFERENCES factorization using Markov chain Monte Carlo. In Interna- tional Conference on Machine learning, 2008. [1] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and [13] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item- M. Jordan. Streaming variational bayes. In Advances in based collaborative filtering recommendation algorithms. In Neural Information Processing Systems, 2013. International Conference on World Wide Web, 2001.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.