ebook img

Probabilistic Topic Model for Hybrid Recommender Systems PDF

40 Pages·2016·2.82 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Probabilistic Topic Model for Hybrid Recommender Systems

Probabilistic Topic Model for Hybrid Recommender Systems: A Stochastic Variational Bayesian Approach Asim Ansari Yang Li Jonathan Z. Zhang1 July 2016 1Asim Ansari is the William T. Dillard Professor of Marketing at Columbia Business School, Yang Li is Assistant Professor of Marketing at Cheung Kong Graduate School of Business in Beijing, and Jonathan Z. Zhang is Assistant Professor of Marketing at University of Washington in Seattle. Abstract Recommendation systems are becoming increasingly popular and useful in e-commerce and dig- ital product contexts that involve heterogeneous consumers and a vast collection of products. This creates representational challenges as features that adequately describe the products are often not readily available. User-generated content in the form of product reviews or product tags can be leveraged to obtain rich representations for subsequent product recommendation and targeting. In this paper we develop a novel covariate guided supervised topic model (CGSTM) that succinctly characterizes products in terms of latent topics and specifies consumer preferences through these semantic features. At the same time, recommendation contexts generate big data problems stem- ming from data volume, variety and veracity, such as our setting that includes massive textual and numerical data. To overcome the computational challenges due to the combination of big data and a complex model, we develop a novel stochastic variational Bayesian (SVB) framework to achieve fast, scalable and accurate estimation. We use our SVB approach to estimate our model on a dataset of 8.87 million movie ratings from 111,793 customers and 233,268 textual movie tags. The resultsshowourmodelgeneratesmuchbetterpredictionsthanthosefromabenchmarkmodel, and yields interesting insights about movie preferences. We also illustrate how our model can be used for targeting recommendations to particular users, and how it can support personalized search by determining relevant products and by generating a personalized ranking of these relevant products. Keywords: HybridRecommendationModels,PersonalizedSearch,User-GeneratedContent,Prob- abilistic Topic Models, Big Data, Scalable Inference, Stochastic Variational Bayes. 1 Introduction Over the past decade, e-commerce firms, online retailers, and digital content providers such as Amazon, Netflix and New York Times have become increasingly reliant on recommender systems to target products and digital content to users. Recommender systems are particularly useful in environments that are characterized by a large number of users who face a vast array of products to choose from. In such contexts, there is considerable heterogeneity in user preferences for product attributes and the large number of products implies that users are often unaware or uncertain of products that might appeal to them. Moreover, such environments are often constantly evolving as new users and new items are added on a regular basis. Firms therefore use recommender systems to offer personalized suggestions to users. This task hinges upon the recommender system’s ability to capture the heterogeneous preferences of current users based on their previous product ratings or product choices and to use such information to predict their liking of new items. Recommendation systems need to overcome various modeling and computational challenges to successfully predict preferences and recommend products. Such systems often operate on a sparse database in which each consumer rates only a few items and each product is rated or chosen by only a few customers. The paucity of data for most consumers implies that it is critical to borrow information from other consumers in order to predict the preferences of a given consumer, and therefore some type of shrinkage mechanism is needed to model customer heterogeneity. The large number of products within a database also poses challenges in representing these in terms of their underlying features. In many marketing situations, such feature representations are unavailable, or are at best partially available, as considerable domain expertise is needed for firms to manually supplydetailedcontentdescriptorsforeachproduct. Yet,arichrepresentationofproductsinterms of their attributes is crucial for properly modeling preference heterogeneity. Thus many systems rely on some sort of automatic feature extraction. In this research, we show how user-generated content that represents the “voice of the customers” can be leveraged to automatically extract features that are most predictive of preferences. Finally, recommender systems need to overcome various cold start problems associated with new users and new items. Apart from the modeling challenges, typical recommendation contexts generate big data com- putational challenges stemming from data volume, variety and veracity. While personalization focuses on a given user or a given product, large data volume that results from a massive user base and a vast product mix is critical for recommendation success, as it facilitates borrowing of infor- mation and enriches representation of products. However, this also results in scalability challenges, particularly so if complex probabilistic representations are needed to fully capture the information content in the data. Moreover, the application of user-generated content such as online texts and tags implies a curse of dimensionality, which needs to be tackled via appropriate dimensionality reduction procedures. Thus, scalable methods that are capable of estimating probabilistic models involving many latent variables on large datasets of variegated forms are needed. In this paper, we develop a novel hybrid model-based recommendation framework that ad- dresses the above representational challenges. Inaddition, we develop a novelstochastic variational Bayesian approach for estimation that overcomes the scalability challenges associated with large volume and high dimensionality. Specifically, we leverage crowd-sourced textual descriptions of products, such as user-generated tags, to construct probabilistic content descriptors of products. This alleviates the often onerous requirement for the firm-provided product attributes – an issue that bedevils most content filtering recommendation systems. We construct a supervised proba- bilistic topic model that transforms the “voice of the customer” about products and automatically infers latent product features that not only summarize the semantic content within the tags/words, but also simultaneously predict product ratings. In our model, the topics capture the semantic structure of the textual representation of the products and allow automatic dimension reduction of the vast vocabulary underlying the textual descriptions. More importantly, consumer preferences are specified in terms of these latent topics in our model. As we jointly model both the textual data as well as the product ratings, these latent topics are inferred to be the most predictive of user preferences. In addition, we allow firm specified covariates, when available, to guide the al- location of products to topics. This results in a recommendation system that leverages preference heterogeneity over rich user-generated content representations in a seamless manner. Specifically, we develop a covariate-guided supervised topic model to relate the textual descrip- tion of products to the user ratings. Our model extends the supervised latent Dirichlet allocation model (SLDA; Blei and McAuliffe 2007) in several directions to capture the unique characteristics of the recommendation context. Recommendation datasets often have a cross-nested dependency structure as a given user rates multiple products and each prodcut is rated by multiple users. In 2 our model each product description (e.g., a document in the topic model) is associated with mul- tiple product ratings given by many different users. This is distinct from typical supervised topic models in which each document is rated by a single user – such models are therefore more suitable for sentiment analysis of reviews, but are not rich enough to represent the preference heterogeneity that is crucial in making successful recommendations. We also account for preference heterogeneity over topics and explicitly take into account the cross-nested structure of the data. Finally, we use firm specified product covariates to guide the allocation of latent topics to products. This allows us to tackle the cold-start problem associated with new products. We apply our modeling framework to the context of personalized movie recommendations. We believethattheexperientialnatureofmovieproductsismoreamenableforusingautomaticfeature extraction from crowd-sourced customer opinions, especially because standard content descriptors such as movie genre are not rich enough to flexibily capture the numerous reasons why certain movies appeal to particular consumers. While we use crowd-sourced tags to generate textual representationsformovies,suchrepresentationscanalsobeobtainedfromothernaturallyoccurring information sources on the Internet, such as Wikipedia, blogs, tweets, or product reviews. Ourapplicationcanalsobeconsideredasaquintessentialexampleofbigdatamarketingbecause it simultaneously incorporates multiple facets of the 4Vs framework that is used to characterize big data situations (Sudhir 2016). For instance, our application uses a very large set of users and products, which results in a large volume of ratings. Also, the model deals with a variety of data, including unstructured texts and numbers, and we use natural language processing methods to replace the high-dimensional semantic content of the tags with a small set of latent topics. Moreover, our application showcases the challenges and opportunities of data veracity, in that data can be fused together from disparate sources, as the tags and ratings can be gathered from different sets of customers on various online platforms. Given the computational demands of our big data setting, we develop a novel stochastic vari- ational Bayesian (SVB) approach to achieve fast and scalable inference of the proposed model. Stochastic variational Bayesian methods differ from sampling-based MCMC approaches for sum- marizing the posterior distribution, and instead use optimization to approximate the posterior. Thus SVB methods generate estimation results at a fraction of the time needed for traditional MCMC methods. Our SVB algorithm contains a number of novel computational features, such as 3 theuseofstochasticnaturalgradientdescentandadaptivemini-batchsizestosignificantlyenhance computational speed and estimation scalability. In the context of our movie application, we show that our model generates much better predic- tions than a benchmark model that only uses manually specified genre covariates. This showcases thebenefitsthataccruefromrichfeaturerepresentationsderivedfromUGC.Wealsouseourmodel to uncover a number of interesting insights about the determinants of movie preferences and the semanticstructurebehindthemovietags. Weshowhowthemodeloutputcanbeusedforanumber of tasks that are relevant in the functioning of a recommender system. We show how our model can be used to generate unconditional recommendations for a given user. More interestingly, we show how our model can support different types of personalized search within the product recommen- dation context. For example, we show how the model can generate a personalized ranking of a set of movies that are most similar to a given movie specified by a user. We also show how our model can generate a set of personalized ranked results for a user query that specifies the user needs via a list of keywords. In summary, our research has both methodological and managerial contributions. Methodolog- ically, our model extends traditional supervised topic models by incorporating a number of features that are relevant for the recommendation context. We also develop a novel SVB algorithm for scalable inference of our model. Our SVB algorithm can be applied to other big data marketing contexts. For instance, it can be modified to accommodate other supervised mixed membership models or hierarchical models with both conjugate and non-conjugate components. On the man- agerial front, our model can be used not only for generating insights about the determinants of consumer preferences, but also for directly recommending products beyond the movie category. Given that segmentation, targeting and personalization are core marketing activities, our modeling and estimation approaches are immediately useful for marketing practitioners. The rest of the paper proceeds as follows. After a literature review, we describe the different components of our data in Section 3. We then develop the modeling framework for hybrid rec- ommendation systems in Section 4. After revealing the computational challenges, we introduce variatonal Bayesian methods in Section 5, and we elaborate on the stochastic natural gradient strategy to speed up the computation for massive data settings. In Section 6 we present the es- timation results and the associated managerial insights. Finally we conclude by discussing the 4 limitations of the current model and estimation approaches, and highlight potential directions for further research. 2 Literature Review Several research areas in marketing, statistics and machine learning are relevant for our work on personalized recommendation systems in big data settings. These include the literature on rec- ommendation systems and the natural language processing literature on probabilistic topic models and mixed membership models. Moreover, the ongoing research on scalable Bayesian inference in statistics and computer science is relevant for handling the big data challenges in our application. We succinctly review these areas below. A number of studies in the marketing and computer science literatures have developed algo- rithmic and statistical approaches for generating recommendations. Recommender systems differ on whether they are model-based or not. Prominent classes of model-based recommender systems include collaborative filtering, content filtering, and hybrid approaches that use a combination of collaborative and content filtering. Collaborative filtering models (see Desrosiers and Karypis 2011 for a review) rely solely on user ratings or purchase incidence data and leverage the similarity in preferences across users or across items. Thesemethodsthereforedonotleverageattributeinformationaboutproducts. Inparticular, user-based collaborative filtering identifies those users who are closest to a given user in terms of theirpreferencesforproductsinthedatabase. Similarly,item-basedcollaborativefilteringidentifies those products that are closest to a given product in terms of their appeal to customers. More recent incarnations of collaborative filtering use matrix factorizations (Koren and Bell 2011) of the user-item ratings matrix to uncover latent factors that represent user preferences or unobservable product features. These matrix factorization approaches result in automatic summarization and dimensionreductionoftheratingsmatrix. Despitetheiradvantages,collaborativefilteringmethods suffer from the cold-start problem in that they cannot be used for new users or new items. Content filtering systems (see Lops et al. 2011 for a review), in contrast, use information about the content of an item to capture the drivers of preferences. Content is broadly defined and often takestheshapeofasetofproductfeaturesthatareusedtomodelthevariabilityinproductratings. 5 Content-based systems can be useful in providing the underlying rationale for a recommendation, thereby increasing customer trust about the recommendations. Content-based methods have ad- ditional advantages that they can be used to predict preferences for new items based on their constituent features. However, explicitly coding a set of features to sufficiently describe an item can become difficult, especially when dealing with a large number of products, as is typically the case in the online environment where products are added on a continual basis. Moreover, a com- plete description of a product could require many attributes, which can add to the difficulty of data collection considerably, especially if domain experts are needed to specify the relevant attribute values. Hybrid recommender systems integrate collaborative and content filtering models to leverage the best features of both. Ansari, Essegaeir and Kohli (2000) develop such a hybrid hierarchical Bayesian model to leverage the preference heterogeneity across consumers in making recommen- dations. In this model, the Bayesian shrinkage arising from the population distribution that char- acterizes preference heterogeneity automatically allows for model-based collaborative filtering. A number of marketing scholars have made advances in this area, including Ying et al. (2006), Boda- pati (2008), Chung, Rust and Wedel (2009) and Chung and Rao (2012). In this paper, we continue in this tradition, but focus explicitly on leveraging automatic content representation obtained via probabilistic topic models to predict preferences. The natural language processing literature on probabilistic topic models for textual data (e.g., Blei, Ng and Jordan 2003; Blei and McAuliffe 2007) is also relevant for our recommendation model. Topic models are mixed membership models that automatically summarize the latent topics char- acterizing the semantic structure underlying a corpus of documents (Tirunillai and Tellis 2014). While traditional supervised topic models are suitable for aggregate semantic analysis of textual data such as reviews, these models often limit one document to be rated by a single author. Hence, they are not readily suitable for personalized recommendation system where each product may contain multiple ratings from different users. Our CGSTM model extends supervised topic models via a richer latent variable specification that captures the dependency structure of recommenda- tion database. In particular, it allows for multiple ratings from different users for each document (movie) and for individual differences across users in their preference structure over the topics. Finally, the statistical and machine learning literature on scalable Bayesian inference is relevant 6 given the big data setting of our application. Bayesian methods (Rossi, Allenby and McCulloch 2005) are particularly suited for recommendation problems, given the need of pooling information across users in the modeling of heterogeneity, and the need of generating individual-level estimates of consumer preferences. MCMC methods are popular in summarizing the posterior distribution of latent variables and parameters, but can be slow in big data contexts due to the need for tens of thousands of iterations required for convergence (Braun and McAuliffe 2010). We therefore use variational Bayesian methods (Bishop 2006; Dzyabura and Hauser 2011; Omerod and Wand 2010) which replace sampling with optimization, thus resulting in significant speed improvements. In particular, we leverage the state-of-the-art advances in stochastic variational methods (Hoffman et al. 2013; Tan 2015; Toulis and Airoldi 2016) to significantly ensure the speed and scalability of inference. We now describe the data context to facilitate an easier understand of our model. 3 Data Description We use our model on the MovieLens data (Harper and Konstan 2015) for movie recommendations. Our analysis is based on a dataset that was made available by MovieLens on August 06, 2015. The data contains 21,622,187 ratings and 516,139 tag applications across 30,106 movies. There are 234,934 users in the data who provided ratings between January 09, 1995 and August 06, 2015. The data files contain 1) the movie ratings given by users on a 10-point scale ranging from 0 to 5 in 0.5 point increments, 2) textual tags applied to movies by the users, and 3) the title and genre information for each movie. In the data, users were free to come up with any tags that described the movies. Not all users in the dataset tagged the movies, so we aggregate all the tags that are applied to the same movie across users to construct a “bag of tags” description of the movie. Thus, in using the tags, we ignore the identity of the users who supplied the tag. In addition, the dataset also describes the movies using a set of 19 genres. Lastly, the dataset does not include any user demographics. We randomly select 5,000 movies from the set of 10,722 movies that received tags from at least four users. We then use a number of preprocessing steps on the textual data associated with these movies to clean it for analysis. In particular, we convert the tags to lower-case to eliminate any redundancy in the tags that may arise from lower and upper case versions of the same tag. We decideagainsttagstemmingtofacilitateeasyunderstandingofthetopicsbyreaders. Wechoosenot 7 Figure 1: Proportion of Movie Genres to tokenize multi-worded tags into space-separated words, as the tag as a whole is more meaningful thantheindividualwordscomprisingthetags. Toreducevocabularysizetoamanageablelevel, we also discard all tags that were applied only once in the data. In addition, as our data contains well formed tags and not free flowing reviews or conversations, there is no need to remove stop-words, as is typically done in textual preprocessing. These preprocessing steps result in a sample of 4,609 movies that were rated by 111,793 users. The total number of tag applications across all movies is 233,268 and the overall vocabulary size (i.e., the number of distinct tags) is 21,255. Compared to the 19 genres, this large vocabulary has the potential to be a lot more expressive about the movie characteristics perceived by the users. The final dataset contains 8,865,061 ratings across the users and movies. Now we provide some summary statistics on the data. First, the proportions of the 19 movie genres in our sample are shown in Figure 1. We can see that Drama, Comedy, Action, Thriller and Romancearethetopmostgenresrepresentedinthedata,whereasFilm-Noiristheleastrepresented. Figure 2 shows a word cloud that reflects the most frequent tags applied to the movies. It is clear that many of these popular tags do not overlap with the 19 genres. The diversity of the tags seen 8

Description:
July 2016. 1Asim Ansari is the William T. Dillard Professor of Marketing at dataset of 8.87 million movie ratings from 111,793 customers and 233,268 .. Figure 4: Scatter Plots for: (a) the Means and Standard Deviations of .. horror disturbing serial killer cult film go re creep y camp y. 5 johnn y
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.