Table Of Content

Joint Deep Modeling of Users and Items Using Reviews for Recommendation Lei Zheng Vahid Noroozi Philip S. Yu DepartmentofComputer DepartmentofComputer DepartmentofComputer Science Science Science UniversityofIllinois atChicago UniversityofIllinois atChicago UniversityofIllinois atChicago Chicago,U.S. Chicago,U.S. Chicago,U.S. [email protected] [email protected] [email protected] 7 1 0 2 ABSTRACT options to customers, it makes it harder for them to pro- n cessthelargeamountofinformationprovidedbycompanies. A large amount of information exists in reviews written by a Recommender systems help customers by presenting prod- users. This sourceof information hasbeen ignored bymost J ucts or services that are likely of interest to them based on ofthecurrentrecommendersystemswhileitcanpotentially 7 their preferences, needs, and past buyingbehaviors. Nowa- alleviatethesparsityproblemandimprovethequalityofrec- 1 days, many people use recommender systems in their daily ommendations. In this paper, we present a deep model to life such as online shopping, reading articles, and watching learnitempropertiesanduserbehaviorsjointlyfromreview ] movies. G text. Theproposedmodel,namedDeepCooperativeNeural Many of the prominent approaches employed in recom- Networks (DeepCoNN), consists of two parallel neural net- L mender systems [13] are based on Collaborative Filtering workscoupledinthelastlayers. Oneofthenetworksfocuses . (CF) techniques. The basic idea of these techniquesis that s onlearninguserbehaviorsexploitingreviewswrittenbythe c user, and the other one learns item properties from the re- peoplewhosharesimilarpreferencesinthepasttendtohave [ similar choices in the future. Many of the most successful views written for the item. A shared layer is introduced on 1 the top to couple these two networks together. The shared CF techniquesare based on matrix factorization [13]. They findcommon factors that can betheunderlyingreasons for v layer enables latent factors learned for users and items to the ratings given by users. For example, in a movie recom- 3 interact with each other in a manner similar to factoriza- mendersystem,thesefactors can begenre, actors, or direc- 8 tion machinetechniques. Experimentalresultsdemonstrate tor of movies that may affect the rating behavior of users. 7 that DeepCoNN significantly outperforms all baseline rec- Matrix factorization techniques not only find these hidden 4 ommender systemson a variety of datasets. factors, but also learn their importance for each user and 0 how each item satisfies each factor. . 1 CCSConcepts Although CF techniques have shown good performance 0 formanyapplications,thesparsityproblemisconsideredas Informationsystems Collaborativefiltering;Rec- 7 • → oneoftheirsignificantchallenges[13]. Thesparsityproblem ommender systems; Computing methodologies 1 • → arises when the number of items rated by users is insignifi- Neural networks; : v cant to the total numberof items. It happensin many real i applications. ItisnoteasyforCFtechniquestorecommend X Keywords items with few ratings or to give recommendations to the r RecommenderSystems,DeepLearning,ConvolutionalNeu- users with few ratings. a One of the approaches employed to address this lack of ral Networks,Rating Prediction data is using the information in review text [16, 17]. In many recommender systems, other than the numeric rat- 1. INTRODUCTION ings,userscanwritereviewsfortheproducts. Usersexplain Thevarietyandnumberofproductsandservicesprovided the reasons behind their ratings in text reviews. The re- by companies have increased dramatically during the last views contain information which can be used to alleviate decade. Companies produce a large number of products sparsityproblem. OneofthedrawbacksofmostcurrentCF to meet the needs of customers. Although this gives more techniquesisthattheymodelusersanditemsjustbasedon thenumericratings providedbyusers andignore theabun- dantinformation existedinthereviewtext. Recently,some Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed studies [17] [16] have shown that using review text can im- forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita- prove the prediction accuracy of recommender systems, in tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan particular for the items and userswith few ratings [34]. ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- publish,topostonserversortoredistributetolists,requirespriorspecificpermission In this paper, we propose a neural network (NN) based and/[email protected]. model, named Deep Cooperative Neural Networks (Deep- WSDM2017,February06-10,2017,Cambridge,UnitedKingdom CoNN), to model users and items jointly using review text (cid:13)c 2017ACM.ISBN978-1-4503-4675-7/17/02...$15.00 for rating prediction problems. The proposed model learns DOI:http://dx.doi.org/10.1145/3018661.3018665 hidden latent features for users and items jointly using two sented in Section 3 toanalyze DeepCoNN and demonstrate coupled neuralnetworkssuch that therating prediction ac- itseffectivenesscompared tothestate-of-the-art techniques curacy is maximized. One of the networks models user be- for recommendation systems. In Section 4, we give a short havior using the reviews written by the user, and the other reviewoftheworksrelatedtoourstudy. Finally,conclusions network models item properties using the written reviews are presented in Section 5. for the item. The learned latent features for user and item areusedtopredictthecorrespondingratinginalayerintro- 2. METHODOLOGY duced on the top of both networks. This interaction layer is motivated by matrix factorization techniques [13] to let Theproposedmodel,DeepCoNN,isdescribedindetailin latent factors of users and items interact with each other. this section. DeepCoNN models user behaviors and item Tothebestofourknowledge,DeepCoNNisthefirstdeep properties using reviews. It learns hidden latent factors model that represents both users and items in a joint man- for users and items by exploiting review text such that the ner using reviews. It makes the model scalable and also learned factors can estimate the ratings given by users. It suitableforonlinelearningscenarioswherethemodelneeds is done with a CNN based model consisting of two parallel to get updated continuously with new data. Another key neural networks, coupled to each other with a shared layer contribution is that DeepCoNN represents review text us- at the top. The networks are trained in a joint manner to ing pre-trained word-embedding technique [21, 20] to ex- predicttheratingswithminimumpredictionerror. Wefirst tract semantic information from the reviews. Recently,this describe notations used throughout this paper and formu- representation has shown excellent results in many Natu- latethedefinitionofourproblem. Then,thearchitectureof ralLanguageProcessing(NLP)tasks[7,4,21]. Moreover,a DeepCoNN and the objective function to get optimized is significantadvantageofDeepCoNNcomparedtomostother explained. Finally, we describe howto train thismodel. approaches [17, 16] which benefit from reviews is that it 2.1 Definition andNotation models users and items in a joint manner with respect to predictionaccuracy. Mostofthesimilaralgorithmsperform A set of training set consists of N tuples. Each tuple T themodelingindependentlyoftheratings. Therefore, there (u,i,rui,wui) denotes a review written by user u for item i isnoguaranteethatthelearned factorscan bebeneficialto with rating rui and text review of wui. The mathematical therating prediction. notations used in thispaperare summarized in Table 1. The experiments on real-world datasets including Yelp, Amazon [19], and Beer [18] show that DeepCoNN outper- Table 1: Notations formsallthecomparedbaselinesinpredictionaccuracy. Also, Symbols Definitions and Descriptions the proposed algorithm increases the performance for users du useroritemu’sreviewtextconsistingofn and items with fewer ratings more than the ones with a 1:n words higher number of ratings. It shows that DeepCoNN allevi- Vu wordvectors ofuseroritemu 1:n ates the sparsity problem byleveraging review text. wui areviewtextwrittenbyuseruforitemi Ourcontributionsandalso advantagesofDeepCoNNcan oj theoutputofjth neuronintheconvolutional be summarized as follows: layer ni thenumberofneuronsinthelayeri TheproposedDeepCooperativeNeuralNetworks(Deep- Kj thejth kernelintheconvolutional layer • CoNN) jointly model userbehaviorsand item proper- bj thebiasofjth convolutional kernel g thebiasofthefullyconnected layer ties using text reviews. The extra shared layer at the zj thejth featuremapintheconvolutional layer top of two neural networks connects the two parallel W theweightmatrixofthefullyconnected layer networks such that user and item representations can t thewindowsizeofconvolutional kernel interact with each other to predict ratings. To the c thedimensionofwordembedding bestofourknowledge,DeepCoNNisthefirstonethat xu theoutput ofNetu jointly models both user and item from reviews using yi theoutputofNeti λ thelearningrate neural networks. It represents review text as word-embeddings using • pre-trained deep models. The experimental results 2.2 Architecture demonstratethatthesemanticmeaningandsentimen- Thearchitectureoftheproposed modelforratingpredic- tal attitudes of reviews in this representation can in- tionisshowninFigure1. Themodelconsistsoftwoparallel creasetheaccuracyofratingprediction. Allcompeting neural networks coupled in the last layer, one network for techniques which are based on topic modeling [38, 3, users(Netu)andonenetworkforitems(Neti). Userreviews 8] use thetraditional bag of words techniques. and item reviews are given to Netu and Neti respectively as inputs,and corresponding rating is produced as theout- It does not only alleviate the problem of sparsity by • put. Inthefirstlayer,denotedaslook-uplayer,reviewtext leveraging reviews, but also improves the overall perfor users or items are represented as matrices of word em- formance of the system significantly. It outperforms beddingstocapturethesemanticinformation in thereview state-of-the-art techniques [17, 33, 25, 35] in terms of text. NextlayersarethecommonlayersusedinCNNbased predictionaccuracyonalloftheevaluateddatasetsin- models to discover multiple levels of features for users and cluding Yelp, 21 categories of Amazon, and Beer (see items, including convolution layer, max pooling layer, and Section 3). fully connected layer. Also, a top layer is added on the top The rest of the paper is organized as follows. In Section of the two networks to let the hidden latent factors of user 2, we describe DeepCoNN in detail. Experiments are pre- and item interact with each other. This layer calculates an User Review Text Item Review Text L L ook-u … … ook-U p . . . . . . . . . . . . . . . . p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . … … *kj *kj Convolution …… …… Convolution M M ax-pooling … … ax-pooling Fully-co Fully Co n n n n ected xu … … yi ected Loss Function Figure 1: The architectureof theproposed model objective function that measures the rating prediction er- uses filterKj c×t on a window of words with size t. For rorusingthelatent factors producedbyNetu andNeti. In V1u:n,weperfor∈mℜaconvolutionoperationregardingeachker- the following subsections, since Netu and Neti only differ nel Kj in theconvolutional layer. iinntdheetiarili.npTuhtes,swameefopcruoscoenssililsuastprpaltiiendgftohreNpertoicewsisthfosrimNielatur zj =f(V1u:n∗Kj+bj) (2) layers. Heresymbol isconvolutionoperator,bj isabiastermand ∗ f is an activation function. In the proposed model, we use 2.3 WordRepresentation Rectified Linear Units (ReLUs) [22]. It is defined as Eq. A word embedding f :M n, where M representsthe 3. Deep convolutional neural networks with ReLUs train →ℜ dictionary of words, is a parameterized function mapping several times faster than their equivalents with tanh units words to n-dimensional distributed vectors. Recently, this [14]. approach hasboosted theperformance inmany NLPappli- f(x)=max 0,x (3) cations [12, 7]. DeepCoNN uses this representation tech- { } nique to exploit the semantics of reviews. In the look-up Followingtheworkof[7],wethenapplyEq. 4,amaxpool- layer, reviews are represented as a matrix of word embed- ingoperation, overthefeaturemap and takethemaximum dings to extract their semantic information. To achieve it, valueas thefeature correspondingto thisparticular kernel. all the reviews written by user u, denoted as user reviews, Themost importantfeatureofeachfeaturemap,whichhas aremergedintoasingledocumentdu1:n,consistingofnwords the highest value, has been captured. This pooling scheme intotal. Then,amatrixofwordvectors,denotedasV1u:n,is can naturally deal with thevaried length of thetext. After built for useru as follows: themaxpoolingoperation,convolutionalresultsarereduced to a fixedsize vector. Vu =φ(du) φ(du) φ(du) ... φ(du), (1) 1:n 1 ⊕ 2 ⊕ 3 ⊕ ⊕ n where du indicates the kth word of document du , look- oj =max{z1,z2,...,z(n−t+1)} (4) k 1:n up function φ(du) returns the corresponding c-dimensional We have described the process by which one feature is ex- k word vector for the word du, and is the concatenation tracted from one kernel. The model uses multiple filters to k ⊕ operator. Itshouldbeconsidered thattheorderofwordsis obtain various features and the output vector of theconvo- preserved in matrix Vu that is another advantage of this lutional layer is as Eq. 5. 1:n representation comparing to bag-of-words techniques. O={o1,o2,o3,...,on1}, (5) 2.4 CNN Layers wheren denotesthenumberofkernelin theconvolutional 1 Nextlayersincludingconvolutionlayer,maxpooling,and layer. fully connected layer follow the CNN model introduced in xu=f(W O+g) (6) [7]. Convolution layer consists of m neurons which produce × newfeaturesbyapplyingconvolutionoperatoronwordvec- The results from the max-pooling layer are passed to a torsVu ofuseru. Eachneuronj intheconvolutionallayer fully connected layer with weight matrix W. As shown in 1:n Eq. 6, the output of the fully connected layer xu n2×1 the internal structure of data and provide a mechanism for ∈ ℜ is considered as features for user u. Finally, the outputs of efficient useof words’ order in text modeling [11]. both user and item CNN xu and yi can beobtained. 2.7.2 OnlineLearning 2.5 The Shared Layer Scalabilityandhandlingdynamicpoolsofitemsandusers Althoughtheseoutputscanbeviewedasfeaturesofusers are considered as critical needs of many recommender sys- and items, they can be in different feature space and not tems. The time sensitivity of recommender systems poses comparable. Thus,tomapthemintothesamefeaturespace, a challenge in learning latent factors in an online fashion. we introduce a shared layer on the top to couple Netu and DeepCoNN is scalable to the size of the training data, and Neti. First, let us concatenate xu and yi into a single vec- alsoitcaneasilygettrainedandupdatedwithnewdatabe- tor ˆz = (xu,yi). To model all nested variable interactions causeitisbasedonNN.Updatinglatentfactorsofitemsor in ˆz, we introduce Factorization Machine (FM) [24] as the userscangetperformedindependentlyfromhistoricaldata. estimator of the corresponding rating. Therefore, given a Alltheapproacheswhichemploytopicmodelingtechniques batchofN trainingexamples ,wecanwritedownitscost do not benefitfrom theseadvantages tothis extent. T as Eq. 7. 3. EXPERIMENTS |zˆ| |zˆ| |zˆ| J =wˆ0+Xwîzî+X X vî,vˆj zîzˆj, (7) We have performed extensive experiments on a variety h i i=1 i=1j=i+1 of datasets to demonstrate the effectiveness of DeepCoNN compared to other state-of-the-art recommender systems. wherewˆ0istheglobalbias,wîmodelsthestrengthoftheith Wefirstpresentthedatasetsandtheevaluationmetricused variable in zând hvî,vˆji = P|fzˆ=|1vî,fvˆj,f. hvî,vˆji models in our experiments in Section 3.1. The baseline algorithms thesecond order interactions. selected for comparisons are explained in Section 3.2. Ex- 2.6 Network Training perimental settings are given in Section 3.3. Performance evaluation and some analysis of the model are discussed in Our network is trained by minimizing Eq. 7. We take sections 3.4 and 3.5 respectively. derivativesof J with respect to z, as shown in Eq. 8. 3.1 Datasets andEvaluationMetric |zˆ| ∂J ∂zî =wî+ X hvî,vˆjizˆj (8) daItnasoetusrteoxpevearilmuaetnetso,uwremhoadveel.selected the following three j=i+1 Thederivativesofotherparametersindifferentlayerscan Yelp: It is a large-scale dataset consisting of restau- • be computedby applyingdifferentiation chain rule. rantreviews,introducedinthe6throundofYelpChal- Givenasetoftrainingset consistingofN tuples,weop- lenge 1 in2015. Itcontainsmorethan1Mreviewsand T timize themodel through RMSprop[30] overshuffled mini- ratings. batches. RMSpropisanadaptiveversionofgradientdescent Amazon: AmazonReviewdataset[19]containsprod- which adaptively controls the step size with respect to the • uct reviews and metadata from Amazon website2. It absolute value of the gradient. It does it by scaling the includesmorethan143.7millionreviewsspanningfrom updatevalueofeachweightbyarunningaverageofitsgra- May 1996 to July 2014. It has 21 categories of items, dient norm. The updating rules for parameter set θ of the and as far as we know, thisis thelargest publicavail- networks are as thefollowing: able rating dataset with text reviews. ∂J 2 rt ←0.9(∂θ) +0.1rt−1 (9) • Beer: Itisabeerreviewdataset extractedfrom rate- beer.com. The data span a period of more than 10 λ ∂J years,includingalmost3millionreviewsuptoNovem- θ θ ( ) , (10) ber2011 [18]. ← − √rt+ǫ ∂θ where λ is the learning rate, ǫ is a small value added for As we can see in Table 2, all datasets contain more than half a million of reviews. However, in Yelp and Amazon, numericalstability. Additionally,topreventoverfitting,the customers provide less than six pair of reviews and ratings dropout[28]strategyhasalsobeenappliedtothefullycon- on average which shows these two datasets are extremely nected layers of thetwo networks. sparse. This sparsity can largely deteriorate the perfor- 2.7 Some AnalysisonDeepCoNN mance of recommender systems. Besides, in all datasets, each review consists of less than 150 words on average. 2.7.1 WordOrderPreservation Inourexperiments,weadoptthewell-knownMeanSquare Error(MSE)toevaluatetheperformanceofthealgorithms. Most of the recommender systems which use reviews in It is selected because most of the related works have used the modeling process employ topic modeling techniques to the same evaluation metric[17, 16, 1]. MSE can be defined model users or items [6]. Topic modeling techniques infer as follows: latent topicvariables usingthebag-of-words assumption,in which word order is ignored. However, in many text mod- 1 N eling applications, word order is crucial [32]. DeepCoNN is MSE = N X(rn−rˆn)2, (11) not based on topic modeling and uses word embeddings to n=1 createamatrixofwordvectorswheretheorderofwordsare 1https://www.yelp.com/dataset-challenge preserved. In this way, convolution operations make use of 2https://snap.stanford.edu/data/web-Amazon.html Table 2: The Statistics of the datasets Class #users #items #review #words #reviews per user #words perreview Yelp 366,715 60,785 1,569,264 198M 4.3 126.41 Amazon 6,643,669 2,441,053 34,686,770 4.053B 5.2 116.67 Beer 40,213 110,419 2,924,127 154M 72.7 52.67 where rn is the nth observed value, rˆn is the nth predicted 1.80 Yelp 1.85 Yelp valueand N is thetotal numberof observations. 1.75 1.80 3.2 Baselines 1.70 1.75 1.70 E1.65 E To validate the effectiveness of DeepCoNN, we have se- MS1.60 MS1.65 lected three categories of algorithms for evaluations: (i) 1.60 1.55 1.55 purely rating based models. We chose Matrix Factoriza- 1.50 1.50 tion(MF)andProbabilisticMatrixFactorization(PMF)to validatethatreviewinformation ishelpfulforrecommender 1.450 2n0umber4 0of laten6t0 factor8s0 100 1.450 n5u0mb1e0r0 of1 c5o0nv20o0lut2io5n0al3 k00ern3e5l0s400 systems, (ii) topicmodeling based models which usereview Figure 2: The impact of the number of latent factors and information. Most of the recommender systems which take convolutional kernels on the performance of DeepCoNN in reviewsintoconsiderationarebasedontopicmodelingtech- terms of MSE (Yelp Dataset). niques. To compare our model with topic modeling based recommender systems, we select three representative models: Latent Dirichlet Allocation (LDA) [5], Collaborative Topic Regression (CTR) [33] and Hidden Factor as Topic (HFT) [17], and (iii) deep recommender systems. In [35], 3.3 Experimental Settings authorshaveproposedastate-of-the-artdeeprecommender system named Collaborative Deep Learning (CDL). Note WedividedeachdatasetshowninTable2intothreesetsof thatallthebaselinesexceptMFandPMFhaveincorporated trainingset,validationset,andtestset. Weuse80%ofeach review information intotheirmodelstoimproveprediction. dataset as thetraining set, 10% is treated as the validation set to tune the hyper-parameters, and the rest is used as MF: Matrix Factorization [13] is the most popular the test set. All the hyper-parameters of the baselines and • CF-basedrecommendationmethod. Itonlyusesrating DeepCoNN are selected based on the performance on the matrix as input and estimates two low-rank matrices validation set. topredictratings. Inourimplementation,Alternating ForMFandPMF,weusedgridsearchtofindthebestval- LeastSquares(ALS)techniqueisadoptedtominimize uesforthenumberoflatentfactorsfrom 25,50,100,150,200 , its objective function. and regularization parameter from 0.00{1,0.01,0.1,1.0 . } { } For LDA, CTR and HFT, the number of topics K is se- PMF:ProbabilisticMatrixFactorizationisintroduced • lectedfrom 5,10,20,50,100 usingthevalidationset. We in [25]. It models latent factors of users and items by { } set K =10 for LDA and CTR. The CTR model solves the Gaussian distributions. one-class collaborative filtering problem [23] by using two different values for the precision parameter c of a Gaussian LDA:LatentDirichletAllocationisawell-knowntopic • distribution. Followingtheworkof[16],inourexperiments, modeling algorithm presented in [5]. In [17], it is pro- we set precision c as the same for all the observed ratings posedtoemployLDAtolearnatopicdistributionfrom for rating prediction. HFT-k (k = 10,50) are included to asetofreviewsforeachitem. Bytreatingthelearned show the impact of the number of latent factors for HFT. topicdistributionsaslatent featuresfor eachitem, la- By performing a grid search on the validation set, we set tent features for each user is estimated by optimizing rating prediction accuracy with gradient descent. hyper-parametersα=0.1, λu =0.02 and λv =10 for CTR and HFT. To optimize the performance of CDL, we per- CTR: Collaborative Topic Regression has been pro- formed a grid search on the hyper-parameters λu, λv, λn, • posed by [33]. It showed very good performance on λw and L. Similar with CTR, theconfidenceparameter cij recommending articles in a one-class collaborative fil- of CDL is set as thesame for all observed ratings. teringproblemwhereauseriseitherinterestedornot. We empirically studied the effects of two important parameters of DeepCoNN: the number of latent factors(xu | | HFT: Hidden Factor as Topic proposed in [17] em- and yi ) and the number of convolutional kernels: n1. In • | | ploys topic distributions to learn latent factors from Figure 2, we show the performance of DeepCoNN on the user or item reviews. The authors have shown that validation set of Yelp with varying xu and yi from 5 to | | | | itemspecifictopicdistributionsproducemoreaccurate 100andn1from10to400toinvestigateitssensitivity. Asit predictions than user specific ones. Thus, we report can be seen, it does not improvethe performance when the theresults of HFTlearning from item reviews. number of latent factors and number of kernels is greater than 50 and 100 respectively. Thus, we set xu = yi =50 CDL: Collaborative Deep Learning tightly couples a and n = 100. Other hyper-parameters: t,|c, |λ an|d|batch • 1 Bayesian formulation of the stacked denoising auto- sizearesetas3,300,0.002and100,respectively. Theseval- encodersandPMF.Themiddlelayerofauto-encoders ueswerechosenthroughagridsearchonthevalidationsets. serves as a bridgebetween auto-encodersand PMF. We used a pre-trained word embeddings which are trained Table 4: Comparing variants of the proposed model. asimplercouplingapproach? Toanswerthesequestions,we Best results are indicated in bold. compare theDeepCoNN with its fivevariants: DeepCoNN- User,DeepCoNN-Item,DeepCoNN-TFIDF,DeepCoNN-Ra Amazon ndom and DeepCoNN-DP. These five variants are summa- Model Yelp Music In- Beer rized in thefollowing: struments DeepCoNN-User 1.577 1.373 0.292 DeepCoNN-Item 1.578 1.372 0.296 • DeepCoNN-User: The Neti of DeepCoNN is substituted with a matrix. Each row of the matrix is the DeepCoNN-TFIDF 1.713 1.469 0.589 latent factors of one item. This matrix is randomly DeepCoNN-Random 1.799 1.517 0.627 initialized and optimized duringthetraining. DeepCoNN-DP 1.491 1.253 0.278 DeepCoNN 1.441 1.233 0.273 DeepCoNN-Item: SimilarwithDeepCoNN-User,the • Netu of DeepCoNN is replaced with a matrix. Each rowofthematrixisthelatentfactorsofoneuser. This on more than 100 billion words from Google News [21] 3. matrix is randomly initialized and optimized during OurmodelsareimplementedinTheano[29],awell-known thetraining. Pythonlibraryformachinelearninganddeeplearning. The NVIDIACUDADeepNeuralNetwork4(cuDNNv4)acceler- DeepCoNN-TFIDF: Instead of using word embed- atedourtrainingprocess. Allmodelsaretrainedandtested • ding, the TFIDF scheme is employed to represent re- on an NVIDIATesla K40 GPU. view text as inputto DeepCoNN. 3.4 Performance Evaluation DeepCoNN-Random: Ourbaselinemodelwhereall TheperformanceofDeepCoNNandthebaselines(seeSec- • wordrepresentationsarerandomlyinitializedasfixed- tion 3.2)arereported intermsof MSEin Tables3. Table 3 length vectors. showstheresultsonthethreedatasetsincludingtheperfor- manceaveraged onall21categories ofAmazon. Theexper- DeepCoNN-DP: The factorization machine in the iments are repeated 3 times, and the averages are reported • objectivefunctionissubstituedwithasimpledotprod- with the best performance shown in bold. The last column indicates the percentage of improvements gained by Deep- uct of xu and yi. CoNN compared to the best baseline in the corresponding The performance of DeepCoNN and its variants on Yelp, category. Beer and one category of the Amazon dataset: Music In- InTable3,allmodelsperformbetteronBeer datasetthan struments are given in Table 4. on Yelp andAmazon. It ismainly related tothesparsity of To demonstrate that the two deep CNNs can cooperate Yelp and Amazon. Although PMF performs better than with each other to learn effective latent factors from user MF on Yelp, Beer, and most categories of Amazon, both anditemreviews,DeepCoNN-UserandDeepCoNN-Itemare techniques do not show good performance compared to the trained with only one CNN with review text as input and ones which use reviews. It validates our hypothesis that the other CNN is substituted with a list of latent variables reviewtextprovidesadditionalinformation,andconsidering astheparameterstogetlearned. Inthismanner,latentfac- reviews in models can improverating prediction. tors of users or items are learned without considering their Although simply employing LDA to learn features from corresponding review text. As it can be seen in Table 4, item reviews can help the model to achieve improvements, whileDeepCoNN-UserandDeepCoNN-Itemachievesimilar LDA models reviews independent of ratings. Therefore, results,DeepCoNNdeliversthebestperformancebymodel- thereisnoguaranteethatthelearnedfeaturescanbebenefi- ingbothusersanditems. Itverifiesthatreviewtextisnec- cialtoratingprediction. Therefore,bymodelingratingsand essary for modeling latent factors of both users and items. reviewstogether,CTRandHFTattainadditionalimprove- Also, it shows that review text hasinformative information ments. Among those topic modeling based models (LDA, that can help to improve the performance of recommenda- CTR and HFT), both HFT-10 and HFT-50 perform better tion. in all three datasets. Furthermore, to validate the effectiveness of word repre- With the capability of extracting deep effective features sentation,we compareDeepCoNNwith DeepCoNN-TFIDF from item review text, as we can see in Table 3, CDL out- andDeepCoNN-Random. TheDeepCoNN-TFIDFandDee- performsalltopicmodelingbasedrecommendersystemsand pCoNN-Random are trained to show that word embedding advances the state-of-the-art. However, in benefiting from is helpful to capture semantic meaning existed in the re- jointmodelingcapacityandsemanticmeaningexistingfrom view text. While the performance of DeepCoNN-TFIDF is reviewtext,DeepCoNNbeatsthebestbaselineinYelp,Beer slightlybetterthanDeepCoNN-Random,theybothperform and Amazon and gains 8.3% improvementon average. considerablyweakerthanDeepCoNN.Itshowstheeffective- 3.5 Model Analysis ness of representing review text in semantic space for modeling thelatent factors of items or users. Arethetwoparallelnetworksreallycooperatetolearnef- At last, to investigate the efficiency of the shared layer, fectivefeaturesfromreviews? Doestheproposedmodelben- DeepCoNN-DPisintroducedthatcouplesthetwonetworks efitfromtheuseofwordembeddingtoexploitthesemantic with a simpler objective function. The comparison shows information in thereview text? How much does the shared thesuperiorityofthefactorizationmachinecoupling. Itcan layerhelpinimprovingthepredcitionaccuracycomparingto betheresultofnotonlymodelingthefirstorderinteractions 3https://code.google.com/archive/p/word2vec/ but also the second order interactions between xu and yi. Table 3: MSE Comparison with baselines. Best results are indicated in bold. Improvement Dataset MF PMF LDA CTR HFT-10 HFT-50 CDL DeepCoNN of DeepCoNN (%) Yelp 1.792 1.783 1.788 1.612 1.583 1.587 1.574 1.441 8.5% Amazon 1.471 1.460 1.459 1.418 1.378 1.383 1.372 1.268 7.6% Beer 0.612 0.527 0.306 0.305 0.303 0.302 0.299 0.273 8.7% Average on all datasets 1.292 1.256 1.184 1.112 1.088 1.09 1.081 0.994 8.3% Yelp Amazon (Music Instruments) Beer 0.20 0.21 0.14 User User 0.20 User MSE0.12 Item MSE0.15 Item MSE0.19 Item n n n 0.18 n i0.10 n i0.10 n i0.17 o o o cti cti cti0.16 edu0.08 edu0.05 edu0.15 r r r 0.06 0.14 0.00 0.13 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 number of training reviews number of training reviews number of training reviews Figure3: MSEimprovementachievedbyDeepCoNNcomparedtoMF.Forusersanditemswithdifferentnumberoftraining reviews, DeepCoNN gains different MSE reductions. 3.6 The Impact oftheNumber ofReviews information in online review text, and deep learning tech- The cold start problem [27] is prevalent in recommender niques employed for recommender systems. In this section, systems. Inparticular,whenanewuserjoinsoranewitem we give a short review of these two research areas and dis- is added to the system, their available ratings are limited. tinguish our work from theexisting approaches. It would not be easy for the system to learn preferences The first studies that used online review text in rating of such users just from their ratings. It has been shown prediction tasks were mostly focused on predicting ratings in some of the previous works that exploiting review text foranexistingreview[2,38],whileinourpaper,wepredict can help to alleviate this problem especially for users or theratingsfromthehistoryofreviewtextwrittenbyauser items with few ratings [17]. In this section, we conduct a to recommend desirable productsto that user. set of experiments to answer the following questions. Can One of the pioneer works that explored using reviews to DeepCoNN help to tackle the cold start problem? What is improve the rating prediction is presented in [10]. It found the impact of the number of reviews on the effectiveness of that reviews are usually related to different aspects, e.g., theproposed algorithm? price, service, positive or negative feelings, that can be ex- In Fig. 3, we have illustrated the reductions in MSE re- ploited for rating prediction. In [17], the authors proposed sultedfromDeepCoNNcomparedtoMFtechniqueonthree Hidden Factors as Topics (HFT) to employ topic model- datasets of Yelp, Beer, and a group of Amazon (Music In- ingtechniquestodiscoverlatentaspectsfromeitheritemor struments). By reduction in MSE, we mean the difference userreviews. Thismethodachievessignificantimprovement betweentheMSEofMFandtheMSEofDeepCoNN.Users compared to models which only use ratings or reviews. A and items are categorized based on the number of their re- similar approach is followed in [3] with the main difference views, and reductions are plotted for both users and items that it models user’s and items’ reviews simultaneously. In groups. Itcan beseen thatin all threedatasets, reductions [8],aprobabilisticmodelisproposedbasedoncollaborative arepositive,andDeepCoNNcanachieveRMSreductionon filtering and topic modeling. It uncovers aspects and senti- allgroupsofusersanditemswithfewnumberofratings. A mentsofusersanditems,butitdoesnotincorporateratings more important advantage of DeepCoNN is that higher re- duringmodelingreviews. RatingsMeetReviews(RMR)[16] ductions are gained for groups with fewer ratings. It shows also triestoharnesstheinformation ofbothratingsandre- thatDeepCoNNcanalleviatethesparsityproblemandhelp views. OnedifferencebetweenHFTandRMRisthatRMR on thecold start problem. applies topic modeling techniques on item review text and It can also be seen that there exists a relation between alignsthetopicswiththeratingdimensionstoimprovepre- the effectiveness of DeepCoNN and the number of ratings diction accuracy. for a user or item. For users or items with a lower number Overall, one limitation of the above studies is that their ofratings,DeepCoNNreductioninMSEishigher. Itshows textual similarity is solely based on lexical similarity. The thatreviewtextcanbevaluableinformationespeciallywhen vocabulary in English is very diverse, and two reviews can we havelimited information on theusers or items. be semantically similar even with low lexical overlapping. The semantic meaning is of particular importance and has been ignored in these works. Additionally, reviews are rep- 4. RELATED WORKS resented by using bag-of-words, and words’ order exists in There are two categories of studies related to our work: reviews has not been preserved. At last, the approaches techniquesthat model users and/or items byexploiting the which employ topic modeling techniquessuffer from a scal- gether by a shared common layer to model users and items abilityproblemandalsocannotdealwithnewcomingusers fromthereviews. Itmakestheuseranditemrepresentations and items. mapped into a common feature space. Similar to MF tech- Recently, several studies have been done to use neural niques, user and item latent factors can effectively interact network based models including deep learning techniques with each other to predict thecorresponding rating. for recommendation tasks. Severalworks[26,37,15]model Incomparisonwithstate-of-the-artbaselines,DeepCoNN usersand/oritemsfrom theratingmatrixusingneuralnet- achieved8.5%and7.6%improvementsondatasetsofYelp workslikedenoisingauto-encodersorRestrictedBoltzmann andBeer,respectively. OnAmazon,itoutperformedallthe Machines(RBM).Theyareconsideredascollaborativebased baselines and gained 8.7% improvement on average. Over- techniques because they just utilize the rating matrix and all, 8.3% improvement is attained by the proposed model ignore review text unlikeour approach. on all threedatasets. In [31] and [36], deep models of CNN and Deep Belief Additionally, in the experiments by limiting modeling to Network (DBN) are introduced to learn latent factors from just one of the users and items, we demonstrated that the musicdataformusicrecommendation. Inbothmodels, ini- two networkscould not only separately learn user anditem tially,theyfinduseranditemlatentfactorsusingmatrixfac- latentfactorsfromreviewtextbutalsocooperatewitheach torization techniques. Then, they train a deep model such other to boost the performance of rating prediction. Fur- thatitcanreconstructtheselatentfactorsfortheitemsfrom thermore,weshowedthatwordembeddingcouldbehelpful themusiccontent. Asimilarapproachisfollowed in[35]for tocapturesemanticmeaningofreviewtextbycomparingit movierecommendationbyusingageneralizedStackedAuto withavariantofDeepCoNNwhichusesrandomorTF-IDF Encoder (SAE) model. In all these works [31, 36, 35], an representations for reviews. item’slatentfactorsarelearnedfromitem’scontentandre- At last, we conducted experiments to investigate the im- view text is ignored. pactofthenumberofreviews. Experimentalresultsshowed In [9], a multi-view deep model is built to learn the user that for the users and items with few reviews or ratings, and item latent factors in a joint manner and map them DeepCoNNobtainsmorereductioninMSEthanMF.Espe- to a common space. The general architecture of the model cially, when only one review is available, DeepCoNN gains seemstohavesomesimilaritiestoourproposedmodel,butit the greatest MSE reduction. Thus, it validates that Deep- differsfromoursinsomeaspects. Theirmodelisacontent- CoNN can e¨ınˇA˘ectively alleviate thesparsity problem. based recommender system and does not use review text. Moreover,theiroutputsarecoupledwithacosinesimilarity 6. ACKNOWLEDGEMENTS objective function to produce latent factors with high sim- This work is supported in part by NSF through grants ilarity. In this way, user and item factors are not learned IIS-1526499, and CNS-1626432. We gratefully acknowledge explicitly in relation to therating information, and thereis the support of NVIDIA Corporation with the donation of no guarantee that the learned factors can help the recom- theTitan X GPU used for this research. mendation task. AlltheaboveNNbasedapproachesdifferfromDeepCoNN 7. REFERENCES because they ignore review text. To the best of our knowledge, the only work which has utilized deep learning tech- [1] A.Almahairi, K.Kastner, K.Cho, and A.Courville. niquestousereviewtexttoimproverecommendationispre- Learning distributedrepresentations from reviews for sented in [1]. Tousetheinformation exists in reviews, they collaborative filtering. InProceedings of the 9th ACM proposed a model consisting of a matrix factorization tech- Conference on Recommender Systems, pages 147–154. niqueand a RecurrentNeuralNetwork (RNN).The matrix ACM, 2015. factorization is responsible for learning thelatent factors of [2] S.Baccianella, A.Esuli, and F.Sebastiani. Multi-facet users and items, and the RNN models the likelihood of a rating of product reviews. In Advances in Information review using the item’s latent factors. The RNN model is Retrieval, pages 461–472. Springer,2009. combined with the MF simply via a trade-off term as some [3] Y.Bao, H.Fang, and J. Zhang. Topicmf: sortofaregularizationtermtotamethecurseofdataspar- Simultaneously exploitingratings and reviews for sity. Due to the matrix factorization technique, handling recommendation. In AAAI,pages 2–8. AAAIPress, newusersanditemsisnottrivialinthismodelunlikeDeep- 2014. CoNN that handles them easily. Their proposed algorithm [4] Y.Bengio, H.Schwenk,J.-S. Senécal, F.Morin, and does not model users and items explicitly in a joint man- J.-L. Gauvain.Neural probabilistic language models. nerfromtheirreviews,anditjustusesreviewstoregularize InInnovations in Machine Learning, pages 137–186. their model. In addition, since item text is represented by Springer,2006. using bag-of-words, semantic meaning existing in words has [5] D.M. Blei, A. Y.Ng, and M. I.Jordan. Latent not been explored. dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. 5. CONCLUSION [6] L. Chen,G. Chen, and F. Wang.Recommender Itisshown thatreviewswritten byuserscanrevealsome systemsbased on user reviews: thestate of theart. info on the customer buying and rating behavior, and also User Modeling and User-Adapted Interaction, reviews writtenforitemsmaycontaininfo ontheirfeatures 25(2):99–154, 2015. and properties. In this paper, we presented Deep Cooper- [7] R.Collobert, J. Weston,L. Bottou, M. Karlen, ative Neural Networks (DeepCoNN) which exploits the in- K.Kavukcuoglu,and P. Kuksa.Naturallanguage formation exists in the reviews for recommender systems. processing (almost) from scratch. The Journal of DeepCoNNconsistsoftwodeepneuralnetworkscoupledto- Machine Learning Research, 12:2493–2537, 2011. [8] Q.Diao, M. Qiu, C. Wu,A.J. Smola, J. Jiang, and Proceedings of the 27th International Conference on C. Wang. Jointly modeling aspects, ratings and Machine Learning (ICML-10), pages 807–814, 2010. sentimentsfor movierecommendation (JMARS).In [23] R.Pan, Y.Zhou, B. Cao, N.N.Liu, R. Lukose, KDD, pages 193–202. ACM, 2014. M. Scholz, and Q.Yang.One-class collaborative [9] A.M. Elkahky,Y.Song, and X.He. A multi-view filtering. InData Mining, 2008. ICDM’08. Eighth deep learning approach for cross domain user IEEE International Conference on, pages 502–511. modeling in recommendation systems. In Proceedings IEEE, 2008. of the 24th International Conference on World Wide [24] S.Rendle.Factorization machines with libfm. ACM Web, pages 278–288. International World WideWeb Transactions on Intelligent Systems and Technology Conferences SteeringCommittee, 2015. (TIST),3(3):57, 2012. [10] N.Jakob, S. H.Weber,M. C. Mu¨ller, and [25] R.Salakhutdinovand A.Mnih. Probabilistic matrix I.Gurevych.Beyond thestars: exploiting free-text factorization. In NIPS, pages 1257–1264. Curran user reviews to improvetheaccuracy of movie Associates, Inc.,2007. recommendations. In Proceedings of the 1st [26] R.Salakhutdinov,A.Mnih, and G. Hinton.Restricted international CIKM workshop on Topic-sentiment boltzmann machines for collaborative filtering. In analysis for mass opinion,pages 57–64. ACM, 2009. Proceedings of the 24th international conference on [11] R.Johnson and T. Zhang. Effectiveuse of word order Machine learning, pages 791–798. ACM, 2007. for text categorization with convolutional neural [27] A.I. Schein,A.Popescul, L. H. Ungar,and D. M. networks. InHLT-NAACL, pages 103–112. The Pennock.Methods and metrics for cold-start Association for Computational Linguistics, 2015. recommendations. In Proceedings of the 25th annual [12] Y.Kim. Convolutional neural networksfor sentence international ACM SIGIR conference on Research and classification. arXiv preprint arXiv:1408.5882, 2014. development in information retrieval, pages 253–260. [13] Y.Koren, R.Bell, and C. Volinsky.Matrix ACM, 2002. factorization techniquesfor recommender systems. [28] N.Srivastava,G. Hinton,A. Krizhevsky,I.Sutskever, Computer, (8):30–37, 2009. and R.Salakhutdinov.Dropout: A simple way to [14] A.Krizhevsky,I. Sutskever,and G. E. Hinton. preventneuralnetworks from overfitting.The Journal Imagenet classification with deep convolutional neural of Machine Learning Research, 15(1):1929–1958, 2014. networks. InAdvances in neural information [29] Theano DevelopmentTeam. Theano: A Python processing systems, pages 1097–1105, 2012. framework for fast computation of mathematical [15] S.Li, J. Kawale, and Y.Fu.Deep collaborative expressions. arXiv e-prints, abs/1605.02688, May filtering viamarginalized denoising auto-encoder. In 2016. Proceedings of the 24th ACM International on [30] T. Tieleman and G. Hinton.Lecture6.5-rmsprop: Conference on Information and Knowledge Dividethegradient by a runningaverage of its recent Management, pages 811–820. ACM, 2015. magnitude.COURSERA: Neural Networks for [16] G. Ling, M. R. Lyu,and I. King. Ratingsmeet Machine Learning, 4:2, 2012. reviews, acombined approach to recommend.In [31] A.Van den Oord,S. Dieleman, and B. Schrauwen. Proceedings of the 8th ACM Conference on Deep content-basedmusic recommendation. In Recommender systems, pages 105–112. ACM, 2014. Advances in Neural Information Processing Systems, [17] J. McAuley and J. Leskovec.Hiddenfactors and pages 2643–2651, 2013. hiddentopics: understandingrating dimensions with [32] H.M. Wallach. Topic modeling: beyond bag-of-words. review text.In Proceedings of the 7th ACM conference InProceedings of the 23rd international conference on on Recommender systems, pages 165–172. ACM, 2013. Machine learning, pages 977–984. ACM, 2006. [18] J. McAuley,J. Leskovec, and D. Jurafsky.Learning [33] C.Wangand D.M. Blei. Collaborative topicmodeling attitudesand attributesfrom multi-aspect reviews. In for recommending scientificarticles. In Proceedings of Data Mining (ICDM), 2012 IEEE 12th International the 17th ACM SIGKDD international conference on Conference on, pages 1020–1025. IEEE, 2012. Knowledge discovery and data mining,pages 448–456. [19] J. McAuley,R. Pandey,and J. Leskovec.Inferring ACM, 2011. networksof substitutableand complementary [34] H.Wang, Y.Lu, and C. Zhai. Latent aspect rating products.In Proceedings of the 21th ACM SIGKDD analysis on review text data: a rating regression International Conference on Knowledge Discovery and approach.In Proceedings of the 16th ACM SIGKDD Data Mining,pages 785–794. ACM, 2015. international conference on Knowledge discovery and [20] T. Mikolov, M. Karafiát, L.Burget, J. Cernocky`,and data mining,pages 783–792. ACM, 2010. S.Khudanpur.Recurrentneural network based [35] H.Wang, N.Wang, and D.-Y.Yeung.Collaborative language model. InINTERSPEECH,pages deep learning for recommender systems. In 1045–1048, 2010. Proceedings of the 21th ACM SIGKDD International [21] T. Mikolov, I. Sutskever,K. Chen,G. S. Corrado, and Conference on Knowledge Discovery and Data Mining, J. Dean. Distributedrepresentations of words and pages 1235–1244. ACM, 2015. phrases and theircompositionality. In Advances in [36] X.Wangand Y.Wang. Improvingcontent-basedand neural information processing systems, pages hybridmusicrecommendation using deep learning. In 3111–3119, 2013. Proceedings of the ACM International Conference on [22] V.Nair and G. E. Hinton.Rectified linear units Multimedia,pages 627–636. ACM, 2014. improverestricted boltzmann machines. In [37] Y.Wu,C. DuBois, A.X.Zheng, and M. Ester. Collaborative denoising auto-encoders for top-n recommendersystems. [38] Y.Wu and M. Ester. FLAME: A probabilistic model combiningaspect based opinion mining and collaborative filtering. In WSDM,pages 199–208. ACM, 2015.