ebook img

Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers PDF

0.31 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers

Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers YijunXiao KyunghyunCho CenterforDataSciences, CourantInstituteand NewYorkUniversity CenterforDataScience, [email protected] NewYorkUniversity [email protected] 6 1 0 2 Abstract continuous vector space by being multiplied with a b weight matrix, forming a sequence of dense, real- e F Document classification tasks were primar- valued vectors. This sequence is then fed into a ily tackled at word level. Recent research deep neural network which processes the sequence 1 that works with character-level inputs shows in multiple layers, resulting in a prediction proba- ] several benefits over word-level approaches L such as natural incorporation of morphemes bility. This whole pipeline, or a network, is tuned C and better handling of rare words. We pro- jointly to maximize the classification accuracy on a . poseaneuralnetworkarchitecturethatutilizes trainingset. s c both convolution and recurrent layers to effi- One important aspect of these recent approaches [ ciently encode character inputs. We validate basedondeeplearningisthattheyoftenworkatthe 1 the proposed model on eight large scale doc- levelofwords. Despiteitsrecentsuccess,theword- v ument classification tasks and compare with levelapproachhasanumberofmajorshortcomings. 7 character-level convolution-only models. It 6 First,itisstatisticallyinefficient,aseachwordtoken achievescomparableperformanceswithmuch 3 is considered separately and estimated by the same lessparameters. 0 number of parameters, despite the fact that many 0 . wordssharecommonroot,prefixorsuffix. Thiscan 2 1 Introduction 0 beovercomebyusinganexternalmechanismtoseg- 6 Documentclassificationisataskinnaturallanguage menteachwordandinferitscomponents(root,pre- 1 : processing where one needs to assign a single or fix,suffix),butthisisnotdesirableasthemechanism v multiplepredefinedcategoriestoasequenceoftext. is highly language-dependent and is tuned indepen- i X Aconventionalapproachtodocumentclassification dentlyfromthetargetobjectiveofdocumentclassi- ar generally consists of a feature extraction stage fol- fication. lowed by a classification stage. For instance, it is Second, the word-level approach cannot handle usualtouseaTF-IDFvectorofagivendocumentas out-of-vocabulary words. Any word that is not aninputfeaturetoasubsequentclassifier. present or rare in a training corpus, is mapped to Morerecently,ithasbecomemorecommontouse an unknown word token. This is problematic, be- a deep neural network, which jointly performs fea- cause the model cannot handle typos easily, which tureextractionandclassification,fordocumentclas- happens frequently in informal documents such as sification (Kim, 2014; Mesnil et al., 2014; Socher postingsfromsocialnetworksites. Also,thismakes et al., 2013; Carrier and Cho, 2014). In most cases, it difficult to use a trained model to a new domain, an input document is represented as a sequence of astheremaybelargemismatchbetweenthedomain words, ofwhicheachispresentedasaone-hotvec- ofthetrainingcorpusandthetargetdomain. tor.1 Each word in the sequence is projected into a elementsareallzeros,exceptforthei-thelementwhichisset 1Aone-hotvectorofthei-thwordisabinaryvectorwhose toone. Recently this year, a number of researchers have an input sequence of characters with a number of noticedthatitisnotatallnecessaryforadeepneu- convolutional layers followed by a single recurrent ralnetworktoworkatthewordlevel. Aslongasthe layer. Because the recurrent layer, consisting of ei- document is represented as a sequence of one-hot ther gated recurrent units (GRU, (Cho et al., 2014) vectors, the model works without any change, re- or long short-term memory units (LSTM, (Hochre- gardlessofwhethereachone-hotvectorcorresponds iter and Schmidhuber, 1997; Gers et al., 2000), can to a word, a sub-word unit or a character. Based efficientlycapturelong-termdependencies,thepro- on this intuition, Kim et al. (Kim et al., 2015) and posed network only needs a very small number of Lingetal.(Lingetal.,2015)proposedtouseachar- convolutionallayers. acter sequence as an alternative to the word-level We empirically validate the proposed model, to one-hot vector. A similar idea was applied to de- which we refer as a convolution-recurrent network, pendency parsing in (Ballesteros et al., 2015). The on the eight large-scale document classification work in this direction, most relevant to this paper, tasks from (Zhang et al., 2015). We mainly com- isthecharacter-levelconvolutionalnetworkfordoc- pare the proposed model against the convolutional ument classification by Zhang et al. (Zhang et al., network in (Zhang et al., 2015) and show that it 2015). is indeed possible to use a much smaller model to Thecharacter-levelconvolutionalnetin(Zhanget achievethesamelevelofclassificationperformance al., 2015) is composed of many layers of convolu- when a recurrent layer is put on top of the convolu- tionandmax-pooling,similarlytotheconvolutional tionallayers. network in computer vision (see, e.g., (Krizhevsky 2 BasicBuildingBlocks: NeuralNetwork etal.,2012).) Eachlayerfirstextractsfeaturesfrom Layers small, overlapping windows of the input sequence and pools over small, non-overlapping windows by In this section, we describe four basic layers in a takingthemaximumactivationsinthewindow. This neuralnetworkthatwillbeusedlatertoconstitutea isappliedrecursively(withuntiedweights)formany singlenetworkforclassifyingadocument. times. The final convolutional layer’s activation is flattened to form a vector which is then fed into a 2.1 EmbeddingLayer smallnumberoffully-connectedlayersfollowedby Asmentionedearlier, eachdocumentisrepresented theclassificationlayer. as a sequence of one-hot vectors. A one-hot vector We notice that the use of a vanilla convolutional ofthei-thsymbolinavocabularyisabinaryvector network for character-level document classification whoseelementsareallzerosexceptforthei-thele- has one shortcoming. As the receptive field of each mentwhichissettoone. Therefore,eachdocument convolutional layer is often small (7 or 3 in (Zhang isasequenceofT one-hotvectors(x ,x ,...,x ). 1 2 T et al., 2015),) the network must have many layers An embedding layer projects each of the one- inordertocapturelong-termdependenciesinanin- hot vectors into a d-dimensional continuous vec- put sentence. This is likely the reason why Zhang tor space Rd. This is done by simply multiplying et al. (Zhang et al., 2015) used a very deep convo- the one-hot vector from left with a weight matrix lutional network with six convolutional layers fol- W ∈ Rd×|V|, where |V| is the number of unique lowedbytwofully-connectedlayers. symbolsinavocabulary: In order to overcome this inefficiency in model- e = Wx . ingacharacter-levelsequence,inthispaperwepro- t t pose to make a hybrid of convolutional and recur- After the embedding layer, the input sequence of rent networks. This was motivated by recent suc- one-hot vectors becomes a sequence of dense, real- cessesofapplyingrecurrentnetworkstonaturallan- valuedvectors(e ,e ,...,e ). 1 2 T guages(see, e.g., (Choetal., 2014; Sundermeyeret 2.2 ConvolutionalLayer al., 2015)) and from the fact that the recurrent net- workcanefficientlycapturelong-termdependencies A convolutional layer consists of two stages. In the evenwithasinglelayer. Thehybridmodelprocesses firststage,asetofd(cid:48) filtersofreceptivefieldsizer, F ∈ Rd(cid:48)×r,isappliedtotheinputsequence: Based on these, the LSTM unit first computes the (cid:2) (cid:3) memorycell: f = φ(F e ;...;e ;...,e ), t t−(r/2)+1 t t+(r/2) c = i (cid:12)c˜ +f (cid:12)c , t t t t t−1 where φ is a nonlinear activation function such as tanh or a rectifier. This is done for every time step andcomputestheoutput,oractivation: of the input sequence, resulting in a sequence F = h = o (cid:12)tanh(c ). t t t (f ,f ,...,f ). 1 2 T Theresultingsequencefromtherecurrentlayeris TheresultingsequenceF ismax-pooledwithsize then r(cid:48): (h ,h ,...,h ), f(cid:48) = max(cid:0)f ,...,f (cid:1), 1 2 T t (t−1)×r(cid:48)+1 t×r(cid:48) where T is the length of the input sequence to the where max applies for each element of the vectors, layer. resultinginasequence Bidirectional Recurrent Layer One property of F(cid:48) = (f(cid:48),f(cid:48),...,f(cid:48) ). the recurrent layer is that there is imbalance in the 1 2 T/r(cid:48) amount of information seen by the hidden states at 2.3 RecurrentLayer different time steps. The earlier hidden states only A recurrent layer consists of a recursive function f observe a few vectors from the lower layer, while whichtakesasinputoneinputvectorandtheprevi- thelateronesarecomputedbasedonthemostofthe oushiddenstate,andreturnsthenewhiddenstate: lower-layervectors. Thiscanbeeasilyalleviatedby havingabidirectionalrecurrentlayerwhichiscom- h = f(x ,h ), t t t−1 posed of two recurrent layers working in opposite where xt ∈ Rd is one time step from the input se- directions. This layer will return two sequences of quence (x1,x2,...,xT). h0 ∈ Rd(cid:48) is often initial- hiddenstatesfromtheforwardandreverserecurrent izedasanall-zerovector. layers,respectively. Recursive Function The most naive recursive 2.4 ClassificationLayer functionisimplementedas A classification layer is in essence a logistic re- ht = tanh(Wxxt+Uhht−1), gression classifier. Given a fixed-dimensional input whereW ∈ Rd(cid:48)×dandU ∈ Rd(cid:48)×d(cid:48) aretheweight from the lower layer, the classification layer affine- x h transformsitfollowedbyasoftmaxactivationfunc- matrices. This naive recursive function however is tion(Bridle,1990)tocomputethepredictiveproba- known to suffer from the problem of vanishing gra- bilitiesforallthecategories. Thisisdoneby dient(Bengioetal.,1994;Hochreiteretal.,2001). More recently it is common to use a more com- p(y = k|X) = exp(wk(cid:62)x+bk) , plicated function that learns to control the flow of (cid:80)K exp(w(cid:62)x+b ) k(cid:48)=1 k(cid:48) k(cid:48) information so as to prevent the vanishing gradient wherew ’sandb ’saretheweightandbiasvectors. k k andallowstherecurrentlayertomoreeasilycapture WeassumethereareK categories. long-term dependencies. Long short-term memory It is worth noting that this classification layer (LSTM) unit from (Hochreiter and Schmidhuber, takes as input a fixed-dimensional vector, while 1997;Gersetal.,2000)isarepresentativeexample. the recurrent layer or convolutional layer returns a The LSTM unit consists of four sub-units–input, variable-length sequence of vectors (the length de- output, forget gates and candidate memory cell, termined by the input sequence). This can be ad- whicharecomputedby dressed by either simply max-pooling the vectors i = σ(W x +U h ), (Kim,2014)overthetimedimension(forbothcon- t i t i t−1 volutional and recurrent layers), taking the last hid- o = σ(W x +U h ), t o t o t−1 denstate(forrecurrentlayers)ortakingthelasthid- f = σ(W x +U h ), t f t f t−1 den states of the forward and reverse recurrent net- c˜ = tanh(W x +U h ). t c t c t−1 works(forbidirectionalrecurrentlayers.) 3 Character-Level Convolutional-RecurrentNetwork Inthissection,weproposeahybridofconvolutional andrecurrentnetworksforcharacter-leveldocument classification. 3.1 Motivation One basic motivation for using the convolutional layer is that it learns to extract higher-level features that are invariant to local translation. By stack- ing multiple convolutional layers, the network can (a) (b) extract higher-level, abstract, (locally) translation- invariant features from the input sequence, in this Figure 1: Graphical illustration of (a) the convolutional net- casethedocument,efficiently. work and (b) the proposed convolution-recurrent network for Despitethisadvantage,wenoticedthatitrequires character-leveldocumentclassification. manylayersofconvolutiontocapturelong-termde- pendencies, due to the locality of the convolution withaone-hotsequenceinput and pooling (see Sec. 2.2.) This becomes more se- vere as the length of the input sequence grows, and X = (x1,x2,...,xT). inthecaseofcharacter-levelmodeling,itisusualfor This input sequence is turned into a sequence of a document to be a sequence of hundreds or thou- dense,real-valuedvectors sands of characters. Ultimately, this leads to the need for a very deep network having many convo- E = (e ,e ,...,e ) 1 2 T lutionallayers. Contrary to the convolutional layer, the recurrent usingtheembeddinglayerfromSec.2.1. layer from Sec. 2.3 is able to capture long-term de- Weapplymultipleconvolutionallayers(Sec.2.2) pendencies even when there is only a single layer. toE togetashortersequenceoffeaturevectors: This is especially true in the case of a bidirectional recurrent layer, because each hidden state is com- F = (f ,f ,...,f ). 1 2 T(cid:48) putedbasedonthewholeinputsequence. However, This feature vector is then fed into a bidirectional the recurrent layer is computationally more expen- recurrentlayer(Sec.2.3),resultingintwosequences sive. The computational complexity grows linearly withrespecttothelengthoftheinputsequence,and →− →− →− H = (h , h ,..., h ), most of the computations need to be done sequen- forward 1 2 T(cid:48) ←− ←− ←− tially. This is in contrast to the convolutional layer H = (h , h ,..., h ). reverse 1 2 T(cid:48) for which computations can be efficiently done in parallel. Wetakethelasthiddenstatesofbothdirectionsand Basedontheseobservations,weproposetocom- concatenate them to form a fixed-dimensional vec- bine the convolutional and recurrent layers into a tor: (cid:104)→− ←− (cid:105) single model so that this network can capture long- h = hT(cid:48); h1 . termdependenciesinthedocumentmoreefficiently Finally,thefixed-dimensionalvectorhisfedinto forthetaskofclassification. the classification layer to compute the predictive probabilities p(y = k|X) of all the categories k = 3.2 ModelDescription 1,...,K giventheinputsequenceX. The proposed model, to which we refer as a See Fig. 1 (b) for the graphical illustration of the convolution-recurrent network (ConvRec), starts proposedmodel. Dataset Classes Task Trainingsize Testsize AG’snews 4 newscategorization 120,000 7,600 Sogounews 5 newscategorization 450,000 60,000 DBPedia 14 ontologyclassification 560,000 70,000 Yelpreviewpolarity 2 sentimentanalysis 560,000 38,000 Yelpreviewfull 5 sentimentanalysis 650,000 50,000 Yahoo! Answers 10 questiontypeclassification 1,400,000 60,000 Amazonreviewpolarity 2 sentimentanalysis 3,600,000 400,000 Amazonreviewfull 5 sentimentanalysis 3,000,000 650,000 Table1: Datasetssummary. 3.3 RelatedWork work. In their model, the convolutional network is strictly constrained to model each sentence, and Convolutional network for document classifica- therecurrentnetworktomodelinter-sentencestruc- tion The convolutional networks for document tures. On the other hand, the proposed ConvRec classification, proposed earlier in (Kim, 2014; network uses a recurrent layer in order to assist the Zhang et al., 2015) and illustrated in Fig. 1 (a), is convolutionallayerstocapturelong-termdependen- almost identical to the proposed model. One ma- cies (across the whole document) more efficiently. jor difference is the lack of the recurrent layer in These are orthogonal to each other, and it is possi- theirmodels. Theirmodelconsistsoftheembedding ble to plug in the proposed ConvRec as a sentence layer,anumberofconvolutionallayersfollowedby feature extraction module in the Conv-GRNN from theclassificationlayeronly. (Tangetal.,2015). Similarly,itispossibletousethe Recurrent network for document classification proposedConvRecasacompositionfunctionforthe Carrier and Cho in (Carrier and Cho, 2014) give a sequence of sentence vectors to make computation tutorialonusingarecurrentneuralnetworkforsen- more efficient, especially when the input document timentanalysiswhichisonetypeofdocumentclas- consistsofmanysentences. sification. Unliketheconvolution-recurrentnetwork Recursive Neural Networks A recursive neural proposedinthispaper,theydonotuseanyconvolu- network has been applied to sentence classification tional layer in their model. Their model starts with earlier (see, e.g., (Socher et al., 2013).) In this ap- theembeddinglayerfollowedbytherecurrentlayer. proach,acompositionfunctionisdefinedandrecur- The hidden states from the recurrent layer are then sively applied at each node of the parse tree of an averagedandfedintotheclassificationlayer. input sentence to eventually extract a feature vector Hybrid model: Conv-GRNN Perhaps the most of the sentence. This model family is heavily de- relatedworkistheconvolution-gatedrecurrentneu- pendent on an external parser, unlike all the other ralnet(Conv-GRNN)from(Tangetal.,2015). They models such as the ConvRec proposed here as well proposed a hierarchical processing of a document. as other related models described above. It is also In their model, either a convolutional network or a not trivial to apply the recursive neural network to recurrent network is used to extract a feature vector documentswhichconsistofmultiplesentences. We from each sentence, and another (bidirectional) re- do not consider this family of recursive neural net- currentnetworkisusedtoextractafeaturevectorof worksdirectlyrelatedtotheproposedmodel. the document by reading the sequence of sentence vectors. Thisdocumentvectorisusedbytheclassi- 4 ExperimentSettings ficationlayer. 4.1 TaskDescription Themajordifferencebetweentheirapproachand the proposed ConvRec is in the purpose of com- Wevalidatetheproposedmodeloneightlarge-scale bining the convolutional network and recurrent net- document classification tasks from (Zhang et al., EmbeddingLayer ConvolutionalLayer RecurrentLayer Model Sec.2.1 Sec.2.2 Sec.2.3 |V| d d(cid:48) r r(cid:48) φ d(cid:48) C2R1DD 5,3 2,2 C3R1DD 5,5,3 2,2,2 96 8 D ReLU D C4R1DD 5,5,3,3 2,2,2,2 C5R1DD 5,5,3,3,3 2,2,2,1,2 Table2: Differentarchitecturestestedinthispaper. 2015). Thesizesofthedatasetsrangefrom200,000 Dropout (Srivastava et al., 2014) is an effective to 4,000,000 documents. These tasks include senti- way to regularize deep neural networks. We apply ment analysis (Yelp reviews, Amazon reviews), on- dropout after the last convolutional layer as well as tology classification (DBPedia), question type clas- aftertherecurrentlayer. Withoutdropout,theinputs sification (Yahoo! Answers), and news categoriza- totherecurrentlayerx ’sare t tion(AG’snews,Sogounews). x = f(cid:48) t t Data Sets A summary of the statistics for each data set is listed in Table 1. There are equal num- wheref(cid:48)isthet-thoutputfromthelastconvolutional t ber of examples in each class for both training and layer defined in Sec. 2.2. After adding dropout, we testsets. DBPediadataset,forexample,has40,000 have trainingand5,000testexamplesperclass. Formore ri ∼ Bernoulli(p) t detailed information on the data set construction x = r (cid:12)f(cid:48) process,see(Zhangetal.,2015). t t t p is the dropout probability which we set to 0.5; ri 4.2 ModelSettings t isthei-thcomponentofthebinaryvectorr ∈ Rd(cid:48). t Referring to Sec. 2.1, the vocabulary V for our experiments consists of 96 characters including all 4.3 TrainingandValidation upper-case and lower-case letters, digits, common For each of the data sets, we randomly split the full punctuation marks, and spaces. Character embed- training examples into training and validation. The dingsizedissetto8. validationsizeisthesameasthecorrespondingtest AsdescribedinSec.3.1,webelievebyaddingre- sizeandisbalancedineachclass. current layers, one can effectively reduce the num- Themodelsaretrainedbyminimizingthefollow- ber of convolutional layers needed in order to cap- ing regularized negative log-likelihood or cross en- turelong-termdependencies. Thusforeachdataset, tropy loss. X’s and y’s are document character se- we consider models with two to five convolutional quences and their corresponding observed class as- layers. Following notations in Sec. 2.2, each layer signments in the training set D. w is the collec- hasd(cid:48) = 128filters. ForAG’snewsandYahoo! An- tionofmodelweights. Weightdecayisappliedwith swers,wealsoexperimentlargermodelswith1,024 λ = 5×10−4. filters in the convolutional layers. Receptive field sizer iseitherfiveorthreedependingonthedepth. l = − (cid:88) log(p(y|X))+ λ(cid:107)w(cid:107)2 Maxpoolingsizer(cid:48) issetto2. Rectifiedlinearunits 2 X,y∈D (ReLUs,(Glorotetal.,2011))areusedasactivation functions in the convolutional layers. The recurrent We train our models using AdaDelta (Zeiler, layer (Sec. 2.3) is fixed to a single layer of bidi- 2012) with ρ = 0.95, (cid:15) = 10−5 and a batch size of rectional LSTM for all models. Hidden states di- 128. Examples are padded to the longest sequence mension d(cid:48) is set to 128. More detailed setups are ineachbatchandmasksaregeneratedtohelpiden- describedinTable2. tifythepaddedregion. Thecorrespondingmasksof OurModel (Zhangetal.,2015) Dataset #Ex. #Cl. Network #Params Error(%) Network #Params Error(%) AG 120k 4 C2R1D1024 20M 8.39/8.64 C6F2D1024 27M -/9.85 Sogou 450k 5 C3R1D128 .4M 4.82/4.83 C6F2D1024(cid:63) 27M -/4.88 DBPedia 560k 14 C2R1D128 .3M 1.46/1.43 C6F2D1024 27M -/1.66 YelpP. 560k 2 C2R1D128 .3M 5.50/5.51 C6F2D1024 27M -/5.25 YelpF. 650k 5 C2R1D128 .3M 38.00/38.18 C6F2D1024 27M -/38.40 YahooA. 1.4M 10 C2R1D1024 20M 28.62/28.26 C6F2D1024(cid:63) 27M -/29.55 AmazonP. 3.6M 2 C3R1D128 .4M 5.64/5.87 C6F2D256(cid:63) 2.7M -/5.50 AmazonF. 3.0M 5 C3R1D128 .4M 40.30/40.77 C6F2D256(cid:63) 2.7M -/40.53 Table 3: Results on character-level document classification. CCRRFFDD refers to a network with C convolutional layers, R recurrentlayers, F fully-connectedlayersandD dimensionalfeaturevectors. (cid:63)denotesamodelwhichdoesnotdistinguish between lower-case and upper-case letters. We only considered the character-level models without using Thesaraus-based data augmentation. Wereportboththevalidationandtesterrors. Inourcase, thenetworkarchitectureforeachdatasetwasselected basedonthevalidationerrors.Thenumbersofparametersareapproximate. the outputs from convolutional layers can be com- Number of classes Fig. 2 (a) shows how relative putedanalyticallyandareusedbytherecurrentlayer performance of our model changes with respect to to properly ignore padded inputs. The gradient of the number of classes. It is worth noting that as thecostfunctioniscomputedwithbackpropagation thenumberofclassesincreases,ourmodelachieves through time (BPTT, (Werbos, 1990)). If the gra- betterresultscomparedtoconvolution-onlymodels. dient has an L2 norm larger than 5, we rescale the For example, our model has a much lower test er- gradientbyafactorof 5 . i.e. ror on DBPedia which has 14 classes, but it scores (cid:107)g(cid:107)2 worse on Yelp review polarity and Amazon review (cid:18) (cid:19) 5 polarity both of which have only two classes. Our g = g·min 1, c (cid:107)g(cid:107) conjectureisthatmoredetailedandcompleteinfor- 2 mationneedstobepreservedfromtheinputtextfor whereg = dl andg istheclippedgradient. the model to assign one of many classes to it. The dw c Early stopping strategy is employed to prevent convolution-only model likely loses detailed local overfitting. Before training, we set an initial features because it has more pooling layers. On the patience value. At each epoch, we calculate and other hand, the proposed model with less pooling record the validation loss. If it is lower than the layers can better maintain the detailed information current lowest validation loss by 0.5%, we extend andhenceperformsbetterwhensuchneedsexist. patience by two. Training stops when the number Numberoftrainingexamples Althoughitisless ofepochsislargerthanpatience. Wereportthetest significant,Fig. 2(b)showsthattheproposedmodel errorrateevaluatedusingthemodelwiththelowest generallyworksbettercomparedtotheconvolution- validationerror. onlymodelwhenthedatasizeissmall. Considering 5 ResultsandAnalysis the difference in the number of parameters, we sus- pect that because the proposed model is more com- Experimental results are listed in Table 3. We com- pact,itislesspronetooverfitting. Thereforeitgen- paretothebestcharacter-levelconvolutionalmodel eralizesbetterwhenthetrainingsizeislimited. withoutdataaugmentationfrom(Zhangetal.,2015) on each data set. Our model achieves comparable Number of convolutional layers An interesting performancesforalltheeightdatasetswithsignifi- observation from our experiments is that the model cantlylessparameters. Specifically,itperformsbet- accuracy does not always increase with the number ter on AG’s news, Sogou news, DBPedia, Yelp re- of convolutional layers. Performances peak at two viewfull,andYahoo! Answersdatasets. orthreeconvolutionallayersanddecreaseifweadd (a) (b) Figure2:Relativetestperformanceoftheproposedmodelcomparedtotheconvolution-onlymodelw.r.t.(a)thenumberofclasses and(b)thesizeoftrainingset.Lowerisbetter. more to the model. As more convolutional layers information. producelongercharactern-grams,thisindicatesthat We validated the proposed model on eight large there is an optimal level of local features to be fed scale document classification tasks. The model into the recurrent layer. Also, as discussed above, achieved comparable results with much less convo- morepoolinglayerslikelyleadtothelostofdetailed lutionallayerscomparedtotheconvolution-onlyar- information which in turn affects the ability of the chitecture. Wefurtherdiscussedseveralaspectsthat recurrentlayertocapturelong-termdependencies. affect the model performance. The proposed model generallyperformsbetterwhennumberofclassesis Number of filters We experiment large models large,trainingsizeissmall,andwhenthenumberof with 1,024 filters on AG’s news and Yahoo! An- convolutionallayersissettotwoorthree. swersdatasets. Althoughaddingmorefiltersinthe convolutional layers does help with the model per- The proposed model is a general encoding archi- formancesonthesetwodatasets, thegainsarelim- tecture that is not limited to document classifica- ited compared to the increased number of parame- tion tasks or natural language inputs. For example, ters. Validationerrorimprovesfrom8.75%to8.39% (Chenetal.,2015;Visinetal.,2015)combinedcon- for AG’s news and from 29.48% to 28.62% for Ya- volution and recurrent layers to tackle image seg- hoo! Answers at the cost of a 70 times increase in mentationtasks;(Sainathetal.,2015)appliedasim- thenumberofmodelparameters. ilarmodeltodospeechrecognition. Itwillbeinter- Note that in our model we set the number of fil- esting to see future research on applying the archi- tersintheconvolutionallayerstobethesameasthe tecture to other applications such as machine trans- dimensionofthehiddenstatesintherecurrentlayer. lationandmusicinformationretrieval. Usingrecur- Itispossibletousemorefiltersintheconvolutional rentlayersassubstitutesforpoolinglayerstopoten- layers while keeping the recurrent layer dimension tiallyreducethelostofdetailedlocalinformationis thesametopotentiallygetbetterperformanceswith alsoadirectionthatworthexploring. lesssacrificeofthenumberofparameters. 6 Conclusion Acknowledgments In this paper, we proposed a hybrid model that pro- cesses an input sequence of characters with a num- This work is done as a part of the course DS-GA ber of convolutional layers followed by a single re- 1010-001 Independent Study in Data Science at the currentlayer. Theproposedmodelisabletoencode CenterforDataScience,NewYorkUniversity. documents from character level capturing sub-word References [Krizhevskyetal.2012] Alex Krizhevsky, Ilya Sutskever, andGeoffreyE.Hinton. 2012. Imagenetclassification [Ballesterosetal.2015] Miguel Ballesteros, Chris Dyer, withdeepconvolutionalneuralnetworks. InF.Pereira, andNoahASmith. 2015. Improvedtransition-based C.J.C. Burges, L. Bottou, and K.Q. Weinberger, edi- parsingbymodelingcharactersinsteadofwordswith tors,AdvancesinNeuralInformationProcessingSys- lstms. arXivpreprintarXiv:1508.00657. tems25,pages1097–1105.CurranAssociates,Inc. [Bengioetal.1994] Yoshua Bengio, Patrice Simard, and [Lingetal.2015] Wang Ling, Tiago Lu´ıs, Lu´ıs Marujo, Paolo Frasconi. 1994. Learning long-term depen- Ramo´nFernandezAstudillo,SilvioAmir,ChrisDyer, dencieswithgradientdescentisdifficult. NeuralNet- Alan W Black, and Isabel Trancoso. 2015. Finding works,IEEETransactionson,5(2):157–166. functioninform: Compositionalcharactermodelsfor [Bridle1990] JohnSBridle. 1990. Probabilisticinterpre- open vocabulary word representation. arXiv preprint tation of feedforward classification network outputs, arXiv:1508.02096. withrelationshipstostatisticalpatternrecognition. In Neurocomputing,pages227–236.Springer. [Mesniletal.2014] Gre´goire Mesnil, Marc’Aurelio Ran- [CarrierandCho2014] Pierre Luc Carrier and zato,TomasMikolov,andYoshuaBengio. 2014. En- Kyunghyun Cho. 2014. LSTM networks for sembleofgenerativeanddiscriminativetechniquesfor sentimentanalysis. DeepLearningTutorials. sentiment analysis of movie reviews. arXiv preprint [Chenetal.2015] Liang-Chieh Chen, Jonathan T. Bar- arXiv:1412.5335. ron, George Papandreou, Kevin Murphy, and Alan L. [Sainathetal.2015] T.N. Sainath, O. Vinyals, A. Senior, Yuille. 2015. Semantic image segmentation with and H. Sak. 2015. Convolutional, long short-term task-specific edge detection using cnns and a dis- memory, fully connected deep neural networks. In criminatively trained domain transform. CoRR, Acoustics, Speech and Signal Processing (ICASSP), abs/1511.03328. 2015IEEEInternationalConferenceon,pages4580– [Choetal.2014] Kyunghyun Cho, Bart van Merrienboer, 4584,April. Caglar Gulcehre, Fethi Bougares, Holger Schwenk, [Socheretal.2013] Richard Socher, Alex Perelygin, andYoshuaBengio. 2014. Learningphraserepresen- Jean Y Wu, Jason Chuang, Christopher D Manning, tations using rnn encoder-decoder for statistical ma- Andrew Y Ng, and Christopher Potts Potts. 2013. chine translation. In Conference on Empirical Meth- Recursivedeepmodelsforsemanticcompositionality odsinNaturalLanguageProcessing(EMNLP2014). overasentimenttreebank. InEMNLP. [Gersetal.2000] FelixAGers,Ju¨rgenSchmidhuber,and [Srivastavaetal.2014] Nitish Srivastava, Geoffrey Hin- Fred Cummins. 2000. Learning to forget: Con- ton, Alex Krizhevsky, Ilya Sutskever, and Ruslan tinual prediction with lstm. Neural computation, Salakhutdinov. 2014. Dropout: Asimplewaytopre- 12(10):2451–2471. ventneuralnetworksfromoverfitting. JournalofMa- [Glorotetal.2011] Xavier Glorot, Antoine Bordes, and chineLearningResearch,15:1929–1958. Yoshua Bengio. 2011. Deep sparse rectifier neural [Sundermeyeretal.2015] Martin Sundermeyer, Hermann networks. In Geoffrey J. Gordon and David B. Dun- Ney, and Ralf Schluter. 2015. From feedforward to son, editors, Proceedings of the Fourteenth Interna- recurrentlstmneuralnetworksforlanguagemodeling. tionalConferenceonArtificialIntelligenceandStatis- Audio,Speech,andLanguageProcessing,IEEE/ACM tics(AISTATS-11),volume15,pages315–323.Journal Transactionson,23(3):517–529. of Machine Learning Research - Workshop and Con- [Tangetal.2015] Duyu Tang, Bing Qin, and Ting Liu. ferenceProceedings. 2015. Document modeling with gated recurrent neu- [HochreiterandSchmidhuber1997] Sepp Hochreiter and ral network for sentiment classification. In Proceed- Ju¨rgenSchmidhuber. 1997. Longshort-termmemory. ingsofthe2015ConferenceonEmpiricalMethodsin Neuralcomputation,9(8):1735–1780. NaturalLanguageProcessing,pages1422–1432. [Hochreiteretal.2001] Sepp Hochreiter, Yoshua Bengio, [Visinetal.2015] Francesco Visin, Kyle Kastner, PaoloFrasconi,andJfirgenSchmidhuber. 2001. Gra- Aaron C. Courville, Yoshua Bengio, Matteo Mat- dient flow in recurrent nets: the difficulty of learning teucci, and KyungHyun Cho. 2015. Reseg: A long-termdependencies,volume1. IEEE. recurrent neural network for object segmentation. [Kimetal.2015] Yoon Kim, Yacine Jernite, David Son- CoRR,abs/1511.07053. tag, and Alexander M Rush. 2015. Character- aware neural language models. arXiv preprint [Werbos1990] P. Werbos. 1990. Backpropagation arXiv:1508.06615. through time: what does it do and how to do it. In [Kim2014] Yoon Kim. 2014. Convolutional neural ProceedingsofIEEE,volume78,pages1550–1560. networks for sentence classification. arXiv preprint [Zeiler2012] MatthewD.Zeiler. 2012. ADADELTA:an arXiv:1408.5882. adaptivelearningratemethod. CoRR,abs/1212.5701. [Zhangetal.2015] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-levelconvolutionalnetworks fortextclassification. InAdvancedinNeuralInforma- tionProcessingSystems(NIPS2015),volume28.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.