ebook img

Shortcut Sequence Tagging PDF

0.13 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Shortcut Sequence Tagging

Shortcut Sequence Tagging Huijia Wu1,3, JiajunZhang1,3, andChengqing Zong1,2,3 1NationalLaboratory ofPatternRecognition,InstituteofAutomation,CAS 2CAS Center forExcellenceinBrain Science and IntelligenceTechnology 3UniversityofChineseAcademyofSciences {huijia.wu,jjzhang,cqzong}@nlpr.ia.ac.cn 7 1 0 2 n a Abstract inant in sequence tagging problems due to J the superior performance (Wangetal.,2015; 3 DeepstackedRNNsareusuallyhardtotrain. Lampleetal.,2016). The horizontal hierarchy of Adding shortcut connections across different ] LSTMswithbidirectional processing canremember L layers is a common way to ease the training the long-range dependencies without affecting the C of stacked networks. However, extra short- short-term storage. Although the models have . cuts make the recurrent step more compli- s c cated. Tosimplythestackedarchitecture,we a deep horizontal hierarchy (the depth refers to [ propose a framework called shortcut block, sequence length), the vertical hierarchy is often 1 which is a marriage of the gating mecha- shallow, which may not be efficient at representing v nismandshortcuts,whilediscardingtheself- each token. Stacked LSTMs are deep in both 6 connected part in LSTM cell. We present dimensions, but become harder to train due to the 7 extensiveempiricalexperimentsshowingthat 5 feed-forward structure ofstackedlayers. thisdesignmakestrainingeasyandimproves 0 generalization. We propose various shortcut Shortcut connections (shortcuts, or skip con- 0 . blocktopologiesandcompositionstoexplore nections) enable unimpeded information flow 1 its effectiveness. Based on this architecture, by adding direct connections across differ- 0 7 we obtain a 6% relatively improvement over ent layers (Raikoetal.,2012; Graves,2013; 1 thestate-of-the-artonCCGbanksupertagging HermansandSchrauwen,2013). Recent works : dataset. We also get comparable results on v have shown the effectiveness of using short- POStaggingtask. i X cuts in deep stacked models (Heetal.,2015; r Srivastavaetal.,2015; Wuetal.,2016b). These a 1 Introduction works share a common way of adding shortcuts as increments totheoriginalnetwork. In natural language processing, sequence tagging mainly refers to the tasks of assigning discrete la- We focus on the refinement of shortcut stacked bels to each token in a sequence. Typical examples models to make the training easy. Particularly, for include Part-of-Speech (POS) tagging and Combi- stackedLSTMs,theshortcutsmakethecomputation natory Category Grammar (CCG) supertagging. A of LSTMblocks more complicated. Wereplace the regular feature of sequence tagging is that the in- self-connected parts in LSTM cells with the short- put tokens in a sequence cannot be assumed to be cutstosimplifytheupdates. Basedonthisconstruc- independent since the same token in different con- tion, we introduce the shortcut block, which can be texts can be assigned to different tags. Therefore, viewed as a marriage of the gating mechanism and theclassifiershouldhavememoriestorememberthe theshortcuts, whilediscard theself-connected units contextstomakeacorrectprediction. inLSTMs. Bidirectional LSTMs Our contribution is mainly in the exploration of (GravesandSchmidhuber, 2005) become dom- shortcutblocks. Weproposeaframeworkofstacked LSTMs within shortcuts, using deterministic or layerl+1,...,L. Inthissection, wefirstintroduce stochastic gates to control the shortcut connections. thetraditional stacked LSTMs. Based onthisarchi- We present extensive experiments on the Combina- tecture,weproposeashortcutblockstructure,which toryCategoryGrammar(CCG)supertagging taskto isthebasicelementofourstackedmodels. compare various shortcut block topologies, gating 3.1 TranditionalStackedLSTMs functions, and combinations of the blocks. We also evaluate our model on Part-of-Speech (POS) tag- StackedLSTMswithoutskipconnectionscanbede- gingtasktotestthegeneralizationperformance. Our finedas: modelobtainsthestate-of-the-artresultsonCCGsu- i sigm pertagging andcomparableresultsonPOStagging.  f  =  sigm Wl hlt−1 (5) 2 Recurrent Neural Networks for o sigm (cid:18) hl (cid:19)     t 1 Sequence Tagging  s   tanh  −     cl = f ⊙cl +i⊙sl Consider a recurrent neural network applied to t t−1 t (6) sequence tagging: Given a sequence x = hl = o⊙tanh(cl) t t (x ,...,x ), the RNN computes the hidden state 1 T Duringtheforwardpass,LSTMsneedtocalculatecl h = (h1,...,hT) and the output y = (y1,...,yT) t and hl, which isthe cell’s internal state and the cell byiteratingthefollowingequations: t outputsstate,respectively. cl iscomputedbyadding t h = f(x ,h ;θ ) (1) twoparts: oneisthecellincrementsl,controlledby t t t 1 h t − theinput gateil,theother isthe self-connected part yt = g(ht;θo) (2) t cl , controlled by the forget gate fl. The cell out- t 1 t wheret ∈ {1,...,T}represents thetime. x repre- pu−ts hl are computed by multiplying the activated t t sents the input at time t, h and h are the previ- cell state by the output gate ol, which learns when t 1 t t − ousandthecurrent hiddenstate,respectively. f and toaccess memorycellandwhentoblock it. “sigm” garethetransitionfunctionandtheoutputfunction, and“tanh”arethesigmoidandtanhactivationfunc- respectively. θ andθ arenetworkparameters. tion, respectively. Wl ∈ R4n 2n is the weight ma- h o × We use a negative log-likelihood cost to evaluate trixneedstobelearned. theperformance, whichcanbewrittenas: 3.2 ShortcutBlocks 1 N ThehiddenunitsinstackedLSTMshavetwoforms. C = − logy (3) N tn One is the hidden units in the same layer {hl,t ∈ nX=1 t 1,...,T}, which are connected through an LSTM. where tn ∈ N is the true target for sample n, and The other is the hidden units at the same time step y is the t-th output in the softmax layer given the {hl,l ∈ 1,...,L}, which are connected through a tn t inputsxn. feed-forward network. LSTM can keep the short- Stacked RNN is one type of deep RNNs, which term memory for a long time, thus the error sig- referstothehiddenlayersarestackedontopofeach nalscanbeeasilypassedthrough{1,...,T}. How- other, eachfeedinguptothelayerabove: ever,whenthenumberofstackedlayersislarge,the feed-forward network will suffer the gradient van- hlt = fl(hlt−1,hlt 1) (4) ishing/exploding problems, which make the gradi- − entshardtopassthrough {1,...,L}. wherehl isthet-thhiddenstateofthel-thlayer. t Shortcut connections can partly solve the above problem byadding adirect link between layers. An 3 Explorations ofShortcuts intuitive explanation is that such link can make the Shortcuts(skip connections) are cross-layer connec- error signal passing jumpthelayers, not justoneby tions, which means the output of layer l − 1 is not one. Thismaylead tofaster convergence and better only connected to the layer l, but also connected to generalization. To clarify notations, we introduce theshortcut block, which iscomposed of thediffer- Why Only Increments? Obviously, we can keep entlayersconnected throughshortcuts. the self-connected part with the increments in the Our shortcut block is mainly based on Wu et al. cell state. But the most important reason to this de- (2016b),whichintroduce gatedshortcuts connected sign is that it is much easier to compute. We do tocelloutputs. Themaindifferenceiswereplacethe not need extra space to preserve the cell state. This self-connected parts with shortcuts to compute the makesdeepstacked modelsmucheasiertotrain. internal state. LSTMblock (1997) composes mem- Discussion. The shortcut block can be seen as a orycellssharingthesameinputandoutputgate. He generalization of several multiscale RNN architec- et al. (2015) create a residual block which adds tures. It is up to the user to define the block topol- shortcut connections across different CNN layers. ogy. For example, when h l := hl 1, this is simi- Alltheseinspiredustobuildashortcutblockacross −t t− lartorecurrenthighwaynetworks(Zillyetal.,2016) differentLSTMlayers. Ourshortcutblockisdefined and highway LSTMs (Zhangetal.,2016). When asfollows: h l := hl 2, this becomes the traditional shortcuts −t t− i sigm forRNNs:  go  =  ssiiggmm Wl(cid:18) hhllt−1 (cid:19) (7) m = i⊙slt+g⊙hlt−2     t 1 (11)  s   tanh  − hl = o⊙tanh(m)+g⊙hl 2     t t− m = i⊙sl +g⊙h l t −t 3.3 GatesSesign (8) hlt = o⊙tanh(m)+g⊙h−t l Shortcut gates are used to make the skipped path deterministic (Srivastavaetal.,2015) or stochastic whereh listheoutputfromoneofthepreviouslay- −t (Huangetal.,2016). We explore many ways to ers1,...,l−1. g isthegatewhichisusedtoaccess compute the shortcut gates (denoted by gl). The theskippedoutputh l orblockit. t −t simplest case is to use gl as a linear operator. In t Comparison with LSTMs. LSTMs introduce a thiscase,glisaweightmatrix,andtheelement-wise t memory cell with a fixed self-connection to make productgl⊙h l inEq. (7)becomesamatrix-vector t −t the constant error flow. They compute the follow- multiplication: ingincrementtotheself-connected cellateachtime gl ⊙h l := Wlh l (12) step: t −t −t c = c +s (9) t t 1 t We can also get gl under a non-linear mapping, − t Hereweremovethemultiplicative gatestosimplify which is similar to the computation of gates in theexplanation. Theself-connected cellct cankeep LSTM: the recurrent information for a long time. s is the t gl = σ(Wlhl 1) (13) increment to the cell. While in our shortcut block, t t− we only consider the increment part. Since in se- Here we use the output of layer l−1 to control the quencetagging problem, theinputsequence andthe shortcuts,e.g. hl 2. Noticethatthisnon-linearmap- t− output sequence areexactly match. Specifically, the pingisnotunique, wejustshowthesimplestcase. input token x ,i ∈ {1,...,n} in a input sequence i Furthermore, inspired by the dropout withlengthnprovidesthemostrelevantinformation (Srivastavaetal.,2014) strategy, we can sam- topredictthecorresponding labely ,i ∈ {1,...,n} i plefromaBernoulli stochastic variabletogetgl. In t in a output sequence. We want to focus the infor- this case, a deterministic gate is transformed into a mation flow in the vertical direction through short- stochastic gate. cuts, rather than in the horizontal direction through the self-connected units. Therefore, we only con- gl ∼ Bernoulli(p) (14) t sider the increment to the cell and ignore the self- where gl is a vector of independent Bernoulli ran- connected part. Ourcellstatebecomes: t domvariableseachofwhichhasprobabilitypofbe- m = h−t l +st (10) ing 1. We can either fix p with a specific value or learnitwithanon-linearmapping. Forexample,we usingalocalwindowapproach, oracombination of canlearnpby: word and character-level representation. Following Wuetal. (2016b), weusealocal window approach p = σ(Hlhl 1) (15) t− together with a concatenation of word representa- tions, character representations, and capitalization Attesttime,h l ismultiplied byp. −t representations. Discussion. The gates of LSTMs are essential Formally, we can represent the distributed word parts to avoid weight update conflicts, which are feature f using a concatenation of these embed- wt also invoked by the shortcuts. In experiments, we dings: find that using deterministic gates is better than the stochastic gates. We recommend using the logistic fwt = [Lw(wt);La(at);Lc(cw)] (16) gatestocomputegl. t wherew ,a represent thecurrent wordanditscap- t t 3.4 CompositionsofShortcutBlocks italization. cw := [c1,c2,...,cTw], where Tw is the lengthofthewordandc ,i ∈ {1,...,T }isthei-th In the previous subsections, we introduce the short- i w character for theparticular word. L (·) ∈ RVw n, cut blocks and the computation of gating functions. w | |× L (·) ∈ RVa m and L (·) ∈ RVc r are the look- To build deep stacked models we need to compose a | |× c | |× up tables for the words, capitalization and charac- these blocks together. In this section, we discuss ters,respectively. f ∈ Rn+m+r representsthedis- several kinds of compositions, as shown in Figure wt tributed feature of w . A context window of size d 1. Thelinkswith⊙representthegatedidentitycon- t surrounding thecurrentwordisusedasaninput: nections across layers. There are two types of con- nections in Figure 1: one is the direct connections x = [f ;...;f ] (17) t wt−⌊d/2⌋ wt+⌊d/2⌋ between adjacent layers, the other is the gated con- nections acrossdifferentlayers. where x ∈ R(n+m+r) d is a concatenation of the t × In Figure 1, Type 1 is to connect the input of the contextfeatures. Inthefollowingwediscussthew , t firsthidden layer h1 to all the following layers L = a andc indetail. t t w 2,3,4,5. Type 2 and Type 3 are composed of the Word Representations. Allwords inthe vocabu- shortcutblockswithspan1and2,respectively. Type laryshare acommonlook-up table, whichisinitial- 4andType5arethenestedshortcutblocks. Wewish ized with random initializations or pre-trained em- to find an optimal composition to pass information beddings. Each word in a sentence can be mapped indeepstackedmodels. to an embedding vector w . The whole sentence is t 4 Neural Architecture forsequence then represented by a matrix with columns vector Tagging [w1,w2,...,wT]. Following Wu et al. (2016a), we use a context window of size d surrounding with Sequence tagging can be formulated as P(t|w;θ), a word w to get its context information. Rather, t where w = [w ,...,w ] indicates the T words in 1 T we add logistic gates to each token in the con- a sentence, and t = [t ,...,t ] indicates the cor- 1 T text window. The word representation is computed responding T tags. In this section we introduce an as w = [r w ;...;r w ], t t d/2 t d/2 t+ d/2 t+ d/2 neuralarchitectureforP(·),whichincludesaninput where r :=−⌊ [r⌋ −⌊ ,.⌋..,r ⌊ ] ⌋∈ ⌊Rd ⌋is t t d/2 t+ d/2 layer, a stacked hidden layers and an output layer. −⌊ ⌋ ⌊ ⌋ a logistic gate to filter the unnecessary contexts, Since the stacked hidden layers have already been w ,...,w is the word embeddings in t d/2 t+ d/2 introduced in the previous section, we only intro- −⌊ ⌋ ⌊ ⌋ thelocalwindow. ducetheinputandtheoutputlayerhere. Capitalization Representations. We lowercase 4.1 NetworkInputs the words to decrease the size of word vocabulary Networkinputs arethe representation ofeach token to reduce sparsity, but we need an extra capitaliza- inasequence. Therearemanykindsoftokenrepre- tion embeddings to store the capitalization features, sentations, such as using a single word embedding, whichrepresentwhetherornotawordiscapitalized. l=1 l=1 l=1 l=1 l=1 l=2 l=2 l=2 l=2 l=2 l=3 l=3 l=3 l=3 l=3 l=4 l=4 l=4 l=4 l=4 l=5 l=5 l=5 l=5 l=5 Type1 Type2 Type3 Type4 Type5 −l l−2 Figure1: Compositionsofshortcutblocks.Wecallashortcutblockwithspan1whenht :=ht . Character Representations. We concate- of rules. Detailed explanations of CCG refer to nate character embeddings in a word to get the (Steedman,2000;SteedmanandBaldridge, 2011). character-level representation. Concretely, given Another similar task is POS tagging, in which a word w consisting of a sequence of characters the tags are part of speeches. Technically, the two [c ,c ,...,c ], where l is the length of the kinds of tags classify the words in different ways: 1 2 lw w word and L(·) is the look-up table for characters. CCG tags implicate the semantics of words, while We concatenate the leftmost most 5 character the POS tags represent the syntax of words. Al- embeddings L(c ),...,L(c ) with its rightmost 5 though these distinctions are important in linguis- 1 5 character embeddings L(c ),...,L(c ) to get tics, here we all treat them as the activations of lw 4 lw − c . When a word is less than five characters, we neurons, usingdistributed representations toencode w pad the remaining characters with the same special these tags. This high-level abstraction greatly im- symbol. proves the generalization, and heavily reduces the costofthemodelredesign. 4.2 NetworkOutputs For sequence tagging, we use a softmax activation 5.1.1 DatasetandPre-processing function g(·)intheoutputlayer: Our experiments are performed on CCG- −→ ←− Bank (Hockenmaier andSteedman,2007), y = g(Why[h ;h ]) (18) t t t which is a translation from Penn Treebank (Marcusetal.,1993) to CCG with a coverage where y is a probability distribution over all possi- t 99.4%. Wefollowthestandardsplits,usingsections ble tags. y (t) = exp(hk) isthe k-th dimension k Pk′exp(hk′) 02-21 for training, section 00 for development and of y , which corresponds to the k-th tag in the tag t section 23 for the test. We use a full category set set. Why isthehidden-to-output weight. containing 1285tags. Alldigitsaremappedintothe samedigit‘9’,andallwordsarelowercased. 5 Experiments 5.1 CombinatoryCategory Grammar 5.1.2 NetworkConfiguration Supertagging Initialization. There are two types of Combinatory Category Grammar (CCG) supertag- weights in our experiments: recurrent and ging is a sequence tagging problem in natural lan- non-recurrent weights. For non-recurrent guage processing. The task is to assign supertags weights, we initialize word embeddings with to each word in a sentence. In CCG the supertags the pre-trained 100-dimensional GolVe vectors stand for the lexical categories, which are com- (Pennington etal.,2014). Other weights are initial- posed of the basic categories such as N, NP and ized with the Gaussian distribution N(0, 1 ) √fan-in PP, and complex categories, which are the com- scaledbyafactorof0.1,wherefan-inisthenumber bination of the basic categories based on a set of units in the input layer. For recurrent weight matrices, following (Saxeetal.,2013) we initialize Model Dev Test with random orthogonal matrices through SVD ClarkandCurran(2007) 91.5 92.0 MLP(Lewisetal.(2014)) 91.3 91.6 to avoid unstable gradients. All bias terms are Bi-LSTM(Lewisetal.(2016)) 94.1 94.3 initialized withzerovectors. Elman-RNN(Xuetal.(2015)) 93.1 93.0 Hyperparameters. Our context window size is Bi-RNN(Xuetal.(2016)) 93.49 93.52 Bi-LSTM(Vaswanietal.(2016)) 94.24 94.5 setto3. Thedimensionofcharacterembedding and 9-stackedBi-LSTM(Wuetal.(2016b)) 94.55 94.69 capitalization embeddings are 5. Thesize of the in- 7-stackedshortcutblock(Ours) 94.74 94.95 put layer after concatenation is 465 ((word embed- 9-stacked: shortcutblock(Ours) 94.82 94.99 ding100+capembedding5+characterembedding 11-stacked: shortcutblock(Ours) 94.66 94.86 50) × window size 3). The number of cells of the 13-stacked: shortcutblock(Ours) 94.73 94.97 stacked bidirectional LSTM is also set to 465 for Table1:1-bestsupertaggingaccuracyonCCGbank orthogonal initialization. All stacked hidden layers havethesamenumberofcells. Theoutputlayerhas 1286neurons,whichequalstothenumberoftagsin different depths. Wepresent experiments trained on thetrainingsetwitha RAREsymbol. the training set and evaluated on the test set using the highest 1-best supertagging accuracy on the de- Training. We train the networks using the back- velopmentset. propagation algorithm,usingstochastic gradientde- Our 9-stacked model presents state-of-the-art re- scent (SGD) algorithm with an initial learning rate sults (94.99 on test set) comparing with other sys- 0.02. The learning rate is then scaled by 0.5 when tems. Notice that 9 is the number of stacked Bi- thefollowingcondition satisfied: LSTM layers. The total layer of the networks con- |ep −ec| tains 11 (9 + 1 input-to-hidden layer + 1 hidden- <= 0.005andlr >= 0.0005 e to-output layer) layers. We find the network with p stacked depth 7 or 9 achieves better performance where e is the error rate on the validation set on p than depth 11 or 13, but the difference is tiny. Our the previous epoch. e is the error rate on the cur- c stackedmodelsfollowEq. (11)andgatingfunctions rent epoch. The explanation of the rule is when the refertoEq. (13). growth of the performance become lower, we need to use a smaller learning rate to adjust the weights. 5.1.4 ExplorationofShortcuts We use on-line learning in our experiments, which Togetabetterunderstandingoftheshortcutarchi- meanstheparameterswillbeupdatedoneverytrain- tecture proposed in Eq. (7), we experiment with its ingsequences, oneatatime. variants to compare the performance. Our analysis Regularization. Dropout (Srivastavaetal.,2014) mainlyfocusesonthreeparts: thetopologyofshort- is the only regularizer in our model to avoid over- cutblocks, the gatingmechanism, and theircompo- fitting. Otherregularization methodssuchasweight sitions. The default number of the stacked layers is decay and batch normalization do not work in our 7. We also use the shared gates and Type 2’s archi- experiments. We add a binary dropout mask to the tecture as our default configurations, which are de- local context windows with a drop rate p of 0.25. scribed inEq. (11). Thecomparison issummarized Wealso apply dropout tothe output of the first hid- asfollows: den layer and the last hidden layer, with a 0.5 drop Shortcut Topologies. Table 2 shows the compar- rate. At test time, weights are scaled with a factor ison of shortcut topologies. Here we design two 1−p. kinds of models for comparison: one is both cl and t 5.1.3 ComparisonwithOtherSystems hl connected through shortcuts (Case 1), the other t Table 1 shows the comparison with other models is using m to replace cl (Case 2), as defined in the t for supertagging. The comparison does not include shortcutblock. Wefindtheskipconnections toboth any externally labeled data or POS tags. We eval- the internal states and the cell outputs with multi- uate the models composed of shortcut blocks with plicative gating achieves the highest accuracy (case Case Variant Dev Test hltupdated(Wuetal.,2016b) withgate: hlt =h˜lt+g⊙hlt−2 94.51 94.67 nogate: clt =c˜lt+hlt−2,hlt =h˜lt+hlt−2 93.84 93.84 withgate: clt =c˜lt+g⊙hlt−2,hlt =˜hlt+g⊙hlt−2 94.72 95.08 bothcltandhltupdated(Case1) highwaygate: hcltlt==((11−−gg))⊙⊙c˜h˜ltlt++gg⊙⊙hhlt−lt−22 94.49 94.62 shortcutsforbothcl andhl: clt =c˜lt+gc⊙clt−2 94.72 94.98 t t hlt =h˜lt+gh⊙hlt−2 nogateinhlt:hlt =o⊙tanh(m)+hlt−2 94.15 94.29 nogateinm:m=i⊙slt+hlt−2 94.77 94.97 shortcutblock(Case2) sharegate:hlt =o⊙tanh(m)+o⊙hlt−2 94.68 94.83 noshortcutininternal: hlt =o⊙tanh(it⊙st)+g⊙hlt−2 93.83 94.01 noshortcutincelloutput: hl =o⊙tanh(m) 93.58 93.82 t Table 2: Comparsionof shortcuttopologies. We use ˜hl to representthe originalcelloutputof LSTM block, which t equalso⊙tanh(clt),similartoc˜lt :=i⊙slt+f ⊙ct−1. 1,95.08%)onthetestset. Butcase2cangetabetter Type Dev Test validationaccuracy(94.77%). Weprefertousecase 1 94.22 94.38 2sinceitgeneralizes wellandmucheasiertotrain. 2 94.79 94.94 3 94.53 94.80 5.1.5 ComparisonofGatingFunctions 4 94.55 94.70 5 94.76 94.95 Weexperiment withseveral gatingfunctions pro- posed in Section 3.3. Detailed discussions are de- Table 4: Comparsion of shortcut block combinations. scribedbelow. Densecompositions(Type2and5)performsbetterthan sparseones. Identity Mapping. We use the tanh function to the previous outputs to break the identity link. The result is 94.81% (Table 3), which is poorer than the 5.1.7 ComparisonofHyper-parameters identityfunction. Wecaninferthattheidentityfunc- tionismoresuitablethanotherscaledfunctionssuch assigmoidortanhtotransmitinformation. AsdescribedinSection4.1,weuseacomplexin- put encoding for our model. Concretely, we use a Exclusive Gating. We find deterministic gates context window approach, together with character- performs better than stochastic gates. Further, non- level information to get a better representation for linear mapping gl = σ(Wlhl 1) achieves the best t t− the raw input. We give comparisons for the system test accuracy (Table 3, 94.79%), while other types with/without thisapproaches whilekeeping thehid- such as linear or stochastic gates are not generalize denandtheoutputpartsunchanged. well. Table5showstheeffectsofthehyper-parameters 5.1.6 ComparisonofShortcutBlock onthetask. Wefindthatthemodeldoesnotperform Compositions well(94.06%)withoutusinglocalcontextwindows. We experiment with several kinds of composi- Although LSTMs can memorize recent inputs for a tions of shortcut blocks, as shown in Table 4. We long time, it is still necessary to use a convolution- find that shortcut block with span 1 (Type 2 and 5) like operator to convolve the input tokens to get a performbetterthanotherspans(Type1,3and4). In better representation. Character-level information experiments, we use Type 2 as our default configu- also plays an important role for this task (13% rela- ration since it is much easier to compute than Type tively improvement), but the performance would be 5. heavily damagedifusingcharacters only. Case Variant Dev Test scaledmapping replacehtl−2withtanh(htl−2) 94.60 94.81 linearmapping gtl⊙h−tl=wl⊙h−tl 92.07 92.15 gll=σ(Wlhtl−1) 94.79 94.91 non-linearmapping gl=σ(Ulhl ) 94.21 94.56 l t 1 gll=σ(Vlhtl−−2) 94.60 94.78 gl∼Bernoulli(p),p=0.5 91.12 91.47 stochasticsampling t gtl∼Bernoulli(p),p=σ(Hlhtl−1) 93.90 94.06 Table3:Comparsionofgatingfunctions.Thenon-linearmappinggl =σ(Wlhl−1)isthepreferredchoice. l t Case Variant Dev Test Model Test k =0 93.96 94.06 Søgaard(2011) 97.5 windowsize k =5 94.27 94.81 Lingetal. (2015) 97.36 k =7 94.52 94.71 Wangetal. (2015) 97.78 characteronly 92.17 93.0 Vaswanietal. (2016) 97.4 l =0 93.59 93.71 Wuetal. (2016b) 97.48 character-level w l =3 94.21 94.41 7-stackedmixed+non-lineargate 97.48 w l =7 94.43 94.75 9-stackedmixed+non-lineargate 97.53 w 13-stackedmixed+non-lineargate 97.51 Table5: Comparsionofhyper-parameters Table6:AccuracyforPOStaggingonWSJ 5.2 Part-of-SpeechTagging Part-of-speech tagging is another sequence tagging useofskipconnections instacked RNNs. However, task, which is to assign POS tags to each word in a the researchers have paid less attention to the anal- sentence. Itisvery similar tothe supertagging task. ysis of various kinds of skip connections, which is Therefore,thesetwotaskscanbesolvedinaunified ourfocusinthispaper. architecture. ForPOStagging, weusethesamenet- Recently, deep stacked networks have been work configurations as supertagging, except for the widelyusedforapplications. Srivastavaetal. (2015) word vocabulary size and the tag set size. We con- and He et al. (2015) mainly focus on feed-forward duct experiments on the Wall Street Journal of the neural network, using well-designed skip connec- Penn Treebank dataset, adopting the standard splits tions across different layers to make the informa- (sections 0-18 for the train, sections 19-21 for vali- tion pass more easily. The Grid LSTM proposed dationandsections 22-24fortesting). by Kalchbrenner et al. (2015) extends the one di- Although thePOStagging result presented inTa- mensional LSTMs to many dimensional LSTMs, ble 6 is slightly below the state-of-the-art, we nei- which provides a more general framework to con- ther doany hyper-parameter tunings nor change the structdeepLSTMs. network architectures, just use the one getting the Yao et al. (2015) and Zhang et al. (2016) pro- best test accuracy on the supertagging task. This pose highway LSTMs by introducing gated direct proves the generalization of the model and avoids connections between internal states in adjacent lay- heavyworkofmodelre-designing. ers. Zilly et al. (2016) introduce recurrent highway networks(RHNs)whichuseasinglerecurrent layer 6 Related Work to make RNN deep in a vertical direction. These Skip connections have been widely used for train- works do not use skip connections, and the hier- ing deep neural networks. For recurrent neural net- archical structure is reflected in the LSTM internal works, Schmidhuber (1992); El Hihi and Bengio states or cell outputs. Wu et al. (2016b) propose a (1995)introducedeepRNNsbystackinghiddenlay- similar architecture for the shortcuts in stacked Bi- ersontopofeachother. Raikoetal. (2012);Graves LSTMs. The difference is we propose generalized (2013);HermansandSchrauwen(2013)proposethe shortcut block architectures as basic units for con- structing deep stacked models. We also discuss the Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian compositions oftheseblocks. Sun. 2015. Deepresiduallearningforimagerecogni- tion. arXivpreprintarXiv:1512.03385. There are also some works using stochastic gates to transmit the information. Zoneout (2016) pro- Michiel Hermans and Benjamin Schrauwen. 2013. Training and analysing deep recurrent neural net- vides a stochastic link between the previous hid- works. InAdvancesinNeuralInformationProcessing den states and the current states, forcing the current Systems,pages190–198. statestomaintaintheirpreviousvaluesduringthere- Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. Lstm currentstep. Chungetal. (2016)proposesastochas- can solve hard long time lag problems. Advances tic boundary state to update the internal states and inneuralinformationprocessingsystems,pages473– cell outputs. These stochastic connections are con- 479. nected between adjacent layers, whileour construc- Julia Hockenmaier and Mark Steedman. 2007. Ccg- tionsoftheshortcutsaremostlycross-layered. Also, bank: a corpus of CCG derivations and dependency the updating mechanisms of LSTM blocks are dif- structuresextractedfromthe penntreebank. Compu- ferent. tationalLinguistics,33(3):355–396. GaoHuang,YuSun,ZhuangLiu,DanielSedra,andKil- 7 Conclusions ianWeinberger. 2016. Deepnetworkswithstochastic depth. arXivpreprintarXiv:1603.09382. Inthispaper, wepropose theshortcut blockasaba- Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. sic architecture for constructing deep stacked mod- 2015. Grid long short-term memory. arXiv preprint els. We compare several gating functions and find arXiv:1507.01526. that the non-linear deterministic gate performs the David Krueger,Tegan Maharaj, Ja´nosKrama´r, Moham- mad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, best. We also find the dense compositions perform Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, better than the sparse ones. These explorations can AaronCourville, etal. 2016. Zoneout: Regularizing help us to train deep stacked Bi-LSTMs success- rnnsbyrandomlypreservinghiddenactivations. arXiv fully. Based on this shortcuts structure, we achieve preprintarXiv:1606.01305. thestate-of-the-artresultsonCCGsupertaggingand Guillaume Lample, MiguelBallesteros, Sandeep Subra- comparable results on POS tagging. Our explo- manian, Kazuya Kawakami, and Chris Dyer. 2016. rations could easily be applied to other sequence Neural architectures for named entity recognition. processing problems, which can be modeled with arXivpreprintarXiv:1603.01360. RNNarchitectures. MikeLewisandMarkSteedman. 2014. ImprovedCCG parsingwith semi-supervisedsupertagging. Transac- tionsoftheAssociationforComputationalLinguistics, References 2:327–338. Mike Lewis, Kenton Lee, and Luke Zettlemoyer. 2016. Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Lstmccgparsing. InProceedingsofthe15thAnnual 2016. Hierarchical multiscale recurrent neural net- ConferenceoftheNorthAmericanChapteroftheAs- works. arXivpreprintarXiv:1609.01704. sociationforComputationalLinguistics. Stephen Clark and James R Curran. 2007. Wide- Wang Ling, Tiago Lu´ıs, Lu´ıs Marujo, Ramo´n Fernan- coverageefficientstatisticalparsingwithccgandlog- dez Astudillo, Silvio Amir, Chris Dyer, Alan W linearmodels. ComputationalLinguistics,33(4):493– Black, and Isabel Trancoso. 2015. Finding func- 552. tion in form: Compositional character models for Salah El Hihi and Yoshua Bengio. 1995. Hierarchical open vocabulary word representation. arXiv preprint recurrentneuralnetworksforlong-termdependencies. arXiv:1508.02096. InNIPS,volume400,page409.Citeseer. MitchellPMarcus,MaryAnnMarcinkiewicz,andBeat- Alex Graves and Ju¨rgen Schmidhuber. 2005. Frame- rice Santorini. 1993. Buildinga largeannotatedcor- wise phoneme classification with bidirectional lstm pusofenglish:Thepenntreebank. Computationallin- and other neural network architectures. Neural Net- guistics,19(2):313–330. works,18(5):602–610. Jeffrey Pennington, Richard Socher, and Christopher D AlexGraves. 2013. Generatingsequenceswithrecurrent Manning. 2014. Glove: Globalvectorsforwordrep- neuralnetworks. arXivpreprintarXiv:1308.0850. resentation. InEMNLP,volume14,pages1532–43. Tapani Raiko, Harri Valpola, and Yann LeCun. 2012. KaishengYao,TrevorCohn,KaterinaVylomova,Kevin Deeplearningmadeeasierbylineartransformationsin Duh,andChrisDyer. 2015. Depth-gatedlstm. InPre- perceptrons. InAISTATS,volume22,pages924–932. sented at Jelinek Summer Workshop on August, vol- Andrew M Saxe, James L McClelland, and Surya Gan- ume14,page1. guli. 2013. Exact solutions to the nonlinear dynam- YuZhang,GuoguoChen,DongYu,KaishengYaco,San- ics of learning in deep linear neuralnetworks. arXiv jeev Khudanpur, and James Glass. 2016. Highway preprintarXiv:1312.6120. longshort-termmemoryrnnsfordistantspeechrecog- nition. In 2016 IEEE International Conference on Ju¨rgenSchmidhuber. 1992. Learningcomplex,extended Acoustics, Speech and Signal Processing (ICASSP), sequencesusingtheprincipleofhistorycompression. pages5755–5759.IEEE. NeuralComputation,4(2):234–242. Julian Georg Zilly, Rupesh Kumar Srivastava, Jan AndersSøgaard. 2011. Semisupervisedcondensednear- Koutn´ık, and Ju¨rgen Schmidhuber. 2016. Recurrent est neighbor for part-of-speech tagging. In Proceed- highwaynetworks. arXivpreprintarXiv:1607.03474. ingsofthe49thAnnualMeetingoftheAssociationfor Computational Linguistics: Human Language Tech- nologies: shortpapers-Volume2,pages48–52.Asso- ciationforComputationalLinguistics. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research,15(1):1929–1958. Rupesh Kumar Srivastava, Klaus Greff, and Ju¨rgen Schmidhuber. 2015. Highway networks. arXiv preprintarXiv:1505.00387. Mark Steedman and Jason Baldridge. 2011. Combina- tory categorialgrammar. Non-TransformationalSyn- tax: FormalandExplicitModelsof Grammar.Wiley- Blackwell. Mark Steedman. 2000. The syntactic process, vol- ume24. MITPress. Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and Ryan Musa. 2016. Supertaggingwithlstms. InProceedings oftheHumanLanguageTechnologyConferenceofthe NAACL. PeiluWang,YaoQian,FrankKSoong,LeiHe, andHai Zhao. 2015. Part-of-speech tagging with bidirec- tional long short-term memory recurrent neural net- work. arXivpreprintarXiv:1510.06168. Huijia Wu, Jiajun Zhang, and ChengqingZong. 2016a. A dynamic window neuralnetwork for ccg supertag- ging. arXivpreprintarXiv:1610.02749. Huijia Wu, Jiajun Zhang, and ChengqingZong. 2016b. An empirical exploration of skip connections for se- quentialtagging. arXivpreprintarXiv:1610.03167. Wenduan Xu, Michael Auli, and Stephen Clark. 2015. CCG supertagging with a recurrent neural network. Volume2: ShortPapers,page250. Wenduan Xu, Michael Auli, and Stephen Clark. 2016. Expected f-measure training for shift-reduce parsing with recurrent neural networks. In Proceedings of NAACL-HLT,pages210–220.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.