ebook img

Expoiting Syntactic Structure for Language Modeling PDF

7 Pages·0.14 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Expoiting Syntactic Structure for Language Modeling

Exploiting Syntactic Structure for Language Modeling Ciprian Chelba and Frederick Jelinek Center for Language and Speech Processing The Johns Hopkins University, Barton Hall 320 3400 N. Charles St., Baltimore, MD-21218, USA {chelba,jelinek}@jhu.edu 0 0 Abstract 0 ended_VP’ 2 The paper presents a language model that devel- with_PP n ops syntactic structure and uses it to extract mean- loss_NP a ingful information from the word history, thus en- J abling the use of long distance dependencies. The 5 of_PP model assigns probability to every joint sequence 2 contract_NP loss_NP cents_NP of words–binary-parse-structure with headword an- notation and operates in a left-to-right manner — ] the_DT contract_NN ended_VBDwith_IN a_DT loss_NN of_IN 7_CD cents_NNS after L therefore usable for automatic speech recognition. C Themodel,itsprobabilisticparameterization,anda . set of experiments meant to evaluate its predictive Figure 1: Partial parse s powerare presented; animprovementoverstandard c [ trigram modeling is achieved. 2 The Basic Idea and Terminology 2 1 Introduction Considerpredictingthe wordafterinthesentence: v 2 Themaingoalofthepresentworkistodevelopalan- the contract ended with a loss of 7 cents 2 guage model that uses syntactic structure to model after trading as low as 89 cents. 0 long-distance dependencies. During the summer96 A 3-gram approach would predict after from 1 DoD Workshop a similar attempt was made by the (7, cents) whereas it is intuitively clear that the 1 dependency modeling group. The model we present strongestpredictorwouldbeendedwhichis outside 8 is closely related to the one investigated in (Chelba the reach of even 7-grams. Our assumption is that 9 / et al., 1997), however different in a few important what enables humans to make a good prediction of s aspects: after is the syntactic structure in the past. The c : • our model operates in a left-to-right manner, al- linguistically correct partial parse of the word his- v lowing the decoding of word lattices, as opposed to tory when predicting after is shown in Figure 1. Xi theonereferredtopreviously,whereonlywholesen- The word ended is called the headword of the con- tencescouldbe processed,thusreducingitsapplica- stituent (ended (with (...)))andendedisanex- r a bilityton-bestlistre-scoring;thesyntacticstructure posed headword when predicting after — topmost is developed as a model component; headwordinthelargestconstituentthatcontainsit. • our model is a factored version of the one The syntactic structure in the past filters out irrel- in(Chelbaetal.,1997),thusenablingthecalculation evant words and points to the important ones, thus ofthejointprobabilityofwordsandparsestructure; enabling the use of long distance information when this wasnotpossible inthe previouscase due to the predicting the next word. huge computational complexity of the model. Our model will attempt to build the syntactic Ourmodeldevelopssyntacticstructureincremen- structure incrementally while traversing the sen- tallywhiletraversingthesentencefromlefttoright. tenceleft-to-right. Themodelwillassignaprobabil- This is the main difference between our approach ityP(W,T)toeverysentenceW witheverypossible andotherapproachestostatisticalnaturallanguage POStag assignment, binary branching parse, non- parsing. Our parsing strategy is similar to the in- terminal label and headword annotation for every cremental syntax ones proposed relatively recently constituent of T. in the linguistic community (Philips, 1996). The Let W be a sentence of length n words to which probabilistic model, its parameterization and a few we have prepended <s> and appended </s> so experimentsthataremeanttoevaluateitspotential that w0 =<s> and wn+1 =</s>. Let Wk be the for speech recognition are presented. word k-prefix w0...wk of the sentence and WkTk h_{-m} = (<s>, SB) h_{-1} h_0 = (h_0.word, h_0.tag) hh__{{--22}} hh__{{--11}} hh__00 T_{-m} .................. T_{-2} T_{-1} T_0 (<s>, SB) ....... (w_p, t_p) (w_{p+1}, t_{p+1}) ........ (w_k, t_k) w_{k+1}.... </s> <<ss>> Figure 2: A word-parse k-prefix Figure 4: Before an adjoin operation (</s>, TOP) h’_{-1} = h_{-2} h’_0 = (h_{-1}.word, NTlabel) (</s>, TOP’) h_{-1} T’_0 h_0 T’_{-m+1}<-<s> <s> ............... T’_{-1}<-T_{-2} T_{-1} T_0 (<s>, SB) (w_1, t_1) ..................... (w_n, t_n) (</s>, SE) Figure 5: Result of adjoin-left under NTlabel Figure 3: Complete parse headword and non-terminal label assignments for the word-parse k-prefix. To stress this point, a word-parse k-prefix contains — for a given parse the w1...wk word sequence can be generated. The followingalgorithmformalizestheabovedescription — only those binary subtrees whose span is com- of the sequential generation of a sentence with a pletely included in the word k-prefix, excluding complete parse. w =<s>. Single words along with their POStag 0 can be regarded as root-only trees. Figure 2 shows Transition t; // a PARSER transition a word-parse k-prefix; h_0 .. h_{-m} are the ex- predict (<s>, SB); posed heads, each head being a pair(headword,non- do{ terminal label), or (word, POStag) in the case of a //WORD-PREDICTOR and TAGGER root-only tree. A complete parse — Figure 3 — is predict (next_word, POStag); any binary parse of the //PARSER (w1,t1)...(wn,tn) (</s>, SE) sequence with the do{ restriction that (</s>, TOP’) is the only allowed if(h_{-1}.word != <s>){ head. Note that ((w1,t1)...(wn,tn)) needn’t be a if(h_0.word == </s>) constituent, but for the parses where it is, there is t = (adjoin-right, TOP’); no restrictiononwhich of its wordsis the headword else{ or what is the non-terminal label that accompanies if(h_0.tag == NTlabel) the headword. t = [(adjoin-{left,right}, NTlabel), The model will operate by means of three mod- null]; ules: else • WORD-PREDICTOR predicts the next word t = [(unary, NTlabel), wk+1 given the word-parse k-prefix and then passes (adjoin-{left,right}, NTlabel), control to the TAGGER; null]; • TAGGER predicts the POStag of the next word } tk+1 given the word-parse k-prefix and the newly } predicted word and then passes control to the else{ PARSER; if(h_0.tag == NTlabel) • PARSER grows the already existing binary t = null; branching structure by repeatedly generating the else transitions: t = [(unary, NTlabel), null]; (unary, NTlabel), (adjoin-left, NTlabel) or } (adjoin-right, NTlabel) until it passes control }while(t != null) //done PARSER to the PREDICTOR by taking a null transition. }while(!(h_0.word==</s> && h_{-1}.word==<s>)) NTlabel is the non-terminal label assigned to the t = (adjoin-right, TOP); //adjoin <s>_SB; DONE; newly built constituent and {left,right}specifies where the new headword is inherited from. The unary transition is allowed only when the The operations performed by the PARSER are most recent exposed head is a leaf of the tree — illustrated in Figures 4-6 and they ensure that all a regular word along with its POStag — hence it possible binary branching parses with all possible canbe takenatmostonce ata givenpositionin the halts with probability one. Consequently, certain h’_{-1}=h_{-2} h’_0 = (h_0.word, NTlabel) PARSER and WORD-PREDICTOR probabilities must be given specific values: h_{-1} h_0 • P(null/WkTk) = 1, if h_{-1}.word = <s> and h_{0}6=(</s>, TOP’)—thatis,beforepredicting T’_{-m+1}<-<s> <s> ............... T’_{-1}<-T_{-2} T_{-1} T_0 </s> — ensures that (<s>, SB) is adjoined in the last step of the parsing process; • P((adjoin-right, TOP)/WkTk)=1, if Figure 6: Result of adjoin-right under NTlabel h_0 = (</s>, TOP’) and h_{-1}.word = <s> and input word string. The second subtree in Figure 2 P((adjoin-right, TOP’)/WkTk)=1, if provides an example of a unary transition followed h_0 = (</s>, TOP’) and h_{-1}.word6= <s> by a null transition. ensurethattheparsegeneratedbyourmodeliscon- It is easy to see that any given word sequence sistent with the definition of a complete parse; with a possible parse and headword annotation is • P((unary, NTlabel)/WkTk) = 0, if h_0.tag 6= generated by a unique sequence of model actions. POStag ensures correct treatment of unary produc- This will prove very useful in initializing our model tions; parameters from a treebank — see section 3.5. • ∃ǫ>0,∀Wk−1Tk−1,P(wk=</s>/Wk−1Tk−1)≥ǫ ensures that the model halts with probability one. 3 Probabilistic Model The word-predictor model (2) predicts the next The probabilityP(W,T) ofa wordsequence W and word based on the preceding 2 exposed heads, thus a complete parse T can be broken into: making the following equivalence classification: P(W,T)= P(wk/Wk−1Tk−1)=P(wk/h0,h−1) Qnk=+11[ P(wk/Wk−1Tk−1)·P(tk/Wk−1Tk−1,wk)· Afterexperimentingwithseveralequivalenceclas- Nk sifications of the word-parse prefix for the tagger k k k YP(pi/Wk−1Tk−1,wk,tk,p1...pi−1)](1) model, the conditioning part of model (3) was re- i=1 duced to using the word to be tagged and the tags where: of the two most recent exposed heads: • Wk−1Tk−1 is the word-parse (k−1)-prefix •wk isthewordpredictedbyWORD-PREDICTOR P(tk/wk,Wk−1Tk−1)=P(tk/wk,h0.tag,h−1.tag) • tk is the tag assigned to wk by the TAGGER Model(4)assignsprobabilitytodifferentparsesof • Nk −1 is the number of operations the PARSER the word k-prefix by chaining the elementary oper- executes before passing control to the WORD- ations described above. The workings of the parser PREDICTOR (the Nk-th operation at position k is modulearesimilartothoseofSpatter(Jelineketal., t•hpeknduelnlottersanthsietiio-nth);PNAkRiSsEaRfuonpcetrioantioonfcTarriedout 1994). The equivalence classification of the WkTk i word-parseweusedfortheparsermodel(4)wasthe at position k in the word string; same as the one used in (Collins, 1996): pk ∈{(unary, NTlabel), 1 (adjoin-left, NTlabel), k k P(pi/WkTk)=P(pi/h0,h−1) (adjoin-right, NTlabel), null}, pki ∈{ (adjoin-left, NTlabel), It is worth noting that if the binary branching (adjoin-right, NTlabel)},1<i<Nk , structuredevelopedbytheparserwerealwaysright- pki =null,i=Nk branching and we mapped the POStag and non- terminallabelvocabulariestoasingletypethenour Our model is based on three probabilities: model would be equivalent to a trigram language model. P(wk/Wk−1Tk−1) (2) P(tk/wk,Wk−1Tk−1) (3) 3.1 Modeling Tools k k k All model components — WORD-PREDICTOR, P(pi/wk,tk,Wk−1Tk−1,p1...pi−1) (4) TAGGER, PARSER — are conditional probabilis- Ascanbeseen,(wk,tk,Wk−1Tk−1,pk1...pki−1)isone tic models of the type P(y/x1,x2,...,xn) where of the Nk word-parse k-prefixes WkTk at position k y,x1,x2,...,xn belong to a mixed bag of words, in the sentence, i=1,Nk. POStags,non-terminallabels and parseroperations To ensure a proper probabilistic model (1) we (y only). For simplicity, the modeling method we have to make sure that (2), (3) and (4) are well de- chose was deleted interpolation among relative fre- fined conditional probabilities and that the model quency estimates of different orders fn(·) using a recursive mixing scheme: (k) (k’) (k+1) P(y/x1,...,xn)= λ(x1,...,xn)·P(y/x1,...,xn−1)+ 0 parser op 0 parser op 0 parser op (1−λ(x1,...,xn))·fn(y/x1,...,xn), (5) k predict. k+1 predict. k+1 predict. f−1(y)=uniform(vocabulary(y)) (6) As can be seen, the context mixing scheme dis- p parser op p parser op p parser op cardsitemsinthecontextinright-to-leftorder. The k predict. k+1 predict. k+1 predict. λ coefficients are tied based on the range of the count C(x1,...,xn). The approach is a standard p+1 parser p+1 parser p+1 parser one which doesn’t require an extensive description k+1 predict. k predict. k+1 predict. giventheliteratureavailableonit(JelinekandMer- cer, 1980). 3.2 Search Strategy P_k parser P_k parser P_k parser Since the number of parses for a given word prefix k k predict. k+1 predict. k+1 predict. Wk growsexponentially with k, |{Tk}|∼O(2 ), the state space of our model is huge even for relatively P_k+1parser P_k+1parser short sentences so we had to use a search strategy thatprunesit. Ourchoicewasasynchronousmulti- k+1 predict. k+1 predict. stack search algorithm which is very similar to a word predictor beam search. and tagger Each stack contains hypotheses — partial parses null parser transitions —thathavebeenconstructedbythesamenumberof parser adjoin/unary transitions predictor and the same number of parser operations. The hypotheses in each stack are ranked according Figure 7: One search extension cycle tothe ln(P(W,T))score,highestontop. Thewidth of the search is controlled by two parameters: • the maximum stack depth — the maximum num- model with that resulting from the standard tri- berofhypothesesthestackcancontainatanygiven gram approach, we need to factor in the entropy of state; guessingthecorrectparseTk∗beforepredicting wk+1, •log-probabilitythreshold—thedifferencebetween based solely on the word prefix Wk. the log-probability score of the top-most hypothesis The probability assignment for the word at posi- and the bottom-most hypothesis at any given state tion k+1 in the input sentence is made using: ofthestackcannotbelargerthanagiventhreshold. Figure 7 showsschematicallythe operationsasso- P(wk+1/Wk)= ciated with the scanning of a new word wk+1. The PTk∈SkP(wk+1/WkTk)·ρ(Wk,Tk), (8) above pruning strategy proved to be insufficient so we chose to also discard all hypotheses whose score ρ(Wk,Tk)=P(WkTk)/ X P(WkTk) (9) ismorethanthe log-probabilitythresholdbelowthe Tk∈Sk score of the topmost hypothesis. This additional which ensures a proper probability overstringsW∗, pruning step is performed after all hypotheses in stage k′ have been extended with the null parser whereSk isthesetofallparsespresentinourstacks at the current stage k. transition and thus prepared for scanning a new Another possibility for evaluating the word level word. perplexity of our model is to approximate the prob- 3.3 Word Level Perplexity ability of a whole sentence: The conditional perplexity calculated by assigning N to a whole sentence the probability: P(W)=XP(W,T(k)) (10) n k=1 ∗ ∗ P(W/T )= YP(wk+1/WkTk), (7) where T(k) is one of the “N-best” — in the sense k=0 defined by our search — parses for W. This is a where T∗ = argmaxTP(W,T), is not valid because deficient probability assignment, however useful for it is not causal: when predicting wk+1 we use T∗ justifying the model parameter re-estimation. which was determined by looking at the entire sen- Thetwoestimates(8)and(10)arebothconsistent tence. To be able to compare the perplexity of our in the sense that if the sums are carried over all possibleparseswegetthe correctvaluefortheword One training iteration of the re-estimation proce- level perplexity of our model. dure we propose is described by the following algo- rithm: 3.4 Parameter Re-estimation N-best parse development data; // counts.Ei The major problem we face when trying to reesti- // prepare counts.E(i+1) matethemodelparametersisthehugestatespaceof for each model component c{ the model and the fact that dynamic programming gather_counts development model_c; techniques similar to those used in HMM parame- } ter re-estimation cannot be used with our model. Our solution is inspired by an HMM re-estimation In the parsingstage we retainfor each“N-best” hy- technique that works on pruned — N-best — trel- pothesis (W,T(k)),k =1...N, only the quantity lises(Byrne et al., 1998). φ(W,T(k))=P(W,T(k))/ N P(W,T(k)) Let (W,T(k)),k = 1...N be the set of hypothe- and its derivation(W,T(Pk))k.=1 We then scan all ses that survivedour pruning strategy until the end the derivations in the “development set” and, for of the parsing process for sentence W. Each of eachoccurrenceoftheelementaryevent(y(m),x(m)) them was produced by a sequence of model actions, in derivation(W,T(k)) we accumulate the value chainedtogetherasdescribedinsection2;letuscall φ(W,T(k)) in the C(m)(y(m),x(m)) counter to be thesequenceofmodelactionsthatproducedagiven used in the next iteration. (W,T) the derivation(W,T). The intuition behind this procedure is that Let an elementary event in the derivation(W,T) be (y(ml),x(ml)) where: φ(W,T(k)) is an approximation to the P(T(k)/W) l l probability which places all its mass on the parses • l is the index of the current model action; that survived the parsing process; the above proce- • ml is the model component — WORD- dure simply accumulates the expected values of the PREDICTOR, TAGGER, PARSER — that takes counts C(m)(y(m),x(m)) under the φ(W,T(k)) con- action number l in the derivation(W,T); ditional distribution. As explained previously, the •yl(ml) istheactiontakenatpositionlinthederiva- C(m)(y(m),x(m)) countsarethe parametersdefining tion: our model, making our procedure similar to a rigor- ifml =WORD-PREDICTOR,thenyl(ml) isaword; ous EM approach (Dempster et al., 1977). if ml = TAGGER, then yl(ml) is a POStag; Aparticular—andveryinteresting—caseisthat if ml = PARSER, then yl(ml) is a parser-action; of events which had count zero but get a non-zero count in the next iteration, caused by the “N-best” • x(ml) is the contextin whichthe aboveactionwas l natureofthere-estimationprocess. Consideragiven taken: sentence in our “development” set. The “N-best” if ml = WORD-PREDICTOR or PARSER, then derivationsforthissentencearetrajectoriesthrough xl(ml) =(h0.tag,h0.word,h−1.tag,h−1.word); the state space of our model. They will change if ml = TAGGER, then from one iteration to the other due to the smooth- xl(ml) =(word-to-tag,h0.tag,h−1.tag). ing involved in the probability estimation and the The probability associated with each model ac- change of the parameters — event counts — defin- tion is determined asdescribed insection3.1, based ing our model, thus allowing new events to appear on counts C(m)(y(m),x(m)), one set for each model anddiscardingothersthroughpurginglowprobabil- component. ity events from the stacks. The higher the number Assuming that the deleted interpolation coeffi- of trajectories per sentence, the more dynamic this cientsandthecountrangesusedfortyingthemstay change is expected to be. fixed, these counts are the only parameters to be The results we obtained are presented in the ex- re-estimatedinaneventualre-estimationprocedure; periments section. All the perplexity evaluations indeed,onceasetofcountsC(m)(y(m),x(m))isspec- were done using the left-to-right formula (8) (L2R- ified for a given model m, we can easily calculate: PPL) for which the perplexity on the “development • the relative frequency estimates set” is not guaranteed to decrease from one itera- fn(m)(y(m)/x(nm)) for all context orders tion to another. However, we believe that our re- estimation method should not increase the approxi- n=0...maximum-order(model(m)); • the count C(m)(x(nm)) used for determining the mation to perplexity based on (10) (SUM-PPL) — again,onthe“developmentset”;werelyonthecon- λ(xn(m)) value to be used with the order-n context sistency property outlined at the end of section 3.3 (m) xn . to correlate the desired decrease in L2R-PPL with This is all we need for calculating the probability of that in SUM-PPL. No claim can be made about an elementary event and then the probability of an the change in either L2R-PPL or SUM-PPL on test entire derivation. data. 5 Experiments Z Z A Z’ B Z’ Due to the low speed of the parser — 200 wds/min Z’ Z’ for stack depth 10 and log-probability threshold Z’ Z’ 6.91 nats (1/1000) — we could carry out the re- estimationtechniquedescribedinsection3.4ononly Y_1 Y_k Y_n Y_1 Y_k Y_n 1 Mwds of training data. For convenience we chose toworkontheUPennTreebankcorpus. Thevocab- ulary sizes were: Figure 8: Binarization schemes • word vocabulary: 10k, open — all words outside the vocabulary are mapped to the <unk> token; 3.5 Initial Parameters • POS tag vocabulary: 40, closed; • non-terminal tag vocabulary: 52, closed; Each model component — WORD-PREDICTOR, • parser operation vocabulary: 107, closed; TAGGER, PARSER — is trained initially from a The training data was split into “development” set set of parsed sentences, after each parse tree (W,T) — 929,564wds (sections 00-20) — and “check set” undergoes: — 73,760wds (sections 21-22); the test set size was • headwordpercolation and binarization — see sec- 82,430wds (sections 23-24). The “check” set has tion 4; been used for estimating the interpolation weights • decomposition into its derivation(W,T). and tuning the search parameters; the “develop- Then, separately for each m model component, we: • gather joint counts C(m)(y(m),x(m)) from the ment” set has been used for gathering/estimating counts; the test set has been used strictly for evalu- derivations that make up the “development data” ating model performance. using φ(W,T)=1; Table1showstheresultsofthere-estimationtech- • estimate the deleted interpolation coefficients on niquepresentedinsection3.4. Weachievedareduc- joint counts gathered from “check data” using the tionintest-dataperplexitybringinganimprovement EM algorithm. over a deleted interpolation trigram model whose These are the initial parameters used with the re- perplexitywas167.14onthesametraining-testdata; estimation procedure described in the previous sec- the reduction is statistically significantaccordingto tion. a sign test. 4 Headword Percolation and Binarization iteration DEV set TEST set number L2R-PPL L2R-PPL In order to get initial statistics for our model com- E0 24.70 167.47 ponents we needed to binarize the UPenn Tree- E1 22.34 160.76 bank (Marcus et al., 1995) parse trees and perco- E2 21.69 158.97 late headwords. The procedure we used was to first E3 21.26 158.28 percolate headwords using a context-free (CF) rule- 3-gram 21.20 167.14 based approach and then binarize the parses by us- ing a rule-based approach again. Table 1: Parameter re-estimation results The headword of a phrase is the word that best represents the phrase, all the other words in the phrase being modifiers of the headword. Statisti- Simplelinearinterpolationbetweenourmodeland cally speaking, we were satisfied with the output the trigram model: of an enhanced version of the procedure described in (Collins, 1996) — also known under the name Q(wk+1/Wk)= “Magerman & Black Headword PercolationRules”. λ·P(wk+1/wk−1,wk)+(1−λ)·P(wk+1/Wk) Once the position of the headword within a con- yielded a further improvement in PPL, as shown in stituent — equivalent with a CF production of the Table 2. The interpolation weight was estimated on type Z → Y1...Yn , where Z,Y1,...Yn are non- check data to be λ=0.36. terminal labels or POStags (only for Yi) — is iden- tifiedtobe k,webinarizethe constituentasfollows: depending on the Z identity, a fixed rule is used Anoverallrelativereductionof11%overthetrigram to decide which of the two binarization schemes in model has been achieved. Figure 8 to apply. The intermediate nodes created 6 Conclusions and Future Directions by the above binarization schemes receive the non- terminal label Z′. The large difference between the perplexity of our model calculated on the “development” set — used iteration TEST set TEST set References number L2R-PPL 3-gram interpolated PPL W. Byrne, A. Gunawardana, and S. Khudanpur. E0 167.47 152.25 1998. Information geometry and EM variants. E3 158.28 148.90 Technical Report CLSP Research Note 17, De- 3-gram 167.14 167.14 partmentof ElecticalandComputer Engineering, The Johns Hopkins University, Baltimore, MD. Table 2: Interpolation with trigram results C.Chelba,D.Engle,F.Jelinek,V.Jimenez,S.Khu- danpur, L. Mangu, H. Printz, E. S. Ristad, R.Rosenfeld,A.Stolcke,andD.Wu. 1997. Struc- formodelparameterestimation—and“test”set— ture and performance of a dependency language unseendata—showsthattheinitialpointwechoose model. In Proceedings of Eurospeech, volume 5, for the parameter values has already captured a lot pages 2775–2778.Rhodes, Greece. of information from the training data. The same MichaelJohnCollins. 1996. Anewstatisticalparser problemisencounteredinstandardn-gramlanguage basedonbigramlexicaldependencies. InProceed- modeling;however,ourapproachhasmoreflexibility ings of the 34th Annual Meeting of the Associ- indealingwithitduetothepossibilityofreestimat- ation for Computational Linguistics, pages 184– ing the model parameters. 191. Santa Cruz, CA. We believe that the above experiments show the A.P.Dempster,N.M.Laird,andD.B.Rubin. 1977. potential of our approach for improved language Maximumlikelihoodfromincompletedataviathe models. Our future plans include: EM algorithm. In Journal of the Royal Statistical • experimentwithother parameterizationsthan the Society, volume 39 of B, pages 1–38. twomostrecentexposedheadsinthewordpredictor Frederick Jelinek and Robert Mercer. 1980. Inter- model and parser; polated estimation of markov source parameters • estimate a separate word predictor for left-to- fromsparsedata. InE.GelsemaandL.Kanal,ed- rightlanguage modeling. Note that the correspond- itors, Pattern Recognition in Practice, pages 381– ing model predictor was obtained via re-estimation 397. aimed at increasing the probability of the ”N-best” F.Jelinek, J.Lafferty,D. M.Magerman,R.Mercer, parses of the entire sentence; A. Ratnaparkhi, and S. Roukos. 1994. Decision • reduce vocabulary of parser operations; extreme tree parsing using a hidden derivational model. case: no non-terminal labels/POS tags, word only In ARPA, editor, Proceedings of the Human Lan- model; this will increase the speed of the parser guage Technology Workshop, pages 272–277. thus rendering it usable on larger amounts of train- M. Marcus, B. Santorini, and M. Marcinkiewicz. ing data and allowing the use of deeper stacks — 1995. Building a large annotated corpus of En- resulting in more “N-best” derivations per sentence glish: the Penn Treebank. Computational Lin- during re-estimation; guistics, 19(2):313–330. • relax — flatten — the initial statistics in the re- Colin Philips. 1996. Order and Structure. Ph.D. estimationofmodelparameters;thiswouldallowthe thesis, MIT. Distributed by MITWPL. model parameters to converge to a different point that might yield a lower word-level perplexity; • evaluate model performance on n-best sentences output by an automatic speech recognizer. 7 Acknowledgments This researchhas been funded by the NSF IRI-19618874grant (STIMULATE). The authors would like to thank to Sanjeev Khu- danpur for his insightful suggestions. Also to Harry Printz,EricRistad, AndreasStolcke,DekaiWu and all the other members of the dependency model- ing group at the summer96 DoD Workshop for use- ful comments on the model, programming support andanextremelycreativeenvironment. Alsothanks to Eric Brill, Sanjeev Khudanpur, David Yarowsky, Radu Florian, Lidia Mangu and Jun Wu for useful input during the meetings of the people working on our STIMULATE grant.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.