ebook img

Enumeration of Extractive Oracle Summaries PDF

0.21 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Enumeration of Extractive Oracle Summaries

Enumeration of Extractive Oracle Summaries Tsutomu Hirao and Masaaki Nishino and JunSuzuki and MasaakiNagata NTT CommunicationScience Laboratories,NTT Corporation 2-4 Hikaridai,Seika-cho, Soraku-gun, Kyoto,619-0237,Japan {hirao.tsutomu,nishino.masaaki}@lab.ntt.co.jp {suzuki.jun,nagata.masaaki}@lab.ntt.co.jp Abstract rently our question is: “Is extractive summariza- tion still useful research?” To answer it, the ul- To analyze the limitations and the future 7 timatelimitations oftheextractive summarization 1 directions oftheextractive summarization paradigm mustbecomprehended; thatis,wehave 0 paradigm, this paper proposes an Inte- to determine its upper bound and compare it with 2 ger Linear Programming (ILP) formula- theperformanceofthestate-of-the-artsummariza- n tiontoobtain extractive oraclesummaries a tionmethods. SinceROUGEn isthede-factoauto- J in terms of ROUGEn. We also propose an maticevaluationmethodandisemployedinmany 6 algorithm that enumerates all of the ora- text summarization studies, an oracle summary is ] cle summaries for a set of reference sum- definedasasetofsentencesthathaveamaximum L maries to exploit F-measures that evalu- ROUGEn score. If the ROUGEn score of an or- C ate which system summaries contain how acle summary outperforms that of a system that s. manysentencesthatareextractedasanor- employsanother summarization approach, theex- c acle summary. Our experimental results [ tractive summarization paradigm isworthwhile to obtained from Document Understanding leverageresearchresources. 1 Conference (DUC) corpora demonstrated v As another benefit, identifying an oracle sum- 4 the following: (1) room still exists to maryforasetofreferencesummariesallowsusto 1 improve the performance of extractive utilizeyetanother evaluation measure. Sinceboth 6 summarization; (2) the F-measures de- 1 oracle and extractive summaries are sets of sen- 0 rived from the enumerated oracle sum- tences, it is easy to check whether a system sum- 1. marieshavesignificantly stronger correla- mary contains sentences in the oracle summary. 0 tionswithhumanjudgment than thosede- As a result, F-measures, which are available to 7 rivedfromsingleoraclesummaries. 1 evaluate a system summary, are useful for eval- : uating classification-based extractive summariza- v 1 Introduction i tion (ManiandBloedorn, 1998; Osborne,2002; X r Recently, compressive and abstrac- Hiraoetal.,2002). Since ROUGEn evaluation a tive summarization are attracting atten- does not identify which sentence is important, an tion (e.g., AlmeidaandMartins(2013), F-measureconveys useful information intermsof QianandLiu(2013), Yaoetal.(2015), “important sentence extraction.” Thus,combining Banerjeeetal.(2015), Bingetal.(2015)). How- ROUGEn andanF-measureallowsustoscrutinize ever, extractive summarization remains a primary thefailure analysisofsystems. research topic because the linguistic quality of Notethat morethan oneoracle summarymight the resultant summaries is guaranteed, at least at exist for a set of reference summaries because the sentence level, which is a key requirement ROUGEn scores are based on the unweighted for practical use (e.g., HongandNenkova(2014), counting of n-grams. As a result, an F-measure Hongetal.(2015), Yogatamaetal.(2015), mightnotbeidenticalamongmultipleoraclesum- Parveenetal.(2015)). maries. Thus, we need to enumerate the oracle The summarization research community is ex- summaries for a set of reference summaries and periencing a paradigm shift from extractive to computetheF-measuresbasedonthem. compressive or abstractive summarization. Cur- In this paper, we first derive an Integer Linear Programming (ILP) problem to extract an oracle anddefinetheoraclesummariesasfollows: summary from a set of reference summaries and a source document(s). To the best of our knowl- O =argmaxROUGEn(R,S) S⊆D (2) edge, this isthe firstILPformulation that extracts s.t. ℓ(S) ≤L . oracle summaries. Second, since it is difficult max to enumerate oracle summaries for a set of ref- Disthesetofallthesentencescontainedinthe erence summaries using ILP solvers, we propose input document(s), and L is the length limi- max an algorithm that efficiently enumerates all ora- tation of the oracle summary. ℓ(S) indicates the clesummariesbyexploitingthebranchandbound number of words in the system summary. Eq. (2) technique. Our experimental results on the Doc- is an NP-hard combinatorial optimization prob- ument Understanding Conference (DUC) corpora lem,andnopolynomial timealgorithms existthat showedthefollowing: canattainanoptimalsolution. 1. Roomstillexistsforthefurtherimprovement 3 RelatedWork of extractive summarization, i.e., where the ROUGEn scores of the oracle summaries are LinandHovy(2003) utilized a naive exhaustive significantlyhigherthanthoseofthestate-of- searchmethodtoobtainoraclesummariesinterms the-artsummarization systems. of ROUGEn and exploited them to understand the limitations of extractive summarization systems. 2. TheF-measuresderivedfrommultipleoracle Ceylanetal.(2010) proposed another naive ex- summaries obtain significantly stronger cor- haustive search method to derive a probability relationswithhumanjudgmentthanthosede- density function from the ROUGEn scores of or- rivedfromsingleoraclesummaries. acle summaries for the domains to which source documents belong. The computational complex- 2 Definition ofExtractive Oracle ity of naive exhaustive methods is exponential to Summaries the size of the sentence set. Thus, it may be pos- sible to apply them to single document summa- We first briefly describe ROUGEn. Given set of reference summaries R and system summary S, rization tasks involving a dozen sentences, but it is infeasible to apply them to multiple document ROUGEn isdefinedasfollows: summarization tasks that involve several hundred ROUGEn(R,S) = sentences. To describe the difference between the |R| |U(Rk)| min{N(gn,R ),N(gn,S)} ROUGEn scores of oracle and system summaries j k j Xk=1 Xj=1 (1) in multiple document summarization tasks, . Riedhammeretal.(2008) proposed an approxi- |R| |U(Rk)| N(gn,R ) mate algorithm with a genetic algorithm (GA) j k Xk=1 Xj=1 to find oracle summaries. Moenetal.(2014) utilized a greedy algorithm for the same purpose. R denotes the multiple set of n-grams that oc- Although GA or greedy algorithms are widely k cur in k-th reference summary R , and S de- usedtosolve NP-hardcombinatorial optimization k notes the multiple set of n-grams that appear in problems, the solutions are not always optimal. system-generated summaryS (asetofsentences). Thus, the summary does not always have a N(gjn,Rk) and N(gjn,S) return the number of maximum ROUGEn score for the set of reference occurrences of n-gram gn in the k-th reference summaries. Bothworkscalledthesummaryfound j and system summaries, respectively. Function bytheirmethodstheoracle, butitdiffersfromthe U(·) transforms a multiple set into a normal set. definitioninourpaper. ROUGEn takes values in the range of [0,1], and Sincesummarization systemscannotreproduce when the n-gram occurrences of the system sum- human-made reference summaries in most cases, mary agree with those of the reference summary, oracle summaries, which can be reproduced by thevalueis1. summarization systems, have been used as train- Inthispaper,wefocusonextractivesummariza- ing data to tune the parameters of summarization tion, employ ROUGEn as an evaluation measure, systems. Forexample, KuleszaandTasker(2011) and Siposetal.(2012) trained their summarizers Root with oracle summaries found by a greedy algo- rithm. PeyrardandEckle-Kohler(2016)proposed a method to find a summary that approximates a ROUGE score based on the ROUGE scores of in- dividual sentences and exploited the framework to train their summarizer. As mentioned above, such summaries do not always agree with the or- acle summaries defined in our paper. Thus, the quality of the training data is suspect. Moreover, sincethesestudiesfailtoconsiderthatasetofref- erence summaries has multiple oracle summaries, thescoreofthelossfunctiondefinedbetweentheir oracleandsystemsummariesisnotappropriate in Figure1: Exampleofasearchtree mostcases. As mentioned above, no known efficient algo- denotes that the i-th sentence s is included in rithmcanextract“exact”oraclesummaries,asde- i the oracle summary. N(gn,s ) returns the num- finedinEq. (2),i.e.,because onlyanaiveexhaus- j i ber of occurrences of n-gram gn in the i-th sen- tive search is available. Thus, such approximate j tence. Constraints (5) and (6) ensure that z = algorithms as a greedy algorithm are mainly em- kj min{N(gn,R ),N(gn,S)}. ployedtoobtainthem. j k j 5 Branch and Bound Technique for 4 OracleSummary Extractionas an Enumerating OracleSummaries Integer Linear Programming (ILP) Problem Since enumerating oracle summaries with an ILP solverisdifficult,weextendtheexhaustivesearch To extract an oracle summary from document(s) approach by introducing a search and prune tech- and a given set of reference summaries, we start nique to enumerate the oracle summaries. The by deriving an Integer Linear Programming (ILP) searchpruningdecisionismadebycomparingthe problem. Sincethedenominator ofEq. (1)iscon- stant for a given set of reference summaries, we currentupperboundoftheROUGEnscorewiththe canfindanoraclesummarybymaximizingthenu- maximum ROUGEn scoreinthesearchhistory. merator of Eq. (1). Thus, the ILP formulation is 5.1 ROUGEn ScoreforTwoDistinctSetsof definedasfollows: Sentences |R| |U(Rk)| The enumeration of oracle summaries can be re- maximize zkj (3) garded as a depth-first search on a tree whose z Xk=1 Xj=1 nodes represent sentences. Fig. 1 shows an ex- |D| ample of a search tree created in a naive exhaus- s.t. ℓ(s )x ≤ L (4) tivesearch. Thenodesrepresentsentencesandthe i i max Xi=1 pathfromtherootnodetoanarbitrarynoderepre- |D| sentsasummary. Forexample,theredpathinFig. ∀j : N(gjn,si)xi ≥ zkj (5) 1from the rootnode tonode s2 represents asum- Xi=1 mary consisting of sentences s ,s . By utilizing 1 2 ∀j :N(gjn,Rk) ≥ zkj (6) the tree, we can enumerate oracle summaries by ∀i: x ∈ {0,1} (7) exploitingdepth-firstsearcheswhileexcludingthe i ∀j : z ∈ Z . (8) summaries that violate length constraints. How- kj + ever, this naive exhaustive search approach is im- Here, z is the count of the j-th n-gram of practicalforlargedatasetsbecausethenumberof kj the k-th reference summary in the oracle sum- nodesinsidethetreeis2|D|. mary, i.e., z = min{N(gn,R ),N(gn,S)}. If we prune the unwarranted subtrees in each kj j k j ℓ(·) returns the number of words in the sen- step of the depth-first search, we can make the tence, x is a binary indicator, and x = 1 search more efficient. The decision to search or i i prune is made by comparing the current upper Algorithm 1 Algorithm to Find Upper Bound of bound of the ROUGEn score with the maximum ROUGEn ROUGEn scoreinthesearch history. Forinstance, 1: Function:R\OUGEn(R,V) inFig. 1,wereachnodes byfollowingthispath: 2: W ←descendant(last(V)), W′←φ 2 3: U ←ROUGE(R,V) “Root→s ,→s ”. Ifweestimatethemaximum 1 2 4: foreach w∈W do ROUGEn score(upperbound) obtained bysearch- 5: append(W′,ROUGE′n(R,V,{w})) ℓ(w) ing for the descendant of s2 (the subtree in the 6: endfor blue rectangle), wecan decide whether the depth- 7: sort(W′,’descend’) 8: foreach w∈W′do first search should be continued. When the upper 9: ifL −ℓ({w})≥0then max bound of the ROUGEn score exceeds the current 10: U ←U+ROUGE′n(R,V,{w}) maximum ROUGEn inthesearchhistory, wehave 11: Lmax ←Lmax−ℓ({w}) 12: else tocontinue. Whentheupperboundissmallerthan 13: U ←U+ ROUGE′n(R,V,{w}) ×L thecurrentmaximumROUGEnscore,nosummary ℓ({w}) max 14: breaktheloop isoptimalthatcontainss ,s ,sowecanskipsub- 1 2 15: endif sequentsearchactivityonthesubtreeandproceed 16: endfor tocheckthenextbranch: “Root→s →s ”. 17: returnU 1 3 18: end To estimate the upper bound of the ROUGEn score, we re-define it for two distinct sets of sen- tences, V andW,i.e.,V ∩W = φ,asfollows: Since the second term on the right side in Eq. (11) is an NP-hard problem, we ROUGEn(R,V∪W)= ROUGEn(R,V) turn to the following relation by intro- +ROUGE′n(R,V,W). ducing inequality, ROUGE′ (R,V,Ω) ≤ n (9) ROUGE′ (R,V,{ω}), ω∈Ω n P HereROUGE′n isdefinedasfollows: max ROUGE′n(R,V,Ω):ℓ(Ω)≤Lmax−ℓ(V) Ω⊆W R|RO|UGE′n(R,V,W) = ≤m(cid:8)xaxn |iW=1|ROUGE′n(R,V,{wi})xi: (cid:9) P min{N(tn,Rk \V),N(tn,W)} |iW=1|ℓ({wi})xi≤Lmax−ℓ(V) . (12) Xk=1tn∈XU(Rk) . P o |R| Here, x = (x ,...,x ) and x ∈ {0,1}. The 1 |W| i N(tn,Rk) rightsideofEq. (12)isaknapsackproblem,i.e.,a Xk=1tn∈XU(Rk)) 0-1ILPproblem. Although wecanobtain theop- (10) timal solution for it using dynamic programming or ILP solvers, we solve its linear programming V,W arethemultiplesetsofn-gramsfoundinthe relaxation version byapplying agreedyalgorithm setsofsentences V andW,respectively. for greater computation efficiency. The solution Theorem1. Eq. (9)iscorrect. output by the greedy algorithm is optimal for the relaxed problem. Since the optimal solution of Proof. SeeAppendixA. the relaxed problem is always larger than that of theoriginalproblem,therelaxedproblemsolution 5.2 UpperBoundof ROUGEn can be utilized as the upper bound. Algorithm 1 LetV bethesetofsentences onthepathfromthe showsthepseudocodethatattainstheupperbound currentnodetotherootnodeinthesearchtree,and of ROUGEn. Inthealgorithm, U indicates theup- let W be the set of sentences that are the descen- perboundscoreofROUGEn. Wefirstsettheinitial dants of the current node. In Fig. 1, V={s1,s2} score of upper bound U to ROUGEn(R,V) (line andW={s3,s4,s5,s6}. AccordingtoTheorem1, 3). Then we compute the density of the ROUGE′ n the upper bound of the ROUGEn score is defined scores (ROUGE′ (R,V,{w})/ℓ(w)) for each sen- n as: tence w in W and sort them in descending or- R\OUGEn(R,V)= ROUGEn(R,V)+ der (lines 4 to 6). When we have room to add w to the summary, we update U by adding the max{ROUGE′n(R,V,Ω):ℓ(Ω)≤Lmax−ℓ(V)}.(11) ROUGE′ (R,V,{w}) (line 10) and update length Ω⊆W n Algorithm 2 Greedy algorithm to obtain initial Algorithm3Branchandboundtechnique toenu- score merateoraclesummaries 1: Function:GREEDY(R,D,Lmax) 1: ReadR,D,Lmax 2: L←0,S ←φ,E ←D 2: τ ←GREEDY(R,D,Lmax),Oτ ←φ 3: whileE 6=φdo 3: foreach s∈D do 4: s∗←argmax ROUGEn(R,S∪{s})−ROUGEn(R,S) 4: append(S,hROUGEn(R,{s}),si) s∈E ( ℓ({s}) ) 5: endfor 5: L←L+ℓ({s∗}) 6: sort(S,’descend’) 6: ifL≤Lmaxthen 7: callFINDORACLE(S,C) 7: S ←S∪{s∗} 8: outputO τ 8: endif 9: Procedure:FINDORACLE(Q,V) 9: E ←E\{s∗} 10: whileQ6=φdo 10: endwhile 11: s←shift(Q) 11: i∗ ← argmax ROUGEn(R,{i}) 12: append(V,s) i∈D,ℓ({i})≤Lmax 13: ifL −ℓ(V)≥0then 111324::: endrSe∗tu←rnKRa∈Or{gU{imG∗E}a,nxS(}RR,OSU∗G)En(R,K) 111654::: ifmRaτaxOp←pUeGnREdn(OO(URτG,,EVVn)()R≥,Vτ)then 17: callFINDORACLE(Q,V) 18: elseifR\OUGEn(R,V)≥τ then 19: callFINDORACLE(Q,V) constraint Lmax (line 11). When we do not have 20: endif room to add w, we update U by adding the score 21: endif 22: pop(V) obtainedbymultiplyingthedensityofwbythere- 23: endwhile maininglength, Lmax (line13),andexitthewhile 24: end loop. 5.3 InitialScoreforSearch 1. ROUGEn(R,V) ≥τ; Since the branch and bound technique prunes 2. ROUGEn(R,V) <τ, R\OUGEn(R,V) < τ; thesearchbycomparingthebestsolutionfoundso 3. ROUGEn(R,V) <τ, R\OUGEn(R,V) ≥ τ. far with the upper bounds, obtaining a good solu- tion in the early stage is critical for raising search With case 1, we update the oracle summary efficiency. as V and continue the search. With case 2, be- Since ROUGEn is a monotone submodular cause both ROUGEn(R,V) and R\OUGEn(R,V) function (LinandBilmes,2011), we can obtain are smaller than τ, the subtree whose root node a good approximate solution by a greedy algo- is the current node (last visited node) is pruned rithm (Khulleretal.,1999). It is guaranteed that fromthesearchspace, andwecontinue thedepth- the score of the obtained approximate solution is first search from the neighbor node. With case 3, largerthan 1(1− 1)OPT,whereOPTisthescore we do not update oracle summary as V because 2 e oftheoptimalsolution. Weemploythesolutionas ROUGEn(R,V)islessthanτ. However,wemight the initial ROUGEn score of the candidate oracle obtain a better oracle summary by continuing the summary. depth-first search because the upper bound of the Algorithm 2 shows the greedy algorithm. In it, ROUGEn score exceeds τ. Thus, we continue to S denotes asummary and D denotes aset ofsen- searchforthedescendants ofthecurrent node. tences. Thealgorithm iteratively addssentence s∗ Algorithm 3 shows the pseudocode that enu- that yields the largest gain in the ROUGEn score merates the oracle summaries. The algorithm to current summary S, provided the length of the reads asetof reference summaries R, length lim- summary does not violate length constraint L itation L , and set of sentences D (line 1) and max max (line 4). After the while loop, the algorithm com- initializes threshold τ as the ROUGEn score ob- pares the ROUGEn score of S with the maximum tained by the greedy algorithm (Algorithm 2). ROUGEn score of the single sentence and outputs It also initializes Oτ, which stores oracle sum- thelargerofthetwoscores(lines11to13). maries whose ROUGEn scores are τ, and priority queueC,whichstoresthehistoryofthedepth-first 5.4 EnumerationofOraclesummaries search (line 2). Next, the algorithm computes the Byintroducing thresholdτ asthebestROUGEn ROUGEn score for each sentence and stores S af- score in the search history, pruning decisions in- ter sorting them in descending order. After that, volvethefollowingthreeconditions: we start a depth-first search by recursively call- Year Topics Docs. Sents. Words Refs. Length 6.2 ResultsandDiscussion 01 30 10 365 7706 89 100 6.2.1 ImpactofOracle ROUGEn scores 02 59 10 238 4822 116 100 03 30 10 245 5711 120 100 Table2showstheaverageROUGE1,2 scoresofthe 04 50 10 218 4870 200 100 oracle summaries obtained from both a set of ref- 05 50 29.5 885 18273.5 300 250 erences and each reference intheset(“multi” and 06 50 25 732.5 15997.5 200 250 “single”), those of the best conventional system 07 45 25 516 11427 180 250 (Peer), and those obtained from summaries pro- ducedbyagreedyalgorithm (Algorithm2). Table1: Statisticsofdataset Oracle (single) obtained better ROUGE1,2 scores than Oracle (multi). The results imply that itiseasiertooptimizeareferencesummarythana ing procedure FINDORACLE. In the procedure, set of reference summaries. On the other hand, we extract the top sentence from priority queue the ROUGE1,2 scores of these oracle summaries Q and append it to priority queue V (lines 11 to aresignificantly higher than those ofthe best sys- 12). When the length of V is less than Lmax, if tems. The best systems obtained ROUGE1 scores ROUGEn(R,V)islargerthanthresholdτ (case1), from60%to70%in“multi”andfrom50%to60% weupdate τ as the score and append current V to in“single”aswellasROUGE2scoresfrom40%to Oτ. Then we continue the depth-first search by 55% in “multi” and from 30% to 40% in “single” calling the procedure the FINDORACLE (lines 15 fortheiroraclesummaries. to17). IfR\OUGEn(R,V)islargerthanτ (case3), Since the systems in Table 2 were developed we do not update τ and Oτ but reenter the depth- over many years, we compared the ROUGEn first search by calling the procedure again (lines scores of the oracle summaries with those of 18 to 19). If neither case 1 nor case 3 is true, we the current state-of-the-art systems using the delete the last visited sentence from V and return DUC-2004 corpus and obtained summaries tothetopoftherecurrence. generated by different systems from a public repository1 (Hongetal.,2014). The repos- 6 Experiments itory includes summaries produced by the following seven state-of-the-art summarization 6.1 ExperimentalSetting systems: CLASSY04 (Conroyetal.,2004), We conducted experiments on the corpora devel- CLASSY11 (Conroyetal.,2011), Sub- oped for amultiple document summarization task modular (LinandBilmes,2012), DPP in DUC 2001 to 2007. Table 1 show the statistics (KuleszaandTasker,2011), RegSum of the data. In particular, the DUC-2005 to -2007 (HongandNenkova,2014), OCCAMS V datasetsnotonlyhaveverylargenumbersofsen- (Davieetal.,2012; Conroyetal.,2013), tencesandwordsbutalsoalongtargetlength(the and ICSISumm (GillickandFavre,2009; reference summarylength)of250words. Gillicketal.,2009). Table3showstheresults. All the words in the documents were stemmed Based on the results, RegSum by Porter’s stemmer (Porter, 1980). We com- (HongandNenkova,2014) achieved the best puted ROUGE1 scores, excluding stopwords, ROUGE1=0.331 result, while ICSISumm and computed ROUGE2 scores, keeping them. (GillickandFavre,2009; Gillicketal.,2009) Owczarzaketal.(2012)suggested using ROUGE1 (a compressive summarizer) achieved the best and keeping stopwords. However, as Takamura result with ROUGE2=0.098. These systems et al. argued (TakamuraandOkumura,2009), outperformedthebestsystems(Peers65and67in the summaries optimized with non-content words Table2),butthedifferencesintheROUGEnscores failed to consider the actual quality. Thus, we between the systems and the oracle summaries excluded stopwords for computing the ROUGE1 are still large. More recently, Hongetal.(2015) scores. demonstrated that their system’s combination Weenumerated the following twotypes of ora- approach achieved the current best ROUGE2 cle summaries: those for a set of references for a score,0.105, fortheDUC-2004corpus. However, given topic andthose for each reference inthe set ofreferences. 1http://www.cis.upenn.edu/˜nlp/corpora/sumrepo.html 01 02 03 04 05 06 07 R R R R R R R R R R R R R R 1 2 1 2 1 2 1 2 1 2 1 2 1 2 Oracle(multi) .400 .164 .452 .186 .434 .185 .427 .162 .445 .177 .491 .211 .506 .236 Oracle(single) .500 .226 .515 .225 .525 .258 .519 .228 .574 .279 .607 .303 .622 .330 Greedy .387 .161 .438 .184 .424 .182 .412 .157 .430 .173 .473 .206 .495 .234 Peer .251 .080 .269 .080 .295 .094 .305 .092 .262 .073 .305 .095 .363 .117 ID T T 19 19 26 13 67 65 10 15 23 24 29 15 Table2: ROUGE1,2 scoresoforaclesummaries, greedysummaries, andsystemsummariesforeachdata set System ROUGE1 ROUGE2 single multi Oracle(multi) .427 .162 ROUGE1 .451 .419 Oracle(single) .519 .228 ROUGE2 .536 .530 CLASSY04 .305 .0897 Table 4: Jaccard Index between both oracle and CLASSY11 .286 .0919 greedysummaries Submodular .300 .0933 DPP .309 .0960 RegSum .331 .0974 cally significant differences between the ROUGE OCCAMS V .300 .0974 scores of the oracle summaries and greedy sum- ICSISumm .310 .0980 maries,thoseobtainedfromthegreedysummaries Table3: ROUGE1,2scoresforstate-of-the-artsum- achieved near optimal scores, i.e., approximation marization systemsonDUC-2004corpus ratio of them are close to 0.9. These results are surprising since the algorithm’s theoretical lower boundis 1(1− 1)(≃ 0.32)OPT. 2 e a large difference remains between the ROUGE2 On the other hand, the results do not support scoreoforacleandtheirsummaries. that the differences between them are small atthe Inshort, the ROUGEn scores oftheoracle sum- sentence-level. Table4showstheaverage Jaccard maries are significantly higher than those of the Index between the oracle summaries and the cor- current state-of-the-art summarization systems, responding greedy summaries for the DUC-2004 both extractive and compressive summarization. corpus. The results demonstrate that the oracle These results imply that further improvement of summaries are much less similar to the greedy the performance of extractive summarization is summaries at the sentence-level. Thus, it might possible. not be appropriate to use greedy summaries as Ontheotherhand,theROUGEnscoresoftheor- trainingdataforlearning-basedextractivesumma- aclesummariesarefarfromROUGEn = 1. Webe- rizationsystems. lieve that the results are related to the summary’s 6.2.3 ImpactofEnumeration compression rate. Thedataset’scompression rate was only 1 to 2%. Thus, under tight length con- Table 5 shows the median number of oracle sum- straints,extractivesummarizationbasicallyfailsto maries and the rates of the reference summaries cover large numbers of n-grams in the reference thathave multiple oracle summaries foreach data summary. Thisrevealsthelimitationoftheextrac- set. Over 80% of the reference summaries and tivesummarizationparadigmandsuggeststhatwe about 60% to 90% of the topics have multiple or- needanotherdirection, compressive orabstractive acle summaries. Since the ROUGEn scores are summarization, toovercomethelimitation. based on the unweighted counting of n-grams, whenmanysentences have similar meanings, i.e., 6.2.2 ROUGEScoresofSummariesObtained many redundant sentences, the number of oracle fromGreedyAlgorithm summariesthathavethesame ROUGEn scores in- Table 2 also shows the ROUGE1,2 scores of the creases. The source documents of multiple docu- summaries obtained from the greedy algorithm mentsummarization tasksarepronetohavemany (greedy summaries). Although there are statisti- such redundant sentences, and the amount of ora- Median Rate single multi single multi ROUGE1 ROUGE2 ROUGE1 ROUGE2 ROUGE1 ROUGE2 ROUGE1 ROUGE2 01 8 9 4 5 .854 .787 .833 .733 02 7.5 5.5 4 4 .897 .836 .814 .780 03 8 10.5 3.5 4 .833 .858 .800 .900 04 8 8 3.5 3 .865 .865 .780 .760 05 35 35.5 2 3 .916 .907 .580 .660 06 28 22 2.5 3 .877 .880 .700 .720 07 23 16 4 2 .910 .878 .733 711 Table5: Mediannumberoforaclesummariesandratesofreferencesummariesandtopicswithmultiple oraclesummariesforeachdataset clesummariesislarge. by a human subject. We computed the correla- The oracle summaries offer significant benefit tion coefficients between the average F-measure withrespect toevaluating theextracted sentences. andtheaveragemeancoveragescorefor50topics. Since both the oracle and system summaries are Table 6 shows Pearson’s r and Spearman’s ρ. In sets ofsentences, itiseasy to check whether each thetable,“F-measure(R1)”and“F-measure(R2)” sentence in the system summary is contained in indicate the F-measures calculated using oracle one of the oracle summaries. Thus, we can ex- summaries optimized to ROUGE1 and ROUGE2, ploit the F-measures, which are useful for eval- respectively. “M” indicates the F-measure calcu- uating classification-based extractive summariza- latedusingmultipleoraclesummaries,and“S”in- tion (ManiandBloedorn, 1998; Osborne, 2002; dicates F-measures calculated using randomly se- Hiraoetal.,2002). Here, we have to consider lected oracle summaries. “multi” indicates oracle that the oracle summaries, obtained from a ref- summaries obtained from a set of references, and erence summary or a set of reference sum- “single”indicatesoraclesummariesobtainedfrom maries, are not identical at the sentence-level a reference summary in the set. For “S,” we ran- (e.g., the average Jaccard Index between the ora- domly selected a single oracle summary and cal- clesummariesfortheDUC-2004corpusisaround culated the F-measure 100 times and took the av- 0.5). The F-measures are varied with the ora- erage value with the 95% confidence interval of cle summaries that are used for such computa- theF-measuresbybootstrap resampling. tion. For example, assume that we have sys- tem summary S={s ,s ,s ,s } and oracle sum- 1 2 3 4 TheresultsdemonstratethattheF-measuresare maries O ={s ,s ,s ,s } and O ={s ,s ,s }. 1 1 2 5 6 2 1 2 3 strongly correlated with human judgment. Their The precision for O is 0.5, while that for O is 1 2 values are comparable with those of ROUGE1,2. 0.75; the recall for O is 0.5, while that for O is 1 2 Inparticular, F-measure (R )(single-M) achieved 1 1;theF-measureforO is0.5,whilethatforO is 1 2 the best Spearman’s ρ result. When comparing 0.86. “single” with “multi,” Pearson’s r of “multi” was Thus, we employ the scores gained by averag- slightlylowerthanthatof“single,”andtheSpear- ingalloftheoraclesummariesasevaluationmea- man’srof“multi”wasalmostthesameasthoseof sures. Precision,recall,andF-measurearedefined “single.” “M”hassignificantlybetterperformance as follows: P={ O∈Oall|O ∩ S|/|S|}/|Oall|, than “S.” These results imply that F-measures R={ O∈Oall|O P ∩ S|/|O|}/|Oall|, based on oracle summaries are a good evaluation F-mePasure=2PR/(P +R). measure and that oracle summaries have the po- To demonstrate F-measure’s effectiveness, we tential to be an alternative to human-made ref- investigatedthecorrelationbetweenanF-measure erence summaries in terms of automatic evalua- and human judgment based on the evaluation re- tion. Moreover, the enumeration of the oracle sultsobtainedfromtheDUC-2004corpus. There- summariesforagivenreference summaryoraset sults include summaries generated by17systems, of reference summaries is essential for automatic eachofwhichhasameancoveragescoreassigned evaluation. Metric r ρ ROUGE1 ROUGE2 ROUGE1 .861 .760 Naive Proposed Naive Proposed ROUGE2 .907 .831 F-measure(R )(single-M) .857 .855 01 3.66×1013 5.75×103 3.32×107 1.00×103 1 F-measure(R1)(single-S) .815-.830 .811-.830 02 1.12×1012 4.64×103 1.34×107 8.87×102 F-measure(R )(single-M) .904 .826 2 03 1.62×1011 3.65×103 6.37×106 8.19×102 F-measure(R )(single-S) .855-.865 .740-.760 2 F-measure(R )(multi-M) .814 .841 04 9.65×1010 4.47×103 6.90×106 9.83×102 1 F-measure(R1)(multi-S) .794-.802 .803-.813 05 5.48×1036 2.32×106 3.48×1021 7.03×104 F-measure(R )(multi-M) .824 .846 2 06 1.94×1032 1.97×106 2.11×1020 5.08×104 F-measure(R )(multi-S) .806-.816 .797-.817 2 07 4.14×1028 1.40×106 1.81×1019 2.60×104 Table 6: Correlation coefficients between auto- maticevaluations andhumanjudgments onDUC- Table7: Mediannumberofsummariescheckedby 2004corpus eachsearchmethod 6.2.4 SearchEfficiency To demonstrate the efficiency of our search algo- per reference summary and 102.94 and 3.65 sec. rithm against the naive exhaustive search method, per set of reference summaries for ROUGE1 and we compared the number of feasible solutions ROUGE2, respectively. The extraction of one or- (setsofsentencesthatsatisfythelengthconstraint) acle summary for a reference summary can be with the number of summaries that were checked achieved withtheILPsolver inpractical timeand inoursearchalgorithm. Thealgorithmthatcounts the enumeration of oracle summaries is also effi- the number of feasible solutions is shown in Ap- cient. However, to enumerate oracle summaries, pendixB. weneededseveralweeksforsometopicsinDUCs Table 7 shows the median number of feasible 2005 to 2007 since they hold a huge number of solutions and checked summaries yielded by our sourcesentences. method for each data set (in the case of “sin- gle”). The differences in the number of feasible solutions between ROUGE1 and ROUGE2 arevery large. Input set (|D|) of ROUGE1 is much larger 7 Conclusions than ROUGE1. On the other hand, the differences between ROUGE1 and ROUGE2 inourmethod are of the order of 10 to 102. When comparing our Toanalyze thelimitations andthefuturedirection method with naive exhaustive searches, its search of extractive summarization, this paper proposed space is significantly smaller. The differences are (1)IntegerLinearProgramming(ILP)formulation of the order of 107 to 1030 with ROUGE1 and 104 to obtain extractive oracle summaries in terms of to 1017 with ROUGE2. These results demonstrate ROUGEn scoresand(2)analgorithmthatenumer- theefficiencyofourbranchandboundtechnique. ates all oracle summaries to exploit F-measures In addition, we show an example of the pro- thatevaluatethesentences extractedbysystems. cessing time for extracting one oracle summary and enumerating all of the oracle summaries for The evaluation results obtained from the cor- the reference summaries in the DUC-2004 cor- pora of DUCs2001 to 2007 identified the follow- pus with a Linux machine (CPU:Intel(cid:13)R Xeon(cid:13)R ing: (1) room still exists to improve the ROUGEn X5675 (3.07GHz)) with 192 GB of RAM. We scores of extractive summarization systems even utilized CPLEX 12.1 to solve the ILP problem. though the ROUGEn scores of the oracle sum- Ouralgorithm wasimplementedinC++andcom- maries fell below the theoretical upper bound plied with GCC version 4.4.7. The results show ROUGEn=1. (2) Over 80% of the reference sum- that we needed 0.026 and 0.021 sec. to extract maries and from 60% to 90% of the sets of refer- one oracle summary per reference summary and ence summaries have multiple oracle summaries, 0.047 and 0.031 sec. to extract one oracle sum- andtheF-measurescomputedbyutilizingtheenu- mary per set of reference summaries for ROUGE1 meratedoracle summaries showed stronger corre- and ROUGE2, respectively. We needed 11.90 and lation with human judgment than those computed 1.40 sec. to enumerate the oracle summaries fromsingleoraclesummaries. Appendix A. Algorithm4DynamicProgrammingAlgorithmto CounttheNumberoftheFeasibleSummaries Proof. We can rewrite the right side of equation 1: Function:GETNUMFS(D,Lmax) (9)asfollows: 2: C[0][0]←1,C[0][j]←0,1≤j ≤L max 3: fori=1to|D|do ROUGE(R,V)+ROUGE′n(R,V,W) = 4: forj =0toLmaxdo |R| 5: ifj−ℓ(si)≥0then 6: C[i][j]←C[i−1][j]+C[i−1][j−ℓ(si)] f(t ,R ,V,W) n k 7: else Xk=1tn∈XU(Rk) . (13) 98:: endCi[fi][j]←C[i−1][j] |R| 10: endfor N(tn,Rk) 11: endfor Xk=1tn∈XU(Rk) 12: returnLXmaxC[|D|][j] Here,f(t ,R ,V,W)isdefinedasfollows: j=1 n k 13: end f(t ,R ,V,W)=min{N(t ,R ),N(t ,V)}+ n k n k n min{N(t ,R \V),N(t ,W)}. n k n Appendix B. (14) We propose an algorithm to compute the num- N(t ,R \V)isthenumberoftimest occursin n k n ber of feasible solutions under the length con- themultiplesetR \V. Equation(14)isrewritten k straint by extending the dynamic programming as based approach for the subset sum problem f(t ,R ,V,W)=min{N(t ,R ),N(t ,V)}+ n k n k n (Cormenetal.,2009). Wedefine C[i][j](0 ≤ i ≤ min{max{N(tn,Rk)−N(tn,V),0},N(tn,W)}. |D|,0 ≤ j ≤ Lmax), which stores the number of (15) feasiblesolutions(lengthislessthanj)thatcanbe obtainedfromset{s ,...,s }asfollows: The solutions of equation (15) are obtained by 1 i considering thefollowingthreeconditions: • Initialization 1. If N(t ,R ) − N(t ,V) > 0 and n k n C[0][j] = 0 (0 ≤ j ≤ L ) (18) N(t ,R ) − N(t ,V) > N(t ,W), then max n k n n f(t ,R ,V,W)= N(t ,V)+N(t ,W) n k n n • Recurrence (1 ≤ i ≤ |D|) 2. If N(t ,R ) − N(t ,V) > 0 and n k n N(t ,R ) − N(t ,V) < N(t ,W), then C[i][j]= n k n n f(tn,Rk,V,W)= N(tn,Rk) C[i−1][j]+C[i−1][j−ℓ(si)] ifj−ℓ(si) ≥ 0 (cid:26) C[i−1][j] otherwise 3. If N(t ,R ) − N(t ,V) < 0, then n k n (19) f(t ,R ,V,W)= N(t ,R ) n k n k Fromtheaboverelations, Algorithm 4isadynamic program that fillsout the (|D|+1)×(L +1) table. After the table max f(t ,R ,V,W) = is filled, each cell on the |D| + 1-th line stores n k the number of feasible solutions. In the algo- min{N(t ,R ),N(t ,V)+N(t ,W)}.(16) n k n n rithm, first, we pick up the sentences that contain Thus, an n-gram that appears in the reference summary at least once and recursively count the number of ROUGEn(R,V∪W)= feasible solutions. Then, the sum of the j-th line |R| whoseindexisfrom1toLmax indicates thenum- min{N(t ,R ),N(t ,V)+N(t ,W)} ber of feasible solutions. The order of the algo- n k n n Xk=1tn∈XU(Rk) rithmisO(nLmax). |R| N(t ,R ) n k References Xk=1tn∈XU(Rk) [AlmeidaandMartins2013] Miguel B. Almeida and (17) Andr´eF.T.Martins. 2013. Fastandrobustcompres- sive summarization with dual decomposition and

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.