ebook img

Computing Abelian regularities on RLE strings PDF

0.56 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Computing Abelian regularities on RLE strings

Computing Abelian regularities on RLE strings Shiho Sugimoto1, Naoki Noda2, Shunsuke Inenaga1, Hideo Bannai1, and Masayuki Takeda1 1 Department of Informatics, KyushuUniversity,Japan 7 {shiho.sugimoto, inenaga, bannai, takeda}@inf.kyushu-u.ac.jp 1 2 Departmentof Physics, KyushuUniversity,Japan 0 2 n a Abstract. Two strings x and y are said to be Abelian equivalent if x J is a permutation of y, or vice versa. If a string z satisfies z =xy with x 1 and y being Abelian equivalent, then z is said to be an Abelian square. 1 If a string w can be factorized into a sequence v1,...,vs of strings such that v1, ..., vs−1 are all Abelian equivalent and vs is a substring of a ] permutation of v , then w is said to have a regular Abelian period (p,t) S 1 D wherep=|v1|andt=|vs|.Ifasubstringw1[i..i+ℓ−1]ofastringw1and a substring w [j..j+ℓ−1] of another string w are Abelian equivalent, 2 2 . s thenthesubstringsaresaidtobeacommonAbelianfactorofw1andw2 c andifthelengthℓisthemaximumofsuchthenthesubstringsaresaidto [ be a longest common Abelian factor of w1 and w2. We propose efficient 1 algorithmswhichcomputetheseAbelianregularitiesusingtherunlength v encoding (RLE) of strings. For a given string w of length n whose RLE 6 is of size m, we propose algorithms which compute all Abelian squares 3 occurring in w in O(mn) time, and all regular Abelian periods of w in 8 O(mn) time. For two given strings w and w of total length n and of 1 2 2 total RLE size m, we propose an algorithm which computes all longest 0 common Abelian factors in O(m2n) time. . 1 0 7 1 Introduction 1 : v Two strings s ands are saidto be Abelian equivalent if s is a permutationof 1 2 1 i X s , or vice versa. For instance, strings ababaac and caaabba are Abelian equiva- 2 r lent.SincetheseminalpaperbyErdo¨s[6]publishedin1961,thestudyofAbelian a equivalenceonstringshasattractedmuchattention,bothinwordcombinatorics and string algorithmics. 1.1 Our problems and previous results In this paper, we are interested in the following algorithmic problems relatedto Abelian regularities of strings. 1. Compute Abelian squares in a given string. 2. Compute regular Abelian periods of a given string. 3. Compute longest common Abelian factors of two given strings. Cummings and Smyth [5] proposed an O(n2)-time algorithm to solve Prob- lem 1, where n is the length of the given string. Crochemore et al. [4] proposed analternativeO(n2)-timesolutiontothesameproblem.Recently,Kociumakaet al. [12] showed how to compute all Abelian squares in O(s+ n2 ) time, where log2n s is the number of outputs. Related to Problem 2, various kinds of Abelian periods of strings have been considered: An integer p is said to be a full Abelian period of a string w iff there is a decomposition u ,...,u of w such that u = p for all 1 i z 1 z i | | ≤ ≤ and u ,...,u are all Abelian equivalent. A pair (p,t) of integers is said to be a 1 z regular Abelian period (or simply an Abelian period) of a string w iff there is a decomposition v1,...,vs of w such that p is a full Abelian period of v1 vs−1, ··· v = p for all 1 i s 1, and v is a permutation of a substring of v (and i s 1 | | ≤ ≤ − hence t p).A triple (h,p,t)ofintegersis saidto be a weak Abelian period ofa ≤ stringw iffthereisadecompositiony ,...,y ofw suchthat(p,t)isanAbelian 1 r period of y y , y = h, y = p for all 2 i r 1, y = t, and y is a 2 r 1 i r 1 ··· | | | | ≤ ≤ − | | permutation of a substring of y (and hence h p). 2 ≤ The study on Abelian periodicity of strings was initiated by Constantinescu and Ilie [3]. Fici et al.[8,9] gavean O(nloglogn)-time algorithmto compute all full Abelian periods. Later, Kociumaka et al. [11] showed an optimal O(n)-time algorithm to compute all full Abelian periods. Fici et al. [8,9] also showed an O(n2)-time algorithm to compute all regular Abelian periods for a given string of length n. Kociumaka et al. [11] also developed an algorithm which finds all regular Abelian periods in O(n(loglogn+logσ)) time, where σ is the alphabet size. Fici et al. [7] proposed an algorithm which computes all weak Abelian periods in O(σn2) time, and later Crochemore et al. [4] proposed an improved O(n2)-timealgorithmtocomputeallweakAbelianperiods.Kociumakaetal.[12] showedhowtocomputeallshortest weakAbelianperiodsinO(n2/√logn)time. Consider two strings w and w . A pair (s ,s ) of a substring s of w and 1 2 1 2 1 1 a substring s of w is said to be a common Abelian factor of w and w , 2 2 1 2 iff s and s are Abelian equivalent. Alatabbi et al. [1] proposed an O(σn2)- 1 2 time and O(σn)-space algorithm to solve Problem 3 of computing all longest common Abelian factors (LCAFs) of two given strings of total length n. Later, Grabowski [10] showed an algorithm which finds all LCAFs in O(σn2) time with O(n) space. He also presented an O((σ +logσ)n2logn)-time O(kn)-space k algorithm for a parameter k σ . ≤ logσ 1.2 Our contribution Inthispaper,weshowthatwecanacceleratecomputationofAbelianregularities of strings via run length encoding (RLE) of strings. Namely, if m is the size of the RLE of a given string w of length n, we show that: (1) All Abelian squares in w can be computed in O(mn) time. (2) All regular Abelian periods of w can be computed in O(mn) time. Since m n always holds, our solution (1) is at least as efficient as the O(n2)- ≤ timesolutionsbyCummingsandSmyth[5]andbyCrochemoreetal.[4],andcan 2 be muchfaster whenthe input string w is highly compressibleby RLE.Amir et al.[2]proposedanO(σ(m2+n))-timealgorithmtocomputeallAbeliansquares using RLEs. Our O(mn)-time solution is faster than theirs when σm2 = o(n). m−σ Our solution (2) is more efficient than the O(n(loglogn+logσ))-time solution by Kociumaka et al. [11] when loglogn+logσ =ω(m). Also, if m is the total size of the RLEs of two given strings w and w of 1 2 total length n, we show that: (3) All longest common Abelian factors of w and w can be computed in 1 2 O(m2n) time. Oursolution(3)ismoreefficientthantheO(σn2)-timesolutionbyGrabowski[10] when σn=ω(m2). All proofs omitted due to lack of space can be found in Appendix. 2 Preliminaries Let Σ = c ,...,c be an ordered alphabet of size σ. An element of Σ∗ is 1 σ { } called a string. For any string w, w denotes the length of w. The empty string is denoted by ε. Let Σ+ = Σ∗ | |ε . For any 1 i w, w[i] denotes the −{ } ≤ ≤ | | i-th symbol of w. For a string w = xyz, strings x, y, and z are called a prefix, substring,andsuffix ofw,respectively.Thesubstringofwthatbeginsatposition iandendsatpositionjisdenotedbyw[i..j]for1 i j w.Forconvenience, ≤ ≤ ≤| | let w[i..j]=ε for j >i. For any string w Σ∗, its Parikh vector is an array of length σ such w ∈ P that for any 1 i Σ , [i] is the number of occurrences of each character w c Σ in w. F≤or e≤xa|m|pleP, for string w = abaab over alphabet Σ = a,b , i ∈ { } = 3,2 .Wesaythatstringsxandy areAbelian equivalent if = .Note w x y P h i P P that = iff x and y are permutations of each other. When x is a substring x y P P of a permutation ofy, then we write . For any ParikhvectorsP andQ, x y P ⊆P let diff(P,Q)= i P[i]=Q[i],1 i σ . |{ | 6 ≤ ≤ }| A non-empty string w of length 2k is called an Abelian square if it is a con- catenationoftwoAbelian equivalentstringsoflengthk each,namely, = w[1..k] P . A string w is said to have a regular Abelian period (p,t) if w can w[k+1..2k] P be factorized into a sequence v ,...,v of substrings such that p= v = = 1 s 1 | | ··· |wvs−,w1|, |vsΣ|∗=,ift,aPsuvibs=triPngv1wfo[ri..ail+l 2ℓ ≤1]io<fws, aannddaPsvusb⊆strPinvg1.wFo[jr..ajn+yℓstr1in]gosf 1 2 1 1 2 ∈ − − w are Abelian equivalent, then the pair of substrings are said to be a common 2 Abelian factor of w and w . When the length ℓ is the maximum of such then 1 2 the pair of substrings are said to be a longest common Abelian factor of w and 1 w . 2 The run length encoding (RLE) of string w of length n, denoted RLE(w), is a compact representation of w which encodes each maximal character run w[i..i+p 1]byap,ifw[j]=aforalli j i+p 1,(2)w[i 1]=w[i]ori=1, and (3) w−[i+p 1] = w[i+p] or i+≤p 1≤= n.−E.g., RLE(−aabb6 bbcccaaa$)= a2b4c3a3$1. The−size6 of RLE(T) = ap−1 apm is the number m of maximal 1 ··· m 3 character runs in w, and each api is called an RLE factor of RLE(w). Notice i thatm nalwaysholds.Also,sinceatmostmdistinctcharacterscanappearin ≤ w, in what follows we will assume that σ m. Even if the underlying alphabet ≤ is large, we can sort the characters appearing in w in O(mlogm) time and use this ordering in Parikh vectors. Since all of our algorithms will require at least O(mn) time, this O(mlogm)-time preprocessing is negligible. For any 1 i j n, let RLE(w)[i..j] = api apj. For convenience, let RLE(w)[i..j] =≤ε f≤or i≤> j. For RLE(w) = ap1 i ·a·p·m,jlet RLE Bound(w) = 1 ··· m 1+Pk p 1 k < m 1,n . For any 1 i n, let succ(i) = min j { i=1 k | ≤ }∪{ } ≤ ≤ { ∈ RLE Bound(w) j > i . Namely, succ(i) is the smallest position in w that is | } greater than i and is either the beginning position of an RLE factor in w or the last position n in w. 3 Computing regular Abelian periods using RLEs In this section, we propose an algorithm which computes all regular Abelian periods of a given string. Theorem 1. Given a string w of length n over an alphabet of size σ, we can computeallregularAbelian periods ofw inO(mn)timeandO(n)workingspace, where m is the size of RLE(w). Proof. Ouralgorithmisverysimple.Letnbethelengthoftheinputstringwand mbethesizeofRLE(w).Weuseasinglewindowforeachlengthd=1,..., n . ⌊2⌋ Foranarbitrarilyfixedd, considera decompositionv ,...,v ofw suchthat 1 s v =w[(i 1)d+1..id] for 1 i n and v =w[n (nmodd)+1..n]. Each i − ≤ ≤⌊d⌋ s − v is called a block, and each block of length d is called a complete block. i There are two cases to consider. (a) If w is a unary string (i.e., RLE(w) = an for some a Σ). In this case, ∈ (d,(nmodd)) is a regular Abelian period of w for any d. Also, note that this is the only case where (d,(nmodd)) can be a regular Abelian period of any string of length n with RLE(v ) = ad for some complete block v . i i Clearly, it takes a total of O(n) time and O(1) space in this case. (b) Ifwcontainsatleasttwodistinctcharacters,thenobservethatanycomplete block v is fully contained in a single RLE factor iff succ(1+Pi−1 v ) = i k=1| k| succ(Pi v ). Let S be an array of length n such that S[j]=succ(j) for k=1| k| each 1 j n. We precompute this array S in O(n) time by a simple left- ≤ ≤ to-rightscanoverw.Using the precomputedarrayS,we cancheck inO(m) time if there exists a complete block v satisfying succ(1 + Pi−1 v ) = i k=1| k| succ(Pi v ); we process each complete block v in increasing order of i k=1| k| i (from left to right), and stop as soon as we find the first complete block v i with succ(1+Pi−1 v )=succ(Pi v ). If there exists such a complete k=1| k| k=1| k| block,thenwecanimmediatelydeterminethat(d,(nmodd))isnotaregular Abelian period (recall also Case (a) above.) 4 1 2! 3! 4! 5! 6! 7! 8! 9!10!11!12!13!14!15!16!17! a a b b a a a b a b a a a a b b a Fig.1. (3,2) is a regular Abelian period of string w = aabbaaababaaaabbaa since P =P =P =P =P ⊃P . w[1..3] w[4..6] w[7..9] w[10..12] w[13..15] w[16..17] Now,assumethateverycompleteblockv spansatleasttwoRLEfactors.For i each v , let m 2 be the number of RLE factors of RLE(w) that v spans i i i ≥ (or alternatively, m is the size of RLE(v )). We can compute in O(m ) i i Pvi i time from RLE(v ), by using the exponents of the elements of RLE(v ). i i Also, we can compare and in O(m ) time, since there can be at Pvi Pvi−1 i most m distinct characters in v and hence it is enough to check the m i i i entries of the Parikh vectors. Since there are n complete blocks and each completeblockspansmorethanoneRLEfact⌊ord,⌋wehave n 1Ps−1m . ⌊d⌋≤ 2 i=1 i Moreover, since each RLE factor is counted by a unique m or by a unique i Opa(iσr+ofnm+iP−1sanmd m)=i fOor(mso)mtiemie, twoedhetaevremPinsie=w1hmeith≤er2omrn.oOtv(edr,a(lnl,mitotdakde))s d i=1 i is a regular Abelian period of w. Consequently, it takes a total of O(mn) timetocomputeallregularAbelianperiodsofwforalld’sinthiscase.Since weusethearrayS oflengthnandwemaintaintwoParikhvectorsofthetwo adjacent vi−1 and vi for each i, the space requirement is O(σ+n)=O(n). ⊓⊔ For example, let w = aabbaaababaaaabbaa and d = 3. See also Fig. 1 for illustration. We have RLE(w) =a2b2a3b1a1b1a4b2a1. Then, we compute = Pv1 2,1 from RLE(v ) = a2b1, = 2,1 from RLE(v ) = b1a2, = 2,1 h i 1 Pv2 h i 2 Pv3 h i from RLE(v ) = a1b1a1, = 2,1 from RLE(v ) = b1a2, = 2,1 from 3 Pv4 h i 4 Pv5 h i RLE(v ) = a2b1, and = 1,1 from RLE(v ) = b1a1. Since = for 5 Pv6 h i 6 Pvi Pv1 1 i 5 and , (3,2) is a regular Abelian period of the string w. ≤ ≤ Pv6 ⊂Pv1 4 Computing Abelian squares using RLEs Inthis section,we describeouralgorithmtocompute allAbeliansquaresoccur- ring in a given string w of length n. Our algorithm is based on the algorithm of CummingsandSmyth[5]whichcomputeallAbeliansquaresinwinO(n2)time. We will improve the running time to O(mn), where m is the size of RLE(w). 4.1 Cummings and Smyth’s O(n2)-time algorithm Here werecallCummings andSmyth’s O(n2)-time algorithm[5].The algorithm is fairly straightforward: To compute Abelian squares in a given string w, their algorithm aligns two adjacent sliding windows of length d each, for every 1 d n . ≤ ≤⌊2⌋ 5 Consider an arbitrary fixed d. For each position 1 i n 2d+1 in w, ≤ ≤ − let L and R denote the left and right windows aligned at position i. Namely, i i L = w[i..i +d 1] and R = w[i +d..i+2d 1]. At the beginning of, the i i − − algorithm computes and for position 1 in w. It takes O(d) time to PL1 PR1 compute these Parikh vectors and O(σ) time compute diff( , ). Assume PL1 PR1 , , anddiff( , )havebeen computed for positioni 1,and , PLi PRi PLi PRi ≥ PLi+1 ,anddiff( , )istobecomputedforthenextpositioni+1.Akey PRi+1 PLi+1 PRi+1 observation is that given , then for the left window L for the next PLi PLi+1 i+1 position i+1 can be easily computed in O(1) time, since at most two entries of the Parikh vector can change. The same applies to and . Also, given PRi PRi+1 diff( , ) for the two adjacent windows L and R for position i, then it PLi PRi i i takes O(1) time to determine whether or not diff( , )=0 for the two PLi+1 PRi+1 adjacent windows L and R for the next position i+1. Hence, for each d, i+1 i+1 it takes O(n) time to find all Abelian squares of length 2d, and thus it takes a total of O(n2) time for all 1 d n . ≤ ≤⌊2⌋ 4.2 Our O(mn)-time algorithm We propose an algorithm which computes all Abelian squares in a given string w of length n in O(mn) time, where m is the size of RLE(w). Ouralgorithmwilloutput consecutiveAbeliansquaresw[i..i+2d 1],w[i+ − 1..i+2d], ..., w[j..j +2d 1] of length 2d each as a triple i,j,d . A single − h i Abelian square w[i..i+2d 1] of length 2d will be represented by i,i,d . − h i For any position i in w, let beg(L ) and end(L ) respectively denote the i i beginning and ending positions of the left window L , and let beg(R ) and i i end(R ) respectively denote the beginning and ending positions of the right i window R . Namely, beg(L ) = i, end(L ) = i+d 1, beg(R ) = i+d, and i i i i − end(R ) = i+2d 1. Cummings and Smyth’s algorithm described above in- i − creases each of beg(L ), end(L ), beg(R ), and end(R ) one by one, and tests all i i i i positions i = 1,...,n 2d+1 in w. Hence their algorithm takes O(n) time for − each window size d. In what follows, we show that it is indeed enough to check only O(m) po- sitions in w for each window size d. The outline of our algorithm is as follows. As Cummings and Smyth’s algorithm, we use two adjacent windows of size d, andslidethewindows.However,unlikeCummingsandSmyth’salgorithmwhere the windows are shifted by one position, in our algorithm the windows can be shifted by more than one position. The positions that are not skipped and are explicitly examined will be characterized by the RLE of w, and the equivalence of the Parikhvectorsof the two adjacent windows for the skipped positions can easily be checked by simple arithmetics. Now we describe our algorithmin detail. First,we compute RLE(w) and let m be its size. Consider an arbitrarily fixed window length d 1. ≥ Initial step for position 1. Initially,we compute and for position1. PL1 PR1 We can compute these Parikh vectors in O(m) time and O(σ) space using the same method as in the algorithm of Theorem 1 in Section 3. 6 Steps for positions larger than 1. For each position i 1 in a given ≥ string w, let Di = succ(beg(L )) i, Di = succ(end(L )) i, and Di = 1 i − 2 i − 3 succ(end(R )) i. The break point for each position i, denoted bp(i), is de- i − finedby i+min Di,Di,Di .Assume the leftwindow is alignedatpositioni in { 1 2 3} w. Then, we leap to the break point bp(i) directly from i. In other words, the two windows L and R are directly shifted to L and R , respectively. i i bp(i) bp(i) It depends on the value of diff( , ) whether there can be an Abelian PLi PRi square between positions i and bp(i). Note that diff( , ) = 1. Below, we PLi PRi 6 characterize the other cases in detail. Lemma 1. Assume diff( , ) = 0. Then, for any i < j bp(i), j is the PLi PRi ≤ beginningpositionofanAbeliansquareoflength2diffw[beg(L )]=w[beg(R )]= i i w[end(R )+1]. i Lemma 2. Assume diff( , ) = 2. Let c be the unique character which PLi PRi p occurs more on the left window L than on the right window R , and c be i i q the unique character which occurs more on the right window R than on the i left window L . Let x = [p] [p] = [q] [q] > 0, and assume i PLi − PRi PRi −PLi x min Di,Di,Di . Then, i+x is the beginning position of an Abelian square ≤ { 1 2 3} of length 2d iff w[beg(L )]=c , w[beg(R )]=c =w[end(R )+1]. Also, this is i p i q i the only Abelian square of length 2d beginning at positions between i and bp(i). Lemma 3. Assume diff( , ) = 2. Let c be the unique character which PLi PRi p occurs more on the left window L than on the right window R , and c be i i q the unique character which occurs more on the right window R than on the i left window L . Let x = [p] [p] = [q] [q] > 0, and assume x min Di,Di i,Di .ThePnL,ii+−x iPsRthie beginPnRinig po−sitPioLni of an Abelian square 2 ≤ { 1 2 3} 2 of length 2d iff w[beg(L )]=c =w[end(R )+1], w[beg(R )]=c . Also, this is i p i i q the only Abelian square of length 2d beginning at positions between i and bp(i). Lemma 4. Assume diff(PLi,PRi)=3. Let cp =w[beg(Li)], cp′ =w[end(Ri)+ 1], and c = w[beg(R )]. Then, i+x with i < i+x bp(i) is the beginning q i position of an Abelian squareof length 2d iff 0<x= ≤[p] [p]= [p′] [p′]= PRi[q]−PLi[q] min Di,Di,Di . Also, thisPiLsithe−onPlRyiAbeliaPnLsiquar−e PRi 2 ≤ { 1 2 3} of length 2d beginning at positions between i and bp(i). Lemma 5. Assume diff( , ) 4. Then, there exists no Abelian square of PLi PRi ≥ length 2d beginning at any position j with i<j bp(i). ≤ Main result. We are ready to show the main result of this section. Theorem 2. Given a string w of the length n over an alphabet of size σ, we can compute all Abelian squares in w in O(mn) time and O(n) working space, where m is the size of RLE(w). Proof. Consideranarbitrarilyfixedwindowlengthd.As wasexplained,ittakes O(m) time to compute , , and diff( , ) for the initial position PL1 PR1 PL1 PR1 7 1. Suppose that the two windows are aligned at some position i 1. Then, ≥ our algorithm computes Abelian squares starting at positions between i and bp(i) using one of Lemma 1, Lemma 2, Lemma 3, Lemma 4, and Lemma 5, depending on the value of diff( , ). In each case, all Abelian squares of PL1 PRi length 2d starting at positions between i and bp(i) can be computed in O(1) time by simple arithmetics. Then, the left and right windows L and R are i i shifted toL andR ,respectively.Using the arrayS asin Theorem1,we bp(i) bp(i) can compute bp(i) in O(1) time for a given position i in w. Letus analyzethe numberoftimes the windowsareshifted foreachd.Since bp(i) = i+min Di,Di,Di , for each position p there can be at most three { 1 2 3} distinct positions i,j,k such that p = bp(i) = bp(j) = bp(k). Thus, for each d we shift the two adjacent windows at most 3m times. Overall, our algorithm runs in O(mn) time for all window lengths d = 1,..., n/2 .ThespacerequirementisO(n)sinceweneedtomaintaintheParikh ⌊ ⌋ vectors of the two sliding windows and the array S. ⊓⊔ ExampleonhowouralgorithmcomputesallAbeliansquaresusingRLEscan be found in Appendix B.1. 5 Computing longest common Abelian factors using RLEs In this section, we introduce our RLE-based algorithm which computes longest common Abelian factors of two given strings w and w . Formally, we solve 1 2 the following problem. Let n = min w , w . Given two strings w and w , 1 2 1 2 {| | | |} compute the length l =max{d|Pw1[i..i+d−1] = Pw2[k..k+d−1],1≤d≤n} of the longest common Abelian factor(s) of w and w , together with a pair (i,j) of 1 2 positions on w1 and w2 such that Pw1[i..i+l−1] =Pw2[k..k+l−1]. Our algorithm uses an idea from Alattabi et al.’s algorithm [1]. For each window size d, their algorithm computes the Parikh vectors of all substrings of w andw oflengthd inO(σn) time,using twowindowsoflengthd each.Then 1 2 they sort the Parikh vectors in O(σn) time, and output the largest d for which commonParikhvectorsexistforw andw ,togetherwiththe listsofrespective 1 2 occurrences of longest common Abelian factors. The total time requirement is clearly O(σn2). Our algorithmis different fromAlattabi et al.’s algorithmin that (1) we use RLEsofstringsw andw and(2)weavoidtosorttheParikhvectors.Asinthe 1 2 previous sections, for a given window length d (1 n), we shift two windows of ≤ lengthd overboth ofRLE(w )andRLE(w ), andstopswhen wereacha break 1 2 point of RLE(w ) or RLE(w ). We then check if there is a common Abelian 1 2 factors in the ranges of w and w we are looking at. 1 2 Since we use a single window for each of the input strings w and w , we 1 2 need to modify the definition of the break points. Let U and V be the sliding i k windows for w and w that are aligned at position i of w and at position k of 1 2 1 w , respectively. For each position i 1 in w , let bp (i) = i+min Di,Di , 2 ≥ 1 1 { 1 2} 8 ! w c c 1 pl! pr! d! ! w c c 2 ql! qr! d! Fig.2. Conceptual drawing of cpl, cpr, cqr, and cql. where Di = succ(beg(U )) i and Di = succ(end(U )) i. For each position 1 i − 2 i − k 1 in w , bp (k) is defined analogously. ≥ 2 2 Consider an arbitrarily fixed window length d. Assume that we have just shifted the window on w from position i (i.e., V = w [i..i +d 1]) to the 1 i 1 − breakpointbp (i)(i.e., V =w [bp (i)..bp (i)+d 1]).Letc =w [i]and 1 bp1(i) 1 1 1 − pl 1 c =w [i+d] (see also Fig. 2). pr 1 Forcharactersc andc ,weconsidertheminimumandmaximumnumbers pl pr of occurrences of of these characters during the slide from position i to bp (i). 1 Let min(pl) = Pw1[bp1(i)..bp1(i)+d−1][pl], max(pl) = Pw1[i..i+d−1][pl], min(pr) = Pw1[i..i+d−1][pr] and max(pr)=Pw1[bp1(i)..bp1(i)+d−1][pr]. We will use these val- ues to determine if there is a common Abelian factor of length d for w and 1 w . 2 Also,assumethatwehavejustshiftedthewindowonw frompositionk(i.e., 2 U =w [k..k+d 1])tothebreakpointbp (k)(i.e.,U =w [bp (k)..bp (k)+ k 2 − 2 bp2(k) 2 2 2 d 1]). Let c = w [k] and c = w [k +d] (see also Fig. 2). For characters − ql 2 qr 2 c and c , we also consider the minimum and maximum numbers of occur- ql qr rences of of these characters during the slide from position k to bp (k). Let 2 min(ql) = Pw2[bp2(k)..bp2(k)+d−1][ql], max(ql) = Pw2[k..k+d−1][ql], min(qr) = Pw2[k..k+d−1][qr] and max(qr)=Pw2[bp2(k)..bp2(k)+d−1][qr]. Let m be the total size of RLE(w ) and RLE(w ), and l be the length 1 2 | of longest common Abelian factors of w and w . Our algorithm computes an 1 2 O(m2)-sizerepresentationofeverypair(i,k)ofpositionsforwhich(w [i..i+l 1 − 1],w [k..k+l 1]) is a longest common Abelian factor of w and w . 2 1 2 − Inthelemmaswhichfollow,weassumethatPw1[i..i+d−1][v]=Pw2[k..k+d−1][v] for any v 1,..,σ p,p ,q,q . This is because, if this condition is not l r l r ∈ { }\ { } satisfied,thentherecannotbeanAbeliancommonfactoroflengthdforpositions between i to bp (i) in w and position between k to bp (k) in w . 1 1 2 2 Lemma 6. Assume c = c and c = c . Then, for any pair of positions pl pr ql qr i i′ bp (i) and k k′ bp (k), (w [i′..i′+d 1],w [k′..k′+d 1]) is an ≤ ≤ 1 ≤ ≤ 2 1 − 2 − Abelian common factor of length d iff Pw1[i..i+d−1] =Pw2[k..k+d−1]. 9 Proof. Since c = c and c = c , the Parikh vectors of the sliding windows pl pr ql qr do notchangeduring the slides fromi to bp (i) andfromk to bp (k). Thus the 1 2 lemme holds. ⊓⊔ Lemma 7. Assume c =c =c =c . There is a common Abelian common pl ql 6 pr qr factor(w [i+x..i+x+d 1],w [k+y..k+y+d 1])oflengthdiff0 x bp (i) 1 − 2 − ≤ ≤ 1 − i, 0 y bp (k) k and x y =max(p ) max(q )=min(q ) min(p ). ≤ ≤ 2 − − l − l r − r Lemma 8. Assume c = c = c = c and c = c . There is a common pr 6 pl ql 6 qr pr 6 qr Abelian factor (w [i+x..i+x+d 1],w [k+y..k+y+d 1]) of length d iff 1 2 − − x = Pw2[k..k+d−1][pr]−min(pr) ≥ 0, y = Pw1[i..i+d−1][qr]−min(qr) ≥ 0 and Pw1[i..i+d−1][pl]−x=Pw2[k..k+d−1][ql]−y. Lemma 9. Assume c = c = c = c and c = c . There is a common pl 6 pr qr 6 ql pl 6 ql Abelian factor (w [i+x..i+x+d 1],w [k+y..k+y+d 1]) of length d iff 1 2 − − x = max(pl)−Pw2[k..k+d−1][pl] ≥ 0, y = max(ql)−Pw1[i..i+d−1][ql] ≥ 0 and Pw1[i..i+d−1][pr]+x=Pw2[k..k+d−1][qr]+y. Lemma 10. Assume c = c = c = c . There is a common Abelian factor pl qr 6 pr ql (w [i+x..i+x+d 1],w [k+y..k+y+d 1]) of length d iff x+y =min(p ) 1 2 r − − − max(q )=max(q ) min(p ), 0 x bp (i) i and 0 y bp (k) k. l l − r ≤ ≤ 1 − ≤ ≤ 2 − Lemma 11. Assumec , c , c and c are mutually distinct. There is a com- pl pr ql qr mon Abelian factor (w [i+x..i+x+d 1],w [k+y..k+y+d 1]) of length d iff 1 2 − − 0 ≤ x = max(pl)−Pw2[k..k+d−1][pl] = Pw2[k..k+d−1][pr]−min(pr) ≤ bp1(i)−i and0≤y =max(ql)−Pw1[i..i+d−1][ql]=Pw1[i..i+d−1][qr]−min(qr)≤bp2(k)−k. Lemma 12. Assume c = c = c = c and c = c . There is a common ql 6 pl pr 6 qr ql 6 qr Abelian factor (w [i+x..i+x+d 1],w [k+y..k+y+d 1]) of length d iff 1 2 − − 0 ≤ x ≤ bp1(i)−i, 0 ≤ y = max(ql)−Pw1[i..i+d−1][ql] = Pw1[i..i+d−1][qr]− min(qr)≤bp2(k)−k and Pw1[i..i+d−1][pl]=Pw2[k..k+d−1][pl]. Lemma 13. Assume c = c = c = c and c = c . There is a common pl 6 ql qr 6 pr pl 6 pr Abelianfactor(w [i+x..i+x+d 1],w [k+y..k+y+d 1])oflengthdiff0 y 1 2 − − ≤ ≤ bp2(k)−kandx=max(pl)−Pw2[k..k+d−1][pl]=Pw2[k..k+d−1][pr]−min(pr)≥0. Lemma 14. Assume c = c = c = c and c = c . There is a common pr 6 pl qr 6 ql pr 6 ql Abelian factor (w [i+x..i+x+d 1],w [k +y..k +y +d 1]) of length d 1 2 − − iff 0 ≤ x = Pw2[k..k+d−1][pr] − min(pr) ≤ bp1(i) − i, 0 ≤ y = max(ql) − Pw1[i..i+d−1][ql]≤bp2(k)−k and Pw1[i..i+d−1][pl]−x=Pw2[k..k+d−1][qr]+y. Lemma 15. Assume c = c = c = c and c = c . There is a common pl 6 ql pr 6 qr pl 6 qr Abelian factor (w [i+x..i+x+d 1],w [k+y..k+y+d 1]) of length d iff 1 2 − − 0 ≤ x = max(pl)−Pw2[k..k+d−1][pl] ≤ bp1(i)−i, 0 ≤ y = Pw1[i..i+d−1][qr]− min(qr)≤bp2(k)−k and Pw1[i..i+d−1][pr]+x=Pw2[k..k+d−1][ql]−y. Theorem 3. Given two strings w and w , we can compute an O(m2)-size rep- 1 2 resentation of all longest common Abelian factors of w and w in O(m2n) time 1 2 with O(σ) working space, where m and n are the total size of the RLEs and the total length of w and w , respectively. 1 2 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.