An in-place truncated Fourier transform and applications to polynomial multiplication David Harvey Daniel S. Roche CourantInstituteofMathematicalSciences CheritonSchoolofComputerScience NewYorkUniversity UniversityofWaterloo NewYork,NewYork,U.S.A. Waterloo,Ontario,Canada [email protected] [email protected] www.cims.nyu.edu/˜harvey/ www.cs.uwaterloo.ca/˜droche/ 0 ABSTRACT hassincebecomeoneofthemostimportantandusefultools 1 incomputerscience,especiallyintheareaofsignalprocess- The truncated Fourier transform (TFT) was introduced by 0 ing. vanderHoevenin2004asameansofsmoothingthe“jumps” 2 TheFFTalgorithmisalsoimportantincomputeralgebra, inrunningtimeoftheordinaryFFTalgorithmthatoccurat mostnotablyinasymptoticallyfastmethodsforintegerand n power-of-twoinputsizes. However,theTFTstillintroduces polynomial multiplication. The first integer multiplication a these jumps in memory usage. We describe in-place vari- J ants of the forward and inverse TFT algorithms, achieving algorithmtoruninsoftlylineartimereliesontheFFT[10], as do the recent theoretical improvement [4] and the best 8 time complexity O(nlogn) with only O(1) auxiliary space. result for polynomial multiplication over arbitrary algebras 2 As an application, we extend the second author’s results [2]. Moreover,numerousotheroperationsonpolynomials— onspace-restrictedFFT-basedpolynomialmultiplicationto includingdivision,evaluation/interpolation,andGCDcom- ] polynomials of arbitrary degree. S putation — have been reduced to multiplication, so more D efficient multiplication methods have an indirect effect on CategoriesandSubjectDescriptors many areas in computer algebra [5, §8–11]. . s F.2.1[AnalysisofAlgorithmsandProblemComplex- The simplest FFT to implement, and the fastest in prac- c ity]: Numerical Algorithms and Problems—Computations tice, is the radix-2 Cooley-Tukey FFT. Because the radix-2 [ onpolynomials;G.4[MathematicalSoftware]: Algorithm FFTrequiresthesizetobeapoweroftwo,thesimplestso- 1 designandanalysis,Efficiency;I.1.2[Symbolic and Alge- lutionforallothersizesistopadtheinputpolynomialswith v braic Manipulation]: Algorithms—Algebraic algorithms, zeros,resultinginlargeunwanted“jumps”inthecomplexity 2 Analysis of algorithms at powers of two. 7 2 GeneralTerms 1.2 ThetruncatedFouriertransform 5 . Algorithms, Performance, Theory It has been known for some time that if only a subset of 1 the DFT output is needed, then the FFT can be truncated 0 0 Keywords or“pruned”to reduce the complexity, essentially by disre- gardingthosepartsofthecomputationtreenotcontributing 1 Truncated Fourier transform, fast Fourier transform, poly- to the desired outputs [8, 11]. More recently, van der Ho- : v nomial multiplication, in-place algorithms eventookthecrucialstepofshowinghowtoinvertthispro- i cess, describing a truncated Fourier transform (TFT) and X 1. INTRODUCTION an inverse truncated Fourier transform (ITFT), and show- r ingthatthisleadstoapolynomialmultiplicationalgorithm a 1.1 Background whose running time varies relatively smoothly in the input size [12, 13]. ThediscreteFouriertransform(DFT)isalinearmapthat Specifically, given an input vector of length n ≤ 2k, the evaluates a given polynomial at powers of a root of unity. TFTcomputesthefirstncoefficientsoftheordinaryFourier Cooley and Tukey [3] were the first to develop an efficient transform of length 2k, and the ITFT computes the inverse method to compute this transform on a digital computer, ofthismap. Therunningtimeofthesealgorithmssmoothly knownastheFastFourierTransform(FFT).Thisalgorithm interpolatestheO(nlogn)complexityofthestandardradix- 2 Cooley–Tukey FFT algorithm. One can therefore deduce an asymptotically fast polynomial multiplication algorithm that avoids the characteristic“jumps”in running time ex- hibitedbytraditionalFFT-basedpolynomialmultiplication algorithms when the output degree crosses a power-of-two boundary. This observation has been confirmed with prac- tical implementations [13, 7, 6], with the most marked im- provements in the multivariate case. OnedrawbackofvanderHoeven’salgorithmsisthatwhile Copyright(C)2010,DavidHarveyandDanielS.Roche their time complexity varies smoothly with n, their space integer a ∈ Z in the range −cn ≤ a ≤ cn for some fixed complexitydoesnot. BoththeTFTandITFToperateina constant c∈N. (In our algorithms, we could take c=2.) buffer of length 2(cid:100)lgn(cid:101); that is, for inputs of length n, they We say an algorithm is in-place if it overwrites its input require auxiliary storage of 2(cid:100)lgn(cid:101)−n+O(1) cells to store buffer with the output. In this case, any element in this intermediate results, which can be Ω(n) in the worst case. single input/output array may be read from or written to in constant time. Our in-place truncated Fourier transform 1.3 Summaryofresults algorithms (Algorithms 1 and 2) fall under this model. The main results of this paper are TFT and ITFT algo- Anout-of-placealgorithmusesseparatememorylocations rithmsthatrequireonlyO(1)auxiliaryspace,whilerespect- for input and output. Here, any element from the input ar- ing the O(nlogn) time bound. raymaybereadfrominconstanttime(butnotoverwritten), The new algorithms have their origin in a cache-friendly and any element in the output array may be read from or variant of the TFT and ITFT given by the first author [6], writtentoinconstanttimeaswell. Thiswillbethesituation that builds on Bailey’s cache-friendly adaptation of the or- in our multiplication algorithm (Algorithm 3). dinary FFT [1]. If the transform takes place in a buffer of Thealgorithmsalsoneedtostoresomenumberofpointers length L = 2(cid:96), these algorithms decompose the transform and ring elements not in the input or output arrays, which into L1 = 2(cid:96)1 row transforms of length L2 = 2(cid:96)2 and L2 wedefinetobetheauxiliarystorageusedbythealgorithm. column transforms of length L , where (cid:96) +(cid:96) = (cid:96). Van All the algorithms we present will use only O(1) auxiliary 1 1 2 der Hoeven’s algorithms correspond to the case L =2 and storage space. 1 L = L/2. To achieve optimal locality, [6] suggests tak- Thismodelshouldcorrespondwellwithpractice,atleast 2 √ whenthecomputationsareperformedinmainmemoryand ing L ≈ L ((cid:96) ≈ (cid:96)/2). In fact, in this case one already i i √ the ring R is finite. obtainsTFTandITFTalgorithmsneedingonlyO( n)aux- iliary space. At the other extreme we may take L1 = L/2 2.2 DFTnotation and L = 2, obtaining TFT and ITFT algorithms that use 2 We denote by ω a primitive 2k-th root unity, and we only O(1) space at each recursion level, or O(logn) auxil- [k] iary space altogether. In signal processing language, these assume that these are chosen compatibly, so that ω[2k+1] = may be regarded as decimation-in-time variants of van der ω[k] for all k ≥ 0. Define a sequence of roots ω0,ω1,... by Hoeven’s decimation-in-frequency algorithms. ωs = ω[rke]vks, where k ≥ (cid:100)lg(s+1)(cid:101) and revks denotes the DuetodatadependenciesintheO(logn)-spacealgorithms length-k bit-reversal of s. Thus we have sketched above, the space usage cannot be reduced further ω =ω (=1) ω =ω ω =ω ω =ω3 by simply reordering the arithmetic operations. In this pa- 0 [0] 2 [2] 4 [3] 6 [3] per, we show that with a little extra work, increasing the ω =ω (=−1) ω =ω3 ω =ω5 ω =ω7 1 [1] 3 [2] 5 [3] 7 [3] impliedconstantintheO(nlogn)runningtimebound,itis possibletoreducetheauxiliaryspacetoonlyO(1). Tomake and so on. Note that the O(1) space bound totally explicit, we present our TFT ω =−ω and ω2 =ω2 =ω . and ITFT algorithms (Algorithms 1 and 2) in an iterative 2s+1 2s 2s 2s+1 s fashion, with no recursion. Since we do not have space to If F ∈ R[x] is a polynomial with degF < n, we write store all the necessary roots of unity, we explicitly include F for the coefficient of xs in F, and we define the Fourier s stepstocomputethemonthefly;thisisnon-trivialbecause transform Fˆ by thedecimation-in-timeapproachrequiresindexingtheroots Fˆ =F(ω ). in bit-reversed order. s s Asanapplication,wegeneralizethesecondauthor’sspace- In Algorithms 1 and 2 below, we decompose F as restrictedpolynomialmultiplicationalgorithm[9]. Consider amodelinwhichtheinputpolynomialsareconsideredread- F(x)=G(x2)+xH(x2), only, but the output buffer may be read from and written where degG < (cid:100)n/2(cid:101) and degH < (cid:98)n/2(cid:99). Using the prop- to multiple times. The second author showed that in such erties of ω mentioned above, we obtain the“butterfly”re- a model, it is possible to multiply polynomials of degree s lations n = 2k −1 in time O(nlogn) using only O(1) auxiliary space. Using the new in-place ITFT, we generalize this re- Fˆ =Gˆ +ω Hˆ , 2s s 2s s sult to polynomials of arbitrary degree. (1) Fˆ =Gˆ −ω Hˆ . 2s+1 s 2s s BoththeTFTandITFTalgorithmrequire,ateachrecur- 2. PRELIMINARIES sive level, iterating through a set of index-root pairs such as {(i,ω ),0 ≤ i < n}. A traditional, time-efficient ap- 2.1 Computationalmodel i proach would be to precompute all powers of ω , store [k] WeworkoveraringRcontaining2k-throotsofunityfor them in reverted-binary order, and then pass through this all (suitably large) k, and in which 2 is not a zero-divisor. array with a single pointer. However, this is impossible un- Ourmemorymodelissimilartothatusedinthestudyof der the restriction that no auxiliary storage space be used. in-placealgorithmsforsortingandgeometricproblems,com- Instead, we will compute the roots on-the-fly by iterating bined with the well-studied notion of algebraic complexity. throughthepowersofω inorder,andthroughtheindices [k] Specifically, we allow two primitive types in memory: ring i in bit-reversed order. Observe that incrementing an in- elementsandpointers. Aringelementisanysingleelement teger counter through rev 0,rev 1,rev 2,... can be done k k k of R, and the input to any algorithm will consist of n such inexactlythesamewayasincrementingthrough0,1,2,..., elements stored in an array. A pointer can hold a single which is possible in-place and in amortized constant time. (0,0)={0,1,2,3,4,5} (0,1)={0,2,4} (1,1)={1,3,5} (0,2)={0,4} (1,2)={1,5} (2,2)={2} (3,2)={3} (0,3)={0} (4,3)={4} (1,3)={1} (5,3)={5} Figure 1: TFT tree for n=6 3. SPACE-RESTRICTEDTFT Algorithm 1: InplaceTFT([X ,...,X ]) 0 n−1 Inthissectionwedescribeanin-placeTFTalgorithmthat Input: X =F for 0≤i<n, where F ∈R[x], i i uses only O(1) auxiliary space (Algorithm 1). The routine degF <n operatesonabufferX0,...,Xn−1 containingelementsofR. Output: X =Fˆ for 0≤i<n i i Ittakesasinputarootofunityofsufficientlyhighorderand the coefficients F ,...,F of a polynomial F ∈R[x], and 1 S ←LeftmostLeaf(0,0) 0 n−1 overwrites these with Fˆ ,...,Fˆ . 2 prev←null 0 n−1 The pattern of the algorithm is recursive, but we avoid 3 while true do recursion by explicitly moving through the recursion tree, 4 m←len(S) avoidingunnecessaryspaceusage. Anexampletreeforn= 5 if IsLeaf(S) or prev=Odd(S) then 6 is shown in Figure 1. The node S = (q,r) represents the 6 for (i,θ)∈{(j,ω2j):0≤j <(cid:98)m/2(cid:99)} do (cid:20) (cid:21) (cid:20) (cid:21) subarray with offset q and stride 2r; the ith element in this 7 S2i ← S2i+θS2i+1 subarray is Si =Xq+i·2r, and the length of the subarray is S2i+1 S2i−θS2i+1 given by 8 if S =(0,0) then halt (cid:108)n−q(cid:109) 9 prev←S len(S)= . 2r 10 S ←Parent(S) Therootis(0,0),correspondingtotheentireinputarray 11 else if prev=Even(S) then oflengthn. Eachsubarrayoflength1correspondstoaleaf 12 if len(S)≡1mod2 then node, and we define the predicate IsLeaf(S) to be true iff 13 v←(cid:80)(m−3)/2S ·(ω )i i=0 2i+1 (m−1)/2 len(S) = 1. Each non-leaf node splits into even and odd 14 S ←S +v·ω m−1 m−1 m−1 child nodes. To facilitate the path through the tree, we 15 prev←S define 16 S ←LeftmostLeaf(Odd(S)) Even(q,r)=(q,r+1), Odd(q,r)=(q+2r,r+1) if (q,r) is not a leaf, Proof. The proof is by induction on (cid:96). If (cid:96) = 1, then (cid:40) IsLeaf(N) is true and Aˆ = A so we are done. So assume (q,r−1), q<2r−1, 0 0 Parent(q,r)= (cid:96)>1 and that the lemma holds for all shorter lengths. (q−2r−1,r−1), q≥2r−1 Decompose A as A(x) = G(x2)+xH(x2). Since S = LeftmostLeaf(Even(N)) as well, the induction hypothesis if (q,r) is not the root, and for any node we define guaranteesthattheeven-indexedelementsofN,correspond- (cid:40)S, IsLeaf(S), ingtothecoefficientsofG,willbetransformedintoGˆ,and LeftmostLeaf(S)= wewillhaveS =Even(N)beforeline8. Thefollowinglines LeftmostLeaf(Even(S)), otherwise. set prev = Even(N) and S = N, so that lines 12–16 are We begin with the following lemma. executed on the next iteration. If(cid:96)isodd,then((cid:96)−1)/2≥len(Odd(N)),soHˆ will Lemma 3.1. Let N be a node with len(N)=(cid:96), and let ((cid:96)−1)/2 notbecomputedintheoddsubtree,andwewillnotbeable A(x)= (cid:88) A xi ∈R[x]. to apply (1) to compute Aˆ(cid:96)−1 = Gˆ((cid:96)−1)/2 +ω(cid:96)−1Hˆ((cid:96)−1)/2. i This is why, in this case, we explicitly compute 0≤i<(cid:96) If S = LeftmostLeaf(N) and Ni = Ai for 0 ≤ i < (cid:96) before v=H(ω((cid:96)−1)/2)=Hˆ((cid:96)−1)/2 some iteration of line 3 in Algorithm 1, then after a finite number of steps, we will have S = N and Ni = Aˆi for online13,andthencomputeAˆ(cid:96)−1directlyonline14,before 0 ≤ i < (cid:96), before the execution of line 8. No other array descending into the odd subtree. entries in X are affected. Another application of the induction hypothesis guaran- tees that we will return to line 8 with S = Odd(N) after Algorithm 2: InplaceITFT([X ,...,X ]) 0 n−1 computing N = Hˆ for 0 ≤ i < (cid:98)(cid:96)/2(cid:99). The following 2i+1 i Input: X =Fˆ for 0≤i<n, where F ∈R[x], linessetprev=Odd(N)andS =N,andwearriveatline6 i i degF <n on the next iteration. The for loop thus properly applies thebutterflyrelations(1)tocomputeAˆ for0≤i<2(cid:98)(cid:96)/2(cid:99), Output: Xi =Fi for 0≤i<n i which completes the proof. 1 S ←(0,0) Now we are ready for the main result of this section. 2 while S (cid:54)=LeftmostLeaf(0,0) do 3 if IsLeaf(S) then Proposition 3.2. Algorithm 1 correctly computes Fˆ for i 4 S ←Parent(RightmostParent(S)) 0 ≤ i < n. It performs O(nlogn) ring and pointer opera- 5 m←len(S) tions, and uses O(1) auxiliary space. 6 if len(S)≡1mod2 then Proof. ThecorrectnessfollowsimmediatelyfromLemma 7 v←(cid:80)(m−3)/2S ·ωi 3.1,sincewestartwithS =LeftmostLeaf(0,0),whichisthe i=0 2i+1 (m−1)/2 8 S ←S −v·ω first leaf of the whole tree. The space bound is immediate m−1 m−1 m−1 since each variable has constant size. 9 S ←Even(S) Toverifythetimebound,noticethatthewhileloopvis- 10 else its each leaf node once and each non-leaf node twice (once 11 m←len(S) with prev=Even(S) and once with prev=Odd(S)). Since 12 for (i,θ)∈{(j,ω−1):0≤j <(cid:98)m/2(cid:99)} do always q < 2r < 2n, there are O(n) iterations through the (cid:20) (cid:21) 2(cid:20)j (cid:21) S (S +S )/2 while loop, each of which has cost O(len(S)+logn). This 13 2i ← 2i 2i+1 S θ·(S −S )/2 gives the total cost of O(nlogn). 2i+1 2i 2i+1 14 S ←Odd(S) 4. SPACE-RESTRICTEDITFT Next we describe an in-place inverse TFT algorithm that uses O(1) auxiliary space (Algorithm 2). It takes as input 5. POLYNOMIALMULTIPLICATION Fˆ ,...,Fˆ forsomepolynomialF ∈R[x],degF <n,and 0 n−1 We now describe the multiplication algorithm alluded to overwrites the buffer with F ,...,F . 0 n−1 in the introduction. The strategy is similar to that of [9], The path of the algorithm is exactly the reverse of Algo- with a slightly more complicated“folding”step. The input rithm 1, and we use the same notation as before to move consistsoftwopolynomialsA,B ∈R[x]withdegA<nand throughthetree. Weonlyrequireoneadditionalfunction: degB < m. The routine is supplied an output buffer X of RightmostParent(S)= lengthr=n+m−1inwhichtowritetheproductC =AB. (cid:40) ThesubroutineFFThasthesameinterfaceasInplaceTFT, S, S =Odd(Parent(S)), but is only called for power-of-two length inputs. RightmostParent(Parent(S)), otherwise. If Algorithm 3: Space-restricted product LeftmostLeaf(Odd(N ))=N , Input: A,B ∈R[x], degA<m, degB <n 1 2 Output: X =C for 0≤s<n+m−1, where then s s C =AB Parent(RightmostParent(N ))=N , 2 1 1 r←n+m−1 soRightmostParentcomputestheinverseoftheassignment 2 q←0 on line 16 in Algorithm 1. 3 while q<r−1 do We leave it to the reader to confirm that the structure 4 (cid:96)←(cid:98)lg(r−q)(cid:99)−1 of the recursion is identical to that of Algorithm 1, but in 5 L←2(cid:96) reverse, from which the following analogues of Lemma 3.1 6 [X ,X ,...,X ]←[0,0,...,0] and Proposition 3.2 follow immediately: q q+1 q+2L−1 7 for 0≤i<m do Lemma 4.1. Let N be a node with len(N)=(cid:96), and 8 X ←X +ωiA q+(imodL) q+(imodL) q i A(x)= (cid:88) Aixi ∈R[x]. 9 FFT([Xq,Xq+1,...,Xq+L−1]) 10 for 0≤i<n do 0≤i<(cid:96) 11 X ←X +ωiB IfS =N andN =Aˆ for0≤i<(cid:96)beforesomeiterationof q+L+(imodL) q+L+(imodL) q i i i 12 FFT([X ,X ,...,X ]) line2inAlgorithm2,thenafterafinitenumberofsteps,we q+L q+L+1 q+2L−1 13 for 0≤i<L do will have S =LeftmostLeaf(N) and N =A for 0≤i<(cid:96) i i 14 X ←X X before some iteration of line 2. No other array entries in X q+i q+i q+L+i are affected. 15 q←q+L Proposition 4.2. Algorithm 2 correctly computes Fi for 16 Xr−1 ←A(ωr−1)B(ωr−1) 0 ≤ i < n. It performs O(nlogn) ring and pointer opera- 17 InplaceITFT([X0,...,Xr−1]) tions, and uses O(1) auxiliary space. The fact that our InplaceTFT and InplaceITFT algo- rithmsareessentiallyreversesofeachotherisaninteresting Proposition 5.1. Algorithm3correctlycomputestheprod- feature not shared by the original formulations in [12]. uctC =AB, intimeO((m+n)log(m+n))andusingO(1) auxiliary space. spite using only O(1) auxiliary storage, is still an out-of- place algorithm, we restate an open question of [9]: Is it Proof. The main loop terminates since q is strictly in- possible, under any time restrictions, to perform multipli- creasing. Let N be the number of iterations, and let q > 0 cation in-place and using only O(1) auxiliary storage? The q > ··· > q and L ≥ L ≥ ··· ≥ L be the values 1 N−1 0 1 N−1 answer seems to be no, but a proof is as yet elusive. of q and L on each iteration. By construction, the intervals [q ,q +L )formapartitionof[0,r−1),andL isthelargest i i i i 7. REFERENCES power of two such that q +2L ≤r. Therefore each L can i i appear at most twice (i.e. if L = L then L < L ), [1] David H. Bailey. FFTs in external or hierarchical i i−1 i+1 i N ≤2lgr, and we have L |q for each i. memory. Journal of Supercomputing, 4:23–35, 1990. i i At each iteration, lines 7–8 compute the coefficients of [2] David G. Cantor and Erich Kaltofen. On fast the polynomial A(ωqx)modxL −1, placing the result in multiplication of polynomials over arbitrary algebras. [Xq,...,Xq+L−1]. Line 9 then computes Xq+i = A(ωqωi) Acta Inform., 28(7):693–701, 1991. for 0 ≤ i < L. Since L|q we have ωqωi = ωq+i, and so [3] James W. Cooley and John W. Tukey. An algorithm we have actually computed Xq+i = Aˆq+i for 0 ≤ i < L. for the machine calculation of complex Fourier series. The next two lines similarly compute X = Bˆ for Math. Comp., 19:297–301, 1965. q+L+i q+i 0≤i<L. (Thepointoftheconditionq+2L≤ristoensure [4] Martin Fu¨rer. Faster integer multiplication. In STOC that both of these transforms fit into the output buffer.) ’07: Proceedings of the thirty-ninth annual ACM Lines 13–14 then compute X = Aˆ Bˆ = Cˆ for symposium on Theory of computing, pages 57–66, New q+i q+i q+i q+i 0≤i<L. York, NY, USA, 2007. ACM Press. After line 16 we finally have Xs = Cˆi for all 0 ≤ i < r. [5] Joachim von zur Gathen and Ju¨rgen Gerhard. Modern (The last product was handled separately since the output computer algebra. Cambridge University Press, buffer does not have room for the two Fourier coefficients.) Cambridge, second edition, 2003. Line 17 then recovers C0,...,Cr−1. [6] David Harvey. A cache-friendly truncated FFT. Wenowanalyzethetimeandspacecomplexity. Theloops Theoret. Comput. Sci., 410(27-29):2649–2658, 2009. on lines 6, 7, 10 and 13 contribute O(r) operations per it- [7] Xin Li, Marc Moreno Maza, and E´ric Schost. Fast eration, or O(rlogr) in total, since N = O(logr). The arithmetic for triangular sets: from theory to practice. FFT calls contribute O(L logL ) per iteration, for a to- i i J. Symbolic Comput., 44(7):891–907, 2009. (cid:80) (cid:80) tal of O( L logL ) = O( L logL) = O(rlogr). Line i i i i i [8] J. Markel. FFT pruning. Audio and Electroacoustics, 16 contribute O(r), and line 17 contributes O(rlogr) by IEEE Transactions on, 19(4):305–311, Dec 1971. Proposition4.2. Thespacerequirementsareimmediatealso [9] Daniel S. Roche. Space- and time-efficient polynomial by Proposition 4.2, since the main loop requires only O(1) multiplication. In ISSAC ’09: Proceedings of the 2009 space. international symposium on Symbolic and algebraic computation, pages 295–302, New York, NY, USA, 6. CONCLUSION 2009. ACM. We have demonstrated that forward and inverse radix- [10] A. Sch¨onhage and V. Strassen. Schnelle Multiplikation 2 truncated Fourier transforms can be computed in-place grosser Zahlen. Computing (Arch. Elektron. Rechnen), using O(nlogn) time and O(1) auxiliary storage. As a re- 7:281–292, 1971. sult,polynomialswithdegreeslessthanncanbemultiplied [11] H.V. Sorensen and C.S. Burrus. Efficient computation out-of-placewithinthesametimeandspacebounds. These of the DFT with only a subset of input or output results apply to any size n, whenever the underlying ring points. Signal Processing, IEEE Transactions on, admits division by 2 and a primitive root of unity of order 41(3):1184–1200, Mar 1993. 2(cid:100)lgn(cid:101). [12] Joris van der Hoeven. The truncated Fourier Numerous questions remain open in this direction. First, transform and applications. In ISSAC 2004, pages ourin-placeTFTandITFTalgorithmsavoidusingauxiliary 290–296. ACM, New York, 2004. space at the cost of some extra arithmetic. So although [13] Joris van der Hoeven. Notes on the truncated Fourier the asymptotic complexity is still O(nlogn), the implied transform. unpublished, available from constant will be greater than for the usual TFT or FFT http://www.math.u-psud.fr/˜vdhoeven/, 2005. algorithms. It would be interesting to know whether this extracostisunavoidable. Inanycase,theimpliedconstant would need to be reduced as much as possible for the in- place TFT/ITFT to compete with the running time of the original algorithms. We also have not yet demonstrated an in-place multi- dimensional TFT or ITFT algorithm. In one dimension, the ordinary TFT can hope to gain at most a factor of two over the FFT, but a d-dimensional TFT can be faster than thecorrespondingFFTbyafactorof2d,asdemonstratedin [7]. Anin-placevariantalongthelinesofthealgorithmspre- sented in this paper could save a factor of 2d in both time and memory, with practical consequences for multivariate polynomial arithmetic. Finally, noticing that our multiplication algorithm, de-