An Agglomeration Law for Sorting Networks and its Application in Functional Programming Lukas ImmanuelSchiller Philipps-UniversitätMarburg FachbereichMathematikundInformatik [email protected] In this paper we will present a generalagglomerationlaw for sorting networks. Agglomerationis acommontechniquewhendesigningparallelprogrammestocontrolthegranularityofthecompu- tation thereby finding a better fit between the algorithm and the machine on which the algorithm runs.Usuallythisisdonebygroupingsmallertasksandcomputingthemenblocwithinoneparallel process.Inthecaseofsortingnetworksthiscouldbedonebycomputingbiggerpartsofthenetwork withoneprocess. Theagglomerationlawinthispaperpursuesadifferentstrategy:Theinputdatais groupedandthealgorithmisgeneralisedtoworkontheagglomeratedinputwhiletheoriginalstruc- tureofthealgorithmremains. Thiswillresultinanewaccessopportunitytosortingnetworkswell- suited for efficient parallelization on modern multicore computers, computer networks or GPGPU programming.Additionallythisenablesustousesortingnetworksas(parallelordistributed)merg- ingstagesforarbitrarysortingalgorithms,therebycreatingnewhybridsortingalgorithmswithease. Theexpressivenessoffunctionalprogramminglanguageshelpsustoapplythislawtosystematically constructedsortingnetworks,leadingtoefficientandeasilyadaptablesortingalgorithms. Anappli- cationexampleisgiven,usingtheEdenprogramminglanguagetoshowtheeffectivenessofthelaw. Theimplementationiscomparedwithdifferentparallelsortingalgorithmsbyruntimebehaviour. 1 Introduction With the increased presence of parallel hardware the demand for parallel algorithms increases accord- ingly. Naturallythisdemandincludessortingalgorithmsasoneofthemostinterestingtasksofcomputer science. Aparticularly interesting classofsorting algorithms forparallelization istheclass ofoblivious algorithms. Wewillcall a parallel algorithm oblivious “iff its communication structure and its commu- nication schemearethesameforallinputsthesamesize”[15]. Sorting networks are the most important representative of the class of oblivious algorithms. They havebeen aninteresting fieldofresearch since theirintroduction byBatcher[1]in1968andareexperi- encing a renaissance in GPGPU programming [18]. They are based on comparison elements, mapping their inputs (a ,a )7→(a′,a′) with a′ = min(a ,a ) and a′ =max(a ,a ) and therefore a′ ≤a′. A 1 2 1 2 1 1 2 2 1 2 1 2 simple graphical representation is shown in Figure1. The arrowhead in the box indicates where the minimumisoutput. a1 a′1=min(a1,a2) ↑ a2 a′2=max(a1,a2) Figure1: Comparisonelement(ascending). S.SchwarzandJ.Voigtländer(Eds.):29thand30thWorkshops on(Constraint)LogicProgrammingand24thInternational (cid:13)c LukasSchiller WorkshoponFunctionaland(Constraint)LogicProgramming ThisworkislicensedundertheCreativeCommons (WLP’15/’16/WFLP’16). Attribution-Noncommercial-ShareAlikeLicense. EPTCS234,2017,pp.165–179,doi:10.4204/EPTCS.234.12 166 AgglomerationLawforSortingNetworks Asimplefunctionaldescriptionofsortingnetworksresultsinarepeatedapplicationofthiscompari- sonelementfunctionwithfixedindicesforeverystep. Forasequence(a ,...,a )oflengthnthespecific 1 n stepsarefixed: (a ,...,a )7→...7→(a˜ ,...,a,...,a ,...,a˜ )7→(a˜ ,...,a′,...,a′,...,a˜ )7→...7→(a′,...,a′) 1 n 1 i j n 1 i j n 1 n with i6= j. In a specific step a and a are sorted with acomparison element. Ultimatly resulting in the i j sortedsequence (a′,...,a′). 1 n Figure2 shows a simple sorting network for lists of length 4. For every permutation of the input (a ,...,a )the output (a′,...,a′)is sorted – the comparisons are independent of the data base. Notice 1 4 1 4 the obvious inherent parallelism in the first two steps of the sorting network. The restriction to a fixed structure ofcomparisons resultsinaneasilypredictable behaviour andeasilydetectable parallelism. a1 a′1 ↑ ↑ a2 a′2 ↑ a3 a′3 ↑ ↑ a4 a′4 Figure2: Simplesortingnetworkwithcomparison elements. Source: [11] Some well-known sorting algorithms, for example Bubble Sort [11], can be described as sorting networks. Especially in the case of recursively constructed sorting networks (e.g. Batcher’s Bitonic Sort or Batcher’s Odd-Even-Mergesort), with their inherent functional structure, an obviously correct descriptionofthealgorithmiseasilypossibleinafunctionalprogramminglanguagesuchasHaskell[16]. In practice straightforward implementations of these algorithms often struggle with too fine a gran- ularity of computation and therefore do not scale well. Agglomerating parts of the algorithm is a com- mon step in dealing with this problem when designing parallel programmes (compare Foster’s PCAM method [7]). With recursive algorithms for example it is a common technique to agglomerate branches of the recursive tree by parallelising only until a specific depth of recursion. With a coarser granularity thecomputation tocommunication ratioimproves. Acommonagglomeration forsorting networks isto placeblocksorrowsofcomparison elementsinoneparallel process. Inthispaperwediscuss adifferent approach. Wewillagglomerate theinputdataandalterthecom- parison element to work on blocks of data. This approach is not based upon the structure of a specific sorting network and can therefore be applied to any sorting network. Atthe same time wewill see that the limited nature of sorting networks is necessary for this modification to be correct. The application of this transformation will open a different access to sorting networks, allowing easy combination with othersortingalgorithms. Workingondatastructuresinsteadofsingleelementsleadstoasuitableimple- mentationformodernmulti-corecomputers,GPGPUconceptsorcomputernetworks. Wewillobtainan adequate granularity ofcomputation andthewidthofthesortingnetwork cancorrespond withthenum- ber of processor units. A second layer of traditional agglomeration (e.g. blocks or rows of comparison elements)isindependently possible. In Section2 we will discuss which demands are necessary for altered comparison elements to pre- serve an algorithm’s functionality and correctness. In Section3 an example isgiven showing situations inwhich theapplication ofthisagglomeration isbeneficial and tests withdifferent approaches are eval- uated. Section4discusses relatedworkandSection5concludes. LukasSchiller 167 2 Agglomeration Law for Sorting Networks In general, sorting networks work on sequences of elements A = (a ,...,a ). Our improvement will 1 n workwithapartitionofagivensequence. Inthefollowing,wewilluseHaskellnotationandlistsinstead ofsequences toimprovereadability, eventhough amoregeneral typewouldbepossible. Theorem1(AgglomerationLawforSortingNetworks). LetA=[a ,...,a ]beasequencewhereatotal 1 n order “≤” is defined on the elements, c::Orda⇒(a,a)→(a,a) a comparison element as described beforeand sN :: ((b,b) → (b,b)) → [b] → [b] acorrectsortingnetwork,meaningsNcA=A′ withA′=[a′,...,a′]anda′ ≤...≤a′ wherea′,...,a′ 1 n 1 n 1 n is a permutation of a ,...,a and the only operation used by the sorting network is a repeated appli- 1 n cation of the comparison element with a fixed, data independent structure for a given input size. Then thereexistsacomparisonelementc′::Orda⇒([a],[a])→([a],[a])withwhichasequenceofsequences A = [A ,...,A ] with A = [a ,...,a ] can be sorted with the same sorting network. Meaning that 1 n i i1 ini sN c′A=A′ with A′=[A′,...,A′] and A′ (cid:22)...(cid:22)A′. Where concat A′ is a permutation of concat A 1 n 1 n and A(cid:22)B means that for two sequences A=[a ,...,a ]and B=[b ,...,b ]every element ofA isless 1 p 1 q thanorequaltoeveryelementfromB: A(cid:22)B⇔∀a∈A,∀b∈B:a≤b Withblocksofdatatheconcatenation oftheelementsofA′ needtobeapermutation oftheconcate- nationofA,theelementsthemselves(A′,...,A′)donotneedtobeapermutation ofA ,...,A . 1 n 1 n Notethattheorder relation forblocks ofdata“(cid:22)”defines only apartial order whereas theelements inside the blocks are totally ordered. To this end we need to specialise the comparison element to deal with the case of overlapping or encasing blocks and still fulfill all properties necessary for the sorting networktoworkcorrectly (cf.Figure3). Ai Ai a ≤ini aini Aj a≤i1... Aj ≤... a≤jnj Aia≤ini Ajajnj a≤jnj a≤i1 ≤... ... ≤... ≤... aj1 ≤ a≤j1 aj1 ai1 (b)overlappingblocks: (a)orderedblocks: maxA ≤maxA but (c)encasingblocks: j i maxAj≤minAior maxAj(cid:2)minAiand maxAj≤maxAiand maxA ≤minA minA ≤minA orviceversa minA ≤minA orviceversa i j j i i j Figure 3: Cases for comparison elements for blocks of data: blocks can be ordered (with order relation (cid:22)), overlapping or encasing, where overlapping and encasing means that the blocks are not in an order relation betweenoneanother (meaningneither(cid:22)nor(cid:23)holds). If for example the input lists overlap (e.g. c′([1,2,3,4],[3,4,5,6])) asimple swap would not fulfill the requirements. Wewould prefer a result such as ([1,2,3,3],[4,4,5,6]) and therefore A′ can not bea permutation of A but we expect that every element a from A ,...,A is in A′,...,A′. In the next step ij 1 n 1 n wewillinvestigate whichconditions acomparison elementforblocksofdatamustfulfill. 168 AgglomerationLawforSortingNetworks 2.1 Comparisonelement forpartiallyordered blocks oftotallyordered elements If we want to alter the comparison element while preserving the functionality and correctness of the sorting network we must understand which information is generated and preserved within a traditional comparison element. We will therefore investigate the capabilities and limits of comparison elements fortotallyorderedsequences: Leta ,a ,a1,a2,a1,a2beelementswhereinformationaboutthefollowing 1 2 1 1 2 2 relations havealready beengatheredbythesortingnetwork: a1≤a ≤a2 and a1≤a ≤a2 (0) 1 1 1 2 2 2 If we do sort a and a with a comparison element (a ,a ) 7→ (a′,a′) we receive new relations (e.g. 1 2 1 2 1 2 a1 ≤ a ⇒ a1 ≤ a′). We will distinguish between direct relations and conditional relations. In this 1 1 1 2 casedirectrelationsrefertoallrelationsresultingdirectlyfromtherelationswhichexistsandareknown before the application of the comparison element and which involve a ,a ,a′ ora′. We expect the 1 2 1 2 comparison element to be side-effect free and therefore we expect every relation between elements not touchedbythecomparisonelementtobeunaffectedbyitsapplication. Herethedirectrelationsresulting from(0)are: a′ ≤ a′ (1) 1 2 a′ ≤ a2, i∈{1,2} (2) 1 i a1 ≤ a′, i∈{1,2} (3) i 2 Ifwehaveadditional information, wegetadditional directrelations. For{i,j}={1,2}: a1≤a ⇒ a1≤a′ (4) i j i 1 a ≤a2 ⇒ a′ ≤a2 (5) j i 2 i a ≤a ⇒ a1≤a′ ∧a′ ≤a2 (6) i j i 1 2 j Wecalltheserelations directrelationsonlyiftheleftsideisalreadyknown. Definition 1(Valid comparison elements for blocks of data). LetA ,A be sequences withatotal order 1 2 “≤” defined on the elements and c′::Orda⇒([a],[a])→([a],[a]) a block comparison element with c′(A ,A )=(A′,A′)andA′ (cid:22)A′. 1 2 1 2 1 2 c′ is called valid, iff all elements from A and A which are less than lb=max(min(A ), min(A )) 1 2 1 2 must be in A′, all elements which are greater than ub=min(max(A ), max(A )) must be in A′ and all 1 1 2 2 elements betweentheselimitscanbeeither inA′ orinA′ aslongaseveryelementinA′ issmallerthan 1 2 1 orequaltoeveryelementinA′ (cf.Figure4). 2 u ub m lb l Figure 4: Sections of the comparison element for blocks of data. Elements from l must be in the lesser result(A′),elementsfromumustbeinthegreaterresult(A′)andelementsfrommcanbeineitherresult 1 2 aslongasA′ (cid:22)A′ holds. 1 2 LukasSchiller 169 Lemma 1. Valid block comparison elements maintain the direct relations fulfilled by the elementary comparison elements. Proof. Weshowthevalidityoftheaboverelations (1)to(6)forblocksofdata: 1. A′ (cid:22)A′ isincluded inthedefinition. 1 2 2. maxA′ ≤ub≤maxA ≤minA2⇒A′ (cid:22)A2, i∈{1,2} 1 i i 1 i 3. maxA1≤minA ≤lb≤minA′ ⇒A1(cid:22)A′, i∈{1,2} i i 2 i 2 4. A1(cid:22)A ∧A1(cid:22)A ⇒A1(cid:22)[min(minA , minA )](cid:22)A′ ⇒A1(cid:22)A′, i∈{1,2} i 1 i 2 i 1 2 1 i 1 5. A (cid:22)A2∧A (cid:22)A2⇒A′ (cid:22)[max(maxA , maxA )](cid:22)A2⇒A′ (cid:22)A2, i∈{1,2} 1 i 2 i 2 1 2 i 2 i 6. A1(cid:22)A (cid:22)A ⇒A1(cid:22)A ∧A1(cid:22)A ⇒maxA1≤min(minA , minA )⇒A1(cid:22)A′ i i j i i i j i i j i 1 A (cid:22)A (cid:22)A2⇒A (cid:22)A2∧A (cid:22)A2⇒max(maxA , maxA )≤minA2⇒A′ (cid:22)A2 i j j i j j j i j j j j Theproofshowsthattheselimitsarenotonlysufficientbutnecessarytoguaranteethedirectrelations onwhichsortingnetworksareessentiallybased. Counterexampleswhereadifferentlimitselectionleads tothefailureofthesortingnetworkcaneasilybefound. All other producible information concerns conditional relations which depend on a yet unknown condition resulting inadisjunction oraconditional withunknownantecedent. Forexample (a ≤a ∨a ≤a )∧a1≤a ≤a2⇒a1≤a′ ∨a′ ≤a2 1 2 2 1 1 1 1 1 1 2 1 Fororderedoroverlapping blockswecaneasilyverifythatalltheserelations canbepreserved, asevery inputelementhasadirectdescendant,analogoustotheoriginalcomparisonelement. Inthiscaseadirect descendant A′ ofablockAisboundedbytheextremaoftheparentalblock,meaningthatminA≤minA′ andmaxA′≤maxA. A′ canbutneednotcontain elements fromAaswellaselements whicharenotinA due to the fact that the property is defined through boundaries not elements. Therefore, when applying the comparison element, the boundaries of each block can at the most approach each other, leaving all relations preserved. AnexampleisgiveninFigure5. A2 A2 1 1 A2 A2 2 2 A A′ 1 2 max(A ) 1 max(A ) 2 (cid:22) min(A1) min(A2) A2 A′1 A11 A12 A11 A12 Figure 5: Splitof overlapping blocks. Inthis case the minimal(maximal) element ofA issmaller than 2 the minimal (maximal) element of A . Thereby A “shrinks” from above, meaning that the maximum 1 2 element of A′ is smaller than maxA . This does not yet disclose any information about the number of 1 2 elementsinA′. A “shrinks”frombelow. WecanseeA′ asthedescendantofA andA′ asthedescendant 1 1 1 2 2 ofA . Allrelations arepreserved. 1 170 AgglomerationLawforSortingNetworks With encased blocks (cf. Figure3c) it is not necessarily possible to find a descendant for every element. If,forexample, wehaveA1(cid:22)A (cid:22)A2 andA1(cid:22)A (cid:22)A2 there mightbenooutput element A′ 1 1 1 2 2 2 i withA1(cid:22)A′(cid:22)A2 (cf.Figure6). 1 i 1 A2 A2 1 1 A′ 2 A 1 (cid:22) A2 A′1 A1 A1 1 1 Figure 6: Splitofencased blocks. Therearenodirect descendants. A′ (cid:22)A2 andA1(cid:22)A′ butneither A′ 1 1 1 2 1 norA′ isbetweenA1 andA2. 2 1 1 Unfortunately, a consequence of this is that this technique of merging and splitting blocks can not necessarilybetransferredtoamoregeneralsortingalgorithm. Inparticularthisdoesnotworkwithpivot based sorting algorithms. However it does with sorting networks because the comparison element does not compare one fixed element with another element but rather returns two sorted elements for which we do not know which input element is mapped to which output element. The information A1 ≺ A 1 1 is reduced to A1 (cid:22) A′ plus some conditional information. Some of this conditional information can 1 2 no longer be guaranteed to hold but can not be used in a sorting network at all because of the limited operations ofsortingnetworks. Therelations ofconcernare a ≤a ∨a ≤a ⇒a1≤a′ ∨a′ ≤a2, i∈{1,2} (7) 1 2 2 1 i 1 2 i Sorting networks as described above can not produce the additional information needed for this conditional information tobecomeuseful. Lemma 2. Information about the conditional relations (7) that can not be preserved by the altered comparison elementc′ cannotbeusedbyasortingnetwork. SketchofProof. The condition of a conditional relation is unknown by definition – otherwise it would be a direct relation. Therefore the implication can not be used to gather additional information. The remaining disjunction can result in useful information in a non-trivial way only if one side of the dis- junction isknown tobe false (modus tollendo ponens) orifboth sides ofthe disjunction areequal. Itis notpossibletoequaliseanoutputelementofthecomparisonelementwithanotherelementandtherefore it is not possible to test whether a ≤a or not. In particular the information that a (cid:2)a can not be i j i j produced for any i and j. We can not test whether one side of such an equation is false or if both sides areequalandthereforecannotuseconditional relations. Non-trivial,productiveinformationfromthese disjunctions canonlybeusedinnon-oblivious algorithms. Lemma1 and Lemma2 imply that a comparison element c′ as demanded in Theorem1 exists with thegivenlimitationsfromLemma1. Thereforeallusableinformation ispreservedandthistechnique of mergingandsplitting twoblocksinacomparison elementcanbeusedwitheverysortingnetwork. Iftheelementsinsidetheblocksaresorted,wecandefinealineartimecomparisonelementthatsplits thetwoblocksintoblocksasequalinsizeaspossible. Animplementationofsuchacomparisonelement LukasSchiller 171 can be found in Section3, Listing5. Balancing the blocks is advantageous in many cases because it limits the maximal block size to the size of the largest block in the initial sequence. This is beneficial especiallyinthesituationoflimitedmemoryfordifferentpartsoftheparallelisedalgorithm,forexample if the parallelization is done with a computer cluster. By preserving the inner sorting of the blocks, the resulting sequence of the sorting network can be easily combined to a completely sorted sequence by concatenation. Every suitable sorting algorithm can be used for the initial sorting inside the blocks. Consequently the sorting network can beused as askeleton toparallelise arbitrary sorting algorithms and workas the merging stage of the newly combined (parallel) algorithm. A concept that will prove it’s worth in the followingexample. 3 Application of the Agglomeration Law on the Bitonic Sorter We will now apply the agglomeration law to Batcher’s Bitonic Sorting Network. It is a recursively constructed sorting network that works in two steps. In the first step an unsorted sequence (of length 2l with l ∈ N) is transformed into a bitonic sequence. A bitonic sequence is the juxtaposition of an ascending andadescending monotonic sequence orthecyclicrotationofthefirstcase(Figure7). 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 Figure7: Examplesofbitonicsequences. ThebitonicsequenceisthereaftersortedbyaBitonicMerger. Wewillcallthefunctionimplementing thisBitonicMergerbMergeandthefunctiontransforming anunsorted sequence intoabitonic sequence prodBList. The Bitonic Sorter works with the nested divide-and-conquer scheme of the sorting-by- merging idea. This means that the repeated generation of shorter sorted lists is done by Bitonic Sorters ofsmallersize. ABitonicSorterforeightinputelementsisdepicted inFigure8. ↑ ↑ ↑ ↑ ↑ ↑ e c e n c e ↓ ↑ ↑ ↑ ↑ ↑ n u e q u e q s e d s e d ort ↑ ↓ ↓ ↑ ↑ ↑ rte s o n s u ↓ ↓ ↓ ↑ ↑ ↑ Figure 8: Bitonic Sorter of order 8. The function prodBList is represented by a red dashed rectangle, thefunction bMergebyabluedottedone. BitonicSequencesarerepresented byshadedrectangles. 172 AgglomerationLawforSortingNetworks Thebasiccomponent ofthesortingnetwork–theoriginal comparisonelement–canbedefinedas: Listing1: Originalcomparison element data Direction = Up | Down deriving Show compElem :: Ord a ⇒ Direction → [a] → [a] compElem Up [x,y] = if x ≤ y then [x,y] else [y,x] compElem Down xs = reverse $ compElem Up xs We will use a two-element-list variant instead of pairs for reasons of code elegance. We define the actual algorithm using the Eden [14, 13] programming language which extends Haskell by the concept of parallel processes with an implicit communication as well as a Remote Data [5] concept. We can instantiate aprocessthatisdefinedbyagivenfunctionwith($#): ($#) :: (Trans b, Trans a) ⇒ (a → b) -- Process function → a → b -- Process input and output The class Trans consists of transmissible values. The expression f $# expr with some function f :: a → bwillcreatea(remote)childprocess. Theexpression exprwillbeevaluated(concurrently by a new thread) in the parent process and the result val will be sent to the child process. The child process willevaluatef $ val(cf.Figure9). parentprocess resultof (f $ val) (evaluatesexprtoval) creates childprocess val release ◦ f Figure9: Theschemeforprocessinstantiation. Source: [13] Hereafterwewillessentially useEden’sparMapAt,aparallelvariantofmapwithexplicitplacement ofprocesses onprocessor elements(PEs),alsocalled(logical) machines,whicharenumberedfrom1to thenumberofprocessor elements. parMapAt :: (Trans a, Trans b) ⇒ [Int] -- ^places for instantiation → (a → b) -- ^worker function → [a] → [b] -- ^task list and ^result list Theexplicitplacementisdeterminedbythefirstargument,alistofPEnumbersspecifyingtheplaces wheretheprocesseswillbedeployed. AdditionallywewillusetheconstantsnoPeandselfPeprovided byEdentocalculate thecorrectplacements: noPe :: Int -- Number of (logical) machines in the system selfPe :: Int -- Local machine number (ranges from 1 to noPe) Forourimplementation wewillplaceeachcomparison element ofthesamerowonthesamePE.In Listing2aparalleldefinitionofthealgorithm isgiven. LukasSchiller 173 Listing2: ParallelbSort 36 bSort :: Trans a 37 ⇒ (Direction → [a] → [a]) -- ^specialized comparison element 38 → Direction -- ^sorting direction 39 → [a] → [a] -- ^input and ^output 40 bSort _ _ [ ] = [ ] 41 bSort _ _ [x] = [x] 42 bSort sCompElem d xss = (bMerge sCompElem d) ◦ prodBList $ xss where 43 prodBList = unSplit ◦ pMap bSort’ ◦ zip [Up, Down] ◦ splitHalf 44 bSort’ = uncurry (bSort sCompElem) 45 pMap = parMapAt [selfPe, selfPe+hcc] 46 hcc = (length xss) ‘div‘ 4 {- half comparator count -} The bSort function takes three arguments: an oriented comparison element, a Direction denoting whethertheresultshouldbesortedinanascendingordescendingorderandaninputlist. Themainpartof thealgorithmisacompositionoftheprodBListandthebMergefunction(cf.Line42inListing2). The prodBListfunction splits theinput list andsorts both parts withthe Bitonic Sorter, one half ascending and one half descending (cf. Line43). It uses twohelper functions splitHalfand unSplit. With the helpofEden’ssplitIntoN,whichsplitstheinputlistblockwiseintoasmanypartsasthefirstparameter determines, wedefine: splitHalf :: [a] → [[a]] splitHalf = splitIntoN 2 BothresultinglistsareofthesamesizebecausethewidthoftheBitonicSorterandthereforeitsinput list’s length are powers of two(not to be confused with the size of the blocks which can be of arbitrary size). Theneededreversefunction–unSplit–canbedefinedas: unSplit :: [[a]] → [a] unSplit = concat The correct placement by line is calculated depending on the width of the sorting network (cf. Line46). Two elements are needed for every comparison element, therefore hcc is half the size of thesortingnetworkintheactualrecursion step. ThebMergefunctiondoeshavethesametypesignature asthebSortfunction buttheinputlistmustbeabitonic listforthefunction toworkcorrectly: Listing3: ParallelbMerge 49 bMerge :: Trans a 50 ⇒ (Direction → [a] → [a]) -- ^specialized comparison element 51 → Direction -- ^sorting direction 52 → [a] → [a] -- ^input and ^output 53 bMerge sCompElem d xss@[x,y] = sCompElem d xss 54 bMerge sCompElem d xss = unSplit ◦ pMap (bMerge sCompElem d) ◦ bSplit $ xss where 55 bSplit = splitHalf ◦ shuffle ◦ pMap’ (sCompElem d) ◦ perfectShuffle 56 pMap = parMapAt [selfPe, selfPe+hcc] 57 hcc = (length xss) ‘div‘ 4 {- half comparator count -} 58 pMap’ = parMapAt [selfPe..] The main part of the bMerge function is the function bSplit which splits a bitonic sequence into two bitonic sequences with an order between each other. This function uses a communication structure referred toas aperfect shuffle1 byStone[21]. Withthis communication scheme the element iandi+ p 2 arecompared, resulting inasplitdepictedinFigure10. 1Thisstructurecanbefoundinvariousalgorithms,e.g.intheFastFouriertransformorinmatrixtranspositions. 174 AgglomerationLawforSortingNetworks 1 p p 1 p p 1 p p 1 p p 2 2 2 2 Figure10: Conceptofsplitting abitonicsequence. InEdenthisperfect shuffleiseasilydefinedwiththehelpoftheofferedauxiliary functions: -- Round robin distribution and inverse function called shuffle unshuffle :: Int → [a] → [[a]] shuffle :: [[a]] → [a] Thefirstparameterofunshufflespecifiesthenumberofsublists inwhichthelistissplit,e.g.: unshuffle 3 [1..10] = [[1,4,7,10],[2,5,8],[3,6,9]] shuffle [[1,4,7,10],[2,5,8],[3,6,9]] = [1..10] Theperfect shuffleisthendefinedas: perfectShuffle :: [a] → [[a]] perfectShuffle xs = unshuffle halfSize xs where halfSize = (length xs) ‘div‘ 2 Adirect communication between consecutive comparison elements canberealised withEden’sRe- mote Data concept in which a smaller handle is transmitted instead ofthe actual data. Thedata itself is fetched directly when needed from the PE where the handle was created. The more intermediate steps areinvolved,themoreeffectivethebenefitsofthisconceptbecome. Thiscanbedonebytheoperations: type RD a -- remote data -- converts local data into corresponding remote data and vice versa release :: Trans a ⇒ a → RD a fetch :: Trans a ⇒ RD a → a releaseAll :: Trans a ⇒ [a] → [RD a] -- list variants fetchAll :: Trans a ⇒ [RD a] → [a] InFigure11thecommunication schemeofaRemoteDataconnection ispictured. PE0 PE0 inp inp PE1 f g PE2 PE1 PE2 release ◦ f release ◦ f release ◦ f g ◦ fetch (a)Indirectconnection. (b)Directconnection. (g $# (f $# inp)) (g ◦ fetch) $# ((release ◦ f) $# inp) Figure 11: Remote Data scheme. Source: [13]. The processes computing the results of f and g are placedontwodifferentPEs. WithoutRD,theresultoffistransferredviatheparentalprocess. WithRD ahandleisgenerated onPE1andtransferred viaPE0toPE2,theactualresultistransferred directly.