Astronomy&Astrophysicsmanuscriptno.aa18497-11 (cid:13)c ESO2012 January11,2012 A morphological algorithm for improving radio-frequency interference detection A.R.Offringa1,J.J.vandeGronde2,andJ.B.T.M.Roerdink2 1 UniversityofGroningen,KapteynAstronomicalInstitute,POBox800,9700AVGroningen,TheNetherlands. e-mail:[email protected] 2 UniversityofGroningen,JohannBernoulliInstituteforMathematicsandComputerScience,P.O.Box407,9700AKGroningen, TheNetherlands. e-mail:[email protected]@rug.nl Received22November2011/Accepted11January2012 2 ABSTRACT 1 0 Atechniqueisdescribedthatisusedtoimprovethedetectionofradio-frequencyinterferenceinastronomicalradioobservatories.It 2 isappliedonatwo-dimensionalinterferencemaskafterregulardetectioninthetime-frequencydomainwithexistingtechniques.The n scale-invariantrank(SIR)operatorisdefined,whichisaone-dimensionalmathematicalmorphologytechniquethatcanbeusedto a findadjacentintervalsinthetimeorfrequencydomainthatarelikelytobeaffectedbyRFI.Thetechniquemightalsobeapplicablein J otherareasinwhichmorphologicalscale-invariantbehaviourisdesired,suchassourcedetection.Anewalgorithmisdescribed,that 8 isshowntoperformquitewell,haslineartimecomplexityandisfastenoughtobeappliedinmodernhighresolutionobservatories. 1 ItisusedinthedefaultpipelineoftheLOFARobservatory. Keywords.Instrumentation:interferometers-Methods:dataanalysis-Techniques:interferometric ] M I . 1. Introduction pipeline.ExamplesofthesearetheRFImitigationpipelineused h for the Effelsberg Bonn HI Survey (Flo¨er et al. 2010) and the p Modern telescopes in radio astronomy, such as the Expanded AOFlaggerpipeline(Offringaetal.2010a). - o Very Large Array (EVLA) and the Low-Frequency Array r (LOFAR), are sensitive devices that observe the sky with enor- st mous depth and detail. The observed bandwidth of telescopes 1.1. RFIdetection a hasdramaticallyincreasedoverthelastdecades,andoftenover- [ lapswithpartsoftheradiospectrumthathavenotbeenreserved RFI detection often involves thresholding based on amplitude 2 for radio astronomy. Simultaneously, the radio spectrum is be- (Winkeletal.2006;Offringaetal.2010b),althoughalsohigher v coming more crowded because of technological advancement. order statistics such as the kurtosis have been used (Gary et al. 4 Therefore, radio observations are affected by man-made radio 2010).Thelatterrequiresstoringboththemeanpowersandthe 6 transmitters,whichcanbeseveralordersofmagnitudestronger squaredpowers,therebydoublingthedatarate,andhenceisnot 3 than the weak celestial signals of interest. This kind of inter- always usable. Most interfering sources radiate either in a con- 3 ference, which seriously disturbs radio observations, is called stant small frequency range, or produce a broadband peak in a 1. radio-frequencyinterference(RFI). shorttimerange.Examplesofsuchinterferencesarerespectively 0 Numerous techniques have been suggested to perform the airtrafficcommunicationandlightning.Consequently,aninter- 2 challenging task of RFI mitigation. They include using spatial fering source tends to affect multiple neighbouring samples in 1 information to null directions, provided in interferometers or the time-frequency domain. These samples form straight lines, : multi-feedsystems(Ellingson&Hampson2002;Leshemetal. parallel to the time and frequency axes. An example is given v i 2000; Smolders & Hampson 2002; Kocz et al. 2010); remov- in Fig. 1(a), which shows data from the Westerbork Synthesis X ingtheRFIbyusingreferenceantennas(Barnbaum&Bradley Radio Telescope (WSRT). This line-shaped behaviour of RFI r 1998);andblankingoutunlikelyhighvaluesathightimereso- can be used to improve the accuracy of detection algorithms. a lutions(Baanetal.2004;Leshemetal.2000;Niamsuwanetal. An algorithm that uses this information is the SumThreshold 2005; Weber et al. 1997). Despite the numerous possible tech- method,whichshowsaveryhighdetectionaccuracycompared niques, almost any observation needs to be post-processed due to other methods (Offringa et al. 2010b). This method is used to RFI effects. The most used technique for such a final pro- in one of the steps in the AOFlagger pipeline (Offringa et al. cessing step consists of detecting the RFI in time, frequency 2010a).Animportantconsiderationforsuccesfulapplicationof and antenna space, and ignoring the contaminated data in fur- automatedfeaturedetectionalgorithmssuchasthese,isthatthe therdataprocessing.Thisstepisoftenreferredtoas“dataflag- signalofinterestshouldnotcontainsignificantline-shapedfea- ging”. Historically, this step was performed by the astronomer. tures,asisthecasewithspectrallineobservations.Also,meth- However,becauseofthemajorincreaseinresolutionandband- ods that assume straight, one-dimensional features in the time- width of observatories, leading to observations of tens of ter- frequencydomain,mightnotworkwellinsituationswherethe abytes,thisisnolongerfeasible.Thetendencyisthereforetoim- featuresarecurved.Thiscanoccurwhenboththefrequencyand plement automated RFI flagging pipelines in the observatory’s thetimeresolutionsarehighenoughtoresolvefrequencyvaria- 1 A.R.Offringaetal.:AmorphologicalalgorithmforimprovingRFIdetection 0.02 0.02 117.8 117.8 0.018 0.018 117.6 117.6 117.4 0.016 117.4 0.016 117.2 0.014 117.2 0.014 117 117 Hz)116.8 0.012 Hz)116.8 0.012 M M uency (116.6 y (Jy) 0.01 uency (116.6 y (Jy) 0.01 Freq111166..42 Visibilit00..000068 Freq111166..42 Visibilit00..000068 116 116 0.004 0.004 115.8 115.8 115.6 0.002 115.6 0.002 8:15 8:20 8:25 0 8:15 8:20 8:25 0 Time Time (a) (b) Fig.1: Typical spectral line RFI received in a short period of WSRT data around 117 MHz. It is likely that such RFI sources transmitcontinuouslywithinasmallbandwidth.Panel(a)showstheoriginalobservation,whilepanel(b)showswhattheAOFlagger withdefaultsettingswouldflagwithoutmorphology-basedflagging.Detectionisquiteaccurate,butsomeofthedetectedlinesin panel (b) are not continuous. It is likely that those RFI sources were active in the gaps as well. Morphology-based detection will helpinsuchcases.TheplotshowsStokesIamplitudesofthecross-correlationofantennasRT0×RT1,whichisa144mEast-West baseline.Asinglepixelis10seconds×10kHzofdata. tion in sources, for example when sources are Doppler shifted invariance is a desirable property of RFI detection algorithms. or vary intrinsically, such as with certain radar signals. With Thispaperprovides: LOFAR,weseeveryfewsuchsources. – Adetaileddescriptionofarecently-introducedmorphologi- Typically, the received power of interfering sources varies overtimeandfrequency.Thishappensbecauseofseveraleffects, caltechniqueforRFIdetection; – Analysisofthetechniqueandacomparisonwithanordinary such as intrinsic variation of the source; changing ionosphere; andbecauseofinstrumentaleffects.Atypicalexampleofthelat- dilation, using simulations and real data from two different radio-observatories; ter,whichispresentinalmosteveryobservation,isthechangeof – Anovelfastalgorithmwithlineartimecomplexitytoimple- thetelescope’sgaintowardsaterrestrialsourceasthetelescope mentthetechnique. tracksafieldinthesky.Liketimevariation,frequencyvariation can be caused intrinsically by the source. The instrument also adds frequency-dependent gain, for example due to imperfect 1.2. Outline band-pass filters. Even though a source might be continuously receivedbythetelescope,thresholdingdetectionmethodsmight We discuss a morphological technique that can be used to im- fail to detect the interferer over its full range due to the varia- prove RFI detection. The method flags additional samples that tioninreceivedpower.Figure1(b)showsanexamplewherethis arelikelytobecontaminatedwithRFI,basedonthemorphology is likely the case. Increasing the sensitivity of the thresholding of the flag mask output of a thresholding stage in the pipeline. methodmighthelpsomewhat,butwillalsocauseanincreaseof In sect. 2 we describe the technique and show a fast algorithm falsepositives.Whilesomefalselydetectedsamplesaretolera- toimplementit.Wepresentsomeresultsofthemethodonsim- ble,theyshouldbekeptminimalinordertoavoiddatabiasand ulated data and real data in Sect. 3. Finally, we summarize and insufficientresolution. discusstheresultsinSect.4. Using mathematical morphology for RFI detection is not a newidea;adilationisoftenusedduringRFIprocessingtoflag 2. Thescale-invariantrankoperator areasnearhighvaluesinthetime-frequencydomain.Anexam- pleofthiscanbefoundinWinkeletal.(2006),wherewindows RFI features such as in Fig. 2(a) are common in radio obser- of5timesteps×5frequencychannelsarounddetectedsamples vations, and can occur at different scales. However, a morpho- areflagged.However,standardmorphologicaltechniquesarenot logical dilation is not scale invariant, and will thus necessar- scale invariant. An operator is called scale invariant if scaling ilyworkbetterforsomeRFIfeaturesthanothers.Toovercome its input results in the same scaling of its output. An ordinary thisproblem,wewilldescribeandanalyseamorphologicalrank dilation will cause sharp RFI features to create a high amount operator that is scale invariant1. Scale invariance is a desirable of false positives, while flagging smooth RFI features requires propertyofRFIdetectionalgorithms,because(a)itimpliesthe averylargedilationkernel.Anotherscale-dependenttechnique usedforRFIdetectionistoconsiderthestatisticsoftimesteps 1 Themathematicalpropertiesofthistechniquewillbeanalysedin and frequency channels. In this paper, we will show that scale moredetailinvandeGrondeetal.,2012,inpreparation. 2 A.R.Offringaetal.:AmorphologicalalgorithmforimprovingRFIdetection method can be applied on data with different resolutions with- Accepted sample Flagged sample Hopoerirzaotnotral Voepretriacatolr Both outchangingparametersand(b)thetimeandfrequencyscaleof RFI itself can be arbitrary, so any method to detect RFI should work equally well for RFI at different scales. In practice, RFI seemstobehaveinamoreorlessscale-invariantmanneratthe resolution of LOFAR, as for example can be seen in Fig. 1, so we should also use a scale-invariant method to detect it. This (a)Input (b)Union scale-invariant behaviour of RFI breaks down at high time and frequencyresolutions,atwhichmanyfeaturesbecomediagonal inthetime-frequencyplane. TheproposedtechniquewasfirstmentionedinOffringaetal. (2010a), as it is part of the AOFlagger, which is the default LOFARRFIdetectionpipeline.Inthatarticletheoperationwas referred to as a dilation, however, it does not strictly adhere to (c)Horizontal (d)Verticalfirst all the properties of a morphological dilation. For example, we first willseethattheoperatorρisnotdistributiveovertheunionset operator:ρ(X∪Y) (cid:44) ρ(X)∪ρ(Y)forsomeX andY.Becausea Fig.3: Example outputs of the SIR operator in which the one- rankoperatorflagspointsforwhichthenumberofflaggedpoints dimensional output has been combined in three different ways. in a neighbourhood exceeds a threshold (Goutsias & Heijmans 2000, §3.4, Soille 2002), we will refer to the operator ρ as the Panel(a)istheinput,panel(b)showstheresultofperforminga unionontheoutputsofbothdirections,andinpanels(c)and(d), scale-invariantrank(SIR)operator. theSIRoperatorwasfirstappliedin,respectively,thehorizontal Inthispaper,wewilldescribethemethodin-depthandanal- yseitseffectiveness.InOffringaetal.(2010a),itwasmentioned andverticaldirection.Parameterηwas0.5inthisexample. thatthefullalgorithmhasatimecomplexityofO(N2),N being theinputsizeoftheSIRoperator,butbymakingthealgorithm lessaccurate,animplementationofO(N×logN)wasmentioned showsanexampleofasimulatedGaussianbroadbandRFIfea- to be possible. Here, we will introduce a faster algorithm with ture,andtheinputandoutputoftheSIR-operator. linear time complexity, which is also an exact implementation Theone-dimensionaloutputscanberemappedtotheoriginal oftheSIRoperator. two-dimensional domain in various ways. A simple and useful wayistoperformalogicalunionofΘ(cid:48) =ρ(Θ)andΦ(cid:48) =ρ(Φ ), t t ν ν 2.1. Description theflagsonrespectivelythetimeandfrequencyoutputs: Consider F, a set of positions in the time-frequency domain, wsuhcehnt(hta,tν)a∈saFm.pAlessautmtiemFeitsatnhderferesquulteonfcyasνtahtaisstibceaelndeflteacgtgioedn F(cid:48) =(cid:91)Θ(cid:48)t∪(cid:91)Φ(cid:48)ν. (4) t ν algorithm,suchastheSumThresholdalgorithm.Wewillapply theSIRoperatorintimeandfrequencydirectionsseparately,and An alternative is to initially apply the SIR operator only in define the sets Θ and Φ to contain the flags of a slice in time t ν one direction, i.e., on the sets that correspond with either the andfrequencydirection: timeorfrequencydirection,andsubsequentlyapplyingtheSIR Θ ≡{(s,ν)∈ F | s=t}, (1) operatorontheoutputsofthefirstintheotherdirection.Thelat- t ter is more aggressive than the former. The result also depends Φ ≡{(t,µ)∈ F |µ=ν}. (2) ν on which direction is processed first. The difference is demon- A single one-dimensional set Θ or Φ is the input for the SIR strated in Fig. 3, and an example of how that would work out t ν operator. The operator considers a sample to be contaminated onactualdataisgiveninFig.4.Optionally,theoperatorcanbe withRFIwhenthesampleisinasubsequenceofmostlyflagged appliedinfrequencyandtimedirectionswithdifferentη,ifone samples. To be more precise, it will flag a subsequence when suspectsthatRFIactsdifferentlyineitherdirection. morethan(1−η)Nofitssamplesareflagged,withNthenumber ofsamplesinthesubsequenceandηaconstant,0≤η≤1.Using 2.2. Properties¶meters ρtodenotetheoperator,theoutputρ(X)canbeformallydefined as Consider the case in Equation (3) when a subsequence of ar- (cid:91)(cid:26) bitrary length is flagged. Since the fraction of flagged samples ρ(X) ≡ [Y1,Y2)| (3) withinthesubsequenceisexplicitlyusedtodefineitsoutput,the (cid:27) operator is scale invariant. Formally, an operator ρ is scale in- #(X∩[Y1,Y2))≥(1−η)(Y2−Y1) , variant if and only if ρ(λX) = λρ(X), i.e., scaling the input X withλfollowedbyρisequaltoscalingtheoutputρ(X)withλ. with[Y ,Y )ahalf-openintervalofΘ orΦ ,andthehashsym- We will now give a formal proof of the scale invariance of the 1 2 t ν bol#denotingthecount-operatorthatreturnsthenumberofel- SIRoperator. ements in the set. In words, Equation 3 defines ρ(X) to consist of all the samples that are in an interval [Y1,Y2), in which the Proof. With ρ the SIR operator, we will scale the input X with ratio of samples in the input X is greater or equal than (1−η). factorλ.Ifλ = 0wetriviallyhavethatρ(λX) = λρ(X).Also,if Parameter η represents the aggressiveness of the method: with ρ(λX) = λρ(X) for λ > 0, it is not difficult to see that we also η = 0,noadditionalsamplesareflaggedandρ(X) = X.Onthe have ρ(−λX) = −λρ(X), as mirroring the input will mirror the otherhand,η = 1impliesallsampleswillbeflagged.Figure2 output.Therefore,assumewithoutlossofgeneralitythatλ > 0. 3 A.R.Offringaetal.:AmorphologicalalgorithmforimprovingRFIdetection 1000 1 1000 10 1000 10 1000 10 900 900 8 900 8 900 8 800 800 6 800 6 800 6 700 700 4 700 4 700 4 600 600 600 600 500 0.1 500 500 500 2 2 2 400 400 400 400 300 300 300 300 200 200 1 200 1 200 1 0.8 0.8 0.8 100 100 100 100 0.6 0.6 0.6 0 80 100 120 0.01 0 80 100 120 0 80 100 120 0 80 100 120 (a) (b) (c) (d) Fig.2:SimulationofatypicalbroadbandRFIfeaturewithGaussianfrequencyprofileasusedintheROCanalysis.Panel(a):isolated RFIfeature;panel(b):whennoiseisadded,apartofthefeaturebecomesundetectable;panel(c):flaggedwiththeSumThreshold method;panel(d):withSIRoperatorapplied,parameterη=0.2. 0.01 0.01 z)H32.12 Hz)32.12 M 0.008 M 0.008 uency (323.21.11 y (Jy)0.006 uency (323.21.11 y (Jy)0.006 Freq32.09 Visibilit0.004 Freq32.09 Visibilit0.004 32.08 32.08 0.002 0.002 6:10 6:15 6:20 0 6:10 6:15 6:20 0 Time Time (a)Originaldata (b)Intersection 0.01 0.01 z)H32.12 Hz)32.12 M 0.008 M 0.008 uency (323.21.11 y (Jy)0.006 uency (323.21.11 y (Jy)0.006 Freq32.09 Visibilit0.004 Freq32.09 Visibilit0.004 32.08 32.08 0.002 0.002 6:10 6:15 6:20 0 6:10 6:15 6:20 0 Time Time (c)Union (d)Timefirst 0.01 0.01 z)H32.12 Hz)32.12 M 0.008 M 0.008 uency (323.21.11 y (Jy)0.006 uency (323.21.11 y (Jy)0.006 Freq32.09 Visibilit0.004 Freq32.09 Visibilit0.004 32.08 32.08 0.002 0.002 6:10 6:15 6:20 0 6:10 6:15 6:20 0 Time Time (e)Frequencyfirst (f)Unionof(d)and(e) Fig.4:ExampleoftheSIRoperatorappliedonaLOFARobservation,displayingfivedifferentmethodstomaketheSIRoperator two-dimensional.ThevisibilitiesshownarefrombaselineCS003×CS007ofaLOFARlow-band-antenna(LBA)observationwith 3s × 0.8 kHz resolution. This observation part was selected as an example because it has a two-dimensional RFI structure. Such RFI is less common, hence this is not a typical case. With the exception of the intersection, there is no difference between the differentmethodsonthethinlinesbelow32.1MHz.Applyingtheoperatorsequentially(panelsd,eandf)ismoreaggressivefor thetwo-dimensionalstructures,asitwillflagsamplesthathavediagonalneighboursthatareflagged.Intersectingthetwomethods (panelb)willonlyflagconcavesamples.Pinkispre-flaggedbytheSumThresholdmethod,yellowisaddedbytheSIRoperator. Avalueofη=0.2wasusedinthisexample. 4 A.R.Offringaetal.:AmorphologicalalgorithmforimprovingRFIdetection Now,substitutingXwithλXinEquation(3)resultsin Listing1:Lineartimecomplexityalgorithmforthescale-invariant (cid:91) rankoperator ρ(λX) = {[Y1,Y2)| function ScaleInvariantRankOperator #(λX∩[Y1,Y2))≥(1−η)(Y2−Y1)}. Input: ByusingZ =Y /λandZ =Y /λ,thiscanberewrittento N : Size 1 1 2 2 Ω : Input array of size N (cid:91) (Ω[i] = 1 =⇒ i is flagged, ρ(λX) = {[λZ1,λZ2)| Ω[i] = 0 otherwise) #(λX∩[λZ1,λZ2))≥(1−η)(λZ2−λZ1)}. η : Aggressiveness parameter Output: If we assume continuous positions, both the left side and the Ω(cid:48) : Output flag array of size N rightsideofthecomparisoncanbescaledby1/λ: 1:begin (cid:91) // Initialize Ψ ρ(λX) = {[λZ1,λZ2)| 2: for x=0...N−1 do Ψ[x] ← η - 1 + Ω[x] #(X∩[Z1,Z2))≥(1−η)(Z2−Z1)}, // Construct an array M such that: and by using [λZ ,λZ ) = λ[Z ,Z ) and the definition in // M(x) = (cid:80) j∈{0...x−1}: Ψ[j] 1 2 1 2 Equation(3),thisisequivalenttoρ(λX)=λρ(X). 3: M[0]←0 4: for x=0...N−1 do M[x+1] ← M[x] + Ψ[x] Because the time and frequency dimensions are obviously dis- crete and finite when applied on radio observations, in practice // Construct array P such that: the scale invariance is limited by the resolution and size of the // M[P[x]] = min M[j]: 0≤ j≤x data. 5: P[0]←0 6: for x=1...N−1 do The aggressiveness of the SIR operator can be controlled with the η parameter, which can be chosen differently for the 7: P[x]←P[x−1] 8: if M[P[x]] > M[x] then P[x]←x time and frequency directions. Because the method is scale in- 9: end for variant,thechoiceofηcanbemadeindependentofthetimeand frequency resolutions of the input. The default η parameter in // Construct array Q such that: the LOFAR pipeline is currently η = 0.2 and is equal in both // M[Q[x]] = max M[j]: x< j<N directions. This value has been determined by tweaking of the 10: Q[N−1]←N parameter and data inspection, e.g. by looking at the resulting 11: for x=N−2...0 do time-frequency diagram and projections of the data variances. 12: Q[x]←Q[x+1] Theresultswerecheckedformanyobservations.Highervalues 13: if M[Q[x]] < M[x+1] then Q[x]←x+1 14: end for seemtoremovetoomuchdatawithoutmuchbenefit,whilesome RFI is left undetected with lower values. The value works well // Flag sample x if M[Q[x]] − M[P[x]] ≥ 0 forvarioustelescopesandondifferenttimeandfrequencyreso- 15: for x=0...N−1 do lutions.WewillevaluatethissettinginSection3.2. 16: if M[Q[x]]-M[P[x]]≥0 then Since most telescopes observe two linear or circular polar- 17: Ω(cid:48)[x]←1 izations, RFI detection can consider each (correlated) polariza- 18: else tion individually, and the operator can be applied on each pro- 19: Ω(cid:48)[x]←0 duced mask independently. However, the flag masks are often 20: end if keptequalbetweenthepolarizations,becausecalibrationmight 21: end for becomeunstablewhen,foraparticularsample,partofthepolar- 22: return Ω(cid:48); izationsaremissing.Moreover,ifoneofthepolarizationfeedsof 23:end thetelescopehasbeenaffectedbyRFI,itislikelythattheothers alsohavebeenaffected.Forthesereasons,theapproachtakenin Listing1showsadirectalgorithmtosolvetheSIRoperator theLOFARpipelineistousetheSumThresholdmethodonall problem. fourcross-correlatedpolarizations(XX,XY,YXandYY)indi- vidually,thenflaganysampleforwhichatleastonepolarization Proof. Using the definition of Ω(x) and Ω(cid:48)(x), such that 1 in- hasbeenflagged,andfinallyapplytheSIRoperatoronceonthe dicates that x is flagged and 0 that it is not, we can rewrite combinedmask. Equation(3)as 2.3. Thealgorithm Ω(cid:48)(x)=1 ifY(cid:80)2∃−Y1Ω1 ≤(y)x,≥Y(21>−xη,)s(uYch−thYat) (5) AEquasttiroanigh(3tf)oirswtaordtestimepaclehmpeonstsaitbiolencoonftiguthoeus soupbesreaqtourencien. 0 otyh=eYr1wise. 2 1 In this case, if N is the number of samples in the sequence Θ t or Φ , O(N2) sums of subsequences have to be tested. Since Inline2,thearrayΨ(y)isinitializedsuchthatΨ(y) = ηincase ν the sums of all subsets can also be constructed in quadratic yisflagged,andΨ(y) = η−1otherwise.Equation(5)cannow time complexity, the total time complexity of a straightforward berewrittentothefollowingtest: implementation is O(N2). We will now show an algorithm that sissoulmsvoeamslgethwoerhitaphtrmosb.imleimlarwtioththlienemaraxtiimmeumcomcopnlteixgiutoy.usThsuebaslegqourietnhcme Ω(cid:48)(x)=10 iofth∃eYr1w≤isex., ∃Y2 > x:Yy=(cid:80)2−Y11Ψ(y)≥0 (6) 5 A.R.Offringaetal.:AmorphologicalalgorithmforimprovingRFIdetection Line3-4initializeM(x)for0≤ x≤ N to 1 (cid:88)x−1 M(x)= Ψ(j), j=0 0.1 sothatEquation(6)canberewrittenas e (s) 0.01 m Ti Ω(cid:48)(x)=1 if∃YM1(Y≤ )x,−∃MY2(Y>)x≥:0 (7) 0.001 0 otherwis2e. 1 0.0001 n log n algorithm n2 algorithm BecauseweareonlyinterestedinΩ(cid:48)(x)intherange0≤ x< N, linear algorithm 1e-05 wecanlimitthesearchforY andY to0 ≤ Y ≤ x < Y ≤ N. 100 1000 10000 100000 1e+06 1e+07 1 2 1 2 ThereexistsY andY inthisrangesuchthatM(Y )−M(Y )≥0, N (size of input) 1 2 2 1 ifandonlyif Fig.5:Computationtimeversusinputsizewiththedifferental- max M(y)− min M(y)≥0. gorithms and fixed η = 0.2. The average over 1000 runs was y:x<y≤N y:0≤y≤x takenforeachdifferentconfiguration. Lines5-14makesurethatPandQareinitializedfor0≤ x< N, suchthat 3. Analysis&results P(x) = argmin M(y), y∈0...x 3.1. Performance Q(x) = argmaxM(y). Figure 5 displays the performance of the C++ implementa- y∈x+1...N tions and compares the linear algorithm with the approximate O(NlogN)algorithmandthefullquadraticalgorithm.Themea- Finally,thisallowsEquation(7)toberewrittenas surementshavebeenperformedonaregulardesktopwitha3.07 (cid:40) GHz Intel Core i7 CPU, using only one of its cores. The time 1 ifM(Q(x))−M(P(x))≥0 Ω(cid:48)(x)= (8) complexities of the three algorithms for increasing N behave 0 otherwise, asexpected.Thelinearalgorithmisfasterinallcases,evenfor smallN.TheO(NlogN)timecomplexityalgorithmismorethan whichisperformedandreturnedinlines15-23. oneorderofmagnitudesloweratbothsmallandlargeN.Thelin- earalgorithmhasbeenexecutedwithdifferentvaluesforηand The algorithm is Θ(N), and performs 3N additions or sub- theresultsareshowninFig.5.Exceptforsomeslightvariations tractions and 3N − 2 comparisons on floating point numbers. —especiallyforη = 0—thealgorithm’sspeedisindependent ThealgorithmusesthetemporaryarraysΨ,M,PandQ,eachof ofη. sizeN,withtheexceptionofMwhichisofsizeN+1.ArrayΨ In the LOFAR pipeline, it takes 3.8 seconds to process a canbeoptimizedawayandtheinputΩcanbereusedforoutput single sub-band for a single baseline, assuming 100,000 time byassigningdirectlytoitinlines17and19.Thetotalamountof steps and 256 channels (which is common). Of these 3.8 sec- temporarystoragerequiredisthusaboutN floatingpointvalues onds, only 49 milliseconds (1.3%) are spent applying the SIR and 2N index values, thus O(N). When the function is applied operator. In common applications, an observation contains on on a two-dimensional image, as in the case of RFI detection, the order of a 1,000 baselines and 250 sub-bands. The pipeline thetemporarystorageisnegligible,asthenumberofprocessed is heavily parallelized by concurrently flagging baselines over slicesisusuallymuchlargerthanoneortwo.Ifηisexpressedas multiplecoresandsub-bandsovermultipleclusternodes.Inthis aratiooftwointegervalues,itispossibletoscaleallvaluesand case, the pipeline’s performance is dominated by disk access, onlyuseintegermath. andtherelativecontributionoftheSIRoperatorisevensmaller. The algorithm has been implemented in C++ and takes around40linesofcode2. 3.2. Accuracyanalysis Because the problem is somewhat similar to the maximum contiguous subsequence sum (Bentley 1984) and the all max- The performance of the SIR operator was tested by using re- imal contiguous subsequence sum problems, it might be pos- ceiver operating characteristics (ROC) analysis. To do so, a sible to parallelize the algorithm by similar means, e.g. as in ground truth needs to be available, which can only be accu- (Alves et al. 2005). Moreover, parallel algorithms exist for the rately acquired in a simulated environment. As discussed pre- prefixsum/min/maxcalculations.Forthespecificapplicationof viously, a very large fraction of RFI is line-like. The samples RFI detection for LOFAR, the pipeline has already been max- onsuchalinearenotuniformduetointrinsiceffectsorinstru- imally parallelized by flagging different baselines and/or sub- mental gain variations. Therefore, we have used simulations of bandsconcurrently.Unlikeparallelizingonthealgorithmlevel, four line-shaped RFI features as displayed in Fig. 6: (a) a sin- thisrequiresnocommunicationbetweenthedifferentprocesses. gle Gaussian that reaches its 3σ point at both borders and is 1 in the centre; (b) three periods of a sinusoidal function which 2 The implementation is part of the AOFlagger and can be down- is scaled between zero and one; (c) the Gaussian feature, but loadedfromhttp://www.astro.rug.nl/rfi-software. slantedby1/50fraction;and(d)aburst-likesignalinwhichthe 6 A.R.Offringaetal.:AmorphologicalalgorithmforimprovingRFIdetection 1000 10 1000 10 1000 10 1000 10 900 8 900 8 900 8 900 8 800 6 800 6 800 6 800 6 700 4 700 4 700 4 700 4 600 600 600 600 500 500 500 500 2 2 2 2 400 400 400 400 300 300 300 300 200 1 200 1 200 1 200 1 0.8 0.8 0.8 0.8 100 100 100 100 0.6 0.6 0.6 0.6 0 0 0 0 80 100 120 80 100 120 80 100 120 140 80 100 120 (a) (b) (c) (d) Fig.6: Features used for the accuracy analysis. Panel (a): Feature with Gaussian slope; panel (b): Sinusoidal feature; panel (c): SlantedfeaturewithGaussianslope;panel(d):BurstfeaturewithsamplesdrawnfromaRayleighdistribution. amplitude levels are drawn from a Rayleigh distribution with changes the scaling of the ROC curves, but the relative differ- mode σ = 0.6. All features are three samples wide. Complex encebetweenthetwomethodsremainsthesame. Gaussiandistributednoisewithσ = 1wasaddedtotheimage, GiventhevarioustypesofRFI,Fig.7showsthat(I)theSIR such that the amplitudes are Rayleigh distributed. The created operator is extremely accurate on straight features, by detect- two dimensional image of size 180 × 1024 was subsequently ing all previously undetected samples with only a very slight flagged by the SumThreshold method with settings as in the increaseinfalsepositives;(II)theSIRoperatorissuperiortothe LOFARAOFlaggerpipeline,andthecreatedflagmaskwasused dilationinalltestedsituations;and(III)asettingofη∼0.2–0.4 asinput. seemstobeagoodcompromisebetweenFPandTPratios. Toestimatetheperformances,thetrueandfalsepositivesra- These tests have been performed by applying the operators tios(TPandFPratiosrespectively)werecalculatedafterdetec- inonedimension.Whenappliedintwodimensionsbyusingthe tion.Wecreatedafuzzygroundtruthmaskinwhichavalueof outputofthefirstdimensionasinputforthesecond,thecompar- onecorrespondswithmaximalRFI,zerocorrespondswithsam- isonbetweenthedilationandtheSIRoperatorwilldivergeeven plesnotcontaminatedbyRFI,andvaluesinbetweencorrespond more,becausethefalsepositivescreatedbythefirstdimension withlowerlevelsofRFIcontamination.Fig.2(a)showsforex- will be multiplied by the repeated application towards the sec- ample the ground-truth mask of the Gaussian feature. Given a ond dimension. An example of a two-dimensional application sample with ground truth value β, if the corresponding sample is shown in Fig. 8. Certain RFI sources create more complex wasflaggedbythemethod,itwouldbecountedwithratioβas shapes in the time-frequency domain, and contaminate larger atruepositiveand1−βasafalsepositive.ThetotalTPandFP non-line like areas. These RFI sources cause higher values in ratios are the sum of all the TP and FP values, divided by the theoutputoftheGaussiansmoothing,whichiscommonlypart totalsumofpositivesandnegativesinthetestset,respectively. oftheearlierRFIdetectionstage,andconsequentlysomeofthe The SIR operator and a standard morphological dilation lower RFI levels of the RFI feature are not flagged. We have have been applied in the direction of the feature, i.e., verti- seenthattheSIRoperatorwillworkverywellonsuchfeatures, cal/frequencydirection.Thetrueandfalsepositiveswerevaried because it fills the feature and slightly extends the flags in all bychangingtheparameterηorthedilationsizeforrespectively directions. the SIR operator and the dilation. Different runs gave slightly Itshouldbenotedthatoneoftheassumptionsmadeforthe differentresultsbecauseoftheintroducedGaussiannoise,hence SIR operator to improve detection accuracy, is that parts of the thesimulationwasrepeated100timesandtheresultswereaver- RFI features are not detectable by amplitude thresholding. In aged. practice, however, a small subset of received RFI sources does Figure 7 shows the average results. The shadowed areas in contributetoanobservationwithsufficientstrengthtodetectthe panels(b)and(c)showthestandarddeviationoverthe100runs. entirefeaturewithamplitudethresholding.Suchtransmittersare InthecaseoftheGaussianRFIfeature,theSumThresholdpre- theworst-casesituationfortheSIRoperator,astheoperatorwill detectionremovesonaverage91.3%RFIpower,whilesimulta- enlarge the flag mask relative to its length, but any samples it neouslyfalselyflaggingaratioof0.38%.Hence,ifthemethods flags extra are false positives. Note that it is not useful to per- do not flag any additional samples, they have a TP/FP ratio of formROCanalysisofsuchasituation,asthetruepositiveswill 91.3%/0.38%,andthisisthereforethestartofbothROCcurves be constant. The number of false positives can easily be calcu- forthisRFIfeature.Withη=0.48,theSIRoperatorflagsallthe lated,andscaleslinearlywithηandthedurationofthetransmit- RFIfeatureswith100%TPwithaFPratioof1.36%,withtheex- ter.Forexample,whenapplyingtheSIRoperatorwithη = 0.2 ceptionoftheslantedfeature.TheSumThresholdpre-detection, onastrongRFItransmitterthatoccupiestenminutesofdatain dilationandSIR-operatorworklesswellontheslantedfeature, one channel, the operator will falsely flag two minutes of the andfailtodetectitwith100%evenatveryhighsensitivity.The channelbeforeandafterit.Anexampleofabandthatcontains dilationoperatorneedsasizeof32.8%oftheheightoftheimage intermittenttransmittersistheairtrafficcommunicationbandof todetecttheverticalRFIfeatures.Sinceitwilldilateanyfalsely 118–137MHz. Nevertheless, while some of these transmitters detectedinputsampleequally,itsFPratiois46%withthisset- areindeed strong,e.g.,whenthey flythroughabeam sidelobe, ting. Changing the signal-to-noise ratio (SNR) of the features there are also many transmitters at this frequency that are too 7 A.R.Offringaetal.:AmorphologicalalgorithmforimprovingRFIdetection 1 Colour legend 0.95 (d) (b) ROC operator: o Test set (a): Gaussian feature ati (a) Test set (b): Sinusoidal feature res 0.9 Test set (c): Slanted feature v siti Test set (d): Burst-like feature o p e Dilation: u 0.85 Tr Test set (a): Gaussian feature Test set (b): Sinusoidal feature Test set (c): Slanted feature 0.8 (c) Test set (d): Burst-like feature 0.75 0 0.02 0.04 0.06 0.08 0.1 False positives ratio (a)ROCcurves.Solidlines:rankoperator;dashedlines:dilation. 0.1 (d) 0.35 1 (d) 1 (b) (a) (b) 0.3 0.08 (a) 0.95 0.95 itives ratioTrue pos 0 .08.59 (c) (c) 00..0046 False positives ratio True positives ratio 0 .08.59 (b)(d(()cc))(a) 000...12255 False positives ratio 0.1 0.8 (d) (b) 0.02 0.8 0.05 (a) 0.75 0 0.75 0 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 Eta Dilation size (% of image height) (b)InfluenceofηontheSIRoperator.Solidlines:truepositives(left (c)Influenceofkernelsizeonthedilation.Solidlines:truepositives axis);dashedlines:falsepositives(rightaxis). (leftaxis);dashedlines:falsepositives(rightaxis). Fig.7:AnalysisofthereceiveroperatingcharacteristicsoftheSIRoperatorandastandarddilationonsimulateddata.Marks(a)–(d) correspond with the features shown in Fig. 6, respectively a Gaussian broadband feature, a sinusoidal feature, a slightly slanted Gaussianfeatureandaburst-likefeature.Theshadowedareasshow1σlevelsover100runs. weakly received to be detected all of the time. Consequently, the sporadic transmitters such as the one at 140 MHz are seen someofthemareonlypartiallydetectedwithamplitudethresh- veryoften. olding.ThisiswhyweexpectbetterresultsusingtheSIRoper- A final remark on the ROC analysis performed here is that atoreveninthesebands,comparedtousingadilation. the given absolute true/false positive ratios are not an accurate representationofactualRFIdetection,becauseourtwomodels are very simplistic and based on the assumption that RFI be- Figure9showsaWSRTexamplethatcontainsmanydiffer- havesinawelldefinedmanner.Establishingabsolutetrue/false ent RFI kinds. The initial SumThreshold method detects the positive ratios would require a detailed statistical model of the RFI quite accurately, but it leaves some parts of the last 1.5 behaviourofRFI.Arealisticestimateforthenumberofsamples hours unflagged. This is solved by the SIR operator, although, occupied by RFI with LOFAR is in the order of a few percent becauseof thesudden startof theRFI, itfalsely flagsabout 20 (Offringaetal.2011). minutesofdatabeforethestartoftheRFI.ThestrongRFIpro- duced by the sporadic transmitter around 140 MHz is flagged by the SumThreshold method, but in this case it is likely that 4. Conclusionsanddiscussion thesechannelshavebeenoccupiedallofthetime.Therefore,the SIR operator gives the desired result by increasing the flags in From panel 7(a) it is clear that the SIR operator is much more thosechannels.Allinall,thebaselinemightbesomewhatover- suitabletodetectthetestedkindsofRFIthananordinarymor- flagged.Nevertheless,itdoesallowfurtherdatareductionwith- phological dilation. A value of η = 0.2 was determined by out manual intervention, and without thresholding part of the tweaking and validating the results to be a reasonable setting noise.Moreover,thiscaseisexceptional,andthesuddenstartof for the LOFAR RFI pipeline, and has been used in the default verystrongbroadbandRFIis(fortunately)seldomlyseen,while LOFAR pipeline for over a year. Panel 7(b) shows that this 8 A.R.Offringaetal.:AmorphologicalalgorithmforimprovingRFIdetection 0.02 0.02 0.02 117.8 117.8 117.8 0.018 0.018 0.018 117.6 117.6 117.6 117.4 0.016 117.4 0.016 117.4 0.016 117.2 0.014 117.2 0.014 117.2 0.014 117 117 117 MHz)116.8 0.012 MHz)116.8 0.012 MHz)116.8 0.012 y (Frequenc111111666...462 Visibility (Jy)00..000.000681 Frequency (111111666...462 Visibility (Jy)00..000.000681 Frequency (111111666...462 Visibility (Jy)00..000.000681 116 116 116 0.004 0.004 0.004 115.8 115.8 115.8 115.6 0.002 115.6 0.002 115.6 0.002 8:15 8:20 8:25 0 8:15 8:20 8:25 0 8:15 8:20 8:25 0 Time Time Time (a)HorizontalSIRoperator (b)VerticalSIRoperator (c)SIRoperatorinbothdirections 0.02 0.02 0.02 117.8 117.8 117.8 0.018 0.018 0.018 117.6 117.6 117.6 117.4 0.016 117.4 0.016 117.4 0.016 117.2 0.014 117.2 0.014 117.2 0.014 117 117 117 MHz)116.8 0.012 MHz)116.8 0.012 MHz)116.8 0.012 y (Frequenc111111666...462 Visibility (Jy)00..000.000681 Frequency (111111666...462 Visibility (Jy)00..000.000681 Frequency (111111666...462 Visibility (Jy)00..000.000681 116 116 116 0.004 0.004 0.004 115.8 115.8 115.8 115.6 0.002 115.6 0.002 115.6 0.002 8:15 8:20 8:25 0 8:15 8:20 8:25 0 8:15 8:20 8:25 0 Time Time Time (d)Horizontaldilation (e)Verticaldilation (f)Dilationwithasquarekernel Fig.8: Gray-scale plots showing examples of the effectiveness of two morphological techniques on the data from Fig. 1. The pink samples have been set by the SumThreshold algorithm and the yellow samples have additionally been detected with the morphological techniques. Panels a–c show results of the SIR operator with η = 0.2 in the time direction and/or η = 0.3 in frequency direction. Panels d–f show an ordinary dilation with a horizontal kernel of five pixels and/or a vertical kernel of three pixels. agrees approximately with the simulations: at η = 0.2, the ver- white noise, while morphologically extending a flag mask tical features have almost been completely detected by the SIR doesnot.Forthesereasons,itispreferabletousemorphol- operator(Gaussian:98.9%,a7.6%increase,sinusoidal:99.9%, ogytofindthefinalfewRFIsamples,comparedtolowering 5.8% increase, burst: 100%, 6.7% increase) with a minor in- thethreshold. crease in the false positive ratio (Gaussian: 0.69%, an 0.31% – Themethodisextremelyfastandsimple,anditsprocessing increase,sinusoidal:0.95%,0.42%increase,burst:1.3%,0.08% timeisalmostnegligibleinafullRFIpipeline. increase).Theslantedfeatureisnotasaccuratelydetected(86%, – Wehaveseensituationsinwhicheventhelowratiooffalse 6.1%increase),butthemethoddoesenhancethedetection.Itis negativesthatareleakedthroughanamplitude-basedRFIde- hard to give a similar optimal value for the dilation operation, tectionpipelinecancausecalibrationtofail.Empirically,we since the false positives scale linearly with the size of the dila- have seen an improvement of the calibratability of LOFAR tionkernel.Therefore,itdependsonwhatisanacceptableloss observationsbyusingthemorphologicalmethod. intermsofthefalsepositives. Section 3.2 describes that strong intermittent (on ∼minute In the case of simulated Gaussian broadband features, scale) RFI transmitters are probably the worst case for the SIR only 8.7% of the RFI power was not detected by the operator, as in these cases the application of the operator with SumThreshold method. For the sinusoidal and burst features, η = 0.2 could in theory yield 40% false positives. However, theSumThresholdmethodperformsevenbetter.Therefore,the because of LOFAR’s high resolution, in combination with the totalbenefitoftheSIRoperatormightseemsmall.However,we SumThreshold’s unprecedented detection accuracy, the total thinkthattherearestrongreasonstousethemethod: percentage of flagged data in the case of LOFAR is only a few – The added false positives are almost negligible, and the percent. This implies that even if a large ratio of these were chances of biasing your data are much smaller compared strongintermittenttransmitters–whichisunlikely–thebenefits to using amplitude thresholding exclusively. For example, of not having to manually consider data quality in cases where thresholding biases the final distribution of uncorrelated thetechniquedoeshelp,probablyoutweighthe∼1%addedfalse 9 A.R.Offringaetal.:AmorphologicalalgorithmforimprovingRFIdetection 0.1 0.1 140.4 140.4 140.3 140.3 140.2 140.2 140.1 140.1 140 140 Hz)139.9 Hz)139.9 y (MFrequenc111333999...687 Visibility (Jy) 0.01 Frequency (M111333999...687 Visibility (Jy) 0.01 139.5 139.5 139.4 139.4 139.3 139.3 19:00 20:00 21:00 22:00 23:00 0:00 1:00 0.001 19:00 20:00 21:00 22:00 23:00 0:00 1:00 0.001 Time Time (a)Originaldata (b)FlaggedwithSumThreshold(pink)followedbytheSIR operator(yellow) Fig.9:ExampleofaninterestingbutuncommonWSRTcase:partofabaselineofanobservationat140MHzthatsufferedunusually strong broadband RFI during the last 1.5 hrs. It also contains many different kinds of transmitters that mostly occupy constant channels.Theverticalstripesarefringesofcelestialsources,hencecontaintheinformationofinterest.Theimageshownis2000× 250samplesinsize. positives.Ifitturnsoutthatsomebandsdohavemostlystrong cases, the scale-invariant operation to extend binary masks as transienttransmitters,theηparametercouldbecomeafunction presentedheremightbegenerallyuseful. of frequency. At the moment, application of the SIR operator Sofar,wehaveconsideredcombinationsofone-dimensional seemstobehelpfulatanyfrequency. application of the SIR operator in order to use it for our two- dimensionalapplicationinthetime-frequencydomain.Forthis Inthispaperwehaveassumedscale-invariantbehaviourfor application, but also for more generic applications, it might be RFI. In reality this might not be entirely accurate, so instead interestingtoconsideratruetwo-dimensionalversionoftheSIR of using a threshold that grows linearly with the scale, as in operator. While the one-dimensional operator selects all sub- our definition of the SIR operator, it might be better to have a sequences (lines) with a ratio ≥ η of flagged values, a two- thresholdthatdependsonthescaleinanon-linearfashion.Also, dimensionaloperatorwouldselectallrectanglesthathavearatio whenlookingattheproblemfromastatisticalpointofview,RFI ≥ ηofflaggedvalues.Itishoweverlikelythatsuchanoperator mightnotbeequallylikelytooccuronallscales.Forexample, can not be implemented with a linear time complexity, which RFI might be less likely to occur on a scale of days than on a makesitlessattractiveforthelargedatarateofLOFAR. scaleofseconds.Whenitdoesoccuronlargescales,itisdoubt- Wehaveshownthatevenslightlyslantedfeaturesareharder ful that we actually need to extend the detected intervals in a todetectaccurately.Fortunately,inthecaseofLOFAR,suchfea- scale-invariantmanner,becausethesignalwouldlikelyalready tures are very rare. If the features to be detected have a known be detected at a smaller scales and gaps would likely be filled. orientation that is not parallel to one of the axes, it might be Suchconsiderationswouldsuggestthatitmightbebettertohave an option to apply the operator in the direction of the features. athresholdthatgrowslessthanlinearlyforlargescales.Better While a trivial implementation can apply the operator along RFIstatisticsandRFImodellingmightprovidetherequiredin- fixed lines, some work might be necessary to maintain transla- formationforassessingsuchconsiderations. tioninvariance(Soille&Talbot2001). SeveraloptionsareavailabletoapplytheSIRoperatorona two-dimensional input. As shown in Fig. 4, the intersection of Acknowledgements. WethankGerdeBruynforprovidingthedataandforvar- the results in both directions does not extend line RFI, thus is iousdiscussions.WealsothanktheLOFARcollaborationforprovidingthedata onwhichtheAOFlagger,includingtheSIRoperator,wasdevelopedandtested. not useful in this context. A union does extend such RFI, but does not extend the flags diagonally. Processing the directions sequentiallymightthereforebebeneficialforRFIthathasstruc- References tureinbothfrequencyandtime,asthiskindofRFIdoeslikely alsoslightlycontributeinthediagonaldirection.Thedifference Alves,C.E.R.,Ca`ceres,E.N.,&Song,S.W.2005,ABSP/CGMalgorithm forfindingallmaximalcontiguoussubsequencesofasequenceofnumbers. betweenprocessingtimefirst,frequencyfirstortakingtheunion Technicalreport,UniversidadedeSa˜oPaulo ofboth,issmall.Takingtheunionovercomesthesomewhatar- Baan,W.A.,Fridman,P.A.,&Millenaar,R.P.2004,AJ,128,933 bitrary decision of which direction to process first. In the case Barnbaum,C.&Bradley,R.F.1998,AJ,115,2598 of LOFAR, we decided to only perform filtering time first, be- Bentley,J.1984,Commun.ACM,27,865 Ellingson,S.W.&Hampson,G.A.2002,IEEETrans.onAnt.&Prop.,50,25 causetakingtheunionofbothtimeandfrequencyfirstismore Flo¨er,L.,Winkel,B.,&Kerp,J.2010,inProc.ofRFI2010 expensive. Gary,D.E.,Liu,Z.,&Nita,G.M.2010,inProc.ofRFI2010 Goutsias,J.&Heijmans,H.J.A.M.2000,Fundam.Inf.,41,1 Morphologycanbeusedinseveralimageprocessingtasks, Kocz,J.,Briggs,F.H.,&Reynolds,J.2010,AJ,140,2086 for example in feature detection. Often, generic morphological Leshem,A.,vanderVeen,A.-J.,&Boonstra,A.-J.2000,ApJS,131,355 operations need to be applied on different resolutions. In such Niamsuwan,N.,Johnson,J.T.,&Ellingson,S.W.2005,RS,40 10