Table Of ContentAssessing The Accuracy Of Radio Astronomy Source Finding
Algorithms
Stefan WesterlundA,B, Christopher HarrisA, and Tobias WestmeierA
AICRAR/University of Western Australia, M468 35 Stirling Highway, Crawley, WA 6009
BEmail: [email protected]
Abstract: Thisworkpresentsamethodfordeterminingtheaccuracyofasourcefinderalgorithmfor
spectrallineradioastronomydataandtheSourceFinderAccuracyEvaluator(SFAE),aprogramthat
implementsthismethod. Theaccuracyofasourcefinderisdefinedintermsofitscompleteness,relia-
bility,andaccuracyoftheparameterisationofthesourcesthatwerefound. Thesevaluesarecalculated
2 byexecutingthesourcefinderonanimagewithaknownsourcecatalogue,thencomparingtheoutput
1 ofthesourcefindertotheknowncatalogue. TheintendedusesofSFAEincludedeterminingthemost
0 accurate source finders for use in a survey, determining the types of radio sources a particular source
2 finderiscapableofaccuratelylocating,andidentifyingoptimumparametersandareasofimprovement
for these algorithms. This paper demonstrates a sample of accuracy information that can be obtained
n
through this method, using a simulated ASKAP data cube and the Duchamp source finder.
a
J
Keywords: methods: data analysis
8
1
1 Introduction
loguesweremergedandexaminedmanuallytoconfirm
]
M the detected objects (Meyer et al. 2004).
Acriticalaspectofastronomyisthequantitativeanal-
Radio telescope spectral line data is fed into a
I ysisofobjectssuchasstarsandgalaxies. Astronomers
. source finder program as a data structure called a
h searchthroughdatacollectedfromtelescopestodeter- data cube. A data cube can be thought of as a three-
p mine the location and properties of these objects. A
dimensional grid, where the three dimensions usually
- common technique is to use a computer program to
o are right ascension, declination and either frequency
search astronomical data, followed by manual inspec-
r or velocity. Each cell contains the flux for the area in
t tion to confirm sources of electromagnetic radiation. theskyandthespectralrangerepresentedbythatcell.
s However,usingpeopletovisuallysearchtelescopedata
a Newtelescopes,suchasASKAP,willproduceordersof
is time-consuming and expensive.
[ magnitude more data than previous installations. For
Telescope installations that will be operational in
example, the HIPASS survey used the existing Parkes
1 the near future, such as the Australian Square Kilo-
telescope to produce a 111 MB data cube every ten
v metreArrayPathfinder(ASKAP)ortheSquareKilo-
hours (Barnes et al. 2001), for a data output rate of
0 metreArray(SKA),willproduceordersofmagnitude
24.7Kbps. Bycomparison,ASKAPiscapableofpro-
9 moredatathanprevioustelescopes(DeBoeretal.2009).
ducing a data cube of as much as 4 TB in size every
6 Partially or completely manual searching techniques
eight hours, depending on the configuration (DeBoer
3 will not scale to handle such a large volume of infor-
et al. 2009), for a data output rate of 1.14 Gbps, an
1. mation. Therefore a purely automated approach is increaseofoverfourordersofmagnitudecomparedto
0 needed. As part of designing an algorithm that accu- HIPASS.
2 rately and efficiently searches large amounts of spec-
ASKAP’s increased data rate will require faster
1 tral line telescope data, it is necessary to determine
sourcefindingprogramsinordertobeabletoprocess
: the accuracy of that algorithm. This paper describes
v thedatafastenoughtokeepup. Inordertoachievethe
themethodusedbytheSourceFinderAccuracyEval-
i required data processing rate, source finders will need
X uator, a program which measures the accuracy of a
to exploit the computational power of modern super-
source finding program.
r computers. Previous work (Westerlund 2010) shows
a
thatinordertodothis,anefficientparallelimplemen-
2 Background tation of the source finding algorithm must be made.
Additionally, such data sets will contain many more
A source finder is a program that searches radio as- sources than previous surveys, such that manual con-
tronomydata,intheformofadatacube,andreturns firmation of all the sources will require an extremely
acatalogueofthesourcesitfinds,withtheparameters largeamountofwork. Therefore,inorderforthedata
of each source, such as the objects’ position and flux. to be processed in a reasonable amount of time and
Surveys such as HIPASS use a combination of auto- effort from observers, the accuracy of the automated
matedsourcefindingprogramsandmanualinspection source finder must be sufficiently high that manual
to provide a source list. In the case of HIPASS, two confirmation is unnecessary.
source finding programs, MultiFind and TopHat were Asourcefinder’saccuracyhasthreeaspects: com-
run on the telescope data and their candidate cata- pleteness,reliabilityandparametercorrectness. Com-
1
2 Publications of the Astronomical Society of Australia
pleteness and reliability describe the accuracy with
which the source finder has found objects in the data
set, as noted by Zwaan et al. (2004). Completeness is
Determine
theportionofobjectsinthedatacubethatarefound
Potential
by the source finder program. If n is the number of
d Matches
sourcespresentinthedatacubeandn isthenumber
r
of real sources that have been located by the source
finder, then the completeness, C, can be calculated
using the following expression:
Determine
n
C = r (1) Optimum
nd Matching
Reliability is the portion of the detections from the
source finder that exist in the data cube. The relia-
bility, R, canbecalculatedin asimilarmanner to the
completeness. If n is the number of all the objects
t Analyse
thatthesourcefinderhaslocated,bothtrueandfalse
Matches
positives, then the reliability can be calculated using
the expression:
n
R= r (2)
n
t
It is often more useful to describe these values as
Figure 1: Process flowchart. This diagram shows the
a function of the parameters of the sources, in order
steps in determining the accuracy of a source finder
to analyse how the source finder performs for differ-
for a particular data cube. The sources are first com-
ent types of sources. In particular, the completeness
pared to determine potential matches, then the po-
and reliability of the sources are often calculated as a
tential matches are sorted into final matches, and fi-
function of peak flux, integrated flux, signal to noise
nallythesefinalmatchesareanalysedtodeterminethe
ratio(SNR),orspectralwidth. Thecompletenessand
source finder’s accuracy.
reliabilitycanbecalculatedbybinningthesourcesby
theparameterinquestion,orbycalculatingthecumu-
lative completeness and accuracy. The binning func-
tionmusthaveenoughbinstoshowthedistributionof
The accuracy of a source finder is defined relative to
sources,butnotsomanyastounder-samplethebins.
the reference catalogue. That is, the reference cata-
Usingasuitablebinningfunction,thecompletenessof
logue is assumed to contain exactly the list of objects
bin i, C , and the reliability of bin i, R can be calcu-
i i that are in the corresponding data cube. For exam-
lated using the number of real sources that have been
ple, the reference catalogue may be a list of sources
found by the source finder in bin i, n , the number
r,i used to create a simulation, a list of artificial sources
of sources in the data set that are in the ith bin, n
d,i addedtoarealdatacube,oralistofsourcesfoundby
and the total number of sources found by the source
a previous examination of the data cube. The follow-
finderthatareinbini,n ,accordingtothefollowing
t,i ing section will describe the method used by SFAE to
equations:
cross-match a reference catalogue to a test catalogue.
n
C = r,i (3)
i n
d,i
n
R = r,i (4)
i n
t,i 3 Method
Parameter correctness describes how accurate the
source finder is in calculating the parameters of the
sources it finds. These parameters include not only ThissectiondescribesthemethodbywhichSFAEmea-
the position and size of the source, in the spatial and sures the accuracy of a source finder. SFAE requires
spectral dimensions, but also properties such as peak severalstepstoanalysetheaccuracyofasourcefinder,
flux. The accuracy for the parameterisation may be asshowninFigure1. Thefirststepistocomparethe
described in terms of the difference between the real positionandfrequencyofeachreferenceobjecttothose
value of a source, and the corresponding value mea- of each test object, and the differences between these
sured by the source finder. are used to place the objects into potential matches.
These accuracy measures are calculated by exe- Then, if there are any sets of potential matches that
cuting the source finder on data cubes for which the are not one-to-one, these sets of objects and their dis-
catalogue of objects is already known, and compar- tances are used to determine an optimum matching
ing this catalogue to the catalogue produced by the between reference objects and test objects. Finally,
source finder. For the purposes of this paper, the the matched sources may be analysed to determine
known source catalogue shall be called the reference theaccuracyofthetestcatalogue,asafunctionofthe
catalogue and the catalogue produced by the source propertiesofthesources. Eachofthesestepswillnow
finder shall be called the test catalogue, respectively. be described in greater detail.
www.publish.csiro.au/journals/pasa 3
3.1 Step One: Potential Matches
Thefirststepofcomparingreferenceobjectstotestob-
jectsistodeterminewhichpairscouldpossiblyreferto
thesameobjectinthedatacube. SFAEappliesafunc-
tion that determines whether a given reference source
and test source are close enough to be considered po-
tentially the same object. Each pair of reference and A
test objects that has been accepted by the potential B
matchfunctionaredeemedapotential match. Ifaref-
erence source and a test source have only each other 1 2
as potential matches, they can be marked as a final
match. If a source from either catalogue has no po-
tentialmatches,itisconsideredanunmatchedsource.
If one or both of them have more than one poten-
tial match, then a set of objects is created containing
the pair of objects. The potential matches of the ob-
Figure 2: Confused Sets. It is possible for objects to
jects inside the set are then added to the set, until all
be in more than one potential match. Suppose that
the potential matches of all the objects in the set are
objects A and B are reference sources, and objects 1
themselvesinthatset. Suchasetofobjectsiscalleda
and 2 are test sources. The lines show the thresholds
confusedset,anexampleofwhichisshowninFigure2.
used to determine potential matches. These objects
The objects in each confused set are sorted into their
would be placed in the same a confused set, and a
final matches in the next stage of the algorithm.
more sophisticated algorithm is needed to determine
The function used by SFAE applies three thresh-
which reference object matches which test object.
oldstothepropertiesofareference-testobjectpairto
determine if they are a potential match of each other.
The first two thresholds check if the spatial distance
betweenthetwoobjectsislessthanorequaltothesize difficultforSFAEtodeterminewhethersuchatestob-
ofthetelescopebeam,alongthebeam’smajorandmi- jectgenuinelymatchesallthecorrespondingreference
noraxesrespectively. Thethirdcomparisonisapplied sources,oriftherearereferencesourcesthathavenot
to the frequency of the entries, where the two sources been found. SFAE uses the reference catalogue as the
are considered close enough if the difference between truth, so it will respond to this situation by matching
thefrequencyofthetwoentriesislessthanorequalto oneofthereferenceobjectstothetestobject,andleav-
the full width at half maximum (FWHM) of the cat- ing the other reference objects unmatched. Leaving
alogue source. Ideally this function should be based the extra sources unmatched conveys the information
on the error of the test detections relative to the ref- that the source finder has incorrectly merged the two
erence catalogue being considered, but the size of the objects. It is possible to account and correct for this
error may not be known. If such information is avail- byexaminingthemergedtestsourceanditspotential
able about the error between the two catalogues, an matches, and using their parameters to manually de-
alternativepotentialmatchfunctionthatincorporates cidewhetherornotthereferencesourcesareallpartof
this information may be used instead. the test source. The same applies, in reverse, to split
objects.
3.2 Step Two: Optimum Matching The optimum matching is found by employing a
morefine-grainedtechnique,theHungarianAlgorithm
The second step of cross-matching the reference and (Kuhn1955). TheHungarianAlgorithmconsidersthe
testcataloguesistomakefinalpairsfromtheconfused entries as a weighted bipartite graph, where each ver-
sets formed in the previous step. This step enforces tex in the graph corresponds to either a reference ob-
a matching scheme where each object is either in a ject or a test object. The matching produced by this
single match, or no matches at all. Additionally, for algorithm is one that has the maximum number of
any reference object a that is matched to test object matches, and among all the possible ways this match-
b, b is also matched to object a. Final matches are ing can be achieved, it picks the one with the least
only allowed between reference-test object pairs that total distance between the paired objects. Edges ex-
have been previously marked as potential matches, to ist between any and all pairs of nodes that have been
ensure that SFAE only makes valid matches. identified as potential matches in step two.
Twoparticularsubsetsofconfusedsetsthatareof The weights between the nodes are dimensionless
interest are the cases of merged or split objects. Re- values that define how different the reference and test
spectively, this is where the source finder has taken objects are to each other. If ∆a and ∆b are the dis-
two sources and reported them as a single object and tances between the two objects along the major and
where the source finder has found a single source and minoraxesofthebeamrespectively,w andw arethe
a b
reported it as multiple objects, resulting in a reduced widthsofthebeamalongthemajorandminoraxesof
completeness. In the case of a merged object, this the beam, ∆f is the difference in frequency position
may appear to SFAE as a confused set with two or between the two objects, and w is the frequency res-
f
more reference objects and a single test object. It is olution of the telescope or simulation used to create
4 Publications of the Astronomical Society of Australia
the data cube, then the distance weighting function ters. This information can be used to identify what
between two objects x and y is defined as δ : types of sources are missed by the source finder. The
x,y
differences between true and false detections can be
(cid:115)
(cid:16)∆a(cid:17)2 (cid:16)∆b(cid:17)2 (cid:18)∆f(cid:19)2 used to suggest selection functions that could be ap-
δx,y = w + w + w (5) plied to the catalogue produced by the source finder,
a b f removing suspected false detections in order to im-
prove its reliability.
If the frequency resolution is unknown, then it may
This section has shown how SFAE program deter-
be possible to approximate w with the value of the
f
mines the accuracy of a source finder. If there are
channel width.
N objectsinthereferencecataloguelistforthesource
Becauseonlypotentialmatchesarebeingcompared
cube,thenreadinginthedatahasatimecomplexityof
by the distance metric function, the frequency dif-
O(N). Comparingthereferenceobjectstothetestob-
ference between different sources is small and there-
jectsinSteponehasatimecomplexityofO(N2). The
fore the difference in frequencies is a good approxi-
optimum matching algorithm in Step two has a time
mation for the difference in distance between the ob-
complexity of O(N3) (Edmonds & Karp 1972). The
jects. It would also be possible to include other pa-
time complexity of Step three varies with the type of
rametersinthedistancefunction,suchaspeakfluxor
data analysis done. Calculating the completeness and
frequency width, which would cause the algorithm to
reliabilityasafunctionofsourceparameterhasatime
favour matching objects with similar parameters.
complexity of O(NlogN) and the parameter compar-
However, the decision was made to not use other
ison analyses each have a time complexity of O(N).
parameters in this equation because doing so would
Therefore, this program has an overall time complex-
causeerrorsintheparameterisationofthetestobjects
ity of O(N3). The information produced by SFAE is
toaffectthematchingofthereferenceandtestobjects.
demonstrated in the following section.
Additionally,incorporatingotherparameterswouldre-
quire an appropriate weighting scheme to ensure that
alltheparametersarefairlyandaccuratelyconsidered
4 Results
in the distance between the two objects. The weight-
ingfunctionwouldneedtoaccountforthesystematic
and random errors for each parameter used to calcu- The SFAE program was tested, both to ensure that
late the distance between two objects. The distance it correctly measures the accuracy of a source finder
function is limited to just the position and frequency and to demonstrate the information that it can give.
because together these three parameters can uniquely The test data set is based on a FITS format ASKAP
identify a source and because they are the core func- simulation data cube (Whiting 2010). This simulated
tionofasourcefinder: toidentifytheobjectsinadata data cube covers an area 1.42×1.42 degrees with an
cube and report their positions. angularresolutionof30arcsec,apixelsizeof10arcsec
and a frequency range of 1327.39 to 1422.0175 MHz.
The frequency resolution and channel width are both
3.3 Step Three: Analyse Matches
92.5 kHz. This results in a data cube composed of
Now that it is known what reference sources corre- 512×512×1,024 voxels. This data cube is intended
spond to what test sources, it is possible to analyse foruseintestingsourcefinders,soithasartificiallyre-
thesematchesinordertocalculatetheaccuracyofthe ducednoise,inordertoincreasethenumberofsources
source finder being tested. The completeness values visibletothesourcefinder. Thesourcecatalogueused
canbecalculatedusingEquations3and4,settingn as the input for the simulation is from the SKADS
r,i
to the number of matched sources that are in the ith S3-SAX simulation (Obreschkow et al. 2009). The
bin. Forconsistency,theparametersusedtodetermine sources used are mostly point sources but there are
whatbinaparticularsourceisinshouldcomefromthe someextendedsources. Thetestdatacubeissupplied
referencecataloguewhencalculatingcompletenessand as a dirty image, so prior to running the source finder
thetestcataloguewhencalculatingthereliability. The the image was cleaned using miriad (Sault & Killeen
value of n is the number of objects in the reference 2008), using the full PSF data cube.
d,i
catalogue, both matched and unmatched, that are in Thiscleanedimagestillhasanumberofsidelobes,
theithbinandn isthenumberoftestobjects,both fromtwosources. Thefirstsourceofsidelobesisfrom
t,i
matched and unmatched, that are in bin i. The accu- errors in the cleaning process, as the PSF cube pro-
racy of the parameterisation of the source finder can videddoesnotexactlymatchthePSFsidelobespresent
bemeasuredbytakingeachpairofmatchedreference- in the dirty image, due to what is suspected to be an
test sources, and comparing the parameter of the test errorinthepipelinecreatingthissimulateddata. The
source to the value of the reference source. second is that there are a number of sidelobes from
Inadditiontocalculatingthecompletenessandre- sources that are outside the data cube, and therefore
liability of a source finder, these matches may also be cannot be removed by the clean task. The sidelobes
used to determine the difference between the sources cause a number of erroneous detections, as the source
thatwereandwerenotfoundbythesourcefinder,and finder tested finds peaks in the sidelobe patterns. In
thetrueandfalsedetectionsofthesourcefinder. The the general case, data sets may contain errors from
differences for one or more parameters can be shown theirconstruction,suchastelescopenoise,pipelineer-
by plotting the distribution of the matched and un- rors, simulation errors, and so on. Source finders will
matched test sources, as a function of these parame- need to be able to operate on data sets with such er-
www.publish.csiro.au/journals/pasa 5
rors. When using SFAE to determine the accuracy of
a source finder, the data cube should be given to the
source finder in the same manner as the source finder
wouldreceivedatacubesinaproductionenvironment. 1 40
ThesourcefinderbeingtestedisDuchamp(Whit- 35
ing 2008), version 1.1.13. However, the information 0.8 30 Bin
given here should be considered a demonstration of n
STFhAeEtesrtatchaetraltohgauneawatessotbotfatinheedabcycuerxaeccyutoifnDgDucuhcahmapm.p pleteness 0.6 2205 Objectsi
opnartahmeectleerasnteodcdaaltcaulcautbeei.tsDluiscthoafmdpetuescetsioansn,uimnbaedrdoi-f Com 0.4 15 berof
10 m
tiontothedatacube. Theparametersspecifiedtocre- 0.2 Nu
5
ate the detection list used for this test use the default
parameters,withtheexceptionofusinga`trouswavelet 0 0
0.1 1 10 100 1000
reconstruction in three dimensions. Duchamp’s de-
CataloguePeakSNR
fault parameter set uses a three sigma threshold for
Completeness
the final objects.
NumberofObjectsinBin
The reference catalogue list was obtained by run-
ningDuchamponacorrespondingmodelimageofthe Figure 3: Completeness by Peak Flux. This graph
dirty cube. This model data cube contains only the shows the completeness of the test source finder as a
sources present in the test data cube, with zero noise. function of the peak SNR of an object. The abscissa
In obtaining the reference catalogue, the default pa- is the peak SNR range of that bin, with a logarith-
rameters for Duchamp were used except for using a mic scale. The histogram shows the completeness of
direct flux threshold of 1µJy. It can be reasonably the source finder, for that bin. The error bars show a
assumed thatthe source finder derives the correct pa- one standard deviation error, as determined by boot-
rametersfromthemodelcube,anditishowthesource strap resampling. The dotted line shows the number
finder parameterises the noisy image that is of inter- of objects in each bin. This graph uses the reference
est. Whether or not a source finder does, in fact, cor- object’s value for the peak flux of a source.
rectly parameterise a model image can be determined
by using several different source finders to search and
parameterise the model image, then comparing their
results to see if they calculate the same values.
The reference catalogue has 235 entries and the
test catalogue has 384 entries. Of these, there are 190
matchedpairs,usingthesevalueswithequations1and
2 results in an overall completeness of 70.6% and a 1 140
reliability of 49.9%. The one-to-one matching did not
120
exclude any sources from being matched. 0.8 Bin
Thecompletenessofthesourcefinderasafunction 100 in
oufsinthgeEpqeuaaktioSnNR3. oTfhae SsoNuRrceofisa sshouowrcne iisncFalicguulraete3d, ability 0.6 80 Objects
bydividingthepeakfluxofasource,inunitsofJyper Reli 0.4 60 of
beam,bythermsnoise,ascalculatedfromthecleaned 40 mber
data cube using the miriad routine histo. Likewise, 0.2 Nu
20
thereliabilityofthetestsourcefinderasafunctionof
peak SNR is shown in Figure 4, using Equation 4. 0 0
1 10 100 1000
Theerrorbarsweredeterminedusingbootstrapre- DetectedPeakSNR
sampling. Foreachbin,anobjectisrandomlyselected
Reliability
from the set of objects in that bin and it is recorded NumberofObjectsinBin
whether or not it has a matching test source. This
procedure is repeated a N times, with N being the Figure 4: Reliability by Peak Flux. This graph shows
number of objects in the bin considered. Using Equa- thereliabilityofthetestsourcefinderasafunctionof
tion 3, the completeness is then calculated from the the peak flux of an object. The abscissa is the peak
objects that had been selected. A total of 1000 com- flux range of that bin, with a logarithmic scale. The
pletenessvaluesarecalculatedperbin,andthe15.9% histogramshowsthereliabilityofthesourcefinder,for
and84.1%percentilecompletenessvalueswereusedas thatbin. Theerrorbarsshowaonestandarddeviation
the lower and upper standard deviation, respectively. error, as determined by bootstrap resampling. The
Note that if the sources in a particular bin are either dotted line shows the number of objects in each bin.
allmatchedorallunmatched,thenthatbinwillalways This graph uses the test object’s value for the peak
have the same completeness (either 100% or 0%), no flux.
matterwhichobjectsarerandomlyselected,andthere-
fore that bin will not have any error bars plotted.
The distribution of the reference and test sources
6 Publications of the Astronomical Society of Australia
1.2
d
overe 1
Rec
x 0.8
u
Fl
ated 0.6
gr
Inte 0.4
of
n
Portio 0.2
450
0
400 0.01 0.1 1 10
350 CatalogueIntegratedFlux(Jykms−1)
−1) 300
kms 250 Figure 6: Integrated Flux Parameter Comparison.
(
M 200 Thisgraphshowshowaccuratelythesourcefinderde-
H
W 150 termines the integrated flux of sources. The abscissa
F
100 isthereferencecatalogueintegratedfluxofamatched
50 source. The ordinate is the portion of the integrated
flux of the source found by the source finder.
0
0.001 0.01 0.1 1 10
IntegratedFlux(Jykms−1)
MatchedSources as a function of their integrated flux and line width
UnmatchedSources(FalseNegatives)
is shown in Figure 5. The matched sources are those
(a) ReferenceSourcesbyIntegratedFluxandLineWidth that are both present in the reference catalogue and
havebeenfoundbythesourcefinder. Theunmatched
400 reference sources are those that have not been found
by the source finder, and the unmatched test sources
350
arefalsedetections. Theabscissaistheintegratedflux
300
ofasource,andtheordinateistheFWHMsizeofthe
−1ms) 250 sources, using the reference catalogue parameters.
(k 200 The matches produced by SFAE can be analysed
M
WH 150 to determine how accurately the source finder param-
F eterises the sources it finds. One such analysis can
100
be seen in Figure 6. This compares how well the test
50 sourcefindermeasurestheintegratedfluxofasource.
0 The abscissa is the integrated flux of a source in Jy
0.0001 0.001 0.01 0.1 1 10 kms-1, as listed in the reference catalogue. The ordi-
IntegratedFlux(Jykms−1)
nateistheportionoftheintegratedfluxofthesource
MatchedSources that was detected by the source finder.
UnmatchedSources(FalsePositives)
TheaccuracyofSFAEitselfwasmeasuredbyman-
(b) TestSourcesbyIntegratedFluxandLineWidth uallymatchingentriesfromthereferencecatalogueto
theentriesinthetestcataloguelist. Thiswasdoneby
Figure 5: These graphs show the distribution of plottingthepositionsofthereferenceandtestobjects.
matched and unmatched sources as a function of Theneachreferenceobject’spositionwasobservedand
their integrated flux and line width. The abscissa is it was recorded either which test object was closest
the integrated flux of the source and the ordinate is to the reference object in question, or that the refer-
the FWHM of the source. The first plot shows the enceobjecthadnonearbytestobject. Thelimitsused
completeness of the source finder using the reference were the size of the major axis of the beam in spatial
sources. The second plot shows the reliability of the distance, and the resolution of the data cube, in fre-
source finder, using the test parameters. quency. Uponcomparingthismanualmatchingtothe
one produced by the SFAE algorithm, SFAE matched
each of the 235 objects in the reference catalogue to
the same test object as the manual matching.
5 Discussion
TheresultsoftheSFAEprogramshowavarietyofin-
formationabouttheperformanceoftheexamplesource
finder Duchamp and the parameter set used. The
overall completeness and reliability figures show how
www.publish.csiro.au/journals/pasa 7
successfulthetestsourcefinderwasoverallincorrectly 2009). With ASKAP’s instantaneous field of view of
locating the sources in the cube. More useful are the approximately30 deg2 therewillbearound1200sep-
completeness and reliability values plotted as a func- arate fields for the entire survey (Warren 2011) and
tion of different parameters. Figure 3 shows the por- therefore approximately 500 sources per data cube.
tion of sources found as a function of the peak flux. From the recorded running time of 2.138s for 2345
Thisplotcanshowforwhichpeakfluxvaluethesource objectsandthetimecomplexityabovethissuggestsa
finder can be considered complete, before the rate at runningtimeontheorderof18sforasingleASKAP-
which the source finder detects objects drops off. The scaledatacube. Therefore,nofurtherattemptstoim-
reliability of the source finder as function of peak flux prove the execution speed are necessary for the scale
isshowninFigure4. Thisanalysisshowsatwhatpeak of data currently being used.
flux value the source finder can be considered to find
primarily genuine detections, rather than false ones.
Together,theinformationinthesegraphscanbeused, 6 Summary
for example, to determine the flux limit of the survey
the source finder is used for. TheSourceFinderAccuracyEvaluatorprovidesanau-
SFAE’s ability to analyse a source finder’s com- tomated,deterministicmethodtodeterminetheaccu-
pletenessandreliabilityisfurtherdemonstratedinFig- racy of a source finder. The program does this by
ure5. Theseplotshowshowreferenceandtestsources takingtheresultsofthesourcefinderinquestionfrom
are distributed as a function of both their integrated a data cube with a known source list, and comparing
flux and their FWHM. Figure 5(a) shows the differ- the results against a known source catalogue. Using
encebetweenthesourcesthathaveandhavenotbeen the two catalogues of sources, and header information
found by the source finder. This information can be fromthedatacubes,SFAEcreatesalistofwhichrefer-
usedtoidentifypotentialareasofimprovementtothe ence object-test object pairs refer to the same source,
searching algorithm and for what types of sources the if any.
producedcataloguewillbeincomplete. Thedifferences ThelistofmatchesproducedbySFAEmaythenbe
betweentrueandfalsedetectionscanbeseeninFigure analysedtodeterminethecompletenessandreliability
5(b). Totheextentthatthetestdatacuberepresents of the source finder. The completeness and reliability
realobservationdata,thesedifferencescanbeusedto figures can be calculated for both the catalogues as
determine a selection function to apply to the source a whole, and as a function of the parameters of the
finder’s output, in order to improve its reliability. sources in the catalogues. Breaking this information
Theobjectsandtheirmatchescanbeusedtocom- down by the properties of the sources allows SFAE
pare how well the source finder determines the pa- to provide a more detailed account of the accuracy of
rameters of the sources it finds. Figure 6 shows what the source finder, characterising the types of sources
thesourcefindercalculatedaseachobject’sintegrated thesourcefindercanaccuratelyfind,andthetypesof
flux,comparedtowhatwaslistedinthatobject’srefer- sources it either misses or erroneously locates.
encecatalogueentry. Thisfigureshowsthatthesource Finally, the accuracy of the source finder’s para-
findercalculatedtheintegratedfluxofalmostallofits meterisation algorithms is determined by comparing
sources as lower than the actual integrated flux, and the value of the selected parameter from each source,
that the reduction in the listed flux increases as the asreportedbythesourcefinder,againstthesameval-
reference value for the integrated flux decreases. The ues as recorded in the reference catalogue list. There-
errorshownsuggeststhatthesourcefindershouldim- fore, SFAE provides a variety of information useful to
prove its existing parameterisation, use a new algo- determining the accuracy of source finders, both in
rithm, or use this error information to devise a statis- terms of the sources they find, or don’t find, in addi-
ticalcorrectiontoapply. Theerrorsintheparameter- tiontohowaccuratelytheycharacterisethesesources.
isationshouldalsobekeptinmindwhenassessingthe
abovemeasuresofaccuracyasafunctionoftheparam- 6.1 Future Work
eters reported by the source finder. Similar analyses
canberunonotherparameterstodeterminethechar- SFAE could be modified to run its analysis over mul-
acteristicoftheerrorinthesourcefinder’sanalysisof tiple data cubes and their corresponding source lists,
that parameter. tocollatesourcefinderaccuracydatafromalargetest
The overall time complexity for SFAE is O(N3), data set. SFAE could also use a more sophisticated
where N is the number of objects in the data cube approach to dealing with instances where the source
catalogue. Running SFAE for the 235 objects in the finder splits a reference catalogue source into two or
ASKAP simulated data cube resulted in a running more separate test objects, or when the source finder
time of less than three seconds, on a single quad-core mergesmultipleobjectsintoasingletestsource. Inthe
CPU system. As the running time for Duchamp on future, SFAE and its results could be used to provide
the same system for the test data cube is 90 minutes, feedback for computer learning-based source finding
over three orders of magnitude longer, the running algorithms, and used to train them.
time of SFAE can be considered negligible in compar- Asaspecialcase,itmaybepossibletoextendthis
ison. On the scale of ASKAP, the Widefield ASKAP work for use in cross-matching catalogues, with cer-
Legacy L-band Blind All-sky surveY (WALLABY) is tain alterations. This application is beyond the scope
expectedtofindapproximately500000galaxiesacross of this work, and the method may not provide com-
anestimatedareaof3πsr(Koribalski&Staveley-Smith plete matching, as matching between sources in dif-
8 Publications of the Astronomical Society of Australia
ferent frequencies may not have a one-to-one relation Whiting, M. 2010, available from: http:
to each other. For example, a single detection in one //www.atnf.csiro.au/people/Matthew.Whiting/
bandmaymatchmultipleobjectinanotherband. The ASKAPsimulations, Last Visited 15 December 2010
changesnecessarywouldinvolveusingdifferentthresh-
old and distance functions. These would need to use Zwaan,M.A.,etal.2004,@@styleMNRAS,350,1210
distance, red shift or velocity in place of frequency.
Thethresholdsappliedtodeterminepotentialmatches
and the weightings applied to different parameters in
the distance function would need to be set depending
ontherelativeerrorsparticulartothecataloguesbeing
matched.
Acknowledgement
The authors thank Matthew Whiting for writing the
DuchampsourcefinderandcreatingtheASKAPsim-
ulationdatausedinthiswork. Theauthorsalsothank
AttilaPoppingforhisfeedbackinpreparingthiswork.
style=
References
Barnes, D. G., et al. 2001, @@styleMNRAS, 322, 486
DeBoer, D., et al. 2009, Proceedings of the IEEE, 97,
1507
Edmonds, J., & Karp, R. M. 1972, J. ACM, 19, 248
Koribalski, B. S., & Staveley-Smith, L. 2009, Pro-
posal for WALLABY: Widefield ASKAP L-band
Legacy All-sky Blind surveY, Available from
http://www.atnf.csiro.au/research/WALLABY/
proposal.html, Last Visited 9 September 2011
Kuhn, H. 1955, Naval research logistics quarterly, 2,
83
Meyer, M., et al. 2004, @@styleMNRAS, 350, 1195
Obreschkow, D., Klo¨ckner, H.-R., Heywood, I., Lev-
rier, F., & Rawlings, S. 2009, The Astrophysical
Journal, 703, 1890
Sault, B., & Killeen, N. 2008, Miriad Multichannel
Image Reconstruction, Image Analysis and Display
Users Guide, Available from http://www.atnf.
csiro.au/computing/software/miriad/,LastVis-
ited 15 December 2010
Warren,B.E.2011,SkyTessellationPatternsforField
Placement for the All-Sky HI Survey WALLABY,
Tech. rep.
Westerlund, S. 2010, iVEC Internship Report, Avail-
able from http://www.icrar.org/__data/assets/
pdf_file/0006/1750866/stefan_westerlund_
ivec_report.pdf, Last Visited 21 July 2011
Whiting, M. 2008, Galaxies in the Local Volume, ed.
B. Koribalski & H. Jerjen, Astrophysics and Space
Science Reviews (Springer), 343–344