Draftversion January17,2017 PreprinttypesetusingLATEXstyleemulateapjv.5/2/11 FROM BLACKBIRDS TO BLACK HOLES: INVESTIGATING CAPTURE-RECAPTURE METHODS FOR TIME DOMAIN ASTRONOMY Silas G. T. Laycock 1 LowellCenterforSpaceScienceandTechnology, 600SuffolkSt,Lowell,MA,01854 Draft version January 17, 2017 ABSTRACT 7 In time domain astronomy,recurrenttransients presenta specialproblem: how to infer total popu- 1 0 lations from limited observations. Monitoring observationsmay give a biassedview of the underlying 2 populationduetolimitationsonobservingtime,visibilityandinstrumentalsensitivity. Asimilarprob- lemexistsinthelifesciences,whereanimalpopulations(suchasmigratorybirds)ordiseaseprevalence, n mustbeestimatedfromsparseandincompletedata. TheclassofmethodstermedCapture-Recapture a isusedtoreconstructpopulationestimatesfromtime-seriesrecordsofencounterswiththestudypop- J ulation. This paper investigates the performance of Capture-Recapture methods in astronomy via a 3 seriesofnumericalsimulations. TheBlackbirdscodesimulatesmonitoringofpopulationsoftransients, 1 in this case accretingbinary stars(neutronstar orblack hole accretingfroma stellar companion)un- der a range of observing strategies. We first generate realistic light-curves for populations of binaries ] E with contrasting orbital period distributions. These models are then randomly sampled at observing cadences typical of existing and planned monitoring surveys. The classical capture-recapture meth- H ods, Lincoln-Peterson, Schnabel estimators, related techniques, and newer methods implemented in . h the Rcapture packageare compared. A generalexponentialmodel basedon the radioactivedecay law p isintroduced,anddemonstratedtorecover(at95%confidence)the underlyingpopulationabundance - and duty cycle, in a fraction of the observing visits (10-50%) required to discover all the sources in o the simulation. Capture-Recaptureis a promising addition to the toolbox of time domain astronomy, r t and methods implemented in R by the biostats community canbe readily called from within Python. s a [ 1. INTRODUCTION of variability that result in the classification of “tran- 1 Thanks to the decade-plus durations of high en- sient” are characterized by large amplitude (factor of v ergy missions such as Chandra, XMM-Newton, Inte- 10 or more), and prolonged periods of quiescence (days- 1 gral, andRXTE, long-termX-raymonitoringcampaigns years) between high luminosity states. Accreting Bina- 0 ries(seee.g. Lewin & van der Klis2006)andFlareStars have been conducted for a number of galaxies and 8 (Favata & Micela 2003) form the two largest classes of star clusters. In the Milky Way, most of this mon- 3 recurrent transients in the X-ray regime. High mass X- itoring effort has been focussed on the Galactic Cen- 0 ray binaries (HMXBs) show large-amplitude outbursts ter (e.g. Markwardt & Swank 2010, Muno et al. 2004, . (duration: days-weeks)atintervalsrelatedto the orbital 1 Hong et al. 2011, Wijnands et al. 2006, Kuulkers et al. 0 2007); Galactic Plane (e.g. Grindlay et al. 2005, period (months). The strength of these outbursts is fre- 7 Motch et al.2010);andtheOrionNebula(Getman et al. quentlymodulatedbyslower(years)changesinthemass 1 2005). Beyond the galaxy, extended monitoring cam- outflow rate of the companion star. The exact interplay v: paigns have been performed for the Small Magellanic between these timescales is tied to complex physical in- teractions between the two stars which are beyond the i Cloud (e.g. Galache et al. 2008, Eger & Haberl 2008, X Coe et al. 2010) and the local group galaxies M31 scope of this paper (see e.g. Reig 2011) for a review). Suffice to say that HMXB activity can be modulated r (Williams et al. 2006), M33 (Williams et al. 2008), and a M81(Sell et al. 2011). Such surveys turn up large num- in a quasi-periodic way, or completely suppressed. Low mass X-ray binaries (LMXBs) and Cataclysmic Vari- bers of transientsourcesbelonging to astrophysicallyin- ables (CVs) tend to remain in a persistent low luminos- teresting classes, such as flaring stellar coronae, and ac- ity or quiescent state, punctuated by rare outbursts of cretingbinariescontainingblackholes,neutronstarsand weeks-monthsduration. These events: bursts, X-ray no- whitedwarfs. Thepresentworkismotivatedbyanatural vae, dwarf novae, classical novae, are not periodic, but questionposedbythisprofusionofnewtransients: “How are thought to recur on certain characteristic timescales to count a population of recurrent transient sources?” (Mukai et al. 2008). Active galactic nuclei (AGN) can Due to their underlying physics, high energy sources also be considered transients with very long timescales tend to exhibit very large changes in luminosity on vir- (decades-millennia). Atopticalwavelengthsazooofir- tuallyalltimescales. Althoughnosetdefinitionof“tran- regular variables, pulsating stars, eclipsing binaries etc. sient” exists, the term is universally applied to objects might appear transient under certain observing condi- that appear in only a subset of repeat observations (of tions (low cadence, low sensitivity). Supernovae and the same sky region), and are absent in others, de- GammaRayburstsdonotcount,beingstrictlyone-time spite sufficient sensitivity to detect them. The types events. 1Department of Physics and Applied Physics, Olney Science Findingthetotalpopulationofanyoftheaboveclasses Center,UniversityofMassachusetts Lowell,MA,01854, USA of transients in a cluster or galaxy, is thus an unsolved 2 Laycock problem, current cases of interest include the SMC, in as the Lincoln-Petersen index, is given by Equation 1, which ∼100 HMXBs are currently known, and the over- whereN isthetotalpopulationestimatederivedfrom LP abundance of X-ray transients in the Galactic Center N (individual encountered on occasion 1), N (individ- 1 2 (Muno et al. 2005). uals encountered on occasion 2) and N (individuals 1,2 The astronomical transient-counting problem remains encountered on both occasions). The modified form due relatively unexplored, despite rigorous mathematical toChapman(1951)avoidsdivergenceinthecaseofsmall analysis of long-durationsurveys dating back to at least numbers, and is normally distributed (provided the as- Herschel (1783). Existing long-duration datasets on sumptionsofsampleindependenceandcaptureprobabil- the whole sky include the Harvard photographic plate ity are met). The modified Lincoln-Petersen index, and collection (1880s-1980s) currently undergoing digitiza- its 1σ uncertainly are given by Equations 2 & 3. tion and data-mining by the DASCH2 collaboration (Grindlay, Tang, Los, & Servillat 2012; Laycock et al. N1×N2 N ≃ (1) 2010); current facilties generating large amounts of LP N 1,2 suchobservationsincludeASAS/ASSAS-SN (Pojmanski 2002; Holoien et al. 2017), PalomarTransientFactory 3, (N1+1)×(N2+1) N = −1 (2) MASTER (Lipunov et al. 2010), and the upcoming LP (N +1) 1,2 Large Synoptic Survey Telescope (LSST)4. The problem of deducing population parameters from (N +1)(N +1)(N −N )(N −N ) repeated isolated observations is not unique to astron- σ2 = 1 2 1 1,2 2 1,2 (3) omy, and in fact an established approach to similar LP (N1,2+1)(N1,2+1)(N1,2+2) problems has been long-used in field-biology: Capture- Whenmorethantwosamplesareavailable,Equation2 Recapture analysis. In recent years, sophisticated vari- can be applied to adjacent pairs, and a running average ants of this method have been developed in the fields of constructed. This approach does not make maximum ecology,epidemiologyandbiostatistics. Typicalapplica- useoftheavailableinformation,althoughitdoestendto tionsinvolve: smallanimallive-trapping(hencetheterm converge. Provided the conditions of closed population capture-and-recapture); estimation of fish populations and equal capture probability apply across individuals fromcatchdataSchnabel(1938);migratorybirdsurvival and observations it will give an unbiassed estimate. rates (e.g. Blackbirds. Miller, Aradis & Landucci 2003) from tagging studies; and disease prevalence inferred by 2.2. Schnabel Estimator comparingpatientIDsinthemedicalrecordsofmultiple clinics. The shared feature of all these situations is an More efficient use of monitoring data is made by the unknown population of identifiable individuals, who are Schnabel Estimator (Schnabel 1938) which is essentially encounteredorobservedonmultiplerandomlyspacedoc- a cumulative weighted average of the ratio of new indi- casions. Each encounter will yield a ratio of known(old) viduals to previously encountered individuals. tounknown(new)individuals. Modelingofthisratioand itstimeevolutionisthebasisofcapture-recapturemeth- PNi×Nc(i−1) Ns = (4) ods. An introduction to many of these methods can be i ( Nr )+1 P i found in Amstrup, S. C., et al. (2005). The aim of this paper is to explore the application Equation4 gives the running populationestimate Nsi ofcapture-recapturemethodstoastronomicaltransients. in terms of the number of individuals (N ) encountered i Section 2 outlines the statistical methods, then in sec- on occasion i, the cumulative number of individuals en- tion 3 a numerical simulation code (Blackbirds) for pop- countered (Nc i−1)) prior to occasion i, and the num- ( ulations of astronomical transients is described. The re- berofindividualsre-encountered(Nr )onoccasioni. In i sults of applying capture-recapture methods to recover programmingthis,weavoidfindingNrexplicitlysinceit the input model parameters, for a wide range of source canbeshownthatNri =Ni +Nc(i−1) -Nci. Aformula properties and observing cadences, are presented in sec- fortheuncertaintyinNsisgivenbyZar(1996),however tion 4. for small Nr one should use the poisson distribution in- 2. REVIEWOFCAPTURE-RECAPTUREMETHODOLOGY stead. In astronomy we are also accustomed to deriving ANDALGORITHMS confidence limits by Monte Carlo simulation. 2.1. Lincoln-Petersen Index 2.3. Derivation of a Generalized Exponential If a fixed populationof individuals (which canbe any- Capture-Recapture Function thing) is observed on multiple occasions, at each epoch While many ad hoc refinements to the above meth- the sample tends to contain a mixture of individuals ods can be found in the literature, the birth of modern who have been previously encountered, and some who techniques lies in the realization that encounter history are new. In the simplest case, where all individuals should be modeled using a probability function. Under have the same probability of being encountered at any theconditionof“equalcaptureprobability”,(allindivid- time,thetotalpopulationcanbededucedfromtheratio uals have the same probability of being in an observable of “new” to “old” individuals encountered at two suc- state at any giveninstant; providedthe observationsare cessive epochs. This mathematical relationship known widely spaced and uncorrelated) it follows that the rate 2 http://dasch.rc.fas.harvard.edu of encounters with new (previously undetected) individ- 3 http://www.ptf.caltech.edu ualswillfollowanexponentialcurvesincethe numberof 4 https://www.lsst.org undetected individuals falls with successive observation; Capture-Recapture 3 the situation being closely analogous to radioactive de- whichisfunctionallyequivalenttoEquation7forsmall cay, with the encounter probability replacing the decay values of p. The exponential approximation converges constant. moreslowlyresultinginaninitialunderestimationwhich Assume a star cluster contains N transients, all hav- diminisheswithk. Forp=0.1,themaximumdiscrepancy 0 ing the same probability p of being “On” during any is less than 5%. observation. For each transient, p can be identified as the star’s duty cycle, since oversufficiently long time in- 2.4. The Problem of Heterogeneous Capture Probability tervals, the ratio of time spent in outburst, to time in Equation 8 still does not make use of all the informa- quiescence is the probability of the star being in out- tionavailable,sinceitignoresthenumberofrepeatdetec- burstatsomearbitrarytime. Thisresultappliestoboth tions of each source; making no distinction for example periodic and aperiodic transients, and is independent of between recurrent, and one-time transients. Data can the actual timescales involved. be used to furnish the form of such heterogeneity in the Ifthispopulationisobservedmanytimes,thenateach duty cycle p, with important astrophysical implications. observation k (where k=1,2,3,4......) we see a mixture Ifdistinctsub-populations exist,eachcharacterizedbya of ‘new’ and ‘old’ stars. At any such observation, we unique p then one canfit a multi-componentmodellike canexpecttoseen =N pstars(withlargeuncertainty). j k 0 Equation 9 based on replication of Equation 8 but if p Subsequentobservationsincreasethecumulativenumber j isafunctionofsomeotherphysicalvariable(age,orbital (Nc ) of stars encountered, thus reducing the number k period, mass, magnetic field, etc) then a more careful of remaining un-encountered stars (Nu ). Following the k analytic prescription will be needed. radioactive decay law as a model leads to Equation 5. Nuk =N0e−pk (5) {Nck}j =XN0,j(cid:2)1−(1−pj)k(cid:3) (9) j The cumulative number of stars encountered by the time one reaches the kth observation will be: The frequency of stellar flares turns out to be such a case. In a review paper Ambartsumian & Mirzoyan Nc =N −Nu (6) (1975) summarizes their long-termstudy of the Pleiades k 0 k starcluster(originallypublishedinRussianinaseriesof Thus, 5 papers: Ambartsumyan et al. (1970) – Mirzoian et al. Nck =N0(1−e−pk) (7) (1977)). They were able to show that the frequency of stellar flares (f) in young stars declines with age, Hence a plot of cumulative number of transients Nc k although repeat photographs of the cluster are on a vs observation number k should be fit by Equation 7 to timescale much too short to show any individual star derive values for N and p. This method is applicable to 0 evolving. The total population of flare stars could be any number of observations (k > 3). Some advantages reproduced from the observed incidence of flares, only, over the LP and Schnabel estimators include better use if the rate of flaring varied between stars and correlated of information, uncertainty estimates generated by the with an independent measure of stellar age. We repro- fittingprocess(e.g. leastsquares),andextensibility. Sci- duce their estimator here as Equation 10 where n is entifically the greatest gain for astronomical purposes is f the number of flare stars for which f flares have been direct estimation of the duty cycle. observed, ν is the flare frequency and t is the total ef- WenotethatinadoptingEquation7wehaveimplicitly fective observing time. Their quantity νt is analogous treatedkasacontinuousvariablebecauseEquation5fol- to the quantity pk in our notation, and is obtained di- lows from a differential equation. An alternative deriva- rectly from the data via a Lincoln-Petersen like identity tion that does not include this feature is as follows. νt = 2n /n using the numbers of stars (n , n ) with Independently of the value of k, the total num- 2 1 1 2 one and two observed flares. In Equation 12 there are j ber of stars is given by N = N + N sothe 0 uk ck sub-populations each with its own flaring frequency. expected number of identified stars can be written as n = pN = pN +pN . Now, before the k-th survey k 0 uk ck (νt)f starts, the expected number of unidentified stars is n =N e−νt (10) f 0 equal to N . When the k-th survey is completed, the f! uk−1 expected number of stars which are identified for the first time is given by pNuk−1. nf =XN0,je−νjt(νjt)f (11) f! Hence, N =N −pN or j uk uk−1 uk−1 A most useful analytic result obtained by N =N (1−p) uk uk−1 Ambartsumian & Mirzoyan (1975) is the following inequality describing the limits of N in the case of Thisisa“differenceequation”that,withthecondition 0 heterogenous ν. Although their parameterization of the N =N , has solution given by u0 0 problem differs slightly, it follows from the same axioms as the Capture-Recapture methods, and therefore N =N (1−p)k uk 0 Equation 12 should be widely applicable. From this last equation, one obtains Equation 8, n2 n2 1 ≤N ≤ 1 (12) Nck =N0(cid:2)1−(1−p)k(cid:3) (8) 2n2 0 n2 4 Laycock 2.5. Biostatistical Implementations: Rcapture TABLE 1 Software for Capture-Recapture studies used in the Pulsar Population Models. bio-statistical field is freely available, for example Mark Model Type P1 P2 T 5, M-Surge (Choquet et al. 2004), CARE (chao et al. s s days 2001) and RCAPTURE (Baillargeon & Rivest 2007). A G 150 50 132 These newer packages use capture histories from mul- B U 1 1000 <311 C G 0 200 16 tiple epochs to iteratively build probabilistic models to D G 200 100 150 describe the population and biases. The models are E G 200 50 150 thenusedtoobtainstatisticalboundsonthepopulation, F G 0 100 16 andtostudythe capture-probabilitydistributionsthem- selves, which can represent survival rates and other de- Pulse period distribution Type is either Gaussian with P¯s=P1, mographic parameters. The newer algorithms make use σ=P2,orUniformwithPs<P2. AllmodelsaretruncatedatPs>1. Thetypical recurrencetimeT isthecorrespondingmost-common of landmark advances in statistical computing and are orbitalperiod,derivedfromEquation13. computationally intensive. Mark dominates animalpop- ulationstudiesandincorporatesthewidestrangeofalgo- werechosentoapproximatelyreproducethepulseperiod rithms, but most are inappropriate for astronomy. Sim- distribution observed by the RXTE monitoring project ilarly M-Surge is aimed at estimating survival and mi- Coe et al.(2010). Theotherchoiceswereselectedtopro- grationrates in open populations. CARE is widely used vide populations with contrasting recurrence timescales. in epidemiology. Rcaptureis gearedtowardclosedpopu- Choice of distribution depends on the application, and lations, including those with high degrees of heterogene- can be specified to explore the observable consequences ity. It incorporates methods from biology, medicine and of various underlying population models. census analysis in a generally applicable framework,and Next, an orbital period is assigned to each X-ray bi- consequentlywasselectedforuse inthis study. Rcapture nary using a fit to the Corbet diagram ((Corbet 1984)) is conveniently distributed as a module for the widely usingup-todatevaluesfor9GalacticBe/Xpulsars. The used open source statistical programming platform “R” formula used is given in Equation 13. Due to the loga- (R Core Team 2014). rithmic relationship between P and P , the or- pulse orbit In the following sections we test Capture-Recapture bitalperioddistributionisaskewedgaussianforaninput methods from the basics up to the latest statistical gaussian P distribution, and a monotonically rising pulse approaches, against a suite of simulations designed to function of P for a flat P distribution, as illus- orbit pulse mimic astronomical monitoring campaigns. tratedinFigure2(lowerpanel). AsteepP distribu- pulse 3. SIMULATINGAPOPULATIONOFTRANSIENTX-RAY tion rising toward small values, modeled by a truncated BINARIES:THEBLACKBIRDSCODE. gaussianwith its mean value at or below the cut-off also produces a bell-curve in P . In order to provide trial datasets for analysis with orbit capture-recapturemethods,the Blackbirds codewascre- ated to simulate populations of recurrent transients. log (P )=1.1329+0.4532×log (P ) (13) 10 orbit 10 pulse By using simulated data, the fidelity of the capture- recapture techniques be quantitatively assessed. The Epoch of periastron for each individual system is ran- simulation enables one to model a wide range of pos- domly assigned within a 1000 day range preceding the sible source populations, each characterized by different beginning ofthe simulation. Outburstsofa givensource recurrence timescales, and to control for the effects of occur at periastron, simulated by a gaussian profile cen- sampling (observing cadence). Blackbirds is written in tered at phase=0, with width (σ=0.1) selected to mimic Fortran 90,andisdesignedtogenerateasetofrepresen- the folded orbital profiles of Galache et al. (2008). The tative X-ray binary light-curves, ‘observed’ via a simu- peak flux of each outburst is assigned to be a baseline latedsampling structure. The type ofsourcewas chosen value (1.0 in the simplest case) modified by a random for its relevance to the author’s observational work, and deviate in the range 0-1.0. The RXTE lightcurves in becausethemethodsdevelopedherewillbeapplicableto LO5/G08shownorelationshipbetweenpulseperiodand existingmonitoringcampaigns. Othertypesofrecurrent X-ray luminosity, hence the use of a simple uniform dis- transient can also be simulated by the code. tribution for peak outburst flux. The imposition of this We could specify the distribution of recurrent time- flux scaling factor is a key part of the simulation, as it scales directly, or via the distribution of another control injectsarealisticaperiodiccomponentintothelongterm parameterwhichisinprinciplebetterconstrainedbyob- lightcurve of each source. servation. InthecaseofHMXBpulsars,thepulseperiod Sampling intervals are assigned as a series of uniform- distributionisknownsomewhatindependently oftheor- randomlydistributedintervalswithaspecifiedminimum bitalperiod-distribution(sinceP isobtainedineven and maximum separation, or or gaussian-random with pulse asingledetection). First,pulseperiodsaredrawnatran- specified mean spacing and spread. This approach fa- dom from either: a uniform distribution; a gaussiandis- cilitates exploring the effects of varying the observing tribution characterizedby a mean periodP¯ and a width cadence. The cadences used in this work are: 7-14 days, σ;oragaussiantruncatedatitspeaktoproduceanexpo- 15-30d,60-90d,90-120d. Theshortestintervalistocom- nential. ThesemodelparametersaregiveninTable1and ply with the RXTE monitoring project’s “weekly” ob- the resulting distributions are plotted in Figure 2, (up- serving cadence; in practice observations are unevenly per panel). The gaussian parameters P1=150s, σ=100 spaced within 7-14 day intervals. Finally the model XRBs are sampled at the observing 5 http://warnercnr.colostate.edu/ gwhite/mark/mark.htm cadence (whether simulated or an actual series of tab- Capture-Recapture 5 0 1. 8 x 0. u e Fl 0.6 v elati 0.4 R 2 0. 0 0. 4000 4100 4200 4300 Time (days) 0 1. 8 x 0. u e Fl 0.6 v elati 0.4 R 2 0. 0 0. 4000 4100 4200 4300 Time (days) Fig.1.— Simulated Pulsar lightcurves. Lower panel shows 5 pulsars with a range of orbital periods, undergoing periodic oubtbursts characterizedbyagaussianmodelwithσ=0.1inphase,andrandomlyassignedpeakflux. Theupperpanelshowsasimilarmodelpopulation sampledatthenominalobservingcadence oftheRXTEmonitoringproject: randomintervalsintherange7-14days. in biology. 400 Further refinements include the ability to specify dif- 300 GGaauussssiiaann:: MMeeaann==115s,0 Ss,i gSmigam=a1=0500ss ferent baseline flux normalizations to each source, and N 200 Uniform: Max=1000s toparameterizetheaperiodiccomponent. Thesechanges 100 willbeusedtocreatemodelpopulationstobeconfronted 0 withactualobservationaldata. Inparticular: tosimulate 200 400 600 800 differential sensitivity, the fact that in real observations Pulse Period (s) pulsarslie atdifferentdistancesfromthe fieldcenter;To model the effects of period-flux correlation; and to ex- plore the nature of long-term variability. A module to 300 produce aperiodic source populations is also in develop- N 200 ment. 100 The simulation is not specific to X-ray pulsars (other 0 than in the choice of pulse-period as the independent 0 50 100 150 200 250 300 variable) , and can readily produce model datasets for Orbital Period (d) anypopulationofperiodicsources. Intypicaluse,popu- Fig.2.—ModelPeriodDistributions. Theupper panelshows 3 lations of 50-100sources are generatedand analyzed (as example Ppulse distributions, while the lower panel shows the re- described below), with 1000 such iterations being used sultingPorbitdistributionsassumingthetwoarecoupledbyEqua- tion13,asinferredfromtheCorbetdiagram. to accumulate statistical bounds. 4. APPLYINGCAPTURE-RECAPTUREMETHODSTOTHE ulated times), modulo an imposed detection threshold, XRBSIMULATION to produce simulated lightcurves for analysis. An exam- Thepurposeofexploringcapture-recapturetechniques ple run, using 5 pulsars is shown in Figure 1. Taking a in this paper is to discover if robust constraints can be ‘normal’ (or type I) outburst to peak at ∼1037erg s−1, placedonthetotalnumberoftransientorvariableastro- and the sensitivity limit for RXTE to be ∼1036erg s−1, nomicalobjects. Forexamplepulsarsinthe SMC,based a suitable simulation threshold is in the range 0.1 to 0.5 on the RXTE weekly monitoring observations. The mo- in the relative flux scale (0-1). tivation for this example is that after 10 years RXTE The output of a single run of the simulation is a cap- continued to discover new pulsars, and the rate of dis- ture history. This is anarraycontaining one column per covery must at some level be tied to the size of the un- source, and one row per observation, with each cell hav- derlying population. We begin with straightforwardim- ing a value of “1” if the source is detected, and “0” if plementationoftheLincoln-Petersenindex(Equation2), not detected. This format follows the standard practice explore its performance on our pulsar simulation under 6 Laycock 50 N 010203040 er 10 years80100 4000 4100 4200 4300 aft e Time (days) mat60 esti n1 n2 n12 lpe n o ati40 300 opul Gaussian Pp: 150, 50 N 100200 an LP p20 UGnaiufosrsmia nP pP:p 1: ,01,0 10000 0 Me Gaussian Pp: 200, 100 4000 5000 6000 7000 0 Time (days) 7 15 30 60 120 240 Fig.3.— Capture histories for source population model A, un- Sampling Interval (Days) dertwocontrastingobservingcadences. Ateachobservationtime, the code keeps track of the number of detected sources (n2), the Fig.4.—EffectofSamplingIntervalandPeriodDistributionon number detected in the previous observation (n1), and number LPpopulationestimatoraveragedover10yearsofmonitoringdata. of sources detected in both the current and previous observation FourcontrastingPpulse modelsarecompared,asinTable1. Each (n12). The Lincoln-Peterson estimator (Equation 2) and its run- bar in the histogram is the average of 10 independent simulation ning mean (solid line) are also plotted. Observation times are runs. drawn from a gaussian-random distribution of intervals. Mean sampleinterval P(top), and>>P(bottom). oneisable(inprinciple)tocaptureanyindividualinthe population. Thisimplementationdoesamuchbetterjob, a range of sampling strategies and population distribu- asshowninFigure5,andishighlyapplicableinthecase tions. WethenmoveontotheSchnableestimator,which that one can construct only 2 such groups; to compare makes more complete use of information, and is formu- two observing seasons for example. lated to incorporate multiple samples. Finally we exam- Thismostbasiccapture-recapturemethodology,inad- ine sophisticated modern capture-recapture algorithms ditiontosub-optimaltreatmentofmultiplesamples,does developed for use in wildlife ecology and epidemiology. not exploit the unique identification of individuals pos- Researchersinthesefieldshavedevelopedrefinementsto sible in the case of most astronomical objects. This is thebasicconceptthatextendittolargenumbersofvisits understandable,assmallanimals,birdsandfish(histori- (observations), individuals with unequal capture proba- callythepopulationsstudied)ofthesamespeciesalllook bilities and correlations between samples. The software alike. package Rcapture Baillargeon& Rivest (2007) provides an implementation of many of these techniques, and 4.2. Cumulative Number Count versus the Schnabel provides example analyses of a wide variety of datasets Estimator drawn from biology and epidemiology. The poisson re- The total number ofpulsars or other transientsources gressionapproachfollowedinRcaptureyieldsrobustcon- detected in an ongoing monitoring campaign naturally fidence intervals in addition to abundance estimates. rises with time and number of observations. The actual rate of new discoveries is determined by a complex in- 4.1. Implementation of the Lincoln-Petersen Estimator terplay of observing cadence and population. Use of ex- We beginby computingthe modifiedLincoln-Petersen pensivetelescopetimecouldbeoptimizedbymonitoring index(Equation2)bytreatingoursimulatedlightcurves strategiesthatareabletodeterminethetotalpopulation in a pairwise manner. For each consecutive pair of ob- in the minimum amount of time. servations,wecalculateN ,andalsokeepacumulative For example if the first observation in a series detects LP runningaverage. Inthiswaywefindoutwhetherthe es- pulsars“A,B,andC”,theseconddetectspulsars“A,C, timatorcanconvergeonastable value,andhowrapidly. D”, and the third “B, D, E”: then the cumulative total TheresultofsuchanapproachisshowninFigure3,fora is 4 pulsars. An extremely interesting questionto ask,is population of 1000 pulsars under two alternative period how many observations are needed before we have seen distributions,andarangeofdifferentsamplingcadences. every pulsar at least once. Going further, can Capture- It is evident from Figure 3 that such a simple imple- Recapture methods accelerate the process? This has an mentation does a poor job in terms of consistency. Fur- obvious implication for the design of monitoring pro- thermore the result gets worse (not better) as the sam- grams if we can predict the optimum observing cadence pling density increases. The reason is that pairing con- and duration of the campaign: If the scientific goal is to secutive samples violates the requirement that samples find N, it may be feasible to obtain a robust estimate be independent, because the flux values are correlated without actually detecting all N objects. on timescales comparable to the cadence. Simple appli- AMonteCarloapproachcanprovideconfidencelimits cationofEquation2isthereforeusefulonlywhenasmall on the detection rate for a variety of target model pop- number of widely spaced observations are available. ulations. At each iteration, the code generates a fresh Animprovementistomergeconsecutivesamplesinor- pulsar population (N objects) and sampling pattern S. der to meet the requirements of independence and equal It then accumulates the total number of unique pulsars capture probability. This is done by grouping observa- detected as a function of time, and records the observa- tions and then treating each group as an independent tion number S at which the count exceeded i% of N . i sample. Eachgroupmustspanatimeintervaloverwhich After 1000 iterations, the results are sorted to produce Capture-Recapture 7 0 500 1000 1500 Time (d2a0y0s0) 2500 3000 3500 basically track the cumulative count. This follows from 100 the fact that closely spaced consecutive samples have a N 80 high degree of correlation. As for NLP the observations would need to be grouped into super-samples, such that 60 the requirement of equal capture-probability is restored. 2 4 6 8 10 Steadilyincreasingtheobservationspacingcausesthecu- Group mulative count to lag as successively more outbursts go undetected. This is the regime where Equation 4 be- 150 comes useful, as demonstrated in Figure 6. A set of N 100 example sampling runs for model A shows that NSch 50 converges rapidly on the underlying value of N, cutting 0 the elapsed time by up to a factor of 10. A full Monte 10 20 30 40 Carlo simulation was then used to compute the num- Group ber ofobservationsrequiredto reach50%,75%and95% of the underlying population in 95% of trials, for each model/sampling combination. 300 N 100200 4.3. Exponential Model, and recovery of Capture Probability or Duty Cycle 0 0 10 20 30 40 50 60 70 As described in section 2 a simple equal-probability Group model for capture-recapture can resemble conceptu- ally the radioactive decay law. The number of un- encountered units in the population falls exponentially 150 withrepeatedobservations,sothemodelfittingamounts N tosimultaneouslyfindingN andp. Thedualoutcomesof 50 thetechniquearepopulationabundance(N )andtheen- 0 0 0 20 40 60 80 counterprobabilityp,whichiscloselyrelatedtotheduty Group cycle. This claim is demonstrated in Figure 7a, which shows the cumulative number of transients (model A: 0 500 1000 1500 Time (d2a0y0s0) 2500 3000 3500 duty cycle=0.1) under different observing cadences. In- terestingly the three cadences (15-30d, 30-60d, 60-120d) N 100 all conform to the predicted exponential curve when 50 cumulative source count is plotted against observation 0 number. Equation 7 is plotted illustrating various val- 0 20 40 60 80 100 120 uesofp=0.1,0.05,0.01),andthedata-pointsclearlytrack Group Fig. 5.—Grouped Lincoln-Petersen Estimator applied to simu- the p=0.1 curve. So, the capture history is shown to be lateddecade-longmonitoringdatafor100pulsars. Themodelpop- independent of time, provided the observations are un- ulation has a gaussianPpulse distributionwithP¯=200s, σP=50s, corelated. and is sampled at 7-14 day intervals, with thresh=0.5. The five Results of fitting Equation 7 to a simulated capture panelsshowresultsofgroupingtheobservations. Fromtopdown: 10groupsof36,40groupsof9,72groupsof5,120groupsof3. The history are shown in Figure 7b. This was done in R MLPEvalueand1σ uncertainty areplottedbysolidpoints,num- using the non-linear least squares function nls to simul- berofpulsarsdetectedinthesample(N2)byopencircles,number taneously fit for N and p. Confidence intervals on the of pulsars in the previous sample (N1) by open squares, and the fittedparametersw0eregeneratedbythefunctionconfint. number of pulsars in both samples (N12) by open triangles. The cumulativerunningmeanisshownasablackline. Thecapturehistorywasfittedforsuccessivelyincreasing numberofobservations,andtheresultingN andp(and 0 their 1σ uncertainties) plotted againstobservationnum- the probabilitydistributionof S , fromwhichconfidence i ber. Fitting for a curve with 2 unknowns requires > 3 intervals may be drawn. For example: “For population points, and in practice the code needs >5, to reliably model X, sampling cadence S will have detected 75% converge and evaluate confidence intervals. Figure 7b of the population after Z observations, with 95% confi- shows that the exponential model converges on the cor- dence”. rect values (N =100, p=0.1)with a 10% errorafter ∼10 TheSchnabelestimateN (seeEquation4isderived 0 Sch observations. fromthecumulativerunningaverageofnewtopreviously encountered individuals. It is therefore expected to per- 4.4. Performance of Advanced Capture-Recapture form better than N , and to provide an accelerated LP Algorithms prediction of the cumulative number count. We ran a large grid of simulations for N to test The most significant development in Capture- Sch a range of observing strategies and population models. Recapture methods is their extension beyond the re- The sampling patterns were 10 year-long stretches of strictions of equal capture probability and closed pop- 7-14, 15-30, 30-60, 60-120 day intervals. The popula- ulations. Capture probability is the likelihood that a tion models are given in Table 1. Each combination of given individual will be encountered in any arbitrary sampling and model was run 1000times, andthe results sampleofthepopulation. Populationsareopenorclosed summarized in Figure 6. When the sampling density is based on whether individuals can enter and leave. In high compared to the average pulsar duty cycle, N the life sciences, heterogeneity in capture probability Sch 8 Laycock 0 0 0 0 2 5 0 0 2 0 10 00 1 Estimate 50 bservations 50 Schnabel umber of O 1020 N 0 Sampling Interval: 5 2 7−14d Percentile: Model: 15−30d 95% black=A 30−60d 2 75% red=B 60−120 50% bglrueee=nC=D oclpoesne dp opionitnst s= =C Sumchunlaatbiveel 10 1 20 50 100 200 500 1000 2000 7−14 15−30 30−60 60−120 Time after First Observation (days) Sampling Interval Fig. 6.— Schnabel Estimator compared with cumulative number of pulsars detected as a function of time, for a variety of sampling cadences andpopulationmodels. Left: Schnabel estimate(solid)andcumulativenumbercount(dashed)foragaussianinputpulseperiod distribution with N=100, P¯=150s, σ=50s. Right: The number of observations required to recover different target fractions (50%, 75%, 95%)ofthepopulation, with95%confidence for4contrasting populationmodels. TheSchnabel estimates generallyconverge onthetrue populationvalueinlessthanhalfthetimetakentodiscoverallofthesources. 0 0 1 1 0 p=0.1 15 0 8 d p=0.05 e ct e Sources Det 60 N 100 y Cycle ve Dut ati umulul 40 p=0.01 50 C 0 2 0 0 0 20 40 60 80 100 0 2 4 6 8 10 Observation Number Observation Number Fig. 7.— The exponential model for capture recapture. Left: Plot shows cumulative total of transients vs observation number, for transientpopulation modelA (duty-cycle=10%) under differentsamplingcadences (15-30 days, 30-60days, 60-120days). Thecurves are Equation 7 forN0=100, andvarious values of p. Right: Fits to the data converge rapidlyon the correct model values of N0=100, p=0.1 forarepresentativesetofsimulatedobservations. (Fittingcodedoesnotalwaysconvergeforfewerthan5points.) arises from both individual behavior and experimental relations. Inthecaseofnovaeforexampletheoccurrence design, and populations feature births, deaths and mi- of a thermonuclear runaway exhausts the fuel supply, so gration. In astronomy we can comfortably ignore mi- the probability of recurrence drops to zero for a time. gration, and birth/death mostly occurs on timescales In preceding sections, the classical Lincoln-Petersen much longer than the typical monitoring program, the and Schnabel formulae were shown to work well for as- longestofwhichexceedthehumanlifetimeonlybysmall tronomical source populations with fixed p. The caveat factors, so astronomers need only consider methods for beingthatsamplingcadenceisselectedappropriately,or closedpopulations. Interms ofcaptureprobabilityhow- observations grouped prior to analysis. We now move ever,sourcescanshowabroaddistributionofdutycycle, on to consider closed populations conforming to the ex- and not all observations are equally sensitive. Transient ponential or log-linear model , and its extension to het- sources are frequently recurrent, or even periodic (e.g. erogeneouscapture probabilities (Darroch1993). In this X-ray binaries) or subject to other strong temporal cor- context, variation in duty-cycle among individual tran- Capture-Recapture 9 In the cases where the Schnabel method worked well, forexamplemodelA(Table1),with15-30dayorgreater 0 sampling intervals, Rcapture showed no need to invoke 0 2 heterogeneity or temporal variation. We deduce this fromthefactthatM(0,t,h,th)allconvergedonthecor- rectresultwithin5-10observations,asshowninFigure8. 0 10 In addition the uncertainty estimates become very con- straining after only 5 observations. Shortening the sam- ure) pling interval into the regime where strong correlation apt is present between observations (e.g. Model A and 7-15 Rc 50 N ( day intervals) causes bias in Rcapture’s estimates. As in the Schnabel case the results track the cumulative dis- coveryrate, with very little difference between M . Sampling Interval 0,h,t,th The type of serial correlation present in the simulated 30−60d (Mo) 20 15−30d (Mo) lightcurves could mimic the behavioral response model 7−15d (Mo) M . A star detected in one observation is likely to be 7−15d (Mb) b seen in the following observation(s) if the sampling in- Cumulative Count terval is sufficiently short. This was tested for models 0 1 A and C (a population dominated by short recurrence periods),andindeedM abundanceestimatesdoamuch 50 100 200 500 1000 b better job in the dense sampling regime (in the sparse Time (days) sampling regime M did not differ strongly from M ). b 0 Fig.8.— Rcapturepopulationabundanceestimates. Thepopu- Next, variation in source behavior was introduced by lationmodelisA(SeeTable1. Resultsforthreedifferentsampling population simulations having a broader distribution of cadencesareplotted. Pointswitherror-barsindicatetheM0fitting resultswith1σuncertainties. Dashedhistogramsshowthecumula- recurrence periods. Model D (Table 1) has double the tivenumberofobjectsdetectedineachseries. ResultforMbfitting width compared to model A, and Model B is flat in isplottedfortheover-sampledseries,showingsomecompensation P , leading to a monotonically rising distribution forthecorrelationbias. pulse of P (i.e. recurrence time). Rcapture exhibited orbit only small (i.e. smaller than the errors) differences in sients. The goals are to see if robust population es- abundance estimators for M and the uniform capture- h timates are possible, and whether advances in dealing probability cM , for models B, D comparedto model A. 0 with heterogeneity result in reduced bias and faster (in Thisresultdemonstratesthatcapture-recaptureprovides terms of observing time) convergence. One of the prin- robust population size estimates in the face of types of cipal software packages for capture-recapture analysis population diversity expected in astronomy. is Rcapture Baillargeon & Rivest (2007) which handles Efficientuseofdensely(over)sampledmonitoringdata both open and closed populations. Rcapture implements is achieved by grouping or merging the observations classes of models in which the capture probabilities of (as was pointed out in Section 4.1). In Rcapture large individuals are allowed to vary during the experiment. datasets can be merged in this way using the function Within a closed population, the probabilities can vary periodhist. withtime(t),heterogeneitybetweenindividuals(h),and changes in behavior caused by response to capture (b) 5. CONCLUSIONSANDFURTHERAPPLICATIONS (otis (1978)). Case b initially seems irrelevantin astron- SimpleformulaeforCapture-Recaptureprovidemean- omy,howeverthe‘behavioralresponse’modelisactually ingfulresultsprovidedtheobservingstrategyismatched anattempt to correctfor correlationsbetween neighbor- to the characteristic recurrence timescale of the target ing observations. This is analogous to the case of tran- population. Modern methods provide robust estimates sient sourceoutbursts that lastfor longerthen the spac- for all populations and observing cadences investigated ingbetweenobservations. Wetestedthe Rcapturemodel in this study. Our generalized exponential population classes M , M , M , M (heterogeneity and temporal model is simple to implement, performs better than the t h b th variation) and M (no variation). The models are fully classical methods, and is a bridge to the more sophisti- 0 describedbyBaillargeon& Rivest(2007)andwerefitby cated methods developed in the life-sciences. poissonregressiontothesimulatedcapturehistorydata. The modified Lincoln-Petersen index is useful when Analysis with Rcapture proceeds by fitting the plau- only 2-3 observations are available. It can be extended sible models, selecting the one that best represents the to longer series by constructing a running mean, or by data,andusing itto estimate abundanceandconfidence merging groups of contiguous observations. The latter interval. strategyisvaluableinthecasesofseasonableobservabil- Capture histories were generated for the usual set of ity,andobservationswith verylowdetectionrates. This models (Table 1 and observing strategies, and analyzed class of estimators were found to be strongly biassed by with the closedp and closedp.0 algorithms. closedp was samplingfrequency andtend to convergeonanunderes- used to perform poisson regression fitting of heteroge- timate of the population. neous and time-dependent models. Due to the com- TheSchnabelestimatorisclosetoidealforsimpleanal- putational load, closedp is currently limited to analyz- ysis and modeling. It converges on the true population ing ≤20 observations at a time. For longer datasets rapidly, enabling the population size to be estimated in we used closedp.0 which fits the heterogeneous capture- a fraction of the time required to discover all individ- probability models in non time-dependent form. uals. This property is extremely useful for predicting 10 Laycock the performance of proposed monitoring strategies, and visible sources. In addition to the example of flare trackingprogressinongoingstudies. Figure6showsthat stars (Ambartsumian & Mirzoyan 1975), the frequency the Schnabel estimator can provide a dramatic accelera- of dust-shell ejections in R Corona Borealis stars and tion in population estimation, potentially reducing both horizontal branch giants, and outbursts of the Be phe- on-sourcetelescopetimeandmonitoringdurationby50- nomenon are all good candidates. 90%. Confidence interval estimation is readily amenable We studied periodic transient sources in this paper, to monte-carlo methods. preciselybecause they appear(atfirstglance)todiverge Capture-Recapture computational algorithms devel- strongly from the first axiom of the Capture-Recapture oped for the life sciences provide a powerful set of gen- paradigm: equal capture probability. Individuals whose eralized poisson regression models to fit a wide variety activity patterns are primarily periodic do not have a of monitoring data. The package Rcapture implements purely random chance of being seen in any arbitrary the most applicable modern methods in R making it sample. Their activity is correlated. Nevertheless in easily accessible to users unfamiliar with the ecology Section 4 we found that for assemblies of randomly dis- and medical settings where most of the methods were tributedperiodicsources,Capture-Recaptureestimators developed. The log-linear modeling framework imple- remain valid if sampling cadence is appropriate. Ape- mented in Rcapture has three advantages: (1) Extends riodic transients on the other hand are expected to re- the capture-recapture paradigm beyond its classical do- cur at random, and to exhibit almost perfectly uniform mainofvalidity. (2)Providesrobustconfidenceintervals (with time) capture probabilities. We say almost, be- (3)Useofgeneralizedmodelsavoidstime-consuminggen- cause outburst duration, brightness (i.e. heterogeneity eration of problem-specific models. We did not investi- in peak luminosity and/or distance), and typical recur- gate the robustdesign methodology introduced in Rcap- rence interval all affect detection probability. Black hole ture, which combines periods of study during which the X-ray novae (accreting binaries containing a black hole populationis treatedas closed,separatedby unobserved and a low mass star) have a very poorly known space periods, during which it is open. Qualitatively such density due to their rarity. Several concerted efforts to an approach could well apply to astronomical datasets, determinethespace-densityanddutycyclesofblack-hole where seasonal observability is a factor, or where short- transients are underway, e.g. ChaMPlane (in the Milky term activity is superimposed on slower trends. For ex- Way)andareamotivatingfactorfortheapproachinves- ample periodic X-ray binary outbursts fueled by a cir- tigated here. cumstellar disk that grows and dissipates on a scale of Capture-Recapture methods have the potential for years. We note that Romine et al. (2016) have recently wide application in time domain astronomy, and ef- used Rcapture to study the association of ultra hard X- forts to apply the methods explored here are under- raysourceswithprotostars,anapplicationthatdoesnot way. AnumberofmonitoringstudiesofX-raytransients involve the time domain at all. in external galaxies are currently planned, in progress, Computational methods were found to outperform or recently completed (see introduction). At optical simple estimators as soon as the number of observations wavelengths large numbers of recurrent transients are exceeds3-5. Forallastronomicalsourcepopulationssim- seen by time-domain surveys such as OGLE (Magel- ulated in this study Rcapture converged on the correct lanic Clouds) and Udalski (2003), MACHO (the Galac- abundance (estimate and errors within 10% of the true tic Bulge) Alcock et al. (1997). New all-sky surveys value)within 8-15observations. Depending onlyslightly include PanStars panstars (2000X), Palomar Transient onthesamplingcadenceandmodel,thisgenerallytrans- Factory Law et al. (2010) and LSST Ivezic et al. (2011). lated to a point at which about half the population had The photographicplate digitizationprojectDASCH will been encountered. The principal objection to applying soonaccess∼100yearsofwellsampledphotometricdata techniquesfromotherdisciplines (inthis casebiology)is for the entire celestial sphere. New statistical tools are that methods may be used outside their domain of ap- needed,totacklequestionsthatcouldnotbeaskedprior plicability, leading to unreliable results. Such concerns to this explosion of transient sources. werethoroughlyinvestigatedinthispaperusingrealistic numerical simulations of astronomical monitoring cam- paigns. The ideal astronomical setting for Capture-Recapture ACKNOWLEDGEMENTS analysis is a fixed population observed on multiple oc- SL thanksthe anonymousrefereeforconstructivesug- casions, separated by sufficiently long intervals for dif- gestionsonpythoncompatibility,andforbringingtoour ferent subsets to be active. In addition to estimating attention the work of V. A. Ambartsumian. This work population abundance, the approach can provide fre- wassupportedinpartbyNASAAstrophysicsDataAnal- quency or duty cycle for rare states in continuously ysis Programgrant NNX14-AF77G. REFERENCES Alcocketal.,1997,ApJ,486,697 Chao,A.,Yip,P.,Lee,S-M.andChu,W-T.,2001,Journalof Ambartsumian,V.A.,&Mirzoyan,L.V.1975,VariableStars Statistical PlanningandInference92,213-232 andStellarEvolution,67,1 ChoquetR.;RebouletA.-M.;Pradel,R.,GiminezO.,Lebreton Ambartsumyan,V.A.,Mirzoyan,L.V.,Parsamyan,E.S., J.-D.,2004,vol.27(1), 207-215 Chavushyan, O.S.,&Erastova,L.K.1970,Astrofizika,6,7 Cleveland,W.S.,1981,TheAmericanStatistician,35,54 Amstrup,S.C.,etal(editors)2005, Handbookof Chapman,1951 Capture-Recapture Analysis,PrincetonUniversityPress. Chen,W.,Shrader,C.,&Livio,1997,ApJ,491,312 Baillargeon,S.,&Rivest,L.,2007,JournalofStatistical Coe,M.J.,etal.,2010,MNRAS,406,2533 Software,vol.19,issue5 Corbet,R.H.D.,etal.,1984,A&A,141,91