ebook img

Event Discovery in Time Series PDF

0.46 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Event Discovery in Time Series

Event Discovery in Time Series Dan Preston∗‡ Pavlos Protopapas∗† Carla Brodley‡ Abstract time series in the set.1 Furthermore, there are millions ofobservedlightcurvesinmodernastronomicalsurveys, Thediscoveryofeventsintimeseriescanhaveimportantim- manywithuninterestingfluctuationsortrends,thusany 9 plications, such as identifying microlensing events in astro- effective method must be able to distinguish between 0 nomicalsurveys,orchangesinapatient’selectrocardiogram. 0 Currentmethodsforidentifyingeventsrequireaslidingwin- these variations and statistically significant events such 2 as microlensing, flares and others. Examples of other dow of a fixed size, which is not ideal for all applications n and could overlook important events. In this work, we de- applications for which event detection is useful include a searching for events in stock market data [5, 28], ex- velop probability models for calculating the significance of J amining CPU usage to identify anomalies, and finding anarbitrary-sizedslidingwindowandusetheseprobabilities 1 irregularities in medical data [25, 26]. to find areas of significance. Because a brute force search 2 of all sliding windows and all window sizes would be com- 17 ] putationally intractable, we introduce a method for quickly M approximating the results. We apply our method to over 17.2 I 100,000 astronomical time series from the MACHO survey, . h inwhich56differentsectionsoftheskyareconsidered,each 17.4 p with one or more known events. Our method was able to ro- rseecnotviaelrly10p0ru%nionfgth9e9s%eeovfetnhtesdinattah.eItnotper1e%stinofgltyh,eoruersumltest,heosd- O Magnitude 1177..68 st wasabletoidentifyeventsthatdonotpasstraditionalevent MACH a discovery procedures. 18 [ 18.2 1 1 Introduction v Eventdiscoveryintimeseriesdataisthefocusofmany 18.4 9 modern temporal data mining methods [4, 11, 13]. An 2 7e+08 7.5e+08 8e+08 8.5e+08 9e+08 9.5e+08 3 event is characterized by an interval of measurements Time [seconds] 3 that differs significantly from some underlying base- Figure 1: Example: A microlensing event in the MA- 1. line. The difficulty in identifying these events is that CHO data set, with ID 104.20121.1692. The X-axis 0 one must distinguish between events that could have represents time in seconds; the Y-axis represents mag- 9 occurredbychanceandeventsthatarestatisticallysig- nitude.2 0 nificant. In this paper, we present a method that is : v noiseindependentanddeterminesthesignificanceofan Searching for events in time series requires first Xi interval. searching for overdensities, intervals of time series with We focus our tests in astronomy, for which discov- a large deviation from the baseline [18], and then r a eries could be found by identifying events in time series determining the statistical significance of the region. A data. For example, microlensing events occur when an na¨ıveapproachwouldbetoexploreallpossibleintervals objectpassesinfrontofalightsourceandactsasagrav- inordertoidentifytheregionwiththelargestdeviation itationallens,thusincreasingthemagnitudeoflightbe- ing observed. Figure 1 is an example of such an event. 1Thereasonsforvaryingnoisecharacteristicsinastronomical Each light curve (time series data recorded by astro- observations are plentiful: weather (clouds, wind, etc) create nomicalobservations)hasdifferentnoisecharacteristics, trends, the temperature can change the CCD characteristics, making it difficult to form a general noise model for all different stresses on the mechanical structure depending on the direction of the telescope can stress the optics differently and consequentlymakeadifferenceinhowthelightisbent,andother ∗InitiativeinInnovativeComputing,HarvardUniversity possibleeffectsoftheenvironment. †Harvard-SmithsonianCenterforAstrophysics 2The magnitude of an observation is related to the logarithm ‡DepartmentofComputerScience,TuftsUniversity oftheintensityofthelightbeingobservedforeachstar. fromthebaseline. Moreefficientmethodsforidentifying Finally,wecomparetheproposedapproachtoHOT these regions have been explored in [19] and [18]. SAX,ananomalydetectionalgorithm[13]. Indoingso, After determining regions of the time series that we outline theessentialdifferences betweenananomaly deviate from the baseline, one must determine if these detection algorithm (i.e., HOT SAX) and an event regionsarestatisticallysignificant,whichrequiresmod- detection algorithm. eling the noise of the time series. Modern methods cre- Inthefollowingsections,webeginbydescribingthe atesuchmodelsbyperformingrandomizationtesting,in relatedworkinscanstatisticsandanomalydetectionin which a time series is randomly reshuffled several times Section 2. We then describe our method for forming a andtheoriginalintervaliscomparedtothemostanoma- probability model to assess the statistical significance lous interval found in each shuffle [18]. The downfall of of an interval in Section 3, and how one can identify performingthisrandomizationtestingisthatitrequires events based on these probability models. In Section 4, significant computation, and needs to be performed for we describe how we can apply these methods to large each new time series. This technique is intractable for datasets by approximating results. Finally, Section 5 domainsinwhichwehavethousandsorevenmillionsof reports our results from an application to an existing time series, such as light curves in astronomy. astronomicalsurvey, MACHO,andastudyofsynthetic Our approach falls under the broader category of time series. scan statistics [9, 17], which claims that by considering sliding windows, or intervals of a time series, one can 2 Related Work determinestatisticalsignificanceiftheunderlyingnoise Perhapstheworkmostcloselyrelatedtoeventdetection can be understood. To remove the need to model the in time series is scan statistics, whose goal is to deter- noise for each time series, we begin by first converting mine the probabilities and statistics of spatial subsets. the time series to rank-space, a discrete series of N Scan statistics are applied in most cases to temporal points representing the rank of each value in a real- data, but in some cases are generalized to any num- valued time series, where 1 is the lowest value and N ber of d dimensions. The goal of scan statistics is to is the highest. This creates a uniform distribution of discover statistically significant densities of points that points across the Y-axis, independent of the underlying deviate from some underlying baseline (see [9] for a de- noise. This allows for a probability model to be formed tailed review). Scan statistics have been used to refute that does not require a model of the noise in the time decisionsmadebasedonfalseidentificationofseemingly series (this probability model is described in Section significant events. Examples of such events include the 3.3). Becausethemodelisapplicabletoanytimeseries groundingofallF-14sbytheU.S.Navyafteraclusterof oflengthN thathasbeenconvertedtorank-space,ithas crashes, and when a judge fined a Tennessee State Pen- the added benefit that it allows an analyst to compare itentiary Institution after an unusual number of deaths the significance of events across different time series. occurred at the facility [9]. Scan statistics were used Giventheprobabilitymodeldescribed,eachinterval in both cases to show that the cluster in question was has a p-value, which represents the likelihood of the likely to have occurred by chance, and thus was not region occurring by chance. The method described in statistically significant. thispaperconsidersallpossibleintervalsofatimeseries. To see how they are applied to time series data, In other words, a variable window size is used in the consideratimeserieswithZ points,eachindependently analysis, and a p-value can be assigned to each window drawn from the uniform distribution of time intervals size. To make this search tractable, a well-known on [0,1) [9]. The scan statistic S is the largest w optimizationtechnique[21]isappliedthatapproximates number of points to be found in any interval of length the solutions, by finding the minimum p-value for all w, thus S is the maximum cluster with window w possible intervals. size w [17]. The determination of the distribution of In applying our method, we ran our method on the statistic P(k|Z,w) is fundamental to this analysis, the MACHO data set [1]. Out of 110,568 time series, whereP(k|Z,w)representstheprobabilityofobserving our method was able to recover all known microlensing k points with the window w. Much work has been and blue star events [2, 12]. Furthermore, many of done to find approximate results for this statistic, such the events found in the top results would not have as the use of binomial probabilities [27], and Poisson passed traditional tests, such as a χ2 test. Results of distributions [8]. Generally, these approximations work the MACHO analysis are explained further in Section well for small values of w and k, with the varying 5.1. Inaddition, Section5.2detailstheanalysisof2000 levels of complexity and accuracy. It is important to synthetic time series, in which all events were found, note that in many applications, a rough approximation including zero false positives. is sufficient to find significance. Using this statistic, one can determine if an interval of the time series is different time interval. not consistent with the noise, and thus is statistically It is important to understand the distinction be- significant. In this work, we describe a new statistic tween anomaly detection and event detection. When for assessing the statistical significance of any arbitrary detecting anomalies, such as in [13], the anomalies are window by forming a new probability distribution, discords from the time series (or set of time series). motivated by work in scan statistics, that will scale to Thisisfundamentallydifferentfromthegoalofthispa- large values of w. Our method has an analytic method per, which is to find subintervals that are statistically for finding the exact probability distribution. Finally, significant from the underlying noise. In other words, byformingthedistributionforallwindowsizes,onecan anomaly detection finds those subintervals that differ analyze significance for any arbitrary window size. most from the rest of the ensemble, whereas event de- Existingmethodsinscanstatisticscurrentlydonot tection searches for those that significantly differ from address the problem of large data sets (e.g., millions the noise. or billions of time series). In many applications, one Methods of identifying events in astronomical data requires an event detection algorithm that can be per- exist in the literature. Traditionally for fast identifica- formed quickly on many time series. In addition, be- tion of variables, a χ2 test of a straight line has been cause much of the data for these applications has noise used, but overlooks events of short duration. Other that is difficult to understand and model, it is critical methodsinvolvefittingeventshapestothelightcurves, to develop a method that is independent of the noise generally using a least-squared fit [10]. This can be characteristics. The current methods in scan statistics thought of as creating synthetic light curves with the are generally based on particular noise models, such as characteristics of known events, and testing these hy- Poissondistributions[8]andbinomialprobabilities[27]. potheses. The fitting of events is an accurate method Both of these characteristics exist in astronomy, and for event discovery, yet it is slow and quite expensive. thus our goal in this paper is to address both of these What makes our method particularly valuable is that issues. Due to the intractability of current scan statis- it can find events that escape the χ2 test and is less tics methods, we cannot compare our method. Thus, expensive than event-fitting tests. we are compelled to compare to anomaly detection due While scan statistics build a sound framework for to its speed, efficiency and that it does not require first determining the significance of events, there are signifi- modeling the noise. cant challenges that must be overcome in order to deal Anomaly detection in time series is closely related withlargeamountsofhigh-dimensionaldata. Ourwork to the problem of searching for events in time series. addresses these issues by first performing a transforma- In [13], time series discords are used to determine the tionofthedatatoauniformrepresentationandsecond, interval most likely to be an anomaly. Keogh et al developing a probability distribution for assessing the present a method for determining these discords for significance of intervals of a time series. Furthermore, a fixed-size sliding window. At its core, the method although anomaly detection is related to event detec- considers all possible intervals of the time series for a tion, it is fundamentally different and not particularly givensize(aparameterprovidedbythedomainexpert), well suited for our application. Our empirical results in anddeterminesthesubsetwiththelargestdistancefrom Section 5.1 make the distinction between anomaly and its nearest non-self match (a distinction made to avoid event detection clear. trivial matches). Requiring the analyst to provide only one parameter is a significant improvement over past 3 Assessing the Statistical Significance of an methods which require several unintuitive parameters Interval [14]. Definingawindowsizeinadvanceisrealisticwhen The following work differs from the work in scan statis- the expert has knowledge about the size of the event, tics in three key ways. First, we do not consider the or when the time series is periodic and the event is temporal distribution of individual events, but instead likely to occur during one period. As an example, this bin the events in equidistant time intervals. In other is the case when analyzing an electrocardiogram time words, a time series with Z points distributed on [0,1) series [13]. On the other hand, when an expert does as defined for scan statistics in Section 2 is transformed not know the characteristics of the event or when the into a time series by grouping the points into N bins, event could occur at different intervals, it is preferable witheachbinbrepresentingaconstantintervaloftime. to examine any arbitrary window size. Such is the case Eachbinbecomesavaluev inthetimeseries,wherev b b inastronomy,whereonemaybesearchingforanynovel isthenumberofpointsforthebinb. Second,wearenot discovery represented by possible significant changes in concerned in particular with the largest subset S , but w light patterns (e.g., microlensing) that each may have a with any interval with statistical significance. Third, we consider any arbitrary window size w, instead of us- A distinct probability distribution exists for each ing a fixed value. Forming a distribution for all window pair of (w,N), because each sum is dependent only on sizes will allow an analyst to identify the exact interval the number of values being added together (w), and in question. More importantly, this step introduces a their bounds, 1≤r ≤N. Therefore, in order to assess i methodforthecomparisonofp-valuesforanyarbitrary thestatisticalsignificanceofaparticularsumQ(s,w)of window size. In this section, we present a method for T , we must know the distribution of (w,N). We first R simplifyingourproblembyconvertingatimeseriesinto describe an analytic method to find this distribution, rank-space, and then two methods for creating a proba- andduetoitscomputationalcomplexity,aMonteCarlo bilitydistributionofpossiblewindowsumsforassessing method for approximating the same distribution. the significance of an interval. 3.3 Determining the Probability Distribution: 3.1 FormalDefinition. Webeginbydefiningatime Analytic Method. We wish to find the probability series T = [v ,v ,...,v ,...,v ] of length N where v that any sum φ, obtained by the performing the rank- 1 2 t N t is the value of a time series T at time t. It is assumed space transformation described above, would appear in that the length of each time interval [t,t+1] is equal uniformly random noise with a window size w and a for all 1 ≤ t < N. An interval S ⊆ T is defined maximum value N. Thus, our goal is to produce an by a starting point s and a window size w. Formally, exact probability distribution function for a given φ, w S =[v ,v ,...,v ]where1≤s≤s+w−1≤N. and N. S s+1 s+w−1 To obtain our probability curve, we can count the 3.2 Rank-Space Transformation. In order to re- number of ways any φ can be formed with distinct move the need to model the noise for each individual values from 1,...,N inclusive, and divide by (cid:0)N(cid:1) (all w time series, one can consider a time series T that has combinations of w distinct values, which represents all R been transformed into rank-space. To this end, we rank ways to form φ with w values). eachpointinatimeseriesT from1tothenumberofto- To find the number of ways a single sum φ can tal points N, where the lowest value is assigned a value be formed, we consider the combinatorial problem of of 1 and the largest is assigned a value of N. This will finding the number of partitions of φ with w distinct allow us to consider a time series with a uniform distri- parts, each part between 1 and N, inclusive. In bution of noise, no matter what the noise distribution other words, we wish to count all possible solutions to was before the transformation. v +v +···+v =φwhere0<v <v <...<v ≤N. 1 2 w 1 2 w By examining the sums of the values inside any In order to solve this, we transform the problem arbitrary sliding window of T , we can determine its p- to match a well-known problem involving a specific R value(theprobabilitythatthesumoccurredbychance). application of the q-binomial coefficients [3]. This How to calculate the p-value is the focus of Sections requires that we slightly modify the problem. The 3.3 and 3.4. Consider a starting point s and a window inequality above is the same as the following: First, size w, and let Q(s,w) be the sum of the ranked subtract 1 from the smallest part, 2 from the second, values inside that window, where 1 ≤ s ≤ N and etc. to get 0≤v −1≤v −2≤...≤v −w ≤N−w. 1 2 w 1≤s+w−1≤N. Thus, we define the sum Q(s,w)= Wehavesubtractedatotalof1+2+...+w =w(w+1)/2, r + r + ··· + r where r ,...,r ∈ T . so we are now looking for the number of partitions of s s+1 s+w−1 s s+w−1 R Our method is based on the assumption that the sums φ−w(w+1)/2 with at most w parts, and with largest in the outer tails of the probability distribution of all part at most N −w. possiblesumsaresignificant,astheydeviatedrastically Followingthenotationof[3],wedefinep(n,m,k)to from the underlying baseline. be the number of partitions of k into at most m parts, Note that with rank-space one loses the depth each part at most n. This value is the coefficient of qk of the time series. In other words, the amount by in the generating function, which an event deviates from the baseline is no longer (cid:88) (3.1) G(n,m;q)= p(n,m,k)qk distinguishable after the transformation, due to the uniform spacing of the points. For example, a time k≥0 series with a significant event that deviates from the where baseline with a signal to noise ratio of 5 is no different than an event with a signal to noise ratio of 20 in rank- (3.2) space. This is beneficial in many cases in which only a (1−qn+m)(1−qn+m−1)···(1−qm+1) G(n,m;q)= few points deviate by large amounts from the baseline, (1−qn)(1−qn−1)···(1−q) but are not necessarily significant events. A basic identity of the coefficients provides the recurrence on page 34 of [3]: thresholdα asaccurateaspossible. Inotherwords, the accuracy of the p-value associated with the event is not (3.3) p(n,m,k)=p(n−1,m,k−m)+p(n,m−1,k) imperative; it is ensuring that the p-values of events do notfluctuatearoundthethresholdα. InaMonteCarlo where the base case follows from the identity run,weconsidern tobethefrequencyforeachsumφ. φ Theexpectedaccuracyofthefrequencyofφisgivenby (3.4) (cid:26) 1 if n=m=k =0 1 p(n,0,k)=p(0,m,k)= (3.6) (cid:15)∼ √ 0 otherwise n φ Applyingthistoourproblem,wearelookingfordistinct where (cid:15) is the error [22]. In order to ensure accuracy partitions of k = φ−w(w+1)/2 with at most m = w of (cid:15) for a threshold α, we must obtain a minimum of λ parts and largest part at most n = N − w. Thus, samples for φ, given by p(N −w,w,φ−w(w+1)/2) is the number of ways to n 1 (3.7) λ∼ φ ∼ create the sum φ with w unique parts, each value from α α(cid:15)2 1 to N. The probability of obtaining sum φ will equal: because we know that n = 1 from Equation 3.6. φ (cid:15)2 p(N −w,w,φ−w(w+1)/2) Figure 2 shows the probability distributions for (3.5) P(φ;w,N)= (cid:0)N(cid:1) differentvaluesof(w,N)calculatedusingtheanalytical w and the Monte Carlo approach. It is important to note Thus, to build a distribution for (w,N), we find that (cid:15) and α are statistically motivated and must be P(φ;w,N) for all φ. defined. In this example, our confidence is α = 10−4 and accuracy of 10% ((cid:15) = .1). Thus, we must perform 3.4 Determining the Probability Distribution: 1,000,000randomsamples. Theerrorinthetailsofthe Monte Carlo Method. Althoughtheanalyticresults distribution seen in Figure 2 is less than 10%, and thus are preferable and provide exact probabilities for all consistent with theory. possible sums, the analytic method is deeply recursive Not all time series are of equal length N, and thus and prohibitively expensive. Memoization can be used wecannotusethesameprobabilitydistributionsforev- toimproveperformance, bycreatingahashforpruning ery time series. Although one could create a probabil- branchesoftherecursiontree,butthehashstillremains ity distribution for every possible N, this is not prac- toolargeinmemorytobepracticalinallcases. Inorder tical. In order to solve this problem, one can split to perform the memoization method for creating the the time series into equal parts of length Ns, where analyticdistribution,onemustkeepathree-dimensional Ns ≤ N and Ns (cid:29) wmax, where wmax is the largest structure in memory to keep track of the recurrence possible window size. Each subset Ns is then ana- (cid:18) (cid:19) w (w −1) lyzed as a single time series. To ensure that the al- of the size w ×N × Nw − max max , max max 2 gorithm does not miss any possible events where any wherew isthelargestwindowsizewewishtosearch two intervals meet, one should overlap the intervals by max for. For example, for w = 50 and N = 500, we w . More formally, given an interval for a fixed max max require 2.2GB of memory, which may be too large for N, [v ,v ,...,v ], the subsequent search inter- i i+1 i+N−1 some systems. valshouldbe[v ,...,v ]. Further- i+N−wmax i+2N−wmax−1 Thus an alternative and less memory demanding more,thismethodmaybeusedtoanalyzetimeseriesof method is to perform random sampling on all possible large lengths, where it is too computationally complex sums. To perform the Monte Carlo method, we repeat- tocomputeafullprobabilitydistributionforthelength edlychoosewuniquerandomnumbersfrom1toN,sum ofthetimeseries. Inordertoensureacorrectmodeling them, and then count the frequency with which each of the noise of N, a large N must be chosen. This is s sum appears. Dividing each of the frequency counts by sufficient, such that as N grows large, the probability thetotalnumberofsamplesgivesanapproximateprob- distributions become increasingly similar, and thus do ability distribution curve. not change the significance of the p-value. Because the tails of the distribution represent the These distributions need only be calculated once. windowswiththelowestp-values, thosecountsmustbe As a benefit of having a single model based on w,N, determined as accurately as possible. We seek statisti- one can store the distributions to be recalled for later callysignificantevents, thuswedefineathresholdα for analyses. Because all time series are first converted which we would like to consider allevents with p-values to rank-space and thus have a uniform distribution of lower than α. Our goal in the accuracy of the prob- noise,wecanusethesamenoisedistributionforalltime ability distribution is to keep the p-values around the series that are of the same length. 0.016 2080 w = 3, Monte Carlo w = 10, Monte Carlo w = 20, Monte Carlo 0.014 ww = = 1 30,, AAnnaallyyttiicc 2060 w = 20, Analytic 0.012 2040 0.01 Probability 0.008 F(t) 22000200 0.006 1980 0.004 0.002 1960 0 1940 0 100 200 300 400 500 600 700 800 900 0 20 40 60 80 100 120 140 160 180 200 Sum Time t Figure 2: Probability distribution for different values Figure 3: Synthetic time series with two events. of the window size w = 3,10,20, and total number of points N = 50; Monte Carlo probability distributions using 1,000,000 samples. 45 40 35 4 Computationally Tractable Search Gdoiwvensizaepwr,ob1a≤bilwity≤dNist,raibnudtifoonr afogriveaenchNp,owsseibwleiswhitno- w Size (w)2350 find areas of significance. A na¨ıve approach to identi- Windo20 fying events would be to apply a brute force technique. We compare each sum Q(s,w) for all s,w to the proba- 15 bility distribution defined by the method in Section 3.3 10 or Section 3.4 for w and N. The search examines the 5 p-value associated with the sum Q(s,w), and ranks the regions by p-value, lowest to highest. The complexity 20 40 60 80 100 120 140 160 180 200 Starting Point (s) of the brute force search is O(N2). For very large data sets, this is unacceptable and a quicker approximation Figure4: Contourplotofprobabilitiesforall(s,w)pairs is needed. In Section 4.1, an analysis of the probability from analysis of Figure 3. surface plot gathered from analyzing all possible (s,w) pairs is shown to be quite revealing, and motivates our useofwell-knownoptimizationmethodsforapproxima- low-probability region, one can pinpoint the exact time tion of results. In addition, it is important to note that and duration of the event in the time series, by identi- werefertoeventregions ratherthanspecific(s,w)pairs, fying the starting point s and window size w. In order because similar pairs of (s,w) may represent the same to find these regions algorithmically and efficiently, the event. The algorithm outlined in Section 4.2 presents a problemcanbereducedtotheproblemofminimization methodforcombiningsimilar(oroverlapping)windows (a special case of the class of optimization algorithms). and return representative (s,w) values for each of the There are several methods for performing the nec- event intervals. essary minimization [15, 20, 22]. The success of the method detailed in this paper does not depend signifi- 4.1 Examining the Probability Surface Plot. A cantly on the particular minimization technique. Pow- two-dimensional, s × w, surface plot of the probabil- ell’s method [21] was chosen due to its efficiency and ities shows clearly where significant events occur in the lack of need to compute derivatives. Our goal is to this space. For the time series shown in Figure 3, find all minima that are significant because we search Figure 4 depicts the probabilities for each pair (s,w), for multiple events. Section 4.2 addresses this problem. where darker regions corresponds to lower p-values. After examination, one notices three regions of low- 4.2 Random Restarts and Combining Results. probability points. By locating the minimum of each Powell’s method begins by selecting a search vector (a vector of points on the surface plot), performing a lin- likely events and spurious solutions when dealing with ear minimization on the vector, and then selecting a the final (s,w) pairs determined from the clustering al- new search vector from this point. These steps are re- gorithm. peated until a satisfactory minimum is found. Random restarts are introduced to alleviate dependence on the originalsearchvector. Thishelpsavoidfindingspurious 45 solutions, which often appear when using approxima- tionmethodsinthepresenceofnoise. Foreachrandom 40 restart, a pseudo-random number generator is used to 35 seed a different vector on the surface plot. When the w(i.ien.d,owwhseinzewofisthsemdaells,irethdeevevenentttowobueldfouanpdpeiasrsamsalal w Size (w)2350 XX X quickspikeinthetimeseries),anyminimizationmethod Windo20 XXX is likely to find spurious solutions. These cases would X benefit from more random restarts to ensure that the 15 significant minima are found. The analyst must define 10 X the number of random restarts required (suggested val- XX 5 ues for this parameter are discussed in Section 5.2). Because many random restarts will identify similar 20 40 60 80 100 120 140 160 180 200 Starting Point (s) instances of the same solution (see Figure 5), we in- troduce an agglomerative clustering method to find the Figure 5: Contour plot of the p-values of the analysis best suited final (s,w) pairs for an event region. Con- of Figure 3, each X represents a minimization using sider two results, (s ,w ) and (s ,w ). The goal is to 1 1 2 2 Powell’s method. determineiftheyrepresentthesamesolution. Thus,we considertheoverlap ofthetworesults,andifthatover- lapiswithinathresholdθ, thetwopairsareconsidered 4.3 Identifying Events. After we have found all to be the same result and are in the same cluster C . i potentially significant events, we rank order them by Thus, two results overlap if significance for further examination. There are two θ ≤(R −(R +R ))/w aspects to consider when analyzing the event regions. (4.8) total begin end 1 θ ≤(R −(R +R ))/w First, one should consider all intervals that are minima total begin end 2 in the surface plot of p-values for a time series. Each where of these minima are included in the ranking of events for all time series being considered. By identifying all R =max(s +w ,s +w )−min(s ,s ) total 1 1 2 2 1 2 minima, we are able to find multiple areas ofinterestin (4.9) R =|s −s | begin 1 2 a single time series. Such is the case in Figure 3, which R =|(s +w )−(s +w )| end 1 1 2 2 is an example of time series with multiple events with Tobuildtheclusters,weiteratethrougheachresultand differing p-values. compare the current result (s(cid:48),w(cid:48)) with each element Second, in some cases, where a trend exists or the of each cluster C . If no cluster exists in which each signal-to-noise ratio is low, simply ranking by the p- i element overlaps with the current result, we create a value of each true minima may result in overlooking new cluster C containing only (s(cid:48),w(cid:48)). Finally, in each some events. For example, Figure 6 is a time series k cluster,theelementinthatclustercorrespondingtothe with an upward trend. As one can see from the surface sum with the lowest probability should be used as the plot in Figure 7, the most significant window resides representative p-value for that cluster. In experiments, at the end of the trend, but there is clearly another we found that the results are fairly insensitive to the minimum with significance at s = 70 and w = 13. chosen value for θ (the default value used in our Thus, it is also important to consider the frequency experiments was θ =.75). with which the event regions were identified during the Notethatrandomrestartsarebeneficialwhendeal- randomrestarts. Thiswillidentifytheintervalsthatare ing with time series that could have multiple events. In clear minima on the probability surface plot, and thus thesesituations,therandomrestartswillallowthemin- could be significant to the time series. The frequency imization to find multiple intervals with low p-values, oftheevent regions canbeconsideredbytheanalystin each of which has the potential to be an event. In the addition to its p-value. followingsection,weexplorehowtodistinguishbetween Althoughourmethodwillidentifysignificantevents 1250 15.8 1200 15.9 F(t) 1150 O Magnitude 16 H 1100 AC 16.1 M 1050 16.2 1000 16.3 0 20 40 60 80 100 120 140 160 180 200 7e+08 7.5e+08 8e+08 8.5e+08 9e+08 9.5e+08 Time t Time [seconds] Figure 6: Synthetic time series with trend. Figure 8: Blue star in MACHO survey (MACHO ID: 2.5628.4423). Event was found with p-value = 5.008× 10−142. 45 40 efficiency. Ourresultsshowtheaccuracyofthemethod inourastronomicalapplication. Second, wecreatesyn- 35 thetictimeseries tofindempiricalresultsforoneofthe w)30 method’s parameters, and in addition, show further ev- w size (25 idence of the accuracy of the method. Windo20 5.1 MACHO. TheMACHOprojectspannedseveral 15 years and observed millions of stars in order to detect a few rare events. It is a well-known and well-explored 10 survey [1]. Our goal is to find at least two known 5 types of events in this data: microlensing events [2] and evidence of blue stars [12]. Other event types 20 40 60 80 100 120 140 160 180 200 Starting point (s) exist, but these two examples are well researched and cited, and thus serve the purpose of verifying our Figure7: Contourplotofthep-valuesoftheanalysisof results. Microlensing events are characterized by a Figure 6. singlesignificantdeviationfromthebaseline(seeFigure 1 for an example). Blue stars are characterized by a similar shape, and also have a similar duration to a on a trend like the one in Figure 6, the event regions microlensing event (see Figure 8 for an example) [12]. associated with the end of the trend will still appear as Our analysis was performed on 110,568 time series significant. Thus, in the case of a data set with many from56tiles(fromtheredband),eachtilerepresenting trends, it may be preferable to first consider filtering a portion of the sky. In total, there were 28 known the time series by detrending them. The most basic microlensing events and 5 known blue stars in the method would be to subtract the best-fit line. One subset. Eachtimeserieshadalengthbetween1000and can also consider multiple detrending algorithms for 1812,withanaveragelengthof1386. Beforeperforming specificfields, suchasdetrendingforastronomy[16,24] our analysis, some basic detrending was performed by or detrending for medical applications [6]. subtractingaleast-squareslinearfitforeachtimeseries. All time series were analyzed with the same probability 5 Results distribution for N =1000, where any time series larger Two major analyses were performed with the method would use the overlapping method described in Section described in this paper. First, we analyze a subset of 3.4. the MACHO survey that was chosen because there are Arankingofthesetofpossibleevents(i.e.,intervals known events in the data, and also because it repre- of time series) was done by considering the event with sents a large enough size to demonstrate the method’s thelowestp-valueforeachinterval. Table1summarizes theanalysisoftheMACHOdataset. Therewasnodif- of events in astronomical data, such as performing a ference between the results obtained from the analytic χ2 test. For example, the microlensing example in and Monte Carlo methods. From the results, we ob- Figure 11 has χ2 = 1.13, but has a p-value significance serve that all known events were found in the top 1.1% of 4.273 × 10−26 and a rank in the above analysis of of the ranks. In order to show that this is a desirable 689. Moreover, it is important to note that 371 new result, one must examine the events found to be more events with χ2 < 3 were discovered in these top ranks, significant than the lowest ranked known event (rank such as the example in Figure 12. We are currently 1217). We examined all 1217 significant light curves examiningtheseeventstodeterminethenatureofthose and found that each either contains a significant event, phenomena. Eacheventmustbeexaminedcarefully,by or is periodic/pseudo-periodic and thus they appear as comparing colors and desirably spectra information. A significant deviations from the baseline. Examples of follow-up paper to appear in an astronomical journal an unidentifiedand pseudo-periodic events arefound in will address those cases [23]. Figures 9 and 10, respectively. 18.5 14 19 14.5 15 19.5 Magnitude 1 51.65 CHO Magnitude 20 HO MA AC 20.5 M 16.5 17 21 17.5 21.5 7e+08 7.5e+08 8e+08 8.5e+08 9e+08 9.5e+08 18 Time [seconds] 7e+08 7.5e+08 8e+08 8.5e+08 9e+08 9.5e+08 Time [seconds] Figure 11: MACHO ID 119.20610.5200. Microlensing Figure 9: MACHO ID 80.6468.77. Periodic (or pseudo- eventwithp-valueof4.273×10−26,butwithχ2 =1.13. periodic) event not identified as microlensing or as a blue star. p-value is 1.688×10−104, with a rank of 55. 18 Event Found 18.5 18.2 18.4 19 Magnitude 18.6 Magnitude 19.5 MACHO 18.8 O 19 H AC 20 M 19.2 20.5 19.4 7e+08 7.5e+08 8e+08 8.5e+08 9e+08 9.5e+08 Time [seconds] 21 7e+08 7.5e+08 8e+08 8.5e+08 9e+08 9.5e+08 Figure 12: MACHO ID 113.18809.1561. Unidentified Time [seconds] event with p-value is 3.184×10−17, a rank of 1199, and Figure 10: MACHO ID 118.18278.261. Event not χ2 =0.82. identified as microlensing or as a blue star. p-value is 8.167×10−112, with a rank of 27. In addition, we compared our results to HOT SAX to make the distinction between event detection and In Section 2, we discuss past work for the discovery anomaly detection. The analysis was run on the same Table 1: Results of MACHO Analysis, analytic. Event Type Median Rank p-value of Median Rank of Last p-value of Last Microlensing 159 6.577×10−50 1217 1.102×10−21 Blue Star 114 4.296×10−61 324 8.872×10−42 datasetof110,568detrendedtimeseries,andwasdone events (323 from single event time series, and 2 events using the brute force method of HOT SAX. The dis- for each of the 50 time series with two events) were tance of the discord was calculated using the Euclidean in the top 423 ranks, without a single false positive in distance, and the results were then ranked from high- those ranks. All events were found with a p-value of esttolowestEuclideandistance(thelargestdiscordbe- 6.018×10−9 or less, whereas all of the most significant tweennearestneighbors). Inthetop1217results,only3 intervals in the time series without events had p-values known microlensing events were discovered, with many higher than 2.891×10−7. other time series of little significance appearing as false HOT SAX performed almost as well on this data. positives. The results are summarized in Table 2. It In the top ranks, only 7 false positives were reported. is clear that finding significance of subintervals is not It is interesting to note that the lower ranked events the goal of HOT SAX, which is more attuned to peri- werethosethatoccurredintimeserieswithtwoevents. odic series with less noise, where it performs quite well This is due to the fact that the algorithm will consider [13]. As discussed in Section 2, the distinction between the two events as nearest neighbors, and subsequently anomaly detection and event detection is key, in that each will be scored lower. As stated, the comparison they have different goals. of such events is ideal for situations in which a time series has periodic events and one is attempting to discover anomalies within these events, not discovering Table2: ResultsofMACHOAnalysisusingHOTSAX. theeventsthemselves. Inaddition,asecondexperiment Event Type Median Rank Lowest Rank wasconductedwithHOTSAXbyaddingasinglepoint Microlensing 34,233 91,779 to each of the time series (at a random value between Blue Star 21,691 52,866 1 and N), with a value of −5h. Adding such a value is consistent with many domains, such as astronomy, where outliers in time series are quite common. After 5.2 Synthetic Time Series. In order to examine conducting the experiment, the results were no longer how the number of random restarts required by our useful, with no correlation in the top ranks with time method affects the results, we performed an analysis series that contained events. Our method was able on synthetic time series. The data set consists of 2000 to reproduce nearly identical results to the original synthetic time series, 323 of which are time series with synthetic time series experiment. varying sizes of single events, and 50 with two events. The final aspect of this experiment was to consider Theremaining1627timeseriesaregaussiannoise,with different values for the number of random restarts a standard deviation of 5.0 and a mean of 0.0. needed for our method. The above experiment was For those time series with events, the events were performed with 30 random restarts. In addition, the created using the following function f(t): analysis was done with 5,10,20,30,50 random restarts. Table 3 summarizes the results of this experiment. It (5.10) f(t)=he−(t2−θ2S)2 +(cid:15) is important to note that significant events in this experiment are considered those that are ranked above where (cid:15) is the error (or noise), h is the height of the a p-value threshold of 10−8, the p-value for which all event(Y-axis),S isthetimetatthehighestpointofthe pure-noise time series fall below. The column One- event (X-axis), and θ modifies the length of the event. Event TS represents the number of events found (one Our events ranged from θ =1.5,...,10.5 by increments per time series) in the set of 323 time series with only of 0.5, h = 20,...,100 by increments of 5, and (cid:15) was oneevent. Thenextcolumn, Two-Event TS,represents a random variable drawn from a Gaussian distribution the number of total events found from the set of 50 with a standard deviation σ = 5.0 and a mean µ = 0. time series with two events (thus, we hope to find 100 Thus,thesignaltonoiseratioofthesetimeseriesranged events). Itisclearthatincreasingthenumberofrandom from 2 to 20. restartsalsoincreasesthepossibilityoffindingasecond Weranourmethodonall2000synthetictimeseries, significant event, but that one begins to experience and the events were ranked by their p-value. All 423 diminishing returns after about 30 random restarts.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.