A New Approach to Time Domain Classification of Broadband Noise in Gravitational Wave Data S. Mukherjee, P. Rizwan, and R. Biswas Center for Gravitational Wave Astronomy, Dept. of Physics and Astronomy, The University of Texas at Brownsville, 80 Fort Brown, Brownsville, TX 78520, U.S.A.∗ Broadband noise in gravitational wave (GW) detectors, also known as triggers, can often be a deterranttotheefficiencywithwhichastrophysicalsearchpipelinesdetectsources. Itisimportantto understandtheirinstrumentalorenvironmentaloriginsothattheycouldbeeliminatedoraccounted forinthedata. Sincethenumberoftriggersislarge,dataminingapproachessuchasclusteringand classification are useful tools for this task. Classification of triggers based on a handful of discrete properties has been done in the past. A rich information content is available in the waveform or ‘shape’ of the triggers that has had a rather restricted exploration so far. This paper presents 2 a new way to classify triggers deriving information from both trigger waveforms as well as their 1 discrete physical properties using a sequential combination of the Longest Common Sub-Sequence 0 (LCSS)andLCSScoupledwithFastTimeSeriesEvaluation(FTSE)forwaveformclassificationand 2 the multidimensional hierarchical classification (MHC) analysis for the grouping based on physical properties. Ageneralizedk-meansalgorithmisusedwiththeLCSS(andLCSS+FTSE)forclustering n thetriggers using a validity measure to determine the correct numberof clusters in absence of any a J priorknowledge. Theresultshavebeendemonstratedbysimulationsandbyapplicationtoasegment of real LIGO data from the sixth science run. 0 2 PACSnumbers: 95.85.Sz,04.80.Nn, 07.05.Kf,02.50.Tt,02.60.Pn ] c q I. INTRODUCTION the GW channels (referred to as DARM ERR channel) - in the LIGO detectors [12]. These consist of looking at r g triggers in time frequency planes [13], exploring loudest Gravitational-Wave (GW) detectors viz., LIGO [1], [ triggersseeninburstpipelines [14,15],studying low fre- Virgo [2] have been online in data-recording mode since quency seismic disturbances [16, 17], looking at specific 1 2000. LIGO has concluded its sixth science run (S6) in types of triggerse.g. from photodiodes [7] and exploring v 2010. Data have been archived from not only the GW 0 structurespresentinthetriggerpopulationindimensions channels,butalsofromseveralhundredsofauxiliaryand 0 higherthantheusualthreedimensionalcartesiansystem environmentalchannels. Therearetwomainbroadcate- 4 usingmultidimensionalhierarchicalclassification(MHC) 4 gories in which data analysis has been organized: astro- methods [12,18,19]. Thesemethods arecomplementary . physical searches [3–6] and detector characterization [7]. 1 and cast light on different aspects of the triggers seen in These two tasks are not entirely independent. Detector 0 the data. characterization research products prepare the ground 2 1 workforunderstandingtheunderlyingnoiseandfeedthe GWdata arearchivedinthe formatofdiscretelysam- : astrophysical searches with information [8] that symbol- pled time series from the main GW channel as well as v i ize which data segments are relevant for GW search. fromseveralhundredsofinstrumentalandenvironmental X DatafromtheGWdetectorshavebothbroadbandand channels that are recorded specifically to monitor func- r narrowband noise. The narrowband noise (aka lines) is tioningofdifferentinstrumentalsubsystemsandenviron- a extensivelystudiedandseveralmethods[9,10]havebeen mental activities that affect the GW channel data. The implemented to efficiently remove them from the data. triggersarriveatahighrateinallchannels. Thisrequires However, reduction of the broadband noise (aka trig- data mining methods to keep up with near realtime in- gers) is a difficult problem that has not been explored formation and to process the enormous volume of data to the fullest yet. The sensitivity of the detectors has for information extraction. Classification is the most ef- improved steadily over the years [11]. With each step fective way of addressing this problem. The methods of towards a more sensitive instrument, many new sources classifyinglargedatasetsinmultidimensionalparameter of noise have also been unearthed. Occurence of trig- space bring an immediate reduction in the dimensional- gers in the data is a function of the operating condi- ity of the problem under the assumption that existing tions. Thus it is important to track down the sources classes show some common collective properties. In the to make the output data as high quality as possible and contextoftriggersseenintheGWdata,wewouldliketo to reduce probability of false alarms. A very large effort explorehowmanydifferentclassesoftriggersarepresent has been undertaken to analyze the noise transients in and how to characterize these classes in terms of their origin. Development of a knowledge base in understand- ing the properties of the triggers thus seen in GW data contributes towards development of realistic noise mod- ∗Correspondingauthor:[email protected] els thatare essentialinproper assessmentofpeformance 2 of astrophysicalsearch pipelines. pipeline produces trigger classes with similar waveforms There are several analysis pipelines that operate on- andamplitude,centralfrequency,Q-factorandsnrrange. line on LIGO data to look for burst-like signals or trig- Eachofthese classesoftriggersisshownto be relatedto gers e.g. kleine welle (KW) [15], omega (OP) [29] and agroupofauxiliaryandenvironmentalchannelsindicat- waveburst(WB) [20]. The KW pipeline works on multi- ing the most probable sources of their origin. This com- ple channels - the GW channel and several hundreds of bined classification pipeline performs better than meth- auxiliary and environmental channels. The threshold is odsbasedeitheronly ondiscrete triggerpropertiesoron kept such that the pipeline picks up triggers of all types trigger waveforms alone. Thus it can be turned into an atasteadyrate. Unsuperviseddataminingmethodslike effective, low latency trigger identification tool for GW theMHC analysis[12,18]hasbeendevelopedandLIGO detector characterization. science data have been analyzedin recentpast. The aim This paper is an illustrationofthe method and its ad- of these studies has been to classify the population of vantages. Results fromthe sampleS6datachosenovera triggers seen in the GW, environmental and auxiliary two day period are used to show how the method could channelsintostatisticallysignificantdistinctgroupswith beappliedinsciencerunstoextractinformationcomple- uniform characteristics. These studies have been mostly mentary to and in conjunction with the existing meth- based on a handful of discrete properties of the triggers ods with low latency. The final outcome of this analysis viz. duration, centralfrequency andsignal-to-noiseratio shows existence of several statistically significant classes (snr). However,an important aspect of the triggers, viz. of triggers with distinct waveforms and physical proper- the ‘shape’ factor has largely been overlooked. Shape ties coming from the GW channel in the test data set of the triggers, or the waveform, often contains rich in- from S6. Post-classification analysis explores the cou- formaton. Temporal classification [22–25] methods e.g. plingsofthesetriggerclassestodifferentsetsofauxiliary S-meansandConstraintValidation[21]havebeendevel- and environmental systems. Application of LCSS and oped in the recent past and studied on simulated GW LCSS+FTSE (which yields classes based on waveforms) data. However, they have not been tested on data from alone would give 19 distinct classes, while application of real GW detectors. only the MHC analysis would have given 3 statistically significant classes. Thus, the proposed analysis clearly Another successful and often applied method is based has advantage in using a bigger parameter space leading ondistances calculatedusing the LongestCommonSub- to a finer classification structure that helps in the iden- sequence(LCSS)[26]. Thetechniquehasbeenstudiedin tification of triggers by maximum utilization of its in- GW data for the first time in a recent publication [19]. formation content. Since each of the subgroups contains Preliminaryresultsdemonstratedhaveshownproduction triggers with very characteristic properties related to a oftriggerclusterswithsimilarshapes. Thecurrentpaper specific set of channels, the method proves useful in the exploresfast,accurateandefficientmethodsforunsuper- classification of triggers seen in GW data and in helping visedclassificationoftriggerwaveformsfurther. Ananal- withtrackingdownthe sourcesororiginsof the triggers. ysis pipeline has been constructed based on LCSS and As a direct application to detector characterization, we also LCSS in conjunction with Fast Time Series Evalua- can classify the triggers seen in GW science data into tion (FTSE) [27]. The latter is done to explore possibil- differentgroupswithcharacteristicproperties,relatedto ities of increasing computational speed when very large specificinstrumentalandenvironmentalsources. Wecan datasets of triggers are involved. This is the first ex- thusstudythetrendofvariouskindsoftriggersandgain ploration of combined LCSS and FTSE methods in the insight into how some of the channels may be reponsible context of GW data analysis. A second stage involving in production of specific types of triggers. theMHCmethods[12,18]iscarriedouttochecktheho- mogeneity ofthe clusters in the parameterspace oftheir The paper is organized as follows. Sec. II describes physical properties. This results in further segmenta- the trigger generation process and Sec. III describes the tion of the triggers to appropriate them to their sources FTSE and LCSS algorithms. We explain the pipeline - instrumental or environmental. Some of the specific for generation of trigger clusters in Sec. IV. Sec. V then questions we explore in the paper are (i) Are LCSS and presents results from numerical simulations and applica- LCSS+FTSEsuitablemethodsforfastandaccuratetrig- tions to S6. Conclusions and directions to future work ger classification? (ii) How are the resulting classes of are presented in Sec. VI. triggerscharacterized? (iii) What are the computational costs involved? (iv) How robust are these methods for GW trigger classification? (v) How can the analysisgive II. TECHNIQUES FOR DETECTING relevant information for tracking down sources of non- TRIGGERS IN LIGO DATA GW triggers? It is shown in this study that application of the LCSS Techniques for detection of burst-like triggers in the (or LCSS+FTSE) followed by MHC is a very useful and GW instrumentaloutput data streamare describedcon- productive way to classify triggers based on their wave- ciselyinanearlierpaper [15]. Ingeneral,suchtechniques formsandcharacteristicphysicalproperties. Ashasbeen project the data onto a basis that spans the parameter found in the study with S6 data, the end result of the space of the burst-like signals. A measurement becomes 3 optimalwhenthereisanexactmatchbetweenamember thatappearsinbothsequences. Alongest common sube- in the contructed basis and a burst. The snr ̺ in this quence (LCSS) is a common subsequence of maximal case is defined as length. For example, suppose ̺2 = ||h||2 , (1) s1 =uuuvvwtwuwttuttvwttvtuwuu (3) S (f) h and where ||h||2 is the total energy content of the signal and S (f) is the one-sided power spectral density. In cases s2 =vuvvvvtuuwwtuvvtttwwttv, (4) h where a close match between the signal and the member of the basis is not achieved, the bursts are characterized an LCSS (denoted by LCSS(s1,s2)) is given by uvvtuwtuvtttw. The algorithm operates on the prin- by a quality factor Q defined as ciple of enumeration all subsequences of s1, followed by f checking if they are subsequences of s2 as well. c Q= (2) Let us now look at the theory of LCSS in the context σ f of the trigger waveforms in the present study. Formally, where f is the central frequency and σ2 is the band- this amounts to comparing two input trigger time se- c f ries sequences X(1...m) and Y(1...n), where m and width. The time-frequency plane is thus titled by the n denote the length of the sequences X and Y respec- Q values where the signals are represented by localized tively. The length of LCSS of X and Y (or written as pixels with the same Q. LCSS(X,Y)) will be denoted by ζ. The recurrence rela- tion [47] leading to the length of the LCSS for each pair III. ALGORITHMS FOR TRIGGER [X(1...i),Y(1...j)] is given as follows: CLASSIFICATION IN A MULTIDIMENSIONAL SPACE α(i,j)=0 (5) There are two main issues that one must address if X or Y is empty sequence, i.e. if i=0 or j =0; while developing an efficient feature-based classification method (i) the data may have temporal gaps, i.e. there α(i,j)=1+α(i−1,j−1) (6) may be same pattern occurring at different time epochs and(ii)the classificationpreferablyshouldbe automatic if X(i)=Y(j); (to keep up with the online feedback systems) and thus unsupervised classification methods that are robust to α(i,j)=max{α(i−1,j),α(i,j−1)} (7) noise need to be explored. if X(i)6=Y(j), where α(i,j)=ζ([X(1...i),Y(1...j)]). (8) A. Longest Common Sub Sequence Thealgorithmisshowngraphicallyinfigure1. Asingle Let us first take a look at why LCSS algorithm is ef- α value is localized in the sense that it depends only on ficient and how it fits into GW data analysis. In any the three neighboring values. After the table has been unsupervised classification method, the first step is to filled, the length of the subsequence is found in calculate the distance between points in the parameter space. Eventhough Euclideandistance is the mostcom- α(m,n)=ζ([X(1...m),Y(1...n)]). (9) monly used method, it is not suitable for addressing the The commonsubsequenceis foundby backtrackingfrom first difficulty mentioned above. Thus, two triggers that α(m,n) by following at each step, the pointers that are havethe samewaveformmayshowa highEuclideandis- set during the calculation of the values. When a match tance if they are not occurring simultaneously. This will is found, the LCSS is upgraded. In this way, one can beconsideredaredundantclusterfromthephysicalpoint traverse a path through the LCSS table (as shown in 1) of view. LCSS algorithmis able to compute a matchbe- until a length of zero is reached. In this case, ζ(X,Y) tween two time series by calculating metrics for triggers is =4 and the LCSS corresponding to the path shown is that do not necessarily occur at the same time without uvuu. having to rearrange the sample sequences to coincide. The algorithm works as follows. A pair (i,j) defines a As an example, let us consider a sequence of charac- match if ters x ,x ,x ,x ,x ,x ,x . A subsequence is defined m n p p q n q as a set of characters that appear in an order from the X(i)=Y(j). (10) leftto the right,butnotnecessarilyconsecutively. Thus, [xm,xn,xp], [xm,xp,xp,xq], [xp], [xm,xn,xp,xp,xq,xn] The set of all matches is given by are subsequences, but [x ,x ,x ] is not a subsequence. p p n A common subequenceof two sequences is a subsequence η ={(i,j)|X(i)=Y(j),1≤i≤m,1≤j ≤n}. (11) 4 means is dictated by the fact that this allows formation of homogeneous clusters that are insensitive to outliers. K-means [37, 38] uses two parameters to start with the number of clusters K and the set of elements D = [t1,t2,...,tn]. The algorithm works as follows. Let M = [m1,m2,...,mk] be the set of centroids as- signed randomly and the size of the set M equals K. Each item t is placed into the cluster which has the i nearest mean. M is populated with a new value of mean for each cluster K . The cluster mean of K = i i [ti1,ti2,...,til]isusually(butnotnecessarily)calculated as: l 1 mi = Xtij l j=1 FIG. 1: The figure shows graphically how the LCSS algo- where l is the number of items and ti is the item placed rithm works. The two-dimensional array with the two se- in each cluster Ki. quences X and Y is shown along the two axes. Initially the The last three steps are repeated until a defined conver- cells in the array have uniform entries of zeros. In the next gence condition is met such as no further change in the step, we look for an element to element match. Whenever a membership of the clusters. matchisfound,thecellentryinincrementedbyone. These- quenceofarrowsshowapossibleLCSSpath. TheLCSSpath One of the short falls of k-means is that the algorithm in this case indicates that there is a match of four elements needs a predefined value of K or the number of clusters, inthesequenceviz.,uvuu[31,47]. More detailsaregivenin which is exactly what the proposed clustering technique III. aims to know. In case of unknown data under test, the number of significant groups in the classification struc- Each match belongs to a class ture is unknown and hence it cannot be supplied to K- means. Ω ={([i,j]|[i,j]∈η;α(i,j)=k},1≤k ≤ζ. (12) Several validity measures have been developed to de- k termine the value of K in k-means. Using the ratio of A match belonging to the Ωk is called a k-match. In the intra cluster distances δintra to inter cluster distances figure 1, the marked entries define the class Ωk. Since δinter, a simple validity measure Vmin is used to find out each match belongs to only one class, these classes par- the optimum number of clusters [30]. The time series tition all matched of η. representationofthetriggerswithinacluster(apointin Some k-matches are more useful algorithmically (e.g. the multivariate parameter space) must be as similar as square marks in the figure 1) than the others (e.g. circle possibleandsimilarpointsbelongingtodifferentclusters marksinthefigure1). Thiscanbeprovenasfollows. Let must be as different as possible (in the same multivari- us consider matches [i,j] and [i′,j′] of Ω for i = i′ and ate parameter space) to ensure compactness of clusters. k j ≤ j′ or i ≤ i′ and j = j′. Every element of Ωk+1 that Therefore,the intra cluster distance shouldbe minimum follow[i′,j′]shouldalsofollow[i,j]intheLCSS.Thus,it andtheinterclusterdistanceshouldbemaximum. Since, is sufficient to consider only the dominant matches [i,j]. Let ϕ be the set of all dominant matches. The regions δ k V = intra (13) in the figure where α(i,j) values are equal (shown by min δ inter vertical and horizontal lines) are called LCSS contours. the value of K which makes the validity measure min- Each k-match lies immediately below the kth contour. imum, is the ideal one. In practice it is expected that Thesecontoursaredefinedbyanorderingproperty. Ifϕ k thevaluesofKfoundfromtheabovemethodwouldvary = [i1,j1],[i2,j2],...,[il,jl,], the matches can be renum- slightly on different runs. This happens because, in the beredsuchthat i1 <i2 <···<il andj1 >j2 >···>jl. k-means algorithm,the centroids are assignedrandomly. The strategyfor locating the dominantmatches is based Therefore, the validity measure was computed one hun- on advancing from contour to contour. dredtimesonthesamesetsofdataandthemostfrequent value of K is chosen. The role of LCSS in this clustering scheme is to gen- B. K-means erate an adjacency matrix containing information about the pairwise distances between trigger time series. Once Once the distances are calculated according to equa- thematrixiscreated,itcannowbe suppliedtoK-means tion(5),ageneralizedk-means[29]algorithmisemployed which treats it as an ordinary input and operate on it to form the clusters. The reason as to the choice of k- based on the algorithm. 5 C. Attempt to Improve Computational Speed: Fast Time Series Evaluation Ifweneedtoimplementtheclassificationalgorithmas anearreal-timetoolfortriggeridentification,weneedto make the process computationally fast and efficient. To address this question, we also investigated methods that couldenhancethecomputationalspeedoftheLCSSalgo- rithm. We thus investigatethe FastTime Series Evalua- tionalgorithm(FTSE) [27]inanattempt tomakeLCSS faster. LCSS in conjunction with FTSE claims to be faster than LCSS algorithm alone because unlike LCSS calculations,FTSEdoesnotuseatwodimensionalarray to compare matches between two time series. Values in one time series are entered into a grid. Then a point in the other time series probes into its respective grid cell FIG.2: ThefigureshowsgraphicallyhowtheLCSS+FTSE (ofthesamegrid)tocheckifpointsofthefirsttimeseries algorithmworks. Thetoprowrepresentsthecellsofthegrid. reside there. The construction of the grid ensures that According to the figure, indices of the matching elements of if points are found residing in that grid cell, they must a series X are put into corresponding cells of the grid. The match the probing point within a defined threshold. To elements of the second series Y are compared with the grid put it simply, in LCSS computed with FTSE, the com- cells formed from X to check if points of the first time series parisonbetween two time series occurs only between the reside there. The construction of the grid ensures that if intersecting portions and that is the underlying reason points are found residing in that grid cell, they must match whyLCSScomputedwithFTSE isexpectedto be faster the probing point within a defined threshold. In this case, than LCSS - where the comparison occurs throughout the element B of Y is found to match the second grid cell the entire length of both time series. and thus the value inside the cell enters the intersection list. Likewise,theotherelementsofY arealsoprobedinasimilar TheaveragecostofcalculatingLCSSisO(p×q)where mannerandthetotalmatchingpointsbetweenthesequences pandq arethelengthsoftwosequencesbeingcompared. are recorded in the intersection list [27]. In this example, TheaveragecostofFTSEcomputingLCSSisO(M′+Lq) thecontributingindices2,3,4,6givesthesubsequence,viz., where M′ is the number of matches, L is the longest in- B,C,B,A. tersectionbetween the two sequences andq is the length of the probing sequence [3]. The process of computing LCSS with FTSE is elaborately shown in figure 2 . The tion, frequency, snr and statistical significance. top row represents the cells of the grid. According to The algorithm starts with calculation of a Euclidean the figure, matching elements of a series X are put into distance between vectors X and X defined by, m n the same cells of the grid. The elements of the second series Y compared with the grid cells formed from X to d2 =Σ (X −X )2 (14) mn i m n i check if points of the first time series reside there. The construction of the grid ensures that if points are found The datamatrix hasdimensionp×q , whereeachofp residing in that grid cell, they must match the probing triggers is described by (1×q) vectors X1,X2,...,Xp. pointwithinadefinedthreshold. Inthiscase,theelement Calculation of distance is followed by computation of BofY isfoundtomatchthesecondgridcellandthusthe suitable measureofproximitybetweentwo groupsofob- lengthofthematchingsubsequenceisaugmentedbyone. jects. These are called ‘linkage’ criteria. We adopt the Likewise, the other elements are also probed in a similar criterionof‘complete’linkagewhichmeasuresthelargest manner and the total matching length between the se- distancebetweenobjectsinthetwoclusters. IfN isthe m quences are recorded. Once all the matching lengths of number of objects in class m and N is the number of n thetriggersignalsarecalculated,thesignalsareclustered objects in class n, and X is the jth object in class m, mj using the method described above. a complete linkage is defined as follows. d(m,n)=max(∆(X ,X )), (15) mj nk D. Multidimensional Hierarchical Classification algorithm with j ranging between 1,...,N and k ranging be- m tween 1,...,N . ∆ denotes the distance. This stage of n TheMHCpipelineimplementsahierarchicalalgorithm the algorithm results in a hierarchy from p clusters with [14-19]thatappliesavarianceminimizationcriterionand one object to one cluster with p objects. The choice of groups together burst triggers detected by the KW [11] significantclustersisgivenbycomputationofthecorrela- pipeline based on their similarity in the higher dimen- tioncoefficient,‘r′ [40]. r2 isrelatedtothefractionofthe sional space spanned by properties like the trigger dura- total variance accounted for partitioning into s clusters. 6 r2 is defined as follows. If W =ΣNi |X −X¯|2, (16) i j=1 j W r2 =1−Σs i (17) i=1Σp |X −X |2 l=1 l total N denotes the number of members in the ith class, X¯ i denotes the unweighted mean of the population in the ith class and X denotes the mean of the entire pop- total ulation [39, 41]. The statistical significance of the classificationscheme canbeverifiedbyusingthemethodofMultivariateAnal- ysisofVariance(MANOVA)[39]. Assumingthattheun- derlying distribution is a multinormal mixture [41, 42], the model is given below. For an n dimensional data set with m clusters eachwith p members, the ith triggerin k the jth cluster gives an n dimensional vector X =µ+τ +ǫ . (18) ij j ij µisthepopulationmeanofthetotalpopulation,τ isthe j offsetofthejth clustermeanfromµandǫ isthescatter ij of the points around the eman value. The hypothesis to FIG. 3: Five different types of simulated trigger wave- forms were used in the classification pipeline with LCSS and be tested is as follows. LCSS+FTSE. These waveforms are (from top to bottom) H0:τ1 =τ2 =τ3 =...τm =0. (19) mixturesinusoids,pulsetrains,noisegeneratedwithalowor- der ARMA model ([28]), a simple triangular sawtooth wave Let the sum of squares be written as and a chirp. The simulated set consisted of 760 waveforms with these shapes but varying in amplitude, frequency, rel- SS =Σm Σpk (X −X)(X −X)T (20) ative location on the time axis etc. The figure represents k=1 j=1 k k typical examples of each type. No noise was added to these and the cross products are written as trigger waveforms. The x-axis denotes time samples and the y-axisdenotes amplitude. CP =Σm Σpk (X −X )(X −X )T, (21) k=1 j=1 kj k kj k The test statistics are as follows: (i) Wilks lambda [43]: det(CP) ∗ λ = (22) det(SS+CP) A. Simulated triggers without additive noise (ii)Pillais trace [44]: V =trace[SS×(SS+CP)−1] (23) Wefirstgenerateadatasetof760simulatedwaveforms (iii) Hotelling Lawley’s trace (or Mahalanobis D2 statis- with variable parameters, each 1024 samples long. The waveforms are in the shapes of Mixture sinusoids, Pulse tic) [45] trains,noise generatedwith a low orderAuto Regressive U =trace(CP−1×SS). (24) Moving Average (ARMA [28]) model, Triangular saw- tooth and chirps with varying parameters, i.e. varying All the test statistics follow the non-central F distribu- amplitude, frequency, width, location on the time axis tion [46]. etc. These waveform models are selected to generate waveforms of diverse nature and shapes. The motiva- tion behind including ARMA-model based waveforms in IV. ANALYSIS PIPELINE the simulation is because it is a general scheme that can modelvariousdifferenttypesofwaveforms(uptosecond The analysis is carried out in three main stages - (a) moment),includingthetypeofoutputsweseeinourGW using simulated triggers without additive noise; (b) us- detectors. The amplitude is normalized. The waveform ingsimulatedtriggerswithadditiveGaussianwhitenoise databaseisintroducedtothepipelineastheprimeinput. and (c) using a segment of LIGO S6 data chosen over a The waveforms are shown schematically in figure 3. The two day period. number of clusters is determined by k-means. 7 icant groups of triggers in this test population based on the shape of the waveforms of the triggers. Triggers in theOmegacatalogarealsodescribedintermsoffourdis- creteproperties. Thesearecentralfrequency,amplitude, snr and the quality factor (or Q-factor). Once the classification of triggers take place, we ad- dress the important detector characterization question: (i) How to characterize these triggers and (ii) What are thepossiblesourcesofthesetriggers,i.e. howdotheyre- latetotheauxiliaryandenvironmentalchanneltriggers? Classcharacterizationandtriggeridentificationstepsare as follows. (i) We take each trigger from a given sub-class that re- sults from the main classification structure of the GW triggers and take an Omega scan [48] around the peak time of the GW trigger. Omega scans produce time- frequency plots of all auxiliary and environmental chan- nel data that coincides with the trigger peak time. (ii) A histogram is constructed to see the distribution of the occurrence of triggers in the auxiliary and environ- mental channels for all the time windows corresponding totriggersinaparticularsub-class. Thehighestfrequen- cies are recorded. FIG. 4: This figure represents simulated trigger waveforms (iii) The existing data quality flags in the LSC detector with added Gaussian white noise used in the classification pipeline with LCSS and LCSS+FTSE. These waveforms are characterizationliterature[8]arealsonotedforcompari- (from top to bottom) mixture sinusoids, pulse trains, noise son. This alsoshowsif this method points to newer aux- generated with a low order ARMA model, a simple triangu- iliary and environmental sources other than the existing lar sawtooth wave and a chirp. The figure represents typical data quality flags. examples of each type. The x-axis denotes time samples and Data Conditioning: The time series segments corre- the y-axis denotes amplitude. The standard deviation of the sponding to the triggers are first subjected to condition- noise added to the trigger waveforms shown in the current ing. The following sequence of operations is executed. figure is σ=0.25. (i) Selection of triggers with snr greater than 12 from KW database; (ii)ExtractionofrawGWchanneldata,centeredaround B. Simulated triggers with additive Gaussian noise the trigger (extracted time series noted by, say, q′.) i (iii) Whitening q′ [32] and dynamically removing[9, 10] i Inthenextstepofthisstudy,wegenerateadatasetof the narrowband noise present in q′. The resulting time i 760 simulated waveforms of five different types of shape series is denoted by cq′. i with variableparameters,each1024samples long,as de- (iv)Filteringcq′withabandwidthofδf aroundf ,where i c scribedintheprevioussection. Theamplitudeisnormal- f is the central frequency of the trigger as noted in the c ized. Each waveform data stream is mixed with Gaus- KWcatalog. Theresultingtimeseriesisdenotedbyfcq′. sian white noise. The output thus is a noisy waveform, i (v) Re-sampling fcq′ to represent the appropriate band- as shown in figure 4. The snr of the triggers are kept i width. in a range of 2 and 20. This is fed to the classification As shown in the figure 5, the trigger database is used analysis pipeline. As before, the number of clusters is tofirstselecttriggerswithsnr>12. Thetimestampson determind by generalized k-means. the triggers are used to extract the corresponding time domain raw data. The raw data are conditioned elab- orately to reduce noise that is mixed with the triggers C. LIGO sixth science run trigger database i.e. theyarewhitenedandthennarrowbandnoiseissub- tracted from the data. The conditioned time series are Havinggainedinsightwiththesimulations,wenowap- resampled and appropriately bandpassed to record the plytheanalysispipelinetoclassifytriggersseeninLIGO waveform. Figure 6 shows the different types of wave- S6data. We haveusedtriggersfoundinthe Omegatrig- forms that have been found in the test data. These are ger catalog [15, 29, 49] during S6. Three hundred and fed first into the LCSS pipeline. The end product of the forty Omega triggers from the GW channel seen in the analysis is a set of individual uniform groups of trigers test data set (with snr > 12) are selected and subjected withsimilarshapeparameters. Thenumberofclustersis totheanalysispipeline. Theaimofthisexerciseistosee determind by generalized k-means. The individual clus- ifthisschemeofclassificationcanfindstatisticallysignif- ters thus generated are further subjected to the MHC 8 FIG. 5: The figure shows the full analysis pipeline as ap- plied to a sample of S6 triggers in this study. The trigger database is used to first select triggers with snr > 12. The time stamps on the triggers are used to extract the corre- sponding time domain information or waveforms. The time domaindataisconditionedelaboratelytoreducenoisethatis mixed with the triggers i.e. first whitened and then narrow- bandnoiseissubtractedfromthedata. Theconditionedtime series are resampled and appropriately bandpassed to record FIG. 7: The figure represents results of the classifica- thewaveform. Thesearefedfirst intotheLCSSpipelineand tion structure. K-means is run on the simulated waveform thenindependentlyintotheLCSSandLCSS-FTSEcombined database 1000 times and the numberof classes deemed most pipeline. The individual clusters thus generated are further significant (by the validity measure) is recorded. The his- subjected to the MHC pipeline for finer classification struc- togram shows that in most of the cases, the data shows ex- tures and post-processing. istence of five distinct classes, which is the true number of classes. The x-axis denotes the numberof significant classes. The y-axisshows thecorresponding numberof trials. fication structure. K-means (as described in section III) hasbeenrunonthedatabase1000timesandthenumber of classes deemed most significant (by the validity mea- sure V ) is recorded. The histogram shows a peak at min 5 significant classes which was the true number of clus- ters in the data. Both LCSS and LCSS+FTSE showed identicalclassificationstructure. Figure8showsthecom- putational speeds for LCSS and LCSS+FTSE. Contrary to hypothesis, LCSS+FTSE is found to be much more computationallyintensiveasthesamplesizegrowslarger than 120. The reason for this divergence is is explained in VI. The next set of studies are conducted with the same FIG.6: Thefigurerepresentssamplesofvarioustypeswave- forms that were seen in the test data from LIGO’s sixth sci- set of waveforms, but now mixed with Gaussian white encerun. Thesetriggerwaveforms wereobtainedafterappli- noise with varying standard deviation σ. The values of cationofthedataconditioningpartofthepipelinetotheraw σ varies from 0.1 to 0.8 in steps of 0.1. Figure 4 shows data. The x-axis represents time in arbitrary units and the thetypicaltriggerwaveformsofvarioustypeswithadded y-axisrepresents amplitude also in arbitrary units. noise. Classification is carried out by k-means as in the previous case. The best value of the number of clusters is determined by the validity measure V . It is found min pipeline for finer classification structures that can be re- thattheclassificationstructurestartstodeterioratewith latedtotheirmostprobableinstrumentalorenvironmen- increasing σ, i.e. decreasing signal to noise ratio (snr). tal origins. Forσ ≥0.3,thehistogrampeaksatfiveclusters,butthe overall shape of the histogram is broad, indicating that three or six clusters are equally probable (9). An over- V. ANALYSIS RESULTS whelming majority of three clusters for cases with high noise (σ ≥ 0.3) is observed. The reason for the deterio- The results of the analysis are demonstrated in this ration is easily explained. With increasing noise, many section. Figure 3 shows the different simulated trigger of the noise dominated waveforms, e.g. the mixture si- waveformsthataresubjectedtotheanalysiswithoutany nusoidal waveforms, the pulse train waveforms and the additive noise. Figure 7 shows the results of the classi- ARMA based waveforms look similar and are thus clas- 9 20 18 16 14 ur)12 Ho LCSS + e (10 FTSE m Ti 8 6 4 2 LCSS 0 0 100 200 300 400 500 600 700 800 Number of triggers FIG. 9: The figure represents results of the classification structure for the case of simulated triggers when the noise FIG. 8: The figure presents a comparison of computational level is high (i.e. triggers with low snr). K-means was run speeds of the analysis pipeline using LCSS and combined on thedatabase 500 timesand thenumberof classes deemed FTSEwithLCSS.Whilethetwomethodsseemtohavecom- most significant (by the validity measure) was recorded. For parablecomputationalspeeds(i.e. notsignificantlydifferent) σ=0.3, the histogram peaks at five clusters, but the over- up to n < 120, the LCSS+FTSE algorithm is found to be all shape of the histogram is broad, indicating that three or more computationally intensive beyond that point. The x- six clusters are equally probable. The results show an over- axisrepresentsthenumberoftriggersclassifiedandthey-axis whelming majority of threeclusters for cases with high noise represents computation time in hours. (σ ≥ 0.3). More details are given in section V. The x-axis denotes the number of significant classes. The y-axis shows thecorresponding numberof trials. sified as one group. As stated earlier in this section, both LCSS and LCSS+FTSE showed identical classification structure, butLCSS+FTSEisfoundtobemuchmorecomputation- ally intensive. Thus, we found that there is no real ad- vantage at this stage to continue to apply LCSS+FTSE to further analysis. We thus continue the application to S6 sample data using only LCSS algorithm. The analysis yielded 19 significant clusters of triggers. As mentioned above, figure 6 shows typical waveform fromeachclass. The shapeofthesewaveformsformsthe basis of classification into distinct clusters by the LCSS pipeline. Figure 10 shows how the four discrete proper- tiesofthe omegatriggers(snr,amplitude,frequencyand Q-value) vary between the different clusters. While the amplitude shows least variation, the other three prop- erties have wide error bars indicating that the classes based on similar waveforms are heterogeneous. We fur- ther investigated if the clusters thus produced have any significant sub-clusters present in them. Figure 11 shows details of the properties of one of the 19 classes of triggersfound in the test data (class # 10). FIG. 10: The figure shows variation of snr, amplitude, Q- The top panel in this figure shows a typical example valueandfrequencythatdescribetriggersinthestudy. While of trigger waveforms that is classified belonging to this the amplitude shows least variation, the other three proper- group. The panels below the top panel show how some tieshavewideerror barsindicating that theclasses based on ofthechiefattributesviz.,centralfrequency,thesnr,the similar waveforms are heterogeneous. The x-axis represents Q-value and the amplitude of the triggers belonging to time in arbitrary units. The y-axis represents the respective this class are distributed. A very similar picture arises unitsof each property displayed. for another trigger class in the study 12 (class #14). As it appears from the distributions of the properties 10 The classificationis based on the four dimensional space spannedbyamplitude,centralfrequency,Q-valueandsnr of the triggers. Statistical significance of the classifica- tionstructurethusfoundhasbeenvalidatedatp<10−6 level. This sequentialapplicationofthe LCSSandthe MHC pipelines has proven very useful as far as distunguishing different types of triggers from various sources is con- cerned. The LCSS pipeline separates out the triggers with similarity of waveform, thus taking a first cut at fragmenting the huge parameter space into a finite num- ber of partially uniform (in terms of waveform) groups. The MHC then further explores the finer physical prop- erty based groups present in these broader classes. Thus,thefinaloutcomeoftheanalysisontheexample data set in this study shows existence of 19 statistically significantclassesoftriggerswithdistinctwaveformsand further classification based on physical properties, com- FIG. 11: The top panel in this figureshows a typical exam- ing from GW channel and different sets of auxiliary and ple of trigger waveforms that is classified belonging to class environmental systems leads to uncovering more groups #10. The panels below the top panel show how some of the orclustersoftriggers. Thecombinationofthetwometh- chief attributes viz., central frequency, the snr, the Q-value ods yield trigger clusters that would not appear by any and the amplitude of the triggers belonging to this class are one of the component algorithms - for example, applica- distributed. The x-axis represents each of the properties ex- tion of LCSS (which yields classes based on waveforms) pressed in arbitrary units. The y-axisrepresents numbers. alone would give 19 distinct classes, while application of only the MHC analysis would have given at most 3 statistically significant classes (being constrained by the dimensional of the parameter space.) Since each of the subgroupscontaintriggerswithverycharacteristicprop- ertiesandcanbe relatedtoaspecificsetofchannels,the method proves useful in classification of triggers seen in GW data and in helping with tracking down the sources or origins of the triggers. As a direct application to detector characterization, we canclassify the triggersseeninGW science datainto differentgroupswithcharacteristicproperties,relatedto specific groups of channels. We can thus study the pat- tern trend of various kinds of triggers and gain insight into how some of the channels may be reponsible in pro- duction of specific types of triggers. A. Post classification analysis FIG. 12: The top panel in this figureshows a typical exam- ple of trigger waveforms that is classified belonging to class In the following examples, we will use the trigger #14. The panels below the top panel show how some of the classes10and14andsub-groupsoftriggersfoundtherein chief attributes viz., central frequency, the snr, the Q-value for illustration of the post classification analysis. and the amplitude of the triggers belonging to this class are distributed. The x-axis represents each of the properties ex- pressed in arbitrary units. The y-axisrepresents numbers. 1. Auxiliary and Environmental channel connection like amplitude, central frequency, Q-value and snr for Once the clusters of triggers are determined and the each group, the groups are fairly diverse within itself, class members are assigned, the next question we ask is eventhoughthey seemto indicate similartypes ofwave- what are the possible couplings of these trigger clusters forms. Thispromptsafurtherlookintothegroupsthem- to the different sub-systems of the detector, i.e. what selves. Here, we have implemented the MHC method, are the possible sources of these triggers? We tackle this previously developed and described in detail in [12, 18]. problembyusingtheOmegascans[15]. Thisinformation