Draftversion January3,2017 PreprinttypesetusingLATEXstyleAASTeX6v.1.0 DEEP-HITS: ROTATION INVARIANT CONVOLUTIONAL NEURAL NETWORK FOR TRANSIENT DETECTION Guillermo Cabrera-Vives1,2,3,5, Ignacio Reyes4,1,5, Francisco Fo¨rster2,1, Pablo A. Est´evez4,1 and Juan-Carlos Maureira2 Email: [email protected] 7 1 MillenniumInstitute ofAstrophysics,Chile 1 2 0 CenterforMathematical Modeling,UniversidaddeChile,Chile 2 3AURAObservatoryinChile 4 n DepartmentofElectricalEngineering,UniversidaddeChile,Chile a 5Theseauthorscontributed equallytothiswork J 2 ABSTRACT We introduce Deep-HiTS, a rotation invariant convolutional neural network (CNN) model for clas- ] M sifying images of transients candidates into artifacts or real sources for the High cadence Transient I Survey (HiTS). CNNs have the advantage of learning the features automatically from the data while . h achieving high performance. We compare our CNN model against a feature engineering approach p using random forests (RF). We show that our CNN significantly outperforms the RF model reducing - the error by almost half. Furthermore, for a fixed number of approximately 2,000 allowed false tran- o r sient candidates per night we are able to reduce the miss-classified real transients by approximately t s 1/5. To the best of our knowledge, this is the first time CNNs have been used to detect astronomi- a cal transient events. Our approach will be very useful when processing images from next generation [ instruments such as the Large Synoptic Survey Telescope (LSST). We have made all our code and 1 data available to the community for the sake of allowing further developments and comparisons at v https://github.com/guille-c/Deep-HiTSa. 8 5 Keywords: methods: data analysis — methods: statistical — techniques: image processing — super- 4 novae: general — surveys 0 0 . 1 1. INTRODUCTION butalsothetime domainopeningnewopportunitiesfor 0 understanding our universe. As in many other fields, large scale survey tele- 7 The High cadence Transient Survey (HiTS, 1 scopes are already generating more data than hu- F¨orster et al. 2016), is aimed at detecting and : mans are able to process, posing the need for new v following up optical transients with characteristic data-processing tools. Some examples are the Sloan i X timescales from hours to days. The primary goal of Digital Sky Survey (SDSS, York et al. 2000), the r Panoramic Survey Telescope and Rapid Response Sys- HiTSistodetectsupernovae(SNe)duringtheirearliest a hours ofexplosion. HiTSuses the Dark EnergyCamera tem (Pan-STARRS, Hodapp et al. 2004), the All- (DECam, Flauguer et al. 2015) mounted at the prime Sky Automated Survey for Supernovae (ASAS-SN, focus of the Victor M. Blanco 4 m telescope on Cerro Shappee et al. 2014), and the Dark Energy Survey TololonearLaSerena,Chile. TheHiTSpipelinedetects (DES,Dark Energy Survey Collaboration 2016)among transients using difference images: a template image is others. Furthermore, in the near future we expect subtracted from the science image taken at the time of to have the Large Synoptic Survey Telescope (LSST, observation and point sources are located within this LSST Science Collaboration 2009) operative, which difference image. A custom made pipeline does the will scan the whole southern sky every couple of days astrometry, PSF matching, candidate extraction, and producing terabytes of data per night. These instru- candidate classificationinto realor faketransients. Our ments allow us to explore not only the space domain, previous approach uses random forests (RF, Breiman 2001) over a set of handmade features for candidate a Deep-HiTS is licensed under the terms of the GNU General classification. Similar approaches have been used in PublicLicensev3.0. the past by different groups including Romano et al. 2 (2006), Bloom et al. (2012), Brink et al. (2013), tuations of the background, inaccurate astrometry, and and Goldstein et al. (2015). The feature engineering bad CCD columns. This produced 802,087 candidates approachusuallyrequiresanimportantamountofwork which we labeled as negatives1. Notice these negatives done by the scientist in order to create representative were produced by running the pipeline over real data, features for the problem at hand. hence no simulation process is used for obtaining them. An alternative approach is to learn the features from Positives were generated by selecting stamps of real the data itself. Convolutional neural networks (CNNs, PSF-like sources and placing them at a different loca- Fukushima 1980) are a type of artificial neural net- tion at the same epoch andin the same CCD they were work (ANN, McCulloch & Pitts 1943) that have re- observed. By doing this we simulated positives using cently gained interest in the machine learning commu- real point like sources, hence reproducing how tran- nity. CNNs have achieved remarkable results in im- sients would look like. Point sources are obtained by age processing challenges, outperforming previous ap- first selecting sextractor (Bertin & Arnouts 1996) iso- proaches (e.g. Krizhevsky et al. 2012; Razavian et al. lated sources which are present in both frames with a 2014; Szegedy et al. 2014). Though CNNs have been consistent position and flux, and whose: size is within applied to a variety of image processing problems, they 2 median absolute deviations from the median size of have only recently caught the attention of the Astron- the sourcesofthe image,totalflux is positive, andmin- omycommunitymainlythankstotheGalaxyZooChal- imum pixel value including sky emission is greater than lenge. Dieleman et al. (2015) won the contest using a zero. We also filtered out those sources that show a rotation invariant CNN approach. This approach was sextractor flux inconsistent with its optimal photome- latter extended to data from the Hubble Space Tele- try flux (Naylor 1998), which we found to be a good scope in Huertas-Company et al. (2015). discriminator between stars and galaxies. Positives are In this paper we introduce Deep-HiTS, a framework then added scaledto mimic the SNR distributionof the for transient detection based on rotational invariant candidates found in the image (fitted power law distri- deep convolutional neural networks, and present the bution). Imageswiththesesimulatedtransientsarerun advantages and improved results obtained using this throughthefirststepsoftheHiTSpipeline(astrometry, framework. We start by explaining the HiTS differ- PSFmatchingandcandidateextraction)inordertoget ence image data in Section 2. In Section 3 we explain stamps of positives. As explained above, non-simulated the basics of ANN and CNN, as well as describing our detected candidates are taken as negatives. In order to model architecture. We then describe our experiments keep a balanced data-set, we used the same number of and show how the proposed CNN model significantly positivesasnegatives. Figure1showsexamplesofnega- outperforms the feature engineering + RF approach in tiveandpositivecandidatesandtheirrespectivestamps. Section4. Wefinallysummarisethisworkandconclude in Section 5. 3. CONVOLUTIONAL NEURAL NETWORK 2. DATA MODEL TheHiTSpipelinefindscandidatesbyusingthediffer- 3.1. Artificial neural networks ence of a science and a template image and it produces CNNsareatypeofartificialneuralnetwork(ANN,see image stamps of 21×21 pixels for each candidate. In Zhang 2000, and references therein). The basic unit of this paper we use data from the 2013 run, in which we anANNis calledneuron. Eachneuronreceivesavector observed 40 fields in the u band every approximately 2 as input andperforms a dot product operationbetween hoursduring 4 consecutivenights. Forthe classification that vector and a set of parameters called weights. The model we use four stamps: template image, science im- resulting value plus a bias goes into a non-linear activa- age, difference image, and signal-to-noise ratio (SNR) tion function, usually a sigmoid. Neurons are grouped difference image (the difference image divided by the in layers,eachone withits ownweights. Inamultilayer estimated local noise). perceptron, layers are stacked one after the other. The For the purpose of training the classification model, output of layer n, x , depends on the output of layer n we created a data-set of real non-transients (negatives n−1, xn−1, and is given by hereafter)andsimulatedtransients(positiveshereafter). Negatives were produced by running the first steps of theHiTSpipelineincludingtheastrometry,PSFmatch- xn =f(Wn·xn−1+bn), (1) ing,andcandidateextractionoverobservedimages. We evaluate the proposed model on the next step, which is 1Weareawarethisdata-setcontainssomepointliketransients candidate classification into non-transient or transient. notpresentinthereferenceimage,butweconservativelyestimate Negativesareartifacts causedmostlyby statisticalfluc- theywillbelessthana0.2%. 3 NEGATIVES POSITIVES template science difference SNR image template science difference SNR image 5 2 5 9 5. 5. = = R R N N S S template science difference SNR image template science difference SNR image SNR = 9.27 SNR = 12.73 template science difference SNR image template science difference SNR image 2 5 0 0 3. 0. 4 5 = = R R N N S S Figure 1. Examples of artifacts (negatives) and simulated transients (positives) at different signal-to-noise ratios. From top to bottom artifacts are caused by statistical fluctuationsof thebackground,inaccurate astrometry, and bad CCD column. where Wn is the weight matrix, bn is the bias, and f stack of K arraysxk,n−1 (k =1,...,K), the outputs as is the activation function. These layers are called fully- L arrays x (l = 1,...,L) and the filters as matrices l,n connected layers. W . The output of layer n is then obtained as k,l,n ANNs are usually trained using an algorithm called K error backpropagation, which adjust the weights of the xn,l =f Wk,l,n∗xk,n−1+bl,n , (3) network in an iterative way proportional to the gradi- ! k=1 X ent descent of the error with respect to the weights. In where ∗ is the convolution operator, and b are the l,n this work neural networks are trained using stochastic biases of layer n. In order to add some flexibility on gradient descent (SGD, LeCun et al. 1998), where the the output size of a layer,zeros can be appended in the gradient of the error with respect to the weights is esti- input layer’s borders. This is called zero-padding. matedateachiterationusingasmallpartofthedata-set Pooling layers return a subsampled version of the in- called mini-batch. That means that for every iteration put data. In the pooling layer the data is divided into the gradient is estimated as the average instantaneous small windows and a single representative value is cho- gradients for a small groupof samples (usually between sen for each window, e.g. the maximum (max-pooling) 10and500),whichprovidesagoodtrade-offintermsof or the average (mean-pooling). A common design for gradient estimation stability and computing time. CNNs consist on a mix of convolutional and pooling The learning rule for SGD is given by layers at a first stage followed by dense fully-connected θ =θ −η∇ L, (2) layers. t+1 t θ Though sigmoids are usually chosen as activation where θ represents a parameter of the ANN (such as functions, they are not the only choice. Rectified lin- weights or biases), L is the loss function (e.g. mean ear units (ReLU, Nair & Hinton 2010) are activation squared error) and η is the learning rate. The learn- functions that have gained popularity in the last years ing rate controls the size of the steps that SGD makes because they can achieve fast training and good perfor- in each iteration. Reducing the learning rate during mance. TheoutputofaReLUisthemaximumbetween training allows having a good compromise between ex- the input and zero ploration at the beginning and exploitation (local fine tuning) in the final part of the process. f(x)=max(0,x). (4) LeakyReLUsarevariantswherethenegativeinputsare 3.2. Convolutional neural networks notsettozerobutinsteadtheyaremultipliedbyasmall Convolutional neural networks (CNNs) usually con- number (e.g. 0.01) to preserve gradient propagation, tain two types of layers: convolutional and pooling lay- ers. Convolutional layers perform a convolution opera- x if x>0, f(x)= (5) tion between the input of the layer and a set of weights 0.01x otherwise. called kernel or filter. We extend Equation 1 by con- sidering the input of the n convolutional layer to be a In order to reduce overfitting, a usual technique is 4 dropout (Srivastava et al. 2014),whichconsistsinturn- We used the candidates data-set described in Sec- ingoffrandomneuronsduringtrainingtimewithaprob- tion 2 to assess the performance of the proposed CNN abilityp,usuallychosenas0.5. Thegoalofdropoutisto model and compare it againstour previous random for- prevent coadaptation of the outputs, hence neurons do est model. We split the data into 1,220,000 candi- not depend on other neurons being present in the net- datesfortraining,100,000candidatesforvalidation,and work. In order to preserve the scale of the total input, 100,000 for testing. The CNN model is trained using the remaining neurons need to be rescaledby 1/(1−p). SGD with mini-batches of 50 examples from the train- Dropout is activated only during training time. When ing set and 0.5 dropout. Learning rate was reduced evaluating models, all neurons become active. to half every 100,000 iterations, starting with an ini- tial value of 0.04. The validation data-set was used 3.3. Deep-HiTS architecture for measuring the performance of the model at train- Preliminaryresultswerepresentedinaconferencepa- ing time anddecidewhento stoptraining. We assumed per by Cabrera et al. (2016) where a very basic CNN the model converged when after feeding 100,000 can- model was proposed achieving a performance compara- didates the zero-one loss (fraction of misclassifications) ble to the RF model. Here we extend these results by doesnotgolowerthana99%ofthe previousloss. Once using a much deeper architecture and introducing rota- the model is trained, we use the test set (data unseen tional invariance as well as leaky ReLUs and dropout, by the model) to calculate performance metrics. Figure significantly outperforming the RF model. Following 3 shows the zero-one loss of the model as a function of Dieleman et al. (2015)weintroducerotationinvariance the number of iterations for the training and validation by applying the convolutional filters to various rotated sets, and for the test set after training. Due to insuffi- versions of the image, thus exploiting rotational sym- cient memory to store the whole training set, the train- metry present in transients images. Figure 2 shows the ingsetcurveiscalculatedoverthelast10,000candidates proposed rotation invariant convolutional neural net- used for training at each iteration, while the validation work architecture. The template, science, difference, curve shows the loss for the whole 100,000 candidates andSNRdifferencestampsarepresentedtothenetwork in the validation set. We programmed our code using asastackedarrayofsize 21×21×4. The convolutional Theano(Theano Development Team 2016)andtrained architecture composed by the convolutional and pool- ourmodelonateslaK20graphicsprocessorunit(GPU). ing layers yields 2,304 features. Rotation invariance is The model converged after 455,000 iterations and took incorporatedby presenting every stamp to the convolu- roughly 37 hours to train achieving an error of 0.525% tional architecture rotated four times, hence producing overthevalidationdata-setand0.531%overthetestset 2,304×4features. These9,216featuresarefedtothree for this particular experiment. fully connected layers that return the probabilities of We compare the performance of the proposed CNN being an artefact and a real transient. model against our previous RF model (implemented After trying more than 50 different architectures and using scikit-learn Pedregosa et al. 2011). As stated training strategies2 the best performance was obtained above, we consider transients to be positives (P) and by the architecture shown in Figure 2. Our convolu- non-transients to be negatives (N). We use the follow- tionalarchitectureiscomposedoftwoconvolutionallay- ing metrics to assess the performance of our models: erswithfiltersofsize4×4and3×3respectively,a2×2 TP +TN max-poolinglayer,three3×3convolutionallayersanda accuracy= , (6) TP +FN +TN +FP 2×2max-poolinglayer. Zero-paddingwasusedinorder TP tosetthesizesofeachlayertotheonesshowninFigure precision= , (7) TP +FP 2(24×24forthe firsttwoand12×12forthe lastfour). TP All convolutional layers use ReLUs. The concatenated recall= , (8) TP +FN outputofallrotationsarefedtothreestackedfullycon- precision·recall 2TP nected layers: two ReLU layers of 64 units each, and a f1-score=2 = , precision+recall 2TP +FP +FN final logistic regressionlayer that outputs two probabil- (9) ities: fake and real transient. where TP stands for true positives (number of posi- 4. EXPERIMENTS tives correctly classified), TN stands for true negatives (numberofnegativescorrectlyclassified),FP standsfor false positives (number of negatives incorrectly classi- 2 Mostimportantimprovementswereachievedbyaddingmore fied aspositives by the model), andFN stands forfalse layers, number of parameters per layer, different activation func- negatives (number of positives incorrectly classified as tions, different training strategies (learning rates, dropout), and addingrotationinvariance. negatives by the model). In order to make a fair com- 5 Figure 2. Architecture of the proposed rotational invariant CNN model. Blue boxes represent convolutional layers and red boxes represent max-pooling layers. The stacked images of each candidate are presented to the network rotated four times in ordertoexploitrotationalsymmetry. Thecalculatedfeaturemapsofeachrotationareflattenedandstackedcreatinga2,304×4 vectorthat is fed to a fully connected network whose outputsare the probabilities of being an artefact and a real transient. parison, we trained and tested the RF model with the 0.05 same number of candidates (1,220,000 for training and Training set Validation set 100,000 for testing). For both CNN and RF models, 0.04 Test set six train-validation-testsplits were doneusing stratified shuffle split (i.e. maintaining positives and negatives 0.03 balancedforallsets). Table3showsthemeanaccuracy, or err precision,recall,andf1-scoreobtainedwith the RF and 0.02 CNN models and their respective standard deviations. All metrics show that the CNN model outperforms the 0.01 RF model. To assess the statisticalsignificance of these results, weperformeda Welch’s hypothesistest (Welch 0.00 0 100000 200000 300000 400000 1947) obtaining a probability of less than a 10−7 that iteration these results were obtained by chance. Figure 3. Learning curve (evolution of the error during the training process). The training set error is calculated over the last 10,000 candidates that were used for training. The testset erroriscalculated at theendoftraining. Validation error does not raise at the end of training, indicating that themodel is not overfitted. 6 Table 1. Comparison of RF and CNN models. CNN shows consistent better results than RF in terms of accuracy, precision, recall and f1-score. Results are statistically significant based on the calculated Welch’s T-test p-values, being all of them lower than 10−7. Accuracy results show that theerror of CNN model is almost half the error obtained with the RFmodel. model accuracy precision recall f1-score RF 98.96±0.03% 98.74±0.06% 99.18±0.02% 98.96±0.03% CNN 99.45±0.03% 99.30±0.07% 99.59±0.07% 99.45±0.03% Welch’s T-test p-value 4.7×10−11 2.8×10−08 9.7×10−08 4.5×10−11 Figure 4 shows the detection error tradeoff (DET) using a CNN model trained over 3×104. curves of the RF and CNN models. The DET curve In order to test our framework over real data, we ap- plots the false negativerate (FNR) versusthe false pos- plied the RF and CNN models over SNe found in the itive rate (FPR) as we vary the probability threshold 2014 and 2015 campaigns. For our 2013 campaign we used to determine positives and negatives, where observed in the u band filter, while in the 2014 and 2015 campaigns we observed in the g band filter. We FP FN used stamps from SNe at all epochs with a SNR over5. FPR= , FNR= . (10) N P DuringobservationtimeHiTSfound47SNeinthe2014 campaign and 110 in the 2015 campaign, giving a total Figure 4a shows the overallDET curvesfor the RF and of 1166 candidates (each SN has candidates at different CNN models, while Figure 4b shows the DET curves epochs). Using the user-defined FPR of 10−2 described for different SNR ranges. A better model would obtain above, the RF model correctly classified 866 of these smaller FPR and FNR, hence moving the curve to the positive candidates (FNR = 0.257), while Deep-HiTS bottom left of the plot. It can be seen again that the correctly classified 921 of them (FNR = 0.210). Notice proposedCNN model achievesbetter performance than that though we trained our models on u band images, ourpreviousRFmodel. TheDETcurveisalsousefulfor we are still able to classify candidates in the g band fil- evaluatingthetrade-offbetweenfalsepositivesandfalse ter. Figure 6 shows the number of positive candidates negatives. Inpractice,whenobservinginrealtime,most incorrectlyclassifiedfortheSNepresentinthe2014and candidates will be negatives. All negatives on our data- 2015 campaigns for the RF and CNN models in terms set are real candidates taken from the HiTS pipeline: of the SNR. It can be seen that for a SNR below 8 the 802,087 obtained during 4 consecutive nights, i.e. ap- CNN model outperforms the RF model. These are the proximately 200,000 negatives per night. The number mostimportantcandidateswhendetectingtransientsin ofestimatedrealtransientsis considerablysmallerthan real time, as we hope to detect them at an early stage. that. As an example, consider as user-defined opera- ForaSNRover8the RFmodelbetterrecoversthetrue tion point getting 2,000 false positives per night (FPR positives, which suggests the use of a hybrid model be- ∼ 10−2). Figure 4 tells us that by using the proposed tween RF and CNN. We will expand on this idea in CNN we will get a FNR ∼2×10−3 which is much bet- future work. ter in comparisonto a FNR ∼10−2 when using the RF As we tested our model on real data from a different model. This means 5 times less real transients would HiTS campaign, and also using a band filter different be missed by using the CNN model. On Figure 4b we from the one used for training, we did not achieve the canseethatthisimprovementismostlyachievedbythe low FNR shown in Figure 4. However our CNN model proposed CNN being able to correctly classify low SNR isabletorecovermoretruepositivesfromimagesinthe sources. g-band filter than the RF-based model given the user- InFigure5weexploretheperformanceoftheRFand defined operation point described above. We currently CNNmodelsintermsofthesizeofthetrainingset. For do not have a robust way of measuring the false pos- this purpose we saved 100,000candidates for validation itives automatically in the 2014 and 2015 data-sets as and trained each model with training sets of different the new observationalsetup yields a significantly larger sizes. It can be seen that the CNN model outperforms population of unknown asteroids which would need to the RF model independently of the size of the training be removed from the set of negatives. Further research set used. Furthermore, by using a RF model trained is needed in order to assess the best way of transferring over 106 candidates we obtain similar performance as 7 100 100 10-1 10-1 10-2 10-2 R R N N F F 10-3 10-3 RF, SNR < 7, N = 16917 CNN, SNR < 7, N = 16917 RF, 7 SNR <15, N = 55772 10-4 10-4 ≤ CNN, 7 SNR <15, N = 55772 RF ≤ RF, SNR 15, N = 27311 CNN CNN, SNR≥ 15, N = 27311 10-5 10-5 ≥ 10-5 10-4 10-3 10-2 10-1 10-5 10-4 10-3 10-2 10-1 FPR FPR (a) (b) Figure 4. Detectionerrortradeoff(DET)curvecomparingaRandomForestmodelagainsttheproposedCNN.DETcurveplots FalseNegativeRate(FNR)versusFalsePositiveRate(FPR).Acurveclosertothelowerbottomindicatesbetterperformance. (a)OverallDETcurves. (b)DETcurvesforsubsetsof candidatesbinnedaccording totheirsignal-to-noise ratio (SNR),where N indicates the number of candidates per bin. The proposed CNN model achieves a better performance than the RF model for a FPR between 10−4 and 10−1, where the HiTS pipeline operates. The biggest improvement occurs for the lowest SNR candidates. 0.07 SNe in the 2014 and 2015 campaigns 180 Random Forest 0.06 CNN 160 CNN false negatives RF false negatives 0.05 140 all positives 120 0.04 Error0.03 100 80 0.02 60 0.01 40 20 0.00 102 103 104 105 106 0 Training set size 0 10 20 30 40 50 SNR Figure 5. Error overa validation data-set of 100,000 candi- Figure 6. ComparisonofRFandCNNmodelsoverrealSNe datesintermsofthesizeofthetrainingsetfortheRandom from the2014 and 2015 campaigns. Forest and Convolutional Neural Network models. advantage of learning suitable features for the problem our CNN model between different data-sets in terms of automatically from the data. The proposed classifica- recovering positives and negatives. tion model was able to improve the overall accuracy of the HiTSpipeline from98.96±0.03%to99.45±0.03%. 5. CONCLUSIONS Inpractice,byusingDeep-HiTSwecanreducethenum- ber of missed transients to approximately 1/5 when ac- We introduced Deep-HiTS, a rotation invariant deep cepting around 2,000 transients per night. The use convolutionalneuralnetwork (CNN) for detecting tran- of deep learning models over next generation surveys, sients in survey images. Deep-HiTS discriminates be- such as the LSST may have a great impact on de- tween real and fake transients in images from the High tecting and classifying the unknown unknowns of our cadenceTransientSurvey(HiTS),asurveythataimsto universe. Deep-HiTS is open source and available at findtransientswithcharacteristictimescalesfromhours GitHub (https://github.com/guille-c/Deep-HiTS), to days using the Dark Energy Camera (DECam). We andtheversionusedinthispaperisarchivedonZenodo compared the proposed CNN model against our previ- (https://doi.org/10.5281/zenodo.190760). ous approach based on a random forest (RF) classifier trained over features designed by hand. Deep-HiTS not only outperforms the RF approach, but it also has the We gratefully acknowledge financial support from 8 CONICYT-ChilethroughitsFONDECYTpostdoctoral Mitchell Institute for Fundamental PhysicsandAstron- grant number 3160747; CONICYT-PCHA through its omyatTexasA&MUniversity,FinanciadoradeEstudos national M.Sc. scholarship 2016 number 22162464; eProjetos,FundaoCarlosChagasFilhodeAmparo,Fi- CONICYT through the Fondecyt Initiation into Re- nanciadora de Estudos e Projetos, Fundao Carlos Cha- search project No. 11130228;CONICYT-Chile through gas Filho de Amparo Pesquisa do Estado do Rio de grant Fondecyt 1140816; CONICYT-Chile and NSF Janeiro, Conselho Nacional de Desenvolvimento Cient- through the Programme of International Cooperation ficoeTecnolgicoandtheMinistriodaCincia,Tecnologia project DPI201400090; CONICYT through the infras- e Inovao,the Deutsche Forschungsgemeinschaftand the tructure Quimal project No. 140003; Basal Project Collaborating Institutions in the Dark Energy Survey. PFB–03; the Ministry of Economy, Development, and The Collaborating Institutions are Argonne National Tourism’s Millennium Science Initiative through grant Laboratory, the University of California at Santa Cruz, IC120009, awarded to The Millennium Institute of As- the University of Cambridge, Centro de Investigaciones trophysics(MAS). Powered@NLHPC:this researchwas Enrgeticas,MedioambientalesyTecnolgicasMadrid,the partially supported by the supercomputing infrastruc- University of Chicago, University College London, the ture ofthe NLHPC (ECM-02). We acknowledgethe us- DES-Brazil Consortium, the University of Edinburgh, ageoftheBelkaGPUcluster(FondequipEQM140101). the Eidgenssische Technische Hochschule (ETH) Zrich, This project used data obtained with the Dark En- Fermi National Accelerator Laboratory, the University ergy Camera (DECam), which was constructed by the ofIllinoisatUrbana-Champaign,theInstitutdeCincies Dark Energy Survey (DES) collaboration. Funding for de l’Espai (IEEC/CSIC), the Institut de Fsica d’Altes the DES Projects has been provided by the U.S. De- Energies, Lawrence Berkeley National Laboratory, the partment of Energy, the U.S. National Science Founda- Ludwig-Maximilians Universitt Mnchen and the asso- tion, the Ministry of Science and Education of Spain, ciated Excellence Cluster Universe, the University of the Science and Technology Facilities Council of the Michigan, the National Optical Astronomy Observa- United Kingdom, the Higher Education Funding Coun- tory,the UniversityofNottingham,theOhioStateUni- cil for England, the National Center for Supercomput- versity, the University of Pennsylvania, the University ing Applications at the University ofIllinois atUrbana- of Portsmouth,SLAC NationalAcceleratorLaboratory, Champaign,theKavliInstituteofCosmologicalPhysics StanfordUniversity,theUniversityofSussex,andTexas atthe UniversityofChicago,Center forCosmologyand A&M University. All plots were created using mat- Astro-ParticlePhysicsatthe OhioState University,the plotlib (Hunter 2007). REFERENCES Bertin,E.,&Arnouts,S.1996,A&AS,117,393 LeCun,Y.,Bottou, L.,Orr,G.,Mller,K.1998, inNeural Bloom,J.S.,Richards,J.W.,Nugent,P.E.etal.2012,PASP, Networks:TricksoftheTrade,ed:Montavon, G.,Orr,G.,, 124,921,1175 Mller,K.(Springer),9 Breiman,L.2001,MachineLearning,45,1,532 LSSTScienceCollaboration,Abell,P.A.,Allison,J.,etal.2009, Brink,H.,Richards,J.W.,Poznanski,D.2013,MNRAS,432,2, arXiv:0912.0201 1047 Krizhevsky,A.,Sutskever,I.&Hinton,G.E.2012,in Cabrera-Vives,G.,Reyes,I.,Fo¨rster,F.etal.2016, in Proceedings oftheNeuralInformationProcessingSystems Proceedings ofthe2016International JointConferenceon Conference,Imagenet classificationwithdeepconvolutional NeuralNetworks(IJCNN2016), SupernovaeDetection by neuralnetworks(LakeTahoe,NV,USA),1097 UsingConvolutional NeuralNetworks(Vancouver, Canada), McCulloch,W.&Pitts,W.1943,TheBulletinofMathematical 251 Biophysics,5,4,115 DarkEnergySurveyCollaboration,Abbott, T.,Abdalla,F.B., NairV.,HintonG.E.,2010,inProceedingsofthe27th etal.2016,MNRAS,460,1270 International ConferenceonMachineLearning(ICML-10), Dieleman,S.,Willett,K.W.,&Dambre,J.2015,MNRAS,450, Rectifiedlinearunitsimproverestrictedboltzmannmachines 1441 (Haifa,Israel),807 Flaugher,B.,Diehl,H.T.,Honscheid,K.,etal.2015,AJ,150, Naylor,T.1998,MNRAS,296,339 150 Pedregosa, F.,Varoquaux,G.,Gramfort,etal.2011,Journalof Fo¨rster,F.,Maureira,J.C.,SanMart´ın,J.etal.2016,ApJ,832, MachineLearningResearch,12,2825 155 Razavian, A.S.,Azizpour,H.,Sullivan,J.etal.2014,in Fukushima,K.1980, BiologicalCybernetics,36,4,193 Proceedings oftheIEEEConferenceonComputerVisionand Goldstein,D.A.,DA´ndrea,C.B.,Fischer,J.A.etal.2015, AJ, PatternRecognitionWorkshops(CVPRW), CNNfeatures 150,82 off-the-shelf:anastoundingbaselineforrecognition Hodapp,K.W.,Kaiser,N.,Aussel,H.,etal.2004, (Columbus,OH,USA),512 AstronomischeNachrichten,325,636 R.Romano,C.R.Aragon,andC.Ding,2006,inProceedings of Huertas-Company,M.,Gravet, R.,Cabrera-Vives,G.etal.2015, the5thInternational ConferenceonMachineLearningand ApJS,221,1,8 Applications,Supernovarecognitionusingsupportvector Hunter,J.D.2007,ComputingInScience&Engineering,9,3,90 machines(Orlando,FL,USA),77 9 ShappeeB.J.etal.,2014,ApJ,788,48 Welch, B.L.1947,Biometrika,34,1/2,28 SrivastavaN.,HintonG.,KrizhevskyA.,Sutskever I., D.G.YorkandSDSSCollaboration,2000, AJ,120,15791587 Salakhutdinov R.,2014,JournalofMachineLearning Zhang, G.P.2000, Systems,Man,andCybernetics,PartC: Research,15,1929 ApplicationsandReviews,IEEETransactionson,30,4, Szegedy, C.Liu,W.,Jia,Y.etal.2014,arXiv::1409.4842 451462 TheanoDevelopmentTeam,2016,arXiv:1605.02688