UnderreviewasaconferencepaperatICLR2017 THE INCREDIBLE SHRINKING NEURAL NETWORK: NEW PERSPECTIVES ON LEARNING REPRESENTA- TIONS THROUGH THE LENS OF PRUNING NikolasWolfe,AdityaSharma&BhikshaRaj SchoolofComputerScience CarnegieMellonUniversity Pittsburgh,PA15213,USA {nwolfe, bhiksha}@cs.cmu.edu, [email protected] 7 1 LukasDrude 0 UniversitatPaderborn 2 [email protected] n a J 6 ABSTRACT 1 How much can pruning algorithms teach us about the fundamentals of learning ] E representations in neural networks? A lot, it turns out. Neural network model compression has become a topic of great interest in recent years, and many dif- N ferenttechniqueshavebeenproposedtoaddressthisproblem. Ingeneral,thisis . s motivatedbytheideathatsmallermodelstypicallyleadtobettergeneralization. c At the same time, the decision of what to prune and when to prune necessarily [ forces us to confront our assumptions about how neural networks actually learn 1 torepresentpatternsindata. Inthisworkwesetouttotestseverallong-heldhy- v pothesesaboutneuralnetworklearningrepresentationsandnumericalapproaches 5 topruning. Toaccomplishthiswefirstreviewedthehistoricalliteratureandde- 6 rived a novel algorithm to prune whole neurons (as opposed to the traditional 4 methodofpruningweights)fromoptimallytrainednetworksusingasecond-order 4 Taylor method. We then set about testing the performance of our algorithm and 0 analyzingthe qualityof thedecisions itmade. As abaseline forcomparison we . 1 used a first-order Taylor method based on the Skeletonization algorithm and an 0 exhaustivebrute-forceserialpruningalgorithm. Ourproposedalgorithmworked 7 well compared to a first-order method, but not nearly as well as the brute-force 1 method. Ourerroranalysisledustoquestionthevalidityofmanywidely-heldas- : v sumptionsbehindpruningalgorithmsingeneralandthetrade-offsweoftenmake i intheinterestofreducingcomputationalcomplexity. Wediscoveredthatthereis X a straightforward way, however expensive, to serially prune 40-70% of the neu- r ronsinatrainednetworkwithminimaleffectonthelearningrepresentationand a withoutanyre-training. 1 INTRODUCTION Inthisworkweproposeandevaluateanovelalgorithmforpruningwholeneuronsfromatrained neuralnetworkwithoutanyre-trainingandexamineitsperformancecomparedtotwosimplermeth- ods.Wethenanalyzethekindsoferrorsmadebyouralgorithmandusethisasasteppingoffpointto launchaninvestigationintothefundamentalnatureoflearningrepresentationsinneuralnetworks. OurresultscorroborateaninsightfulthoughlargelyforgottenobservationbyMozer&Smolensky (1989a)concerningthenatureofneuralnetworklearning. Thisobservationisbestsummarizedin a quotation from Segee & Carter (1991) on the notion of fault-tolerance in multilayer perceptron networks: Contrarytothebeliefwidelyheld,multilayernetworksarenotinherentlyfault tolerant. Infact,thelossofasingleweightisfrequentlysufficienttocompletely 1 UnderreviewasaconferencepaperatICLR2017 disruptalearnedfunctionapproximation. Furthermore,havingalargenumberof weightsdoesnotseemtoimprovefaulttolerance. [Emphasisadded] Essentially,Mozer&Smolensky(1989b)observedthatduringtrainingneuralnetworksdonotdis- tribute the learning representation evenly or equitably across hidden units. What actually happens isthatafew, eliteneuronslearnanapproximationoftheinput-outputfunction, andtheremaining units must learn a complex interdependence function which cancels out their respective influence onthenetworkoutput. Furthermore,assumingenoughunitsexisttolearnthefunctioninquestion, increasing the number of parameters does not increase the richness or robustness of the learned approximation, but rather simply increases the likelihood of overfitting and the number of noisy parameterstobecanceledduringtraining. Thisisevincedbythefactthatinmanycases,multiple neuronscanberemovedfromanetworkwithnore-trainingandwithnegligibleimpactonthequality oftheoutputapproximation. Inotherwords,therearefewbipartisanunitsinatrainednetwork. A unitistypicallyeitherpartofthe(possiblyoverfit)input-outputfunctionapproximation,oritispart ofanelaboratenoisecancellationtaskforce. Assumingthisisthecase, mostofthecompute-time spenttraininganeuralnetworkislikelyoccupiedbythisarguablywastefulprocedureofsilencing superfluousparameters,andpruningcanbeviewedasanecessaryprocedureto“trimthefat.” We observed copious evidence of this phenomenon in our experiments, and this is the motivation behind our decision to evaluate the pruning algorithms in this study on the simple criteria of their ability to trim neurons without any re-training. If we were to employ re-training as part of our evaluation criteria, we would arguably not be evaluating the quality of our algorithm’s pruning decisions per se but rather the ability of back-propagation trained networks to recover from faults caused by non-ideal pruning decisions, as suggested by the conclusions of Segee & Carter (1991) and Mozer & Smolensky (1989a). Moreover, as Fahlman & Lebiere (1989) discuss, due to the “herdeffect”and“movingtarget”phenomenainback-propagationlearning,theremainingunitsin anetworkwillsimplyshiftcoursetoaccountforwhatevererrorsignalisre-introducedasaresultof abadpruningdecisionornetworkfault. Solongasthereareenoughcriticalparameterstolearnthe functioninquestion,anetworkcantypicallyrecoverfaultswithadditionaltraining. Thislimitsthe conclusionswecandrawaboutthequalityofourpruningcriteriawhenweemployre-training. Intermsofremovingunitswithoutre-training,whatwediscoveredisthatpredictingthebehaviorof anetworkwhenaunitistobeprunedisverydifficult,andmostoftheapproximationtechniquesput forthinexistingpruningalgorithmsdonotfarewellatallwhencomparedtoabrute-forcesearch. Tobeginourdiscussionofhowwearrivedatouralgorithmandsetupourexperiments,wereview oftheexistingliterature. 2 LITERATURE REVIEW Pruning algorithms, as comprehensively surveyed by Reed (1993), are a useful set of heuristics designedtoidentifyandremoveelementsfromaneuralnetworkwhichareeitherredundantordo notsignificantlycontributetotheoutputofthenetwork. Thisismotivatedbytheobservedtendency of neural networks to overfit to the idiosyncrasies of their training data given too many trainable parametersortoofewinputpatternsfromwhichtogeneralize,asstatedbyChauvin(1990). Network architecture design and hyperparameter selection are inherently difficult tasks typically approached using a few well-known rules of thumb, e.g. various weight initialization procedures, choosingthewidthandnumberoflayers,differentactivationfunctions,learningrates,momentum, etc. Some of this “black art” appears unavoidable. For problems which cannot be solved using linearthresholdunitsalone,Baum&Haussler(1989)demonstratethatthereisnowaytoprecisely determinetheappropriatesizeofaneuralnetworkapriorigivenanyrandomsetoftraininginstances. Usingtoofewneuronsseemstoinhibitlearning,andsoinpracticeitiscommontoattempttoover- parameterize networks initially using a large number of hidden units and weights, and then prune orcompressthemafterwardsifnecessary. Ofcourse,astheoldsayinggoes,there’smorethanone waytoskinaneuralnetwork. 2.1 NON-PRUNINGBASEDGENERALIZATION&COMPRESSIONTECHNIQUES The generalization behavior of neural networks has been well studied, and apart from pruning al- gorithms many heuristics have been used to avoid overfitting, such as dropout (Srivastava et al. 2 UnderreviewasaconferencepaperatICLR2017 (2014)), maxout (Goodfellow et al. (2013)), and cascade correlation (Fahlman & Lebiere (1989)), among others. Of course, while cascade correlation specifically tries to construct of minimal net- works, manytechniquestoimprovenetworkgeneralizationdonotexplicitlyattempttoreducethe totalnumberofparametersorthememoryfootprintofatrainednetworkperse. Modelcompressionoftenhasbenefitswithrespecttogeneralizationperformanceandtheportability ofneuralnetworkstooperateinmemory-constrainedorembeddedenvironments.Withoutexplicitly removing parameters from the network, weight quantization allows for a reduction in the number ofbytesusedtorepresenteachweightparameter,asinvestigatedbyBalzeretal.(1991),Dundar& Rose(1994),andHoehfeld&Fahlman(1992). Arecentlyproposedmethodforcompressingrecurrentneuralnetworks(Prabhavalkaretal.(2016)) uses the singular values of a trained weight matrix as basis vectors from which to derive a com- pressedhiddenlayer. Øland&Raj(2015)successfullyimplementednetworkcompressionthrough weightquantizationwithanencodingstepwhileotherssuchasHanetal.(2016)havetriedtoexpand onthisbyaddingweight-pruningasaprecedingsteptoquantizationandencoding. In summary, we can say that there are many different ways to improve network generalization by alteringthetrainingprocedure,theobjectiveerrorfunction,orbyusingcompressedrepresentations of the network parameters. But these are not, strictly speaking, examples of techniques to reduce thenumberofparametersinanetwork. Forthiswemustemploysomeformofpruningcriteria. 2.2 PRUNINGTECHNIQUES Ifwewantedtocontinuallyshrinkaneuralnetworkdowntominimumsize,themoststraightforward brute-forcewaytodoitistoindividuallyswitcheachelementoffandmeasuretheincreaseintotal erroronthetrainingset. Wethenpicktheelementwhichhastheleastimpactonthetotalerror,and removeit. Rinseandrepeat. Thisisextremelycomputationallyexpensive,givenareasonablylarge neuralnetworkandtrainingset. Alternatively,wemightaccomplishthisusinganynumberofmuch fasteroff-the-shelfpruningalgorithms,suchasSkeletonization(Mozer&Smolensky(1989a)),Op- timalBrainDamage(LeCunetal.(1989)),orlatervariantssuchasOptimalBrainSurgeon(Hassibi &Stork(1993)). Infact,weborrowmuchofourinspirationfromthesealgorithms,withonemajor variation: Insteadofpruningindividualweights,wepruneentireneurons,therebyeliminatingallof theirincomingandoutgoingweightparametersinonego,resultinginmorememorysaved,faster. The algorithm developed for this paper is targeted at reducing the total number of neurons in a trained network, which is one way of reducing its computational memory footprint. This is often adesirablecriteriatominimizeinthecaseofresource-constrainedorembeddeddevices, andalso allows us to probe the limitations of pruning down to the very last essential network elements. In terms of generalization as well, we can measure the error of the network on the test set as each element is sequentially removed from the network. With an oracle pruning algorithm, what we expecttoobserveisthattheoutputofthenetworkremainsstableasthefirstfewsuperfluousneurons areremoved,andaswestarttobiteintothemorecrucialmembersofthefunctionapproximation, the error should start to rise dramatically. In this paper, the brute-force approach described at the beginningofthissectionservesasaproxyforanoraclepruningalgorithm. Onereasontochoosetorankandpruneindividualneuronsasopposedtoweightsisthattherearefar fewerelementstoconsider. Furthermore,theremovalofasingleweightfromalargenetworkisa dropinthebucketintermsofreducinganetwork’scorememoryfootprint. Ifwewanttoreducethe sizeofanetworkasefficientlyaspossible,wearguethatpruningneuronsinsteadofweightsismore efficient computationally as well as practically in terms of quickly reaching a hypothetical target reduction in memory consumption. This approach also offers downstream applications a realistic expectationoftheminimalincreaseinerrorresultingfromtheremovalofaspecifiedpercentageof neurons. Such trade-offs are unavoidable, but performance impacts can be limited if a principled approachisusedtofindthebestcandidateneuronsforremoval. Itiswellknownthattoomanyfreeparametersinaneuralnetworkcanleadtooverfitting.Regardless ofthenumberofweightsusedinagivennetwork,asSegee&Carter(1991)assert,therepresentation of a learned function approximation is almost never evenly distributed over the hidden units, and thustheremovalofanysinglehiddenunitatrandomcanactuallyresultinanetworkfault. Mozer & Smolensky (1989b) argue that only a subset of the hidden units in a neural network actually 3 UnderreviewasaconferencepaperatICLR2017 latchontotheinvariantorgeneralizingpropertiesofthetraininginputs,andtherestlearntoeither mutually cancel each other’s influence or begin overfitting to the noise in the data. We leverage this idea in the current work to rank all neurons in pre-trained networks based on their effective contributions to the overall performance. We then remove the unnecessary neurons to reduce the network’sfootprint. Throughourexperimentswenotonlyconcretelyvalidatethetheoryputforth byMozer&Smolensky(1989b)butwealsosuccessfullybuildonittoprunenetworksto40to60 %oftheiroriginalsizewithoutanymajorlossinperformance. 3 PRUNING NEURONS TO SHRINK NEURAL NETWORKS AsdiscussedinSection1ouraimistoleveragethehighlynon-uniformdistributionofthelearning representation in pre-trained neural networks to eliminate redundant neurons, without focusing on individualweightparameters. Takingthisapproachenablesustoremovealltheweights(incoming andoutgoing)associatedwithanon-contributingneuronatonce. Wewouldliketonoteherethatin anidealscenario,basedontheneuroninterdependencytheoryputforwardbyMozer&Smolensky (1989a),onewouldevaluateallpossiblecombinationsofneuronstoremove(oneatatime,twoata time,threeatatimeandsoforth)tofindtheoptimalsubsetofneuronstokeep. Thisiscomputation- ally unacceptable, and so we will only focus on removing one neuron at a time and explore more “greedy”algorithmstodothisinamoreefficientmanner. Thegeneralapproachtakentopruneanoptimallytrainedneuralnetworkhereistocreatearanked listofalltheneuronsinthenetworkbasedoffofoneofthe3proposedrankingcriteria:abruteforce approximation,alinearapproximationandaquadraticapproximationoftheneuron’simpactonthe outputofthenetwork. Wethentesttheeffectsofremovingneuronsontheaccuracyanderrorofthe network. Allthealgorithmsandmethodspresentedhereareeasilyparallelizableaswell. One last thing to note here before moving forward is that the methods discussed in this section involvesomenon-trivialderivationswhicharebeyondthescopeofthispaper. Wearemorefocused on analyzing the implications of these methods on our understanding of neural network learning representations. However,acompletestep-by-stepderivationandproofofalltheresultspresented isprovidedintheSupplementaryMaterialasanAppendix. 3.1 BRUTEFORCEREMOVALAPPROACH Thisisperhapsthemostnaiveyetthemostaccuratemethodforpruningthenetwork. Itisalsothe slowestandhencepossiblyunusableonlarge-scaleneuralnetworkswiththousandsofneurons.This methodexplicitlyevaluateseachneuroninthenetwork. Theideaistomanuallychecktheeffectof everysingleneuronontheoutput. Thisisdonebyrunningaforwardpropagationonthevalidation setK times(whereK isthetotalnumberofneuronsinthenetwork),turningoffexactlyoneneuron eachtime(keepingallotherneuronsactive)andnotingdownthechangeinerror. Turninganeuron offcanbeachievedbysimplysettingitsoutputto0. Thisresultsinalltheoutgoingweightsfrom thatneuronbeingturnedoff. Thischangeinerroristhenusedtogeneratetherankedlist. 3.2 TAYLORSERIESREPRESENTATIONOFERROR Let us denote the total error from the optimally trained neural network for any given validation datasetbyE. E canbeseenasafunctionofO,whereOistheoutputofanygeneralneuroninthe network. Thiserrorcanbeapproximatedataparticularneuron’soutput(sayO )byusingthe2nd k orderTaylorSeriesas, Eˆ(O)≈E(Ok)+(O−Ok)· ∂∂EO(cid:12)(cid:12)(cid:12)(cid:12) +0.5·(O−Ok)2· ∂∂2OE2(cid:12)(cid:12)(cid:12)(cid:12) , (1) Ok Ok Whenaneuronispruned,itsoutputObecomes0. ReplacingObyO inequation1showsusthattheerrorisapproximatedperfectlybyequation1at k O . So: k 4 UnderreviewasaconferencepaperatICLR2017 ∆Ek =Eˆ(0)−Eˆ(Ok)=−Ok· ∂∂EO(cid:12)(cid:12)(cid:12)(cid:12) +0.5·Ok2· ∂∂2OE2(cid:12)(cid:12)(cid:12)(cid:12) , (2) Ok Ok where ∆E is the change in the total error of the network when exactly one neuron (k) is turned k off. Most of the terms in this equation are fairly easy to compute, as we have O already from k theactivationsofthehiddenunitsandwealreadycompute ∂E| foreachtraininginstanceduring ∂O Ok backpropagation. The ∂2E| terms are a little more difficult to compute. This is derived in the ∂O2 Ok appendixandsummarizedinthesectionsbelow. 3.2.1 LINEARAPPROXIMATIONAPPROACH We can use equation 2 to get the linear error approximation of the change in error due to the kth neuronbeingturnedoffandrepresentitas∆E1asfollows: k (cid:12) ∆Ek1 =−Ok· ∂∂EO(cid:12)(cid:12)(cid:12) (3) Ok Thederivativetermaboveisthefirst-ordergradientwhichrepresentsthechangeinerrorwithrespect totheoutputagivenneuron. Thistermcanbecollectedduringback-propagation. Asweshallsee furtherinthissection, linearapproximationsarenotreliableindicatorsofchangeinerrorbutthey provideuswithaninterestingbasisforcomparisonwiththeothermethodsdiscussedinthispaper. 3.2.2 QUADRATICAPPROXIMATIONAPPROACH Asabove,wecanuseequation2togetthequadraticerrorapproximationofthechangeinerrordue tothekthneuronbeingturnedoffandrepresentitas∆E2asfollows: k ∆Ek2 =−Ok· ∂∂EO(cid:12)(cid:12)(cid:12)(cid:12) +0.5·Ok2· ∂∂2OE2(cid:12)(cid:12)(cid:12)(cid:12) (4) Ok Ok Theadditionalsecond-ordergradienttermappearingaboverepresentsthequadraticchangeinerror with respect to the output of a given neuron. This term can be generated by performing back- propagation using second order derivatives. Collecting these quadratic gradients involves some non-trivial mathematics, the entire step-by-step derivation procedure of which is provided in the SupplementaryMaterialasanAppendix. 3.3 PROPOSEDPRUNINGALGORITHM Figure 1 shows a random error function plotted against the output of any given neuron. Note that thisfigureisforillustrationpurposesonly. Theerrorfunctionisminimizedataparticularvalueof theneuronoutputascanbeseeninthefigure.Theprocessoftraininganeuralnetworkisessentially the process of finding these minimizing output values for all the neurons in the network. Pruning thisparticularneuron(whichtranslatestogettingazerooutputfromitwillresultinachangeinthe total overall error. This change in error is represented by distance between the original minimum error(shownbythedashedline)andthetopredarrow. Thisneuronisclearlyabadcandidatefor removalsinceremovingitwillresultinahugeerrorincrease. Thestraightredlineinthefigurerepresentsthefirst-orderapproximationoftheerrorusingTaylor Series as described before while the parabola represents a second-order approximation. It can be clearlyseenthatthesecond-orderapproximationisamuchbetterestimateofthechangeinerror. One thing to note here is that it is possible in some cases that there is some thresholding required whentryingtoapproximatetheerrorusingthe2ndorderTaylorSeriesexpansion.Thesecasesmight arisewhentheparabolicapproximationundergoesasteepslopechange. Totakeintoaccountsuch cases,meanandmedianthresholdingwereemployed,whereanychangeaboveacertainthreshold wasassignedameanormedianvaluerespectively. 5 UnderreviewasaconferencepaperatICLR2017 2nd Order Real Change in Error Approximation 2nd Order Estimate 1st Order Approximation 1st Order Estimate 18 Figure1: Theintuitionbehind1st&2ndorderneuronpruningdecisions Twopruningalgorithmsareproposedhere. Theyaredifferentinthewaytheneuronsarerankedbut bothofthemuse∆E ,theapproximationofthechangeinerrorasthebasisfortheranking. ∆E k k can be calculated using the Brute Force method, or one of the two Taylor Series approximations discussedpreviously. The first step in both the algorithms is to decide a stopping criterion. This can vary depending on theapplicationbutsomeintuitivestoppingcriteriacanbe: maximumnumberofneuronstoremove, percentagescalingneeded,maximumallowableaccuracydropetc. 3.3.1 ALGORITHMI:SINGLEOVERALLRANKING ThecompletealgorithmisshowninAlgorithm1. Theideahereistogenerateasinglerankedlist basedonthevaluesof∆E . Thisinvolvesasinglepassofsecond-orderback-propagation(without k weightupdates)tocollectthegradientsforeachneuron. Theneuronsfromthisrank-list(withthe lowest values of ∆E ) are then pruned according to the stopping criterion decided. We note here k thatthisalgorithmisintentionallynaiveandisusedforcomparisononly. Data: optimallytrainednetwork,trainingset Result: Aprunednetwork initializeanddefinestoppingcriterion; performforwardpropagationoverthetrainingset; performsecond-orderback-propagationwithoutupdatingweightsandcollectlinearandquadratic gradients; ranktheremainingneuronsbasedon∆E ; k whilestoppingcriterionisnotmetdo removethelastrankedneuron; end Algorithm1:SingleOverallRanking 3.3.2 ALGORITHMII:ITERATIVERE-RANKING In this greedy variation of the algorithm (Algorithm 2), after each neuron removal, the remaining networkundergoesasingleforwardandbackwardpassofsecond-orderback-propagation(without weightupdates)andtheranklistisformedagain. Hence,eachremovalinvolvesanewpassthrough 6 UnderreviewasaconferencepaperatICLR2017 thenetwork. Thismethodiscomputationallymoreexpensivebuttakesintoaccountthedependen- ciestheneuronsmighthaveononeanotherwhichwouldleadtoachangeinerrorcontributionevery timeadependentneuronisremoved. Data: optimallytrainednetwork,trainingset Result: Aprunednetwork initializeanddefinestoppingcriterion; whilestoppingcriterionisnotmetdo performforwardpropagationoverthetrainingset; performsecond-orderback-propagationwithoutupdatingweightsandcollectlinearand quadraticgradients; ranktheremainingneuronsbasedon∆E ; k removetheworstneuronbasedontheranking; end Algorithm2:IterativeRe-Ranking 4 EXPERIMENTAL RESULTS 4.1 EXAMPLEREGRESSIONPROBLEM Thisproblemservesasaquickexampletodemonstratemanyofthephenomenadescribedinprevi- oussections. Wetrainedtwonetworkstolearnthecosinefunction,withoneinputandoneoutput. Thisisataskwhichrequiresnomorethan11sigmoidneuronstosolveentirely,andinthiscasewe don’t care about overfitting because the cosine function has a precise definition. Furthermore, the cosine function is a good toy example because it is a smooth continuous function and, as demon- stratedbyNielsen(2015),ifweweretotinkerdirectlywiththeweightsandbiasparametersofthe network, we could allocate individual units within the network to be responsible for constrained rangesofinputs,similartoabasissplinefunctionwithmanycontrolpoints. Thiswoulddistribute thelearnedfunctionapproximationevenlyacrossallhiddenunits, andthuswehavepresentedthe networkwithaprobleminwhichitcouldproductivelyuseasmanyhiddenunitsaswegiveit. In thiscase, apruningalgorithmwouldobserveafairlyconsistentincreaseinerroraftertheremoval ofeachsuccessiveunit. Inpracticehowever,regardlessofthenumberofexperimentaltrials,thisis notwhathappens. Thenetworkwillalwaysuse10-11hiddenunitsandleavetheresttocanceleach other’sinfluence. 4500 Brute Force Brute Force 4000 1st Order / Skeletonization 4000 1st Order / Skeletonization 2nd Order Taylor Approx 2nd Order Taylor Approx 3500 Sum of Squared Errors23000000 Sum of Squared Errors1223505000000000 1000 1000 500 0 0 0 5 10 15 20 0 20 40 60 80 100 Number of Neurons Removed Number of Neurons Removed Figure 2: Degradation in squared error after pruning a two-layer network trained to compute the cosine function (Left Network: 2 layers, 10 neurons each, 1 output, logistic sigmoid activation, starting test accuracy: 0.9999993, Right Network: 2 layers, 50 neurons each, 1 output, logistic sigmoidactivation,startingtestaccuracy: 0.9999996) Figure2showstwographs. Bothgraphsdemonstratetheuseoftheiterativere-rankingalgorithm andthecomparativeperformanceofthebrute-forcepruningmethod(inblue),thefirstordermethod (in green), and the second order method (in red). The graph on the left shows the performance of thesealgorithmsstartingfromanetworkwithtwolayersof10neurons(20total),andthegraphon therightshowsanetworkwithtwolayersof50neurons(100total). 7 UnderreviewasaconferencepaperatICLR2017 In the left graph, we see that the brute-force method shows a graceful degradation, and the error onlybeginstorisesharplyafter50%ofthetotalneuronshavebeenremoved. Theerrorisbasically constantuptothatpoint. Inthefirstandsecondordermethods, weseeevidenceofpoordecision makinginthesensethatbothmademistakesearlyon,whichdisruptedtheoutputfunctionapproxi- mation. Thefirstordermethodmadealargeerrorearlyon,thoughweseeafterafewmoreneurons were removed this error was corrected somewhat (though it only got worse from there). This is directevidenceofthelackoffaulttoleranceinatrainedneuralnetwork. Thisphenomenoniseven more starkly demonstrated in the second order method. After making a few poor neuron removal decisionsinarow,theerrorsignalrosesharply,andthenwentbacktozeroafterthe6thneuronwas removed. This is due to the fact that the neurons it chose to remove were trained to cancel each others’influencewithinalocalizedpartofthenetwork. Aftertheentiregroupwaseliminated,the approximationreturnedtonormal. Thiscanonlyhappeniftheoutputfunctionapproximationisnot evenlydistributedoverthehiddenunitsinatrainednetwork. This phenomenon is even more starkly demonstrated in the graph on the right. Here we see the first order method got “lucky” in the beginning and made decent decisions up to about the 40th removed neuron. The second order method had a small error in the beginning which it recovered fromgracefullyandproceededtopassthe50neuronpointbeforefinallybeginningtounravel. The brute force method, in sharp contrast, shows little to no increase in error at all until 90% of the neurons in the network have been obliterated. Clearly first and second order methods have some valueinthattheydonotmakecompletelyarbitrarychoices,butthebruteforcemethodisfarbetter atthistask. Thisalsodemonstratesthesharpdualisminneuronroleswithinatrainednetwork. Thesenetworks weretrainedtonear-perfectprecisionandeachpruningmethodwasappliedwithoutanyre-training ofanykind.Clearly,inthecaseofthebruteforceororaclemethod,upto90%ofthenetworkcanbe completelyextirpatedbeforetheoutputapproximationevenbeginstoshowanysignsofdegradation. Thiswouldbeimpossibleifthelearningrepresentationwereevenlyorequitablydistributed. Note, forexample,thatthedegradationpointinbothcasesisapproximatelythesame. Thisexampleisnot areal-worldapplicationofcourse,butitbringsintoveryclearfocusthekindofphenomenawewill discussinthefollowingsections. 4.2 RESULTSONMNISTDATASET Foralltheresultspresentedinthissection,theMNISTdatabaseofHandwrittenDigitsbyLeCun& Cortes(2010) wasused. Itisworth notingthat dueto thetime taken bythe bruteforce algorithm weratheruseda5000imagesubsetoftheMNISTdatabaseinwhichwehavenormalizedthepixel values between 0 and 1.0, and compressed the image sizes to 20x20 images rather than 28x28, so thestartingtestaccuracyreportedhereappearshigherthanthosereportedbyLeCunetal.Wedonot believethatthisaffectstheinterpretationofthepresentedresultsbecausethebasiclearningproblem doesnotchangewithalargerdatasetorinputdimension. 4.3 PRUNINGA1-LAYERNETWORK Thenetworkarchitectureinthiscaseconsistedof1layer,100neurons,10outputs,logisticsigmoid activations,andastartingtestaccuracyof0.998. 4.3.1 SINGLEOVERALLRANKINGALGORITHM We first present the results for a single-layer neural network in Figure 3, using the Single Overall algorithm(Algorithm1)asproposedinSection3.(Weagainnotethatthisalgorithmisintentionally naiveandisusedforcomparisononly. Itsperformanceshouldbeexpectedtobepoor.) Aftertrain- ing,eachneuronisassigneditspermanentrankingbasedonthethreecriteriadiscussedpreviously: Abruteforce“groundtruth”ranking,andtwoapproximationsofthisrankingusingfirstandsecond orderTaylorestimationsofthechangeinnetworkoutputerrorresultingfromtheremovalofeach neuron. Aninterestingobservationhereisthatwithonlyasinglelayer, nocriteriaforrankingtheneurons inthenetwork(bruteforceorthetwoTaylorSeriesvariants)usingAlgorithm1emergessuperior, indicatingthatthe1stand2ndorderTaylorSeriesmethodsareactuallyreasonableapproximations 8 UnderreviewasaconferencepaperatICLR2017 1.0 8000 Brute Force Brute Force 1st Order / Skeletonization 1st Order / Skeletonization 7000 2nd Order Taylor Approx 0.8 2nd Order Taylor Approx Sum of Squared Errors3456000000000000 Classification Accuracy00..46 2000 0.2 1000 0 0.0 0 20 40 60 80 100 0 20 40 60 80 100 Percentage of Neurons Removed Percentage of Neurons Removed Figure 3: Degradation in squared error (left) and classification accuracy (right) after pruning a single-layernetworkusingTheSingleOverallRankingalgorithm(Network: 1layer,100neurons, 10outputs,logisticsigmoidactivation,startingtestaccuracy: 0.998) of the brute force method under certain conditions. Of course, this method is still quite bad in termsoftherateofdegradationoftheclassificationaccuracyandinpracticewewouldlikelyfollow Algorithm 2 which is takes into account Mozer & Smolensky (1989a)’s observations stated in the Related Work section. The purpose of the present investigation, however, is to demonstrate how much of a trained network can be theoretically removed without altering the network’s learned parametersinanyway. 4.3.2 ITERATIVERE-RANKINGALGORITHM 8000 Brute Force 1.2 Brute Force 1st Order / Skeletonization 1st Order / Skeletonization 7000 2nd Order Taylor Approx 1.0 2nd Order Taylor Approx Sum of Squared Errors3456000000000000 Classification Accuracy000...468 2000 0.2 1000 0 0.0 0 20 40 60 80 100 0 20 40 60 80 100 Percentage of Neurons Removed Percentage of Neurons Removed Figure 4: Degradation in squared error (left) and classification accuracy (right) after pruning a single-layernetworktheiterativere-rankingalgorithm(Network: 1layer,100neurons,10outputs, logisticsigmoidactivation,startingtestaccuracy: 0.998) InFigure4wepresentourresultsusingAlgorithm2(Theiterativere-rankingAlgorithm)inwhich allremainingneuronsarere-rankedaftereachsuccessiveneuronisswitchedoff. Wecomputethe samebruteforcerankingsandTaylorseriesapproximationsoferrordeltasovertheremainingactive neurons in the network after each pruning decision. This is intended to account for the effects of cancellinginteractionsbetweenneurons. Thereare2keyobservationshere. Usingthebruteforcerankingcriteria,almost60%oftheneurons in the network can be pruned away without any major loss in performance. The other noteworthy observationhereisthatthe2ndorderTaylorSeriesapproximationoftheerrorperformsconsistently betterthanits1storderversion,inmostsituations,thoughFigure21isapoignantcounter-example. 4.3.3 VISUALIZATIONOFERRORSURFACE&PRUNINGDECISIONS AsexplainedinSection3,thesegraphsareavisualizationoftheerrorsurfaceofthenetworkoutput withrespecttotheneuronschosenforremovalusingeachofthe3rankingcriteria, representedin 9 UnderreviewasaconferencepaperatICLR2017 intervals of 10 neurons. In each graph, the error surface of the network output is displayed in log space(left)andinrealspace(right)withrespecttoeachcandidateneuronchosenforremoval. We createtheseplotsduringthepruningexercisebypickinganeurontoswitchoff,andthenmultiplying itsoutputbyascalargainvalueαwhichisadjustedfrom0.0to10.0withastepsizeof0.001.When thevalueofαis1.0,thisrepresentstheunperturbedneuronoutputlearnedduringtraining.Between 0.0and1.0,wearegraphingtheliteraleffectofturningtheneuronoff(α = 0),andwhenα > 1.0 wearesimulatingaboostingoftheneuron’sinfluenceinthenetwork,i.e. inflatingthevalueofits outgoingweightparameters. We graph the effect of boosting the neuron’s output to demonstrate that for certain neurons in the network,evendoubling,tripling,orquadruplingthescalaroutputoftheneuronhasnoeffectonthe overallerrorofthenetwork,indicatingtheremarkabledegreetowhichthenetworkhaslearnedto ignore the value of certain parameters. In other cases, we can get a sense of the sensitivity of the network’soutputtothevalueofagivenneuronwhenthecurverisessteeplyafterthered1.0line. Thisindicatesthatthelearnedvalueoftheparametersemanatingfromagivenneuronarerelatively important,andthisiswhyweshouldideallyseesharperupticksinthecurvesforthelater-removed neurons in the network, that is, when the neurons crucial to the learning representation start to be pickedoff. Someveryinterestingobservationscanbemadeineachofthesegraphs. Rememberthatlowerisbetterintermsoftheheightofthecurveandminimal(ornegative)horizon- talchangebetweentheverticalredlineat1.0(neuronon, α = 1.0)and0.0(neuronoff, α = 0.0) isindicativeofagoodcandidateneurontoprune, i.e. therewillbeminimaleffectonthenetwork outputwhentheneuronisremoved. 4.3.4 VISUALIZATIONOFBRUTEFORCEPRUNINGDECISIONS 104 5000 Neuron: 0 Neuron: 0 Neuron: 10 Neuron: 10 Neuron: 20 Neuron: 20 Neuron: 30 Neuron: 30 Neuron: 40 4000 Neuron: 40 Neuron: 50 Neuron: 50 Neuron: 60 Neuron: 60 103 Neuron: 70 Neuron: 70 m of Squared Errors NNeeuurroonn:: 8900 m of Squared Errors23000000 NNeeuurroonn:: 8900 Su102 Su 1000 101 0 0 2 4 6 8 10 0 2 4 6 8 10 Gain value (1 = normal output, 0 = off) Gain value (1 = normal output, 0 = off) Figure5: Errorsurfaceofthenetworkoutputinlogspace(left)andrealspace(right)withrespect toeachcandidateneuronchosenforremovalusingthebruteforcecriterion;(Network: 1layer,100 neurons,10outputs,logisticsigmoidactivation,startingtestaccuracy: 0.998) InFigure??,wenoticehowlowtothefloorandflatmostofthecurvesare. It’sonlyuntilthe90th removed neuron that we see a higher curve with a more convex shape (clearly a more sensitive, influentialpieceofthenetwork). 4.3.5 VISUALIZATIONOF1STORDERAPPROXIMATIONPRUNINGDECISIONS ItcanbeseeninFigure6thatmostchoicesseemtohaveflatornegativelyslopedcurves,indicating that the first order approximation seems to be pretty good, but examining the brute force choices showstheycouldbebetter. 4.3.6 VISUALIZATIONOF2NDORDERAPPROXIMATIONPRUNINGDECISIONS ThemethodinFigure7lookssimilartothebruteforcemethodchoices,thoughclearlynotasgood (they’remorespreadout). Noticethedifferenceinconvexitybetweenthe2ndand1stordermethod 10