Converting Cascade-Correlation Neural Nets into Probabilistic Generative Models ArdavanS.Nobandegani1,3 ThomasR.Shultz2,3 {[email protected],[email protected]} 1DepartmentofElectricalandComputerEngineering,McGillUniversity 2SchoolofComputerScience,McGillUniversity 3DepartmentofPsychology,McGillUniversity Abstract (Shultz,1998),seriation(Mareschal&Shultz,1999). More- over, CCNNs exhibit several similarities with known brain Humans are not only adept in recognizing what class an in- 7 functions:distributedrepresentation,self-organizationofnet- put instance belongs to (i.e., classification task), but perhaps 1 more remarkably, they can imagine (i.e., generate) plausible work topology, layered hierarchical topologies, both cas- 0 instances of a desired class with ease, when prompted. In- caded and direct pathways, an S-shaped activation function, 2 spired by this, we propose a framework which allows trans- activation modulation via integration of neural inputs, long- formingCascade-CorrelationNeuralNetworks(CCNNs)into n probabilistic generative models, thereby enabling CCNNs to termpotentiation,growthatthenewerendofthenetworkvia a generate samples from a category of interest. CCNNs are a synaptogenesisorneurogenesis,pruning,andweightfreezing J well-knownclassofdeterministic,discriminativeNNs,which (Westermann, Sirois, Shultz,&Mareschal, 2006). Nonethe- 8 autonomouslyconstructtheirtopology,andhavebeensuccess- 1 fulingivingaccountsforavarietyofpsychologicalphenom- less,invirtueofbeingdeterministicanddiscriminative,CC- ena. Our proposed framework is based on a Markov Chain NNshavesofarlackedthecapacitytoprobabilisticallygen- MonteCarlo(MCMC)method,calledtheMetropolis-adjusted ] erate examples from a category of interest. This ability can C Langevin algorithm, which capitalizes on the gradient infor- mation of the target distribution to direct its explorations to- beused,e.g.,todiagnosewhatthenetworkknowsatvarious N wardsregionsofhighprobability,therebyachievinggoodmix- points during training, particularly when dealing with high- . ingproperties.Throughextensivesimulations,wedemonstrate dimensionalinputspaces. o theefficacyofourproposedframework. i In this work, we propose a framework which allows b transforming CCNNs into probabilistic generative models, - 1 Introduction q thereby enabling CCNNs to generate samples from a cat- [ A green-striped elephant! No one has probably seen such a egory. Our proposed framework is based on a Markov 1 thing—no surprise. But what is a surprise is our ability to ChainMonteCarlo(MCMC)method,calledtheMetropolis- v imagine one with almost no trouble. Humans are not only AdjustedLangevin(MAL)algorithm,whichemploysthegra- 4 adeptinrecognizingwhatclassaninputinstancebelongsto dient of the target distribution to guide its explorations to- 0 (i.e.,classificationtask),butmoreremarkably,theycanimag- wardsregionsofhighprobability,therebysignificantlyreduc- 0 ine(i.e.,generate)plausibleinstancesofadesiredclasswith ingtheundesirablerandomwalkoftenobservedatthebegin- 5 0 ease,whenprompted. Infact,humanscangenerateinstances ning of an MCMC run (a.k.a. the burn-in period). MCMC . ofadesiredclass,say,elephant,thattheyhaveneverencoun- methods are a family of algorithms for sampling from a de- 1 tered before, like, a green-striped elephant.1 In this sense, sired probability distribution, and have been successful in 0 7 humans’ generative capacity goes beyond merely retrieving simulating important aspects of a wide range of cognitive 1 frommemory. Incomputationalterms,thenotionofgenerat- phenomena, e.g., temporal dynamics of multistable percep- : ingexamplesfromadesiredclasscanbeformalizedinterms tion (Gershman, Vul, & Tenenbaum, 2012; Moreno-Bote, v i of sampling from some underlying probability distribution, Knill, & Pouget, 2011), developmental changes in cogni- X and has been extensively studied in machine learning under tion (Bonawitz, Denison, Griffiths, & Gopnik, 2014), cate- r therubricofprobabilisticgenerativemodels. gory learning (Sanborn, Griffiths, & Navarro, 2010), causal a Cascade-CorrelationNeuralNetworks(CCNNs)(Fahlman reasoning in children (Bonawitz, Denison, Gopnik, & Grif- & Lebiere, 1989) are a well-known class of discriminative fiths, 2014), and giving accounts for many cognitive biases (as opposed to generative) models that have been success- (Dasgupta,Schulz,&Gershman,2016). fulinsimulatingavarietyofphenomenainthedevelopmen- Furthermore, work in theoretical neuroscience has shed tal literature, e.g., infant learning of word-stress patterns in light on possible mechanisms according to which MCMC artificial languages (Shultz & Bale, 2006), syllable bound- methods could be realized in generic cortical circuits aries (Shultz & Bale, 2006), visual concepts (Shultz, 2006), (Buesing, Bill, Nessler, & Maass, 2011; Moreno-Bote et andhavealsobeensuccessfulincapturingimportantdevelop- al., 2011; Pecevski, Buesing, & Maass, 2011; Gershman mentalregularitiesinavarietyoftasks,e.g.,thebalance-scale & Beck, 2016). In particular, Moreno-Bote et al. (2011) task(Shultz,Mareschal,&Schmidt,1994;Shultz&Takane, showedhowanattractorneuralnetworkimplementingMAL 2007), transitivity (Shultz & Vogel, 2004), conservation couldaccountformultistableperceptionofdriftinggratings, and Savin and Deneve (2014) showed how a network of 1Incounterfactualterms:Hadahumanseenagreen-stripedele- leaky integrate-and-fire neurons could implement MAL in a phant, s/hewouldhaveyetrecognizeditasanelephant. Geoffrey Hintononcetoldasimilarstoryaboutapinkelephant! biologically-realisticmanner. 2 Cascade-CorrelationNeuralNetworks Algorithm1TheMetropolis-AdjustedLangevinalgorithm CCNNs are a special class of deterministic artificial neural Input: Targetdistributionπ(X),parameterτ∈R+,num- networks, which construct their topology in an autonomous berofsamplesN. fashion—an appealing property simulating developmental Output: SamplesX(0),...,X(N−1). phenomena(Westermannetal.,2006)andothercaseswhere 1: PickX(0)arbitrarily. networksneedtobeconstructed. CCNNtrainingstartswith 2: fori=0:N−1 atwo-layernetwork(i.e.,theinputandtheoutputlayer)with 3: Sampleu∼Uniform[0,1] nohiddenunits,andproceedsbyrecruitinghiddenunitsone 4: SampleX∗∼q(X∗|X(i))=N(X(i)+τ∇logπ(X(i)),2τI) at a time, as needed. Each new hidden unit is trained to be π(X∗)q(X(i)|X∗) 5: ifu<min{1, }Then maximallycorrelatedwithresidualerrorinthenetworkbuilt π(X(i))q(X∗|X(i)) sofar, andisrecruitedintoahiddenlayerofitsown, giving 6: X(i+1)←X∗ risetoadeepnetworkwithasmanyhiddenlayersasthenum- 7: else ber of recruited hidden units. CCNNs use sum-of-squared 8: X(i+1)←X(i) errorasobjectivefunction, andtypicallyusesymmetricsig- 9: endif moidalactivationfunctionswithrange−0.5to+0.5forhid- 10: endfor den and output units.2 Some variants have been proposed 11: returnX(0),...,X(N−1) for CCNNs, e.g., Sibling-Descendant Cascade-Correlation (SDCC) (Baluja & Fahlman, 1994) and Knowledge-Based Cascade-Correlation (KBCC) (Shultz & Rivest, 2001). Al- 2014; Moreno-Bote et al., 2011). In the following section, though in this work we specifically focus on standard CC- we propose a target distribution π(X), allowing CCNNs to NNs,ourproposedframeworkcanhandleSDCCandKBCC generatesamplesfromacategoryofinterest. aswell. 4 TheProposedFramework 3 TheMetropolis-AdjustedLangevin Algorithm In what follows, we propose a framework which transforms CCNNs into probabilistic generative models, thereby en- MAL(Roberts&Tweedie,1996)isaspecialtypeofMCMC abling them to generate samples from a category of inter- method,whichemploysthegradientofthetargetdistribution est. TheproposedframeworkisbasedontheMALalgorithm toguideitsexplorationstowardsregionsofhighprobability, giveninSec. 3. Let f(X;W∗)denotetheinput-outputmap- therebyreducingtheburn-inperiod. Morespecifically,MAL ping learned by a CCNN at the end of the training phase, combinesthetwoconceptsofLangevindynamics(arandom and W∗ denote the set of weights for a CCNN after train- walkguidedbythegradientofthetargetdistribution),andthe ing.3 Upon termination of training, presented with input X, Metropolis-Hastings algorithm (an accept/reject mechanism aCCNNoutputs f(X;W∗). Notethat,incaseaCCNNpos- forgeneratingasequenceofsamplesthedistributionofwhich sessesmultipleoutputunits,themapping f(X;W∗)willbea asymptoticallyconvergestothetargetdistribution). vector rather than a scalar. To convert a CCNN into a prob- Wedenoterandomvariableswithsmallbold-facedletters, abilisticgenerativemodel,weproposetousetheMALalgo- randomvectorsbycapitalbold-facedletters,andtheircorre- rithmwithitstargetdistributionπ(X)beingsetasfollows: spondingrealizationsbynon-bold-facedletter. TheMALal- gorithmisoutlinedinAlgorithm1whereinπ(X)denotesthe π˜(X) (cid:44) p(X|Y=L ) j target probability distribution, τ is a positive real-valued pa- 1 rameterspecifyingthetime-stepusedintheEuler-Maruyama = Zexp(−β||Lj−f(X;W∗)||22), (1) approximation of the underlying Langevin dynamics, N de- notes the number of samples generated by the MAL algo- where||·|| denotesthel -norm,β∈R isadampingfactor, 2 2 + rithm, q denotes the proposal distribution (a.k.a. transition Z is the normalizing constant, and L is a vector whose ele- j kernel), N(µ,Σ) denotes the multivariate normal distribu- ment corresponding to the desired class is +0.5 (i.e., its jth tion with mean vector µ and covariance matrix Σ, and, fi- element)andtherestofitselementsare−0.5s. Theintuition nally, I denotes the identity matrix. The sequence of sam- behindEq. (1)canbearticulatedasfollows: Foraninputin- plesgeneratedbytheMALalgorithm,X(0),X(1),...,isguar- stanceX=X belongingtothedesiredclass j,4 theoutputof anteedtoconvergeindistributiontoπ(X)(Robert&Casella, thenetwork f(X;W∗)isexpectedtobeclosetoL inl -norm j 2 2013).Itisworthnotingthatworkintheoreticalneuroscience sense;inthislight,Eq. (1)isadjustingthelikelihoodofinput hasshownthatMAL,outlinedinAlgorithm1, couldbeim- plemented in a neurally-plausible manner (Savin & Deneve, 3Formally, f(·;W∗):∏ni=1Di→∏mj=1Rj whereDi andRj de- notethesetofvaluesthatinputunitiandoutputunit jcantakeon, 2FahlmanandLebiere(1989)alsosuggestlinear,Gaussian,and respectively. asymmetric sigmoidal (with range 0 to +1) activation functions 4Incounterfactualterms,thisisequivalenttosaying: Hadinput as alternatives. Our proposed framework can be straightforwardly instanceXbeenpresentedtothenetwork,itwouldhaveclassifiedX adaptedtohandleallsuchactivationfunctions. inclass j. instanceX tobeinverselyproportionaltotheexponentofthe learning which can be accomplished by two input and one saidl distance. output units. This permits visualization of the input-output 2 Forthereaderfamiliarwithprobabilisticgraphicalmodels, space, which lies in R3. Note that our proposed framework the expression in Eq. (1) looks similar to the expression for can handle arbitrary number of input and output units; this thejointprobabilitydistributionofMarkovrandomfieldsand restrictionissolelyforeaseofvisualization. probabilisticenergy-basedmodels,e.g.,RestrictedBoltzman 5.1 Continuous-XORProblem MachinesandDeepBoltzmanMachines. However,thereisa crucial distinction: The normalizing constant Z, the compu- Inthissubsection,weshowhowourproposedframeworkal- tation of which is intractable in general, renders learning in lowsaCCNN,trainedonthecontinuous-XORclassification those models computationally intractable.5 The appropriate task (see Fig. 1), to generate examples from a category of way to interpret Eq. (1) is to see it as a Gibbs distribution interest. The output unit has a symmetric sigmoidal activa- for a non-probabilistic energy-based model whose energy is tion function with range −0.5 and +0.5. The training set definedasthesquareofthepredictionerror(LeCun,Chopra, consistsof100samplesintheunit-square[0,1]2,pairedwith Hadsell,Ranzato,&Huang,2006). Section1.3of(LeCunet theircorrespondinglabels. Morespecifically,thetrainingset al., 2006) discusses the topic of Gibbs distribution for non- is comprised of all the ordered-pairs starting from (0.1,0.1) probabilisticenergy-basedmodelsinthecontextofdiscrim- andgoingupto(1,1)withequalstepsofsize0.1,pairedwith initivelearning,computationallymodeledby p(Y|X)(i.e.,to theircorrespondinglabels(i.e.,+0.5forpositivesamplesand predictaclassgivenaninput),andraisesthesameissuethat −0.5fornegativesamples);seeFig. 1(top-left). Aftertrain- wehighlightedaboveregardingtheintractabilityofcomput- ing, a CCNN with 6 hidden layers is obtained whose input- ing the normalizing constant Z in general. In sharp contrast output mapping, f(x ,x ;W∗), is depicted in Fig. 1(top- 1 2 to (LeCun et al., 2006), our framework is proposed for the right).7 purpose of generating examples from a desired class, as ev- idenced by Eq. (1) being defined in terms of p(X|Y). Also 0.5 crucially,theintractabilityofcomputingZraisesnoissuefor 0.4 ourproposedframeworkduetoanintriguingpropertyofthe 0.19 000...345 00..23 MZnAeLedalngootrbitehmcoamcpcourteddinagttaollw.6hichthenormalizingconstant x20000....5678 ∗fx,xW(;)12−−−00000.....321012 −00.01.1 tsieonDnt,iuaqell,ytroeinqLvuioinrleevse4sthtohefecAcoolmgmoppruiutthatatmitoion1n,ooMff∇A∇lLfo’(gsXπ˜p((ir)Xo;pW(io))∗s,a)lw(ndhoiistctehritbheuas--t 0000....012340 0.1 0.2 0.3 0.4 x0.15 0.6 0.7 0.8 0.9 1 −−00..154 0.8 0.6x20.4 0.2 0 0 0.2 0.x41 0.6 0.8 1 −−−−0000....5432 thegradientisoperatingonX(i),andW∗ ismerelytreatedas 1 0.5 asetoffixedparameters).Themulti-layerstructureofCCNN 0.9 0.4 ensuresthat∇f(X(i);W∗)canbeefficientlycomputedusing 0.8 0.3 0.7 0.2 Backpropagation.Alternatively,insettingswhereCCNNsre- 0.6 0.1 cruitasmallnumberofinputunits(hence,thecardinalityof X(i) is small), ∇f(X(i);W∗) can be obtained by introducing x20.5 0 0.4 −0.1 negligible perturbation to a component of input signal X(i), 0.3 −0.2 dividingtheresultingchangeinnetwork’soutputsbythein- 0.2 −0.3 troducedperturbation,andrepeatingthisprocessforallcom- 0.1 −0.4 ponentsofinputsignalX(i). Itisworthnotingthatalthough 0 −0.5 0 0.2 0.4 x 0.6 0.8 1 the idea of computing gradients through introducing small 1 perturbationswouldleadtoacomputationallyinefficientap- Figure1:ACCNNtrainedonthecontinuous-XORclassifica- proachforlearningCCNNs,itleadstoacomputationallyef- tiontask. Top-left: Trainingpatterns. Allthepatternsinthe ficientapproachforgeneration,asthenumberofinputunits gray quadrants are negative examples with label −0.5, and aretypicallymuchfewerthanthenumberofweightsinCC- all the patterns in the white quadrants are positive examples NNs (and artificial neural networks, in general). It is also withlabel+0.5. Reddottedlinesdepicttheboundaries. Top- crucialtonotethatthenormalizingconstant Z playsnorole right: Theinput-outputmapping, f(x ,x ;W∗),learnedbya inthecomputationof∇logπ˜(X(i)). 1 2 CCNN, along with a colorbar. Bottom: The top-down view ofthecurvedepictedintop-right,alongwithacolorbar. 5 Simulations In this section we demonstrate the efficacy of our proposed 7DuetotheinherentrandomnessinCCNNconstruction,training framework through simulations. We particularly focus on couldleadtonetworkswithdifferentstructures. However,sincein thisworkwearesolelyconcernedwithgeneratingexamplesusing 5Morespecifically,Zrendersthecomputationofthegradientof CCNNsratherthanhowwellCCNNscouldlearnagivendiscrim- thelog-likelihoodforthosemodelsintractable. initive task, we arbitrarily pick a learned network. Note that our 6TheMALalgorithminheritsthispropertyfromtheMetropolis- proposed framework can handle CCNNs with arbitrary structures; Hastingalgorithm,whichitusesasasubroutine. inthatlight,thechoiceofnetworkiswithoutlossofgenerality. β=1,τ =5×10−5 β=1,τ =5×10−3 β=10,τ =5×10−3 1 0.5 1 0.5 1 0.5 0.9 0.4 0.9 0.4 0.9 0.4 0.8 0.3 0.8 0.3 0.8 0.3 0.7 0.2 0.7 0.2 0.7 0.2 0.6 0.1 0.6 0.1 0.6 0.1 x20.5 0 x20.5 0 x20.5 0 0.4 −0.1 0.4 −0.1 0.4 −0.1 0.3 −0.2 0.3 −0.2 0.3 −0.2 0.2 −0.3 0.2 −0.3 0.2 −0.3 0.1 −0.4 0.1 −0.4 0.1 −0.4 00 0.1 0.2 0.3 0.4 x0.15 0.6 0.7 0.8 0.9 1 −0.5 00 0.1 0.2 0.3 0.4 x0.15 0.6 0.7 0.8 0.9 1 −0.5 00 0.1 0.2 0.3 0.4 x0.15 0.6 0.7 0.8 0.9 1 −0.5 (a)N=2000,AR=99.55% (b)N=2000,AR=75.25% (c)N=2000,AR=57.85% Figure 2: Generating example for the positive category, under various choices for MAL parameter τ and damping factor β. Contour-plot of the learned mapping, f(x ,x ;W∗), along with its corresponding colorbar is shown in each sub-figure. 1 2 Generated samples are depicted by red dots. N denotes the total number of samples generated by MAL, and AR denotes the correspondingacceptancerate. (a)τ=5×10−5 leadstoaveryslowexplorationoftheinputspace. (b)τ=5×10−3 leadsto anadequateexplorationoftheinputspace,however,β=1isnotpenalizingundesirableinputregionsseverelyenough. (c)A desirableperformanceisachievedbyτ=5×10−3andβ=10. Fig. 2 shows the efficacy of our proposed framework in nicelydemonstrateshowMAL—bydirectingitssuggestions enablingCCNNstogeneratesamplesfromacategoryofin- towardthedirectionofgradientandthereforemakingmoves terest, under various choices for MAL parameter τ (see Al- towardregionswithhighlikelihood—couldalleviatetheneed gorithm 1) and damping factor β (see Eq. (1)); generated for discarding a (potentially large) number of samples gen- samples are depicted by red dots. For the results shown in erated at the beginning of an MCMC which are assumed to Fig. 2,thecategoryofinterestisthecategoryofpositiveex- be unrepresentative of equilibrium state, a.k.a. the burn-in amples,i.e.,thecategoryofinputpatternswhich,uponbeing period. Fig. 3 shows the performance of our framework in presentedtothe(learned)network,wouldbeclassifiedaspos- enablingthelearnedCCNNtogeneratefromthecategoryof itivebythenetwork. Becauseτcontrolstheamountofjump negativeexamples,withτ=5×10−3andβ=10. between consecutive proposals made by MAL, the follow- ingbehaviorisexpected: Forsmallτ(Fig. 2(a))consecutive β =10,τ =5×10−3 proposalsareveryclosetooneanother,leadingtoaslowex- 1 0.5 ploration of the input domain. As τ increases, bigger jumps 0.9 0.4 are made by MAL (Fig. 2(b)).8 Parameter β controls how 0.8 0.3 severelydeviationsfromthedesiredclasslabel(here, +0.5) 0.7 0.2 should be penalized. The larger the parameter β, the more 0.6 0.1 severely such deviations are penalized and the less likely it becomes for MAL to make moves toward such regions of x20.5 0 input space. Acceptance Rate (AR), defined as the number 0.4 −0.1 ofacceptedmovesdividedbythetotalnumberofsuggested 0.3 −0.2 moves,isalsopresentedfortheresultsshowninFig. 2. Fig. 0.2 −0.3 2(c) shows that for τ=5×10−3 and β=10, our proposed 0.1 −0.4 framework demonstrates a desirable performance: virtually 00 0.1 0.2 0.3 0.4 x0.15 0.6 0.7 0.8 0.9 1 −0.5 all of the generated samples fall within the desired input re- gions (i.e., the regions associated with hot colors, signaling Figure3: Generatingexampleforthenegativecategory,with the closeness of network’s output to +0.5 in those regions; τ=5×10−3,β=10. Generatedsamplesareshownbyblue see Fig. 1(bottom)) and the desired regions are adequately dots. Total number of samples generated is N =2000, with explored(i.e.,allhot-coloredinputregionsbeingvisitedand AR=65.13%. almostevenlyexplored). Results shown in Fig. 2 depict all the first N = 2000 samplesgeneratedbyMAL,withoutexcludingtheso-called 5.2 Two-SpiralsProblem burn-in period. In that light, the result shown in Fig. 2(c) Next,weshowhowourproposedframeworkallowsaCCNN, 8Yet, too large a β is not good either, leading to a sparse and trained on the famously difficult Two-Spirals classification coarse-grainedexplorationoftheinputspace. Somemeasureshave task (Fig. 4), to generate examples from a category of in- been proposed in computational statistics for properly choosing τ; cf.(Roberts&Rosenthal,1998). terest. Theoutputunithasasymmetricsigmoidalactivation function with range −0.5 and +0.5. The training set con- β=20,τ =0.7 0.5 sists of 194 samples (97 samples per spiral), in the square 6 [−6.5,6.5]2,pairedwiththeircorrespondinglabels(+0.5and 0.4 4 0.3 −0.5 for positive and negative samples, respectively). The 0.2 training pattern is shown in Fig. 4(top-left); cf. (Chalup & 2 0.1 Wiklendt, 2007) for details. After training, a CCNN with 14 hidden layers is obtained whose input-output mapping, x2 0 0 f(x ,x ;W∗),isdepictedinFig. 4(top-right). −0.1 1 2 −2 −0.2 −4 −0.3 0.5 6 0.4 −0.4 4 0.3 −6 2 ∗W)0.5 00..12 −6 −4 −2 x01 2 4 6 −0.5 x2−−420 fx,x(;12−0.506 4 2 2 4 6 −−−0000...321 6 β=20,τ =0.7 00..45 0 −2 −2 0 −0.4 4 0.3 −6−6 −4 −2 x01 2 4 6 x2 −4 −6 −6 −4 x1 −0.5 0.2 0.5 2 6 0.1 0.4 x2 0 0 4 0.3 −0.1 0.2 −2 2 −0.2 0.1 −4 −0.3 x2 0 0 −0.4 −0.1 −6 −2 −0.2 −6 −4 −2 x01 2 4 6 −0.5 −4 −0.3 −0.4 Figure 5: Generating example for the positive and negative −6 −6 −4 −2 x0 2 4 6 −0.5 categories, with β = 20 and τ = 0.7. Contour-plot of the 1 learned mapping, f(x ,x ;W∗), along with its correspond- 1 2 ing colorbar is shown in each sub-figure. N denotes the to- Figure 4: A CCNN trained on the two-spirals classification tal number of samples generated by MAL, and AR denotes task. Top-left: Training patterns. Positive patterns (associ- thecorrespondingacceptancerate. Top: Generatingexample atedwithlabel+0.5)areshownbyhollowcircles, andneg- forthepositivecategory,withN=15000andAR=40.69%. ative patterns (associated with label −0.5) by black circles. Generatedsamplesaredepictedbyreddots. Bottom: Gener- Positivespiralisdepictedbyadashedline,andnegativespi- atingexampleforthenegativecategory,withN=15000and ral by a dotted line. Top-right: The input-output mapping, f(x ,x ;W∗), learned by a CCNN, along with a colorbar. AR=40.28%. Generatedsamplesaredepictedbybluedots. 1 2 Bottom: The top-down view of the curve depicted in top- right,alongwithacolorbar. tivecategory,underthefollowingconstraint:Generatedsam- plesmustlieonthecurvex =0.25sin(8πx )+0.5. Togen- Fig. 5(top) and Fig. 5(bottom) show the efficacy of our 2 1 eratesamplesfromthepositivecategorywhilesatisfyingthe proposedframeworkinenablingCCNNstogeneratesamples saidconstraint,MALadoptsourproposedtargetdistribution from the positive and negative categories, respectively. Al- giveninEq. (1),andtreatsx asanindependentandx asa thoughsimilarpatternsofbehaviorobservedinSec. 5.1due 1 2 dependentvariable. to increasing/decreasing β and τ are observed here as well, due to the lack of space such results are omitted. Note that 6 GeneralDiscussion the results shown in Fig. 5 depicts all the first N =15000 samples generated by MAL, without excluding the burn-in Although we discussed our proposed framework in the con- period. Inthatlight,theresultsshowninFig. 5(top)andFig. textofCCNNs,itcanbestraightforwardlyextendedtohandle 5(bottom)demonstrateonceagaintheefficacyofMALinal- otherkindsofartificialneuralnetworks,e.g.,multi-layerper- leviatingtheneedfordiscardinga(potentiallylarge)number ceptronandconvolutionalneuralnetworks. Furthermore,our samplesgeneratedatthebeginningofanMCMCrun. proposedframework,togetherwithrecentworkintheoretical Interestingly,ourproposedframeworkalsoallowsCCNNs neuroscienceshowingpossibleneurally-plausibleimplemen- togeneratesamplessubjecttosomeformsofconstraints. For tationsofMAL(Savin&Deneve,2014;Moreno-Boteetal., example, Fig. 6 demonstrates how our proposed framework 2011), suggests an intriguing modular hypothesis according enablesaCCNN,trainedonthecontinuous-XORclassifica- towhichgenerationcouldresultfromtwoseparatemodules tiontask(seeSec. 5.1),togenerateexamplesfromtheposi- interactingwitheachother(inourcase,aCCNNandaneural β=10,τ =5×10−3 self-organized generative model: a generative model pos- 1 0.5 sessingtheself-constructivepropertyofCCNNs. Suchself- 0.9 0.4 organizedgenerativemodelscouldprovideawealthofdevel- 0.8 0.3 opmentalhypothesesastohowtheimaginativecapacitiesof 0.7 0.2 children change over development, and models with quanti- 0.6 0.1 tative predictions to compare against. We see our work as a x20.5 0 steptowardssuchmodels. 0.4 −0.1 0.3 −0.2 Acknowledgment 0.2 −0.3 This work is funded by an operating grant to TRS from 0.1 −0.4 the Natural Sciences and Engineering Research Council of 00 0.1 0.2 0.3 0.4 x0.15 0.6 0.7 0.8 0.9 1 −0.5 Canada. References Figure6:Generatingexampleforthepositivecategory,under Baluja,S.,&Fahlman,S.E.(1994).Reducingnetworkdepthinthecascade-correlation constraint x2 = 0.25sin(8πx1)+0.5 (dashed dotted curve), learningarchitecture. TechnicalReport#CMU-CS-94-209,SchoolofComputer withN=5000andAR=39.82%.Contour-plotofthelearned Science,CarnegieMellonUniversity,Pittsburgh,PA.. Bonawitz,E.,Denison,S.,Gopnik,A.,&Griffiths,T.L.(2014).Win-stay,lose-sample: mapping, f(x ,x ;W∗),alongwithitscorrespondingcolorbar Asimplesequentialalgorithmforapproximatingbayesianinference.Cognitivepsy- 1 2 chology,74,35–65. isdepicted. Generatedsamplesareshownbyreddots(dueto Bonawitz,E.,Denison,S.,Griffiths,T.L.,&Gopnik,A.(2014).Probabilisticmodels, learningalgorithms,andresponsevariability: samplingincognitivedevelopment. highdensity,individualsamplesmaynotbeeasilyvisible). Trendsincognitivesciences,18(10),497–500. Brooks,D.,&Baddeley,A.(1976).Whatcanamnesicpatientslearn?Neuropsycholo- gia,14(1),111–122. Buesing,L.,Bill,J.,Nessler,B.,&Maass,W. (2011). Neuraldynamicsassampling: networkimplementingMAL).Thishypothesisyieldsthefol- amodelforstochasticcomputationinrecurrentnetworksofspikingneurons.PLoS ComputBiol,7(11),e1002211. lowingprediction: Thereshouldbesomebrainimpairments Chalup,S.K.,&Wiklendt,L. (2007). Variationsofthetwo-spiraltask. Connection Science,19(2),183–199. which lead to a marked decline in a subject’s performance Dasgupta,I.,Schulz,E.,&Gershman,S.J.(2016).Wheredohypothesescomefrom? CenterforBrains,MindsandMachines(CBMM)MemoNo.056. ingenerativetasks(i.e.,tasksinvolvingimagery,orimagina- Fahlman,S.E.,&Lebiere,C.(1989).Thecascade-correlationlearningarchitecture. tivetasksingeneral)butleavethesubject’slearningabilities Gershman,S.J.,&Beck,J.M.(2016).Complexprobabilisticinference:Fromcogni- tiontoneuralcomputation. InComputationalModelsofBrainandBehavior,edA. (nearly)intact. Studiesonlearningandimaginativeabilities Moustafa(Hoboken,NJ:Wiley-Blackwell). Gershman,S.J.,Vul,E.,&Tenenbaum,J.B. (2012). Multistabilityandperceptual of hippocampal amnesic patients already provide some sup- inference.Neuralcomputation,24(1),1–24. porting evidence for this idea (Hassabis, Kumaran, Vann, & Hassabis, D., Kumaran, D., Vann, S.D.,&Maguire, E.A. (2007). Patientswith hippocampalamnesiacannotimaginenewexperiences.ProceedingsoftheNational Maguire,2007;Spiers,Maguire,&Burgess,2001;Brooks& AcademyofSciences,104(5),1726–1731. LeCun,Y.,Chopra,S.,Hadsell,R.,Ranzato,M.,&Huang,F. (2006). Atutorialon Baddeley,1976). energy-basedlearning.Predictingstructureddata,1,0. According to Line 4 of Algorithm 1, to generate the ith Mareschal,D.,&Shultz,T.R.(1999).Developmentofchildren’sseriation:Aconnec- tionistapproach.ConnectionScience,11(2),149–186. sample, MAL requires to have access to a fine-tuned, Gaus- Moreno-Bote,R.,Knill,D.C.,&Pouget,A. (2011). Bayesiansamplinginvisual perception. ProceedingsoftheNationalAcademyofSciences, 108(30), 12491– sian noise with mean X(i)+τ∇logπ(X(i)) for its proposal 12496. Pecevski,D.,Buesing,L.,&Maass,W. (2011). Probabilisticinferenceingeneral distribution q. Recently Savin and Deneve (2014) showed graphicalmodelsthroughsamplinginstochasticnetworksofspikingneurons.PLoS howanetworkofleakyintegrate-and-fireneuronscouldim- ComputBiol,7(12),e1002294. Robert,C.,&Casella,G.(2013).Montecarlostatisticalmethods.SpringerScience& plement MAL in a neurally-plausible manner. However, as BusinessMedia. Roberts,G.O.,&Rosenthal,J.S. (1998). Optimalscalingofdiscreteapproximations GershmanandBeck(2016)pointout,SavinandDeneveleave tolangevindiffusions.JournaloftheRoyalStatisticalSociety:SeriesB(Statistical Methodology),60(1),255–268. unansweredwhatthesourceofthatfine-tunedGaussiannoise Roberts,G.O.,&Tweedie,R.L. (1996). Exponentialconvergenceoflangevindistri- couldbe. Ourproposedframeworkmayprovideanexplana- butionsandtheirdiscreteapproximations.Bernoulli,341–363. Sanborn,A.N.,&Chater,N.(2016).Bayesianbrainswithoutprobabilities.Trendsin tion, not for the source of Gaussian noise, but for its fine- CognitiveSciences,20(12),883–893. Sanborn,A.N.,Griffiths,T.L.,&Navarro,D.J. (2010). Rationalapproximationsto tuned mean value. According to our modular account, the rationalmodels:alternativealgorithmsforcategorylearning.Psychologicalreview, main component of the mean value, which is ∇logπ(X(i)), 117(4),1144. Savin,C.,&Deneve,S. (2014). Spatio-temporalrepresentationsofuncertaintyin maycomefromanothermodule(inourcaseaCCNN)which spikingneuralnetworks.,2024–2032. Shultz,T.R.(1998).Acomputationalanalysisofconservation.DevelopmentalScience, has learned some input-output mapping f(X;W∗), based on 1(1),103–126. whichthetargetdistributionπ(X(i))isdefined(seeEq. (1)). Shultz,T.R. (2006). Constructivelearninginthemodelingofpsychologicaldevel- opment. Processesofchangeinbrainandcognitivedevelopment: Attentionand The idea of sample generation under constraints could be performance,21,61–86. Shultz,T.R.,&Bale,A.C. (2006). Neuralnetworksdiscoveranear-identityrelation an interesting line of future work. Humans clearly have the todistinguishsimplesyntacticforms.MindsandMachines,16(2),107–139. Shultz,T.R.,Mareschal,D.,&Schmidt,W.C. (1994). Modelingcognitivedevelop- capacity to engage in imaginative tasks under a variety of mentonbalancescalephenomena.MachineLearning,16(1-2),57–86. constraints, e.g., when given incomplete sentences or frag- Shultz,T.R.,&Rivest,F.(2001).Knowledge-basedcascade-correlation:Usingknowl- edgetospeedlearning.ConnectionScience,13(1),43–72. mentsofapicturepeoplecangeneratepossiblecompletions; Shultz,T.R.,&Takane,Y. (2007). Rulefollowingandruleuseinthebalance-scale task.Cognition,103(3),460–472. cf. (Sanborn & Chater, 2016). Also, our proposed frame- Shultz,T.R.,&Vogel,A.(2004).Aconnectionistmodelofthedevelopmentoftransi- tivity.InProceedingsofthe26thannualconferenceofthecognitivesciencesociety work can be used to let a CCNN generate samples from (pp.1243–1248). a category of interest at any stage during CCNN construc- Spiers,H.J.,Maguire,E.A.,&Burgess,N.(2001).Hippocampalamnesia.Neurocase, 7(5),357–382. tion. In that light, our proposed framework, along with a Westermann,G.,Sirois,S.,Shultz,T.R.,&Mareschal,D.(2006).Modelingdevelop- mentalcognitiveneuroscience.Trendsincognitivesciences,10(5),227–232. neurally-plausible implementation of MAL, gives rise to a