JournalofMachineLearningResearch11(2010)1771-1798 Submitted5/09;Revised1/10;Published5/10 Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing RyoYoshida [email protected] DepartmentofStatisticalModeling InstituteofStatisticalMathematics Tachikawa,Tokyo190-8562,Japan MikeWest [email protected] DepartmentofStatisticalScience DukeUniversity Durham,NC27708-0251,USA Editor:MichaelJordan Abstract We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse,orthogonalfactorloadingsmatrices,that,inadditiontosparsityoftheimpliedcovariance matrices, also induce conditional independence structures via zeros in the implied precision ma- trices. Wedescribethemodelsandtheiruseforrobustestimationofsparselatentfactorstructure and data/signal reconstruction. We develop computational algorithms for model exploration and posteriormodesearch,addressingthehardcombinatorialoptimizationinvolvedinthesearchover a huge space of potential sparse configurations. A mean-field variational technique coupled with annealingisdevelopedtosuccessivelygenerate“artificial”posteriordistributionsthat,atthelimit- ingtemperatureintheannealingschedule,definerequiredposteriormodesintheGFMparameter space. Several detailed empirical studies and comparisons to related approaches are discussed, includinganalysesofhandwrittendigitimageandcancergeneexpressiondata. Keywords: annealing,graphicalfactormodels,variationalmean-fieldmethod,MAPestimation, sparsefactoranalysis,geneexpressionprofiling 1. Introduction Bayesiansparsemodellinginmultivariateanalysisisofincreasinginterestinapplicationsasdiverse aslifescience,economicsandinformationscience,andisdrivinganeedforeffectivecomputational methodsforlearningmodelstructure,thatis,sparseconfigurations. Paralleldevelopmentsofsparse latent factor models (e.g., West, 2003; Griffiths and Ghahramani, 2006; Lucas et al., 2006; Wang et al., 2007; Archambeau and Bach, 2009; Carvalho et al., 2008; Guan and Dy, 2009; Rai and Daume´,2009)andinherentlysparselystructuredgraphicalmodels(e.g.,Jordan,1999,2004;Dobra et al., 2004; Jones et al., 2005; Carvalho and West, 2007) have explored Bayesian computations using a range of stochastic and deterministic search methods. With a view to scaling to higher di- mensions and identification of regions of interest in model structure space, efficient and effective computation remains a challenge. We describe a previously undeveloped class of sparse graphi- cal factor models (GFMs)—a subclass of linear, Gaussian latent factor models with sparse factor loadings that also induce sparse conditional independencies. In this context, we develop a compu- (cid:13)c2010RyoYoshidaandMikeWest. YOSHIDAANDWEST tational technique for posterior mode evaluation using a hybrid of variational mean-field method (Attias,1999;WainwrightandJordan,2008)andannealing-basedoptimization. Asapreviouslyunexploredclassofsparse(linear,Gaussian)factormodels,theintrinsicgraph- ical structure of the GFM arises from use of an orthogonal factor loadings matrix and appropriate scalingofitscolumns,togetherwiththeusualdiagonalcovariancematrixforlatentfactors(withno lossofgenerality). Weshowthatthisgenerallyinduceszeroelementsintheprecisionmatrixofthe GFM,aswellasthecovariancematrix. Particularly, thezeroentriesinthecovariancematrixhave correspondingzerosintheprecisionmatrix. Wealsoshowthatcovariancematricesoffittedvalues (i.e., “data reconstructions”) from such a model have the same sparse structure, and demonstrate aspects of robustness of the model in inferring variable-latent factor relationships in the presence of outliers. These properties are not shared in general by sparse factor models that lack the graph- ical structure on variables, nor of course by non-sparse approaches. These intrinsic properties of the GFM, along with relationships with earlier studies on sparse factor analyses, are discussed in Section2. Ourvariationalmean-fieldannealingalgorithm(VMA2)addressesthecombinatorialoptimiza- tioninvolvedinaimingtocomputeapproximateposteriormodesforGFMparametersinthecontext ofthehugespaceofzero/non-zeropotentialpatternsinfactorloadings. Usingaprescribedschedule of decreasing temperatures, VMA2 successively generates tempered “artificial” posteriors that, at thelimitingzerotemperature,yieldposteriormodesforbothGFMparametersandthe0/1loadings indicators. Defined via an artificial, dynamic regularization on the posterior entropy of configured sparsestructures,VMA2isdevelopedinSection3. Section 4 provides additional algorithmic details, including prior modelling for evaluating de- greeofsparseness,andastochasticvariantofVMA2forhigher-dimensionalproblemsisdescribed inSection5. PerformanceandcomparisonsonartificialdataappearinSection6. Section7summa- rizesextensive,detailedempiricalcomparisonswithrelatedapproachesinanalysesofhand-written digitimagesandcancergeneexpressiondata. Section8concludeswithbriefadditionalcomments. A range of detailed supplementary materials, extended discussion on the gene expression studies andRcode,isaccessiblefromhttp://daweb.ism.ac.jp/˜yoshidar/anneals/. 2. SparseGraphicalFactorModels We describe the GFM with some intrinsic graphical properties, followed by connections to previ- ouslydevelopedclassesofsparselatentfactoranalyses. 2.1 GFMForm Observed sample vectors x ∈ Rp in p dimensional feature space are each linearly related to in- i dependent, unobserved Gaussian latent factor vectors l ∈Rk with additional Gaussian noise. We i are interested in sparse variable-factor relationships so that the bipartite mapping l →x is sparse, with the underlying p×k matrix of coefficients—the factor loadings matrix—having a number of zero elements; the p×k binary matrix Z defines this configured sparsity pattern. We use a sparse, orthogonal loading matrix and diagonal covariance matrices for both latent factors and residuals; themodelismathematicallyidentifiedintheusualsenseinfactoranalysis(Anderson,2003). 1772 GRAPHICALFACTORMODELSANDVARIATIONALMEAN-FIELDANNEALING WithZ asthe p×k binarymatrixwithelementsz suchthatvariablegisrelatedtofactor j if gj andonlyifz =1,theGFMis gj x =Y 1/2F l +n with l ∼N(l |0,D ) and n ∼N(n |0,Y ) i Z i i i i i i where: (a) the factor loading matrix Y 1/2F has F ≡ F ◦Z with ◦ representing element-wise Z Z product; (b) F is orthogonal, that is, F ′ F =I ; (c) the factors have diagonal covariance matrix Z Z Z k D =diag(d ,...,d ); and (d) the idiosyncratic Gaussian noise (or residual) n is independent of l 1 k i i and has covariance matrix Y =diag(y ,...,y ). The implied covariance matrix of the sampling 1 p model,S ,andthecorrespondingprecisionmatrix,S −1,are S =Y 1/2{I+F FD ′ }Y 1/2 and S −1=Y −1/2{I−F TF ′ }Y −1/2 (1) Z Z Z Z where T =diag(t ,...,t ) with t =d /(1+d ) (j =1:k). In general, sparse loading matrices 1 k j j j induce some zero elements in the covariance matrix whether or not they are orthogonal, but not in the implied precision matrix. In the GFM here, however, a sparse factor model also induces off- diagonalzerosinS −1.Zerosintheprecisionmatrixdefinesaconditionalindependenceorgraphical model, hence the GFM terminology. In (1), the pattern of sparsity (location of zero entries) in the covariance and precision matrices are the same. The set of variables associated with one specific factor forms a clique in the induced graphical model, with sets of variables that have non-zero loadings on any two factors lying in the separating subgraph between the corresponding cliques. Hence, we have a natural and appealing framework in which sparse factor models and graphical modelsarereconciledandconsistent. 2.2 SomeModelAttributes Ingeneral,anon-orthogonalfactormodelwiththesparseloadingmatrixW—asparseextensionof probabilisticPCA(Bishop,1999,2006)—hastheform x =Wl +n withl ∼N(0,I)andn ∼N(0,Y ). i i i i i TheGFMariseswhenasingularvaluedecompositionisappliedtothescaled-factorloadingmatrix Y −1/2W =F D 1/2R with a k×k orthogonal matrix R being removed. This non-orthogonal model Z definesaBayesoptimalreconstructionofthedataviathefittedvalues(orextractedsignal) xˆ(x):=WE[l |x]=WW′(WW′+Y )−1x. i i i i Then,asymptotically, 1 (cid:229) n xˆ(x)xˆ(x)′−→p Cov[xˆ(x)]=WW′(WW′+Y )−1WW′ i i i n i=1 andthisisgenerallyanon-sparsematrix(nozeroentries)eventhoughW issparse. Thisisanincon- sistencyinthesensethatdatareconstructionsshouldbeexpectedtosharethedominantpatternsof covariancesparsityevidentintheoriginalcovariancematrixCov[x]=WW′+Y .IntheGFM,how- i ever, Cov[xˆ(x)]=Y 1/2F GF ′ Y 1/2 where G is diagonal with entries d 2/(1+d ). In such cases, i Z Z j j Cov[xˆ(x)]issparseandsharesthesame0elementsasCov[x]. i i Another feature of the GFM is related to a robust property acquired by the implied graphical structure. Consider an example of 4 variables x′ = (x ,x ,x ,x ) and 2 factors l ′ = (l ,l ) i i1 i2 i3 i4 i i1 i2 1773 YOSHIDAANDWEST Directed graph of sparse factor models Conditional independence graph induced from the GFM x x x x x 1 2 3 4 1 x 3 x 2 x 4 λ λ 1 2 Figure1: GraphicalmodelstructureofanexampleGFM. with two cliques in the conditional independence graph; {x ,x ,x } ← l and {x ,x ,x } ← i1 i2 i3 i1 i2 i3 i4 l (see Figure 1). The graph defines the decomposition of the joint density p(x ,x ,x ,x ) = i2 i1 i2 i3 i4 p(x |x ,x )p(x ,x |x )p(x ) or p(x ,x ,x ,x ) = p(x |x ,x )p(x ,x |x )p(x ). This im- i1 i2 i3 i2 i3 i4 i4 i1 i2 i3 i4 i4 i2 i3 i2 i3 i1 i1 plies that presence of one or more outliers in the isolated feature variable, that is, x or x , asso- i1 i4 ciated with a single factor clique, has no effect on the variables, x or x , once the intermediate i4 i1 variables x and x are given. Then, the parameters involved in p(x ) or p(x ), for instance, the i2 i3 i1 i4 loadingcomponentsandthenoisevariancescorrespondingtotheisolatedvariable,canbeestimated independentlyoftheimpactofoutliersinx orx . ThenumericalexperimentshowninSection7.1 i4 i1 highlightsthisrobustnesspropertyintermsofdatacompression/restorationtasks,withcomparison toothersparsefactormodels. 2.3 Likelihood,PriorsandPosterior Denote by Q the full set of parameters Q ={F ,D ,Y }. Our computations aim to explore model structures Z and corresponding posterior modes of parameters Q under the posterior p(Z,Q |X) usingspecifiedpriorsandbasedonthenobservationsformingthecolumnsofthe p×ndatamatrix X. 2.3.1 LIKELIHOOD FUNCTION Thelikelihoodfunctionis p(X|Z,Q )(cid:181) |Y |−n/2|I−T|n/2etr(−SY −1/2+Y −1/2SY −1/2F TF ′ /2) (2) Z Z where etr(A)=exp(trace(A)) for any square matrix A, and S is the sample sum-of-square matrix S =XX′ with elements s . In (2), the factor loadings appear only in the last term and form the gh importantstatistic k trace(Y −1/2SY −1/2F TF ′ )= (cid:229) t f ′ Y −1/2SY −1/2f Z Z j zj zj j=1 wheref iscolumn jofF ,orf =f ◦z wheref iscolumn jofF andz iscolumn j ofZ. zj Z zj j j j j 1774 GRAPHICALFACTORMODELSANDVARIATIONALMEAN-FIELDANNEALING 2.3.2 PRIORS ONQ ANDZ Priors over non-zero factor loadings may reflect substantive a priori knowledge if available, and willthenbeinherentlycontextspecific. Forexampleshere,however,weuseuniformpriors p(Q |Z) for exposition. Note that, on the critical factor loadings elements F , this involves a uniform on the hypersphere defined by the orthogonality constraint that is then simply conditioned (by setting impliedelementsofF tozero)aswemoveacrosscandidatemodelsZ. Concerning the sparse structure Z, we adopt independent priors on the binary variates z with gj logit(Pr(z = 1|z )) = −z /2 where logit(p) = log(p/(1−p)) and the parameters z are as- gj gj gj gj signed hyperpriors and included in the overall parameter set in later. Beta priors are obvious al- ternatives to this; the logit leads to a minor algorithmic simplification, but otherwise the choice is arbitrary. Using beta priors can be expected to lead to modest differences, if any of practical relevance, in many cases, and users are free to explore variants. The critical point is that includ- ingBayesianinferenceonthese p×ksparsity-determiningquantitiesleadsto“self-organization”as theirposteriordistributionsconcentrateonlargerorsmallervalues. ExamplesinSection6highlight this. 2.4 MAPEstimationfor(Q ,Z)inGFMs Conditional on the p×k matrix of sparsity control hyperparameters z whose elements are the z , gj itfollowsthatposteriormodes(Z,Q )maximize p k p 2logp(Z,Q |X,z ) = 2logp(Q |Z)−(cid:229) (cid:229) z z −(cid:229) (nlogy +s y −1) gj gj g gg g g=1j=1 g=1 k +(cid:229) (nlog(1−t )+t f ′ Y −1/2SY −1/2f ). (3) j j zj zj j=1 The first two terms in (3) arise from the specified priors for Q and Z, respectively. The quadratic forminthelasttermisf ′ Y −1/2SY −1/2f =f ′S(z ,Y )f foreach j,wherethekey p×pmatrices zj zj j j j S(z ,Y )haveelements(S(z ,Y )) givenby j j gh (S(z ,Y )) =z z s (y y )−1/2, forg,h=1: p. (4) j gh gj hj gh g h The(relative)signal-to-noiseratiost =d /(1+d )controltherolesplayedbythelasttermin(3). j j j Optimizing (3) over Q and Z involves many discrete variables and the consequent combina- torial computational challenge. Greedy hill-climbing approaches will get stuck at improper local solutions,oftenandquickly. TheVMA2methodinSection3addressesthis. 2.5 LinkstoPreviousSparseFactorModellingandLearning In the MAP estimation defined by (3), there are evident connections with traditional sparse princi- palcomponentanalyses(sparsePCA;Jolliffeetal.,2003,Zouetal.,2006andd’Aspremontetal., 2007). If Y =I and D =I, the latter likelihood component in (3) is the pooled-variance of pro- jections, that is, (cid:229) k f ′S(z ,I)f , constructed by the k sparse loading vectors. This is the central j=1 j j j statisticoptimizedinmanysparsePCAs. DifferencesamongexistingsparsePCAsariseintheway they regulate degrees of sparseness and whether or not orthogonality is imposed on the loading vectors. 1775 YOSHIDAANDWEST The direct sparse PCA of d’Aspremont et al. (2007) imposes an upper-bound d > 0 on the cardinality of z (the number of non-zero elements), with a resulting semidefinite programming of j computationalcomplexityO(p4 log(p)).Theapplicabilityofthatapproachisthereforelimitedto problemswith prathersmall. Suchcardinalityconstraintscanberegardedassuggestiveofstructure p forthepriordistributiononz inourmodel. The SCoTLASS algorithm of Jolliffe et al. (2003) uses ℓ -regularization on loading vectors, 1 later extended to SPCA using elastic nets by Zou et al. (2006). Recently, Mairal et al. (2009) presented a ℓ -based dictionary learning for sparse coding in which the method aims to explore 1 sparsityonfactor-samplemappingratherthanthatonfactor-variablerelations. SettingLaplace-like prior distributions on scale loadings is a counterpart of ℓ -based penalization (Jolliffe et al., 2003; 1 Zouetal.,2006). However,ourmodel-basedperspectiveaimsforamoreprobabilisticanalysis,with advantages in probabilistic assessment of appropriate dimension of the latent factor space as well as flexibility in the determination of effective degrees of sparseness via the additional parameters z . Other than the preceding studies, ℓ -regularizations have widely been employed to make sparse 1 latent factor analyses. Archambeau and Bach (2009) developed a general class of sparse latent factor analyses involving sparse probabilistic PCA (Guan and Dy, 2009) and a sparse variant of probabilisticcanonicalcorrelationanalysis. AkeyideaofArchambeauandBach(2009)istoplace theautomaticrelevancedetermination(ARD)priorofMackay(1995)oneachloadingcomponent, andtoapplyavariationalmean-fieldlearningmethod. KeyadvancesinBayesiansparsefactoranalysisbuildonnon-parametricBayesianmodellingin Griffiths and Ghahramani (2006) and Rai and Daume´ (2009), and developments in Carvalho et al. (2008)stemmingfromtheoriginalsparseBayesianmodelsinWest(2003). Carvalhoetaldevelop MCMCandstochasticsearchmethodsforposteriorexploration. MCMCposteriorsamplingcanbe effectivebutishugelychallengedasthedimensionsofdataandfactorvariablesincrease. Ourfocus hereisMAPevaluationwithaviewtoscalingtoincreasinglylargedimensions,andweleaveopen theopportunitiesforfutureworkonMCMCmethodsinGFMs. Mostimportantly,asremarkedinSection2.2,theGFMdiffersfromsomeoftheforgoingmod- els in the conditional independence graphical structures induced. This characteristic contributes to preserving sparse structure in the data compression/reconstruction process and also to the outlier robustnessissue. We leavefurther comparative discussionto Section7.1, where we evaluatesome oftheforegoingmethodsrelativetothesparseGFManalysisinanimageprocessingstudy. 3. VariationalMean-FieldAnnealingforMAPSearch FindingMAPestimatesoftheaugmentedposteriordistribution(3)involvesmanydiscretevariables z . Then,commonlyappliedsearchmethodssuchasgreedyhill-climbingalgorithmoftengetstuck gj inimproperlocalsolutions. Here,wepresentageneralframeworkofVMA2enablingustoescape localmodetrapsbyexploitingannealing. 3.1 BasicPrinciple Relativeto(3),considertheclassofextendedobjectivefunctions G (Q ,w )= (cid:229) w (Z)logp(X,Z,Q |z )−T (cid:229) w (Z)logw (Z) (5) T Z∈Z Z∈Z 1776 GRAPHICALFACTORMODELSANDVARIATIONALMEAN-FIELDANNEALING where w (Z)—the sparsity configuration probability—represents any distribution over Z ∈Z that may depend on (X,Q ,z ), and where T ≥0. This modifies the original criterion (3) by taking the expectationof p(X,Z,Q |z )withrespecttow (Z)—theexpectedcompletedatalog-likelihoodinthe contextofEMalgorithm—andbytheinclusionofShannon’sentropyofw (Z)withthetemperature multiplierT. Now,view(5)asacriteriontomaximizeover(Q ,w )jointlyforanygivenT.Thefollowingisa keyresult: Proposition1 ForanygivenparametersQ andtemperatureT,(5)ismaximizedwithrespecttow at w (Z)(cid:181) p(Z|X,Q ,z )1/T. (6) T Proof SeetheAppendix. ForanygivenQ ,alargeT leadstow (Z)beingratherdiffuseoversparseconfigurationsZ sothat T iterative optimization—alternating between Q and w —will tend to move more easily and freely around the high-dimensional space Z. This suggests annealing beginning with the temperature T largeandsuccessivelyreducingtowardszero. Wenotethat: • As T →0, w (Z) converges to a distribution degenerate at the conditional mode Zˆ(Q ,z ) of T p(Z|X,Q ,z ),sothat • joint maximization of G (Q ,w ) would approach the global maximum of the exact posterior T p(Q ,Z|X,z )asT →0. The notion of the annealing operation is to realize a gradual move of successively-generated solu- tions for Q and w (Z), and to escape local mode traps by exploiting annealing. Note that, for any T given tempered posterior (6), the expectation in the first term of (5) is virtually impossible to be taken due to the combinatorial explosion. In what follows, we introduce VMA2 as a mean-field technique coupled with the annealing-based optimization to overcome this central computational difficulty. 3.2 VMA2basedonFactorized,TemperedPosteriors To define and implement a specific algorithm, we constrain the otherwise arbitrary “artificial con- figuration probabilities” w , and do so using a construction that induces analytic tractability. We specifythesimplest,factorizedform p k p k w (Z)= (cid:213) (cid:213) w (z ):= (cid:213) (cid:213) w zgj(1−w )1−zgj gj gj gj g=1j=1 g=1j=1 in the same way as conventional Variational Bayes (VB) procedures do. In this GFM context, the resulting optimization is eased using this independence relaxation as it gives rise to tractability in computingtheconditionalexpectationinthefirsttermof(5). If T = 1, and given the factorized w , the objective function G exactly agrees with the free 1 energy,whichboundstheposteriormarginalas log (cid:229) p(X,Q ,Z|z )≥G (Q ,w ). 1 Z∈Z 1777 YOSHIDAANDWEST Thelower-boundG isthecriterionthattheconventionalVBmethodsaimtomaximize(Wainwright 1 and Jordan, 2008). This indicates that any solutions corresponding to the VB inference can be obtained by stopping the cooling schedule at T =1 in our method. Similar ideas have, of course, been applied in deterministic annealing EM and annealed VB algorithms (e.g., Ueda and Nakano, 1998). Thesemethodsexploitannealingschemestoescapefromlocaltrapsduringcoordinate-basis updatesinaimingtodefinevariationalapproximationsofposteriors. Evenwiththisrelaxation,maximizationoverw (Z)cannotbedoneforallelementsofZsimulta- neouslyandsoisapproachedsequentially—sequencingthrougheachw inturnwhileconditioning gj theothers. ForanygivenT thisyieldstheoptimizingvaluegivenby w (T)(cid:181) exp 1 (cid:229) (cid:213) (cid:213) w (z )logp(z =1|X,Z ,Q ,z ) (7) gj T hl gj C\{g,j} n ZC\{g,j}h6=gl6=j o whereC denotesthecollectionofallindices(g,j)forthe pfeaturesandkfactorvariables,C\{g,j} is the set of the paired indices (h,l) such that (h,l)6=(g,j), and Z stands for the set of z s C\{g,j} hl otherthanz . gj Starting with w ≃1/2 at an initial large value of T, (7) gradually concentrates to the point gj massasT decaystozeroslowly: p(z =1,X,Z ,Q ,z ) 1, if (cid:229) (cid:213) (cid:213) w (z )log gj C\{g,j} >0, zˆgj :=lTi↓m0w gj(T)=( ZC\{g,j}h6=gl6=j hl p(zgj =0,X,ZC\{g,j},Q ,z ) 0, otherwise. Itremainstruethat,atthelimitingzerotemperature,theglobalmaximumofG (Q ,w )istheset T of p×kpointmassesattheglobalposteriormodeof p(Q ,Z|X,z ). Thisisseentriviallyasfollows: (i)AsT →0,andwiththenon-factorizedw in(5),wehavelimitingvalue sup logp(X,Q ,Z|z )=sup G (Q ,w ) (8) 0 Z w with the point mass w (Z) = d (Z) at the location of the global maximum (Zˆ) = zˆ . Further, Zˆ gj gj (ii) any point mass d (Z) is representable by a fully factorized p×k point masses as d (Z) = Zˆ Zˆ (cid:213) d (z ). g,j zˆgj gj It is stressed that the coordinate-basis updates (7) cannot, of course, guarantee convergence to the global optimum even with prescribed annealing. Nevertheless, VMA2 represents a substantial advanceinitsabilitytomovemorefreelyandescapelocalmodetraps. Wealsonotethegenerality oftheidea,beyondfactormodelsandalsopotentiallyusingpenaltyfunctionsotherthanentropy. 4. SparseLearninginGraphicalFactorModels We first provide a specific form of VMA2 for the GFM, and then address the issue of evaluating relevantdegreesofsparseness. 4.1 MAPAlgorithm Computations alternate between conditional maximization steps for w and Q while reducing the temperature T. At each step, the value of the objective function (5) is kept to refine until conver- gencewherethetemperaturereachestozero. Specifically: 1778 GRAPHICALFACTORMODELSANDVARIATIONALMEAN-FIELDANNEALING 1: SetacoolingscheduleT ={T ,...,T }oflengthd whereT =0; 1 d d 2: Setz ; 3: InitializeQ ; 4: Initializew (Z); 5: i←0; 6: while({theloopisnotconverged}∧{i≤d}) 7: i←i+1; 8: Computeconfigurationprobabilitiesw (T); gj i 9: Optimize with respect to each column f (j = 1 : k) of F in turn under full- j conditioning; 10: OptimizewithrespecttoD underfull-conditioning; 11: OptimizewithrespecttoY underfull-conditioning; 12: Optimizewithrespecttoz underfull-conditioning; 13: end while Wenowsummarizekeycomponentsintheiterative,annealedcomputationdefinedabove. 4.2 SparseConfigurationProbabilities First consider maximization with respect to each sparse configuration probability w conditional gj on all others. We note that the first term in (5) involves the expectation over Z with respect to the probabilitiesw ,denotedbyEw [·].Accordingly,forthekeytermsS(z ,Y )wehave j w , ifg=h, Ew [S(z ,Y )]=W ◦(Y −1/2SY −1/2)with(W ) = gj (9) j j j gh w w , otherwise. ( gj hj IntroducethenotationY −1/2SY −1/2=(s (Y ),...,s (Y ))torepresentthe pcolumnsofthescaled- 1 p samplesum-of-squarematrixhere,anddefinethe p−vector w˜ =(w ,...,w ,1,w ,...,w )′. gj 1j g−1,j g+1,j pj Then, the partial derivative of (5) with respect to w conditional on Q and the other configuration gj probabilitiesleadsto logit(w (T))=H (z )/T where H (z ):=t f (f ◦w˜ )′s (Y )−z . gj gj gj gj gj j gj j gj g gj This directly yields the conditional maximizer for w in terms of the tempered negative energy gj H (z )/T. As the temperature T is reduced towards zero, the resulting estimate tends towards 0 gj gj or1accordingtothesignofH (z ). gj gj 4.3 ConditionalOptimizationoverF The terms in (5) that involve F are simply the expectation of the quadratic forms in the last term of (3), with the term for each column f involving the key matrices S(z ,Y ) defined in (4), for j j each j=1:k.AteachstepthroughtheoveralloptimizationalgorithminSection4.1,wesequence through these columns of the loadings matrix in turn conditioning on the previously optimized 1779 YOSHIDAANDWEST valuesofallothercolumns. InthecontextoftheoveralliterativeMAPalgorithm,thisyieldsglobal optimizationoverF asT →0. Conditional optimization then reduces to the following: for each j =1:k, sequence through eachcolumnf inturnandateachstep j maxfimize f ′jEw [S(zj,Y )]f j j subjectto f ′f =1 and f ′ f =0form6= j,m=1:k. (10) j j m j Theoptimizationconditionsonthemostrecentlyupdatedvaluesofallothercolumnsm6= jateach step,andisperformedasonesweepastheline9inthealgorithmofSection4.1. Columnordercan be chosen randomly or systematically each time while still maintaining convergence. In this step, westressthattheoriginalorthogonalityconditionismodifiedtoF ′ F =I →F TF =I in(10). It Z Z remains the case that iteratively refined estimates obtained from (10) satisfy the original condition atthelimitingzerotemperature, yieldingsparsityforEw [S(zj,Y )],asdetailedinthemathematical derivationsinsupplementarymaterial. Thespecificcomputationsrequiredfortheconditionaloptimizationin(10)areasfollows(with supporting details in the Appendix). Note that the central matrices Ew [S(zj,Y )] required here are triviallyavailablefromEquation(9). 1: Compute the p×(k−1) matrix F ={f } by simply deleting column j (−j) m m6=j fromF ; 2: Computethe p×pprojectionmatrixN =I −F F ′ ; j p (−j) (−j) 3: Compute the eigenvector j corresponding to the most dominant eigenvalue of j NjEw [S(zj,Y )]Nj; 4: Computetherequiredoptimalvectorf =N j /kN j k. j j j j j Thisproceduresolves(10)byoptimizingoveraneigenvectoralreadyconstrainedbytheorthogonal- ity conditions. Here N spans the null space of the current k−1 columns of F , so j (−j) NjEw [S(zj,Y )]Nj defines the projection of Ew [S(zj,Y )] onto the orthogonal space and eigenvec- torsj lieinthenullspace. Itremainstoensurethatthecomputedvaluef isofunitlength,which j j involves the normalization in the final step in part 4. Selecting the eigenvector with maximum eigenvalueensurestheconditionalmaximizationin(10). 4.4 ConditionalOptimizationoverD The variances d of the latent factors appear in Equations (3) and (5) in the sum over j =1:k of j terms −nlog(1+d j)+d j(1+d j)−1f ′jEw [S(zj,Y )]f j. Thisisunimodalind withmaximizingvalue j dˆj =max{0,n−1f ′jEw [S(zj,Y )]f j−1}, (11) and so the update at the line 10 of the MAP algorithm of Section 4.1 computes these values in parallelforeachfactor j=1:k.Notethatthismaygeneratezerovalues,indicatingtheremovalof thecorrespondingfactorsfromthemodel, andsoinducinganintrinsicabilitytoprunethenumber 1780
Description: