ebook img

Article, book and MOOC summaries by Vincent Zoonekynd (2013-10) PDF

497 Pages·2016·17.37 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Article, book and MOOC summaries by Vincent Zoonekynd (2013-10)

Applying separative non-negative matrix Alternatives to deep neural networks in finance factorization to extra-financial data A.V. Antonov and V.V. Piterbarg (2022) P. Fogel et al. (2022) To approximate functions f :Rk !R, try Nonnegative matrix factorization (NMF) is asymmet- X ric: it pays more attention to high values than low f˜(x)= (cid:11) s(x(cid:0)z ): n n values. Instead, decompose the data as X = X + 0 X+ (cid:0) X−, X+ ⩾ 0, X− ⩾ 0, for some baseline X0 For the Fourier transform, the zn’s are on a grid, and (e.g., thecolumn-wisemedians)anduseanonnegative (cid:11) =fˆ(z )–butwecouldusequasi-randompointsfor n n tensor factorization for the order-3 tensor [X+jX−]. thezn’s,estimatethe(cid:11)n’swithalinearregression,and fine-tunethepositionofthez ’s(stochastic sampling). Python implementation in nmtf; appliction to ESG n scores. For image rendering, s is sinc or its Lanczos general- ization Deep differentialble reinforcement learning sin((cid:25)x) sin((cid:25)x)sin((cid:25)x=a) and optimal trading s(x)= (cid:25)x or s(x)= (cid:25)x (cid:25)x=a 1x∈[−a,a]: T. Jaisson (2022) Differentiablereinforcementlearning(RL),forasingle Generalized stochastic sampling uses asset (n=1), on simulated data with a 2-scale alpha X Y (cid:0) (cid:1) f˜(x)= (cid:11) (cid:30) (x (cid:0)z )(cid:12) (cid:11) =(cid:11)slow+(cid:11)fast n d nd d t t t n d (cid:11)slow =(cid:26)slow(cid:11)slow+noise t t−1 for a fixed activation function (cid:30) (e.g., sinc), quasi- (cid:11)fast =(cid:26)fast(cid:11)fast +noise t t−1 random zn’s, and (cid:0) (cid:1) (cid:26)fast =0; (cid:26)slow >0 – (cid:12) =(cid:21) (cid:1)(cid:17), (cid:21)2 =E r f(x) 2, (cid:17) optimized; d d d d – or (cid:12) optimized; d – or (cid:12) optimized. d,n Distributionally robust The tensor train decomposition end-to-end portfolio construction G. Costa and G.NB. Iyengar (2022) A =G1 G2 (cid:1)(cid:1)(cid:1)GD−1 GD i1,...,iD i1k1 k1i2k2 kD−2iD−1kD−1 kD−1iD In the portfolio optimization A(i1;:::;iD)=g1(i1)g2(i2)(cid:1)(cid:1)(cid:1)gD−1(iD−1)gD(iD) MinimizeRisk(w)(cid:0)(cid:13)w′y; row matrices column w⩾0 vector vector w′11=1 can be made functional (fTT) use, as risk, the worst deviation risk measure f˜(x)=g1(x1)g2(x2)(cid:1)(cid:1)(cid:1)gD−1(xD−1)gD(xD) Risk(w)= Max f (w;p) pp′1⩾=01 ε veroctwor matrices cvoelcutmorn Dϕ(p,XUnif)⩽δ where the g ’s are learned (with alternating least fε(w;p)=Min pjR(w′"j (cid:0)c) squares, fromdbasis functions – note that this depends c on the order of the x ’s). D : (cid:30)-divergence d ϕ ": past prediction errors Classical approaches include: R: even function – Fourier decomposition; – Chebychev decomposition (up to dimension 7: 77 < The minimax problem can be reformulated with con- 106); vex duality, and solved end-to-end (to maximize the – Natural neighbour interpolation (in dimension 2 or Sharpe ratio), for 20 assets, 15 years, weekly returns, 3) or other linear interpolations (barycenters). 17 predcitors (code available). Cryptocurrency bubble detection: Supervised portfolios a new stock market dataset, financial task G. Chevalier et al. (2022) and hyperbolic models Instead of forecasting returns (or uniformized returns, R. Sawhney et al. residualreturns, volatility-adjustedreturns, etc.) fore- Identify bubbles in cryptocurrencies (9 exchanges (cid:2) cast the weights of the optimal portfolio (using real- 50 currencies, 5 years of daily OHLC prices) with the ized returns as alpha). This adjusts for: factor ex- psy model, and train a hyperbolic GRU on both text posures, volatility, investor preferences (utility, con- (tweets with currency tickers, e.g., $DOGE) and prices, straints). Example with 25 assets, 14 features. to predict the probability that a bubble starts or ends at time t. ArticleandbooksummariesbyVincentZoonekynd 1/898 In the Poincaré ball: Decomposing cross-sectional volatility J. Menchero and A. Morozov (2010) 2 gx = 1(cid:0)kxk2gEuclidean Given a risk model (cid:0) (cid:1) (cid:0) (cid:1) X 1+2hx;yi+kyk2 x+ 1(cid:0)kxk2 y X = (cid:12) F +u ; x(cid:8)y = i ik k i 1+2hx;yi+kxk2kyk2 k (cid:18) (cid:18) (cid:19) (cid:19) kvk v exp (v)=x(cid:8) tanh the cross-sectional volatility can be decomposed as x 1(cid:0)kxk2 kvk X log (y)=(cid:0)1(cid:0)kxk2(cid:1)tanh−1(cid:0)k(cid:0)x(cid:8)yk(cid:1) (cid:0)x(cid:8)y (cid:27)(X•)= Fk(cid:27)((cid:12)•k)(cid:26)((cid:12)•k;X•)+(cid:27)(u•)(cid:26)(u•;X•): x k(cid:0)x(cid:8)yk k (cid:0) (cid:1) W (cid:10)x=exp W log (x) 0 0 AdaptSPEC: adaptive spectral estimation for nonstationary time series Testing for multiple bubbles: O. Rosen et al. (2012) historical episodes of exuberance Model (oscillating) non-stationary time series by re- and collapse in the S&P 500 cursivelysplittingthemandestimatingtheirspectrum P.C.B. Phillips et al. (2015) on each segment, with reversible jump Markov chain Todetecta(single)pricebubble,trythefollowingtest Monte Carlo (RJMCMC). statistics: Statistical indetence of lead-lag at various PYW: SADF(r )= sup ADF 0 r0⩽r2⩽1 [0,r2] timescales between asynchronous time series PSY: GSADF(r )= sup ADF : from p-values of transfer entropy 0 r0⩽r2⩽1 [r1,r2] C. Bongiorno and D. Challet (2022) 0⩽r1⩽r2−r0 Asymptotic distribution of the test statistic They can be adapted to the detection of multiple bub- bles. H0 : TE(B !A)=TE(C !B) H : TE(B !A)>TE(C !B) 1 Deep multiple instance learning for forecasting where the transfer entropy stock trends using financial news Y. Deng and S.M. Yiu TE(B !A)=H(A+jA)(cid:0)H(A+jA;B) In multiple instance learning. training instances are (A+ denoted the future of A) is a generalization of arranged in bags, and only bags are labeled, not in- Granger causality, often approximated by discretizing stances. the data. (If you are patient, you can also compute Forecast the sign of the next day’s returns from news: bootstrap p-values.) – Word embedding, pretrained with Glove; – BiLSTM; Advances in domain independent – Instance-level classifiers; linear text segmentation – Aggregation; F.Y.Y. Choi – Final classifier. To segment a text: – Split it into sentences; The RIFT (representation of inter-related – Removestopwords,stepthewords,countthewords; time series) model and its applications – Compute the matrix of similarities between sen- A. Sokolov et al. (2022) tences; Compute a neural representation of financial time se- – Replace each value with its rank in a local region; ries by applying – Compute sij, the sum of the values of simJi,jK×Ji,jK (start along the diagonal and move outwards); – A TCN on stock returns; – A transformer on industry and market returns; – PRecursPively split J1;NK, maximizing the density s = a , wherea iftheareaofk(cid:2)k, k =Ji;jK; – An industry embedding k k k – To decide when to stop, compute the differences in a Siamese network to forecast future correlations. (cid:14)Dn (cid:0) Dn (cid:0) Dn−1, smooth them, compute their [Instead, forecast reversion or divergence of pairs, or mean (cid:22) and standard deviation (cid:27); stop when (cid:14)D > cointegration, or mutual information.] (cid:22)+1:2(cid:27). ArticleandbooksummariesbyVincentZoonekynd 2/898 Self-attention between datapoints: p(u(cid:0)v) = hX ;Y i. Other random graph models in- u v going beyond individual input-output pairs clude: in deep learning – Configuration model (prescribed degree distribu- J. Kossen et al. (2021) tion); Do not use attention only between attributes, but also – Prefered attachment (Barabási-Albert); between datapoints (samples – this assumes you put – Copying model. thewholedatasetineachminibatchor,atleast,arep- resentative part of it). Try on tabular data. Duplication models for biological networks F. Chung et al. (2003) Decision transformer: reinforcement learning via sequence modeling The degree distribution of biological graphs follows a L. Chen et al. power law with exponent (cid:12) 2 (1;2), in contrast with non-biological networks, for which (cid:12) 2(2;4). This can To learn, offine, from suboptimal trajectories, train a be modeled by a duplication process: GPT model to predict the next token in a sequence of return-to-go, state, action. For instance, one can find – Start with a graph G0; shortestpathsonagraphbytrainingonrandomwalks. – Pick a node at random; – Duplicate it, keeping each edge with probability p; t = embed_t(t) # Positional embedding – Iterate. s = embed_s(s) Asymptotically, the power law exponent (cid:12) satisfies a = enbed_a(a) p((cid:12) (cid:0)1) = 1(cid:0)pβ−1; in particular, if p 2 (1;1), then 2 R = embed_R(R) (cid:12) <2. return transformer( s+t, a+t, R+t ).action Modernizing PHCpack through phcpy CrossViT: cross-attention multi-scale vision J. Verschelde (2014) transformer for image classification PHCpack is an (old, file and menu-based) package C.F. Chen et al. to solve systems of polynomial equations (over Cn) ProcessbothsmallandlargeimagepatcheswithaViT; f(x)=0usingasystemwithknownsolutionsg(x)=0 to combine them, use cross-attention, i.e., attention by keeping track of the solutions of h (x)=0 where t between the cls token for one patch size and the im- age patch tokens for the other size – the cls tokens h (x)=(cid:13)(1(cid:0)t)g(x)+tf(x); t2(0;1) t can be seen as “inducing points” (or tokens): the com- putations only require linear time. and (cid:13) 2C is random. Efficient and modular implicit differentiation Multivariate backtests and copulas M. Blondel et al. (2022) for risk evaluation Implicit differentiation (computing the gradient of the B. David and G. Zumbach (2022) solution of an optimization problem) in JAX (for op- Fit a bivariate Student copula to LM-GARCH innova- timization layers, or bilevel problems: hyperparameter tions (to remove heteroskedasticity), in-sample; then optimization, meta-learning): separately specify the apply the Rosenblatt transform to out-of-sample optimalityconditionsandimplementtheoptimization. (cid:26) Student copula (cid:0)! U(cid:0)niform copula (cid:1) Evaluating robustness of neural networks (u ;u ) 7(cid:0)! u ;P[U ⩽u jU =u ] 1 2 1 2 2 1 1 with mixed integer programming V. Tjeng et al. (2019) and test if the result is indeed uniform. Neural networks can be verified (their robustness to adversarial examples can be measured) efficiently, up Polynomial voting rules to 100,000 ReLU, with mixed integer programming W. Tang and D.D. Yao (2022) (MIPVerify.jp). In a proof of stake (PoS) system, agents have a voting power proportional to Random dot product graph models for social networks – Their stake; S.J. Young and E.R. Scheinerman – The square root of their stake; – Their randomly fluctuating stake: each bidder re- Thedotproductrandomgraphmodelgeneratesgraphs ceives a new stake with probability proportional to by sampling a vector X (from some probability dis- u some power of its current stake. tribution) and adding an edge u(cid:0)v with probability hX ;X i. For directed graphs, use two distributions u v ArticleandbooksummariesbyVincentZoonekynd 3/898 On finding the community Three-species Lotka-Volterra model with maximum persistence probability with respect to Caputo and A. Avellone et al. Caputo-Fabrizio fractional operators M. Khalighi et al. (2021) The persistence probability of a subgraph C (cid:26) V of a graph G=(V;E) is The fractional derivative is not unique; popular defini- P tions include (Caputo, Caputo-Fabrizio) a ij Z (cid:11)= i,Pj∈C : 1 t f′((cid:28))d(cid:28) a Dαf(t)= i∈C,j∈V ij Γ(1(cid:0)Z(cid:11)) 0 ((cid:20)t(cid:0)(cid:28))α (cid:21) 1 t (cid:11)(t(cid:0)(cid:28)) For small graphs, it can be computed with a mixed in- Dαf(t)= exp (cid:0) f′((cid:28))d(cid:28): teger program (MIP); for arger graphs, heuristics are 1(cid:0)(cid:11) 0 1(cid:0)(cid:11) available. Conic optimization via operator splitting Compromise-free Bayesian neural networks and homogeneous self-dual embedding K. Javid et al. B. O’Donoghue et al. (2016) The Bayesian evidence (aka marginal likelihood) is the The pair of primal and dual problems average of the likelihood function over the parameter space, weighted by the prior distribution. It is often a Find x;s Find y;r proxy for the out-of-sample performance. To minimize c′x To maximize (cid:0)b′y Such that Ax+s=b Such that (cid:0)A′y+r =c Randomized Nyström preconditioning x2Rn r =0 Z. Frangella et al. s2K y 2K∗ To solve (A+(cid:22)I)x=b, with A positive semi-definite, can be converted into a feasibility problem (cid:22)⩾0, consider a randomized Nyström approximation 0 1 0 1 0 1 Ω(cid:24)N(0n×ℓI) @rsA=@(cid:0)0A A0′A(cid:18)x(cid:19)+@cbA dpuriamlacloncosntrsatirnatint Aˆ=(AΩ)(Ω′AΩ)†(AΩ)′; s c′ b′ y 0 duality gap compute its eigendecomposition Aˆ = UΛU′ and use or, after introducing scaling factors (cid:28);(cid:20) ⩾ 0 to detect the preconditioner primalordualinfeasibility(homogeneous self-dual em- U(Λ+(cid:22)I)U′ bedding) P = +(I (cid:0)UU′) 0 1 0 10 1 (cid:21) +(cid:22) ℓ r 0 A′ c x i.e., replace (A+(cid:22)I) with P−1(A+(cid:22)I) (it is easy to @sA=@(cid:0)A 0 bA@yA: solve Py =0). (cid:20) (cid:0)c′ (cid:0)b′ 0 (cid:28) This problem is self-dual: Binarsity: a penalization for one-hot encoded features in linear supervised learning Find u;v M.Z. Alaya et al. (2019) Such that v =Qu Discretize (binarize) continuous variables into b bins (u;v)2C(cid:2)C∗: and fit a linear model with a fused lasso (total varia- It can be solved with the ADMM algorithm. tion) penalty to make the transformation locally con- stant, and a constraint to have the sum of the weights Find u;v;u˜;v˜ sum to zero for each preductor. To minimize IC×C∗(u;v)+IQu=v(u˜;v˜) To total variation proximal operator can be computed Such that (u;v)=(u˜;v˜) efficiently, and applied separately from the constraint. Quantifying the impact of ecological memory An agent-based model on the dynamics of interacting communities with realistic financial time series: M. Khalighi et al. (2021) a method for agent-based models validation L.G. de Faria To add memory to the Lotka-Volterra model, replace d=dtwiththefractionalderivativeDµ,where(cid:22)2(0;1) Long list of stylized facts, for log-returns measures the memory. – Fat tails, which disappear at lower frequencies; Z 1 t g′((cid:28))d(cid:28) – Heavy tails (tail index), even after correction for Dµg(t)= Γ(1(cid:0)(cid:22)) (t(cid:0)(cid:28))µ volatility (GARCH); t0 – Equity premium: E[r]>E[r ]; rf – Excess volatility: (cid:27)(returns)>(cid:27)(fundamentals); ArticleandbooksummariesbyVincentZoonekynd 4/898 – Leverage: Cor(r ;(cid:27) )<0; SpeqNets: sparsity-aware past future – No autocorrelation; permutation-equivariant graph networks – Long memory: Cor(jr j;jr j); C. Morris et al. (2022) t s – Power law of returns (”inverse cubic law”) and of The k-dimensional Weisfeiler-Lehman algorithm (k- volatility; WL) generalizes the WL algorithm (1-WL) by con- – Volatility clustering: Cor((cid:27) ;(cid:27)t+1); t sidering all k-tuples of nodes (there are exponentially – Cor((cid:27);volume)>0 many). The (k;s)-LWL algorithm considers a subset volume: of all k-tuples, viz those whose induced graph has at most s components. – Power law; – Long memory Spectre: spectral conditioning inter-trade duration: helps to overcome the expressivity limits – Clustering; of one-shot graph generators – Long memory; K. Martinkus et al. (2022) – Over-dispersion Generate graphs, with a GAN, by conditioning on the transaction size: first eigenvalues and eigenvectors of the Laplacian ma- trix (which encode global properties): – Power law – First, generate eigenvalues; spreads: – Then, generate the eigenvectors, conditioned on the – Cor(spread;(cid:27))>0; eigenvalues, starting with a bank of orthogonal ma- – Cor(spread;volume)<0. trices, and multiplying them, on both sides, by or- thogonal matrices (exponentials of skew-symmetric matrices, computed by a PointNetST); note that Improving graph neural network expressivity not all eigenvalues and eigenvectors come from valid via subgraph isomorphism counting Laplacians; G. Bouritsas et al. – Use a PPGN (provably powerful graph network) to refine the appoximate Laplacian, L = Udiag(Λ)U′, Graphneuralnets(GNN)areblindtostructuralprop- and convert it to an adjacency matrix; erties, such as triangles and larger cycles. Add node – Use a discriminator for each step; the last one en- andedgefeaturescountingsmallsubgraphscontaining sures the adjacency matrix is consistent with the a given node or edge. eigenvalues and eigenvectors. Provably powerful graph networks H. Maron et al. (2019) Adding a matrix multiplication layer to GNNs in- creases their expressiveness to 3-WL. Weisfeiler and Leman go sparse: towards scalable higher-order graph embeddings MLP C. Morris et al. (2020) × Generalize the Weisleiler-Lehman (WL) graph isomor- X MLP MLP phism algorithm by considering k-tuples of nodes, two tuplesbeingneighboursiftheydifferbyonlyonenode, and if those nodes are neighbours. X 2Rh×n×n A theoretical comparison M =MLP (X)2Rh×n×n 1 1 of graph neural network extensions M =MLP (X) 2 2 P.A. Papp and R. Wattenhofer (2022) Mi =MiMi 2Rn×n 1 2 To increase the expressiveness of GNNs: M 2Rn×n – Add node features, e.g., the number of cycles of Y =MLP (XkM)2Rh×n×n lengthk (orsomeothermotif)containingthatnode; 3 – Add, as node features, the isomorphism class of the repeat n times, with skip-connections k-hop neighbourhoods; – Drop k nodes at random, and consider the resulting Local augmentation for graph neural networks collection of graphs; S. Liu et al. (2022) – Mark k nodes at random, and consider the resulting collection of graphs. To augment data, find similar nodes and look at their neighbourhoods. ArticleandbooksummariesbyVincentZoonekynd 5/898 Janossy pooling: Structure-aware transformer learning deep permutation-invariant for graph representation learning functions for variable-size inputs D. Chen et al. (2022) R.L. Murphy et al. (2019) Transformers augment the data with a positional em- Arbitrary permutation-invariant functions can be de- bedding. For graphs, node distances are not enough: finedasaverages(ofpermutation-sensitivefunctionsf) also add structural information. over all orderings. Interpretatble and generalizable graph learning Meanf(x ) σ σ∈Sn via stochastic attention mechanism S. Miao et al. (2022) They can be approximated using: Add noise inside the network (in the attention mecha- – Canonical orderings; nism); this also helps identify task-relevant subgraphs. – Functions f using only their first k arguments, f(x ;:::;x ) = f(x ;:::;x ), k (cid:28) n, i.e., account- 1 n 1 k ing for order-k interactions at most; Mention memory: incorporating textual – Random orderings (and stochastic optimization). knowledge into transformers through entity mention attention The permutation-sensitive function f could be an M. de Jong (2022) LSTM. Combine language model and knowledge base (entity mention embeddings) with an attention mechanism. G-Mixup: graph data augmentation for graph classification X. Han et al. Logical rule induction and theory learning using neural theorem proving ToapplyMixuptographs,estimateagraphonforeach A. Campero et al. class, and sample from a convex combination of them. Use a dense representation of facts as (V,S,O,belief) To estimate a graph from a set of graphs, first align and rules as them by sorting their nodes by degree, estimate a step function graphon for eachRgraph, and average them – (V S O V S O V S O ) 0 0 0 1 1 1 2 2 2 thisassumesthemarginal W(x;y)dx=W(y)isvery conclusion premise different from a constant function. and define the belief of a conclusion as GenLabel: Mixup relabeling hv;V ihs;S iho;O ihv ;V ihs ;S iho ;O ihv ;V ihs ;S iho ;O i 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 using generative models J.Y. Sohn et al. (2022) and use forward chaining for derive new facts. Mixup can suffer from manifold intrusion: mixing two classes may intrude into the manifold of another one. Learning hierarchy-aware knowledge graph embeddings for link prediction class 1 class 2 class 3 Z. Zhang et al. (2020) Use polar coordinates to find an embedding of entities Additionally, the linear labeling is suboptimal for soft- (head, taill) and relations accounting for hierarchy. max regression. Replacing score=(cid:0)kh (cid:12)r (cid:0)t k (cid:0)(cid:21)ksin(h +r (cid:0)t )k y =(cid:21)e +(1(cid:0)(cid:21))e m m m 2 p p p 1 mix 1 2 with X Therearemanyotherscorefunctions: TransE,RotatE, y / pˆ(x )e ; ComplEx, etc. gen i mix i i wherepˆ=p(xjy)isagenerativemodel,addressesboth Translating embeddings issues. for modeling multi-relational data A. Bordes et al. Large-scale representation learning on graph TransE embeds entities as vectors and relations as via bootstrapping translations (as in the semantic interpretation of S. Thakoor et al. word2vec: king(cid:0)man+woman). For self-supervised training on graph data, consider two augmentations, g1, g2 of a graph, and train two RotatE: knowledge graph embedding encoders, e1 such that e1(g1) (cid:25) e2(g2), and set e2 = by relational rotation in complex space EMA(e1). Z. Sun et al. ArticleandbooksummariesbyVincentZoonekynd 6/898 3D Infomax improves GNNs Inductive representation learning for molecular property prediction on temporal graphs H. Stärk et al. (2022) D. Xu et al. (2020) Dynamic graphs can be seen as graphs with another type of edge, for the time evolution; temporal graph molecule 2D data latent representation properties attention (TGAT) layers use the attention mechanism plentiful maximizeMI in the time dimension (in the space dimension, use a 3D data latent representation GCN, GAT, etc.) E(n) equivariant graph neural network Temporal graph networks V.G. Satorras et al. (2021) for deep learning on dynamic graphs Replace the GNN layer with E. Rossi et al. mij =(cid:30)X2(hℓi;hℓj;aij) To process dynamic graphs (in discrete or continuous time,i.e.,sequencesofgraphs,ortime-stampedlistsof m = m i ij graphevents),replacethenode(oredge)featureswith j∈N(i) time series (LSTM, attention, etc.). hℓ+1 =(cid:30) (hℓ;m ) i h i i with (cid:13) (cid:13) m =(cid:30) (hℓ;hℓ;(cid:13)xℓ(cid:0)xℓ(cid:13)2;a ) Cycle representation learning ij 2 i j i j ij for inductive relation prediction xℓi+1 =xℓi +Mje̸=ain(xℓi (cid:0)xℓj)(cid:30)x(mij) Z. Yan et al. (2022) m =unchanged Givenagraph(V;E),itsincidencematrix@ 2F|V|×|E| i 2 hℓ+1 =unchanged defines a map @ : F|2E| ! F|2V|; its kernel is the set of i cycles, Z. TogetabasisofZ,pickanodepandbuilditsshortest Vector neurons: a general framework path tree T : the edges not in T define cycles, which for SO(3)-equivariant networks p p form a basis of Z, with (cid:12) =jEj(cid:0)jVj+1 elements. C. Deng et al. 1 (To have shorter cycles, repeat for several points, e.g., Replace 1-dimensional (scalar) neurons with 3- the cluster centers from spectral clustering.) dimensional ones: linear transformations (cid:26) Rc×3(cid:0)!Rc′×3 The cycle incidence matrix of this basis is CTp 2 fW : V 7(cid:0)!WV F|2E|×β1. Use a bidirectional LSTM to compute cycle features. are natually equivariant, f (V)R = (WV)R = W W(VR) = f (VR), but the non-linearities have to Build a new graph, with cycles as nodes, and edge W be changed, e.g., to V 7!V′, where weights equal to the number of edges the cycles have V =(v ;:::;v )2Rc×3 in common. 1 c W ;U 2R1×C for each c2J1;CK Totestifanedgeshouldbeinthegraph, addittothe c c graph, and use a GNN on the cycle graph to compute qc =8WcV; kc =UcV 2R1×3 its “confidence”. <qc (cid:28) (cid:29) if hqc;kci=0 ′ v = k k :q (cid:0) q ; c c otherwise. c c kk k kk k GNNRank: learning global rankings c c from pairwise comparisons (The normalization layers are easy to make equivari- via directed graph neural networks ant.) Y. He et al. (2022) On the equivalence between temporal and static Given a set of pairwise comparisons between n ele- equivariant graph representations ments as the (weighted) adjacency matrix A of a di- J. Gao et B. Ribeiro (2022) rectedgraph,serial rank computesthebinarycompar- ison matrix C = sign(A (cid:0)A ), then the similarity ij ij ji Instead of time-and-graph representations (compute matrix S = 1(n11′+CC′), and finally its Fiedler vec- node embeddings for each graph G in the sequence, 2 t tor (eigenvector for the second largest eigenvalue – the and then model the evolution of those embeddings), first one is 1), whose coordinate define the desired or- try time-then-graph: compute representations of the der. time series of node features (e.g., with an LSTM), and then embed them. GNNRank replaces the computation of the similarity matrix A 7! S with a GNN, and the computation of ArticleandbooksummariesbyVincentZoonekynd 7/898 the Fiedler vector with a few proximal gradient steps such that altering those features on that subgraph sig- for the constrained optimization problem nificantly changes the output – mutual information H(Y)(cid:0)H(YjS) measures that. Find x To minimize x′Sx PGMExplainer: probabilistic graphical model Such that kxk =1 2 explanations for graph neural networks x′1=1 M.N. Vu and M.T. Thai (2020) Explain the (node prediction) output of a GNN, lo- cally,withasimpleBayesiannetwork(builtonasmall G2CN: graph Gaussian convolution networks subgraph (motif) containing the target node). with concentrated graph filters M. Li et al. The CLRS algorithmic reasoning benchmark The action of a GNN on node features can be written P. Veličković et al. (2022) as x 7! g(L)x, where L is the normalized Laplacian. Benchmarks (30 algorithms from CLRS: sort, search, Look at the maximum response, the center, and the dynamic programming, graphs, strings, geometry) to bandwidth. test whether neural networks can learn to reproduce R= Max g((cid:21)) them. λ∈[0,2] b=Argmaxg((cid:21)) Diffusion-LM Zλ∈[0,2] improves controllable text generation 2 (cid:0) p (cid:1) X.L. Li et al. bw= 1 g((cid:21))⩾R= 2 d(cid:21) 0 Diffusion models can also generate text with a pre- scribedsemanticcontents(orlength,orsentencestruc- Try g((cid:21))=e−T(λ−b)2. ture, etc.) Molecular representation learning XT !XT−1 !(cid:1)(cid:1)(cid:1)!X0 !text noise word via heterogeneous motif graph neural networks vectors Z. Yu and H. Gao (2022) Molecules and motifs ( , C–C, C–OH, etc.) for a bi- Photorealistic text-to-image diffusion models partite graph, similar to the document-word bipartite with deep language understanding graphinNLP:TF-IDFfeaturescancomplementgraph C. Saharia et al. (2022) neural nets. Contrary to Dall-E, Imagen uses a very large language model (T5-XXL), trained on text only, frozen; it fine- Topology-aware network pruning tunes the result with two efficient U-Net steps. It uses using multi-stage graph embedding dynamic thresholding to avoid fully=saturated pixels. and reinforcement learning Drawbenchisabenchmarkfortext-to-imgegeneration S. Yu et al. (2022) (list of prompts, challenging for various reasons). Use reinforcement learning (RL) to find a good prun- ing strategy, with a reward function defined from ac- GLIDE: Towards photorealistic curacy and pruning ratio, after encoding the neural image generation and editing network (graph embedding) with a multi-stage GNN with text-guided diffusion models (to account for its hierarchical structure). A. Nichol et al. Diffusion models are trained by progressively adding Parametrized explainer noise to an image x 0 for graph neural network x (cid:24)Data D. Luo et al. 0 (cid:0)p (cid:1) GNNExplainer gives an explanation (a subgraph) for xtjxt−1 (cid:24)N (cid:11)txt−1;(1(cid:0)(cid:11)t)I theoutputofasingleinstance–butthishastobedone and trying to undo that process anew for each sample. PGExplainer trains a GNN to x (cid:24)N(0;I) output those explanations T (cid:0) (cid:1) xt−1jxt (cid:24)N (cid:22)θ(xt);Σθ(xt) GNNExplainer: generating explanations (with Σ diagonal). If x = x +", the loss tries to θ t 0 for graph neural networks recover " R. Ying et al. (2019) Loss= E k"(cid:0)" (x ;t)k2: θ t To explain the output of a GNN, for a given input, t∼UnifJ1,TK look for a small subgraph, and a subset of features, x0∼Data ε∼N(0,1) ArticleandbooksummariesbyVincentZoonekynd 8/898 Guided diffusion increases the likelihood of a given A primer on monotone operator methods class y A.K. Ryu and S. Boyd (2016) (cid:22)ˆ (x jy)=(cid:22) (x jy)+s(cid:1)Σ (x jy)r logp (yjx ): 2. A relation (or operator, or correspondance, or set- θ t θ t θ t xt ϕ t valued function) on Rn is a subset F (cid:26)Rn(cid:2)Rn. Its subdifferential is Classifier-free guidance also moves the model away from a “null label” ? (e.g., an empty prompt) @f =f(x;g( : 8z f(z)⩾f(x)+gT(z(cid:0)x)g (cid:0) (cid:1) "ˆ(x jy)=" (x j?)+s " (x jy)(cid:0)" (x j?) : t t t θ t θ t (v;u)2@f () (u;v)2(@f)−1 CLIP-guidance uses a joint representation of images f () v 2Argminf(x)(cid:0)uTx and text g, encouraging large fot products f(x)(cid:1)g(c) x for matching pairs (contrastive cross-entropy), instead () f(v)+f∗(u)=vTu of a classifier P where f∗(y)= yTx(cid:0)f(x). x (cid:22)ˆ (x jy)=(cid:22) (x jy)+s(cid:1)Σ (x jy)r f(x )(cid:1)g(c): θ t θ t θ t xt t A function f :Rn !R[f1g is closed if its epigraph epif =f(x;t) : x2domf; f(x)⩽tg; Denoising diffusion probabilistic models J. Ho et al. it is proper if its domain is non-empty. If f is CCP (convex, closed, proper), f∗∗ =f, and (@f)−1 =@f∗. Initial paper on (unconditional) diffusion models for image generation. 3. A Lipschitz relation with constant L is a function; (cid:18) (cid:19) if L < 1, it is a contraction; if L = 1, it is non- (cid:22) = p1 x(cid:0) p (cid:12) " expansive. The set of fixed points of a non-expansive θ (cid:11) 1(cid:0)(cid:11) θ function,(I(cid:0)F)−1(0),isclosedandconvex(butpossi- bly empty); for a contraction, it has exactly one point. Anaveraged operatorisoftheformF =(1(cid:0)(cid:18))I+(cid:18)G, Reverse-time stochastic differential equation (cid:18) 2 (0;1), G non-expansive; F is still non-expansive, for generative modeling and has the same fixed points as G. L. Winkler (2021) Projections, and overprojections (on a compact set C) Given an SDE are non-expansive. dX =(cid:22)(X ;t)dt+(cid:27)(X ;t)dW; t t t Π (x)=Argminkz(cid:0)xk C z∈C use the Kolmogorov forward equation (KFE) Q =2Π (cid:0)I C C (cid:2) (cid:3) (cid:2) (cid:3) 1 @tp(xt)=(cid:0)@xt (cid:22)(xt)p(xt) + 2@x2t (cid:27)2(xt)p(xt) x and the Kolmogorov backward equation (t⩽s) ΠC(x) QC(x) 1 (cid:0)@ p(x jx )=(cid:22)(x )@ p(x jx )+ (cid:27)2(x )@2 p(x jx ) t s t t xt s t 2 t xt s t 4. A relation f is monotone if to derive a PDE for p(x ;x )=p(x jx )p(x ): s t s t t 8x;y (fx(cid:0)fy)T(x(cid:0)y)⩾0; (cid:0)@ p(x ;x )= t s "t (cid:2) (cid:3)!# it is maximal if there is no larger monotone relation @ p(x ;x ) (cid:22)(x )(cid:0) @xt (cid:27)2(xt)p(xt) (for the inclusion, on Rn (cid:2)Rn); it is strongly mono- xt s t t p(xt) tone if 8x;y (fx(cid:0)fy)T(x(cid:0)y)⩾mkx(cid:0)yk2 (m>0). 1 (cid:2) (cid:3) For a strongly monotone Lipschitz relation, + @2 p(x ;x )(cid:27)2(x ) : 2 xt s t t CS 8x;y mkx(cid:0)yk2 ⩽(fx(cid:0)fy)T(x(cid:0)y) ⩽ Lkx(cid:0)yk2; ItistheKFEofaSDE.Ifthenoise(cid:27) isindependentof theinput,wecanmoveitoutsideofthepartialderiva- the condition number is (cid:20)=L=m. tive,andwerecognizethelog-derivativeofp, thescore. If F is maximal monotone, then so if F−1. The final SDE is (cid:2) (cid:3) If F is strongly monotone with parameter m, the F−1 dx¯ = (cid:22)(x )(cid:0)(cid:27)2@ logp(x ) dt+(cid:27) dW¯ : t t t xt t t t is Lipschitz with constant 1=m (but if F is Lipschitz, F−1 need not be strongly monotone). If f is CCP, then @f is maximal monotone. Reverse-time diffusion equation models B.D.O. Anderson (1980) A CCP f is strongly convex if f(x)(cid:0)mkxk2 is con- vex or, equivalently, if @f is strongly monotone with ArticleandbooksummariesbyVincentZoonekynd 9/898 parameter m. A CPP f is strongly convex with pa- is a contraction for (cid:11) 2 (0;2m=L2), and the forward rameter m iff f∗ is strongly smooth with parameter step method converges L=1=m. x =x (cid:0)(cid:11)F(x ): k+1 k k 5b. Iterationsofaveragedoperatorsconvergetoasolu- tion,ifoneexists(fornon-expansive,butnon-averaged not monotone monotone maximal maximal operators, theyneednotconverge). Iff isconvex(not not maximal monotone monotone strongly) and Lipschitz, I(cid:0)(cid:11)rf is averaged, and gra- (not a function) dient descent converges. Continuous monotone functions are maximal. 5c. The dual ascent for Anaffinefunctrionf(x)=Ax+bismaximalmonotone iff A+AT ≽ 0; it is strongly monotone with parame- Minimizef(x) such that Ax=b x ter (cid:21) (A+AT)=2; it is the subdifferential of a CCP min iff A = AT and A ≽ 0. This generalizes to differen- is gradien(cid:0)t ascent on the(cid:1)dual, Minimizeyg(y) where tiable functions F : Rn ! Rn and their derivative g(y)=(cid:0) f∗((cid:0)A′y)(cid:0)y′b , A=DF(x). x =ArgminL(x;y ) k+1 k Projectionsaremonotone. IfC (cid:26)Rn isclosedconvex, x its normal cone operator NC =@IC is yk+1 =yk+(cid:11)(Axk+1(cid:0)b) ( fy : 8z 2C yT(z(cid:0)x)⩽0g if x2C The convex feasibility problem looks for x 2 C \ D, N (x)= C ? otherwise where C and D are non-empty, closed, convex. Minimize(cid:18)d2(x;C)+(1(cid:0)(cid:18))d2(x;D) x x1 NC(x1) x2 The gradient method gives NC(x2) xCk+1 =ΠC(xk) The saddle subdifferential of f :Rm(cid:2)Rn !R[f1g xDk+1 =ΠD(xk) is (cid:18) (cid:19) x =(cid:18)xC +(1(cid:0)(cid:18))xD k+1 k+1 k+1 @ f(x;y) F(x;y)= x ; (cid:0)@ f(x;y) y 6a. TheresolventandCayleyoperatorofarelation iff isconvexinxandconcaveiny,F isoftenmaximal. on Rn are For the optimization problem R=(I +(cid:11)A)−1 C =2R(cid:0)I: Find x To minimize f (x) 0 If A is monotone, R and C are non-expansive. If A Such that 8i f (x)⩽0 CCP i is maximal monotone, domR = domC = Rn (fixed 8j h (x)=0 affine j point iterations are always defined); 0 2 A(x) iff x = R (x) iff x = C (x). If A is strongly mono- the KKT operator is A A 0 1 tone, then R is Lipschitz with L=1=(1+(cid:11)m). If A is @ L(x;(cid:21)) strongly monotone and Lipschitz, then C is a contrac- x T(x;(cid:21);(cid:23))=@(cid:0)F(x)+Nλ⩾0A tion with (cid:18) (cid:19) (cid:0)H(x) 4(cid:11)m 1/2 L= 1(cid:0) : (1+(cid:11)L)2 where F =(f ;:::;f ), H =(h ;:::;h ), 1 m 1 p If A is maximal and single-valued, C = (I (cid:0)(cid:11)A)(I + ( P P f (x)+ (cid:21) f (x)+ (cid:23) h (x) if (cid:21)⩾0 (cid:11)A)−1. L(x;(cid:21);(cid:23))= 0 i i j j (cid:0)1 otherwiseI;f A is a symmetric matrix with positive eigenvalues, R has eigenvalues in (0;1], C in ((cid:0)1;1]. it is monotone, and its zero set is the set of optimal The resolvent of the subdifferential of a convex func- primal-dual pairs. tion f is the proximal operator 5a. Fixedpointiterationsofacontractionconverge,at 1 leastgeometrically. Iff isstronglyconvexandstrongly R(x)=(I+(cid:11)@f)−1(x)=Argminf(u)+ ku(cid:0)xk2: 2(cid:11) smppth with parameters m and L, gradient descent u For the normal cone operator N (x)=@I (x), x =x (cid:0)(cid:11)rf(x ) (cid:11)2(0;2L) C C k+1 k k R=(I +(cid:11)N )−1 =Π projection converges to a minimizer of f. If F is strongly mono- C C toneandLipschitzwithparametersmandL,(I(cid:0)(cid:11)F) C =2ΠC (cid:0)I (cid:0)QC overprojection: ArticleandbooksummariesbyVincentZoonekynd 10/898

Description:
stream is an R package for machine learning on data streams, focusing on faster than DEOptim or rgenoud; ftsa analyzes func- tional time series (e.g.,
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.