ebook img

Closure, connectivity and degrees PDF

33 Pages·2007·0.47 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Closure, connectivity and degrees

Closure, connectivity and degrees: New specifications for exponential random graph (p*) models for directed social networks Garry Robins, Pip Pattison, Peng Wang Department of Psychology, University of Melbourne. 17 August 2006. Note: This research was assisted by grants from the Australian Research Council. An early version of this paper was presented at the International Social Networks Sunbelt Conference in Vancouver in April 2006. We thank Antonietta Pane and Yu Zhao for permission to use their data in this paper. Abstract The new higher order specifications for exponential random graph models introduced by Snijders, Pattison, Robins & Handcock (2006) exhibit dramatic improvements in model fit compared with the commonly used Markov random graphs. Snijders et al briefly presented versions of these new specifications for directed graphs, in particular a directed alternating k- triangle parameter, based on closure of multiple two-paths. In this paper, we present a number of additional higher order parameters for directed graphs. Most importantly, we propose three new triadic-based parameters to represent different versions of triadic closure: cyclic effects; transitivity based on shared choices of partners; and transitivity based on shared popularity. We also introduce corresponding parameters for multiple connectivity effects. We propose some fifty graph features to be investigated in goodness of fit diagnostics for these new parameters. As empirical illustrations, we develop models for two sets of organizational network data, to show that the new parameters help with an optimal representation of the data. The first example is a trust network within a training group, and the second a “work difficulty” network within a government instrumentality. In the first example we show that our additional parameters are necessary to obtain an acceptable model for the data. The second example is novel in fitting a statistical model, and inferring structural processes, for a negative tie network. Using this second example, we show how the incorporation of additional effects – the number of sources and sinks in the network, and the correlation between the in- and out-degree distributions – can improve representation of the degree distribution. The final model acceptably replicates the negative tie network in terms of: statistics related to twenty different graph configurations; the in- and out-degree distribution, including their correlation; seven different graph clustering coefficients; the triad census; and the geodesic distribution. Model interpretation emphasizes the importance of some nodes receiving high numbers of negative ties. Exponential random graph models are the most effective statistical approach for modeling a single network observation. This class of models was introduced by Frank and Strauss (1986) with their Markov random graph models, which were elaborated and popularized as p* models in the 1990s (Pattison & Wasserman, 1999; Robins, Pattison & Wasserman, 1999; Wasserman & Pattison, 1996 – see Wasserman & Robins, 2005, for a review). Markov random graphs are based on the dependence assumption that two possible network edges are conditionally independent unless they share a node. This seemingly simple assumption in fact results in a highly complex parameter space, one that is problematic in modeling most observed social networks (see Handcock, 2002; Snijders, 2002; Snijders, Pattison, Robins & Handcock, 2006; Robins, Snijders, Wang, Handcock & Pattison, 2006). Using a more complex partial dependence assumption, Snijders et al (2006) introduced new specifications for exponential random graph models, intended to circumvent some of the problems of Markov random graphs (see also Hunter & Handcock, 2006). In particular, these new specifications include higher order parameterization of triangulation effects, as well as effects for degree-based processes and multiple connectivity. Inclusion of these higher order effects have resulted in much improved model performance, in terms both of obtaining convergent parameter estimates and of improving goodness of fit of the models (Goodreau, 2006; Robins et al, 2006). Based on his experience of using the new models with large network data sets, Goodreau (2006) concluded that the new specifications represented a major advance in the field of statistical network analysis. To date, work on the new models (Goodreau, 2006; Hunter, 2006; Hunter & Handcock, 2006; Robins et al, 2006; Snijders et al, 2006) has concentrated on non-directed graphs. For directed networks, Snijders et al (2006) proposed specifications that were counterparts of the parameters in the non-directed models. This paper reviews, but also generalizes, these specifications for directed graphs. Most importantly, we present three new parameters related to triadic network closure (in addition to the one parameter originally proposed by Snijders et al, 2006). We show that models with the additional closure parameters can improve goodness of fit and that model interpretation is subtly different for the different parameters. We also present counterpart parameters to represent multiple connectivity. Using these new parameters, we show by empirical example how the building of a model with the addition of various effects may improve representation not only of the patterns of closure and connectivity in the data, but may also assist with the modeling of the degree distribution and other features of the graph. The article is structured as follows. We begin by presenting the general form of the exponential random graph model and review the more familiar non-directed versions of the new specifications. We review the general diagnostic method to examine goodness of fit. After presenting the directed specifications of Snijders et al (2006), we introduce the three new additional network closure parameters, and counterparts representing multiple connectivity, and discuss interpretation. We propose a range of directed graph features for diagnostic examination of goodness of fit. We present two empirical examples. In the first, a network of positive trust ties, we show that the Snijders et al (2006) specifications for directed graphs do not lead to a stable model, but incorporation of the new parameters successfully overcomes this problem. Our other empirical illustration is the fitting a statistical model to a negative tie network, in this case a “work difficulty” network. We show that the new parameters are necessary to represent closure in this network, and that the incorporation of further effects into the model successfully replicates the degree distribution as well. We conclude with general comments about fitting these models to directed networks and discuss further work. Exponential random graph models We use standard notation and terminology (Robins et al, 2006). For each pair i and j of a set N of n actors, Y is a network tie variable with Y = 1 if there is a network tie from i to j, and Y = 0 ij ij ij otherwise. The observed value of Y is y with Y the matrix of all variables and y the matrix of ij ij observed ties, the network. Y may be directed or non-directed. A configuration is a small possible subgraph for which there is a parameter in the model. In broad terms, an exponential random graph model implies that the network is built up from combinations of these small configurations. The statistical basis of the model permits inferences about which configurations are important, allowing for the other effects in the model. Configurations may be interpreted as outcomes of structural processes in the network, so that the model assists judgments about those structural processes that are sufficient to explain how the network came to be. The dependence assumption delimits the possible configurations in the model. For instance, the Markov dependence assumption (reviewed below) implies that the only configurations in the model relate to edges, stars of various sizes1, and triangles (Frank & Strauss, 1986). The general form of the class of (homogeneous) exponential random graph models is then as follows: Pr(Y = y) = (1/κ) exp{Σ η z (y)} (1) A A A where: 1 A k-star refers to k edges incident on the one node. For directed networks, Markov graph models typically include parameters for 2-in-stars (two arcs directed to the one node), 2-out-stars (two arcs emanating from the one node) and mixed 2-stars or 2-paths (an arc directed to a node from which another arc emanates), and possibly as well higher order versions, such as 3-in-stars and 3-out-stars. (i) the summation is over all configuration types A; different sets of configuration types represent different models; (ii) η is the parameter corresponding to configuration of type A; A (iii) z (y) is the network statistic corresponding to configuration A; A (iv) κ is a normalizing quantity to ensure that (1) is a proper probability distribution. As Robins et al (2006) noted, the model represents a probability distribution of graphs on a fixed node set, where the probability of observing a graph is dependent on the presence of the various configurations expressed by the model. One can interpret the structure of a typical graph in this distribution as the result of a cumulation of these particular configurations. With suitable constraints on the number of parameters, it is possible to estimate parameters for a given observed network. The parameters then provide information about the presence of structural effects observed in the network data. A large positive parameter indicates that more of that configuration is observed in the network than expected by chance, given the presence of other effects in the data. The new specifications for non-directed networks The Markov dependence assumption (Frank & Strauss, 1986) states that possible edges in a graph are independent (conditional on the rest of the observed graph) unless they share a node. This assumption leads to the class of Markov random graph models which have been the most commonly used form of exponential random graph models until very recently. Markov random graph models, however, tend to produce degenerate or non-convergent parameter estimates when triangulation is high or there are high degree nodes in the degree distribution (“hubs”). Model degeneracy here occurs when parameter values imply that only one or two graphs have substantial non-zero probabilities (Handcock, 2002; Snijders, 2002; Snijders et al, 2006; Robins et al, 2006). As these are often rather empty or full graphs, typically not the observed network, such parameter estimates cannot reproduce the network of interest and so the model is quite inadequate. In simple conceptual terms, the problem with Markov random graph models is a supposition that degrees are rather evenly spread across nodes, and that triangles are evenly dispersed throughout the network. Whenever there are outliers in the degree distribution, and when triangles tend to “clump” into denser regions of the network (quite common in much observed data), Markov graph models are not liable to produce parameter estimates that can cope with the data. These problems are greater for networks with larger node size. In a very promising attempt to overcome these problems, Snidjers et al (2006) proposed three new parameters for non-directed networks: alternating k-stars, alternating k-triangles, and alternating 2-paths. It would be far from correct to claim universal success, but there is now overwhelming experience that these new specifications do a dramatically better job than Markov random graph models in producing stable, coherent parameter estimates and in representing much network data (Goodreau, 2006; Snijders et al, 2006; Robins et al, 2006). We now briefly review this new parameterization for nondirected graphs. Readers who wish further details may refer to Snijders et al (2006). The alternating k-star parameter is intended to assist with modeling the degree distribution. It is based on the Markov star parameters but imposes a constraint between each star parameter of size k and size k+1, such that, for all k ≥2, σ = – σ/λ for some λ greater than 1, where σ is the (k+1) k k Markov parameter for a k-star2. In this article, we typically set λ =2, for which a number of authors have had good experience (Goodreau, 2006; Robins et al, 2006; Snijders et al, 2006). It is also possible to estimate an optimal value of λ (Hunter, 2006; Hunter & Handcock, 2006). With this constraint applied across all k-stars, the resulting statistic is: n−1 S s =∑(−1)k k (1) λk−2 k=2 and the associated alternating k-star parameter takes into account all star effects simultaneously. Snijders et al (2006) and Robins et al (2006) provided interpretations of this parameter: a positive estimate is evidence that the network contains a skewed degree distribution with some higher degree nodes, whereas a negative parameter suggests that high degree nodes are improbable, with a smaller variance between the degrees. Moreover, a positive parameter suggests a preference for connections between a larger number of low degree nodes and a smaller number of higher degree nodes, akin to a core-periphery structure, but with a core of limited size (determined by the value of λ.) These interpretations are conditional on other effects of the model (e.g. over and above transitivity effects if some form of triangulation is included among the parameters.) This single parameter is more effective and efficient than the traditional Markov star parameters. For reasons of practicality and parsimony, typically only a few star parameters are included in standard Markov random graph models (maybe only 2- and 3-star effects). When high degree nodes are present, a small number of lower-order star parameters have difficulty capturing the degree distribution, so that star parameter estimates tend to be large and positive in order to express the presence of hubs (which, for instance, contain many 2- or 3-stars). But unless there are countervailing effects in the model, large positive lower order star parameters imply complete graphs (which indeed maximize the number of stars.) The alternating k-star parameter limits these 2 It is worth briefly noting that alternating k-star parameters are consistent with the Markov dependence assumption, whereas the other new parameters introduced by Snijders et al (2006) and later in this paper require a more complex dependence assumption, using the partial dependence approach proposed by Pattison and Robins (2002). problems by weighting the effect of higher order stars (by a multiple of the inverse of λ), with alternating signs that produce appropriate countervailing effects. Snijders et al (2006) described an equivalent version, the geometrically weighted degree distribution parameter, which is further elaborated by Hunter (2006) and Hunter and Handcock (2006). For this article we shall keep to the alternating k-star version. It is important to note that when λ = 1, the parameter in effect models the number of isolates in the network (although because of the way it is calculated it does in fact represent the proportion of non-isolates.) In that case, a negative parameter value implies that for two graphs in the distribution with the same number of edges, the graph with a higher number of isolates will be more probable (conditional on the other effects in the model.) In substantive terms, a large negative parameter estimate with λ = 1 suggests a network where there is a specific tendency for at least some people to be “left out” and isolated from the others. Snijders et al (2006) suggested it may be useful to include in the one model two alternating k-star parameters, one with λ = 1 and one with λ = 2 (or some other value greater than 1). We utilize this insight below in one of our examples. The alternating k-triangle parameter is intended to assist with modeling transitivity and clustering in the network. Snijders et al (2006) introduced the concept of a k-triangle configuration, which occurs when two nodes i and j, connected by an edge (the base of the k-triangle), are also connected to k other nodes, described as shared partners. The sides of the k-triangle are the various edges from i or j to the shared partner nodes. A k-triangle is represented in Figure 1. Figure 1 about here. Let the number of k-triangles in the graph be T . Then the alternating k-triangle parameter k has the associated statistic: n−1 T t =∑(−1)k k (2) λk−2 k=2 Snijders et al (2006) provided interpretations of this parameter. A positive estimate is evidence not only for triangulation in the network but also for a tendency for triangles to occur together in “clumps” (again a core-periphery might be one outcome). There is an equivalent form of the parameter, the edgewise shared partner parameter (Hunter, 2006). This emphasizes that, just as the alternating k-star parameter controls for features of the degree distribution, the alternating k- triangle parameter controls for features of the shared partner distribution among connected nodes – the edgewise shared partner distribution (see Hunter, 2006, and Hunter & Handcock, 2006, for further details.) The strength of the closure of the triangle (i.e. the formation of the base) increases with the number of shared partners, but due to the form of (2) the increase is not linear and, beyond a certain number (depending on λ), further shared partners do not much affect the chances of closure. It is worth noting that a positive alternating k-triangle parameter and a negative alternating k-star parameter together suggest a segmented network of multiple (but small) dense regions connected by low density paths (Robins et al, 2006). The alternating k-two-path parameter (or alternating 2-path parameter) is intended to assist with modeling multiple connectivity in the network. A k-two path is a precursor to the transitivity construct represented by a k-triangle: the configuration is identical to a k-triangle except that there is no requirement for an edge at the base of the k-triangle. A k-two path is represented in Figure 2. It represents k independent two-paths between two nodes (whether connected or not). Figure 2 about here. Let the number of k-two paths in the graph be U . Then the alternating 2-path parameter has k the associated statistic: n−1 U u =∑(−1)k k (3) λk−2 k=2 When this parameter is in the model, it assists interpretation of the alternating k-triangle parameter. If the k-triangle parameter is positive in the presence of the k-2-path parameter, then there is evidence that transitivity in this network tends to occur because of the completion of the bases of k-triangles, rather than completion of the sides. In other words, multiple connectivity in the form of independent 2-paths between nodes tends to lead to closure. There is also an equivalent form of the parameter, the dyadwise shared partner parameter (Hunter, 2006). Estimation and Goodness of fit diagnostics for nondirected graphs As discussed by various authors (e.g., Hunter & Handcock, 2006; Snijders, 2002; Wasserman & Robins, 2005), reliable parameter estimation for these models requires Monte Carlo Markov Chain Maximum Likelihood techniques. Robins et al (2006) reviews three programs (Siena, statnet and pnet) that are currently available for estimation. If non-degenerate parameter estimates are successfully obtained, it is possible to simulate from the model to produce the distribution of graphs implied by the estimates. This distribution of graphs will be consistent with the observed graph in the following sense: for each effect in the model, the mean statistic in the distribution will be close to the statistic in the observed graph. But it is also helpful to examine features from the simulated graphs that may not be directly modeled, for instance, the full degree distribution or geodesic distribution. If for a certain graph feature, the observed graph is consistent with the simulated graphs in the sense described above, then we can say that the model reproduces that feature of the data well. For instance, if graphs in the distribution typically do not have a large clustering coefficient, whereas the observed data does so, we infer that the model does not do a good job of replicating clustering. For a more precise examination, we might collect the clustering coefficients from graphs in a sample from the simulated graph distribution. If the observed clustering coefficient is substantially far from the mean clustering coefficient of the simulated graphs (which can be measured by a simple t statistic), we know that our model does not do a good job with clustering. This approach constitutes a diagnostic goodness of fit examination. We are able to assess whether a particular feature in the observed data is implied by the model. We can see which features the model describes well and which features it fails to reproduce. In a bid to achieve an optimal model, we may then add further effects to the model to improve fit on those features that are poorly replicated. Goodreau (2006) provides a compelling empirical illustration of how this approach to goodness of fit enables judgments between different models. The question arises as to which graph features to include in such a goodness of fit examination. In assessing models with the new specifications for nondirected graphs, in addition to model statistics, Goodreau (2006) also examined graphically the degree distribution, the shared partner distribution and the geodesic distribution. Robins et al (2006) used t-statistics for graph counts of all Markov configurations, the skew and standard deviation of the degree distribution, and the local and global clustering coefficients3, as well as examining quartiles of the geodesic distribution, with poor fit typically indicated by a large t-statistic. The new specifications for directed networks For directed networks, Snijders et al (2006) proposed four new parameters that were counterparts of the non-directed case. The alternating k-instar and alternating k-outstar parameters are identical to the non-directed version, except that they relate specifically to the Markov configurations of in- and out-stars, respectively. These two parameters control for features of the in- and out-degree distribution respectively, and their interpretations are analogous to the non-directed case, except that, as is standard for Markov directed star parameters, the indegree parameter relates to popularity effects and the outdegree parameter to expansiveness (or activity) effects. Snijders et al (2006) introduced directed versions analogous to the alternating k-triangle (AKT) and alternating 2-path (A2P) parameters, by defining the sides of a directed k-triangle to be a 3 The global clustering coefficient is simply thrice the number of triangles divided by the number of two-stars. The local clustering coefficient is the mean across all nodes of the density of the egocentric network of each node. directed two-path (not a semi-path), and then producing statistics analogously to equations (2) and (3). The configurations are depicted in Figure 3. We label these new parameters AKT-T and A2P- T, respectively, where the addition of the letter ‘T’ to the acronyms indicates “transitivity”, in the sense that closure here can be interpreted along the lines of “the friend of my friend is my friend.” Figure 3 about here Additional network closure and connectivity parameters In this paper, we make a simple generalization of the directed AKT and A2P parameters of Snijders et al (2006) by defining new k-triangle and k-2path configurations that have semi-paths as their sides. We also define a k-triangle configuration to represent cyclicity. These AKT and A2P configurations represent different versions of network closure and of multiple connectivity, respectively. The additional versions of k-triangle configurations are presented in Figure 4, labeled by the associated alternating parameters that have statistics analogous to the non-directed case – equation (2). For AKT-U and AKT-D, the ‘U’ and ‘D’ suffices follow the labeling practice of the standard directed graph triad census as “up” and “down” (Holland & Leinhardt, 1970). The AKT-C parameter is labeled with a ‘C’ suffix to indicate cyclicity. Figure 4 about here For consistency with these new closure parameters, we also propose two new lower order k- 2-path configurations to express multiple connectivity, with alternating parameters A2P-U and A2P-D, as presented in Figure 5. Together with the parameters of Snijders et al (2006), we are then proposing, in total, four parameters that express different forms of network closure (AKT-T, AKT- U, AKT-D, AKT-C) and three parameters that express different forms of multiple connectivity (A2P-T, A2P-U, A2P-D). There are only three parameters for multiple connectivity because the A2P-T parameter is lower order to both the AKT-T and AKT-C parameters. Figure 5 about here. Interpretation The four directed alternating k-triangle parameters express different versions of triadic closure:

Description:
We thank Antonietta Pane and Yu Zhao for permission to use their data in this paper. triangle parameter, based on closure of multiple two-paths.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.