ebook img

Opinion Maximization in Social Networks PDF

0.44 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Opinion Maximization in Social Networks

Opinion maximization in social networks Aristides Gionis∗ Evimaria Terzi† Panayiotis Tsaparas‡ Abstract terfere with the opinion-formation process by influenc- Theprocessofopinionformationthroughsynthesisand ing the opinions of appropriately selected individuals, contrast of different viewpoints has been the subject such that the overall favorable opinion about the spe- of many studies in economics and social sciences. To- cific information item is strengthened. 3 day, this process manifests itself also in online social The idea of leveraging social influence for market- 1 networks and social media. The key characteristic of ingcampaignshasbeenstudiedextensivelyindatamin- 0 successful promotion campaigns is that they take into ing. Introduced by the work of Domingos and Richard- 2 considerationsuchopinion-formationdynamicsinorder son [20], and Kempe et al. [14] the problem of influence n to create a overall favorable opinion about a specific in- maximization asks to identify the most influential indi- a formationitem,suchasaperson,aproduct,oranidea. viduals, whose adoption of a product or an action will J Inthispaper,weadoptawell-establishedmodelfor spread maximally in the social network. This line of 0 3 social-opinion dynamics and formalize the campaign- workemploysprobabilisticpropagationmodels,suchas design problem as the problem of identifying a set the independent-cascade or the linear-threshold model, ] of target individuals whose positive opinion about an whichspecifyhowactionsspreadamongtheindividuals I S information item will maximize the overall positive of the social network. At a high level, such propagation . opinion for the item in the social network. We call models distinguish the individuals to either active or s c this problem Campaign. We study the complexity inactive, and assume that active individuals influence [ of the Campaign problem, and design algorithms for their neighbors to become active according to certain 1 solving it. Our experiments on real data demonstrate probabilistic rules, so that activity spreads in the net- v the efficiency and practical utility of our algorithms. work in a cascading manner. 5 Inthispaper,weareinterestedintheprocessofhow 5 1 Introduction individuals form opinions, rather than how they adopt 4 products or actions. We find that models such as the 7 Individuals who participate in social networks form independent cascade and the linear threshold are not . their opinions through synthesis and contrast of dif- 1 appropriate for modeling the process of forming opin- 0 ferent viewpoints they encounter in their social circles. ions in social networks. First, opinions cannot be ac- 3 Suchprocessesmanifestthemselvesmorestronglyinon- curately modeled by binary states, such as being either 1 line social networks and social media where opinions : active or inactive, but they can take continuous values v and ideas propagate faster through virtual connections in an interval. Second, and perhaps more importantly, i between social-network individuals. Opinion dynam- X the formation of opinions does not resemble a discrete ics have been of considerable interest to marketing and r cascading process; it better resembles a social game in a opinion-formationagencies,whichareinterestedinrais- which individuals constantly influence each other until ing public awareness on important issues (e.g., health, convergence to an equilibrium. social justice), or increasing the popularity of person or Accordingly,weadopttheopinion-formationmodel an item (e.g., a presidential candidate, or a product). of Friedkin and Johnsen [11], which assumes that each Today, social-networking platforms and social me- node i of a social network G = (V,E) has an internal diatakeupasignificantamountofanypromotioncam- opinions andanexpressedopinionz . Whiletheinter- paignbudget. Itisnotuncommontoseeactivistgroups, i i nalopinionsofindividualsarefixedandnotamenableto political parties, or corporations launching campaigns external influences, the expressed opinions can change primarily via Facebook or Twitter. Such campaigns in- due to the influence from their neighbors. More specif- ically, people change their expressed opinion z so that i ∗Aalto University, Finland. This work was done while the they minimize their social cost, which is defined as the authorwaswithYahoo!Research. disagreement between their expressed opinion and the †Boston University, USA. Supported by NSF awards: CNS- expressed opinion of their social neighbors, and the di- 1017529,III-1218437andgiftsfromMicrosoftandGoogle. ‡UniversityofIoannina,Greece. vergence between the expressed opinion and their true internalbelief. Itcanbeshownthattheprocessofcom- sociology and computer science. puting expressed opinion values by repeated averaging In his original work in 1974, DeGrout [6] was the leads to a Nash equilibrium of the game where the util- first to define a model where individuals organized in ities are the node social costs [3]. a social network G = (V,E) have a starting opinion Armed with this opinion-formation model, we in- (represented by a real value) which they update by troduce and study a problem similar to the influence- repeatedlyadoptingtheaverageopinionoftheirfriends. maximizationproblem,introducedbyKempeetal.[14]. This model and its variants, has been subject of recent We consider a social network of individuals, each hold- studies in social sciences and economics [2, 7, 11, 12, ing an opinion value about an information item. We 13]. Most of these studies focus on identifying the view an opinion as a numeric value between zero (neg- conditionsunderwhichsuchrepeated-averagingmodels ative) and one (positive). We then ask to identify k converge to a consensus. In our work, we also adopt a individuals, so that once their expressed opinions be- repeated averaging model. However, our focus is not comes 1, the opinions of the rest of the nodes — at the characterization or the reachability of consensus. an equilibrium state — are, on average, as positive as In fact, we do not assume consensus is ever reached. possible. We call this problem Campaign. Rather, we assume that the individuals participatingin Note that there is a fundamental qualitative differ- the network reach a stable point, where everyone has ence between our framework and existing work on in- crystallized a personal opinion. As has been recently fluence maximization: In contrast to other models used notedbysocialscientistDavideKrackhardt[16],studies ininfluence-maximizationliterature,ourworkviewsthe of such non-consensus states are much more realistic nodesasrationalindividualswhowishtooptimizetheir since consensus states are rarely reached. own objectives. Thus, we assume that the opinions of Key to our paper is the work by Bindel et al. [3]. individuals get formed through best-response dynamics In fact, we adopt the same opinion-dynamics model of a social game, which in turn is inspired by classi- as in [3], where individuals selfishly form opinions calopinion-dynamicsmodelsstudiedextensivelyineco- to minimize their personal cost. However, Bindel et nomics [2, 6, 7, 12, 13]. al. focus on quantifying the social cost of this lack Furthermore, from a technical point of view, the of central coordination between individuals, i.e., the maximization problem that results from our framework price of anarchy, and they consider a network-design requires a completely different toolbox of techniques problem with the objective of reducing the social cost than the standard influence-maximization problem. To atequilibrium. Ourworkontheotherhand, focuseson this end, we exploit an interesting connection between designingapromotioncampaignsothat,atequilibrium, our problem and absorbing random walks in order to the overall positive inclination towards a particular establishourtechnicalresults,includingthecomplexity opinion is maximized. and the approximability of the problem. Interestingly, Recently, there has been a lot of work in the as with the influence-maximization problem, we show computer-science literature on identifying a set of in- that the objective function of the Campaign problem dividuals to advertise an item (e.g., a product or an issubmodular,andthus,itcanbeapproximatedwithin idea) so that the spread of the item in the network is factor (1− 1) using a greedy algorithm. maximized. Different assumptions about how informa- e In addition, motivated by the properties of the tion items propagate in the network has led to a rich greedy algorithm, we propose scalable heuristics that literature of targeted-advertisement methods (e.g., see can handle large datasets. Our experiments show that [5, 9, 14, 20]). Although at a high level our work has in practice the heuristics have performance comparable the same goal as all of these methods, there are also to the greedy algorithm, and that are scalable to very important differences, as we have already discussed in large graphs. Finally, we discuss two natural variants the introduction. of the Campaign problem. Our discussion reveals a surprising property of the opinion formation process on 3 Problem definition undirected graphs: the average opinion of the network 3.1 Preliminaries. We consider a social graph G = dependsonlyontheinternalopinionsoftheindividuals (V,E) with n nodes and m edges. The nodes of the and not on the network structure. graph represent people and the edges represent social affinity between them. We refer to the members of 2 Related Work the social graph by letters such as i and j, and we Tothebestofourknowledge,wearethefirsttoformally write (i,j) to denote the edges of the graph. With defineandstudytheCampaignproblem. However,our each edge (i,j) we associate a weight w ≥ 0, which ij work is related to a lot of existing work in economics, expresses the strength of the social affinity or influence from person i to person j. We write N(i) to denote Equation (3.1). That is, the stationary vector of opin- the social neighborhood of person i, that is, N(i)={j | ions z is such that no node i has an incentive to change (i,j) ∈ E}. Unless explicitly mentioned, we do not their opinion to improve their cost c . i make an assumption whether the graph G is directed or undirected; most of our results and our algorithms 3.2 Problem definition. The goal of a promotion carryoverforbothtypesofgraphs. Thedirected-graph campaign is to improve the overall opinion about a model is more natural as in many real-world situations product, person, or idea in a social network. Given an the influence w from person i to person j is not equal opinion vector z, we define the overall opinion g(z) as ij to w . ji Following the framework of Bindel et al. [3] we as- (cid:88)n g(z)= z , sume that person i has a persistent internal opinion s , i i whichremainsunchangedfromexternalinfluences. Per- i=1 son i has also an expressed opinion z , which depends i which is also proportional to the average expressed on their internal opinion s , as well as on the expressed i opinion oftheindividualsinG. Thegoalofacampaign opinions of their social neighborhood N(i). The under- is to maximize g(z). Following the paradigm of Kempe lying assumption is that individuals form opinions that et al. [14], we assume that such a campaign relies on combine their internal predisposition with the opinions selecting a set of target nodes T, which are going to be of those in their social circle. convinced to change their expressed opinions to 1. For Wemodeltheinternalandexternalopinionss and i therestofthediscussion, wewilluseg(z |T)todenote z as real values in the interval [0,1]. The convention i the overall opinion in the network, when vector z is the is that 0 denotes a negative opinion, and 1 a positive Nash-equilibrium vector obtained under the constraint opinion. Thevaluesin-betweencapturedifferentshades that the expressed opinions of all nodes in T are fixed of positive and negative opinions. Given a set of to 1. Given this notation, we can define the Campaign expressed opinion values for all the people in the social problem as follows. graph, represented by an opinion vector z = (z : i ∈ i V), and the vector of internal opinions s=(s :i∈V), i Problem 1. (Campaign) Given a graph G = (V,E) we consider that the personal cost for individual i is and an integer k, identify a set T of k nodes such that (cid:88) fixing the expressed opinions of the nodes in T to 1, (3.1) c (z)=(s −z )2+ w (z −z )2. i i i ij i j maximizes the overall opinion g(z |T). j∈N(i) This cost models the fact that the expressed opinion zi Weemphasizethatfixingzi =1foralli∈T means of an individual i is a “compromise” between their own thatEquation(3.2)isonlyappliedforthezj’ssuchthat internalbeliefsi andtheopinionsoftheirneighbors. As j ∈/ T, while for the nodes i∈T the values zi remain 1. theindividualiformsanopinionz ,theirinternalopin- The definition of the Campaign problem reflects i ion s and the opinions z of their neighbors j ∈ N(i) our belief of what constitutes a feasible and effec- i j mayhavedifferentimportance. Therelativeimportance tive campaign strategy. Expressed opinions are more of those opinions is captured by the weights w . amenable to change, and have stronger effect on the ij Now assume that, as a result of social influence overall opinion in the social network. Thus, it is rea- and conflict resolution, every individual i is selfishly sonable for a campaign to target these opinions. We minimizingtheirowncostc (z). Iftheinternalopinions note that other campaign strategies are also possible, i are persistent and cannot change (an assumption that resulting in different problem definitions. For example, we carry throughout), minimizing the cost c implies one can define the problem where the campaign aims i changing the expressed opinion z to the weighted at changing the fundamental beliefs of people by alter- i averageamongtheinternalopinionsi andtheexpressed ing their internal opinions si. It is also conceivable to opinions of the neighbors of i. In other words, ask whether it is possible to improve the overall opin- ion g(z) by introducing a number of new edges in the (cid:80) si+ j∈N(i)wijzj social graph, e.g., via a link-suggestion application. We (3.2) z = . i 1+(cid:80) w discuss both of these variants at the end of the paper. j∈N(i) ij It turns out that from the algorithmic point of view, Infact,itcanbeshownthatifeverypersoniiteratively boththeseproblemsarerelativelysimple. Forinstance, updates their expressed opinion using Equation (3.2), we can show that for undirected graphs, surprisingly, it then the iterations converge to a unique Nash Equilib- is not possible to improve the overall opinion g(z) by rium for the game with utilities of players expressed by introducing new edges in the graph. 3.3 Background. We now show the connection be- Assume that each absorbing node j ∈ B is associ- tweencomputingtheNash-equilibriumopinionvectorz ated with a fixed value b . If a random walk starting j and a random walk on a graph with absorbing nodes. fromtransientnodei∈U getsabsorbedinanabsorbing This connection is essential in the analysis of the Cam- node j ∈B, then we assign to node i the value b . The j paign problem. probability of the random walk starting from node i to Absorbing random walks: Let H = (X,R) be a be absorbed in j is QUB(i,j). Therefore, the expected (cid:80) graph with a set of N nodes X, and a set of edges value of i is fi = j∈BQUB(i,j)bj. If fU is the vector R. The graph is also associated with the following withtheexpectedvaluesforalli∈U,andfB keepsthe three N ×N matrices: (i) the weight matrix W with values {bj} for all j ∈B, then we have that entries W(i,j) denoting the weight of the edges; (ii) (3.3) f =Q f . the degree matrix D, which is a diagonal matrix such U UB B that D(i,i) = (cid:80)N W(i,j); (iii) the transition matrix j=1 A fundamental observation, which highlights the P = D−1W, which is a row-stochastic matrix; P(i,j) connection between our work and random walks with expressestheprobabilityofmovingfromnodeitonode absorbing states, is that the expected value f of node i j in a random walk on the graph H. i ∈ U can be computed by repeatedly averaging the In such a random walk on the graph H, we say values of the neighbors of i in the graph H. Therefore, that a node b ∈ X is an absorbing node, if the random thecomputationoftheNash-Equilibriumopinionvector walk can only transition into that node, but not out z canbedoneusingEquation(3.3)onanappropriately of it (and thus, the random walk is absorbed in node constructedgraphH. WediscusstheconstructionofH b). Let B ⊆ X denote the set of all absorbing nodes below. Moredetailsonabsorbingwalkscanbefoundin of the random walk. The set of the remaining nodes the excellent monograph of Doyle and Snell [8]. U =X\B arenon-absorbing,ortransient nodes. Given The augmented graph. We will now show how the this partition of the states in X, the transition matrix theory of absorbing random walks described above can of this random walk can be written as follows: be leveraged for solving the Campaign problem. This (cid:18) (cid:19) P P connection is achieved by performing a random walk P = UB UU . I O with absorbing states on an augmented graph H = (X,R), whose construction we describe below. In the above equation, I is an (N −|U|)×(N −|U|) GivenasocialnetworkG=(V,E)whereeveryedge identitymatrixandO amatrixwithallitsentriesequal (i,j)∈Eisassociatedwithweightw ,weconstructthe ij to 0; PUU is the |U| × |U| sub-matrix of P with the augmented graph H =(X,R) of G as follows: transition probabilities between transient states; and P isthe|U|×|B|sub-matrixofP withthetransition (i) thesetofverticesX ofH isdefinedasX =V ∪V, UB probabilities from transient to absorbing states. where V is a set of n new nodes such that for each An important quantity of an absorbing random node i∈V there is a copy σ(i)∈V; walk is the expected number of visits to a transient (ii) the set of edges R of H includes all the edges E of state j when starting from a transient state i before G,plusanewsetofedgesbetweeneachnodei∈V being absorbed. The probability of transitioning from and its copy σ(i) ∈ V. That is, R = E ∪E, and i to j in exactly (cid:96) steps is the (i,j)-entry of the matrix E ={(i,σ(i))|i∈V}; (P )(cid:96). Therefore, the probability that a random walk UU starting from state i ends in j without being absorbed (iii) theweightsofallthenewedges(i,σ(i))∈Rareset is given by the (i,j) entry of the |U|×|U| matrix to 1, i.e., W(i,σ(i)) = 1. For i,j ∈ V, the weight of the edge (i,j) ∈ R is equal to the weight of the ∞ F =(cid:88)(P )(cid:96) =(1−P )−1, corresponding edge in G, i.e., W(i,j)=wij. UU UU (cid:96)=0 Our main observation is that we can compute the opinion vector z that corresponds to the Nash equi- which is known as the fundamental matrix of the librium defined by Equation (3.1) by performing an absorbing random walk. Finally, the matrix absorbingrandomwalkonthegraphH. Inthisrandom Q =F P walk, we set B = V and U = V, that is, we make UB UB all copy nodes in V to be absorbing. We also set is an |U|×|B| matrix, with Q (i,j) being the proba- f = s, that is, we assign value s to each absorbing UB B i bilitythatarandomwalkwhichstartsattransientstate node σ(i). The Nash-equilibrium opinion-vector z can i ends up being absorbed at state j ∈B. be computed using Equation (3.3), that is, z = Q s. UB (cid:80) The opinion z = Q (i,j)s is the expected the less competition there is for x (i.e., the smaller the i j∈B UB j internal opinion value at the node of absorption for a size of B), the more mass of the random walk will be random walk that starts from node i ∈ V. Given the absorbed in x, and the larger the increase of g(z | T). vector z we can compute the overall opinion g(z). Hence g(z |T) is submodular. TheCampaignproblemcanbenaturallydefinedin this setting. Selecting a set of nodes T is equivalent to 5 Algorithms addingthenodesinT intothesetofabsorbingnodesB, 5.1 Estimating the Nash-equilibrium vector z. andassigningthemvalue1. Thatis,wehaveB =V ∪T A central component in all the algorithms presented in and U = V \T. For the vector fB, we have fσ(i) = si thissectionistheestimationofthetheopinionfunction for all σ(i) ∈ V, and fj = 1 for all j ∈ T. We use g(z |T). In Section 3.3, we have already discussed that Equation (3.3) to compute vector z and using this z, this can be done by evaluating Equation (3.3). For ap- we can then compute the overall opinion g(z | T). propriatelydefinedsetsU andB, thisrequirescomput- Hence, the Campaign problem becomes the problem ing the matrix Q = (1−P )−1P . Hence, this UB UU UB of selecting a set of k nodes T ⊆ V to make absorbing calculation involves a matrix inversion, which is very with value 1, such that g(z |T) is maximized. inefficient. The reason is that despite the fact that the social graph is typically sparse, matrix inversion does 4 Problem complexity not preserve sparseness. Thus, it may be too expensive In this section, we establish the complexity of the to even store the matrix QUB. Campaign problem by showing that it is an NP-hard Instead, we resort to the power-iteration method problem. We also discuss properties of the objective implied by Equation (3.2): at each iteration we update function g(z | T), which give rise to a constant-factor the opinion zi of a node i ∈ V by averaging the approximation algorithm for the Campaign problem. opinions of its neighbors j ∈ N(i) and its own internal opinion s . During the iterations we do not update the i Theorem 4.1. Problem Campaign is NP-hard. values of opinions that are fixed. This power-iteration method is known to converge to the equilibrium vector The proof of the theorem appears in the Ap- z, and it is highly scalable, since it only involves pendix A. The proof relies on a reduction from multiplication of a sparse matrix with a vector. For the Vertex Cover on Regular Graphs problem a graph with n nodes, m edges, and thus, average (VCRG) [10]. degree d = 2m, the algorithm requires O(nd) = O(m) Since the Campaign problem is NP-hard, we are n operationsperiterations. Therefore,theoverallrunning content with algorithms that approximate the optimal timeisO(mI),whereI isthetotalnumberofiterations. solution in polynomial time. Fortunately, we can show In our experiments we found the the method converges thatthefunctiong(z |T)ismonotoneandsubmodular, in around 50-100 iterations, depending on the dataset. and thus a simple greedy heuristic yields a constant- factor approximation to the optimal solution. 5.2 Algorithms for the Campaign problem. Our algorithms for the Campaign problem, include a Theorem 4.2. The function g(z |T) is monotone and constant-factorapproximationalgorithmaswellassev- submodular. eral efficient and effective heuristics. Proof. We only give here a proof sketch. A detailed The Greedy algorithm. It is known that the greedy proof is given in Appendix B. Recall that g(z | T) = algorithm is a (1 − 1)-approximation algorithm for (cid:80) z . In the absorbing random walk interpretation maximizing a submoduelar function h : Y → R subject i∈V i of the opinion formation process, we have shown that to cardinality constraints, i.e., finding a set A⊆Y that the expressed opinion of node i is the expected opinion maximizes h(A) such that |A| ≤ k [19]. Consequently, valueatthepointofabsorbtionforarandomwalkthat the Greedy algorithm constructs the set T by adding (cid:80) starts from node i. That is, z = P(b | i)f , one node in each iteration. In the t-th iteration the i b∈B b where B is the set of absorbing nodes, P(b | i) is the algorithm extends the set T(t−1) by adding the node i probability of the random walk starting from node i that maximizes g(z |T(t)) when setting z =1. i to be absorbed at node b, and f the opinion value at The computational bottleneck of Greedy is due to b node b. When we add a node x to T, and hence to the fact that we need to compute the Nash-equilibrium the set B, some of the probability mass of the random opinion vector z that results from setting z =1 for all t walk will be absorbed at x. Since x has the maximum t ∈ T ∪ {j}, and we need to do such a computation possibleopinionvalue,f =1,iffollowsthatz canonly for all candidate nodes j ∈ V \ T. Overall, for a x i increase, and thus g(z | T) is monotone. Furthermore, solution T of size |T| = k Greedy needs to perform Karate club degree 51015 llllllllllllllllllllllllllllllll degree to free neighbors 051015 llllllllllllllllllllllllllllllll internal opinion 0.00.20.40.60.8 llllllllllllllllllllllllllllllll 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 k k k Figure 1: Measures of graph nodes plotted in order selected by the Greedy algorithm. O(nk) computations of finding the optimal vector z. already selected by Greedy. Finally, in the third panel As we saw, each of these computations is performed ofFigure1weseetheinternalopinions ofnodesinthe i by a power-iteration in time O(mI), yielding an overall order selected by the Greedy. We see that the Greedy running time O(nmkI). Such a running time is super- tends to select first nodes with low internal opinion. quadratic and therefore the algorithm is not scalable to There are a few exceptions of nodes with high internal very large datasets. opinion s selected at the initial steps of greedy. Such i One way to speedup the algorithm is by storing, nodes are nodes with high degree, connected to many for each node that it is not yet selected, its marginal nodes with small values of s . i improvement on the score, at the last time it was com- Armed with intuition from this analysis we now puted. This speedup, which is commonly used in op- proceed to describe our heuristics. timization problems with submodular functions [17], is The Degree algorithm. This algorithm simply sorts not adequate to make the greedy algorithm applicable the nodes of G = (V,E) in decreasing order of their in for large data, at least for the version of the algorithm degree and forms the set of target nodes T by picking described here. The reason is that in the very first iter- the top-k nodes of the ranking. The running time of ationthereisnopruningandthereforeweneedtomake Degree is O(nlogn), i.e., the time required for sorting. O(n) power-iteration computations, yielding again a quadraticalgorithm. Toovercome thesescalability lim- The FreeDegree algorithm. This algorithm is a itations, we present a number of scalable heuristics. “greedy” variant of Degree; FreeDegree forms the set T iteratively by choosing at every iteration the node 5.3 Designing the heuristics. To characterize the with the highest free degree. The free degree of a node nodes selected by Greedy we execute the algorithm on is the sum of the weights of the edges that are incident small datasets, and we compute a number of measures to it and are are connected to nodes not already in T. for each node selected by Greedy. In particular, for WhenthesetT consistsofk nodes,therunningtimeof each node we compute measures such as its degree, the FreeDegree is O(kn). average degree of its neighbors, the maximum degree of The RWR algorithm. As we saw in Figure 1, a its neighbors, the value of its internal opinion si, the good choice for nodes to be added in the solution average value of si over its neighbors, and so on. Three are not only the nodes of high degree but also the of the features with the most clear signal are shown nodes of small value of internal opinion s . The RWR i in Figure 1 for the karate club dataset (described in algorithm combines both of these features: selecting detail in Section 6). We obtain similar behavior on all nodes with high degree and with small s . This is done i the datasets we tried. byperformingarandomwalkwithrestart(RWR),where In Figure 1 we plot measures of nodes in the order the probability of restarting at a node i is proportional selected by the Greedy. A good measure would be to r = s − s , where s = max s , and i max i max i∈V i one that is monotonic with respect to this order. In orderingthenodesaccordingtotheresultingstationary the first panel, we show the degree of a node in the distribution. The intuition is that a random walk selectionorderof Greedy,andweseethatGreedytends favors high-degree nodes, and using the specific restart to select first high degree nodes. As shown in the probabilities favors nodes with low value of s . i secondpanel,thisdependenceisevenmoreclearforthe For the restart probability, we use the parameter free degree, i.e., the number of neighbors that are not α = 0.15, which has been established as a standard parameter of the PageRank algorithm [4]. Making 6.2 Bibliographic datasets: A “data mining” one RWR computation can be achieved by the power- campaign. We also evaluate our algorithms on two iteration method, which similarly to computing the large social networks, derived from bibliographic data. optimal vector z, has running time O(mI). Therefore, The first dataset, bibsonomy, is extracted from theoverallrunningtimeofthealgorithmforselectinga bibsonomy [1], a social-bookmarking and publication- set T of size k is O(mkI). sharing system. From the available data, we extract a social graph of 45329 nodes representing authors and The Min-S and Min-Z algorithms. The Min-S 149895 edges representing co-authorship relations. For algorithm simply selects the k nodes with the smallest each author we also keep the set of tags that have been value s . This heuristic is motivated by the observation i used for the papers of that author. that the Greedy algorithm tends to select nodes with The second dataset, dblp, is also a co-authorship small value s . For completeness, we also experiment i graph among computer scientists extracted from the with the Min-Z algorithm, which greedily selects and dblp site.2 The dataset is a large graph containing add in the solution set T the node that at the current 635585 nodes and 1423716 edges. Again, for each iteration has the smallest value of expressed opinion z . i author we keep the set of terms they have been used in the titles of the papers they have co-authored. 6 Experimental evaluation Inthisexperiment,wegeneratetheinternalopinion The objective of our experiments is to compare the vectors by identifying keywords related to data mining proposed heuristics against Greedy, the algorithm with (e.g., we picked data, mining, social, networks, graph, the approximation guarantee, and demonstrate their clustering, learning, and community). For each author scalability. i ∈ V we then set s to be the fraction of the above i keywords present in his set of terms. This setting 6.1 Small networks. Weexperimentwithanumber corresponds to a hypothetical scenario of designing a of publicly available small networks.1 We evaluate our campaign to promote the “data mining” topic among algorithms by reporting the value of the objective func- all computer-science researchers. tion g(z) as a function of the solution set size |T| = k. The results of the heuristics for the two datasets Our results for three small networks are shown in Fig- are shown in Figure 3. For the bibsonomy dataset ure 2. The datasets shown in the figure are the follow- (left), there is a clear distinction between the three ing: (i) karate club: a social network of friendships heuristics;RWRclearlyperformsbetterthanbothDegree between34membersofakarateclubataUSuniversity and FreeDegree. This superior performance of RWR in the 1970s [21]; (ii) les miserables: co-appearance is expected as this algorithm takes into account both network of characters in the novel Les Miserables [15]; the degrees and the values of the internal opinions s . and(iii)dolphins: anundirectedsocialnetworkoffre- i Also the better performance of FreeDegree compared quentassociationsbetween62dolphins[18]. Forthisset with the performance of Degree is consistent with the of experiments we set the internal opinions s to be a i results obtained for smaller networks. On the other uniformly-sampled value in [0,1]. hand, on the dblp dataset (right part of Figure 3) the We see that the difference between all the algo- behavior of the three heuristics is more surprisingly, as rithmsisrelativelysmallbuttheirrelativeperformance all three perform almost identical. Finally, for both is consistent. The Greedy algorithm achieves the best datasets, the difference of the three best heuristics results,whilethethreeheuristics,Degree,FreeDegree, Degree, FreeDegree, and RWR with the other two and RWR come close together. Min-S and Min-Z have heuristics, Min-S and Min-Z is more pronounced. In the poorest performance, even though for larger val- fact, the performance of Min-S and Min-Z is very poor. ues of k they improve and slightly outperform some of We investigate the difference on the relative perfor- the heuristics. Between the two, Min-S performs best, manceofthebestthreeheuristics,Degree,FreeDegree, outperforming Min-Z, especially for small values of k. and RWR, on the two datasets by plotting the degrees of Both of those trends are expected: the three heuris- the nodes versus their s value. This is shown in Fig- ticsDegree,FreeDegree,andRWR,arebettermotivated i ure 4. Recall that our intuition for selecting nodes that than Min-S, which in turn, is better motivated than istochoosenodesthathavelargedegreeandsmallvalue Min-Z. of s . Figure 4 demonstrates that in the dblp dataset, We obtain similar results for other small networks, i large degree correlates well with small s values, while although we do not provide the plots for lack of space. i this is not the case for the bibsonomy dataset. There- 1www-personal.umich.edu/˜mejn/netdata/ 2www.informatik.uni-trier.de/˜ley/db/ Karate club Les miserables Dolphins g(z)02530 l ll ll ll lll llgfrreellee.ddlleygrlleell g(z)06070 llllllllllllllllllllllllllllgfrllreelleell.ddlleyllgllrllellell g(z)405060 ll ll ll ll ll ll ll lll ll llgfllrrelleeell.ddlleyllgrllellell 2 RWR 5 RWR RWR l l 5 l degree 0 l degree 30 l degree 1 l min.s 4 l min.s l min.s l min.z min.z min.z 10 ll 30 ll 20 ll 0 5 10 15 20 25 30 0 10 20 30 40 50 60 70 0 10 20 30 40 50 k k k Figure 2: Performance of the algorithms on small networks. Bibsonomy −− data mining DBLP −− data mining l z) 0010000 ll RfdmmreeWiinnge..Rr.szdeeegrleel l l l l l l l l l l l z) 5000060000 ll RfdmmreeWiinnge..Rr.szdeeegreel l l l l l l l l l l g( 60 l l g( 000 l l 0 l 4 l 02000 ll l l l l l l l l l l l l l l l l 30000 ll ll l l l l l l l l l l l l l l l 0 100 200 300 400 0 100 200 300 400 k k Figure 3: Performance of the algorithms on the bibsonomy and dblp networks for the topic “data mining”. 7 Problem variants The Campaign problem we studied in this paper fo- cusesoncampaignsthataimtoaltertheexpressedopin- ions of individuals. However, other campaign strategies arealsopossible. Forexampleonecouldaimataltering the internal opinions of individuals such that the over- all opinion is improved as much as possible. Formally, the goal would be to select a set of nodes S, which are going to be convinced to change their internal opinions s to 1, such that the resulting overall opinion g(z | S) i Figure 4: Scatter plot of degrees vs. internal opinion is maximized. We call this problem the i-Campaign values s in the two datasets, bibsonomy and dblp. problem. The difference between the Campaign and i- i Campaignproblemsisthatintheformerweareasking to fix the expressed opinions z for k individuals, while i in the latter we are asking to fix the internal opinions s . Even though the difference is seemingly small, the i problems are computationally very different. fore,fordblpallthreeheuristicspickhigh-degreenodes, Algorithmically, theproblemofselectingS individ- which makes their performance almost identical. uals to change their internal opinions, so that we max- References imize g(z | S) is much simpler. In fact, we can show that for an undirected social graph G = (V,E), where [1] Benchmark folksonomy data from bibsonomy. Techni- each node i ∈ V has internal opinion si and expressed cal report, Knowledge and Data Engineering Group, opinion zi, the following invariant holds, independently University of Kassel, 2007. of the structure of the graph (the set of edges E): [2] D. Acemoglu and A. Ozdaglar. Opinion dynamics and learning in social networks. Dynamic Games, 2011. (cid:88) (cid:88) (7.4) g(z)= zi = si. [3] D. Bindel, J. M. Kleinberg, and S. Oren. How bad is i∈V i∈V forming your own opinion? In FOCS, 2011. [4] S. Brin and L. Page. The anatomy of a large-scale Consequently,thegoalofmaximizingg(z)bymodifying hypertextual web search engine. In WWW, 1998. k values s can be simply achieved by selecting the k i [5] N. Chen. On the approximability of influence in social smallest values si and setting them to 1. We note that networks. In SODA, 2008. the above observation does not hold once the expressed [6] M. H. DeGrout. Reaching consensus. Journal of opinions of some individuals are fixed, as is the case in Maerican Statistical Association, 1974. theCampaignproblem. Theproofoftheinvariant,and [7] P. M. DeMarzo, D. Vayanos, and J. Zweibel. Persua- its implications are discussed in the Appendix C. sionbias,socialinfluence,andunidimensionalopinions. The graph invariant has obvious implications for Quarterly Journal of Economics, 2003. [8] P. Doyle and J. Snell. Random walks and electric the other variant of the campaign problem, where we networks. MathematicalAssociationofAmerica,1984. seek to maximize g(z) by adding or removing edges [9] E. Even-Dar and A. Shapira. A note on maximizing to the graph. From Equation (7.4) it follows that the spread of influence in social networks. In WINE, for undirected social graphs this problem variant is 2007. meaningless; the overall opinion g(z) does not depend [10] U. Feige. Vertex cover is hardest to approximate on on the structure of the graph. This observation has regular graphs. Technical report MCS03-15 of the important implications for opinion formation on social Weizmann Institute, 2003. networks. Itshowsthatalthoughthenetworkstructure [11] N. E. Friedkin and E. Johnsen. Social influence and has an effect on the individual opinions of network opinions. Journal of Mathematical Sociology, 1990. participants,itdoesnotaffecttheaverageopinioninthe [12] B. Golub and M. O. Jackson. Naive learning in social network. For the campaign problem, this says that you networks: Convergence, influence and the wisdom of the crowds. Amer. Econ. J.: Microeconomics, 2010. cannot create more goodwill by altering the network. [13] M.O.Jackson. SocialandEconomicNetworks. Prince- For the study of social dynamics, this implies that the ton University Press, 2008. collective wisdom of the crowd remains unaffected by [14] D.Kempe,J.M.Kleinberg,andE´.Tardos. Maximizing the social connections between individuals. the spread of influence through a social network. In KDD, 2003. 8 Conclusions [15] D.E.Knuth. TheStanfordGraphBase: APlatformfor We considered a setting where opinions of individuals Combinatorial Computing. Addison-Wesley, 1993. in a social network evolve through processes of social [16] D.Krackhardt. Aplungeintonetworks. Science,2009. [17] J.Leskovec,A.Krause,C.Guestrin,C.Faloutsos,J.M. dynamics, reaching a Nash equilibrium. Adopting a VanBriesen, and N. S. Glance. Cost-effective outbreak standard social and economic model of such dynamics detection in networks. In KDD, 2007. we addressed the following natural question: given [18] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, a social network of individuals who have their own E.Slooten,andS.M.Dawson. Behavioral Ecology and internal opinions about an information item, which are Sociobiology, 54:396–405, 2003. the individuals that need to be convinced to adopt [19] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis a positive opinion so that in the equilibrium state, of the approximations for maximizing submodular set the network (as a whole) has the maximum positive functions. Mathematical Programming, 1978. opinion about the item? We studied the computational [20] M. Richardson and P. Domingos. Mining knowledge- complexityofthisproblemandproposedalgorithmsfor sharing sites for viral marketing. In KDD, 2002. solving them exactly or approximately. Our theoretical [21] W.W.Zachary. Aninformationflowmodelforconflict and fission in small groups. Journal of Anthropological analysisandthealgorithmdesignreliedonaconnection Research, 33:452–473, 2003. between opinion dynamics and random walks with absorbing states. Our experimental evaluation on real datasets demonstrated the efficacy of our algorithms and the effect of the structural characteristics of the underlying social networks on their performance. A Proof of Theorem 4.1 in node σ is strictly greater than zero. Therefore, the (cid:96) Proof. We prove the theorem by reducing an instance probability of being absorbed in some node not in T is oftheVertex Cover on Regular Graphsproblem strictly greater than d+11, and thus zj < d+d1. It follows (VCRG)[10]toaninstanceofthedecisionversionofthe that g(z |T)<(n−k) d . d+1 Campaign problem. We remind that a graph is called regular if all its nodes have the same degree. B Proof of Lemma 4.2 Given a regular graph GVC = (VVC,EVC) and an Proof. Recallthatg(z |T)=(cid:80)i∈V zi,wherethevalues integerK theVCRGproblemaskswhetherthereexists of vector z are computed using Equation (3.3). As we a set of nodes Y ⊆ VVC such that |Y| ≤ K and Y is have already described, we can view the computation a vertex cover (i.e., for every (i,j) ∈ EVC it is i ∈ Y of the vector z as performing a random walk with or j ∈Y). absorbing nodes on the augmented graph H = (V ∪ An instance of the decision version of the Cam- V,E∪E). AssumethatB isthesetofabsorbingnodes, paign problem consists of a social graph G = (V,E), and that each b ∈ B is associated with value f . Let b internal opinions s, an integer k and a number θ. The P (b | i) be the probability that a random walk that B solution to the decision version is “yes” iff there exists starts from i gets absorbed at node b, when the set of a set T ⊆V such that |T|≤k and g(z |T)≥θ. absorbing nodes is B. The expressed opinion of node Given an instance of the VCRG problem, we will i∈/ B is construct an instance of the decision version of Cam- z = (cid:88)P (b|i)f . i B b paign by setting G = (V,E) to be equal to G = VC b∈B (V ,E ), s = 0 for every i ∈ V, k = K and VC VC i Since this value depends on the set B, we will write θ =(n−k) d . Then,weshowthatT ⊆V isasolution d+1 z(i | B) to denote the value of z when the set of to the Campaign problem with value g(z | T) ≥ θ if i absorbing nodes is B. and only if T is a vertex cover of the input instance of Initially, the set of absorbing nodes is B = V and VCRG. f =s for all b∈B. When we select a subset of nodes Inordertoshowthis, weusetheabsorbingrandom b b T ⊆V suchthattheirexpressedopinionsarefixedto1, walk interpretation of the Campaign problem. Recall wehavethatB =V∪T,f =s forallb∈V andf =1 that by the definition of the Campaign problem, every b b b for all b∈T. Since the set of nodes V is always part of node i ∈ T becomes an absorbing node with value the set B, and the parameter that we are interested in z =1. For every other node j ∈/ T we compute a value i for this proof is the target set of nodes T, we will use z , which is the expected value at the absorption point j z(i | T) to denote z(i | B) where B = V ∪T. We thus ofarandomwalkthatstartsfromj. Sinces =0forall i have i ∈ V and z = 1 for all i ∈ T, the value z represents (cid:88) i j g(z |T)= z(i|T). the probability that the random walk starting from j i∈V will be absorbed in some node in T. NotethatthesummationisoverallnodesinV including Now suppose that T is a vertex cover for G . VC the nodes in T. If i ∈ T, then P (i | i) = 1, and Then, for every non-absorbing vertex j ∈ V \ T, for B P (j | i) = 0 for all j (cid:54)= i. Therefore, z(i | T) = 1 for each one of the d edges (j,i)∈E incident on j, it must B all i∈T. be that i ∈ T; otherwise edge (j,i) is not covered, and We now make the following key observation. Let j T is not a vertex cover. Therefore, in the augmented be a non-absorbing node in V \T. We have that graph H, node j is connected to d nodes in T, and to nodeσj,allofthemabsorbing. Arandomwalkstarting (cid:88)(cid:0) (cid:1) z(i|T)= P (b|i)+P (j |i)P (b|j) f . from j will be absorbed and converge in a single step. B∪{j} B∪{j} B b The probability of it being absorbed in a node in T is b∈B zj = d+d1. There are n−k nodes in V \T, therefore, Theequationabovefollowsfromtheobservationthatwe g(z |T)=(n−k) d . can express the probability of a random walk starting d+1 If T is not a vertex cover for GVC, then there is an from i to be absorbed in some node b ∈ B as the sum edge (j,(cid:96))∈E, such that j,(cid:96)(cid:54)∈T. As we noted before, of two terms: (i) the probability P (b | i) that the B∪{j} zj is the probability of being absorbed in some node in random walk is absorbed in b, while avoiding passing T. The transition probability of the edge (j,σj) is d+11, through j (thus we add j in the absorbing set); (ii) therefore, node j has probability at least 1 of being the probability P (j | i) that the random walk d+1 B∪{j} absorbed in σ . Since there is a path from j to σ with is absorbed in j while avoiding the nodes in B (that j (cid:96) non-zeroprobability,theprobabilityofjbeingabsorbed is, the probability of all paths that go from i to j of arbitrarylength,withoutpassingthroughjorB),times

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.