ebook img

Sampling Online Social Networks via Heterogeneous Statistics PDF

0.41 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Sampling Online Social Networks via Heterogeneous Statistics

Sampling Online Social Networks via Heterogeneous Statistics Xin Wang†, Richard T. B. Ma∗, Yinlong Xu†, Zhipeng Li† † School of Computer Science and Technology, University of Science and Technology of China ∗ School of Computing, National University of Singapore {yixinxa, lizhip}@mail.ustc.edu.cn, [email protected], [email protected] Abstract—Mostsamplingtechniquesforonlinesocialnetworks able in the friendship graph of LiveJournal. However, they (OSNs) are based on a particular sampling method on a single might induce different sampling efficiencies, which are often 5 graph, which is referred to as a statistic. However, various unknown a priori. Although one can use multiple unbiased 1 realizing methods on different graphs could possibly be used 0 statistics, generated by different methods on different graphs, in the same OSN, and they may lead to different sampling 2 efficiencies,i.e.,asymptoticvariances.Toutilizemultiplestatistics to form a heterogeneous statistic, it is unclear how one could c for accurate measurements, we formulate a mixture sampling 1) optimally allocate the sampling budgets among different e problem, through which we construct a mixture unbiased es- statistics and 2) optimally combine them. D timator which minimizes the asymptotic variance. Given fixed As we focus on unbiased estimators, we use the asymptotic sampling budgets for different statistics, we derive the optimal variance [5] to measure the efficiency of a statistic (or its 8 weights to combine the individual estimators; given a fixed total 1 budget, we show that a greedy allocation towards the most estimator). We formulate a mixture sampling problem that efficientstatisticisoptimal.Inpractice,thesamplingefficiencies tries to minimize the asymptotic variance of a linearly mixed ] of statistics can be quite different for various targets and are estimator, constrained by sampling budgets. Given allocated I S unknown before sampling. To solve this problem, we design a budgets for different statistics, we prove that the optimal two-stageframeworkwhichadaptivelyspendsapartialbudgetto . weights of individual estimators are inversely proportional to s test different statistics and allocates the remaining budget to the c inferred best statistic. We show that our two-stage framework their asymptotic variances; under a fixed total budget, we [ is a generalization of 1) randomly choosing a statistic and 2) rank the allocation decisions and find that a greedy allocation 2 evenly allocating the total budget among all available statistics, is optimal, i.e., allocating more budgets to the statistic with v andouradaptivealgorithmachieveshigherefficiencythanthese smaller asymptotic variance is always better. 5 benchmark strategies in theory and experiment. However, the asymptotic variances of the statistics are 0 usually unknown before sampling. To address this challenge, 9 I. INTRODUCTION we design a two-stage framework with a pilot and a regular 2 0 With the ever increasing popularity of online social net- sampling stage. In the pilot sampling stage, we allocate part . works (OSNs) in recent years, many studies have focused on of the sampling budget to all the statistics and infer the most 1 0 theanalysisofOSNs,suchasestimatingvariouspropertiesof efficient statistic by estimating the asymptotic variance of 5 the users and their relationships. OSNs are usually measured each statistic. In the regular sampling stage, we allocate the 1 via graph sampling techniques, because they are typically too remaining budget to the inferred most efficient statistic. Our : v large to be completely visited and OSN service providers framework is a generalization of two benchmark strategies: i rarelymaketheircompletenetworkdatasetpubliclyvisible.To 1) spending all budget on a randomly chosen statistic and 2) X guarantee the estimation accuracy, many unbiased graph sam- allocating the budget among all available statistics evenly. We r a plingmethodshavebeendesigned,suchasthesimplerandom show that our two-stage strategies achieve higher sampling walk with re-weighting (RWRW) [1, 2], the frontier sampling efficiency than the two benchmark strategies. Furthermore, to (FS) [3] and the random walk with uniform restarts (RWuR) allocate an optimal sub-budget for the pilot sampling stage, [4]. However, OSNs often consist of multiple social graphs we design an online algorithm to dynamically estimate an which can be sampled by different unbiased graph sampling upper-boundoftheoptimalfractionduringthepilotsampling. methods. For example, in the YouTube social network, users Becausetheinferenceofthemostefficientstatisticismadeby are allowed to declare friendship with each other and create estimatingtheasymptoticvariancesinthepilotsamplingstage, interest groups for others to join in. This creates two graphs it makes our framework adaptive for different measurement whoseedgesetscorrespondto1)themutualfriendshipand2) targets. Our framework does not restrict how the estimators the sharing of membership of some interest group among the of asymptotic variances should be constructed, as long as users, respectively. For a given measurement target, sampling they are asymptotically unbiased. To illustrate, we provide a via different graphs usually have different efficiencies, which detailed implementation and evaluate the performance of our also vary as the measurement target changes. Furthermore, framework in the Douban social network. The experimental various graph sampling methods can be applied to the same results show that our technique uses only 18% − 57% of social graph, e.g., the FS and the RWuR are both realiz- the sampling budget needed by the benchmark strategies for achieving the same estimation accuracy for a range of samples are required to achieve a certain level of accuracy measurement targets. Our main contributions are as follows. for the estimator fˆ(m). Thus, we use the asymptotic variance • We formulate and solve a mixture sampling problem σ2(f)tomeasuretheefficiencyofanasymptoticallyunbiased whichconstructsanoptimalestimatorofaheterogeneous graph sampling method (or its estimator) in this paper. statistic to improve sampling efficiency. In particular, we In the next two sections, we formulate and solve a mixture derive the optimal weights of the individual estimators sampling problem, based on which we design a two-stage in the mixture estimator (Theorem 1) and the optimal framework to sample via multiple statistics. The estimators of allocation decisions among the statistics (Theorem 2). these statistics can be based on very different asymptotically • We design a two-stage framework and an adaptive algo- unbiased sampling methods on different graphs. rithm (Algorithm 1) for the pilot sampling, a practical III. MIXTURESAMPLINGPROBLEM solution for the mixture sampling problem when the efficienciesofthestatisticsareunknownbeforesampling. We consider an objective of measuring the mean value of • We show that the two-stage strategies are asymptotically property f over the users, i.e., f¯defined earlier. We refer to optimal (Theorem 4) and achieve higher efficiency than anasymptoticallyunbiasedsamplingmethodonasocialgraph two benchmark strategies (Corollary 1). as a statistic, and assume there are K types of statistics that • As a case study, we provide a detailed implementation can be applied in the OSN. For any statistic k, we denote the of our framework and evaluate its performance in the randomvariablefˆk(mk)asthevalueofitsestimatorgivenmk Douban social network. samplesandσk2(f)asitsasymptoticvariance.Wesimplifythe notation σ2(f) as σ2 when we focus on a single property f. Theremainingofthispaperisorganizedasfollows.Section k k Because each estimator fˆ(m ) is asymptotically unbiased, II introduces the concepts and characteristics of unbiased k k we use the asymptotic variance σ2 as a metric for comparing graph sampling methods. Section III defines the mixture sam- k the efficiencies of these statistics. If the asymptotic variance plingproblemandpresentsitsoptimalsolution.Withunknown σ2 issmallerthan σ2 ,wesaystatisticiismoreefficientthan efficiencies of the statistics before sampling, we design the i j statistic j for estimating f¯. Furthermore, we denote k∗ as the two-stage framework and its adaptive algorithm in Section most efficient statistic, i.e., σ2 =min{σ2 :k =1,··· ,K}. IV. Section V implements the framework and evaluates its k∗ k performanceintheDoubansocialnetwork.SectionVIreviews A. Mixture Sampling Problem related work and Section VII concludes. Suppose we have a total sampling budget1 of M samples II. UNBIASEDGRAPHSAMPLING and K types of candidate statistics, we consider the mixture sampling problem of how to allocate the sampling budget We denote an undirected graph in an online social network among different statistics and how to construct an unbiased as G=(V,E) with aset of nodes V={1,··· ,V} to represent estimator fˆfor f¯so as to minimize its asymptotic variance. usersandasetofedgesE torepresenttherelationshipsamong We denote a = (a ,··· ,a ) as a budget allocation the users. We denote f as a property and f as its value of 1 K v decision, where each a ≥ 0 defines the fraction of the total user v. Our measurement target is to estimate the mean value k of property f over all users in V, i.e., f¯(cid:44)(cid:0)(cid:80) f (cid:1)/V. budget allocated to statistic k. We define Ka (cid:44) {k : ak > 0} v∈V v to be the set of active statistics. Thus, each active statistic k We consider a graph sampling method that traverses the has a budget m = a M and an estimator fˆ(m ). Because nodes of the graph via a random walk, which generates a k k k k the sum of budget allocated to each statistic cannot exceed discrete-time stochastic process {Xt}t∈N with the state space the total budget, we define the constraint set of the allocation ofV,i.e.,Xt ∈V forallt∈N.Wedefinetherandomvariable decisions as A(cid:44){a|(cid:80)K a ≤1;a ≥0 ∀k =1,··· ,K}. fˆ(m) as an estimator on the sample path {X : t=1,···,m} k=1 k k of m samples. An estimator fˆ(·) is unbiased itf E[fˆ(m)]=f¯ Givenavectorfˆ=(fˆ1,··· ,fˆK)ofestimators,weconsidera mixed estimator fˆ(w) which linearly combines the individual for all m∈N and is asymptotically unbiased if estimatorsbyaweightvectorw =(w ,··· ,w ),definedas fˆ(m)−a−.s→. f¯ as m→∞, 1 K K where−a−.s→. denotesconvergencealmostsurely.Iftheprocess fˆ(w)(cid:44)(cid:88)w fˆ. (3) k k {Xt}t∈N is ergodic, by the central limit theorem (CLT), k=1 √m[fˆ(m)−f¯]−→d N(0,σ2(f)) as m→∞, (1) Each weight wk is used to determine the relative importance of the individual estimator fˆ. Under a total budget M and where−→d denotesconvergenceindistributionandN(0,σ2(f)) k anallocationdecisiona,wedefinethemixtureestimator with denotesanormaldistributionwithmean0andvarianceσ2(f), weights w as which is defined by fˆ(a,M,w)(cid:44) (cid:88) w ·fˆ(m )= (cid:88) w ·fˆ(a M). (4) σ2(f)(cid:44) lim mVar(fˆ(m)). (2) k k k k k k m→∞ k∈Ka k∈Ka By(1),wecaninferthatfˆ(m)−a−.s→. f¯asm→∞,i.e.,fˆ(m)is Wedefinetheasymptoticvarianceoftheaboveestimatoras an asymptotically unbiased estimator of f¯. It also shows that ς(a,w)(cid:44) lim M ·Var(fˆ(a,M,w)). (5) the distribution of √mfˆ(m) is asymptotically normal with M→∞ variance σ2(f), which approximately determines how many 1Weassumethatoneunitofthebudgetisthecostofvisitinganode. If each fˆ is asymptotically unbiased, we hope that the thesecondoptimizationproblem(7)couldbestatedasfinding k constructed mixture estimator fˆ(a,M,w) would still be the optimal allocation a∗ that solves: asymptotically unbiased. We denote the set Wa to be the Minimize ς(a,w∗(a)) subject to a∈A. (11) domain of weights under the budget allocation a such that Intuitively, an optimal solution should allocate more budgets for every w ∈Wa, fˆ(a,M,w) is asymptotically unbiased. to the more efficient statistic. The next result shows that a Our design goal is to construct the optimal unbiased esti- greedy strategy that allocates all budgets to the statistic with mator fˆ(a,M,w) whose asymptotic variance ς(a,w) could the smallest asymptotic variance is actually optimal. be minimized. We formulate two related mixture sampling Theorem 2. Assume that the conditions of Theorem 1 hold. problems as follows. In the first problem, we consider a given allocation decision a and we denote ςa(w) (cid:44) ς(a,w). The Denote {σ(2k)}Kk=1 as the relabeled set of asymptotic variance objective is to find the optimal weights w∗ that solve: of {σk2}Kk=1 with an ascending order. For any allocation decisions a and a˜ satisfying (cid:80)i a ≥ (cid:80)i a˜ for Minimize ςa(w) subject to w ∈Wa. (6) k=1 (k) k=1 (k) i=1,2,··· ,K, we have In the second problem, the objective is to find the optimal allocation decision a∗ and the corresponding optimal weights ς(a,w∗(a))≤ς(a˜,w∗(a˜)). w∗(a∗) that solve: In particular, the optimal allocation a∗, which solves the Minimize ς(a,w) optimization problem in Equation (7), satisfies a∗k =1{k=k∗} (7) with the minimum asymptotic variance subject to a∈A and w ∈W . a ς(a∗,w∗(a∗))=σ2 . The first problem can be regarded as a sub-problem of the k∗ second one, where the allocated decision is predetermined. Theorem 2 states that an allocation decision a is more efficient, i.e., it induces a smaller ς(a,w∗(a)), if it allocates B. Optimal Weights and Allocation Decisions more budgets to more efficient statistics. In particular, if we greedily allocate all budgets to the most efficient statistic k∗, Inthissubsection,wesolvetheoptimalweightstoconstruct the asymptotic variance ς(a,w∗(a)) will be minimized. anestimatorandtheoptimalbudgetallocationformaximizing the efficiency of an estimator. Under a fixed budget allocation Theorem 1 and 2 show that the optimal solutions are decision a, intuitively, a larger weight wk should be given to closely related to the asymptotic variances σk2 of the indi- an estimator fˆ if statistic k is more efficient, i.e., its asymp- vidual statistics, and the directions for decreasing ς(a,w) k totic variance σ2 is smaller. The following result provides an are allocating as much budget to statistic k∗ as possible and k affirmative answer to the intuition. weighting the individual estimators inversely proportional to theirasymptoticvariances.However,theasymptoticvariances Theorem 1. Assume all the pure estimators fˆk are inde- σ2 are usually unknown before sampling. To address this pendent of each other. The mixture estimator fˆ(a,M,w) is chkallenge,weproposeatwo-stageframework,whereweinfer asymptotically unbiased for f¯ if and only if the domain of the best statistic k∗ in the first stage before allocating all the weights under an allocation decision a satisfies remaining budget greedily in the second stage. (cid:40) (cid:41) (cid:88) Wa = w| wk =1 . (8) IV. ADAPTIVETWO-STAGEFRAMEWORK k∈Ka Inthissection,wefirstexplain thebasicconceptsofatwo- Itsasymptoticvariancecanbe characterizedbyafunctionof stageframeworkandthenshowtheframeworkachieveshigher the allocation a and the weight vector w, defined as (cid:88) w2 sampling efficiency than two benchmark strategies, finally we ς(a,w)= k ·σ2. (9) propose an adaptive algorithm to determine an upper-bound a k k k∈Ka of the optimal budget fraction which is allocated to the first The optimal solution w∗ of the optimization problem in stage. Equation (6) satisfies w∗ = ak/ (cid:88) ai, ∀k ∈K , (10) A. TwoBenchmarkStrategiesandATwo-StageGeneralization k σ2 σ2 a k i∈Ka i Withoutknowingtheasymptoticvariancesσk2oftheindivid- and the corresponding minimum asymptotic variance is ualstatistics,westartwithtwonaivestrategiesasbenchmarks. (cid:34) (cid:35)−1 ς (w∗)= (cid:88) ak . The first strategy spends all budget M on a randomly chosen a σ2 statistic k; the second strategy evenly divides the budget M k∈Ka k amongK statisticstoconstructthemixtureestimator.Wecall Theorem1showsthattoguaranteethemixtureestimatorto these two benchmark strategies as the Random Statistics (or be asymptotically unbiased, the sum of weights of the active RND) and Average Statistics (or AVG), respectively. statistics must be one. It also tells that when the allocation Based on the two benchmark strategies, we consider a two- decision a is fixed, the optimal weight w∗ of each estimator stagegeneralization,whichspendsapartialbudgettoestimate k fˆ(m ) is proportional to a and inversely proportional to the best statistic k∗ in a pilot sampling stage and allocates the k k k its asymptotic variance σ2. Based on Theorem 1, we denote remaining budget to an estimated best statistic kˆ∗ in a regular k w∗(a) to be the optimal solution of (6) defined in (10) and sampling stage. We assume that a fraction c ∈ [0,1] of the total budget M is allocated for pilot sampling and name the Given any strategy c∈[0,1], we can define the (unknown) cM samples as the pilot budget. We evenly allocate the pilot optimalallocationdecisionasa∗(c)=(a∗(c),...,a∗ (c))as 1 K budget among all K statistics, and therefore, each statistic a∗(c)(cid:44)c/K+(cid:0)1−c(cid:1)·1 , ∀k =1,...,K. k {k=k∗} k is allocated a budget of mk = cM/K samples in this Under this optimal allocation a∗(c), by Theorem 1, the stage. We use these pilot samples to make an asymptotically corresponding optimal weight vector becomes w∗(a∗(c)). uthnebieasstiemdaetsetdimvaatleueofbyeaσˆckh2(amsykm).pMtootisctvliakreialyn,cteheσk2s,taatinsdticdewfiinthe iInntuthiteivpeillyo,twsahmenpliangbusdtaggeet,cmMk i=s ucMsed/Ktofeosrtiamnaytesteaaticshticσkk2 thesmallestestimatedasymptoticvariancetendstobethemost andthebeststatistick∗ ismorelikelytoinduceasmalleresti- efficient statistic k∗ for estimating f¯. We call this statistic mated asymptotic variance σˆ2(m ) than other statistics. Con- the inferred most efficient statistic and denote it as kˆ∗(cM), sequently,theresultingallocaktionka(c)andweightswˆ∗(c)are parameterizedbythepilotsamplingbudgetcM.Intheregular more likely to be equal to the optimal a∗(c) and w∗(a∗(c)), samplingstage,weallocatealltheremainingsamplingbudget respectively.Weconsiderthetwo-stagestrategycasafunction (1−c)M to the inferred most efficient statistic kˆ∗, and fully of the total budget M, denoted as c(M), and simplify the use the total budget M to construct a mixture estimator. notation a∗(c(M)) as a∗(M). The next theorem shows that Under the above two-stage framework, we denote a(c) as when the pilot budget fraction c is higher than the order of the effective allocation decision, defined by M−1, the two-stage strategy c(M) is asymptotically optimal. a (c)(cid:44)c/K+(1−c)·1 , (12) k {k=kˆ∗(cM)} Theorem4. Assumeeachestimatedasymptoticvarianceσˆ2(·) through which we can define the effective budget for each is asymptotically unbiased for σ2 (k =1,··· ,K), i.e., k statistic k as mk(c)(cid:44)ak(c)M naturally. After both sampling σˆ2(m )−a−.s→. σ2 aks m →+∞. stages,weconstructamixtureestimatorbyusinganestimated k k k k Ifc(M)∈ω(cid:0)M−1(cid:1),i.e.,forallδ >0,thereexistsapositive optimal weight vector wˆ∗(c). We use the estimated value number M(cid:48) such that c(M)≥δM−1 for all M >M(cid:48), σˆ2(m ) to approximate σ2, and define wˆ∗(c) by substituting σk2 wikth σˆ2(m ) in the opktimal weight of Equation (10) as kˆ∗ −a−.s→. k∗ , a(c(M))−a−.s→. a∗(M) and k k k wˆ∗(c)(cid:44) ak(c) / (cid:88) ai(c) , ∀k ∈K . (13) wˆ∗(cid:0)c(M)(cid:1)−a−.s→. w∗(cid:0)a∗(M)(cid:1) as M →+∞. k σˆ2(m (c)) σˆ2(m (c)) a k k i∈Ka i i Theorem4showsthatasthetotalbudgetM grows,toguar- Consequently, the corresponding mixture estimator and its antee an (asymptotic) optimal two-stage strategy, the fraction asymptotic variance can be written as fˆ(a(c),M,wˆ∗(c)) and cforthepilotbudgetdoesnotneedtobelarge.Thecondition ς(a(c),wˆ∗(c)), respectively. c(M)∈ω(M−1)ensuresthatthepilotbudgetc(M)M grows The two-stage framework actually uses the AVG and RND with M unboundedly as M goes to infinity, although c itself strategiesinitspilotandregularsamplingstages,respectively. could approaches zero, such that the estimated asymptotic In particular, the estimated statistic kˆ∗ plays the role of a variance σˆ2(m ) will converge to σ2. Consequently, the two- k k k randomstatisticintheRNDstrategy.Also,theframeworkcan stagestrategyc(M)willidentifythemostefficientstatistick∗ be seen as a generalization of the two benchmark strategies, viathepilotsamplingandsettheoptimalallocationa∗(cid:0)c(M)(cid:1) because the Average and Random Statistics are equivalent to and optimal weight w∗(cid:0)a∗(c(M))(cid:1) for the mixture estimator. a two-stage strategy of c=1 and c=0, respectively. When we simply give the same weight for each sample point, for any allocation a, the corresponding weight vector Theorem 3. The asymptotic variances of the Random becomesw =a,whichareproportionaltotheirsamplesizes. Statistics and Average Statistics are ς(a(0),wˆ∗(0)) and To distinguish the benefit of choosing an optimal allocation ς(a(1),a(1)), respectively. They satisfy a∗ and an optimal weight w∗, we consider an intermediate E(cid:2)ς(a(0),wˆ∗(0))(cid:3)=ς(a(1),a(1))= 1 (cid:88)K σ2. mixtureestimatorfˆ(cid:0)a,M,a(cid:1),whichgetsaffectedonlybythe K k allocation decision a and has an asymptotic variance ς(cid:0)a,a(cid:1). k=1 Theorem 3 states that the expected asymptotic variance of Corollary1. UndertheconditionsofTheorem4,foranypilot the Random Statistics and the asymptotic variance of Average fraction c(M)∈ω(M−1), as M →+∞, we have Statistics both equal the average of the asymptotic variances ς(cid:16)a(cid:0)c(M)(cid:1),wˆ∗(cid:0)c(M))(cid:17)−a−.s→. ς(cid:16)a∗(cid:0)M(cid:1),w∗(cid:0)a∗(M)(cid:1)(cid:17), of all individual statistics. and the asymptotic limit of ς satisfies K B. Asymptotic Performance of Two-Stage Strategies ς(a∗(M),w∗(a∗(M)))≤ς(a∗(M),a∗(M))≤ 1 (cid:88)σ2 K k Our two-stage framework does not restrict how the asymp- k=1 toticvariancesσk2 areestimatedinthepilotsamplingstage.We and ς(cid:0)a∗(M),w∗(a∗(M))(cid:1)≤ Kσk2∗ . will show that as long as σˆ2(·) is an asymptotically unbiased K+(1−K)c(M) k estimator for σ2, the two-stage strategies will outperform the k twobenchmarkstrategies.Thedetaileddesignoftheestimator As a consequence of Theorem 4, Corollary 1 shows that as σˆ2(·) may depend on the sampling method of statistic k, and M grows, the asymptotic variance ς induced by the strategy k we will give an example of implementation in a later section. c(M)convergestoanoptimalvalueς(a∗(M),w∗(a∗(M))). Thefirstinequalityimpliesthat1)usingtheestimatedoptimal Algorithm 1 Adaptive Two-Stage Sampling (M,∆M) weight wˆ∗(c(M)) is more efficient than the equal weight 1: c←∆M/M; w = a, and 2) using w = a is again more efficient than 2: spend ∆M budget for pilot sampling; the two benchmark strategies, whose (expected) asymptotic 3: while c<cˆ∗(cM) do variances equal 1 (cid:80)K σ2 as shown in Theorem 3. The c←c+∆ /M; K k=1 k M second inequality provides an upper-bound for the optimal ς, spend ∆ more budget for pilot sampling; M which can be derived from an estimator fˆkˆ∗(cid:0)akˆ∗(c(M))M(cid:1) end while which only uses the samples of the inferred best statistic kˆ∗ 4: choose the estimated best statistic kˆ∗; and throws out the samples of other statistics collected in the 5: spendtheremainingbudget(1−c)M forregularsampling; pilot sampling stage. C. Optimal Fraction for Pilot Budget Algorithm 1 performs the pilot sampling in an adaptive Our design of any two-stage strategy c(M) ∈ ω(M−1) is manner.Ittakestwoinputparameters:thetotalbudgetM and abudgetspendingstepsize∆ ∈(0,M).Wedenotecˆ∗(·)asa asymptotically optimal. However, a more practical problem is M functionwhereeachcˆ∗(m)providesanestimatedupper-bound that, given a finite budget M, how to choose an optimal frac- tion c∗(M) for the pilot budget that maximizes the efficiency oftheoptimalfractionc∗(M),whenmnumberofsamplesare for the mixture estimator fˆ, i.e., c∗(M) solves: used.Instep3,weincreasethepilotbudgetby∆M ifthespent Minimize Var(cid:0)fˆ(a(c),M,wˆ∗(c))(cid:1), fractioncissmallerthanthederivedupper-boundcˆ∗(cM)for (14) c∗(M),untilcexceedstheupper-boundcˆ∗(cM).Basedonthe subject to c∈[0,1]. cM samples generated in the pilot sampling stage, we choose On the one hand, when allocating more budget for the pilot theestimatedbeststatistic kˆ∗ andspendtheremaining budget sampling,eachσˆ2(·)couldprovideamoreaccurateestimation k (1−c)M for regular sampling as usual. In general, given any for the asymptotic variance σk2 and the best statistic k∗ would m pilot samples, the function cˆ∗(·) uses them to estimate an have a higher chance to be picked out in the regular sampling optimal fraction c∗(m(cid:48)) for some m(cid:48) <m. Because c∗(·) has stage. On the other hand, increasing the pilot budget means a decreasing trend in general, we could use this estimation thatmorebudgetwillbeallocatedtosomeinefficientstatistics of c∗(m(cid:48)) as an upper-bound for c∗(M) so as to determine at the pilot sampling stage. One needs to balance the above whether the pilot sampling stage should end. As the sampling contradictory conditions so as to obtain an optimal fraction budget cM increases, the estimation cˆ∗(cM) should decrease c∗(M). In practice, it is hard to obtain the exact value of and approach c∗(M), because it estimates some c∗(m(cid:48)) and the optimal fraction for the pilot budget c∗(M), because it m(cid:48) increases. Notice that our algorithm does not restrict how depends on the unknown values of asymptotic variances σk2. the upper-bound estimation cˆ∗(·) should be implemented, and However, we will provide a heuristic algorithm to estimate we will provide an example of implementation which we use c∗(M)effectively,whichisbasedonthefollowingtheoretical in our evaluation in a later section. Finally, although a large result on the monotonicity of c∗(M). stepsize approaches c∗(M) faster, to avoid overestimating the pilot budget, a small value of ∆ should be used in practice. Theorem 5. Assume the rate of convergence of the esti- M mated asymptotic variance σˆ2(m) for each σ2 is Θ(m−ηk), k k V. EVALUATIONINDOUBANSOCIALNETWORK i.e., supx∈R+|Gσˆ2(m)(x) − Gσ2(x)| = Θ(m−ηk), where G (x)=P(cid:0)σˆk2(m)≤x(cid:1) ankd G (x)=1 are the In this section, we apply the adaptive two-stage framework σˆ2(m) k σ2 {x≥σ2} cumkulativedistributionfunctionofσˆ2(km)andσ2,reskpectively, to the Douban social network, a popular Chinese web site k k providing user comment and recommendation services for and the order η > 0. Let η=min{η : k = 1,···,K}. The k k optimal fraction satisfies lim c∗(M) = 0 with the rate of books,musicandmovies.Wefirstintroducemultiplestatistics M→+∞ which can be realized in Douban and then provide a detailed convergence Θ(M−1+η+11). implementation of our framework to measure the statistics, finally we evaluate the performance of the framework. Theorem5showsthattheoptimalfractionc∗(M)decreases to zero asymptotically with the rate Θ(M−1+η+11) when M A. Multiple Statistics grows. Intuitively, as the total budget M increases, to guaran- SimilartoTwitterandSinamicroblog,usersinDoubancan teethesameaccuracyforestimatingk∗,weonlyneedtokeep follow each other, and therefore, Douban can be seen as a the pilot budget cM constant and thus the fraction c becomes followship graph2 in which the edges capture the following smaller. Both Theorem 4 and 5 imply that when M becomes relationship. Douban also allows users create interest groups larger,theoptimalfractionc∗(M)shoulddecrease.Therefore, for others to join in. We consider two users who have a we assume that c∗(M) follows a decreasing trend as M common group share a membership and Douban can also be increases (In Section V, our evaluations in the Douban social seenasamembershipgraph.Thesetwodifferentsocialgraphs, networkalsosupportthisconjecturewell),basedonwhichwe propose an adaptive algorithm to dynamically determine the 2Here, we serve the followship graph as an undirected graph and one optimal fraction c∗(M) for the pilot sampling. followingrelationshipcorrespondstoanundirectededge. together with two random walk based sampling methods, the 1) Estimating the asymptotic variances: Both the pilot RWuRandFSintroducednext,providefourdifferentavailable sampling and the adaptive Algorithm 1 need to estimate the statistics. unknown asymptotic variances σ2. Assume the sample set k 1) The random walk with uniform restarts (RWuR): The used to estimate σ2 is collected by q (≥ 2) samplers whose k RWuR [4] is a hybrid sampling method that mixes random budgets are all l. We denote the estimated value for f¯based walk crawling and uniform node sampling. It generates a on the j-th sampler as fˆ(j)(l) for j =1,··· ,q, which serves k sample set {Xt}t∈N as follows. At each step t, assume the as a√sample of the estimator fˆk(l). Then the sample variance current node is X =i. With probability α/(d +α), it jumps of { lfˆ(j)(l):j =1,··· ,q} is defined by t i k to an arbitrary node j of the graph chosen uniformly and q (cid:34) q (cid:35)2 make the transition X =j. With probability d /(d +α), it S2(q,l)= l (cid:88) fˆ(j)(l)− 1(cid:88)fˆ(i)(l) . (17) t+1 i i k q−1 k q k uniformly chooses an i’s neighboring node k, i.e., Xt+1=k. j√=1 i=1 The parameter α (≥ 0) controls the probabilities of random It describes how far lfˆ(j)(l) (j = 1,2,··· ,q) are spread k walk and jump. Specially, when α = 0, the RWuR is out. From the definition of asymptotic variance in Equation the simple random walk, and when α = +∞, the RWuR (2),S2(q,l)isanasymptoticallyunbiasedestimateofσ2,i.e., k k becomes the uniform node sampling. Obviously, the sample S2(q,l)−a−.s→. lim lVar(fˆ(l))=σ2 as q,l→∞. (18) set {Xt}t∈N is biased towards the high-degree nodes. To k l→∞ k k correct the bias, it uses the Hansen-Hurwitz estimator [6, 7] Thus,wecanuseSk2(q,l)toestimatetheasymptoticvariance to re-weight the samples, i.e., the weight of the sample Xt is σk2, i.e., σˆk2(ql) = Sk2(q,l). Also, because all unbiased graph inversely proportional to d +α, and the unbiased estimator sampling methods have the same definition of the asymptotic for f¯is Xt variance from Equation (2), this implementation is applicable m m to any one of them. fˆ(m)=(cid:88) fXt /(cid:88) 1 . (15) 2) Estimating upper-bound of optimal fraction c∗(M): d +α d +α t=1 Xt t=1 Xt Givenanysub-budgetm<M,weprovideanimplementation 2) The frontier sampling (FS): The FS [3] is a distributed of the upper-bound estimation function cˆ∗(m) as follows. We sampling method that performs s (∈N) random walkers on a use the budget m to collect B sample sets whose sizes are all graph.Initially,ituniformlyobtainssnodesasthestartnodes m(cid:48) (cid:44) m for each statistic, and by using these m samples, BK ofthesrandomwalkers.Ateachstep,itfirstrandomlyselects we could estimate the optimal fraction c∗(m(cid:48)) when the total ther-thwalkerwithprobabilityd /(cid:80)s d ,wherev isthe given budget is m(cid:48). We denote the b-th sample set of the k-th currentnodeofthei-thwalker.Thvernthei=r-1thvwialkerunifiormly statistic as S(b). We could try different two-stage strategies k choosesav ’sneighboringnodeasthenextsampleandmoves with fraction c ∈ (0,1) on the b-th group of sample sets r toit.SimilartotheRWuR,thebiastowardshighdegreenodes {S(b) : k = 1,··· ,K}. Specifically, like a normal two-stage k of the FS can be corrected by the Hansen-Hurwitz estimator, strategy of fixed c, we obtain cm(cid:48) samples from each set S(b) and the unbiased estimator for f¯is asthepilotsampling,andusethKemtoestimatethebeststatisktic fˆ(m)=(cid:88)m fXt/(cid:88)m 1 . (16) k∗, and then use the remaining (1 − c)m(cid:48) samples of the d d inferred best statistic to generate a realization of the estimator t=1 Xt t=1 Xt fˆ(a(c),m(cid:48),wˆ∗(c)). Finally, we calculate the sample variance The RWuR and FS samplers are less likely to get trapped of the B realizations obtained from the B groups of sample in loosely connected components of a graph via jumping sets. Based on (14), we choose the fraction c that minimizes randomly and running multiple walkers, respectively. Thus, thesamplevarianceasanestimationforc∗(m(cid:48)),whichserves both of them usually perform better than the simple random as an upper-bound for the optimal fraction c∗(M). walk with re-weighting [1, 2], but we do not know which When increasing the number of realizations B, the esti- oneachieveshighersamplingefficiencyinanunknowngraph. mation cˆ∗(m(cid:48)) for c∗(m(cid:48)) becomes more accurate. However, Besides, it is unclear how the efficiencies of the two methods the budget m(cid:48) = m decreases under a fixed budget m, and BK vary on the followship and membership graphs. Therefore, therefore, using c∗(m(cid:48)) as an upper-bound for the optimal we choose the four statistics, the RWuR and FS on the fraction c∗(M) could be loose. As a result, we recommend followship and membership graphs, to demonstrate our two- to set the parameter B moderately. stage framework. C. Measurement Setup The publicly available information for every Douban user B. Implementation of Two-Stage Framework includes user-id, location, lists of followers, users he/she Our adaptive two-stage strategy does not restrict how the follows and the interest groups he/she joins in. We consider estimators of the asymptotic variances σˆ2(·) and the upper- twomeasurementtargets,i.e.,theaveragenumberoffollowers k boundestimationoftheoptimalfractioncˆ∗(·)areconstructed, ofusersandtheaveragenumberofinterestgroupsofusers.To as long as they are asymptotically unbiased. Next, we provide measure these targets, we develop crawlers to sample via the anexampleofdetailedimplementationsofσˆ2(·)andcˆ∗(·)for fourstatistics(K =4),i.e.,theRWuRandFSmethodsonthe k measuring in Douban. followship and membership graphs. We ignore the users who 0.05 0.08 ATS ATS AVG 0.07 AVG 0.04 RND RND 0.06 FS−f FS−f SE0.03 FRSW−−mf SE0.05 FRSW−−mf RM RW−m RM0.04 RW−m N0.02 N0.03 0.02 0.01 0.01 0 0 0.5 1 1.5 2 2.5 3 3.5 4 0.5 1 1.5 2 2.5 3 3.5 4 Total sampling budget M x 104 Total sampling budget M x 104 (a) Numberoffollowers (b) Numberofgroups Fig.1. NRMSEoftheadaptivetwo-stagestrategy(ATS),AverageStatistics(AVG),RandomStatistics(RND),andindividualstatisticsincludingtheRWuR andFSmethodsonthefollowshipgraph(RW-fandFS-f)andonthemembershipgraph(RW-mandFS-m),whenwevarythetotalsamplingbudgetM.(a) measuretheaveragenumberoffollowersofusers,and(b)measuretheaveragenumberofinterestgroupsofusers. do not have any followship or membership, as these isolated 3.5 3.5 AVG AVG users cannot be visited via crawling. 3 RND 3 RND rFoefopWrrraetenhsedesnoesttmstatahtwibesoatitlucoktstea0rbls.a0sssa5em%d=pool5infn0gtthihnebeuatFdoSgtFaeSlmtnsteuoatmhmbobpedel,eMrrw.oefF=oDsre4ottu·ht1bhe0aen4stn,auuwtsimeshtrbiiscce3hsr. Ratio of total budget12..1255 FFRRSSWW−−−−fmfm Ratio of total budget12..1255 FRRSWW−−−ffm based on the RWuR method, we moderately set the parameter 0.5 0.5 whichcontrolstheprobabilitiesofrandomwalkandjumpα= 0 0 0.005 0.01 0.015 0.02 0.025 0.005 0.01 0.015 0.02 0.025 0.03 0.1.Besides,wealsoconsiderthecostofuniformlychoosinga NRMSE NRMSE startnodeandjumpingtoanarbitrarynodeintheFSorRWuR (a) Numberoffollowers (b) Numberofgroups method. This cost is about 14 units of budget in the Douban Fig. 2. The ratio of needed budget between the two-stage strategy and network, i.e., it needs to query an average of 14 randomly othersincludingtheAverageStatistics(AVG),RandomStatistics(RND),and generated user-ids to obtain a valid one in the user-id space. individual statistics including the RWuR and FS methods on the followship Inthetwo-stageframework,wesetthenumberofrealizations graph(RW-fandFS-f)andonthemembershipgraph(RW-mandFS-m),so astoattainthesameNRMSE.(a)measuretheaveragenumberoffollowers B =10andthebudgetspendingstepsize∆ =2%M =800 M ofusers,and(b)measuretheaveragenumberofinterestgroupsofusers. forAlgorithm1.Toestimatetheasymptoticvariances,weuse q =5 samplers. ofusers,theRWuRmethodonthemembershipgraph(RW-m) We also implement the benchmark strategies, i.e., the Ran- leadstohigherestimationaccuracythantheFSmethodonthe dom Statistics and Average Statistics, and the four single- followship graph (FS-f) as shown in subfigure 1(a); however, statistic strategies for comparison. To measure the esti- when the target is the average number of interest groups of mation accuracy of the different sampling strategies, we users, the conclusion is reversed as shown in subfigure 1(b). use Normalized Root Mean Square Error (NRMSE) [2–4], Thus,theefficienciesofthestatisticsvaryasthemeasurement (cid:113) E(fˆ−f¯)2/f¯where f¯is the true value of the measurement target changes and choosing a bad statistic, e.g., the RW- target and fˆis the estimated one. Because the “ground truth” f strategy for estimating the average number of followers, f¯is not published by Douban, we calculate the NRMSE by may lead to an inaccurate estimation. Without knowing the taking as f¯ the grand average of fˆ values over all samples efficiencies of the individual statistics, Figure 1 shows that our adaptive two-stage strategy (ATS) always outperforms collected via all full-length crawlers and statistics. All exper- both benchmark strategies (AVG and RND) regardless of the iment results presented in the following are the average of 25 measurementtarget.Furthermore,ourstrategy(ATS)isonlya independent simulations and our crawls were performed from bit inferior to the true best statistic (RW-m for estimating the Nov. 5th to 11th of 2013. average number of user’s followers or FS-f for estimating the D. Evaluation Results average number of user’s groups), which could be used when the asymptotic variances of all statistics are known. Figure 1 1) PerformanceoftheAdaptiveTwo-StageStrategies(ATS): also demonstrates that our framework has good adaptivity for Figure 1 shows that the efficiencies of different statistics may different measurement targets in the two subfigures. vary for different measurement targets. For example, we ob- Figure 2 shows the budget saving of our two-stage strategy serve that when we measure the average number of followers (ATS) compared with the benchmarks (AVG and RND) and other single-statistic strategies if they can fulfill the given 3By Nov. 15 2013, Douban service provider declare there are about 79.2 millionusers. NRMSE target. For example, when measuring the average x 10−3 11 0.02 10 TTSS−−aaaw TTSS−−aaaw 1 efrsatcimtioante odf uspppeenrd bbouudngde tof optimal fraction 1 efrsatcimtioante odf uspppeenrd bbouudngde tof optimal fraction 9 AVG AVG 8 0.015 0.8 0.8 MSE 7 MSE 0.6 0.6 NR 6 NR 0.01 0.4 0.4 5 0.2 0.2 4 30 0.2fractio0n.4 of pilo0t .b6udget 0c.8 1 0.0050 0.2fractio0n.4 of pilo0t .b6udget0 c.8 1 00 10 20Iteration30 40 50 00 10 20Iteration30 40 50 (a) Numberoffollowers (b) Numberofgroups (a) Numberoffollowers (b) Numberofgroups Fig. 4. With total sampling budget M = 4∗104, the estimated upper Ffriagm.e3w.orWkiwthitthottahlesaemstipmliantgedbuodpgteimtMal w=ei4gh·t1s,04i.,e.N,RfˆM(aS,EMo,fwˆth∗e)tw(ToS--satawg)e, bound of the optimal fraction cˆ∗(cM) and the spent fraction of budget c whentheiterationincreasesinAlgorithm1.(a)measuretheaveragenumber the two-stage framework with the same weight for each sample point, i.e., offollowersofusers,and(b)measuretheaveragenumberofinterestgroups fˆ(a,M,a)(TS-aa)andAverageStatistics(AVG),whenwevarythefraction ofusers. ofpilotbudgetc.(a)measuretheaveragenumberoffollowersofusers,and (b)measuretheaveragenumberofinterestgroupsofusers. cˆ∗(t∆ ) for c∗(M) (solid line) has a decreasing trend as the M numberofiterationstincreases.Itisconsistentwithourresult number of followers of users in subfigure 2(a), ATS saves that the optimal pilot fraction c∗(M) decreases as the budget about 49% budget compared with the AVG strategy to obtain M grows. The consumed fraction of the pilot budget c (dash the NRMSE = 0.015. From subfigures 2(b), as the target line) increases linearly (at a rate of ∆ ) with the number of M is the average number of groups of users, ATS saves about iterations. When the consumed pilot fraction c is larger than 75% budget compared to the RND strategy for obtaining the estimated upper-bound of optimal fraction, the iteration the NRMSE = 0.025. In general, we observe that our ATS stops in Algorithm 1. Subfigures 4(a) (resp. 4(b)) show that strategy requires only 18% to 57% of the budget needed for whenmeasuringtheaveragenumberofusers’followers(resp. the benchmark strategies to achieve the same NRMSE. We groups), the estimated optimal pilot fraction 40% (resp. 22%) alsoobservewhenmeasuringtheaveragenumberoffollowers approximatesefficientlytherealvalue32%(resp.16%).These (resp. groups) of users, the best statistic RW-m (resp. FS-f) results show that Algorithm 1 is effective for setting a near- uses the smallest amount of budget, which is consistent with optimal pilot fraction in the practical two-stage sampling. the observations from Figure 1 and the result of Theorem 2. 4) Observations of different statistics: At last, we provide 2) Benefitofoptimalallocationdecisionandweights: The some insights into the different statistics. Subfigure 1(a) indi- two-stage framework tries to improve estimation efficiency catesthattheRWuRandFSmethodsonthemembershipgraph by choosing budget allocation decision and setting estimated perform better than them on the followship graph when we optimal weights for the mixture estimator. Figure 3 compares measure the average number of followers of users. However, theNRMSEofourtwo-stagestrategywhentheweightsareset when the target is the average number of groups of users, to be equal (TS-aa) or optimally adjusted (TS-aw) and that of the conclusion is reversed as shown in subfigure 1(b). The the AVG benchmark strategy, when the fraction c of the pilot reason may be that the followship (resp. membership) graph budget varies along the x-axis. We observe that the two-stage has a strong cluster feature [8] that makes the samples highly strategy with optimal weights always outperforms that with correlatedonthenumberoftheusers’followers(resp.groups). equal weights, which again outperforms the AVG benchmark This strong correlation leads to a poor estimation accuracy. strategy.Noticethatundertheequalweights,c=0andc=1 Wealsoobservethat,forthefollowshipgraph,theFSmethod corresponds to the RND and AVG strategies, respectively, achieves higher efficiency than the RWuR; while the RWuR which have the same performance as shown in Theorem 3. In has smaller estimation error than the FS for the membership general,whencincreasesfrom0to1,thebenefitoftwo-stage graph. Because the RWuR sampler frequently chooses an strategyfirstincreasesandthendecreases.Thisisanintegrated arbitrary node as restart on a less connected graph (e.g., the result of two competing factors: 1) increasing the pilot budget followship graph), which costs large budget and decreases the help select the more efficient statistic at the regular sampling estimation accuracy. On the other hand, the RWuR is close stage, and 2) at the same time more budgets are allocated to to a single random walker on a well connected graph (e.g., the inefficient statistics at the pilot sampling stage. We also the membership graph). Compared with the FS with multiple observe that the benefit of using optimal weights is larger randomwalkers,itsavesthecostofobtainingmultipleuniform when the pilot fraction is larger. The reason is that with the start nodes and converging to the walkers’ steady state. larger pilot budget, more samples are used on the inefficient statistics and therefore, optimal weights are more needed to VI. RELATEDWORK discount those statistics. Graph Sampling Techniques. As OSN service providers 3) Effectiveness of the adaptive Algorithm 1: We imple- rarelymakepubliclyvisibletotheframeinformationofentire mented Algorithm 1 for estimating the optimal pilot fraction. networks, most widely used graph sampling techniques in Figure 4 shows that the estimated optimal pilot fraction OSNs are crawling methods. Early graph crawling methods are based on Breath-First Search (BFS), Depth-First Search for the pilot sampling stage, we design an adaptive algorithm (DFS) and Snowball Sampling (SBS) [9]. In particular, BFS to dynamically decide an upper-bound of the optimal pilot has been frequently used to explore large networks, such as budget and test whether the pilot sampling should end. We YoutubeandFacebook[6].However,thesemethodsintroduce implement the adaptive two-stage framework and evaluate a large bias towards high degree nodes and it is difficult to be its performance in the Douban network. We demonstrate, in corrected in general graphs [10–13]. theoryandexperiment,thatourtwo-stageframeworkachieves Recently the most popular graph crawling is random walk- higher sampling efficiency than two benchmark strategies. based sampling, including simple random walk with re- weighting (RWRW) [1, 2] and Metropolis-Hastings random walk(MHRW)[14].RWRWisconsideredasaspecialcaseof REFERENCES Respondent-Driven Sampling (RDS) [1] if only one neighbor is chosen in each iteration and revisiting nodes is allowed. [1] A. H. Rasti, M. Torkjazi, R. Rejaie, N. Duffield, W. Willinger, and It is also biased to sample high degree nodes, but the bias D.Stutzbach,“Respondent-drivensamplingforcharacterizingunstruc- turedoverlays,”ProceedingsofIEEEINFOCOM,pp.2701–2705,2009. can be corrected by the Hansen-Hurwitz estimator shown in [2] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou, “Practical [6,7].RWRWwasnotonlyusedtosampleOSNs[7,12],but recommendations on crawling online social networks,” IEEE Journal also P2P networks and Web [15, 16]. MHRW is based on the on Selected Areas in Communications, vol. 29, no. 9, pp. 1872–1892, 2011. Metropolis-Hastings (MH) algorithm and provides unbiased [3] B.RibeiroandD.Towsley,“Estimatingandsamplinggraphswithmul- samples directly [2, 14]. Some studies [1, 2] have shown that tidimensionalrandomwalks,”Proceedingsofthe10thACMSIGCOMM RWRW estimates are more accurate than MHRW estimates. conferenceonInternetmeasurement,pp.390–403,2010. [4] K.Avrachenkov,B.Ribeiro,andD.Towsley,“Improvingrandomwalk Improvement of sampling efficiency. Researchers have pro- estimationaccuracywithuniformrestarts,”AlgorithmsandModelsfor posed some methods to improve the sampling efficiency theWeb-Graph,pp.98–109,2010. against random walk-based sampling, including the FS [3] [5] A. Mira, “Ordering and improving the performance of monte carlo markovchains,”StatisticalScience,pp.340–350,2001. and RWuR [4] methods which we apply as showcases in this [6] A.Mislove,M.Marcon,K.P.Gummadi,P.Druschel,andB.Bhattachar- work.Besides,Kurantetal.[17]presentedaweightedrandom jee,“Measurementandanalysisofonlinesocialnetworks,”Proceedings walk method to perform stratified sampling with a priori of the 7th ACM SIGCOMM conference on Internet measurement, pp. 29–42,2007. estimate of network information. Lee et al. [18] proposed a [7] Mohaisen,AbedelazizandYun,AaramandKim,Yongdae,“Measuring non-backtracking random walk which forbids the sampler to the mixing time of social graphs,” Proceedings of the 10th ACM backtracktothepreviouslyvisitednode,andtheytheoretically SIGCOMMconferenceonInternetmeasurement,pp.383–389,2010. [8] S. E. Schaeffer, “Graph clustering,” Computer Science Review, vol. 1, guaranteedthetechniqueachieveshigherefficiencythanasim- no.1,pp.27–64,2007. ple random walk. Our work concentrates on how to combine [9] Heckathorn,DouglasD,“Respondent-drivensampling:anewapproach the existing statistics (sampling methods) efficiently and thus tothestudyofhiddenpopulations,”Socialproblems,pp.174–199,1997. [10] D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, “On the bias is complementary to their approaches. of traceroute sampling: or, power-law degree distributions in regular It is worth mentioning that, Gjoka et al.[19] designed a graphs,” Proceedings of the thirty-seventh annual ACM symposium on multi-graphsamplingtechniqueforthesocialnetworkswhich Theoryofcomputing,pp.694–703,2005. [11] L. Becchetti, C. Castillo, D. Donato, A. Fazzone, and I. Rome, “A have multiple relation graphs. Their technique improves the comparison of sampling techniques for web graph characterization,” convergence rate of the sampler by walking along a union ProceedingsoftheWorkshoponLinkAnalysis,2006. graph of all relations. But it does not distinguish the efficien- [12] Gjoka,MinasandKurant,MaciejandButts,CarterTandMarkopoulou, Athina, “Walking in Facebook: A case study of unbiased sampling of cies of walking on different relation graphs. In this paper, OSNs,”ProceedingsofIEEEINFOCOM,pp.1–9,2010. we propose the two-stage framework to select an inferred [13] M. Kurant, A. Markopoulou, and P. Thiran, “On the bias of BFS most efficient one from multiple graphs to improve sampling (BreadthFirstSearch),”22ndInternationalTeletrafficCongress,pp.1–8, 2010. efficiency further. [14] Hastings, W Keith, “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, vol. 57, no. 1, pp. 97–109, VII. CONCLUSIONS 1970. [15] Henzinger, Monika R and Heydon, Allan and Mitzenmacher, Michael In this paper, we consider the problem of using multiple and Najork, Marc, “On near-uniform URL sampling,” Computer Net- statistics to efficiently sample online social networks. Given a works,vol.33,no.1,pp.295–308,2000. fixed sampling budget, we design budget allocation decisions [16] Rasti, Amir H and Torkjazi, Mojtaba and Rejaie, Reza and Stutzbach, D, “Evaluating sampling techniques for large dynamic graphs,” Univ. and combine them to construct an optimal estimator. In par- Oregon,Tech.Rep.CIS-TR-08,vol.1,2008. ticular, we formulate a mixture sampling problem which con- [17] M. Kurant, M. Gjoka, C. T. Butts, and A. Markopoulou, “Walking on structs the optimal mixture estimator, and derive the optimal a Graph with a Magnifying Glass: Stratified Sampling via Weighted RandomWalks,”ProceedingsofACMSIGMETRICS,2011. weightsandaconditionofrankingbudgetallocationdecisions [18] C.-H.Lee,X.Xu,andD.Y.Eun,“Beyondrandomwalkandmetropolis- fortheoptimalestimator.Becausetheasymptoticvariancesof hastings samplers: why you should not backtrack for unbiased graph theindividualstatisticsareunknowninpractice,weproposean sampling,”ACMSIGMETRICSPerformanceEvaluationReview,vol.40, no.1,pp.319–330,2012. adaptive two-stage framework, which spends a partial budget [19] M. Gjoka, C. T. Butts, M. Kurant, and A. Markopoulou, “Multigraph to test all different statistics in the pilot sampling stage and sampling of online social networks,” IEEE Journal on Selected Areas allocates the remaining budget to the inferred best statistic in inCommunications,vol.29,no.9,pp.1893–1905,2011. [20] E.C.Titchmarsh,Thetheoryoffunctions. London,1939,vol.80. the regular sampling stage. To optimally set the sub-budget APPENDIX Proof of Theorem 4: As each estimated asymptotic variance Proof of Theorem 1: From Equation (4), we have σˆ2(·) is an asymptotically unbiased for σ2 (k = 1,··· ,K), lim fˆ(a,M,w)= lim (cid:88) w ·fˆ(a M) obkserve that k k k k M→∞ M→∞ k∈Ka lim P(kˆ∗ =k∗) = (cid:88) w · lim fˆ(a M)−a−.s→. (cid:88) w f¯ M→∞ k k k k M→∞ c(M)M c(M)M k∈Ka k∈Ka =Mli→m∞P(σˆk2∗( K )≤σˆj2( K ) ∀j =1,··· ,K) implying that the mixture estimator fˆ(a,M,w) is asymptot- = lim P(σ2 ≤σ2 ∀j =1,··· ,K)=1 ically unbiased for f¯ if and only if (cid:80) w = 1, i.e., M→∞ k∗ j k∈Ka k Equation (8) concludes. Then from Equation (5), observe that holds up if c(M)∈ω(M−1). Thus, we have, as M →∞, ς(a,w)= lim M ·Var(fˆ(a,M,w)) a (c(M))−a−.s→. c(M) +(1−c(M))·1 =a∗(M) = lim MM→(cid:88)∞ w2·Var(fˆ(a M)) k K {k=k∗} k M→∞ k k k for ∀k = 1,··· ,K. Consequently, it satisfies kˆ∗ −a−.s→. k∈Ka k∗,a(c(M))−a−.s→. a∗(M) and wˆ(c(M))−a−.s→. w∗(a∗(M)) as = (cid:88) wk2 lim a M ·Var(fˆ(a M))= (cid:88) wk2 ·σ2. M →+∞. ak M→∞ k k k ak k k∈Ka k∈Ka Proof of Corollary 1: From Theorem 1, for any Based on Cauchy-Schwarz inequality, it satisfies c(M) ∈ w(M−1), we have ς(a∗(M),w∗(a∗(M))) ≤ (cid:34)(cid:88) w 2· σk2(cid:35)·(cid:34)(cid:88) ak(cid:35)≥(cid:34)(cid:88) w (cid:35)2 =1 ς(a∗(M),a∗(M)) as M →∞, where k∈Ka k ak k∈Ka σk2 k∈Ka k ς(a∗(M),w∗(a∗(M)))=(cid:34)(cid:88)K a∗k(M)(cid:35)−1 where the equality holds up if and only if w = σ2 wσakk2e/ig(cid:80)htiv∈eKcatoσari2iw. T∈huWs g,iven an allocation decision a, fokr any ≤(cid:20)a∗k∗(M)(cid:21)−1 = Kk=σ1k2∗ k , a σ2 K+(1−K)c(M) (cid:34) (cid:35)−1 k∗ ς (w)= (cid:88) w 2·σk2≥ (cid:88) ak =ς (w∗) ς(a∗(M),a∗(M))=(1−c(M))σ2 +c(M) a k a σ2 a k∗ K k∈Ka k k∈Ka k K K K holds up, i.e., w∗ solve the optimization problem in Equation = 1 (cid:88)σ2− 1−c(M)(cid:88)(cid:0)σ2−σ2 (cid:1)≤ 1 (cid:88)σ2. K k K k k∗ K k (6). k=1 k=1 k=1 Par(cid:48)osoaftisfiofesT(cid:80)heioream 2≥: (cid:80)If itheaa(cid:48)lloc(ia=tio1n,··d·e,cKis)io,ns a and Proof of Theorem 5: Because E(cid:2)ς(a(c(M)),wˆ∗(c(M)))(cid:3)= k=1 (k) k=1 (k) lim M ·Var(cid:0)fˆ(a(c),M,wˆ∗(c))(cid:1) from Equation (5), the M→+∞ (cid:88)K a(k)−a(cid:48)(k) a(1)+a(2)−a(cid:48)(1)−a(cid:48)(2) (cid:88)K a(k)−a(cid:48)(k) fraction c∗(M) minimizes E(cid:2)ς(a(c(M)),wˆ∗(c(M)))(cid:3). ≥ + σ2 σ2 σ2 When the convergence rate of estimated asymptotic k=1 (k) (cid:80)Kk=(21)[a(k)−a(cid:48)(k)]k=3 (k) vc(aMria)nc∈e ωσˆ(k2M(m−)1)f,oritσsak2tisisfieΘs(Em(cid:2)ς−(ηak()c((Mk )=),wˆ1∗,(·c·(·M,K))))(cid:3)a→nd ≥···≥ =0. σ2 E(cid:2)ς(a∗(M),w∗(a∗(M)))(cid:3) as M → +∞ with the conver- (K) gence rate Θ(cid:0)(c(M)M)−η(cid:1) from Theorem 4 and Bounded holds up. Based on Theorem 1, we have Convergence Theorem [20]. Further, as c∗(M) minimizes ς(a,w∗(a))=(cid:34)k(cid:88)∈Ka σa((2kk))(cid:35)−≤1 (cid:34)k(cid:88)∈Ka σa(cid:48)((2kk))(cid:35)−=1ς(a(cid:48),w∗(a(cid:48))). Etio(cid:2)nς(aM(l→ci(m+M∞))d,Ewˆ(cid:2)∗ς((ca((Mc(M)d)c))((cid:3))M,,wiˆt)∗s(acti(sMfie)s))t(cid:3)h(cid:12)(cid:12)ec=ficr∗s(tM-o)rd=er 0c,onadnid- In particular, for any a, the allocation a∗ satisfies therefore Θ(d(cM)−η(cid:12)(cid:12) ) = Θ(cid:0)−ηM−ηc∗(M)−η−1(cid:1) = dc c=c∗(M) ς(a,w∗(a))≥ς(a∗,w∗(a∗))=σ2 . (cid:2) (cid:3) k∗ dE ς(a∗(M),w∗(a∗(M))) (cid:12) Θ( (cid:12) ) = Θ(1), from which we dc c=c∗(M) Proof of Theorem 3: When c = 1, ak(1) = 1/K can derive that c∗(M) = Θ(Mη−+η1) = Θ(M−1+η+11) and (k = 1,···,K). Then we have ς(a(1),a(1)) = K1 (cid:80)Kk=1σk2 lim c∗(M)=0. from Equation (9). When c = 0, the inferred most M→+∞ efficient statistic is uniform randomly chosen, i.e., P(kˆ∗(cM) = k) = 1/K (k = 1,2,··· ,K). Then P(a (0) = 1) = 1/K and E(ς(a(0),wˆ∗(0))) = k (cid:80)K P(a (0)=1)·σ2 = 1 (cid:80)K σ2. k=1 k k K k=1 k

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.