Table Of Content

Estimation of Vertex Degrees in a Sampled Network ApratimGanguly∗ EricKolaczyk DepartmentofMathematicsandStatistics DepartmentofMathematicsandStatistics BostonUniversity BostonUniversity Boston,MA02212 Boston,MA02212 [email protected] [email protected] 7 1 0 2 Abstract n a Theneedtoproduceaccurateestimatesofvertexdegreeinalargenetwork,based J on observation of a subnetwork, arises in a number of practical settings. We 5 studyaformalizedversionofthisproblem,whereinthegoalis,givenarandomly 2 sampledsubnetworkfromalargeparentnetwork, toestimatetheactualdegree of the sampled nodes. Depending on the sampling scheme, trivial method of ] P momentsestimators(MMEs)canbeused. However,theMMEisnotexpected, A ingeneral, touseallrelevantnetworkinformation. Inthisstudy, weproposea handfulofnovelestimatorsderivedfromarisk-theoreticperspective,whichmake . t more sophisticated use ofthe information in the sampled network. Theoretical a assessment ofthe new estimators characterizesunder what conditionstheycan t s offerimprovementovertheMME,whilenumericalcomparisonsshowthatwhen [ suchimprovementobtains,itcanbesubstantial.Illustrationisprovidedonahuman 1 traffickingnetwork. v 3 0 1 Introduction 2 7 0 Frequentlyitisthecaseinthestudyofreal-worldcomplexnetworksthatweobserveessentially . a sample from a larger network. There are many reasons why sampling in networks is often 1 0 unavoidable–and,insomecases,evendesirable. Sampling,forexample,haslongbeenanecessary 7 partofstudyingInternettopology[3]. Similarly,itsrolehasbeenlong-recognizedinthecontextof 1 biologicalnetworks,e.g.,protein-proteininteraction[7,11,13],generegulation[17]andmetabolic : networks [7]. Finally, in recent years, there has been intense interest in the use of sampling for v monitoringonlinesocialmedianetworks. See[19],forexample,forarepresentativelistofarticles i X inthislatterdomain. Givenasamplefromanetwork,afundamentalstatisticalquestionishowthe r samplednetworkstatisticsbeusedtomakeinferencesabouttheparametersoftheunderlyingglobal a network. Parameters of interest in the literature include (but are by no means limited to) degree distribution, density, diameter, clusteringcoefficient, andnumberofconnectedcomponents. For seminalworkinthisdirection,see[4,5]. Inthispaper,weproposepotentialsolutionstoanestimationproblemthatappearstohavereceived significantly less attention in the literature to date – the estimation of the degrees of individual samplednodes. Degreeisoneofthemostfundamentalofnetworkmetrics,andisabasicnotionof node-centrality. Derivingagoodestimateofthenodedegree,inturn,canbehelpfulinestimating otherglobalparameters,asmanysuchparameterscanbeviewedasfunctionsthatincludedegreeas anargument. Whileanumberofmethodsareavailabletoestimatethefulldegreedistributionunder networksampling(e.g.,[16,19]),littleworkappearstohavebeendoneonestimatingtheindividual nodedegrees. Ourworkaddressesthisgap. Formally,ourinterestliesinestimationofthedegreeofa vertex,providedthatvertexisselectedinasampleoftheunderlyinggraph. ∗CurrentlyatNateraInc. Therearemanysamplingdesignsforgraphs. See[9,Ch5]forareviewoftheclassicalliterature, and[1]forarecentsurvey. Canonicalexamplesincludeego-centricsampling[6],snowballsampling, induced/incidentsubgraphsampling,link-tracingandrandomwalkbasedmethods[10,14]. Under certainsamplingdesignswhereoneobservesthetruedegreeofthesamplednode(e.g. ego-centric and one-wave snowball sampling), degree estimation is unnecessary. In this paper, we focus on induced subgraph sampling, which is structurally representative of a number of other sampling strategies[19]. Formally, in induced subgraph sampling, a set of nodes is selected according to independentBernoulli(p)trialsateachnode. Then,thesubgraphinducedbytheselectednodes,i.e., thegraphgeneratedbyselectingedgesbetweenselectednodes,isobserved. Thismethodofsampling sharesstochasticpropertieswithincidentsubgraphsampling(whereintheroleofnodesandedgesis reversed)andwithcertaintypesofrandomwalksampling[14]. Theproblemofestimatingdegreesofsamplednodeshasbeengivenaformalstatisticaltreatment in[18],forthespecificcaseoftraceroutesamplingasaspecialcaseoftheso-calledspeciesproblem [2]. Tothebestofourknowledge,asimilarlyformaltreatmenthasnotbeenappliedmoregenerally for other, more canonical sampling strategies. However, a similar problem would be estimating personalnetworksizeforagroupofpeopleinasurvey. Somepriorworksinthisdirection[8,12] considerestimatorsobtainedbyscalinguptheobserveddegreeinthesamplednetwork,inthespirit ofwhatwetermamethodofmomentsestimatorbelow. Butnospecificgraphsamplingdesigns arediscussedinthesestudies. Wefocusonformulatingtheproblemusingtheinducedsubgraph samplingdesignandexploitnetworkinformationbeyondsampleddegreetoproposeestimatorsthat arebetterthannaivescale-upestimators. Keytoourformulationisarisktheoreticframeworkused toderiveourestimatorsofthenodedegrees,throughminimizingfrequentistorBayesrisks. This contributionisaccompaniedbyacomparativeanalysisofourproposedestimatorsandnaivescale-up estimators,boththeoreticalandempirical,inseveralnetworkregimes. Wenotethatwhensamplingiscoupledwithfalsepositiveandfalsenegativeedges,e.g.,incertain biologicalnetworks,ourmethodsarenotimmediatelyapplicable. Samplingdesignsthatresultinthe selectionofafractionofedgesfromtheunderlyingglobalnetwork(inducedandincidentsubgraph sampling,randomwalksetc.) areourprimaryobjectsofstudy. Weuseinducedsubgraphsampling as a rudimentary but representative model for this class and aim to simultaneously estimate the truedegreesofalltheobservednodeswithaprecisionbetterthanthatobtainedbytrivialscale-up estimatorswithnonetworkinformationused. 2 DegreeEstimationMethods Let us denote by G0 = V0,E0 a true underlying network, where V0 = 1, ,N . This { ··· } networkisassumedstaticand,withoutlossofgenerality,undirected. Thetruedegreevectorisd0 = (cid:0) (cid:1) (d0, ,d0 )T. ThesamplednetworkisdenotedbyG = (V ,E )where,againwithoutlossof 1 ··· N ∗ ∗ ∗ generality,weassumethatV = 1, ,n . Writethesampleddegreevectorasd =(d , ,d ). ∗ { ··· } ∗ ∗1 ··· ∗n Throughoutthepaper,weassumethatwehaveaninducedsubgraphsample,with(known)sampling proportionp. Itiseasytoseefromthesamplingschemethatd B(d0,p). Therefore,themethodofmoments ∗i ∼ i T estimator(MME)ford0 isdˆMME = d∗i. Thus,dˆ = dˆMME, ,dˆMME isanaturalscale- i i p MME 1 ··· n upestimatorofthedegreesequenceofthesamplednode(cid:16)s. Inthissection,w(cid:17)eproposeaclassof estimators that minimize the unweighted (cid:96) -risk of the sampled degree vector and discuss their 2 theoreticalproperties. Weaimtodemonstrate,underseveralconditions,thattheriskminimizersare superiortotheregularscale-upestimators,theformertakingintoaccounttheinherentrelationships insidethenetwork. Wenotethatalthoughamaximumlikelihoodapproachtoestimationisperhapsintuitivelyappealing, aclosedformderivationoftheMLEinthissettingisprobitive. Anotheroptionistolookatmarginal likelihoods. ButtheMLEbasedonunivariatemarginallikelihoodsareessentiallyequivalenttothe MMEforthissamplingscheme. Wewillfrequentlyusethethefirstandsecondmomentsofthe sampleddegreevectorinourestimationmethods. Thefollowinglemmawillbeuseful. 2 Lemma2.1. Underinducedsubgraphsampling,themeanandcovariancematrixoftheobserved degreevectorare E(d∗)=pd0 (1) Var(d∗)=p(1 p)D0 (2) − wherethediagonalsofD0 ared0, ,d0 andthe(i,j)-thoff-diagonalisdenotedbyd0 , which 1 ··· n ij denotesthenumberofcommonneighborsofnodeiandnodej inthenetworkG0. 2.1 FrequentistRiskMinimization Adoptingthestandarddefinitionof(unweighted)frequentist(cid:96) riskofanestimatorθôfaparameter 2 θ0,i.e., (θˆ,θ0)=E θˆ θ0 2,thefrequentistrisksarecalculatedforageneralclassofestimators. R || − || We also define (θˆ,θ0) := E θˆ θ0 21(G∗ ) , a restricted risk function assuming the RA || − || ∈A sampledgraphG∗ isrestrictedto(cid:16)someclass . Ourpr(cid:17)oposedcandidatesaretheelementsinthe A classoflinearfunctionsoftheobserveddegreevectorthatminimizetheriskortherestrictedrisk w.r.t. someclass. Itisexpectedthattheoptimalestimatorwillbeafunctionoftheparameterand henceanother(naive)estimatorwillneedtobepluggedin. Ourfinalestimatewillthenbeaplug-in riskminimizer. 2.1.1 UnivariateRiskMinimization Hereweestimatethenodedegreesindividually,assumingthattheestimatefortheithnodeisofthe formdˆ =c d ,wherec isascalarandd istheobserveddegreeinthesample.Sinced B(d0,p), i i ∗i i ∗i ∗i ∼ i whered0isthetruedegreeoftheithnode, i (dˆ,d0)=Bias2(c d )+Var(c d )=(c pd0 d0)2+p(1 p)c2d0 . R i i i ∗i i ∗i i i − i − i i Differentiatingw.r.t. ciandequatingto0,wegettheoptimalc∗i = pd0+d0i1 p. PluggingintheMME i − 2 ofd0,wegettheplug-inunivariateriskminimizerdˆ = d∗i . i,u,P p(d∗i+1−p) Taylorexpandingtheaboveformula(duringTaylorexpansionsoffunctionsofd ,wewillassume ∗i thatd isconcentratedarounditsmean,sothattheTaylorexpandedapproximationisclose)and ∗i takingexpectation,weseethat 2 1 E(cid:16)dî,u,P(cid:17)=E(cid:20)p(d∗id+∗i1−p)(cid:21)= p1E(cid:20)d∗i (cid:16)1+ 1d−∗ip(cid:17)− (cid:21)≈ p1E(cid:104)d∗i (cid:16)1− 1d−∗ip(cid:17)(cid:105)=d0i − 1−pp . The above calculation suggests that an adjustment needs to be made to dˆ by bias-correction, i,u,P sothatitsriskbecomescomparabletothatofdˆMME. Infact,wewillshowinProposition3.1that i our bias-corrected plug-in estimator has a lower risk than MME when the true degree is bigger thanalowerbound,whichcanbeexpressedasaclosedformfunctionofthesamplingproportion. Ultimately,ourproposedunivariateriskminimizerisgivenby d 2 1 p dˆ = ∗i + − (3) i,u p(d +1 p) p ∗i − 2.1.2 MultivariateRiskMinimization Weextendtheideapresentedintheprevioussectiontothemultivariatecase,inordertominimizethe overall(cid:96) sumoverallsamplednodes. Therationaleforthisextensionistoexploitthecovariance 2 structurewederivedinLemma2.1inestimatingthedegreevector. Accordingly,weconsiderall estimatesoftheformdˆ =Ad ,whereAisann nmatrix. UsingLemma2.1,wegetthe(cid:96) risk ∗ 2 × R(dˆ,d0)=(pA I)d0d0T(pA I)T +p(1 p)AD0AT − − − =A p2d0d0TAT +p(1 p)D0 AT p d0d0TAT +Ad0d0T +constant. − − Themultivariat(cid:16)eriskminimizerisdefinedas(cid:17) (cid:16) (cid:17) 2 A∗ =argminA ni=1E dî−d0i =argminAtr R(dˆ,d0) . (cid:16) (cid:17) (cid:16) (cid:17) (cid:80) 3 Differentiatingtheobjectivefunctionw.r.t. Aandequatingitto0,weget 1 A =pd0d0T p2d0d0T +p(1 p)D0 − . ∗ − PluggingintheMMEofd0andD0,we(cid:16)gettheplug-inmultivariate(cid:17)riskminimizer 1 1 dˆm = d∗d∗T d∗d∗T +D∗ − d∗ , (4) p (cid:16) (cid:17) whered denotesthenumberofcommonneighborsofnodeiandnodej inthesample,andD is ∗ij ∗ givenbyamatrixwhosediagonalsared andwhoseoff-diagonalsared ,i,j 1, ,n , i=j. ∗i ∗ij ∈{ ··· } (cid:54) 2.2 BayesRiskMinimization In this section, we propose a Bayesian solution to our estimation problem, by putting a prior on thedegreedistribution. Theprincipalmotivationbehindthisapproachisthedesiretoincorporate additionalinformationonglobalnetworkstructure,wherethenaturalcandidateinthiscontextis thedegreedistribution. Incasesuchasubjectivepriorisnotavailable, anestimateofthedegree distributionmaybeused. Weproposeandanalyzeestimatorsbasedonbothknown(subjective)and estimateddegreedistributionsbelow. First,letusassumethatweknowthedegreedistributionπ()oftheunderlyingnetwork. Underthe · assumptionthatthetruedegreeofnodeifollowsπ(),andunderinducedsubgraphsamplingofG, · theconditionaldistributionofd d isB(d ,p). ThenitcanbeeasilyshownthattheBayesestimator ∗i| i i undersquareerrorlossis dˆB = di≥d∗i di dd∗ii (1−p)diπ(di) . (5) i (cid:80) di≥d∗i (cid:0)dd∗ii (cid:1)(1−p)diπ(di) (cid:80) (cid:0) (cid:1) Ifthetruedegreedistributionisnotknown,thenitneedstobeestimated,forexampleusingtechniques described in orsimilar to [19]. Let πˆ() be a “reasonable"estimator for π(). Then an empirical · · Bayesestimatorisgivenby dÊB = di≥d∗i di dd∗ii (1−p)diπˆ(di) . (6) i (cid:80) di≥d∗i (cid:0)dd∗ii (cid:1)(1−p)diπˆ(di) Generallyspeaking,ifξ(d ;d )denote(cid:80)sthedis(cid:0)trib(cid:1)utionofd givend ,thenthisempiricalBayes ∗i i ∗i i estimatecanbeexpressedas d ξ(d ;d )πˆ(d ) dÊB = di≥d∗i i ∗i i i . i ξ(d ;d )πˆ(d ) (cid:80) di≥d∗i ∗i i i (cid:80) Theseestimatorstaketheformofaweightedmean,asexpectedforBayesestimatesunderquadratic loss. Theweightsarefunctionalsofbothsamplingdesignandthedegreedistribution. Forthelatter estimator,onlytheestimateddegreedistributioncomesintoplay,andthustheproposedempirical Bayesestimatorincorporatesthesamplingandsamplednetworkinformation. 3 RiskAnalysis Inthissection, wepresentresultsontherelativeperformanceofourproposedestimatorsfroma risk-theoreticperspective,andwediscussseveralconditionsunderwhichoneoutperformstheother. Alltheseestimateswillbebenchmarkedagainsttheregularscale-upestimatedˆ . Proofsmaybe MME foundinthesupplementarymaterials. 3.1 RiskofFrequentistEstimates In the first part of our risk analysis, we look at the (cid:96) frequentist risk of our proposed univariate 2 andmultivariateestimators. Ourmainresultsinthissectionwillcomparetheriskincurredbyour proposed estimators to the scale up estimator and discuss conditions under which our proposed estimatorsperformbetter. 4 Proposition3.1. Assumingd0i > 1−pp,wehaveR dî,u,d0i <R dˆMi ME,d0i . (cid:16) (cid:17) (cid:16) (cid:17) Inotherwords,theunivariateriskminimizerdˆ willoutperformtheMMEwhenthetruedegreed0 i,u i issufficientlylarge. Proposition3.2. Letusdenotetheclassofallsampledgraphsofsizen(whered∗i ≥1foralli,i.e., thereisnoisolatednode)asG . Alsoassumethatthereexistsan0<α 1suchthat n∗ 0 ≤ T G1∗,n = G ∈Gn∗ : Normalizedeigenvectorsv1,v2,··· ,vnof d∗d∗ +D∗ satisfy (cid:110) 1Tv √nα i(cid:16) (cid:17) i 0 ≥ ∀ n3α2 (1 p)λ (D(cid:9)) G2∗,n =G ∈Gn∗ : |E(G)| 2|nE(0G1)| +n ≥1− −||d0m||2in  − (cid:16) (cid:17) are nonempty. Then we have RG1∗ 2,n dˆm,d0 ≤ RG1∗ 2,n dˆMME,d0 over sampled graphs ∩ ∩ belongingtoG1∗ 2,n =G1∗,n G2∗,n. (cid:16) (cid:17) (cid:16) (cid:17) ∩ Scrutinyoftheconditionsin(cid:84)Proposition3.2,alongwithdefinitionofthesetG ,revealsageneral 1∗ 2,n characterizationofthegraphswheretheproposedmultivariateestimatorperfo∩rmsbetter. Itistobe noticedthatdˆ shrinksdˆMME bysomefactor. Thetermontherightsideoftheinequalityinthe m definitionofG providesalowerboundontheshrinkagefactorandthetermontheleftdecreasesas 2∗,n thecardinalityofE( )increases,i.e.,thegraphbecomeslesssparse. Hence,theproposedestimator G canbeexpectedtoworkbetterthanthestandardscale-upestimatorundertheassumptionofsparsity ofthesampledgraph. Thiswillalsobedemonstratedinthesimulationsection. Theeigenvectorconditionimposesageometricconstraintonthesampledegree-degreematrixD . ∗ Whatitessentiallymeansisthattheanglebetweentheeigenvectorsof d d T +D and1should ∗ ∗ ∗ besmallerthanarccos(α0). Or,inotherwords,byselectinganα0suffi(cid:16)cientlysmallb(cid:17)utpositive,our classofsampledgraphsarerestrictedwheretheassociatedmatrix d d T +D haseigenvectors ∗ ∗ ∗ atleastarcsin(α0)angleawayfromanyorthogonaldirectionto(cid:16)1. Thus,oure(cid:17)stimatorperforms betterforsparsegraphsatisfyingamildgeometriccondition. 3.2 RiskofBayesEstimate TheperformanceoftheBayesestimatorsisevaluatedhereunderseveralconditionsandnetwork paradigms. Notethattheseestimatorsarecomparedtotheregularscale-upestimatorwithrespect to their frequentist risk functions. We start with our estimator in its most general form and state conditionsonthepriordegreedistributionthatwillensurelowerrisk. Fromthat,weassessitsrisk whenthepriordegreedistributionisreplacedwithanappropriateestimate. Wealsoexplicitlyderive theBayesestimatorfortheErdös-Rényiclassofrandomgraphsandstateconditionsunderwhichthe Bayesestimatoryieldslowerriskthanthescale-upestimator. Proposition3.3. Letd0i bethetruedegreeofsamplenodei,andd∗i,theobserveddegree. Denoteby G theclassofsampledgraphswherethefollowingtwoconditionshold: B∗ p(1 p) N 1 E π2(di)≤ (N 1− d0)2d0i whend0i ≤ 2− ; and (7) d(cid:88)i≥d∗i − − i  p(d ,d)π(d ) di≥d∗i ∗i i i p , (8) p(d ,d ) ≥ (cid:80) di≥d∗i ∗i i wherep(d∗i,di)=(cid:80) dd∗ii (1−p)di.ThenRGB∗ dˆBi ,d0i ≤RGB∗ dˆMi ME,d0i underinducedsubgraph sampling. (cid:0) (cid:1) (cid:16) (cid:17) (cid:16) (cid:17) Theconditions(7)and(8)essentiallyconstrainthetailbehaviorofthepriordegreedisbution. The first condition ensures that the tail decays at a rate such that it is not too “thick” and the second 5 conditionensuresthatitisnottoo“thin”. Asd0becomesbigger,theRHSincondition(7)becomes i smallerandthatisreminiscentofthesparsitypropertyoftheunderlyinggraph,meaningthatnota lotofnodescanhaveveryhighdegree,anobservationconsistentwithsparsegraphs. Ontheother hand,theLHSinthecondition(8)canbeinterpretedasthemeanofthetailprobabilitiesweightedby theposteriordistribution. ThishastobeboundedawayfromzeroinorderfortheBayesestimateto havelowerriskthantheMME. Inrealproblems,wherethetruedegreedistributionisunknown,oneeitherhastochooseπsubjectively orusethedatatocomeupwithareasonableestimate. Estimatingπforageneralcaseisbeyondthe scopeofthispaperandwillnotbediscussedhere. Forouranalysis,wewilljustassumethatwehave anestimateofthedegreedistributionatourdisposal(e.g.,[19]),denotedbyπˆ. Usingπˆ willgiveus ourproposedempiricalBayesestimatedÊB,thebehaviorofwhichcanbedescribedasfollows. i Proposition3.4. Letπˆ()beanestimateofπ()suchthat πˆ π <(cid:15).Thenunderassumption · · (cid:107) − (cid:107) (8),withπreplacedbyπˆ,wehave ∞ di (1 p)diπˆ(d ) di (1 p)diπ(d ) < (cid:15)(1−p)d∗i (9) (cid:12)(cid:12)(cid:12)d(cid:88)i≥d∗i (cid:18)d∗i(cid:19) − i −d(cid:88)i≥d∗i (cid:18)d∗i(cid:19) − i (cid:12)(cid:12)(cid:12) pd∗i+1 (cid:12) (cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)d(cid:88)i≥d∗i di(cid:18)dd∗ii(cid:19)(1−p)diπˆ(di)−d(cid:88)i≥d∗i di(cid:18)dd∗ii(cid:19)(1−p)diπ(cid:12)(cid:12)(di)(cid:12)(cid:12)(cid:12)< (cid:15)(1pd−∗i+p2)d∗i (d∗i +1−p) (10) (cid:12) (cid:12) Th(cid:12)us,itfollowsthat (cid:12) (cid:12) (cid:12) dÊi B −dˆBi < (cid:15)(1−p)d∗i + (cid:15)(1−p)d∗i(d∗i +1−p) (11) (cid:12)(cid:12)(cid:12) dˆBi (cid:12)(cid:12)(cid:12) pd∗i+1 di≥d∗i di dd∗ii (1−p)diπ(di) pd∗i+2 di≥d∗i dd∗ii (1−p)diπ(di) (cid:80) (cid:0) (cid:1) (cid:80) (cid:0) (cid:1) Itiseasilyseenthatwiththeassumption(8),theupperboundin(11)canbesimplifiedto dÊi B −dˆBi < (cid:15)(1−p)d∗i + (cid:15)(1−p)d∗i(d∗i +1−p) . (cid:12)(cid:12)(cid:12) dˆBi (cid:12)(cid:12)(cid:12) d∗ipd∗i+2 di≥d∗i dd∗ii (1−p)di pd∗i+3 di≥d∗i dd∗ii (1−p)di Assumingalargenetwork,thesum(cid:80)inthed(cid:0)en(cid:1)ominatorcanbeapp(cid:80)roximate(cid:0)db(cid:1)y (1−p)d∗i . Thenthe pd∗i+1 upperboundis (cid:15) (cid:15)(d +1 p) (cid:15) 1 d +1 p + ∗i − = + ∗i − . d p p2 p d p ∗i (cid:18) ∗i (cid:19) Fromtheabovediscussion, itisevidentthatif(cid:15) = o(p2/n), dÊB dˆB foralliandhencetheir i ≈ i riskfunctionswillalsobeclose. Thus,usingProposition3.3,itisexpectedthatRGB∗ dÊiB,d0i (cid:46) RGB∗ dˆMi ME,d0i (cid:16) (cid:17) (cid:16) (cid:17) 3.2.1 Illustration: Erdös-RényiGraphs ItiswellknownthattheasymptoticdegreesinErdös-RényigraphmodelsfollowaPoissondistribu- tion,understandardconditions. Inthissection,westudytheeffectsofusingaPoissonpriordegree distributionforlargeErdös-Rényigraphs. ThegoalistodemonstratetheefficacyoftheBayesian approachcomparedtoscale-upestimatesasinthelastsection. However,studyingspecificmodels likeErdös-RényiwillgiveusmoreinsightabouttheperformanceoftheproposedBayesestimate. In thisscenario,thepriorπ()isgivenby · λdi π(d )=e λ , i − d ! i where λ is the prior mean. For a large Erdös-Rényi graph with number of nodes N and edge probabilityp ,λ Np . Wedenote,byP(k,µ),theshiftedPoissondistributiononk,k+1, , e e ≈ ··· ∞ whosep.m.f. isgivenby µx k f(x)=e µ − 1 (x). − (x k)! {k,k+1,···} − 6 ItiseasytocheckthatwithaPoisson(λ)priorond ,theposteriordistributionisP (d ,λ(1 p)). i ∗i − HencetheBayesestimatewithrespecttothequadraticlossfunctionis dˆB =d +λ(1 p) . i ∗i − Proposition3.5. Assuming 1+p 1 λp 1+p 1 λp λ+ +1 d0 λ+ + +1 , p (cid:32)2 −(cid:115)1+p (cid:33)≤ i ≤ p (cid:32)2 (cid:115)1+p (cid:33) thequadraticriskoftheBayesestimatorusingaPoisson(λ)priorissmallerthanthatoftheMME. Theaboveresultshowsthatifthesamplednodeissuchthatitstruedegreebelongstoaneighborhood aroundthemeanoftheunderlyingdegreedistribution,thentheBayesestimatorisuniformlybetter thantheMME.Incasetheunderlyingmeanisunknown,itcaneasilybeestimatedfromthesample. (e.g.,forknownN,λˆ = Npˆ = N E(G )/ n .)Ifλˆ isaconsistentestimatorofλinthesense e e | ∗ | 2 thatλˆ P λwhenN ,n andn/N (cid:0) (cid:1)p,thentheempiricalBayesestimator → →∞ →∞ → dÊB =d +λˆ(1 p) i ∗i − willconvergeinprobabilitytotheBayesestimatorinthesensethat dÊB dˆB P 0. Hence,the i − i → resultofProp. (3.5)isexpectedtohold. Thiswillalsobedemonstrate(cid:12)dinthesim(cid:12)ulations. (cid:12) (cid:12) (cid:12) (cid:12) 4 Simulations Foroursimulationstudy,welookattwodifferentregimesofnetwork–Erdös-Rényirandomgraphs andheavytaileddegreedistributions. 4.1 Erdös-Rényinetwork Wecomparefourmethodsofestimation-theregularMME,univariateriskminimizer,multivariate riskminimizerandtheBayesestimate. AspriorsinBayesestimation,weusebothexponentially decaying(Poisson)andpolynomiallydecayingdegreedistributionaspriors. Table1recordsthe Euclidean distance between the true and estimated degree vectors across some combinations of graphsizeN,edgestrengthp andsamplingproportionp. Theerrorsareaveragedover50different e samplesfromeachgivengraphG. Fromtheoutput,itisclearthattheBayesestimatorswithtrue λ and estimated λ outperform other estimators by a very wide margin in terms of (cid:96) risk. Also, 2 ourtheoreticalpredictioninthediscussionfollowingProposition3.2wasthatthemultivariaterisk minimizer(MRM)worksbetterthantheMMEforsparsegraphs. Thisisexperimentallyverified in this simulation, since we see that the relative risk of MRM compared to MME decreases as thesparsityoftheunderlyinggraphincreases,i.e.,asp decreases. Themethodwithlowesttotal e quadraticlossisshowninredforeachcondition. 4.2 ScaleFreeNetwork Wecomparedfourmethodsofestimationinsimulatedscalefreenetworkswhichfollowapower lawdegreedistribution. AspriorsinBayesestimation,wecomparedthetruepolynomialpriorand quadraticprior. Wecomputedthel distancesacrosssomecombinationsofsparsity(denotedbys, 2 givenbytheratiooftotaledgestoallpossibleedges),samplingproportionpandheavinessofthetail ofthedegreedistribution,controledbym. TheresultsareshowninTable2. TheBayesestimatorsor themultivariateriskminimizersworkbetterthantheotherestimators. Oneimportantthingtoobserve hereisthatforthemostsparsegraph,theBayesestimatorwithtruepriorworksthebestandass increases,multivariateriskminimizersworkbetterthantherest,butthereishardlyanyimprovement overMME.Again,themethodwithlowesttotalquadraticlossisshowninredforeachcondition. 5 HumanTraffickingNetwork InFebruary2015, theDefenseAdvancedResearchProjectsAgency(DARPA),anagencyofthe U.S.DepartmentofDefense,announcedtheMemexprograminresponsetotheuseoftheInternetin 7 pe,p↓,N→ N=1000 s,p↓,N→ N=1000 MME URM MRM Bayes MME URM MRM Bayes Pois.(λ) Pois.(λˆ) Poly. TruePrior Quad.Prior pppeee===000...123,,,ppp===000...111 244919262...202922 244919051...018458 244818938...720685 11923016...038326 11924589...940592 244919262...406824 ss=ss===00.215.2%%%%,,,,pppp====0000...111.1,,,,mmmm====2222.5 249435228...614.10380 238235584...732.26968 248433907...792.28377 238133292....22271936 238233212....22071976 pe=0.4,p=0.1 588.18 587.84 586.40 152.94 168.99 588.02 s=1%,p=0.1,m=2.5 92.91 82.89 91.50 81.93 78.72 pe=0.1,p=0.2 284.08 283.67 282.76 119.87 122.73 284.24 s=5%,p=0.1,m=2.5 210.04 214.70 208.22 231.68 219.55 pe=0.2,p=0.2 389.15 389.07 386.87 164.30 166.84 389.55 s=0.2%,p=0.1,m=3 41.52 28.75 39.36 21.71 22.61 pe=0.3,p=0.2 485.09 485.07 481.82 187.43 190.55 485.63 s=1%,p=0.1,m=3 89.40 79.98 88.07 83.39 75.46 pe=0.4,p=0.2 527.37 527.28 527.68 205.47 210.42 527.07 s=5%,p=0.1,m=3 209.97 213.30 208.25 242.90 217.87 Table 1: Erdös-Rényi Simulation Results: λ is Table2: ScaleFreeSimulationResults thetruemeanusingknownp . λîstheestimated e meanusinganestimatepˆ ofp . e e humantrafficking,especiallychatforums,advertisementsandjobservicessections. DARPA-funded researchdeterminedthetraffickingindustryspent$250Mtopostmorethan60Madvertisementsover atwo-yeartimeframe[15]. Indexingandcross-referencingtheadswiththesamecontactnumber, similaraddressorzipcodeshelpidentifyandtracktheillegaltraffickingactivities. Thisleadstoa massivebackgroundnetworkstructurewhereeachnoderepresentsanadvertisementandanedge betweentwonodesarecreatediftheysharecertainfeatures. Itisnotunreasonabletoexpectthat,in surveillanceofnetworkslikethis,samplingmaywellarise,eitherbychoiceorbycircumstance. We mimicthissituationbypretendingthatthisunderlyingnetworkgeneratedbytheMemexprogram is unknown to us and sampling it using induced subgraph sampling. The nodes associated with traffickingactivitiesareflaggedinthedata. Thereare31,248nodes,ofwhich12,387areflaggedand thereare10,200,838edges. Ourgoalwastoestimatethetruedegreesofflaggednodesthatwesawin oursample. Wecomparedthe(cid:96) distanceofregularscale-upestimators,andourproposedunivariate, 2 multivariateandBayesestimators. FortheBayesestimator,anumberofpolynomialpriorswere takenintoconsiderationwithvaryingdegreeofdecay,denotedbyα. TheresultsareshowninTable 3. Almosteverythingworksbetterthanthenaivescale-upestimatorintermsoftotal(cid:96) loss,although 2 therelativeimprovementismoremodestthaninsimulation. p MME URM MRM Bayes α= 0.1 α= 0.5 α= 1 − − − p=0.005 3451.364 3436.64 3447.24 3687.26 3541.94 3450.97 p=0.01 3427.55 3397.71 3427.88 3451.86 3412.12 3428.59 p=0.02 4462.937 4448.33 4461.64 4492.83 4450.71 4462.31 Table3: SamplingfromHumanTraffickingNetwork 6 Discussion&FutureResearch In this paper, we addressed the problem of estimation of true degrees of sampled nodes from an unknowngraph. Weproposedaclassofestimatorsfromarisk-theoryperspectivewherethegoal wastominimizetheoverall(cid:96) riskofthedegreeestimatesforthesamplednodes. Weconsidered 2 estimatorsthatminimizebothfrequentistandBayesriskfunctionsandcomparedthefrequentist(cid:96) 2 risksofourproposedestimatortothenaivescale-upestimator. Thebasicobjectiveofproposing theseestimatorswastoexploittheadditionalnetworkinformationinherentinthesampledgraph, beyondtheobserveddegrees. Ourtheoreticalanalyses,simulationstudiesandrealdatashowclear evidenceofsuperiorperformanceofourestimatorscomparedtoMME,especiallywhenthegraphis sparseandthesamplingratioislow,mimickingthereal-worldexamples. Thereareanumberofwaysourcurrentworkcouldbeextended. Firstly,atheoreticalanalysisofthe BayesestimatorsunderpriorsforrandomgraphmodelsbeyondErdös-Rényiisdesirable,although likely more involved. Secondly, although induced subgraph sampling serves as a representative structuralmodelforacertainclassofadaptivesamplingdesigns,thespecificdetailsofthesufficiency conditionsdiscussedinthispapercanbeexpectedtovaryslightlywiththeothersamplingdesigns (e.g., incident subgraph or random walk designs) . Finally, the success of the Bayesian method appearstorelyheavilyuponappropriatechoiceofpriordistribution,asobservedinourtheoretical analysis and computational experiments. It would be of interest to explore the performance of theempiricalBayesestimateinconjunctionwiththenonparametricmethodofdegreedistribution 8 estimation proposed in [19]. More generally, the method in [19] can in principle be extended to estimate individual vertex degrees. But the computational challenge of implementation and the correspondingriskanalysiscanbeexpectedtobenontrivial. 9 Estimation of Vertex Degrees in a Sampled Network: Supplementary-A: Proofs 1 Proofs 1.1 ProofofLemma2.1 Proof. LetS bethesetofsamplednodes. Seethatd = I(k S). Hence,d B d0,p . ∗i k∈Nei ∈ ∗i ∼ i P (cid:0) (cid:1) E d d =E I(k S) I(l S) ∗i ∗j  ∈ ! ∈  (cid:0) (cid:1) kX∈Nei l∈XNej    =E I(k S) + I(k S)I(l S)  ∈   ∈ ∈  k∈NXei∩Nej (k,l)∈(NeiX∪NeXj)\(Nei∩Nej) =d0p+ d0d0 d0 p2    ij i j − ij (cid:0) (cid:1) Notethatd0 isthecardinalityofthefirstsetofnodes(byitsdefinition)and(d0d0 d0 )isthatof ij i j − ij thesecond. Theprobabilitythatanodeisselectedininducedsubgraphsamplingispandsinceeach nodeisselectedindependently,thejointprobabilitythattwonodesareselectedisp2. Hence, Cov d ,d =d0 p(1 p) ∗i ∗j ij − (cid:0) (cid:1) 1.2 ProofofProposition3.1 TakingTaylorexpansionupto2ndorder,weget 1 E dî,u = 1−p p + p1E d∗i 1+ 1d−p − (cid:16) (cid:17) " (cid:18) ∗i (cid:19) # 1 p 1 (1 p)2 (1 p)2 ≈ −p + pE d∗i −(1−p)+ −d ≈d0i + p−2d0 (cid:20) ∗i (cid:21) i WeonlyconsiderTaylorexpansionupto2ndorderbecausetheexpectationofhigherordertermscan beneglectedassumingd0issufficientlylarge. Hence,weget i (1 p)2 Bias dˆ ,d0 = − i,u i p2d0 i (cid:16) (cid:17) Similarly,weapproximatethevariancebyTaylorexpansionandget 1 (1 p)2 Var dî,u ≈ p2Var d∗i + −d (cid:16) (cid:17) (cid:18) ∗i (cid:19) 1 1 1 = Var(d )+(1 p)4Var +2(1 p)2Cov d , p2 ∗i − d − ∗i d (cid:20) (cid:18) ∗i(cid:19) (cid:18) ∗i(cid:19)(cid:21) 10