ebook img

certifiable distributional robustness with principled adversarial training PDF

34 Pages·2017·1.7 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview certifiable distributional robustness with principled adversarial training

PublishedasaconferencepaperatICLR2018 CERTIFYING SOME DISTRIBUTIONAL ROBUSTNESS WITH PRINCIPLED ADVERSARIAL TRAINING AmanSinha∗,1,HongseokNamkoong∗,2,JohnDuchi1,3 Departmentsof1ElectricalEngineering,2ManagementScience&Engineering,3Statistics StanfordUniversity Stanford,CA94305 {amans,hnamk,jduchi}@stanford.edu ABSTRACT Neuralnetworksarevulnerabletoadversarialexamplesandresearchershavepro- posed many heuristic attack and defense mechanisms. We address this problem through the principled lens of distributionally robust optimization, which guar- antees performance under adversarial input perturbations. By considering a La- grangian penalty formulation of perturbing the underlying data distribution in a Wasserstein ball, we provide a training procedure that augments model param- eter updates with worst-case perturbations of training data. For smooth losses, ourprocedureprovablyachievesmoderatelevelsofrobustnesswithlittlecompu- tational or statistical cost relative to empirical risk minimization. Furthermore, our statistical guarantees allow us to efficiently certify robustness for the popu- lationloss. Forimperceptibleperturbations,ourmethodmatchesoroutperforms heuristicapproaches. 1 INTRODUCTION Consider the classical supervised learning problem, in which we minimize an expected loss E [(cid:96)(θ;Z)] over a parameter θ ∈ Θ, where Z ∼ P , P is a distribution on a space Z, and (cid:96) P0 0 0 isalossfunction. Inmanysystems,robustnesstochangesinthedata-generatingdistributionP is 0 desirable,whethertheybefromcovariateshifts,changesintheunderlyingdomain(Ben-Davidetal., 2010),oradversarialattacks(Goodfellowetal.,2015;Kurakinetal.,2016). Asdeepnetworksbe- comeprevalentinmodernperformance-criticalsystems(perceptionforself-drivingcars,automated detection of tumors), model failure is increasingly costly; in these situations, it is irresponsible to deploymodelswhoserobustnessandfailuremodeswedonotunderstandorcannotcertify. Recentworkshowsthatneuralnetworksarevulnerabletoadversarialexamples;seeminglyimper- ceptibleperturbationstodatacanleadtomisbehaviorofthemodel,suchasmisclassificationofthe output(Goodfellowetal.,2015;Nguyenetal.,2015;Kurakinetal.,2016;Moosavi-Dezfoolietal., 2016). Consequently, researchers have proposed adversarial attack and defense mechanisms (Pa- pernotetal.,2016a;b;c;Rozsaetal.,2016;Carlini&Wagner,2017;Heetal.,2017;Madryetal., 2017;Trame`retal.,2017). Theseworksprovideaninitialfoundationforadversarialtraining,butit ischallengingtorigorouslyidentifytheclassesofattacksagainstwhichtheycandefend(orifthey exist).Alternativeapproachesthatprovideformalverificationofdeepnetworks(Huangetal.,2017; Katzetal.,2017a;b)areNP-hardingeneral;theyrequireprohibitivecomputationalexpenseevenon smallnetworks. Recently,researchershaveproposedconvexrelaxationsoftheNP-hardverification problemwithsomesuccess(Kolter&Wong,2017;Raghunathanetal.,2018),thoughtheymaybe difficulttoscaletolargenetworks. Inthiscontext,ourworkissituatedbetweentheseagendas: we developefficientprocedureswithrigorousguaranteesforsmalltomoderateamountsofrobustness. We take the perspective of distributionally robust optimization and provide an adversarial training procedurewithprovableguaranteesonitscomputationalandstatisticalperformance. Wepostulate aclassP ofdistributionsaroundthedata-generatingdistributionP andconsidertheproblem 0 minimize sup E [(cid:96)(θ;Z)]. (1) P θ∈Θ P∈P ∗Equalcontribution 1 PublishedasaconferencepaperatICLR2018 The choice of P influences robustness guarantees and computability; we develop robustness sets P with computationally efficient relaxations that apply even when the loss (cid:96) is non-convex. We provide an adversarial training procedure that, for smooth (cid:96), enjoys convergence guarantees simi- lar to non-robust approaches while certifying performance even for the worst-case population loss sup E [(cid:96)(θ;Z)]. OnasimpleimplementationinTensorflow,ourmethodtakes5–10×aslong P∈P P asstochasticgradientmethodsforempiricalriskminimization(ERM),matchingruntimesforother adversarialtrainingprocedures(Goodfellowetal.,2015;Kurakinetal.,2016;Madryetal.,2017). Weshowthatourprocedure—whichlearnstoprotectagainstadversarialperturbationsinthetraining dataset—generalizes,allowingustotrainamodelthatpreventsattackstothetestdataset. Webrieflyoverviewourapproach. Letc : Z ×Z → R ∪{∞}, wherec(z,z )isthe“cost”for + 0 an adversary to perturb z to z (we typically use c(z,z ) = (cid:107)z−z (cid:107)2 with p ≥ 1). We consider 0 0 0 p the robustness region P = {P : W (P,P ) ≤ ρ}, a ρ-neighborhood of the distribution P under c 0 0 theWassersteinmetricW (·,·)(seeSection2foraformaldefinition). Fordeepnetworksandother c complexmodels,thisformulationofproblem(1)isintractablewitharbitraryρ.Instead,weconsider itsLagrangianrelaxationforafixedpenaltyparameterγ ≥0,resultinginthereformulation (cid:26) (cid:27) minimize F(θ):=sup{E [(cid:96)(θ;Z)]−γW (P,P )}=E [φ (θ;Z)] (2a) θ∈Θ P P c 0 P0 γ where φ (θ;z ):= sup{(cid:96)(θ;z)−γc(z,z )}. (2b) γ 0 0 z∈Z (See Proposition 1 for a rigorous statement of these equalities.) Here, we have replaced the usual loss(cid:96)(θ;Z)bytherobustsurrogateφ (θ;Z);thissurrogate(2b)allowsadversarialperturbationsof γ thedataz,modulatedbythepenaltyγ. Wetypicallysolvethepenaltyproblem(2)withP replaced 0 bytheempiricaldistributionP(cid:98)n,asP0isunknown(werefertothisasthepenaltyproblembelow). The key feature of the penalty problem (2) is that moderate levels of robustness—in particular, defenseagainstimperceptibleadversarialperturbations—areachievableatessentiallynocomputa- tional or statistical cost for smooth losses (cid:96). Specifically, for large enough penalty γ (by duality, small enough robustness ρ), the function z (cid:55)→ (cid:96)(θ;z)−γc(z,z ) in the robust surrogate (2b) is 0 strongly concave and hence easy to optimize if (cid:96)(θ,z) is smooth in z. Consequently, stochastic gradient methods applied to problem (2) have similar convergence guarantees as for non-robust methods(ERM).InSection3,weprovideacertificateofrobustnessforanyρ;wegiveanefficiently computabledata-dependentupperboundontheworst-caselosssup E [(cid:96)(θ;Z)]. That P:Wc(P,P0)≤ρ P is,theworst-caseperformanceoftheoutputofourprincipledadversarialtrainingprocedureisguar- anteedtobenoworsethanthiscertificate. Ourboundistightwhenρ=ρ ,theachievedrobustness (cid:98)n for the empirical objective. These results suggest advantages of networks with smooth activations ratherthanReLU’s. WeexperimentallyverifyourresultsinSection4andshowthatwematchor achievestate-of-the-artperformanceonavarietyofadversarialattacks. Robustoptimizationandadversarialtraining Thestandardrobust-optimizationapproachmini- mizeslossesoftheformsup (cid:96)(θ;z+u)forsomeuncertaintysetU (Ratliffetal.,2006;Ben-Tal u∈U etal.,2009;Xuetal.,2009). Unfortunately,thisapproachisintractableexceptforspeciallystruc- turedlosses,suchasthecompositionofalinearandsimpleconvexfunction(Ben-Taletal.,2009;Xu etal.,2009;2012).Nevertheless,thisrobustapproachunderliesrecentadvancesinadversarialtrain- ing(Szegedyetal.,2013;Goodfellowetal.,2015;Papernotetal.,2016b;Carlini&Wagner,2017; Madryetal.,2017),whichheuristicallyperturbdataduringastochasticoptimizationprocedure. Onesuchheuristicusesalocallylinearizedlossfunction(proposedwithp=∞asthe“fastgradient signmethod”(Goodfellowetal.,2015)): ∆ (θ):=argmax{∇ (cid:96)(θ;(x ,y ))Tη} andperturb x →x +∆ (θ). (3) xi x i i i i xi (cid:107)η(cid:107) ≤(cid:15) p One form of adversarial training trains on the losses (cid:96)(θ;(x + ∆ (θ),y )) (Goodfellow et al., i xi i 2015; Kurakin et al., 2016), while others perform iterated variants (Papernot et al., 2016b; Carlini & Wagner, 2017; Madry et al., 2017; Trame`r et al., 2017). Madry et al. (2017) observe that these proceduresattempttooptimizetheobjectiveE [sup (cid:96)(θ;Z +u)],aconstrainedversionof P0 (cid:107)u(cid:107)p≤(cid:15) the penalty problem (2). This notion of robustness is typically intractable: the inner supremum is generallynon-concaveinu,soitisunclearwhethermodel-fittingwiththesetechniquesconverges, 2 PublishedasaconferencepaperatICLR2018 andtherearepossiblyworst-caseperturbationsthesetechniquesdonotfind. Indeed,itisNP-hard tofindworst-caseperturbationswhendeepnetworksuseReLUactivations, suggestingdifficulties forfastanditeratedheuristics(seeLemma2inAppendixB).Smoothness, whichcanbeobtained instandarddeeparchitectureswithexponentiallinearunits(ELU’s)(Clevertetal.,2015),allowsus tofindLagrangianworst-caseperturbationswithlowcomputationalcost. Distributionally robust optimization To situate the current work, we review some of the sub- stantialbodyofworkonrobustnessandlearning. ThechoiceofP intherobustobjective(1)affects boththerichnessoftheuncertaintysetwewishtoconsideraswellasthetractabilityoftheresult- ingoptimizationproblem. Previousapproachestodistributionalrobustnesshaveconsideredfinite- dimensional parametrizations for P, such as constraint sets for moments, support, or directional deviations (Chen et al., 2007; Delage & Ye, 2010; Goh & Sim, 2010), as well as non-parametric distances for probability measures such as f-divergences (Ben-Tal et al., 2013; Bertsimas et al., 2013;Lam&Zhou,2015;Miyatoetal.,2015;Duchietal.,2016;Namkoong&Duchi,2016),and Wassersteindistances(Esfahani&Kuhn,2015;Shafieezadeh-Abadehetal.,2015;Blanchetetal., 2016). Inconstrasttof-divergences(e.g.χ2-orKullback-Leiblerdivergences)whichareeffective whenthesupportofthedistributionP isfixed,aWassersteinballaroundP includesdistributions 0 0 Qwithdifferentsupportandallows(inasense)robustnesstounseendata. Many authors have studied tractable classes of uncertainty sets P and losses (cid:96). For example, Ben-Tal et al. (2013) and Namkoong & Duchi (2017) use convex optimization approaches for f- divergenceballs. Forworst-caseregionsP formedbyWassersteinballs,Esfahani&Kuhn(2015), Shafieezadeh-Abadehetal.(2015),andBlanchetetal.(2016)showhowtoconvertthesaddle-point problem (1) to a regularized ERM problem, but this is possible only for a limited class of convex losses(cid:96)andcostsc. Inthiswork,wetreatamuchlargerclassoflossesandcostsandprovidedirect solution methods for a Lagrangian relaxation of the saddle-point problem (1). One natural appli- cation area is in domain adaptation (Lee & Raginsky, 2017); concurrently with this work, Lee & Raginskyprovideguaranteessimilartooursfortheempiricalminimizeroftherobustsaddle-point problem(1)andgivespecializedboundsfordomainadaptationproblems. Incontrast,ourapproach is to use the distributionally robust approach to both defend against imperceptible adversarial per- turbationsanddevelopefficientoptimizationprocedures. 2 PROPOSED APPROACH Our approach is based on the following simple insight: assume that the function z (cid:55)→ (cid:96)(θ;z) is smooth, meaningthereissomeLforwhich∇ (cid:96)(θ;·)isL-Lipschitz. Thenforanyc : Z ×Z → z R ∪{∞}1-stronglyconvexinitsfirstargument,aTaylorexpansionyields + L−γ (cid:96)(θ;z(cid:48))−γc(z(cid:48),z )≤(cid:96)(θ;z)−γc(z,z )+(cid:104)∇ ((cid:96)(θ;z)−γc(z,z )),z(cid:48)−z(cid:105)+ (cid:107)z−z(cid:48)(cid:107)2. 0 0 z 0 2 2 (4) Forγ ≥Lthisisthefirst-orderconditionfor(γ−L)-strongconcavityofz (cid:55)→((cid:96)(θ;z)−γc(z,z )). 0 Thus,wheneverthelossissmoothenoughinz andthepenaltyγ islargeenough(correspondingto lessrobustness),computingthesurrogate(2b)isastrongly-concaveoptimizationproblem. We leverage the insight (4) to show that as long as we do not require too much robustness, this strongconcavityapproach(4)providesacomputationallyefficientandprincipledapproachforro- bust optimization problems (1). Our starting point is a duality result for the minimax problem (1) anditsLagrangianrelaxationforWasserstein-baseduncertaintysets,whichmakestheconnections between distributional robustness and the “lazy” surrogate (2b) clear. We then show (Section 2.1) how stochastic gradient descent methods can efficiently find minimizers (in the convex case) or approximatestationarypoints(when(cid:96)isnon-convex)forourrelaxedrobustproblems. Wassersteinrobustnessandduality Wassersteindistancesdefineanotionofclosenessbetween distributions. Let Z ⊂ Rm, and let (Z,A,P ) be a probability space. Let the transportation cost 0 c : Z ×Z → [0,∞) be nonnegative, lower semi-continuous, and satisfy c(z,z) = 0. For exam- ple, for a differentiable convex h : Z → R, the Bregman divergence c(z,z ) = h(z)−h(z )− 0 0 (cid:104)∇h(z ),z−z (cid:105) satisfies these conditions. For probability measures P and Q supported on Z, 0 0 let Π(P,Q) denote their couplings, meaning measures M on Z2 with M(A,Z) = P(A) and 3 PublishedasaconferencepaperatICLR2018 M(Z,A)=Q(A). TheWassersteindistancebetweenP andQis W (P,Q):= inf E [c(Z,Z(cid:48))]. c M M∈Π(P,Q) For ρ ≥ 0 and distribution P , we let P = {P : W (P,P ) ≤ ρ}, considering the Wasserstein 0 c 0 formoftherobustproblem(1)anditsLagrangianrelaxation(2)withγ ≥0. Thefollowingduality result(Blanchet&Murthy,2016)givestheequality(2)fortherelaxationandananalogousresultfor theproblem(1).WegiveanalternativeproofinAppendixC.1forconvex,continuouscostfunctions. Proposition 1. Let (cid:96) : Θ × Z → R and c : Z × Z → R be continuous. Let φ (θ;z ) = + γ 0 sup {(cid:96)(θ;z)−γc(z,z )}betherobustsurrogate(2b). ForanydistributionQandanyρ>0, z∈Z 0 sup E [(cid:96)(θ;Z)]= inf (cid:8)γρ+E [φ (θ;Z)](cid:9), (5) P Q γ P:Wc(P,Q)≤ρ γ≥0 andforanyγ ≥0,wehave sup{E [(cid:96)(θ;Z)]−γW (P,Q)}=E [φ (θ;Z)]. (6) P c Q γ P Leveraging the insight (4), we give up the requirement that we wish a prescribed amount ρ of ro- bustness(solvingtheworst-caseproblem(1)forP = {P : W (P,P ) ≤ ρ})andfocusinsteadon c 0 theLagrangianpenaltyproblem(2)anditsempiricalcounterpart (cid:26) (cid:110) (cid:111) (cid:27) minθ∈imΘize Fn(θ):=suPp E[(cid:96)(θ;Z)]−γWc(P,P(cid:98)n) =EP(cid:98)n[φγ(θ;Z)] . (7) 2.1 OPTIMIZINGTHEROBUSTLOSSBYSTOCHASTICGRADIENTDESCENT Wenowdevelopstochasticgradient-typemethodsfortherelaxedrobustproblem(7),makingclear thecomputationalbenefitsofrelaxingthestrictrobustnessrequirementsofformulation(5). Webe- ginwithassumptionswerequire,whichroughlyquantifytheamountofrobustnesswecanprovide. AssumptionA. Thefunctionc:Z×Z →R iscontinuous.Foreachz ∈Z,c(·,z )is1-strongly + 0 0 convexwithrespecttothenorm(cid:107)·(cid:107). Toguaranteethattherobustsurrogate(2b)istractablycomputable,wealsorequireafewsmoothness assumptions. Let(cid:107)·(cid:107) bethedualnormto(cid:107)·(cid:107);weabusenotationbyusingthesamenorm(cid:107)·(cid:107)onΘ ∗ andZ,thoughthespecificnormisclearfromcontext. AssumptionB. Theloss(cid:96):Θ×Z →RsatisfiestheLipschitziansmoothnessconditions (cid:107)∇ (cid:96)(θ;z)−∇ (cid:96)(θ(cid:48);z)(cid:107) ≤L (cid:107)θ−θ(cid:48)(cid:107), (cid:107)∇ (cid:96)(θ;z)−∇ (cid:96)(θ;z(cid:48))(cid:107) ≤L (cid:107)z−z(cid:48)(cid:107), θ θ ∗ θθ z z ∗ zz (cid:107)∇ (cid:96)(θ;z)−∇ (cid:96)(θ;z(cid:48))(cid:107) ≤L (cid:107)z−z(cid:48)(cid:107), (cid:107)∇ (cid:96)(θ;z)−∇ (cid:96)(θ(cid:48);z)(cid:107) ≤L (cid:107)θ−θ(cid:48)(cid:107). θ θ ∗ θz z z ∗ zθ These properties guarantee both (i) the well-behavedness of the robust surrogate φ and (ii) its γ efficient computability. Making point (i) precise, Lemma 1 shows that if γ is large enough and AssumptionBholds,thesurrogateφ isstillsmooth. Throughout,weassumeΘ⊆Rd. γ Lemma1. Letf :Θ×Z →Rbedifferentiableandλ-stronglyconcaveinzwithrespecttothenorm (cid:107)·(cid:107), anddefinef¯(θ) = sup f(θ,z). Letg (θ,z) = ∇ f(θ,z)andg (θ,z) = ∇ f(θ,z), and z∈Z θ θ z z assumeg andg satisfyAssumptionBwith(cid:96)(θ;z)replacedwithf(θ,z). Thenf¯isdifferentiable, θ z andlettingz(cid:63)(θ)=argmax f(θ,z),wehave∇f¯(θ)=g (θ,z(cid:63)(θ)). Moreover, z∈Z θ (cid:18) (cid:19) (cid:107)z(cid:63)(θ1)−z(cid:63)(θ2)(cid:107)≤ Lλzθ (cid:107)θ1−θ2(cid:107) and (cid:13)(cid:13)∇f¯(θ)−∇f¯(θ(cid:48))(cid:13)(cid:13)(cid:63) ≤ Lθθ+ LθzλLzθ (cid:107)θ−θ(cid:48)(cid:107). See Section C.2 for the proof. Fix z ∈ Z and focus on the (cid:96) -norm case where c(z,z ) satisfies 0 2 0 AssumptionAwith(cid:107)·(cid:107) . Notingthatf(θ,z) := (cid:96)(θ,z)−γc(z,z )is(γ−L )-stronglyconcave 2 0 zz from the insight (4) (with L := L ), let us apply Lemma 1. Under Assumptions A, B, φ (·;z ) zz γ 0 thenhasL=Lθθ+ [γL−θzLLzzzθ]+-Lipschitzgradients,and ∇ φ (θ;z )=∇ (cid:96)(θ;z(cid:63)(z ,θ)) where z(cid:63)(z ,θ)=argmax{(cid:96)(θ;z)−γc(z,z )}. θ γ 0 θ 0 0 0 z∈Z 4 PublishedasaconferencepaperatICLR2018 Algorithm1 Distributionallyrobustoptimizationwithadversarialtraining INPUT: SamplingdistributionP0,constraintsetsΘandZ,stepsizesequence{αt >0}Tt=−01 fort=0,...,T −1do Samplezt ∼P andfindan(cid:15)-approximatemaximizerztof(cid:96)(θt;z)−γc(z,zt) 0 (cid:98) θt+1 ←Proj (θt−α ∇ (cid:96)(θt;zt)) Θ t θ (cid:98) ThismotivatesAlgorithm1,astochastic-gradientapproachforthepenaltyproblem(7).Thebenefits of Lagrangian relaxation become clear here: for (cid:96)(θ;z) smooth in z and γ large enough, gradient ascenton(cid:96)(θt;z)−γc(z,zt)inzconvergeslinearlyandwecancompute(approximate)ztefficiently (cid:98) (weinitializeourinnergradientascentiterationswiththesamplednaturalexamplezt). ConvergencepropertiesofAlgorithm1dependontheloss(cid:96). When(cid:96)isconvexinθ andγ islarge enough that z (cid:55)→ ((cid:96)(θ;z)−γc(z,z )) is concave for all (θ,z ) ∈ Θ×Z, we have a stochastic 0 0 monotone variational inequality, which is efficiently solvable (Juditsky et al., 2011; Chen et al., √ 2014) with convergence rate 1/ T. When the loss (cid:96) is nonconvex in θ, the following theorem guaranteesconvergencetoastationarypointofproblem(7)atthesameratewhenγ ≥ L . Recall zz thatF(θ)=E [φ (θ;Z)]istherobustsurrogateobjectivefortheLagrangianrelaxation(2). P0 γ Theorem 2 (Convergence of Nonconvex SGD). Let Assumptions A and B hold with the (cid:96) -norm 2 and let Θ = Rd. Let ∆ ≥ F(θ0)−inf F(θ). Assume E[(cid:107)∇F(θ)−∇ φ (θ,Z)(cid:107)2] ≤ σ2 and F θ θ γ 2 (cid:113) take constant stepsizes α = Lφ∆TFσ2 where Lφ := Lθθ + Lγθ−zLLzzθz. For T ≥ Lφσ∆2F, Algorithm 1 satisfies 1 T(cid:88)−1E(cid:104)(cid:13)(cid:13)∇F(θt)(cid:13)(cid:13)2(cid:105)− 4L2θz (cid:15)≤4σ(cid:114)Lφ∆F. T 2 γ−L T zz t=0 See Section C.3 for the proof. We make a few remarks. First, the condition E[(cid:107)∇F(θ)−∇ φ (θ,Z)(cid:107)2] ≤ σ2 holds(towithinaconstantfactor)whenever(cid:107)∇ (cid:96)(θ,z)(cid:107) ≤ σ θ γ 2 θ 2 forallθ,z. Theorem2showsthatthestochasticgradientmethodachievestheratesofconvergence on the penalty problem (7) achievable in standard smooth non-convex optimization (Ghadimi & Lan, 2013). The accuracy parameter (cid:15) has a fixed effect on optimization accuracy, independent of T: approximatemaximizationhaslimitedeffects. Key to the convergence guarantee of Theorem 2 is that the loss (cid:96) is smooth in z: the inner supre- mum (2b) is NP-hard to compute for non-smooth deep networks (see Lemma 2 in Section B for a proof of this for ReLU’s). The smoothness of (cid:96) is essential so that a penalized version (cid:96)(θ,z)−γc(z,z ) is concave in z (which can be approximately verified by computing Hessians 0 ∇2 (cid:96)(θ,z) for each training datapoint), allowing computation and our coming certificates of opti- zz mality. ReplacingReLU’swithsigmoidsorELU’s(Clevertetal.,2015)allowsustoapplyTheo- rem2,makingdistributionallyrobustoptimizationtractablefordeeplearning. Insupervised-learningscenarios,weareofteninterestedinadversarialperturbationsonlytofeature vectors(andnotlabels). LettingZ = (X,Y)whereX denotesthefeaturevector(covariates)and Y thelabel,thisisequivalenttodefiningtheWassersteincostfunctionc:Z×Z →R ∪{∞}by + c(z,z(cid:48)):=c (x,x(cid:48))+∞·1{y (cid:54)=y(cid:48)} (8) x where c : X ×X → R is the transportation cost for the feature vector X. All of our results x + suitably generalize to this setting with minor modifications to the robust surrogate (2b) and the aboveassumptions(seeSectionD).Similarly, ourdistributionallyrobustframework(2)isgeneral enough to consider adversarial perturbations to only an arbitrary subset of coordinates in Z. For example,itmaybeappropriateincertainapplicationstohedgeagainstadversarialperturbationstoa smallfixedregionofanimage(Brownetal.,2017).Bysuitablymodifyingthecostfunctionc(z,z(cid:48)) totakevalue∞outsidethissmallregion,ourgeneralformulationcoverssuchvariants. 3 CERTIFICATE OF ROBUSTNESS AND GENERALIZATION Fromresultsintheprevioussection,Algorithm1provablylearnstoprotectagainstadversarialper- turbations of the form (7) on the training dataset. Now we show that such procedures generalize, 5 PublishedasaconferencepaperatICLR2018 allowingustopreventattacksonthetestset.Oursubsequentresultsholduniformlyoverthespaceof parametersθ ∈Θ,includingθ ,theoutputofthestochasticgradientdescentprocedureinSec- WRM tion2.1. Ourfirstmainresult,presentedinSection3.1,givesadata-dependentupperboundonthe population worst-case objective sup E [(cid:96)(θ;Z)] for any arbitrary level of robustness P:Wc(P,P0)≤ρ P ρ; thisboundisoptimalforρ = ρ ,thelevelofrobustnessachievedfortheempiricaldistribution (cid:98)n by solving (7). Our bound is efficiently computable and hence certifies a level of robustness for theworst-casepopulationobjective. Second,weshowinSection3.2thatadversarialperturbations on the training set (in a sense) generalize: solving the empirical penalty problem (7) guarantees a similarlevelofrobustnessasdirectlysolvingitspopulationcounterpart(2). 3.1 CERTIFICATEOFROBUSTNESS Our main result in this section is a data-dependent upper bound for the worst-case population ob- √ jective: sup E [(cid:96)(θ;Z)] ≤ γρ+E [φ (θ;Z)]+O(1/ n)forallθ ∈ Θ,withhigh P:Wc(P,P0)≤ρ P P(cid:98)n γ probability. To make this rigorous, fix γ > 0, and consider the worst-case perturbation, typically calledthetransportationmaporMongemap(Villani,2009), T (θ;z ):=argmax{(cid:96)(θ;z)−γc(z,z )}. (9) γ 0 0 z∈Z Underourassumptions,T iseasilycomputablewhenγ ≥L . Lettingδ denotethepointmassat γ zz z z,Proposition1showstheempiricalmaximizersoftheLagrangianformulation(6)areattainedby n (cid:110) (cid:111) 1 (cid:88) Pn∗(θ):=argmax EP[(cid:96)(θ;Z)]−γWc(P,P(cid:98)n) = n δTγ(θ,Zi) and P (10) i=1 ρ(cid:98)n(θ):=Wc(Pn∗(θ),P(cid:98)n)=EP(cid:98)n[c(Tγ(θ;Z),Z)]. Our results imply, in particular, that the empirical worst-case loss E [(cid:96)(θ;Z)] gives a certificate P∗ ofrobustnessto(population)Wassersteinperturbationsuptolevelρ .nE [(cid:96)(θ;Z)]isefficiently (cid:98)n P∗(θ) n computablevia(10),providingadata-dependentguaranteefortheworst-casepopulationloss. Ourboundreliesontheusualcoveringnumbersforthemodelclass{(cid:96)(θ;·) : θ ∈ Θ}asthenotion ofcomplexity(e.g.vanderVaart&Wellner,1996),so,despitetheinfinite-dimensionalproblem(7), weretainthesameuniformconvergenceguaranteestypicalofempiricalriskminimization. Recall thatforasetV,acollectionv ,...,v isan(cid:15)-coverofV innorm(cid:107)·(cid:107)ifforeachv ∈V,thereexists 1 N v suchthat(cid:107)v−v (cid:107)≤(cid:15). ThecoveringnumberofV withrespectto(cid:107)·(cid:107)is i i N(V,(cid:15),(cid:107)·(cid:107)):=inf{N ∈N|thereisan(cid:15)-coverofV withrespectto (cid:107)·(cid:107)}. ForF := {(cid:96)(θ,·):θ ∈Θ}equippedwiththeL∞(Z)norm(cid:107)f(cid:107) := sup |f(z)|,westate L∞(Z) z∈Z ourresultsintermsof(cid:107)·(cid:107) -coveringnumbersofF. Toeasenotation,welet L∞(Z) (cid:114)M (cid:90) 1(cid:113) (cid:114)t (cid:15) (t):=γb (cid:96) logN(F,M (cid:15),(cid:107)·(cid:107) )d(cid:15)+b M n 1 n (cid:96) L∞(Z) 2 (cid:96) n 0 whereb ,b arenumericalconstants. 1 2 Wearenowreadytostatethemainresultofthissection. Wefirstshowfromthedualityresult(6) that we can provide an upper bound for the worst-case population performance for any level of robustnessρ. Forρ=ρ (θ)andθ =θ ,thiscertificateis(inasense)tightasweseebelow. (cid:98)n WRM Theorem3. Assume|(cid:96)(θ;z)|≤M forallθ ∈Θandz ∈Z.Then,forafixedt>0andnumerical (cid:96) constantsb ,b >0,withprobabilityatleast1−e−t,simultaneouslyforallθ ∈Θ,ρ≥0,γ ≥0, 1 2 sup E [(cid:96)(θ;Z)]≤γρ+E [φ (θ;Z)]+(cid:15) (t). (11) P:Wc(P,P0)≤ρ P P(cid:98)n γ n Inparticular,ifρ=ρ (θ)thenwithprobabilityatleast1−e−t,forallθ ∈Θ (cid:98)n sup E [(cid:96)(θ;Z)]≤γρ (θ)+E [φ (θ;Z)]+(cid:15) (t) P:Wc(P,P0)≤ρ(cid:98)n(θ) P (cid:98)n P(cid:98)n γ n = sup E [(cid:96)(θ;Z)]+(cid:15) (t). (12) P n P:Wc(P,P(cid:98)n)≤ρ(cid:98)n(θ) 6 PublishedasaconferencepaperatICLR2018 SeeSectionC.4foritsproof. WenowgiveaconcretevariantofTheorem3forLipschitzfunctions. WhenΘisfinite-dimensional(Θ⊂Rd),Theorem3providesarobustnessguaranteescalinglinearly withddespitetheinfinite-dimensionalWassersteinpenalty.Assumingthereexistθ ∈Θ,M <∞ 0 θ0 suchthat|(cid:96)(θ ;z)|≤M forallz ∈Z,wehavethefollowingcorollary(seeproofinSectionC.5). 0 θ0 Corollary1. Let(cid:96)(·;z)beL-Lipschitzwithrespecttosomenorm(cid:107)·(cid:107)forallz ∈ Z. Assumethat Θ⊂Rdsatisfiesdiam(Θ)=sup (cid:107)θ−θ(cid:48)(cid:107)<∞. Then,thebounds(11)and(12)holdwith θ,θ(cid:48)∈Θ (cid:114) (cid:114) d(Ldiam(Θ)+M ) t (cid:15) (t)=b θ0 +b (Ldiam(Θ)+M ) (13) n 1 n 2 θ0 n forsomenumericalconstantsb ,b >0. 1 2 Akeyconsequenceofthebound(11)isthatγρ+E [φ (θ;Z)]certifiesrobustnessfortheworst- P(cid:98)n γ case population objective for any ρ and θ. For a given θ, this certificate is tightest at the achieved levelofrobustnessρ (θ),asnotedintherefinedbound(12)whichfollowsfromthedualityresult (cid:98)n E [φ (θ;Z)]+ γρ (θ) = sup E [(cid:96)(θ;Z)]=E [(cid:96)(θ;Z)]. (14) (cid:124)sP(cid:98)unrrog(cid:123)γa(cid:122)teloss(cid:125) ro(cid:124)b(cid:98)u(cid:123)ns(cid:122)tne(cid:125)ss P:Wc(P,P(cid:98)n)≤ρ(cid:98)n(θ) P Pn∗(θ) (SeeSectionC.4foraproofoftheseequalities.) Weexpectθ ,theoutputofAlgorithm1,tobe WRM closetotheminimizerofthesurrogatelossE [φ (θ;Z)]andthereforehavethebestguarantees. P(cid:98)n γ Mostimportantly,thecertificate(14)iseasytocomputeviaexpression(10):asnotedinSection2.1, themappingsT(θ,Z )areefficientlycomputableforlargeenoughγ,andρ =E [c(T(θ,Z),Z)]. i (cid:98)n P(cid:98)n The bounds (11)–(13) may be too large—because of their dependence on covering numbers and dimension—for practical use in security-critical applications. With that said, the strong duality result,Proposition1,stillappliestoanydistribution.Inparticular,givenacollectionoftestexamples Ztest, wemayinterrogatepossiblelossesunderperturbationsforthetest examplesbynotingthat, i ifP(cid:98)testdenotestheempiricaldistributiononthetestset(say,withputativeassignedlabels),then n 1 (cid:88) sup {(cid:96)(θ;z)}≤ sup E [(cid:96)(θ;Z)]≤γρ+E [φ (θ;Z)] (15) ntest i=1z:c(z,Zitest)≤ρ P:Wc(P,P(cid:98)test)≤ρ P P(cid:98)test γ forallγ,ρ ≥ 0. Wheneverγ islargeenough(sothatthisistightforsmallρ), wemayefficiently compute the Monge-map (9) and the test loss (15) to guarantee bounds on the sensitivity of a pa- rameterθtoaparticularsampleandpredictedlabelingbasedonthesample. 3.2 GENERALIZATIONOFADVERSARIALEXAMPLES Wecanalsoshowthatthelevelofrobustnessonthetrainingsetgeneralizes. Ourstartingpointis Lemma1,whichshowsthatT (·;z)issmoothunderAssumptionsAandB: γ L (cid:107)T (θ ;z)−T (θ ;z)(cid:107)≤ zθ (cid:107)θ −θ (cid:107) (16) γ 1 γ 2 [γ−L ] 1 2 zz + for all θ ,θ , where we recall that L is the Lipschitz constant of ∇ (cid:96)(θ;z). Leveraging this 1 2 zz z smoothness, we show that ρ (θ) = E [c(T (θ;Z),Z)], the level of robustness achieved for the (cid:98)n P(cid:98)n γ empiricalproblem,concentratesuniformlyarounditspopulationcounterpart. Theorem4. LetZ ⊂ {z ∈ Rm : (cid:107)z(cid:107) ≤ M }sothat(cid:107)Z(cid:107) ≤ M almostsurelyandassumeeither z z that (i) c(·,·) is L -Lipschitz over Z with respect to the norm (cid:107)·(cid:107) in each argument, or (ii) that c (cid:96)(θ,z)∈[0,M ]andz (cid:55)→(cid:96)(θ,z)isγL -Lipschitzforallθ ∈Θ. (cid:96) c IfAssumptionsAandBhold,thenwithprobabilityatleast1−e−t, (cid:115) sup|E [c(T (θ;Z),Z)]−E [c(T (θ;Z),Z)]|≤4B 1 (cid:18)t+logN(cid:18)Θ,[γ−Lzz]+t,(cid:107)·(cid:107)(cid:19)(cid:19). θ∈Θ P(cid:98)n γ P0 γ n 4LcLzθ (17) whereB =L M underassumption(i)andB =M /γ underassumption(ii). c z (cid:96) 7 PublishedasaconferencepaperatICLR2018 See Section C.6 for the proof. For Θ ⊂ Rd, we have logN(Θ,(cid:15),(cid:107)·(cid:107)) ≤ dlog(1+ diam(Θ)) so (cid:15) (cid:112) that the bound (30) gives the usual d/n generalization rate for the distance between adversarial perturbations and natural examples. Another consequence of Theorem 4 is that ρ (θ ) in the (cid:98)n WRM certificate(12)ispositiveaslongastheloss(cid:96)isnotcompletelyinvarianttodata. Toseethis,note fromtheoptimalityconditionsforT (θ;Z)thatE [c(T (θ;Z),Z)]= 0iff∇ (cid:96)(θ;z)= 0almost γ P0 γ z surely,andhenceforlargeenoughn,wehaveρ (θ)>0bythebound(30). (cid:98)n 4 EXPERIMENTS Ourtechniquefordistributionallyrobustoptimizationwithadversarialtrainingextendsbeyondsu- pervised learning. To that end, we present empirical evaluations on supervised and reinforcement learningtaskswherewecompareperformancewithempiricalriskminimization(ERM)and,where appropriate,modelstrainedwiththefast-gradientmethod(3)(FGM)(Goodfellowetal.,2015),its iterated variant (IFGM) (Kurakin et al., 2016), and the projected-gradient method (PGM) (Madry etal.,2017). PGMaugmentsstochasticgradientstepsfortheparameterθ withprojectedgradient ascentoverx(cid:55)→(cid:96)(θ;x,y),iterating(fordatapointx ,y ) i i ∆xt+1(θ):=argmax{∇ (cid:96)(θ;xt,y )Tη} and xt+1 :=Π (cid:8)xt+α ∆xt(θ)(cid:9) (18) i (cid:107)η(cid:107) ≤(cid:15) x i i i B(cid:15),p(xti) i t i p fort=1,...,T ,whereΠdenotesprojectionontoB (x ):={x:(cid:107)x−x (cid:107) ≤(cid:15)}. adv (cid:15),p i i p Theadversarialtrainingliterature(e.g.Goodfellowetal.(2015))usuallyconsiders(cid:107)·(cid:107) -normat- ∞ tacks, which allow imperceptible perturbations to all input features. In most scenarios, however, itisreasonabletodefendagainstweakeradversariesthatinsteadperturbinfluentialfeaturesmore. Weconsiderthissettingandtrainagainst(cid:107)·(cid:107) -normattacks. Namely,weusethesquaredEuclidean 2 costforthefeaturevectorsc (x,x(cid:48)) := (cid:107)x−x(cid:48)(cid:107)2 anddefinetheoverallcostasthecovariate-shift x 2 adversary (8) for WRM (Algorithm 1), and we use p = 2 for FGM, IFGM, PGM training in all experiments;westilltestagainstadversarialperturbationswithrespecttothenormsp = 2,∞. We useT =15iterationsforalliterativemethods(IFGM,PGM,andWRM)intrainingandattacks. adv InSection4.1,wevisualizedifferencesbetweenourapproachandad-hocmethodstoillustratethe benefitsofcertifiedrobustness.InSection4.2weconsiderasupervisedlearningproblemforMNIST whereweadversariallyperturbthetestdata. Finally,weconsiderareinforcementlearningproblem inSection4.3,wheretheMarkovdecisionprocessusedfortrainingdiffersfromthatfortesting. WRMenjoysthetheoreticalguaranteesofSections2and3forlargeγ,butforsmallγ(largeadver- sarialbudgets),WRMbecomesaheuristiclikeothermethods.InAppendixA.4,wecompareWRM withothermethodsonattackswithlargeadversarialbudgets. InAppendixA.5,wefurthercompare WRM—whichistrainedtodefendagainst(cid:107)·(cid:107) -adversaries—withotherheuristicstrainedtodefend 2 against(cid:107)·(cid:107) -adversaries. WRMmatchesoroutperformsotherheuristicsagainstimperceptibleat- ∞ tacks,whileitunderperformsforattackswithlargeadversarialbudgets. 4.1 VISUALIZINGTHEBENEFITSOFCERTIFIEDROBUSTNESS iid For our first experiment, we generate synthetic data Z = (X,Y) ∼ P by X ∼ N(0 ,I ) with √ 0 i 2 2 labelsY = sign((cid:107)x(cid:107) − 2),whereX ∈ R2 andI istheidentitymatrixinR2. Furthermore,to i 2 2 √ √ create a wide margin separating the classes, we remove data with (cid:107)X(cid:107) ∈ ( 2/1.3,1.3 2). We 2 train a small neural network with 2 hidden layers of size 4 and 2 and either all ReLU or all ELU activationsbetweenlayers,comparingourapproach(WRM)withERMandthe2-normFGM.For ourapproachweuseγ =2,andtomakefaircomparisonswithFGMweuse (cid:15)2 =ρ(cid:98)n(θWRM)=Wc(Pn∗(θWRM),P(cid:98)n)=EP(cid:98)n[c(T(θWRM,Z),Z)], (19) forthefast-gradientperturbationmagnitude(cid:15),whereθ istheoutputofAlgorithm1.1 WRM Figure 1 illustrates the classification boundaries for the three training procedures over the ReLU- activated (Figure 1(a)) and ELU-activated (Figure 1(b)) models. Since 70% of the data are of the 1ForELUactivationswithscaleparameter1,γ =2makesproblem(2b)stronglyconcaveoverthetraining √ data.ReLU’shavenoguarantees,butweuse15gradientstepswithstepsize1/ tforbothactivations. 8 PublishedasaconferencepaperatICLR2018 (a) ReLUmodel (b) ELUmodel Figure1. Experimentalresultsonsyntheticdata. Trainingdataareshowninblueandred. Classifica- tionboundariesareshowninyellow,purple,andgreenforERM,FGM,andWRMrespectively. The boundariesareshownwiththetrainingdataaswellasseparatelywiththetrueclassboundaries. 2 1.2 1 1.5 0.8 1 0.6 0.4 0.5 0.2 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.5 1 1.5 2 2.5 3 3.5 (a) Syntheticdata (b) MNIST Figure2. Empiricalcomparisonbetweencertificateofrobustness(11)(blue)andtestworst-caseper- formance(red)forexperimentswith(a)syntheticdataand(b)MNIST.Weomitthecertificate’serror term(cid:15) (t).Theverticalbarindicatestheachievedlevelofrobustnessonthetrainingsetρ (θ ). n (cid:98)n WRM √ blue class ((cid:107)X(cid:107) ≤ 2/1.3), distributional robustness favors pushing the classification boundary 2 outwards; intuitively, adversarial examples are most likely to come from pushing blue points out- wardsacrosstheboundary. ERMandFGMsufferfromsensitivitiestovariousregionsofthedata, asevidencedbythelackofsymmetryintheirclassificationboundaries. Forbothactivations,WRM pushes the classification boundaries further outwards than ERM or FGM. However, WRM with ReLU’sstillsuffersfromsensitivities(e.g.radialasymmetryintheclassificationsurface)duetothe lackofrobustnessguarantees.WRMwithELU’sprovidesacertifiedlevelofrobustness,yieldingan axisymmetricclassificationboundarythathedgesagainstadversarialperturbationsinalldirections. Recallthatourcertificatesofrobustnessontheworst-caseperformancegiveninTheorem3applies foranylevelofrobustnessρ. InFigure2(a), weplotourcertificate(11)againsttheout-of-sample (test)worst-caseperformancesup E [(cid:96)(θ;Z)]forWRMwithELU’s.Sincetheworst- P:Wc(P,P0)≤ρ P case loss is hard to evaluate directly, we solve its Lagrangian relaxation (6) for different values of γ . Foreachγ ,weconsiderthedistancetoadversarialexamplesinthetestdataset adv adv ρ (θ):=E [c(T (θ,Z),Z)], (20) (cid:98)test P(cid:98)test γadv where P(cid:98)test is the test distribution, c(z,z(cid:48)) := (cid:107)x − x(cid:48)(cid:107)22 + ∞ · 1{y (cid:54)=y(cid:48)} as before, and T (θ,Z) = argmax {(cid:96)(θ;z)−γ c(z,Z)}istheadversarialperturbationofZ (Mongemap) γadv z adv forthemodelθ. Theworst-caselossesonthetestdatasetarethengivenby E [φ (θ ;Z)]+γ ρ (θ )= sup E [(cid:96)(θ ;Z)]. P(cid:98)test γadv WRM adv(cid:98)test WRM P:Wc(P,Ptest)≤ρ(cid:98)test(θWRM) P WRM As anticipated, our certificate is almost tight near the achieved level of robustness ρ (θ ) for (cid:98)n WRM WRM(10)andprovidesaperformanceguaranteeevenforothervaluesofρ. 9 PublishedasaconferencepaperatICLR2018 100 100 10-1 10-1 10-2 10-2 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 (a) Testerrorvs.(cid:15)advfor(cid:107)·(cid:107)2attack (b) Testerrorvs.(cid:15)advfor(cid:107)·(cid:107)∞attack Figure3. PGMattacksontheMNISTdataset. (a)and(b)showtestmisclassificationerrorvs. the adversarial perturbation level (cid:15) for the PGM attack with respect to Euclidean and ∞ norms re- adv spectively.Theverticalbarin(a)indicatestheperturbationlevelusedfortrainingthePGM,FGM,and (cid:112) IFGMmodelsaswellastheestimatedradius ρ (θ ).ForMNIST,C =9.21andC =1.00. (cid:98)n WRM 2 ∞ 4.2 LEARNINGAMOREROBUSTCLASSIFIER Wenowconsiderastandardbenchmark—traininganeuralnetworkclassifierontheMNISTdataset. The network consists of 8×8, 6×6, and 5×5 convolutional filter layers with ELU activations followedbyafullyconnectedlayerandsoftmaxoutput. WetrainWRMwithγ =0.04E [(cid:107)X(cid:107) ], P(cid:98)n 2 and for the other methods we choose (cid:15) as the level of robustness achieved by WRM (19).2 In the figures,wescalethebudgets1/γ and(cid:15) fortheadversarywithC :=E [(cid:107)X(cid:107) ].3 adv adv p P(cid:98)n p First,inFigure2(b)weagainillustratethevalidityofourcertificateofrobustness(11)fortheworst- case test performance for arbitrary level of robustness ρ. We see that our certificate provides a performanceguaranteeforout-of-sampleworst-caseperformance. Wenowcompareadversarialtrainingtechniques.Allmethodsachieveatleast99%test-setaccuracy, implying there is little test-time penalty for the robustness levels ((cid:15) and γ) used for training. It is thusimportanttodistinguishthemethods’abilitiestocombatattacks. Wetestperformanceofthe five methods (ERM, FGM, IFGM, PGM, WRM) under PGM attacks (18) with respect to 2- and ∞-norms. InFigure3(a)and(b),alladversarialmethodsoutperformERM,andWRMoffersmore robustnessevenwithrespecttothesePGMattacks. TrainingwiththeEuclideancoststillprovides robustnessto∞-normfastgradientattacks. WeprovidefurtherevidenceinAppendixA.1. Next we study stability of the loss surface with respect to perturbations to inputs. We note that small values of ρ (θ), the distance to adversarial examples (20), correspond to small magni- (cid:98)test tudes of ∇ (cid:96)(θ;z) in a neighborhood of the nominal input, which ensures stability of the model. z Figure 4(a) shows that ρ differs by orders of magnitude between the training methods (models (cid:98)test θ = θ ,θ ,θ ,θ ,θ ); the trend is nearly uniform over all γ , with θ ERM FGM IFGM PGM WRM adv WRM beingthemoststable. Thus,weseethatouradversarial-trainingmethoddefendsagainstgradient- exploitingattacksbyreducingthemagnitudesofgradientsnearthenominalinput. InFigure4(b)weprovideaqualitativepicturebyadversariallyperturbingasingletestdatapointuntil themodelmisclassifiesit. Specifically,weagainconsiderWRMattacksandwedecreaseγ until adv each model misclassifies the input. The original label is 8, whereas on the adversarial examples IFGM predicts 2, PGM predicts 0, and the other models predict 3. WRM’s “misclassifications” appear consistently reasonable to the human eye (see Appendix A.2 for examples of other digits); WRM defends against gradient-based exploits by learning a representation that makes gradients point towards inputs of other classes. Together, Figures 4(a) and (b) depict our method’s defense mechanismstogradient-basedattacks:creatingamorestablelosssurfacebyreducingthemagnitude ofgradientsandimprovingtheirinterpretability. 2Forthisγ,φ (θ ;z)isstronglyconcavefor98%ofthetrainingdata. γ WRM 3ForthestandardMNISTdataset,C :=E (cid:107)X(cid:107) =9.21andC :=E (cid:107)X(cid:107) =1.00. 2 P(cid:98)n 2 ∞ P(cid:98)n ∞ 10

Description:
Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional smoothing with virtual adversarial training. arXiv:1507.00677 [stat.ML], 2015. N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against deep learning systems using adversarial examples.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.