ebook img

A More General Robust Loss Function PDF

0.61 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A More General Robust Loss Function

A More General Robust Loss Function JonathanT.Barron [email protected] Abstract robustness[4,10]. ThegeneralizedCharbonnierlossbuilds 7 upontheCharbonnierlossfunction[3], whichisgenerally 1 Wepresentatwo-parameterlossfunctionwhichcanbe definedas: 0 (cid:112) viewed as a generalization of many popular loss functions f(x,c)= x2+c2 (1) 2 used in robust statistics: the Cauchy/Lorentzian, Geman- ar McClure, Welsch, and generalized Charbonnier loss func- Thislosssometimesiswritteninareparameterizedform: M tions (and by transitivity the L2, L1, L1-L2, and pseudo- (cid:113) Huber/Charbonnierlossfunctions). Wedescribeandvisu- f(x,c)=c (x/c)2+1 (2) 9 alizethisloss,anddocumentseveralofitsusefulproperties. This form of the loss is sometimes referred to as “L1-L2” ] V Many problems in statistics [8] and optimization [6] loss (as it behaves like quadratic loss near the origin and C require robustness — that a model be insensitive to out- like absolute loss far from the origin) or a Pseudo-Huber liers. Thisideaisoftenusedinparameterestimationtasks, loss(duetoitsresemblancetotheclassicHuberlossfunc- . s where a non-robust loss function such as the L2 norm is tion [7]). The generalized Charbonnier loss function takes c [ replaced with some most robust alternative in the face of theCharbonnierlossand,insteadofapplyingasquare-root, non-Gaussian noise. Practitioners, especially in the image raisesthelosstoanarbitrarypowerparameterα: 3 processing and computer vision literature, have developed v 7 alargecollectionofdifferentrobustlossfunctionswithdif- g(x,α,c)=(cid:0)(x/c)2+1(cid:1)α/2 (3) 7 ferent parametrizations and properties (some of which are 0 summarizedwellin[2,13]). Theselossfunctionsareoften Here we use a slightly different parametrization from [12] 3 usedwithingradient-descentorsecond-ordermethods,oras anduseα/2astheexponentinsteadofjustα.Thismakesthe 0 partofM-estimationorsomemorespecializedoptimization generalized Charbonnier somewhat easier to reason about . 1 approach. Unless the optimization strategy is co-designed withrespecttostandardlossfunctions:g(x,2,c)resembles 0 withthelossbeingminimized,theselossesareoften“plug L2loss,g(x,1,c)resemblesL1loss,etc. Wealsoomitthe 7 1 and play”: only a loss and its gradient is necessary to in- cscalefactorinEquation2,whichgivesusscale-invariance : tegrate a new loss function into an existing system. When withrespecttoc: v designing new models or experimenting with different de- i X signchoices,practitionersoftenswapindifferentlossfunc- g(2x,α,2c)=g(x,α,c) (4) r tions to see how they behave. In this paper we present a a singlelossfunctionthatisasupersetofmanyofthesecom- Thisallowsustoviewthec“padding”variableasa“scale” monlossfunctions. Asinglecontinuous-valuedparameter parameter,similartoothercommonrobustlossfunctions. inourlossfunctioncanbesetsuchourlossisexactlyequal ButthisformulationofthegeneralizedCharbonnierstill to several traditional loss functions, but can also be tuned hasseveralunintuitiveproperties:thelossisnon-zerowhen arbitrarily to model a wider family of loss functions. As x=0(assuminganon-zerovaluesofc),andthecurvature as result, this loss may be useful to practitioners wishing ofthequadratic“bowl”nearx=0variesasafunctionofc toeasilyandcontinuouslyexploreawidevarietyofrobust andα. Wethereforeconstructashiftedandscaledversion lossfunctions. ofEquation3thatdoesnothavetheseproperties: 1.Derivation g(x,α,c)−g(0,α,c) = 1 (cid:18)(cid:16)(x/c)2+1(cid:17)a/2−1(cid:19) (5) c2g(cid:48)(cid:48)(0,α,c) α We will derive our loss function from the “generalized Charbonnier”lossfunction[12],whichhasrecentlybecome This loss has the unfortunate side-effect of flattening out popularinsomeflowanddepthestimationtasksthatrequire to 0 for large negative values of α, which we address 1 by modifying the 1/α scale factor while preserving scale- 2.LossFunction invariance. Withthisanalysisinplace,wecanpresentourfinalloss z(α)(cid:32)(x/c)2 (cid:33)(α/2)  function. Our loss is simply our normalized variant of the h(x,α,c)=  +1 −1 (6) generalized Charbonnier loss, where we have introduced α z(α) special cases to cover the otherwise-undefined limits as α z(α)=max(1,2−α) (7) approaches 0 and −∞, as without these special cases the lossistechnicallyundefinedforthesevalues. This loss resembles a normalized, centered, and scale- invariantversionofthelossshowninEquation2. ρ(x,α,c)=  (cid:16) (cid:17) trhepeAagrsaemniseeratarllirizezaaetddioynCw:haerllb-oknnnoiwern,,aLn2dltohsiss risemaasipnesctiraulecianseooufr l1o−g ex12p(x(cid:16)/−c)212+(x/1c)2(cid:17) iiffαα==−0∞ (13) h(x,2,c)= 1(x/c)2 (8) z(αα)(cid:18)(cid:16)(zx(/αc))2 +1(cid:17)(α/2)−1(cid:19) otherwise 2 z(α)=max(1,2−α) (14) Though L1 loss is a special case of the generalized Char- bonnierlossinit’straditionalform,trueL1lossisnotactu- As we have shown, this loss function is a superset of the ally expressible in our normalized form in Equation 6 due generalized Charbonnier loss function (and therefore the tothedivisionbyc.Butifweassumethatxismuchgreater theCharbonnier/L1-L2/pseudo-Huber,quadraticandab- thanc,weseethatournormalizedformapproachesL1loss: solute loss functions by transitivity) and is also a superset oftheCauchy/Lorentzian,Geman-McClure,andWelsch/ |x| h(x,1,c)≈ −1 ifx(cid:29)c (9) Leclerclossfunctions. c As a reference, we provide the derivative of ρ(x,α,c) UnlikethegeneralizedCharbonnierloss,ournormalized withrespecttox,foruseingradient-basedoptimization: losscanbeshowntobeageneralizationoftwoothercom- dρ monlossfunctions. Thoughh(x,0,c)isundefineddueto (x,α,c)= dx adivisionbyzero,wecantakethelimitofh(x,α,c)asα  approacheszero: (cid:18)1 (cid:19) xcx22+2ex2xcp2(cid:16)−12(x/c)2(cid:17) iiffαα==−0∞ (15) αli→m0h(x,α,c)=log 2(x/c)2+1 (10) x (cid:16)(x/c)2 +1(cid:17)(α/2−1) otherwise c2 z(a) Perhaps surprisingly, this yields the Cauchy (aka This is also known as the influence function ψ(x,α,c) Lorentzian) loss function [1]. Cauchy loss is there- viewedthroughtheframeworkofM-estimation. fore a special case of our normalized loss, or equivalently, As a reference for M-estimation, we also provide the our normalized loss is not just a generalization of the weightfunctionw(x,α,c)tobeusedduringIRLSthatcor- Charbonnier,butisalsoageneralizationofCauchyloss. respondstoourloss: Thoughthisdoesnotappeartobecommonpractice,the 1 dρ power parameter α in a generalized Charbonnier can be (x,α,c)= set tonegative values, and thisis alsotrue forour normal- xdx  iGzeedmavna-rMianctC.lBuryesloetstsin[g5]α: = −2, our loss is equivalent to xc122+e22xcp2(cid:16)−12(x/c)2(cid:17) iiffαα==−0∞ (16) h(x,−2,c)= 2(x/c)2 (11) 1 (cid:16)(x/c)2 +1(cid:17)(α/2−1) otherwise (x/c)2+4 c2 z(a) Letusnowenumeratesomepropertiesofourlossfunc- Andinthelimitasαapproachesnegativeinfinity,ourloss tion.Ourlossisvalidforallrealvaluesofαandforallreal, becomesWelsch[9](akaLeclerc[11])loss: positivevaluesofc. Thelossincreasesmonotonicallywith (cid:18) 1 (cid:19) themagnitudeofx. Thelossisscaleinvariantwithrespect α→lim−∞h(x,α,c)=1−exp −2(x/c)2 (12) tocandx: ρ(2x,α,2c)=ρ(x,α,c) (17) The Welsch and Geman-McClure losses are therefore spe- At the origin the loss is zero and the IRLS weight (when cialcasesofourloss,orequivalently,ourlossfunctioncan α≤2)is1. be viewed as a generalization of the Welsch and Geman- McClurelossfunctions. ∀ ρ(0,α,c)=0 (18) α,c 1 dρ 3.Conclusion ∀ (0,α,c)=1 (19) α,c xdx We have presented a two-parameter loss function that Therootsofthesecondderivativeofρ(x,α,c)are: generalizes many existing one-parameter robust loss func- tions: the Cauchy/Lorentzian, Geman-McClure, Welsch, (cid:114) a−2 x=±c (20) and generalized Charbonnier, pseudo-Huber, L2, and L1 a−1 loss functions. We have presented the loss, gradient, and M-estimatorweightasareference,inadditiontoenumerat- Thistellsusatwhatvalueofxthelossbeginstore-descend. ingseveralconvenientpropertiesoftheloss. Byreducinga This point has a magnitude of c when α = −∞, and that largediscretefamilyofsingle-parameterlossfunctionsinto magnitude increases as α increases. The root is undefined asinglelossfunctionwithtwocontinuousparameters, our when α ≥ 1, from which we can infer that the loss is re- lossfunctionenablestheconvenientandcontinuousexplo- desendingiffα<1. ration of different robust loss functions, and and intuitive Forallvaluesofx,α,andc>0thelossincreasesmono- waytocomparetherelativeeffectsofthemembersofthese tonicallywithα: lossfunctions. dρ (x,α,c)≥0 (21) References dα Thismeansthatthislosscanbeannealedwithrespecttoα [1] M.J.BlackandP.Anandan. Therobustestimationofmul- in a graduated non-convexity framework. This property is tiplemotions: Parametricandpiecewise-smoothflowfields. CVIU,1996. 2 duetoourchoiceofz(α),asmanyotherarbitraryclamping [2] M.J.BlackandA.Rangarajan. Theoutlierprocess: Unify- functionsofαdonotguaranteemonotonicity. inglineprocessesandrobuststatistics. CVPR,1994. 1 For all values of α, when |x| is small with respect to c [3] P. Charbonnier, L. Blanc-Feraud, G. Aubert, and M. Bar- thelossiswell-approximatedbyaquadratic: laud. Two deterministic half-quadratic regularization algo- 1 rithmsforcomputedimaging. ICIP,1994. 1 ρ(x,α,c)≈ (x/c)2 if|x|<c (22) [4] Q.ChenandV.Koltun. Fastmrfoptimizationwithapplica- 2 tiontodepthreconstruction. CVPR,2014. 1 This approximation holds for ρ and its first and second [5] S.GemanandD.E.McClure. Statisticalmethodsfortomo- derivatives. Because the second derivative of the loss is graphicimagereconstruction. Bulletindel’Institutinterna- maximized at x = 0, this quadratic approximation tells us tionaldestatistique,1987. 2 thatthesecondderivativeisboundedfromabove: [6] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. d2ρ 1 ChapmanandHall/CRC,2015. 1 (x,α,c)≤ (23) dx2 c2 [7] P.J.Huber. Robustestimationofalocationparameter. An- nalsofMathematicalStatistics,1964. 1 This property is useful when deriving approximate Jacobi [8] P.J.Huber. RobustStatistics. Wiley,1981. 1 preconditioners for optimization problems that minimize [9] J.E.D.JrandR.E.Welsch. Techniquesfornonlinearleast thisloss. squaresandrobustregression.CommunicationsinStatistics- Whenαisnegativethelossapproachesaconstantas|x| simulationandComputation,1978. 2 approaches infinity, which lets us provide an upper bound [10] P.Kra¨henbu¨hlandV.Koltun. Efficientnonlocalregulariza- ontheloss: tionforopticalflow. ECCV,2012. 1 α−2 [11] Y. G. Leclerc. Constructing simple stable descriptions for ∀x,c ρ(x,α,c)≤ α ifα<0 (24) imagepartitioning. IJCV,1989. 2 [12] D. Sun, S. Roth, and M. J. Black. Secrets of optical flow Additionally,whenαislessthanorequalto1wecanpro- estimationandtheirprinciples. CVPR,2010. 1 videanupperboundonthegradientoftheloss: [13] Z.Zhang. Parameterestimationtechniques: Atutorialwith applicationtoconicfitting,1995. 1 (cid:18) (cid:19)(α−1) dρ 1 α−2 2 ∀ (x,α,c)≤ ifα≤1 (25) x,c dx c α−1 A visualization of our loss and its derivative/influence and weight functions for different values of α can be seen inFigures1,2,3,and4.Becauseconlycontrolsthescaleof thelossonthex-axis,wedonotvarycinourvisualizations andinsteadannotatethex-axisofourplotsinunitsofc. Figure 1: Our loss for different values of the power pa- rameter α. When α = 2 we have L2 loss, when α = 1 Figure3: Ourloss’sIRLSweightfordifferentvaluesofthe we closely approximation L1 loss, when α = 0 we have powerparameterα. Cauchy/Lorentzian loss, when α = −2 we have Geman- McClure loss, and as α approaches negative infinity we haveWelschloss. Figure4: Ourloss’sIRLSweightfordifferentvaluesofthe Figure 2: Our loss’s gradient for different values of the powerparameterα,renderedusingalogarithmicy-axis. powerparameterα.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.