A More General Robust Loss Function JonathanT.Barron [email protected] Abstract robustness[4,10]. ThegeneralizedCharbonnierlossbuilds 7 upontheCharbonnierlossfunction[3], whichisgenerally 1 Wepresentatwo-parameterlossfunctionwhichcanbe definedas: 0 (cid:112) viewed as a generalization of many popular loss functions f(x,c)= x2+c2 (1) 2 used in robust statistics: the Cauchy/Lorentzian, Geman- ar McClure, Welsch, and generalized Charbonnier loss func- Thislosssometimesiswritteninareparameterizedform: M tions (and by transitivity the L2, L1, L1-L2, and pseudo- (cid:113) Huber/Charbonnierlossfunctions). Wedescribeandvisu- f(x,c)=c (x/c)2+1 (2) 9 alizethisloss,anddocumentseveralofitsusefulproperties. This form of the loss is sometimes referred to as “L1-L2” ] V Many problems in statistics [8] and optimization [6] loss (as it behaves like quadratic loss near the origin and C require robustness — that a model be insensitive to out- like absolute loss far from the origin) or a Pseudo-Huber liers. Thisideaisoftenusedinparameterestimationtasks, loss(duetoitsresemblancetotheclassicHuberlossfunc- . s where a non-robust loss function such as the L2 norm is tion [7]). The generalized Charbonnier loss function takes c [ replaced with some most robust alternative in the face of theCharbonnierlossand,insteadofapplyingasquare-root, non-Gaussian noise. Practitioners, especially in the image raisesthelosstoanarbitrarypowerparameterα: 3 processing and computer vision literature, have developed v 7 alargecollectionofdifferentrobustlossfunctionswithdif- g(x,α,c)=(cid:0)(x/c)2+1(cid:1)α/2 (3) 7 ferent parametrizations and properties (some of which are 0 summarizedwellin[2,13]). Theselossfunctionsareoften Here we use a slightly different parametrization from [12] 3 usedwithingradient-descentorsecond-ordermethods,oras anduseα/2astheexponentinsteadofjustα.Thismakesthe 0 partofM-estimationorsomemorespecializedoptimization generalized Charbonnier somewhat easier to reason about . 1 approach. Unless the optimization strategy is co-designed withrespecttostandardlossfunctions:g(x,2,c)resembles 0 withthelossbeingminimized,theselossesareoften“plug L2loss,g(x,1,c)resemblesL1loss,etc. Wealsoomitthe 7 1 and play”: only a loss and its gradient is necessary to in- cscalefactorinEquation2,whichgivesusscale-invariance : tegrate a new loss function into an existing system. When withrespecttoc: v designing new models or experimenting with different de- i X signchoices,practitionersoftenswapindifferentlossfunc- g(2x,α,2c)=g(x,α,c) (4) r tions to see how they behave. In this paper we present a a singlelossfunctionthatisasupersetofmanyofthesecom- Thisallowsustoviewthec“padding”variableasa“scale” monlossfunctions. Asinglecontinuous-valuedparameter parameter,similartoothercommonrobustlossfunctions. inourlossfunctioncanbesetsuchourlossisexactlyequal ButthisformulationofthegeneralizedCharbonnierstill to several traditional loss functions, but can also be tuned hasseveralunintuitiveproperties:thelossisnon-zerowhen arbitrarily to model a wider family of loss functions. As x=0(assuminganon-zerovaluesofc),andthecurvature as result, this loss may be useful to practitioners wishing ofthequadratic“bowl”nearx=0variesasafunctionofc toeasilyandcontinuouslyexploreawidevarietyofrobust andα. Wethereforeconstructashiftedandscaledversion lossfunctions. ofEquation3thatdoesnothavetheseproperties: 1.Derivation g(x,α,c)−g(0,α,c) = 1 (cid:18)(cid:16)(x/c)2+1(cid:17)a/2−1(cid:19) (5) c2g(cid:48)(cid:48)(0,α,c) α We will derive our loss function from the “generalized Charbonnier”lossfunction[12],whichhasrecentlybecome This loss has the unfortunate side-effect of flattening out popularinsomeflowanddepthestimationtasksthatrequire to 0 for large negative values of α, which we address 1 by modifying the 1/α scale factor while preserving scale- 2.LossFunction invariance. Withthisanalysisinplace,wecanpresentourfinalloss z(α)(cid:32)(x/c)2 (cid:33)(α/2) function. Our loss is simply our normalized variant of the h(x,α,c)= +1 −1 (6) generalized Charbonnier loss, where we have introduced α z(α) special cases to cover the otherwise-undefined limits as α z(α)=max(1,2−α) (7) approaches 0 and −∞, as without these special cases the lossistechnicallyundefinedforthesevalues. This loss resembles a normalized, centered, and scale- invariantversionofthelossshowninEquation2. ρ(x,α,c)= (cid:16) (cid:17) trhepeAagrsaemniseeratarllirizezaaetddioynCw:haerllb-oknnnoiwern,,aLn2dltohsiss risemaasipnesctiraulecianseooufr l1o−g ex12p(x(cid:16)/−c)212+(x/1c)2(cid:17) iiffαα==−0∞ (13) h(x,2,c)= 1(x/c)2 (8) z(αα)(cid:18)(cid:16)(zx(/αc))2 +1(cid:17)(α/2)−1(cid:19) otherwise 2 z(α)=max(1,2−α) (14) Though L1 loss is a special case of the generalized Char- bonnierlossinit’straditionalform,trueL1lossisnotactu- As we have shown, this loss function is a superset of the ally expressible in our normalized form in Equation 6 due generalized Charbonnier loss function (and therefore the tothedivisionbyc.Butifweassumethatxismuchgreater theCharbonnier/L1-L2/pseudo-Huber,quadraticandab- thanc,weseethatournormalizedformapproachesL1loss: solute loss functions by transitivity) and is also a superset oftheCauchy/Lorentzian,Geman-McClure,andWelsch/ |x| h(x,1,c)≈ −1 ifx(cid:29)c (9) Leclerclossfunctions. c As a reference, we provide the derivative of ρ(x,α,c) UnlikethegeneralizedCharbonnierloss,ournormalized withrespecttox,foruseingradient-basedoptimization: losscanbeshowntobeageneralizationoftwoothercom- dρ monlossfunctions. Thoughh(x,0,c)isundefineddueto (x,α,c)= dx adivisionbyzero,wecantakethelimitofh(x,α,c)asα approacheszero: (cid:18)1 (cid:19) xcx22+2ex2xcp2(cid:16)−12(x/c)2(cid:17) iiffαα==−0∞ (15) αli→m0h(x,α,c)=log 2(x/c)2+1 (10) x (cid:16)(x/c)2 +1(cid:17)(α/2−1) otherwise c2 z(a) Perhaps surprisingly, this yields the Cauchy (aka This is also known as the influence function ψ(x,α,c) Lorentzian) loss function [1]. Cauchy loss is there- viewedthroughtheframeworkofM-estimation. fore a special case of our normalized loss, or equivalently, As a reference for M-estimation, we also provide the our normalized loss is not just a generalization of the weightfunctionw(x,α,c)tobeusedduringIRLSthatcor- Charbonnier,butisalsoageneralizationofCauchyloss. respondstoourloss: Thoughthisdoesnotappeartobecommonpractice,the 1 dρ power parameter α in a generalized Charbonnier can be (x,α,c)= set tonegative values, and thisis alsotrue forour normal- xdx iGzeedmavna-rMianctC.lBuryesloetstsin[g5]α: = −2, our loss is equivalent to xc122+e22xcp2(cid:16)−12(x/c)2(cid:17) iiffαα==−0∞ (16) h(x,−2,c)= 2(x/c)2 (11) 1 (cid:16)(x/c)2 +1(cid:17)(α/2−1) otherwise (x/c)2+4 c2 z(a) Letusnowenumeratesomepropertiesofourlossfunc- Andinthelimitasαapproachesnegativeinfinity,ourloss tion.Ourlossisvalidforallrealvaluesofαandforallreal, becomesWelsch[9](akaLeclerc[11])loss: positivevaluesofc. Thelossincreasesmonotonicallywith (cid:18) 1 (cid:19) themagnitudeofx. Thelossisscaleinvariantwithrespect α→lim−∞h(x,α,c)=1−exp −2(x/c)2 (12) tocandx: ρ(2x,α,2c)=ρ(x,α,c) (17) The Welsch and Geman-McClure losses are therefore spe- At the origin the loss is zero and the IRLS weight (when cialcasesofourloss,orequivalently,ourlossfunctioncan α≤2)is1. be viewed as a generalization of the Welsch and Geman- McClurelossfunctions. ∀ ρ(0,α,c)=0 (18) α,c 1 dρ 3.Conclusion ∀ (0,α,c)=1 (19) α,c xdx We have presented a two-parameter loss function that Therootsofthesecondderivativeofρ(x,α,c)are: generalizes many existing one-parameter robust loss func- tions: the Cauchy/Lorentzian, Geman-McClure, Welsch, (cid:114) a−2 x=±c (20) and generalized Charbonnier, pseudo-Huber, L2, and L1 a−1 loss functions. We have presented the loss, gradient, and M-estimatorweightasareference,inadditiontoenumerat- Thistellsusatwhatvalueofxthelossbeginstore-descend. ingseveralconvenientpropertiesoftheloss. Byreducinga This point has a magnitude of c when α = −∞, and that largediscretefamilyofsingle-parameterlossfunctionsinto magnitude increases as α increases. The root is undefined asinglelossfunctionwithtwocontinuousparameters, our when α ≥ 1, from which we can infer that the loss is re- lossfunctionenablestheconvenientandcontinuousexplo- desendingiffα<1. ration of different robust loss functions, and and intuitive Forallvaluesofx,α,andc>0thelossincreasesmono- waytocomparetherelativeeffectsofthemembersofthese tonicallywithα: lossfunctions. dρ (x,α,c)≥0 (21) References dα Thismeansthatthislosscanbeannealedwithrespecttoα [1] M.J.BlackandP.Anandan. Therobustestimationofmul- in a graduated non-convexity framework. This property is tiplemotions: Parametricandpiecewise-smoothflowfields. CVIU,1996. 2 duetoourchoiceofz(α),asmanyotherarbitraryclamping [2] M.J.BlackandA.Rangarajan. Theoutlierprocess: Unify- functionsofαdonotguaranteemonotonicity. inglineprocessesandrobuststatistics. CVPR,1994. 1 For all values of α, when |x| is small with respect to c [3] P. Charbonnier, L. Blanc-Feraud, G. Aubert, and M. Bar- thelossiswell-approximatedbyaquadratic: laud. Two deterministic half-quadratic regularization algo- 1 rithmsforcomputedimaging. ICIP,1994. 1 ρ(x,α,c)≈ (x/c)2 if|x|<c (22) [4] Q.ChenandV.Koltun. Fastmrfoptimizationwithapplica- 2 tiontodepthreconstruction. CVPR,2014. 1 This approximation holds for ρ and its first and second [5] S.GemanandD.E.McClure. Statisticalmethodsfortomo- derivatives. Because the second derivative of the loss is graphicimagereconstruction. Bulletindel’Institutinterna- maximized at x = 0, this quadratic approximation tells us tionaldestatistique,1987. 2 thatthesecondderivativeisboundedfromabove: [6] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. d2ρ 1 ChapmanandHall/CRC,2015. 1 (x,α,c)≤ (23) dx2 c2 [7] P.J.Huber. Robustestimationofalocationparameter. An- nalsofMathematicalStatistics,1964. 1 This property is useful when deriving approximate Jacobi [8] P.J.Huber. RobustStatistics. Wiley,1981. 1 preconditioners for optimization problems that minimize [9] J.E.D.JrandR.E.Welsch. Techniquesfornonlinearleast thisloss. squaresandrobustregression.CommunicationsinStatistics- Whenαisnegativethelossapproachesaconstantas|x| simulationandComputation,1978. 2 approaches infinity, which lets us provide an upper bound [10] P.Kra¨henbu¨hlandV.Koltun. Efficientnonlocalregulariza- ontheloss: tionforopticalflow. ECCV,2012. 1 α−2 [11] Y. G. Leclerc. Constructing simple stable descriptions for ∀x,c ρ(x,α,c)≤ α ifα<0 (24) imagepartitioning. IJCV,1989. 2 [12] D. Sun, S. Roth, and M. J. Black. Secrets of optical flow Additionally,whenαislessthanorequalto1wecanpro- estimationandtheirprinciples. CVPR,2010. 1 videanupperboundonthegradientoftheloss: [13] Z.Zhang. Parameterestimationtechniques: Atutorialwith applicationtoconicfitting,1995. 1 (cid:18) (cid:19)(α−1) dρ 1 α−2 2 ∀ (x,α,c)≤ ifα≤1 (25) x,c dx c α−1 A visualization of our loss and its derivative/influence and weight functions for different values of α can be seen inFigures1,2,3,and4.Becauseconlycontrolsthescaleof thelossonthex-axis,wedonotvarycinourvisualizations andinsteadannotatethex-axisofourplotsinunitsofc. Figure 1: Our loss for different values of the power pa- rameter α. When α = 2 we have L2 loss, when α = 1 Figure3: Ourloss’sIRLSweightfordifferentvaluesofthe we closely approximation L1 loss, when α = 0 we have powerparameterα. Cauchy/Lorentzian loss, when α = −2 we have Geman- McClure loss, and as α approaches negative infinity we haveWelschloss. Figure4: Ourloss’sIRLSweightfordifferentvaluesofthe Figure 2: Our loss’s gradient for different values of the powerparameterα,renderedusingalogarithmicy-axis. powerparameterα.