Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2012-022 October 7, 2012 Patch complexity, finite pixel correlations and optimal denoising Anat Levin, Boaz Nadler, Fredo Durand, and William T. Freeman massachusetts institute of technology, cambridge, ma 02139 usa — www.csail.mit.edu Patch Complexity, Finite Pixel Correlations and Optimal Denoising AnatLevin1 BoazNadler1 FredoDurand2 WilliamT.Freeman2 1WeizmannInstitute 2MITCSAIL Abstract. Imagerestorationtasksareill-posedproblems,typicallysolvedwith priors. Sincetheoptimalprior istheexact unknown densityof natural images, actualpriorsareonlyapproximateandtypicallyrestrictedtosmallpatches.This raisesseveralquestions:Howmuchmaywehopetoimprovecurrentrestoration resultswithfuturesophisticatedalgorithms?Andmorefundamentally,evenwith perfectknowledgeofnaturalimagestatistics,whatistheinherentambiguityof theproblem?Inaddition,sincemostcurrentmethodsarelimitedtofinitesupport patchesorkernels,whatistherelationbetweenthepatchcomplexityofnatural images,patchsize,andrestorationerrors?Focusingonimagedenoising,wemake severalcontributions.First,inlightofcomputationalconstraints,westudythere- lationbetweendenoisinggainandsamplesizerequirementsinanonparametric approach. Wepresent alaw of diminishing return, namely that withincreasing patchsize,rarepatchesnotonlyrequireamuchlargerdataset,butalsogainlittle fromit.Thisresultsuggestsnoveladaptivevariable-sizedpatchschemesforde- noising.Second,westudyabsolutedenoisinglimits,regardlessofthealgorithm used,andtheconvergeratetothemasafunctionofpatchsize.Scaleinvariance ofnaturalimagesplaysakeyrolehereandimpliesbothastrictlypositivelower boundondenoisingandapowerlawconvergence.Extrapolatingthisparametric law gives a ballpark estimate of the best achievable denoising, suggesting that someimprovement,althoughmodest,isstillpossible. 1 Introduction Characterizingthepropertiesofnaturalimagesiscriticalforcomputerandhumanvi- sion [18,13,20,16,7,23]. In particular, low level vision tasks such as denoising, su- perresolution,deblurringandcompletion,arefundamentallyill-posedsinceaninfinite numberofimagesxcanexplainanobserveddegradedimagey.Imagepriorsarecrucial inreducingthisambiguity,asevenapproximateknowledgeoftheprobabilityp(x) of naturalimagescanruleoutunlikelysolutions. Thisraisesseveralfundamentalquestions.First,atthemostbasiclevel,whatisthein- herentambiguityoflowlevelimagerestorationproblems?i.e.,cantheybesolvedwith zero error given perfect knowledge of the density p(x)? More practically, how much canwehopetoimprovecurrentrestorationresultswithfutureadvancesinalgorithms andimagepriors? Clearly,moreaccuratepriorsimproverestorationresults.However,while mostimage priors(parametric,non-parametric,learning-based)[2,14,20,16,23]aswellasstudies onimagestatistics[13,7]arerestrictedtolocalimagepatchesorkernels,littleisknown abouttheirdependenceonpatchsize.Henceanotherquestionofpracticalimportanceis thefollowing:Whatisthepotentialrestorationgainfromanincreaseinpatchsize?and, how is it related to the ”patchcomplexity”of naturalimages, namely their geometry, densityandinternalcorrelations. Inthispaperwestudythesequestionsinthecontextofthesimplestrestorationtask:im- agedenoising[18,20,6,11,9,15,10,23].Webuildonpriorattemptstostudythelim- itsofnaturalimagedenoising[17,3,8].Inparticular,onthenon-parametricapproach of[14],whichestimatedtheoptimalerrorfortheclassofpatchbasedalgorithmsthat denoiseeachpixelusingonlyafinite supportofnoisypixelsaroundit. Amajorlimi- tationof[14],isthatcomputationalconstraintsrestrictedittorelativelysmallpatches. Thus, [14] was unable to predict the best achievable denoising of algorithmsthat are allowedto utilize the entireimage support.In otherwords,an absolutePSNR bound, independentofpatchsizerestrictions,isstillunknown. Wemakeseveraltheoreticalcontributionswithpracticalimplications,towardsanswer- ing these questions. First we considernon-parametricdenoisingwith a finite external database and finite patch size. We study the dependence of denoising error on patch size.Ourmainresultisalawofdiminishingreturn:whenthewindowsizeisincreased, thedifficultyoffindingenoughtrainingdataforaninputnoisypatchdirectlycorrelates withdiminishingreturnsindenoisingperformance.Thatis, notonlyisiteasierto in- creasewindowsizeforsmoothpatches,theyalsobenefitmorefromsuchanincrease. In contrast, textured regions require a significantly larger sample size to increase the patchsize,whilegainingverylittlefromsuchanincrease.Fromapracticalviewpoint, thisanalysissuggestsanadaptivestrategywhereeachpixelisdenoisedwithavariable windowsizethatdependsonitslocalpatchcomplexity. Next, we put computational issues aside, and study the fundamental limit of denois- ing,withaninfinite windowsize andaperfectlyknownp(x) (i.e.,an infinitetraining database).Underasimplifiedimageformationmodelwestudythefollowingquestion: Whatistheabsolutelowerboundondenoisingerror,andhowfastdowe convergeto it, as a functionofwindowsize. We show thatthe scale invarianceof naturalimages plays a key role and yields a power law convergencecurve. Remarkably, despite the model’ssimplicity,its predictionsagreewell with empiricalobservations.Extrapolat- ing this parametric law providesa ballpark predictionon the best possible denoising, suggestingthatcurrentalgorithmsmaystillbeimprovedbyabout0.5−1dB. 2 Optimal MeanSquare Error Denoising In image denoising, given a noisy version y = x+n of a clean image x, corrupted by additive noise n, the aim is to estimate a cleaner version xˆ. The common quality measureofdenoisingalgorithmsistheirmeansquarederror,averagedoverallpossible cleanandnoisyx,ypairs,wherexissampledfromthedensityp(x)ofnaturalimages Z Z MSE=E[kxˆ−xk2]= p(x) p(y|x)kx−xˆk2dydx (1) 2 It is known, e.g. [14], that for a single pixel of interest xc the estimator minimizing Eq.(1)istheconditionalmean: Z p(y|x) xˆc =µ(y)=E[xc|y]= p(y) p(x)xcdx. (2) InsertingEq.(2)intoEq.(1)yieldsthattheminimummeansquarederror(MMSE)per pixelistheconditionalvariance Z Z MMSE=Ey[V[xc|y]]= p(y) p(x|y)(xc−µ(y))2dxdy. (3) TheMMSEmeasurestheinherentambiguityofthedenoisingproblemandthestatistics ofnaturalimages,asanynaturalimagexwithinthenoiselevelofymayhavegenerated y.SinceEq.(2)dependsontheexactunknowndensityp(x)ofnaturalimages(withfull imagesupport),itisunfortunatelynotpossibletocompute.Nonetheless,bydefinitionit isthetheoreticallyoptimaldenoisingalgorithm,andinparticularoutperformsallother algorithms,eventhosethatdetecttheclassofapictureandthenuseclass-specificpriors [3],orthosewhichleverageinternalpatchrepetition[6,22].Thatsaid,suchapproaches canyieldsignificantpracticalbenefitswhenusingafinitedata. Finally,notethatthedensityp(x)playsadualrole.AccordingtoEq.(1),itisneededfor evaluatinganydenoisingalgorithm,sincetheMSEistheaverageovernaturalimages. Additionally,itdeterminestheoptimalestimatorµ(y)inEq.(2). Finitesupport: First,weconsideralgorithmsthatonlyuseinformationinawindowof dnoisypixelsaroundthe pixeltobe denoised.Whenneeded,we denotebyxw ,yw d d therestrictionofthecleanandnoisyimagestoad-pixelwindowandbyxc,ycthepixel of interest, usually the centralone with c = 1. As in Eq. (3),the optimalMMSEd of anydenoisingalgorithmrestrictedtoadpixelssupportisalsotheconditionalvariance, butcomputedoverthespaceofnaturalpatchesofsizedratherthanonfull-sizeimages. Bydefinition,theoptimaldenoisingerrormayonlydecreasewithwindowsized,since the best algorithm seeing d + 1 pixels can ignore the last pixel and provide the an- swer of the d pixels algorithm. This raises two critical questions: how does MMSEd decreasewith d, andwhatis MMSE∞, namelythe bestachievabledenoisingerrorof anyalgorithm(notnecessarilypatchbased)? Non-Parametric approach with a finite training set: The challenge in evaluating MMSEd is that the density p(x) of natural images is unknown. To bypass it, a non- parametricstudyofMMSEdforsmallvaluesofdwasmadein[14],byapproximating Eq.(2)withadiscretesumoveralargedatasetofcleand-dimensionalpatches{xi}Ni=1. P 1 p(y |x )x µˆd(y)= N1 Pi p(wyd |ix,wd )i,c (4) N i wd i,wd where,foriidzero-meanGaussiannoisenwithvarianceσ2, p(ywd|xwd)= (2πσ12)d/2e−kxwd2−σy2wdk2. (5) 3 Aninterestingconclusionof[14]wasthatforsmallpatchesorhighnoiselevels,exist- ingdenoisingalgorithmsareclosetotheoptimalMMSEd. For Eq. (4) to be an accurate estimate of µd(y), the givendataset must contain many cleanpatchesatdistance(dσ2)1/2fromyw ,whichistheexpecteddistanceoftheorig- d inalpatch,E[kxw −yw k2] = dσ2. As a result, non-parametricdenoisingrequiresa d d largertrainingsetatlow noiselevelsσ wherethedistancedσ2 issmaller,oratlarger patchsizesdwherecleanpatchsamplesarespreadfurtherapart.Thiscurseofdimen- sionalityrestricted[14]tosmallvaluesofd. Incontrast,inthispaperweareinterestedinthebestachievabledenoisingofanyalgo- rithm,withoutrestrictionsonsupportsize, namelyMMSE∞. We thusgeneralize[14] bystudyinghowMMSEd decreasesasafunctionofd,andasaresultprovideanovel predictionofMMSE∞(seeSection4). Note that MMSE∞ corresponds to an infinite database of all clean images, which in particular also includes the original image x. However, this does not imply that MMSE∞ = 0,sincethisdatabasealso includesmanyslightvariantsofx, withsmall spatial shifts or illumination changes. Any of these variants may have generated the noisyimagey,makingitisimpossibletoidentifythecorrectonewithzeroerror. 3 Patch Size, Complexity andPSNR Gain Increasingthewindowsizeprovidesamoreaccuratepriorasitconsiderstheinforma- tion ofdistantpixelsonthe pixelofinterest. However,in a non-parametricapproach, thisrequiresamuchlargertrainingsetanditisunclearhowsubstantialthePSNRgain might be. This section shows that this tradeoff depends on “patch complexity”, and presentsa law ofdiminishingreturn:patchesthat requirea large increase in database size also benefit little from a larger window. This gain is governed by the statistical dependencyofperipheralpixelsandthecentralone:weaklycorrelatedpixelsprovide littleinformationwhileleadingtoamuchlargerspreadinpatchspace,andthusrequire asignificantlylargertrainingdata. 3.1 Empiricalstudy TounderstandthedependenceofPSNRonwindowsize,wepresentanempiricalstudy withM = 104 cleanandnoisypairs{(xj,yj)}Mj=1 andN = 108 samplestakenfrom the LabelMe dataset, as in [14]. We compute the non-parametric mean (Eq. (4)) at varying window sizes d. For each noisy patch we determine the largest d at which estimationisstillreliablebycomparingtheresultswithdifferentcleansubsets1. 1WedividetheN cleansamplesinto10groups,computethenon-parametricestimatorµˆd(yj) oneachgroupseparately,andcheckifthevarianceofthese10estimatorsismuchsmallerthan σ2.Forsmalld,samplesaredenseenoughandalltheseestimatorsprovideconsistentresults. Forlarged,sampledensityisinsufficient,andeachestimatorgivesaverydifferentresult. 4 35 G7 30 G11 PSNR23500 10 20 30 GGGGGG11223447150040 PSNR1225050 50 100 GGGGGG4611139914911397750 Number of pixels d Number of pixels d 1Dsignals,σ=10 2Dsignals,σ=50 Fig.1:ForpatchgroupsGℓ ofvaryingcomplexity,wepresentPSNRvs.numberofpixelsdin windowwd,whered = 1,...,ℓ.Highercurvescorrespondtosmoothregions,whichflattenat largerpatchdimensions.Texturedregionscorrespondtolowercurveswhichnotonlyrunoutof samplessooner,butalsotheircurvesflattenearlier. We divide the M test patches into groups Gℓ based on the largest window size ℓ at whichtheestimateisstillreliable.Foreachgroup,Fig.1displaystheempiricalPSNR averagedoverthegroup’spatchesasafunctionofwindowsized,ford=1,...,ℓ(that is,uptothemaximalwindowsized=ℓatwhichestimationisreliable),where: (cid:18) (cid:19) X PSNR(Gℓ|wd)=−10log10 |G1| (xj,c−µˆd(yj))2 ℓ j∈G ℓ Wefurthercomputeforeachgroupitsmeangradientmagnitude,k∇yw k,andobserve ℓ thatgroupswithsmallersupportsizeℓ,whichrunmorequicklyoutoftrainingdata,in- cludemostlypatcheswithlargegradients(texture).ThesepatchescorrespondtoPSNR curvesthatarelowerandalsoflattenearlier(Fig.1).Incontrast,smootherpatchesare ingroupsthatrunoutofexampleslater(higherℓ)andalsogainmorefromanincrease inpatchwidth:thehighercurvesinFig.1flattenlater.ThedatainFig.1demonstrates animportantprinciple:Whenanincreaseinpatchwidthrequiresmanymore training samples,theperformancegainduetotheseadditionalsamplesisrelativelysmall. Tounderstandtherelationbetweenpatchcomplexity,denoisinggain,andrequirednum- berofsamples,weshowthatthestatisticaldependencybetweenadjacentpixelsisbro- ken whenlargegradientsare observed.We sample rowsof 3 consecutivepixelsfrom cleanxandnoisyy naturalimages(Fig.2(a)),discretizetheminto100intensitybins, and estimate the conditional probability p(x1,x3|y1,y2) by counting occurrences in each bin. When the gradient |y2 −y1| is high with respect to the noise level, x1,x3 are approximatelyindependent,p(x1 = i,x3 = j||y1 −y2| ≫ σ) ≈ p1(i)p3(j), see Fig. 2(d,f).Incontrast,smallgradientsdon’tbreakthe dependency,andwe observea much more elongated structure, see Fig. 2(b,c,e). For reference, Fig. 2(g) shows the unconditionaljointdistributionp(x1,x3), withoutseeinganyy.Its diagonalstructure impliesthatwhilethepixels(x1,x3)arebydefaultdependent,thedependencyisbro- ken in the presence of a strong edge between them. From a practical perspective, if |y1−y2|≫σ,addingthepixely3doesnotcontributemuchtotheestimationofx1.If thegradient|y1−y2|issmallthereisstilldependencybetweenx3andx1,soaddingy3 doesfurtherreducethereconstructionerror. Asimpleexplanationforthisphenomenon 5 x x 2 3 y 2 y 1 x 1 (a) x x x 1 1 1 y 1 y 1 y 1 y x y x y x 2 3 2 3 2 3 |y −y |=10 |y −y |=40 |y −y |=80 1 2 1 2 1 2 (b)σ=10 (c)σ=10 (d)σ=10 x x x 1 1 1 y 1 y 1 y x y x x 2 3 2 3 3 |y1−y2|=10 |y1−y4|=40 unconstrained (e)σ=5 (f)σ=5 (g) Fig. 2: (a) A clean and noisy 1D signal. (b-g) Joint distribution tables. (b-f) p(x1,x3|y1,y2) attwonoiselevels.(g)p(x1,x3),beforeanyobservation.Whileneighboringpixelsaredepen- dentindefault,thedependencyisbrokenwhentheobservedgradientishighwithrespecttothe noise(d,f). istothinkofadjacentobjectsinanimage.Asobjectscanhaveindependentcolors,the colorofoneobjecttellsusnothingaboutitsneighborontheothersideoftheedge. 6 4 4 2 2 p(x|y,y) 3 3 1.5 p(x|y)=p(x|y,y) 1.5 1 1 2 (y,y) (y,y) 1 1 1 1 2 x22 1 2 x22 1 2 p 1 p 1 p(x1|y1) 1 1 0.5 0.5 0 0 x12 4 0 0 1 x21 3 4 00 1 x21 3 4 00 1 x21 3 4 (a)Independent (b)FullyDependent (c)Independent (d)FullyDependent Samplesfromjointdistributions Marginaldistributions Fig.3:Atoyexampleof2Dsampledensities. 3.2 TheoreticalAnalysis Motivatedby Fig. 1 and Fig. 2, we study the implicationsof partialstatistical depen- dencebetweenpixels,bothontheperformancegainexpectedbyincreasingthewindow size,andontherequirementsonsamplesize. 2D Gaussiancase: To gain intuition,we first considera trivial scenariowhere patch size is increased from 1 to 2 pixels and distributions are Gaussians. In Fig. 3(a), x1 andx2 areindependent,whileinFig.3(b)theyarefullydependentandx1 =x2.Both caseshavethesamemarginaldistributionp(x1)withequaldenoisingperformancefor a 1-pixelwindow. We draw N = 100 samples from p(x1,x2) and see how many of themfallwithinaradiusσaroundanoisyobservation(y1,y2).Intheuncorrelatedcase (Fig. 3(a)),the samplesare spreadin the 2D plane andthereforeonlya smallportion of them fallnear(y1,y2). In the secondcase, since the samplesare concentratedin a significantly smaller region (a 1-D line), there are many more samples near (y1,y2). Hence, in the fullycorrelatedcase a nonparametricestimator requiresa significantly smallerdatasettohaveasufficientnumberofcleansamplesinthevicinityofy. To study the accuracy of restoration, Fig. 3(c,d) shows the marginal distributions p(x1|y1,y2). When x1,x2 are independent, increasing window size to take y2 into account provides no information about x1, and p(x1|y1) = p(x1|y1,y2). Worse, de- noising performance decreases when the window size is increased because we now have fewer trainingpatchesinside the relevantneighborhood.In contrast, in the fully correlatedcase,addingy2 providesvaluableinformationaboutx1,andthevarianceof p(x1|y1,y2)ishalfofthevariancegiveny1alone.Thisillustrateshowhighcorrelation betweenpixelsyieldsasignificantdecreaseinerrorwithoutrequiringalargeincrease insamplesize.Conversely,weakcorrelationgivesonlylimitedgainwhilerequiringa largeincreaseintrainingdata. Generalderivation: Weextendour2Danalysistoddimensions.Thefollowingclaim, provedin theappendix,providestheleadingerrortermofthenon-parametricestimator µˆd(y)ofEq.(4)asafunctionoftrainingsetsizeN andwindowsized.Itissimilarto resultsinthestatisticsliteratureontheMSEoftheNadaraya-Watsonestimator. 7 Claim. Asymptotically,asN →∞,theexpectednon-parametricMSEwithawindow ofsizedpixelsis (cid:0) (cid:1) EN[MSEd(y)]=MMSEd(y)+ N1Vd(y)+o N1 (6) V[x |y ]|Φ | Vd ≈ 1σw2dd d , (7) withV[x1|yw ]theconditionalvarianceofthecentralpixelx1givenawindowwdfrom d y,and|Φd|isthedeterminantofthelocald×dcovariancematrixofp(y), ˛ ˛ ˛ ∂2logp(y )˛ |Φd|−1 =˛˛− ∂2y wd ˛˛. (8) wd TheexpectederroristhesumofthefundamentallimitMMSEd(y)andavarianceterm that accounts for the finite number of samples N in the dataset. As in Monte-Carlo sampling,itdecreasesas N1.Whenwindowsize increases,MMSEd(y)decreases,but the variance Vd(y) might increase. The tension between these two terms determines whetherforaconstanttrainingsizeN increasingwindowsizeisbeneficial. ThevarianceVd isproportionaltothevolumeofp(yw ),asmeasuredbythedetermi- d nant|Φd|ofthelocalcovariancematrix.Whenthevolumeofthedistributionislarger, theN samplesarespreadoverawiderareaandtherearefewercleanpatchesneareach noisypatchy.ThisispreciselythedifferencebetweenFig.3(a)andFig.3(b). For the error to be close to the optimal MMSEd, the term Vd/N in Eq. (6) must be small. Eq. (7) shows that Vd dependson the volume |Φd| and we expect this term to growwithdimensiond,thusrequiringmanymoresamplesN.Bothourempiricaldata andour2Danalysisshowthattherequiredincreaseinsamplesizeisafunctionofthe statisticaldependenciesofthecentralpixelwiththeaddedone. TounderstandtherequiredincreaseintrainingsizeN whenwindowsizeisincreased byonepixelfromd−1tod,weanalyzetheratioofvariancesVd/Vd−1.Letgd(y)be thegaininperformance(foraninfinitedataset),whichaccordingtoEq.(3)isgivenby: gd(y)= MMMMSSEEd−d(1y(y)) = VV[x[x1|1y|1y,1.,....y.dy−d]1] (9) We also denote by gd∗(y) the ideal gain if xd and x1 were perfectly correlated, i.e. r =cor(x1,xd|y1,...,yd−1)=1.ThefollowingclaimshowsthatwhenMMSEd(y) ismostimproved,samplingisnothardersincethevolumeandvarianceVddonotgrow. Forsimplicity,weprovetheclaimintheGaussiancase. Claim. Let p(y) be Gaussian. When increasing the patch size from d − 1 to d, the varianceratioandtheperformancegainoftheestimatorsarerelatedby: V g∗ d = d ≥1. (10) V g d−1 d 8 Thatis,theratioofvariancesequalstheratioofoptimaldenoisinggaintotheachievable gain.Whenx1,xd areperfectlycorrelated,gd =gd∗,wegetVd/Vd−1 =1,andalarger windowgivesimprovedrestorationresultswithoutincreasingtherequireddatasetsize. In contrast, if xd,x1 are weakly correlated, increasing window size requires a bigger datasettokeepVd/N small,andyetthePSNRgainissmall. Proof. LetC bethe2×2covarianceofx1,xd giveny1,...,yd−1(beforeseeingyd) „ « c c C =Cov(x1,xd|y1,...yd−1)= c1 c12 (11) 12 2 √ andletr =c12/ c1c2bethecorrelationbetweenx1,xd. Assuming that the distribution is locally Gaussian, upon observing yd, the marginal varianceofx1decreasesfromc1tothefollowingexpression(seeEq.2.73in[5]), „ « c2 c2 /c c (1−r2)+σ2 V[x1|y1,...,yd]=c1− c +12σ =c1 1− c12+σ12 =c1 2 c +σ2 . (12) 2 2 2 Hencethecontributiontoperformancegainoftheadditionalpixelyd is V[x |y ,...y ] c +σ2 gd = V[x1 |1y ,...dy−]1 = c (1−2 r2)+σ2. (13) 1 1 d 2 Whenr =1,thelargestpossiblegainfromydisgd∗ =(c2+σ2)/σ2.Theratioofbest possiblegaintoachievedgainis g∗ c (1−r2)+σ2 d = 2 . (14) g σ2 d Next, let us compute the ratio Vd/Vd−1. For Gaussian distributions, accord- ing to Eq. 2.82 in [5], the conditional variance of yd given y1,...,yd−1 is independent of the specific observed values. Further, since p(y1,...,yd) = p(y1,...,yd−1)p(yd|y1,...yd−1),weobtainthat |Φd|=V(yd|y1,...yd−1)|Φd−1| (15) Thisimpliesthat V V(y |y ,...y ) V[x |y ,...y ] d = d 1 d−1 1 1 d (16) V σ2 V[x |y ,...y ] d−1 1 1 d−1 Next, since yd = xd + nd with nd ∼ N(0,σ2) independent of y1,...,yd−1, then V(yd|y1,...yd−1)=c2+σ2.Thus, V c +σ2c (1−r2)+σ2 g∗ d = 2 2 = d. V σ2 c +σ2 g d−1 2 d To understand the growth of Vd, consider two extreme cases, similar to Fig. 3. First, consider a signal whose pixels are all independentwith variance γ. In this case r = 9
Description: