Table Of Content

Revisiting Deep Image Smoothing and Intrinsic Image Decomposition QingnanFan1 DavidWipf2 GangHua2 BaoquanChen1 1ShandongUniversity 2MicrosoftResearchAsia [email protected], davidwip,ganghua @microsoft.com, [email protected] { } 7 1 Abstract Recently, deep learning has demonstrated a remarkable 0 ability to replace expensive optimization procedures with 2 We propose animage smoothing approximation and in- cheap computational modules that can be trained offline r trinsic image decomposition method based on a modified whengivenaccesstorepresentativedata,andlaterdeployed a convolutional neural network architecture with large re- atafractionofthecomputationalbudget[15].Forexample, M ceptive fields applied directly to the original color image. givenasufficienttrainingcorpusobtainedbypassinginput 8 Whentrainingadeepmodelforthesepurposeshowever,it imagesthroughsomepreferredsmoothingfilter,wecanat- 2 is quite difficult to generate edge-preserving images with- tempttolearnadeepmodelthatefficientlymimicsthefilter outundesirablecolordifferencesorboundaryartifacts. To outputs [33]. Of course considerable care must be taken ] overcome these obstacles, we supervise intermediate out- indesigningsuchanetworktoreducetrainingdatarequire- V puts from different functional components during training. mentsandtoensurethatthetrainedmodelisindeedcapable C Forexample,weapplybothimagegradientsupervisionand of fast, robust inference on new images, and naive, black- . s a channel-wise rescaling layer that computes a minimum boxapproachesaredestinedtofail. Perhapsevenmoreim- c mean-squared error color correction. Additionally, to en- portantly, a well-designed network for image filtering can [ hancepiece-wiseconstanteffectsforimagesmoothing, we alsoserveasthefoundationformorecomplex,higher-level 2 append a domain transform filter with a predicted refined vision tasks, without the need for cumbersome integration v edgemap. Theresultingdeepmodel,whichcanbetrained withaniterative,energy-basedfilteringapproach. 5 end-to-end,directlylearnsedge-preservingsmoothimages Inthisregard,onenotableapplicationofedge-preserving 6 9 and intrinsic image decompositions without any special imagesmoothingisthecomputationofintrinsicimagede- 2 design, specialized auxiliary training data, or input scal- compositions [5], which are useful for material recogni- 0 ing/size requirements. Moreover, our method shows much tion,image-basedresurfacing,andrelighting,amongother . 1 better numerical and visual results on both tasks and runs things. Theintrinsicimagemodelassumesanidealdiffuse 0 incomparabletesttimetoexistingdeepmethods. environment, in which an input image I is the pixel-wise 7 productofanalbedoimageAandashadingimageS,i.e., 1 : v 1.Introduction I A S. (1) ≈ (cid:12) i X Image smoothing is a fundamental building block of The albedo or reflectance image indicates how object sur- r a many computer vision and graphics applications including facematerialsreflectlight,whileshadingaccountsforillu- texture removal, image enhancement, edge extraction, im- minationeffectsduetogeometry,shadows,andinterreflec- age abstraction, and beyond. The underlying goal of im- tions. Obviously,estimatingsuchadecompositionisafun- age smoothing, or edge preserving filtering, is to extract damentallyill-posedproblemasthereexistinfinitelymany sparse salient structures like perceptually important edges feasible decompositions. Fortunately though, prior infor- and contours, while minimizing color differences in some mation, often instantiated via specially tailored image s- input image of interest. A vast array of practical filter- moothingfiltersorenergyterms,allowsustoconstrainthe ing algorithms have been proposed for this task using tra- spaceoffeasiblesolutions[2,4,6,9,28,37]. Forexample, ditional image processing principles and energy function- thealbedoimagewillusuallybenearlypiece-wiseconstant, s that operate across both local and global image regions withafinitenumberoflevelsreflectingadiscretesetofma- [5,12,13,18,26,27,32,34,35,36]. However,oftenanex- terials and boundaries common to natural scenes. In con- pensive, impractical optimization problem must be solved trast, the shading image is often assumed to be grayscale, perinstancetoproducethedesirededge-awarefilteredim- and is more likely to contain smooth gradations quantified ages[5,18]. bysmallgradientsexceptatlocationswithcastshadowsor 1 abruptchangesinscenegeometry[22]. ciently approximate a variety of computationally intensive Naturally, given access to ground truth intrinsic image imagesmoothingfiltersaswellasinferrelatedintrinsicim- decompositions, deep networks provide a data-driven can- age decompositions. For each of these tasks, our method didate for solving this ill-posed inverse problem with few- demonstratesstate-of-the-artperformancebothvisuallyand erheuristicorhand-craftedassumptions. Inthispaper, we numerically. We differentiate our design choices from ex- propose a flexible deep neural network architecture that is istingmethodsinthenextsection. simultaneously capable of handling wide-ranging, generic image smoothing tasks, where the primary goal is an ef- 2.RelationshipwithExistingModels ficient implementation of existing edge-aware filtering ef- ImageSmoothing: Recently,therehavebeenmultiple fects,aswellasintrinsicimagedecompositions,wherenow attempts[33,25]toimitatethebehaviorofalarge,represen- an augmented objective supports improved accuracy on a tativefamilyofimagesmoothingfiltersusingasingle,data- higher-levelvisiontask.Ourfullytrainable,end-to-endnet- drivenapproximationframeworkcapableofacceleratedin- workreliesonthreeimportantingredients: ference speeds. For example, [33] proposed a 3-layer net- Large Flexible Receptive Fields: Edge-aware filter- workforfirstpredictingprominentgradients,followedbya ingrequirescontextualinformationacrossarelativelywide separatepost-processingstepinvolvingfilter-specifictuning imageregion,especiallyinthecontextofanend-to-endsys- parameterstoreconstructthesmoothedimageestimate. A tem. Additionally,accurateintrinsicimagedecompositions potentiallimitationhowever,isthatgradientorsalientstruc- must account for the fact that distant pixels with the same ture prediction errors can easily propagate to surrounding reflectanceonalargesurfacecouldhavesignificantcolord- areasinthepixeldomainduringthereconstructionstep,re- ifferencesduetotheaccumulationofsmoothly-varyingsoft ducingvisualquality.Moreover,practicalimagesmoothing castshadows[5]. Hencelargereceptivefieldsareessential filtersfavorvaryingdegreesofedgepreservation,spanning forbothtasks, andwechoosetoactualizethemwithmax- thegamutfromblurrytopiece-wiseconstantimages. Pure- imal flexibility via a fully convolutional subnetwork with lygradient-basedprocessingtendstostrugglemoregener- manylayers. atingimagesontheblurryendofthisscale, wherecontin- ColorCorrectionandGradientSupervision: Forim- uous color transitions may be required. For these reasons, age smoothing, the input and output color images should our approach operates entirely in the original color image behighlycorrelated,butunfortunatelythecolormaybeat- domain,agradientsupervisiontermnotwithstanding. tenuated during many forward convolutions in a deep net- Asecondapproachfrom[25]involvesahybridnetwork worksuchasours.Toaddressthisprobleminthecontextof composedofseveralconvolutionallayersfollowedbyaset superresolution, [19] adds a skip connection directly from of recurrent network layers. In brief, the CNN portion of input to output images, which is not effective for image s- themodeltakestheoriginalimageasinputandcomputesa moothing in our experience. Instead, we include a color set of weights that are presumably reflective of salient im- correctionlayerwhichscaleseachchanneloftheimageto agestructures. Therecurrentnetworklayersthentakethese minimizeMSEcolordifferencesandnaturallyallowsgradi- weights,aswellastheoriginalimage,andactasaformof entbackpropagationfortrainingpurposes. Wealsoobserve data-driven guided filter to produce the desired smoothing thatsupervisinggradientsofpredictedimagesusinganaux- effect. Toacertaindegree, ourapproachcanbeviewedas iliary objective helps preserve salient edges; however, this theantithesisofthismodel. Whereasweexplicitlyparame- isnotablydifferentthandirectoperationinthegradientdo- terizeanimagemodelandonlylaterrefineitusinggradient mainakintomostexistingmethods. supervision and a parameterless domain transform modu- Domain Transform: To further enhance piece-wise lated via a small subnetwork, the method from [25] fully constanteffectsfavoredbysparsity-basedfilteringmethods parameterizestheweightsofarecurrentguidedfilterarray, [32,5],wealsoleverageadomaintransform[14,8].Practi- which acts much like a domain transform via the analysis callyspeaking,inoursystemthiscanbeviewedasaseries from [8], but then applies these filters only to multi-scale of final, parameterless network layers that implicitly filter versionsoftheoriginalrawimages. Adownsideofthelater color information across the image in a warped domain. strategy is that it requires downsampling the original im- This warping is determined by edge maps derived from a agestomultiplescales,andatleastinitspresentform,will preliminarysmoothedimage,whichemergesfromthelow- notworkwithimagesizesthatarenotmultiplesof16(pro- er convolutional and color correction layers described pre- vided code actually only works with special image sizes). viously. Althoughthedomaintransformlayersthemselves Additionally,anymismatchbetweenthecomputedweights donotactuallyincreasetheoverallnetworkcapacityperse, and legitimate salient edge structure will necessarily prop- theynonethelessalterthespaceofcandidatefiltersandin- agate to wider regions in the predicted output image since fluencewhichimageregionsshouldbesmoothedornot. therecurrentfiltershavenomechanismtocompensate. Overall, weproposeanend-to-endsystemthatcaneffi- Intrinsic Image Decomposition: To move beyond Edge refinement Domain Deep CNN Scale transform x gradient y gradient Inference net 𝑃 𝑃 𝐸 𝑃 1 2 3 tupnI NNC peeD tuptuO roloCnoitcerroC tuptuO egdEtnemenifeR egdE niamoDmrofsnarT tuptuO Inference Net 𝑃1 𝑃2 𝐸 𝑃3 Figure1.BasicNetworkStructure.Forsmoothing/filteringtasks,aninputimageisfedintoaninferencenetwork(DeepCNN+channel- wiserescalinglayerforcolorcorrection),followedbyadomaintransformmodulatedusinganedgerefinementstep. Forintrinsicimage decompositions,thispipelineismodifiedslightlyviaseparatealbedoandshadingpathsasdescribedinthetext. the traditional filtering approaches mentioned in Section tered images in an end-to-end manner. This network con- 1 to data-driven deep models capable of producing scene- tains 20 convolutional neural network (CNN) layers fol- level intrinsic image decompositions, we require images lowed by a color correction layer and two gradient com- withknowngroundtruth. TheMPI-Sintelintrinsicimages putationlayers. Thesearedesignedasfollows: benchmark[7]providesanimportantstepinthisdirection, CNN Layers: Of the 20 CNN layers, the middle 18 andseveralexistingdeepnetworkpipelineshavebeenbuilt arealloftheequivalentsize64 64 3 3,meaningthatfor uponit.First,[31]learnsatwo-scaleconvolutionalnetwork × × × each of 64 output feature maps, 3 3 spatial regions of all to directly predict both albedo and shading images. How- × 64inputfeaturemapsarefiltered. Incontrast,thefirstlayer ever,thespecificarchitecture,whichcloselyresemblesthat takestheinputimageandrepresentsitas64featuremaps, from [10] developed for predicting depth and surface nor- while the last layer transforms the multi-channel represen- mals,involvesintermediatefeaturemapsat1/32scalesuch tation back to the original image space (3 color channels). thatsignificantdetailinformationmaybecompromised. A In both cases the kernel sizes are also 3 3. Additionally, secondmorerecentmethodfrom[20]trainsajointmodelto × each convolutional layer is followed by batch normaliza- learndepthmapsandintrinsicimageswithactivationscou- tion(BN)andReLUlayersexceptforthelastone. Finally, pled across iterations. Unlike our approach, the resulting toacceleratethetrainingprocess,weretrofitthemiddle16 pipeline operates in the gradient domain and requires sep- convolutions with 8 modified residual blocks (see the sup- arate post-processing steps, as well as additional guidance plementary file for details of these modifications and net- from labeled depth images. Additionally, [24] leverages a workstructure). generativeadversarialnetwork(GAN)tolearnintrinsicim- As observed in [33], a fully convolutional network agesfromaspecialresynthesizeddataset,butdoesnotpro- trained directly in the original image domain may tend videmodelsorresultsfromexistingMPI-Sintelbenchmark- to generate very blurry images and unwanted details, sforcomparison.Finally,[29]trainsanencoder-decoderC- while gradient domain supervision is more sensitive to the NNtolearnalbedo,shadingandspecularimages;however, changesinsharpedgesthatcomplywithhumanperception this approach does not apply to scene-level images as our of contrast more than absolute color values. But we argue approachdoes. that this limitation of the former is largely due to unsuffi- cientreceptivefieldsizeandflexibility. Afterall, images- 3.ProposedModel moothingandintrinsicimagedecompositiontechniquesde- Ourproposedmodelconsistsofthreemainparts: anin- termine the color values of each pixel in the context of a ference net, an edge refinement step, and a domain trans- largesurroundingarea. Especiallyforalbedoimages,small form as illustrated in Figure 1. We will now detail each colordifferencebetweennearbypixelscanaccumulate,pro- componentinturn. ducing significant differences between distant pixels with complex connecting pathways even when they share the 3.1.InferenceNet samereflectance. Thisrenderstheestimationproblemquite Unlikepreviousworkthatprimarilyattemptstoexplicit- difficult if we are restricted to either small or insufficient- lypredictsparseimagegradients[33,20]orrelatedweight lyparameterizedreceptivefields. Therefore,insteadofus- maps [25], the primary parameterized workhorse of our ing3convolutionallayersoflargekernelsize(16 16,1 1 × × model is an inference net designed to directly predict fil- and8 8)asin[33],weadopt20convolutionallayerswith × 3 3 kernels, which allows the learned regression function edge-preservingprocessingstepthatadmitsefficientimple- × to more accurately account for distant contextual informa- mentation via separable 1-D recursive filtering layers ap- tion[30]. pliedacrossrowsandcolumnsinanimage. Twoinputsare Color Correction Layer: Although image smoothing required: the raw input signal to be filtered, which corre- attempts to remove small gradients, it should still general- spondswithourpredictedimageP fromthepreviouslay- 2 lypreserveahighdegreeofcorrelationbetweeninputand er, and a domain transform density map that differentiates output colors. With so many convolutional layers howev- whichregionstofilterandwhichtoleaveunaltered. er, suchcorrelationsmayeventuallyweaken withoutsome This mapping is produced by a small edge refinement additional source of long-term memory that reflects the p- subnetwork in our pipeline. The core idea originates from reservation of the original color schema. For this purpose, [14],wherethedensitymapissimplytheoriginalinputim- weincludeanadditionalcolorcorrectionlayertoscaleeach agegradient. Later[8]experimentedwithmoresophisticat- channelindependentlyasfollows. LetP1c denotetheout- edpredictededgemapsappliedinconjunctionwithimage putofthe20layerCNNmoduleassociatedwithcolorchan- segmentation scores. However, their pipeline addresses a nel c, while Ic is the corresponding input image. We then completely different task, and the embedded, application- compute specific edge map requires the concatenation of features culledfromintermediateCNNlayers. Incontrast,sincewe P I α =argmin I α P = 1c • c (2) already generate reasonable edge-preserving images from c αc k c− c 1ck2 P1c •P1c ourbaseinferencenet,wedonotrequirethisdegreeofso- phistication. Hence our proposed edge refinement subnet- viaadedicatedscalinglayer(where indicatesadotproduc- • work operates as follows. (Further technical details about t)andoutputtheupdatedimagepredictionP = α P . 2c c 1c how to implement domain transforms within a neural net- As a small caveat, we also threshold values of α outside c workarchitecture,includinggradientbackpropagationstep- of the range [0.7,1.3] during training when the predict- s,canbefoundin[8].) ed images may still be radically different from I . Note c Theedgerefinementmoduleconsistsoftwostages. The that network gradients are easily backpropagated through firstcomputes,foreverypointintheinputP ,theaveraged sucha layer. Also fortesting, wediscard anysuch thresh- 2 absolute differences between each of its four neighboring old. A related idea has been incorporated into a classical, pixels,whichisthenfurtheraveragedacrossallcolorchan- optimization-basedintrinsicimagedecomposition[16],but nels to produce a single value. The resulting map is next thisrepresentsadecidedlydifferentcontextfromourCNN refinedusingaveryshallownetworkcontaining3convolu- model. Inanyevent, achannel-wisecolorcorrectionlayer tionallayers. Thefeaturemapsforthefirsttwoconvolution is a simple but very effective addition to our pipeline, and layershave64channels. Thekernelsizeforthe1stand3rd wefindthatitreducesnegativecolordifferenceeffectsbya layer is 15 15, zero padded with 7 pixels on each side to widemargin. × maintain spatial dimensions, while the kernel size for the Gradient Supervision: Most image smoothing algo- 2nd layer is 1 1, which computes a weighted average of rithms seek to more-or-less achieve a piece-wise constant × processed pixel vectors for the smoothing operation. The result. Certainly the albedo component of an intrinsic im- firsttwoconvolutionlayersarefollowedwithReLUunits, age decomposition problem is well-approximated by such whilethethirdisnot. an effect. But smoothing a large region into a single col- or is a challenging task for convolution since the kernel 3.3.ModificationsforIntrinsicImages weights are shared over the whole feature map, but local color values can vary dramatically across regions. To Finally we conclude this section by describing special help mitigate this problem, we supervise the x and y gra- accommodationsforintrinsicimagedecompositions. Here dients xP2, yP2 fromthecolor-normalizedimagesP2 theinferencenetissplitintotwobranchesafterthemiddle ∇ ∇ (acrossallcolorchannels)usingastandardauxiliaryMSE convolutional layer in order to predict albedo and shading lossfunction. images simultaneously. We also observe that (1) can be leveraged as a useful prior for intrinsic image decomposi- 3.2.EdgeRefinementandDomainTransform tion, enforcing that the estimated decomposition must re- The images generated by the inference net described combinetoapproximatetheoriginalimageI (totheextent above already demonstrate excellent numerical perfor- that the diffusive, Lambertian model is accurate). There- mance. But we observe that some modest color changes foreweincludeanadditionaltermthatpenalizesdeviations still exist in regions that piece-wise constant filters like oftheestimatedA S fromI,whichhelpstobalancethe (cid:12) [5,32,34]areabletosmoothaway. Inspiredby[14,8],we errorbetweenalbedoandshadingimages. address this lingering issue via a domain transform guid- Additionally, intrinsic image decompositions often re- ed by refined edge estimates. The domain transform is an quire potentially larger receptive fields since accumulated soft shadows sharing the same albedo could cover a wide filtersandallintrinsicimagedecompositions,andhenceare region. Consequently we extend the depth to 42 convolu- quiterobust;manualtuninginanapplication-specificman- tional layers and downsample intermediate features maps nerisnotrequired. Convolutionallayersareinitializedus- by 1/2 which saves half the training time. And because ing the methods from [17]. To train inference net, we use albedoandshadingimagesneednothaveaveryclosecol- ADAM [21] to accelerate convergence. The learning rate or similarity to the input image, we can remove the color is initially set to 0.01 and then reduced to 0.001 to gener- normalization layer for training an intrinsic image decom- atefurtherimprovement. Foredgerefinement,weusesim- positionnetwork. Forthedetailedstructureofthismodified plestochasticgradientdescent[23]withlearningratesetto network,pleaserefertothesupplementarymaterial. 0.01. And for fine-tuning the whole network together, we againuseADAMwithasmalllearningrateof1e-5. 4.TrainingDetails 5.ExperimentalResults We begin with the basic image smoothing network de- picted in Figure 1. The embedded inference net requires Thissectionshowcasesthenumericalandvisualadvan- supervision on 4 outputs: the predicted image P directly tages of our approach. Further experiments and thorough 1 from the CNN, the color-normalized image P , and com- ablationstudiesaredeferredtothesupplementaryfile. 2 puted gradients in x and y directions, i.e., P , P . x 2 y 2 ∇ ∇ 5.1.ImageSmoothing Withthe symboldenotinggroundtruth,theinferencenet ∗ objectivetermsbecome Dataset: As a low-level vision task, many corpora of natural images are suitable for training and evaluating s- l1(θ)=w1||P1−P∗1||22+w2||P2−P∗2||22 moothing algorithms. Here we use the around 17000 nat- +w3(||∇xP2−∇xP∗2||22+||∇yP2−∇yP∗2||22) (3) uralimagesfrompublicPASCALVOCdataset[11]asin- put,andthecorrespondingimagesfilteredbytraditionals- where θ denotes the parameter set of the network layers moothing algorithms as the labels/supervision. These im- and‘2normpenaltiesarechosensincetheyfavorhighpeak ageswerecollectedfromthetheflickrphoto-sharingweb- signal-to-noiseratios(PSNR),acommonevaluationcriteria siteandareinawiderangeofviewingconditions,thesame forimageestimationproblems. Aftertraininginferencenet datasourceusedby[33]withwhichwecompare. Weran- inisolation,wethenlearnanindependentnetworkforedge domlycropimagesatasizeof224 224pixelsforasingle × refinement,whoseoutputisdenotedE withcorresponding image to generate about 17,000 patches. To evaluate our lossfunction||E−E∗||22.Finally,wejointlytrainthewhole performance, we randomly pick 100 images to form our networkusing VOC test split. Additional test results using the BSDS500 dataset[1]aresimilar(seesupplementary). L(θ)=l1(θ)+w3||E−E∗||22+w4||P3−P∗3||22, (4) Comparisons: We approximate 8 different image s- moothingfiltersincludingthebilateralfilter(BLF)[27],it- where P is the output of the domain transform, and not- 3 erativebilateralfilter(IBLF)[13],L smoothing(L )[32], ing that, as a gradient proxy we retain the same weighting 0 0 rollingguidancefilter(RGF)[35], RTVtexturesmoothing parameter w for E as used in Eq. 3 for coordinate-wise 3 (RTV) [34], weighted least square smoothing (WLS) [12], gradients. weightedmedianfilter(WMF)[36]andL smoothing(L ) Fortheintrinsicimagenetwork,thealbedoandshading 1 1 [5].Comparedto7publicmodelsreleasedby[33],weshow branchessharethesamelossfunctionasintroducedabove, much better results in Table 1. Likewise, Table 3 presents referred to as La(θ) and Ls(θ), except for the removal ofsupervisiononthecolor-normalizedimageP andwith 2 analogousgradientsupervisionnowdenotedtoP . Weal- 1 PSNR SSIM so include supervision that reflects the constraint from Eq. 1,givingthefinalcost Liu[25] Ours Liu[25] Ours L 32.26 33.97 0.958 0.980 L(θ)=La(θ)+Ls(θ)+w5||A1(cid:12)S1−A∗(cid:12)S∗||22, (5) 0 RGF 38.64 38.15 0.986 0.987 whereA andS aretheestimatesobtainedfromtheinfer- 1 1 WLS 38.29 37.29 0.983 0.986 encenetanalogoustoP . 1 WeimplementallnetworkswithintheTorchframework WMF 33.29 36.68 0.951 0.982 using mini-batch gradient descent, with batch size fixed at Ave. 35.64 36.52 0.966 0.984 16. Theenergyfunctionweightsfromw tow areempir- 1 5 Table 3. Comparison with recent work Liu et al. [25] on image icallydeterminedas0.2,0.1,0.35,0.2,and0.2/255respec- smoothing task. Experiments are conducted using their four re- tively. Note that these weights are used for all smoothing leasedtrainedfilters. Liu[25] Ours GroundTruth Liu[25] Ours GroundTruth Input Xu[33] Ours GroundTruth k w a h e s r o h Figure2.QualitativeComparisononImageSmoothingTask.Top:ComparisonwithLiuetal.[25]onthe256 256imagesize.Bottom: × ComparisonwithXuetal.[33].AlltheresultsaretrainedonL filter.Zoomintowatchthedifference. 0 BLF IBLF L RGF RTV WLS WMF L Ave. 0 1 Xu[33] 35.02 32.97 31.66 32.49 35.68 33.92 29.62 32.62 PSNR Ours 38.34 35.07 33.56 38.62 36.02 38.03 35.75 32.43 36.48 Xu[33] 0.976 0.962 0.966 0.950 0.974 0.963 0.960 0.964 SSIM Ours 0.987 0.980 0.981 0.987 0.983 0.985 0.980 0.950 0.983 Table1.QuantitativeComparisononImageSmoothingTask. WereportPSNRandSSIMerrormetrics(largerisbetter)on8different smoothing filters compared with Xu et al. [33]. The average value is computed among the front 7 filters. Our results outperform our competitorbyalargemargin. comparisonswiththerecentworkfrom[25].However,their serve that both [33] and [25] have some trouble reproduc- releasedcodeatthetimeofthiswriting,whichisbuiltupon ingconstantregions,and[33]hasdifficultywitherroneous a multiscale decomposition, cannot run on arbitrary image gradientestimatespropagatingfromimageboundaries(see size,sowetestontheirdefaultimagesize256 256. Even horseimage). Finally,weevaluatethetestingtimewithtra- × thoughwedonotapplyanyspecialtuningforthissize, or ditionaledge-awarefiltersandlearneddeepnetworksallon trainonmultiscaleimagesasinTable3,westillachievebet- thesamecomputerwithpubliccodesandmodels. Ournet- ter results on average in their setting. And our consistent- work achieves favourable speeds and is much faster than ly higher SSIM values indicate that our learned filters are mosttraditionalimagesmoothingmethods(seeTable2). quiteeffectiveingeneratingperceptuallygoodimages. Ad- 5.2.IntrinsicImageDecomposition ditionally,wenotethatWMFcangeneraterelativelyblurry images compared to other filters with the parameterization Datasets: We follow two recent state-of-the-art deep usedintheseexperimentstakenfrom[33],andyetournet- learningbasedmethods[31,20]andevaluateouralgorith- work handles this case quite well in both tables relative to m on the MPI-Sintel dataset [7] that facilates scene-level other approaches. This is likely because previous method- quantitative comparisons. It consists of 890 images from s more-or-less directly rely on sparse structures and may 18sceneswith50frameseach(exceptforonethatcontains have trouble reproducing the softer transitions required by 40images). Duetolimitedimagesinthisdataset, weran- this WMF filter. Visual comparisons with the WMF filter domlycrop10differentpatchesofsize300 300fromone × areinthesupplementarymaterial. image to generate 8900 patches. Like [31], we use two- WenextdisplaysomevisualcomparisonsinFigure2us- fold cross validation to obtain all 890 test results with two ingtheL filter,whichproduceseffectsmoreonthepiece- trainedmodels. Weevaluateourresultsonbothascenes- 0 wiseconstantendofthesmoothingspectrum. Hereweob- plit,wherehalfthescenesareusedfortrainingandtheother methods BLF IBLF RGF L WMF RTV WLS L Xu[33] Liu[25] Ours 0 1 QVGA(320 240) 0.03 0.11 0.22 0.17 0.62 0.41 0.70 32.18 0.23 0.07 0.06 × VGA(640 480) 0.12 0.40 0.73 0.66 2.18 1.80 3.34 212.07 0.76 0.14 0.20 × 720p(1280 720) 0.34 0.97 1.87 2.43 4.98 5.74 13.26 904.36 2.16 0.33 0.51 × Table2.RunningTimeComparison. Timingcomparisonsagainstdifferenttraditionalfiltersanddeeplearning-basedcounterpartsfrom Xuetal.[33]andLiuetal.[25]atvariousresolutions. MSE LMSE DSSIM albedo shading average albedo shading average albedo shading average Retinex[16] 0.0606 0.0727 0.0667 0.0366 0.0419 0.0393 0.2270 0.2400 0.2335 Barronetal. [3] 0.0420 0.0436 0.0428 0.0298 0.0264 0.0281 0.2100 0.2060 0.2080 imagesplit Chenetal. [9] 0.0307 0.0277 0.0292 0.0185 0.0190 0.0188 0.1960 0.1650 0.1805 MSCR[31] 0.0100 0.0092 0.0096 0.0083 0.0085 0.0084 0.2014 0.1505 0.1760 Ours 0.0067 0.0060 0.0063 0.0041 0.0042 0.0041 0.1050 0.0783 0.0916 MSCR[31] 0.0190 0.0213 0.0201 0.0129 0.0141 0.0135 0.2056 0.1596 0.1826 scenesplit Ours 0.0181 0.0175 0.0178 0.0122 0.0118 0.0120 0.1674 0.1382 0.1528 Table4.QuantitativeComparisononmainMPI-SintelBenchmark. Weevaluateourresultsusingbothsceneandimagesplitsacross threestandarderrorratesofintrinsicimagesonthemainMPI-Sinteldataset. islesssusceptibletoinflatedresultsfromoverfitting,since put images in the same sequence has a big overlap with each n I otherwhileimagesindifferentscenesdonot. While investigating the MPI-Sintel dataset online, we noticed that there are actually two sources for the input R C S andalbedoimages. Thefirstoneisobtainablebyemailing M the authors of [7] directly, which we call main MPI-Sintel dataset given that it appears to be used by most previous Albedo Ours mbleythleosdsser[-9k,no3w1,n2s4e]t.thBatuctatnhebreepiasrtailaslolyadosewcnolnoda,deodstefrnosmi- theirofficialwebpage(butalsorequiresemailingtoobtain uth fullground-truth). Werefertothisversionastheauxilliary Tr MPI-Sinteldataset,whichisusedby[20].Defectiveimages d n u in these data stemming from rendering issues are removed o r G forevaluationpurposesfollowingpreviousmethods[31]. Finally,totestperformanceonrealimageswherescene- R SC level ground-truth is unavailable, we also use the 220 im- M ages in the MIT intrinsic dataset [16] as in [31]. This da- tacontainsonly20differentobjects, eachofwhichhas11 g hadin Ours immoadgeelsu.sTinogc1o0mopbajreectwsivthiaptrheevisopulistmfroemtho[d3s],, awnedteravianluoauter S theresultsonimagesfromtheremainingobjects. h ut Comparison: To quantitatively evaluate the perfor- Tr d mance, we use three criteria, mean-squared error (MSE), n ou localmean-squarederror(LMSE),andthedissimilarityver- r G sionofthestructuralsimilarityindex(DSSIM)asproposed Figure4.QualitativeComparisononmainMPI-SintelBench- in[9]. First,inTables4and5,ourmodelshowsmuchbet- mark.Thevisualresultsareevaluatedonthemodeltrainedonthe terresultsontheMPI-Sinteldata. Notethattheveryrecent moredifficultscenesplit. workJCNF[20]benefitsfromtrainingonadditionaldepth halffortesting,andanimagesplit,whereall890imagesare map data, and as a gradient domain method needs to exe- randomly separated into two parts. Clearly the scene split cute a separate post-processing step to reconstruct images. Input Shietal.[29] MSCR[31] Ours GroundTruth Shietal.[29] MSCR[31] Ours GroundTruth Albedo Shading Figure3.QualitativeComparisononMITIntrinsicBenchmark. ComparedwithShietal. [29]andMSCR[31]onBarronetal.’stest splitofMITintrinsicdataset,ouralgorithmachievesthebest/sharpestresults. MSE LMSE DSSIM albedo shading average albedo shading average albedo shading average JCNF[20] 0.0070 0.0090 0.0080 0.0060 0.0070 0.0065 0.0920 0.1010 0.0970 imagesplit Ours 0.0043 0.0057 0.0050 0.0032 0.0042 0.0037 0.1012 0.0827 0.0919 Table5.QuantitativeComparisononauxilliaryMPI-SintelBenchmark. SimilarresultstoTable4, exceptusingtheauxiliarydata. NotethatJCNF[20]isonlytrainedandtestedontheimagesplitofMPI-Sinteldataset,henceourexclusionofthescenesplithere. MSCR[31]alsobenefitsfromadditionaltrainingdatafrom MSE LMSE theMITintrinsicdataset. albedo shading average total We show 2 groups of qualitative results trained on the Zhouetal. [38] 0.0252 0.0229 0.0240 0.0319 moredifficultscenesplitinFigure4. Notethatweareable Barronetal. [3] 0.0064 0.0098 0.0081 0.0125 to predict much sharper and relatively high-quality results eventhoughwehaveonlyaround500trainingimages(and Shietal.[29] 0.0216 0.0135 0.0175 0.0271 nooutsidetraininginformationaswithothermethods),and MSCR[31] 0.0207 0.0124 0.0165 0.0239 trainingandtestingdatadonothaveanyoverlap. Formore Ours 0.0127 0.0085 0.0106 0.0200 visual results of image split or scene split, please refer to thesupplementarymaterial. Table 6. MIT Intrinsic Data Results. Performance of various methodsonBarronetal.’stestset[3]. LMSEiscomputedusing Next, Table 6 presents relative performance using MIT anerrormetricspecificallydesignedforthisdata[16]. Notealso intrinsic data. Here we observe that our approach is sig- that Barron et al.’s approach [3] relies on specialized priors and nificantlybetterthantheotherdeepnetworks[29,31]even maskedobjectsparticulartothisdataset. Thisapproachdoesnot though[31]utilizesadditionalMPI-Sinteldataand[29]in- workwellonsceneleveldata(seeTable4). corporatesadditionalrendered,large-scalesingle-objectin- trinsic images to assist in training. Note that [3] uses a number of specialized priors appropriate for this simpli- 6.Conclusion fied object-level data, while end-to-end CNN approaches like ours and [31] have less advantage here due to limit- ed training data (110 images). Moreover, [3] is not com- This paper proposes a deep network for image smooth- petitive on other more complex, scene-level data types as ing and intrinsic image decomposition tasks. Design fea- shown in Table 4. In Figure 3, our predicted images are turesincludeadeepCNNwithlargereceptivefieldsoperat- also much sharper and more accurate than the other deep ingintheoriginalcolordomain,acolornormalizationstep, learningmethods. Finally,weshouldnotethat[38]trainsa gradientsupervision,andadomaintransformforrefiningfi- modeltopredictrelativereflectanceorderingsbetweenim- nalimageestimates. Theresultingpipelineproducesedge- agepatchesandintegratesthelearnedpriorwithinanexist- preservingsmoothimagesorintrinsicdecompositionswith- ing optimization framework for real-world intrinsic image outanypreprocessing,postprocessing,application-specific decomposition. We tested their trained model only on the tuning parameters, or special requirements of input image MITintrinsicdataduetotheslowrunningtime,anditwas size. We have also numerically demonstrated state-of-the- notcomparabletoothermethods. artperformanceonimportantrepresentativebenchmarks. References [19] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super- resolutionusingverydeepconvolutionalnetworks.InCVPR, [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con- 2016. 2 tourdetectionandhierarchicalimagesegmentation. PAMI, [20] S.Kim,K.Park,K.Sohn,andS.Lin. Unifieddepthpredic- 33(5):898–916,2011. 5 tionandintrinsicimagedecompositionfromasingleimage [2] J.T.BarronandJ.Malik. Intrinsicscenepropertiesfroma viajointconvolutionalneuralfields.InECCV,2016.3,6,7, singlergb-dimage. CVPR,2013. 1 8 [3] J. T. Barron and J. Malik. Shape, illumination, and re- [21] D.KingmaandJ.Ba. Adam: Amethodforstochasticopti- flectancefromshading. PAMI,37(8):1670–1687, 2015. 7, mization. ICLR,2015. 5 8 [22] E.H.LandandJ.J.McCann. Lightnessandretinextheory. [4] S.Bell,K.Bala,andN.Snavely.Intrinsicimagesinthewild. J.Opt.Soc.Am.,pages1–11,1971. 2 ACMTrans.onGraphics(SIGGRAPH),33(4),2014. 1 [23] Y.LeCun, L.Bottou, Y.Bengio, andP.Haffner. Gradient- [5] S.Bi,X.Han,andY.Yu. AnL imagetransformforedge- 1 based learning applied to document recognition. Proceed- preserving smoothing and scene-level intrinsic decomposi- ingsoftheIEEE,86(11):2278–2324,1998. 5 tion. ACMTransactionsonGraphics,34(4):78,2015. 1,2, [24] L.Lettry, K.Vanhoey, andL.VanGool. Darn: adeepad- 4,5 versial residual network for intrinsic image decomposition. [6] N.Bonneel,K.Sunkavalli,J.Tompkin,D.Sun,S.Paris,and arXivpreprintarXiv:1612.07899,2016. 3,7 H.Pfister. Interactiveintrinsicvideoediting. ACMTransac- [25] S. Liu, J. Pan, and M.-H. Yang. Learning recursive filters tionsonGraphics(TOG),33(6):197,2014. 1 forlow-levelvisionviaahybridneuralnetwork. InECCV, [7] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A 2016. 2,3,5,6,7 naturalistic open source movie for optical flow evaluation. [26] S. H. J. K. M. Aubry, S. Paris and F. Durand. Fast local InA.Fitzgibbonetal.(Eds.),editor,ECCV,PartIV,LNCS laplacianfilters:Theoryandapplications.ACMTransactions 7577,pages611–625.Springer-Verlag,Oct.2012. 3,6,7 onGraphics,2014. 1 [8] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and [27] S.ParisandF.Durand. Afastapproximationofthebilateral A.L.Yuille.Semanticimagesegmentationwithtask-specific filter using a signal processing approach. In ECCV, pages edgedetectionusingcnnsandadiscriminativelytraineddo- 568–580,2006. 1,5 maintransform. InCVPR,2016. 2,4 [28] L.ShenandC.Yeo. Intrinsicimagesdecompositionusing [9] Q.ChenandV.Koltun. Asimplemodelforintrinsicimage a local and global sparse representation of reflectance. In decompositionwithdepthcues. InICCV,2013. 1,7 CVPR,pages697–704.IEEE,2011. 1 [10] D.EigenandR.Fergus. Predictingdepth, surfacenormals [29] J. Shi, Y. Dong, H. Su, and S. X. Yu. Learning andsemanticlabelswithacommonmulti-scaleconvolution- non-lambertianobjectintrinsicsacrossshapenetcategories. alarchitecture. InICCV,pages2650–2658,2015. 3 CVPR,2017. 3,8 [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, [30] K. Simonyan and A. Zisserman. Very deep convolution- and A. Zisserman. The pascal visual object classes (voc) al networks for large-scale image recognition. CoRR, ab- challenge. IJCV,88(2):303–338,June2010. 5 s/1409.1556,2014. 4 [12] Z.Farbman,R.Fattal,D.Lischinski,andR.Szeliski. Edge- [31] M. M. Takuya Narihira and S. X. Yu. Direct intrinsics: preserving decompositions for multi-scale tone and detail Learningalbedo-shadingdecompositionbyconvolutionalre- manipulation. ACMTransactionsonGraphics(Proceedings gression. InICCV,2015. 3,6,7,8 ofACMSIGGRAPH2008),27(3),Aug.2008. 1,5 [32] L. Xu, C. Lu, Y. Xu, and J. Jia. Image smoothing via L 0 [13] R. Fattal, M. Agrawala, and S. Rusinkiewicz. Multiscale gradientminimization. InSIGGRAPHAsia,2011. 1,2,4,5 shapeanddetailenhancementfrommulti-lightimagecollec- [33] L. Xu, J. S. Ren, Q. Yan, R. Liao, and J. Jia. Deep edge- tions. ACMTransactionsonGraphics(Proc.SIGGRAPH), awarefilters.InICML,pages1669–1678,2015.1,2,3,5,6, 26(3),Aug.2007. 1,5 7 [14] E.S.GastalandM.M.Oliveira.Domaintransformforedge- [34] L.Xu,Q.Yan,Y.Xia,andJ.Jia. Structureextractionfrom awareimageandvideoprocessing. InACMTransactionson texturevianaturalvariationmeasure. ACMTransactionson Graphics(TOG),volume30,page69.ACM,2011. 2,4 Graphics(SIGGRAPHAsia),2012. 1,4,5 [15] K. Gregor and Y. LeCun. Learning fast approximations of [35] Q.Zhang,X.Shen,L.Xu,andJ.Jia.Rollingguidancefilter. sparsecoding. InICML,pages399–406,2010. 1 InECCV,pages815–830,2014. 1,5 [16] R.Grosse, M.K.Johnson, E.H.Adelson, andW.T.Free- [36] Q. Zhang, L. Xu, and J. Jia. 100+ times faster weighted man. Groundtruthdatasetandbaselineevaluationsforin- medianfilter. InCVPR,2014. 1,5 trinsicimagealgorithms. InICCV,pages2335–2342.IEEE, [37] Q. Zhao, P. Tan, Q. Dai, L. Shen, E. Wu, and S. Lin. A 2009. 4,7,8 closed-form solution to retinex with nonlocal texture con- [17] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into straints. PAMI,34(7):1437–1444,2012. 1 rectifiers:Surpassinghuman-levelperformanceonimagenet [38] T. Zhou, P. Krahenbuhl, and A. A. Efros. Learning data- classification. InICCV,pages1026–1034,2015. 5 drivenreflectancepriorsforintrinsicimagedecomposition. [18] L.Karacan,E.Erdem,andA.Erdem. Structure-preserving In Proceedings of the IEEE International Conference on image smoothing via region covariances. ACM Trans. ComputerVision,pages3469–3477,2015. 8 Graph.,32(6):176:1–176:11,2013. 1 Supplementary Material: Revisiting Deep Image Smoothing and Intrinsic Image Decomposition 1 Outline Thisdocumentcontainsseveraltechnicaldetailsandexperimentsthatcouldnotbeincludedinthemainpaperbecause ofspaceconsiderations. Section2: Networkstructuredetails,includingdeepCNNmodulesandvariantsforhandlingintrinsicimages. • Section3: DescriptionsofmodifiedresidualblockusedinourdeepCNNmodule. • Section4: Detailsoftheforwardandbackgroundpropagationprocessusedintheedgerefinementnetwork. • Section5: Self-comparisonstudies,includingbothqualitativeandquantitativecomparisonofvariousnetwork • components. Section6: Testsofthegeneralizationabilityofour8learneddeepedge-awarefilters. • Section7: Moreanalysisofotherdeepimagesmoothingnetworks. • Section8: Morevisualresultsofdifferentlearnededge-awarefilters. • Section9: Additionaldetailsregardingintrinsicimagedecompositiondataanderrormetrics. • Section10: Morevisualcomparisonswithstate-of-the-artmethodsontheintrinsicimagedecompositiontask. • 2 Network Structures InFigure1weillustratethemodifiednetworkpipelineforhandlingintrinsicimagedecompositions,aswellasdetails ofthedeepCNNmodulesusedbyourinferencenets(forbothimagesmoothingandintrinsicimagedecompositions). Decomposing an image into albedo and shading layers requires two outputs from the network, and a much larger receptive field. Our network has been split into two separate branches from the deep CNN module. As shown in the left column of Figure 1, the two split outputs are forwarded through independent edge refinement and domain transformlayerstoproducealbedoandshadingimages. Figure1alsoshowsthedetailsoftheinferencenetdeepCNNmodulesforbothintrinsicimagedecomposition (middlecolumn)andimagesmoothing(rightcolumn). Asobservedin[3], distantpixelswiththesamereflectance onalargesurfaceinintrinsicimagescouldhavesignificantcolordifferencesduetotheaccumulationofsmoothly- varyingsoftcastshadows,whichimpliesthatamuchlargerreceptivefieldmaybeneededwhencomparedtoimage smoothing. Therefore,weextendthedeepCNNmoduleto42convolutionlayers,anddownsampletheintermediate feature maps by enlarging the stride of corresponding convolution layers. Note that downsampling can not only increasethereceptivefield,butalsoreducethecomputationalburdencausedbyextendedconvolutionlayers. Thisis averyeffectiveandefficienttrainingstrategy. Formoredetailsofthedifferencebetweenthetwonetworks,pleaserefertoSections3.3and4inthemainpaper. 1