Table Of Content

End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning Liliang Zhang1, Liang Lin1,*, Xian Wu1, Shengyong Ding1, Lei Zhang2 1Sun Yat-sen University, Guangzhou 510006, China 2Department of Computing, The Hong Kong Polytechnic University. zhangll.level0@gmail.com, linliang@ieee.org 5 ABSTRACT interesting application is searching image databases using 1 free-hand sketch queries [19, 16]. 0 Sketch-basedfacerecognitionisaninterestingtaskinvision However,drawingavividsketchportraitistimeconsum- 2 and multimedia research, yet it is quite challenging due to ing even for a skilled artist. Automatic face sketch genera- the great difference between face photos and sketches. In r tionhasbeenstudiedforalongtimeandithasmanyuseful p this paper, we propose a novel approach for photo-sketch applications for digital entertainment [12]. A generation, aiming to automatically transform face photos Another important application based on face sketch is to into detail-preserving personal sketches. Unlike the tradi- assistlawenforcement. Assumedthatweneedtoautomati- 1 tionalmodelssynthesizingsketchesbasedonadictionaryof 1 exemplars,wedevelopafullyconvolutionalnetworktolearn callyretrievalaphotofromthedatabaseforaqueryimage, the end-to-end photo-sketch mapping. Our approach takes which help the police to narrow down the suspect quickly. ] whole face photos as inputs and directly generates the cor- Unfortunately, the photo of the suspect maybe unavailable V inmostcases. Todealwithsuchaproblem,thebestsubsti- respondingsketchimages withefficientinference andlearn- C ing, in which the architecture is stacked by only convolu- tute available is an artist drawing based on the recollection of an eyewitness. . tionalkernelsofverysmallsizes. Towellcapturetheperson s Inthissituation,weonlyfocusonsketcheswithoutexag- identity during the photo-sketch transformation, we define c geration, so that the sketch can realistically reflect the real [ our optimization objective in the form of joint generative- discriminative minimization. In particular, a discriminative person. Figure1showstwosamplesofphoto-sketchpairsin 2 regularization term is incorporated into the photo-sketch CUHK Face Sketch Database [24]. v generation, enhancing the discriminability of the generated 0 personsketchesagainstotherindividuals. Extensiveexperi- 8 mentsonseveralstandardbenchmarkssuggestthatourap- 1 proach outperforms other state-of-the-arts in both photo- 7 sketch generation and face sketch verification. 0 . CategoriesandSubjectDescriptors 1 0 I.2.10[ARTIFICIALINTELLIGENCE]:VisionandScene 5 Understanding (a) (b) 1 Keywords : v Figure1: Samplesofphoto-sketchpairs: (a)photos; i Sketch-photo generation; face verification; neural nets (b) sketches drawn by artist X r 1. INTRODUCTION Figure1indicatesthegreatdifferencebetweenphotosand a sketches. Thus, photo based face verification methods can- Sketch is an important artistic drawing style and may be not be directly applied in this problem. The key to sketch the simplest form since it is only composed of lines. An based face verification is to reduce the modality difference ∗Corresponding author is Liang Lin. This work was sup- between photos and sketches. ported by the Hi-Tech Research and Development Program of One intuitive idea is to recover the photo image from a China(no.2013AA013801),GuangdongNaturalScienceFounda- sketch. However, it’s an ill-pose problem because a sketch tion(no.S2013050014548),andtheHongKongScholarprogram. may lose many informations during the drawing procedure. An alternative way is to generate a pseudo-sketch from a Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor photo, which has been discussed in [22, 13, 24]. classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed The original generation scheme is trying to find a trans- forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita- tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan formation from photos to sketches. Intuitively, it should ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- be a complex nonlinear mapping. Thus, former works like publish,topostonserversortoredistributetolists,requirespriorspecificpermission [22, 13, 24, 26, 27, 23, 20] simplify the generation problem and/orafee.RequestpermissionsfromPermissions@acm.org. tosynthesis form. Theunderlyingassumptionisthat,iftwo ICMR’15,June23–26,2015,Shanghai,China. photo images (or patches) are similar, their corresponding Copyright(cid:13)c 2015ACM978-1-4503-3274-3/15/06...$15.00. sketch images (or patches) should also be similar. Thus, if DOI:http://dx.doi.org/10.1145/2671188.2749321. we find out a way to use other photo images (or patches) the pseudo-images. Random Sample LDA (RS-LDA) per- tosynthesizeaphotoimage(orpatch),thenwecanusethe formed best among them. Zhang et al. [28] then improved corresponding sketch images (or patches) to synthesize its theMRFframeworkbyaddingshapepriorsanddescriptors pseudo-sketch. robust to lighting variations. However, this simplification may has some limitation, es- Mostofmethodsabovemaylosesomevitaldetails,which pecially the scalability. The time cost in synthesis based influenced the visual quality and face recognition perfor- method grows linearly with the amounts of training data, mance. Thus, Zhang et al. [27] added a refinement step on becauseitneedtousethesamplesintrainingsettosynthe- existing approaches. They applied a support vector regres- size a new one. sion(SVR)basedmodeltosynthesizethehigh-frequencyin- Onthecontrary,ourmethodtrytodirectly solvethegen- formation. Similarly, Gao et al. [6] proposed a new method erationproblemwithanend-to-endmodelcalledfullyconvo- calledSNS-SREwithtwosteps,i.e.sparseneighborselection lutionalnetwork(FCN),whichcanberegardedasaspecial (SNS)togetaninitialestimationandsparse-representation- kind of convolutional neural networks (CNNs). More pre- based enhancement (SRE) for further improvement. cisely, FCN is stacked by only convolutional layers. It can 2.2 Pixel-WisePredectionsviaCNNs solvecomplexnonlinearproblemwhileproducingpixel-wise outputs, which is very suitable for the photo-sketch gener- Convolutional neural network (CNN) was widely used in ation problem. Please refer to Figure 2 to get an intuitive computer vision. Its typical structure contains a series of idea. convolutional and pooling layers as feature extractors, and Themaincontributionsofthisworkarethree-folds. First, severalfullconnectedlayersasclassifierstogiveprediction. anend-to-endphoto-sketchgenerationmodelisstudiedwith It has achieved great success in large scale object classifica- anovelarchitectureoffullyconvolutionalnetworks,whichis tion, localization and detection [10, 8, 17, 4, 21, 7]. original to the best of our knowledge. Second, we present a One important application of CNN was to produce dense joint generative-discriminative formulation driving the net- or even pixel-wise predictions. Sermanet et al. [17] devel- workoptimization,suchthatthegeneratedfacesketchescan oped a CNN-based framework called OverFeat, which in- characterizethepersondetailaswellasthediscriminability tegrated recognition, localization and detection. The key against other individuals. Third, our system demonstrates component was a pooling layer with offsets, which can imi- superiorperformancescomparedwithotherstate-of-the-art tateslidingwindowtechniqueandproducedenseoutputsin approaches on several benchmarks. thefinallayer. Asimilarideacalledspatialpyramidpooling Rest of the paper is organized as follows: Section 2 dis- (SPP) was proposed by He et al. [8], which was also modi- cusses the relative work in photo-sketch synthesis and con- fiedthelastpoolinglayer. Wangetal.[25]proposedajoint volutional neural networks. In Section 3, we will interpret architectureforgenericobjectextraction,andLuoetal.[15] our model and its implementation. In Section 4, extensive appliedadeepdecompositionalnetworkforpedestrianpars- experiments suggest that our approach outperforms other ing. Both of them shared an idea, which is adopting a full state-of-the-art methods on several benchmarks. In Section connectedlayeronthetopofthenetworktoproduceadense 5, we draw the conclusion of our work and list some ideas prediction. Moreover,patch-by-patchscanningtechniqueon we may adopt in the future. the original image has been studied by Ciresan et al. [1] in neuronal membrane segmentation and Farabet et al. [5] for 2. RELATIVEWORK scene labeling. Recently, Dong et al. [3] applied CNN for image super- 2.1 Photo-SketchSynthesisandRecognition resolution and it can produce a pixel-wise output. They discussed its relationship to sparse-coding-based methods, In recent ten years, there has been several works on the and concluded that they can use three convolutional layers topic of photo-sketch synthesis and recognition. to simulate the representation-mapping-reconstruction pro- TangandWang[22]werethefirsttoaddresstheproblem cedure in sparse coding. withasignificantamountofdatabase. Theyproposedasyn- Our work was mostly inspired by [3]. We conducted our thesis method based on eigen transformation (ET). It was early experiments via their network and then further im- under the assumption that the photo-sketch mapping can proved it. We called it fully convolutional network (FCN, be approximated as linear, yet this assumption may be too a similar idea was also proposed in [14]), for it only con- strong especially when the hair region was included. Then, tains convolutional layers and the corresponding activation it used reconstruction error based distance for recognition. function,butwithoutanyotherlayerslikepooling,fullcon- Liu et al. [13] proposed a nonlinear face sketch synthesis nected, local response normalization (LRN), etc. We would method called locally linear embedding (LLE), which can further discuss it in Section 3.2. be considered as an improved version of [22]. Instead of modelling the whole image, it applied ET on local patches 3. PHOTO-SKETCHGENERATION with overlapping. It adopted a kernel-based nonlinear LDA discriminative classifier for sketch recognition. 3.1 Formulation Another local-based strategy proposed by Xiao et al. [26] was based on Embedded Hidden Markov model (E-HMM). In our model, we use two constraints for sketch genera- Theytransformedthesketchestopseudo-photosandapplied tion. The first one is the generated sketches should be as eigenface algorithm for recognition. close to the ones drawn by artists as possible. This encour- In [24], Wang and Tang proposed a multiscale Markov agesthedeepnetworktolearnthesketchingskillsfromthe RandomFields(MRF)forfacephoto-sketchsynthesis,which artists. The second one is the generated sketches should canbebothappliedtophoto-to-sketchsynthesisandsketch- be able to facilitate the law enforcement. That is, given a to-photo synthesis. They evaluated a series of classifiers on sketch drawn by the artist, it should be able to identify the subject in the photo database. These two constraints are Algorithm 1 Parameter Optimization introduced as generative loss and discriminative regularizer Require: in our objective function, which is similar to [2]. Training photos {P } and sketches {S }; i i Suppose there are N subjects in the training set with Ensure: each subject containing one photo Pi and one sketch Si. Network Parameters W We use f(W,Pi) to denote the sketch generated by a fully 1: while t<T do convolutional network parameterized by W. We define our 2: t←t+1; loss function as follows where Lgen and Ldiscrim represent 3: Randomly select a subset of photos and sketches thegenerativelossandthediscriminativeregularizerrespec- {P(cid:48)},{S(cid:48)} from the training set; i j tively. Here we use α to control the weight of the discrimi- 4: for all P(cid:48) do i native regularizer to the overall objective. 5: Do forward propagation to get f(W,P(cid:48)) i 6: end for L(P,S,W)=L (P,S,W)+αL (P,S,W) (1) gen discrim 7: ∆W=0 For the generative loss, we use a straight function which 8: for all Pi(cid:48) do is defined as the pixel-wise difference between the ground- 9: Calculate the partial derivative with respect to the truth sketch and the generated one. output: ∂L ∂f(W,P(cid:48)) i 10: Run backward propagation to obtain the gradient N L (P,S,W)= 1 (cid:88)(S −f(W,P ))2 (2) with respect to the network parameter: ∆Wi gen N i i 11: Accumulate the gradient: W+=∆W i=1 i 12: end for Forthediscriminativeregularizer,weencouragethedrawn 13: Wt =Wt−1−λ ∆W t sketchofoneparticularpersonshouldbedifferentfromthe 14: end while generatedsketchofanotherpersonasdefinedbythefollow- ing function. L (P,S,W)= of arbitrary size, and then produce a corresponding spatial discrim 1 (cid:88)N (cid:88)N log(cid:32)1+e−(Si−f(λW,Pj))2(cid:33) (3) ouRtpeulta.tionship to the Patch-Wise RepresentationIn N(N −1) Equation (4), a pixel on the output feature map only will i=1j=1,j(cid:54)=i be influenced by a patch on the input feature map, i.e. its λ is a parameter used to avoid numeric overflow. receptivefield. Thispropertyisalsomaintainedundercom- Parameter Optimization Using this model, the learn- position of convolutions. Thus, fully convolutional repre- ingprocessistominimizethelossfunctionL(P,S,W). This sentation on the whole image can be regarded as a batch is achieved by the standard network propagation algorithm of patch-wise representation on patches of the image. Nev- in the batch training manner. More precisely, in each iter- ertheless, fully convolutional representation is much more ation, we randomly select a set of photo-sketch pairs and efficient, for the patches overlap significantly (we set stride construct the generative loss items and discriminative reg- to 1 in all layers). ularizer items respectively. In order to derive the gradient BorderEffectTrade-Off Convolutionaloperationswill with respect to the network parameter W, the key step is shrinkthesizeofthefeaturemap. Forexample,ifweadopt to calculate the partial derivative of the loss function with a(3×3)convolutionoperatorona(3×3)input,theoutput respect to the output (generated sketch) of each photo. As size will be shrank to (1×1). one photo may be involved into several cost items, we need Asmoreconvolutionallayersstackedup,moreshrinkswill to go through each items to accumulate the derivative with beaccumulatedintheoutputs. Apossiblesolutionistoadd respect to the output. Algorithm 1 gives the details of the properpaddingbeforetheconvolutionoperation. Butitwill learning process. bring on border effects, which has been claimed in [3]. Thus, in the trade-off, we don’t use padding in the con- 3.2 FullyConvolutionalRepresentation volutional layers during both the training and testing time. In this subsection, we will discuss the property of convo- Each(155×200)photoimagewillbeshrankto(143×188) lutionallayersandhowtheycanbecascadedasapixel-wise in the representation (See Figure 2). fully convolutional representation of an input image. Composition of ConvolutionsAtypicalconvolutional 3.3 Implementation layer has K kernels with activation function f can be for- Pre-processing As it is described in [24], all the photos mulated as and sketches are translated, rotated, and scaled such that yk =f((W ∗x) +b ) (4) the two eye centers of all the face images are at fixed posi- ij k ij k tion in the pre-processing step. This simple geometric nor- x denotes the input feature maps. k ∈ {1,2,...,K} de- malizationstepmakesthesamefacecomponentsindifferent notestheindexofafilter,andW andb arethek-thfilter k k images in roughly alignment. weightsandbias. yikj denotestheelementatthecoordinate Another pre-processing step is inspired by the results∗ (i,j) on the k-th output feature map. of [24]. The transformation mentioned above produces a Equation (4) indicates that the convolutional operation (200×250) image, but it may have some black regions on preserves the spatial relationship. What’s more, composi- theborderareas. Wecropthe(155×200)centerpartofthe tion of convolutions will not change this property, i.e. we image in order to exclude this negative influence. can use a stack of convolutional layers to represent a complex nonliear mapping, which can be adopted on an input ∗http://www.ee.cuhk.edu.hk/˜xgwang/sketch multiscale.html 155 143 200 188 5x5 5x5 1x1 1x1 3x3 3x3 5 128 128 64 64 1 1 (RGB+XY) Input Full image inference Output Figure 2: The overview of our model. It takes a full-size photo image as input and directly generates a full-size pseudo-sketch as output. The middle part is the architecture of our fully convolutional network. It contains six convolutional layers, with rectified linear units as activation functions (omitted in the figure). Sincewechoosetoavoidbordereffects,thesketchimages 4. EXPERIMENTS needtobecroppedto(143×188)tofitinthenetworkoutput We evaluate our model on CHUK student dataset [24]. dimension. It includes 188 faces. 88 faces are selected for training and SpatialPatch-wiseLearningwithOverlappingFrom remaining 100 faces are selected for testing. For each face, thepreviousworksonphoto-sketchsynthesis[13,24],patch- there is a sketch drawn by the artist and a photo taken in wise learning with overlapping is very important to handle a frontal pose, under normal lighting condition, and with a the non-linearity between photos and sketches. Intuitively, neural expression. patches in different positions are diverse from others (e.g. We mainly do three aspects of experiments, which will eye patches, nose patches and mouth patches). Therefore, be shown in the next three subsections. Section 4.1 will learning different patch representations in different spatial discuss our photo-sketch generation results and comparison positions respectively is a very straightforward idea. with the synthesis based method in [24]. Section 4.2 will WehandleitbyaddingadditionalXYchannelsinthein- showthatourgeneratedpseudo-sketchessignificantlyreduce put, i.e. the input image data contain five channels, three themodalitybetweenphotosandsketches. Thustheycanbe RGB channels of the photo image, and two channels of the adopted to a sketch-based face verification system. Section corresponding coordinate (i,j). This tiny modification sig- 4.3 will involve empirical study about our model. Several nificantly improves the result, which will be discussed in factors such as network depth and filter numbers will be Section 4.3. discussed on the evaluation protocol both qualitatively and NetworkArchitectureWeapplythenetworkproposed quantitatively. in[3]inourearlyexperiments,thenwemodifyitsarchitec- ture for further enhancement. We borrow the idea in [18], 4.1 Face-SketchGeneration whichhasagreatsuccessinILSVRC-2014. Theirmaincon- In Figure 3, we show some examples of our generated re- tribution is to deeper the CNN under computational con- sultsandcomparethemto[24]. Forfaircomparison,welist straints via very small (3×3) convolutional filters in all all14pseudo-sketcheswhichcanbefoundontheirwebsite. layers. Figure 3 indicates that our results contain more vital de- Similarly, we modify the 3 layers network in [3] into a 6 tails. For example, the man in column (e) of the first row layers network, which is shown in Figure 2. For the (9× has two little locks of hair on his forehead. But this detail 9), (1×1), (5×5) convolutional layer in [3], we replace is missing in the pseudo-sketch in [24] (in column (g)). On them with two (5×5), (1×1) and (3×3) convolutional thecontrary,ourgeneratedresult(incolumn(f))retainsthis layers respectively. Moveover, we double the filter amounts detail information which may be very important to distin- in every layer to improve the network’s capacity. guishthismanfromothers. Anotherexampleisthewoman We call it medium network due to its size. We have two in column (a) of the second row. She has a distinctive fea- similararchitecturescalledsmallnetworkandlargenetwork, turecomparingtootherpersonsinthedatabase,forshehas which will be further discussed in Section 4.3. twobigeyes. Ourmethodcanalsocapturethisdetailinthe Training Details Our results are based on our medium pseudo-sketch (in column(b)). network (See Figure 2). As mentioned above, we first pre- Itsuggeststhatsynthesisbasedmethodsmayfailinsome process the photo images into (155×200) and the sketch cases. Themainreasonisthattheyareunderanassumption images into (143×188). The optimization objective has that the synthesized sketch image (or patch) can be recon- been discussed in Section 3.1. We set λ as 109 and set α as structed by the sketch images (or patches) in the training 104. set. However, this assumption may be too strong, for some We use Caffe [9] for implementation. The filter weights persons have their facial distinctions in fact. are initialized by drawing randomly from a Gaussian dis- Our generative method overcomes this difficulty. We try tribution with zero mean and standard deviation 0.01, and to directly learn the transformation from photo space to the bias are initialized by zero. We set the learning rate as sketch space. Even though it’s a complex nonlinear map- 10−11,thenittakesseveralhourstoconvergeonaNVIDIA ping, FCN has the capacity to handle this challenge. Tesla K40 GPU. Compared to traditional synthesized method, our novel generative approach has two folds of advantages. Firstly, it (a) (b) (c) (d) (e) (f) (g) (h) Figure 3: Compared to synthesis based method [24], our generative approach could retain more details in the original photos, e.g. the two locks of hair in forehead of the man in column (e) of row 1; the two big eyes of the woman in column (a) of row 2. (a)&(e) photos; (b)&(f) our generative pseudo-sketches; (c)&(g) synthesized pseudo-sketches in [24]; (d)&(h) sketches drawn by the artist. could retain more detail information from the photos. Sec- ondly, our inference time is independent on the amount of training data. The time cost in synthesis based method 100 grows linearly with the data amounts, while the runtime of ourapproachisonlyaffectedbythesizeoftheinputphoto. 98 ) % 4.2 Sketch-BasedFaceVerification ( S Inthissubsection,wewillshowthatourgeneratedpseudo- M 96 C sketchsignificantlyreducethemodalitybetweenphotosand 1 sketches. We follow the same testing procedure in [22, 13, k n 94 24], which can be concluded in two steps: (a) convert the a r photosintestingsetintocorrespondingpseudo-sketches;(b) with discriminative regularizer define a feature or transformation to measure the distance 92 without discriminative regularizer between the query sketch and the pseudo-sketches. In our implementation, we use our model to generate the pseudo-sketchesinprocedure(a),anduseourgenerativeloss 90 5 10 15 20 25 30 35 (see Equation (2)) as the distance measurement. Since the training samples amounts sizes of our pseudo-sketches are (188×143), we also crop the query sketch into (188×143). Figure 4: The model trained with discrimina- Followingthesameprotocoldescribedin[22],wecompare tive regularizer consistently outperforms the model our approach with previous methods on cumulative match trained without discriminative regularizer. Both score (CMS) in Table 1. CMS measures the percentage modelgets100%accuracyonrank1candidatewhile of“the correct answer is in the top n matches”, where n is involves less than half training samples. i.e. 27 sam- called the rank. ples in 88 photo-sketch pairs. As shown in Table 1, we form a baseline experiment, which is to convert the photo into gray scale as somehow “pseudo-sketch”to clarify the modality difference between In this subsection, further analyses will be conducted on photospaceandsketchspace. Inthisbaseline,theaccuracy the factors whichaffectour photo-sketch generation perfor- forthefirstrankonlyequalsto41%,whichisfarawayfrom mance. satisfying. In early method like ET described in [22], has Different Network ArchitecturesOurfirstconcernis got 71% for the top one match. Latter trials in [24], [28], about the depth of the network and the numbers of filters. [27] have greatly improved the verification accuracy on top We use the network proposed in [3] (we call it SR network) one candidate to 96%, 99% and even 100%. in our early experiments. However, the result is not quite FromthelastrowinTable1,ourapproachalsogets100% satisfying,forthisnetworkseemstoosmalltolearnthecom- correct answer on its first guess. It’s worth to notice that plex nonlinear mapping from photos to sketches. Thus, we ourapproachisquitedifferentfromtheformermethods,for improve the network by: (a) adding more layers to the net- it applies generation instead of synthesis. work; and (b) adding more numbers of filters to each layer. Weconductourexperimentsonthreedifferentnetworkar- Rank 1 Rank 3 Rank 5 Rank 10 chitectures. Duetotheirscales,wecallthemsmall,medium Baseline 41 56 59 70 and large network respectively. ET [22] 71 81 88 96 ThemediumnetworkarchitecturecanbefoundinFigure MRF [24] 96 - - 100 2. The difference between it and SR network please refer MRF+ [28] 99 - - 100 to Section 3.3. Moreover, the only difference between our SVR [27] 100 - - - small,mediumandlargenetworkistheirfilternumbers. The Ours 100 100 100 100 medium network has two times of filters compared to the smallone. Andthelargenetworkdoublesthefilternumbers Table 1: Cumulative Match Scores (CMS) compari- of the medium network. son on full training set (88 photo-sketch pairs). Our For comparison, we define a measurement call MPRL, method achieves 100% accuracy on the first guess. which is short for multiscale pixel-wise reconstruction loss. It evaluates the pixel-wise accuracy on different scales. To Moreover, our generative method is very robust and has explain MPRL, we firstly introduce pixel-wise reconstruc- excellentgeneralizationability. InFigure4,itsuggeststhat tion loss (PRL) on one photo-sketch pair. our method can using a portion of training data to get a Foraphoto-sketchpair,wedenotethesketchastheground 100%accuracyforthefirstrank. Forexample,weonlyneed truth (GT) and generated pseudo-sketch as the prediction 5 training samples to get to 95% accuracy, and 27 training (P). Both of their sizes are (W ×H). Thus, PRL can be samplesisenoughtogeta100%scorefortoponecandidate. formulated as We also design a verification on our optimization objec- (cid:118) (cid:117)W H tive on Figure 4. It suggests that the model trained with PRL= 1 (cid:117)(cid:116)(cid:88)(cid:88)(GT(x,y)−P(x,y))2 (5) the discriminative regularizer consistently outperforms the W ×H x=1y=1 model trained without discriminative regularizer. MPRL is the multiscale version of PRL. In practice, for 4.3 EmpiricalStudy each photo-sketch pair, we rescale both the GT and the P MultiscalePixel-wise Reconstruction Loss Difference between trained with and without XY channels Now we consider another factor that influences Scale=0.5 Scale=1 Scale=2 the model’s performance. As mentioned above, we add XY 40 channelstotheRGBcolorchannels. Weassumetheseaddi- 35 tionalspatialmessagescouldhelpthenetworktodistinguish 30 different spatial patches and learning different representa- 25 20 tion on them. 15 Similarly, we use MPRL for quantitative measurement 10 and use pseudo-sketch comparison for qualitative analysis. 5 In Figure 6, the left part is for results generated by our 0 mediumnetworkbuttrainedwithoutXYchannels,andthe right part is for results generated by our medium network trained normally. Figure 6 suggests that latter one outperforms former one both in numbers and perception. MultiscalePixel-wise Reconstruction Loss Scale=0.5 Scale=1 Scale=2 40 35 30 25 20 15 (a) SR Net (b) Small Net (c) Medium Net (d) Large Net 10 5 0 Figure 5: Comparison on the MPRL measurement and generative pseudo-sketches via different network architectures. The model achieves better performance while to deeper the network or to add more filters. (a-d) pseudo-sketches generated via network in [3], our small network, our medium network and our large network. to three scales {0.5, 1, 2} and evaluate their PRL to form theMPRL. Then,forthewholetestset,weevaluateallthe pairs and average their MPRL as the final measurement. Figure 5 summarizes the results both in qualitative and (a)w/o XY channels (b) w/ XY channels quantitative. The MPRL of SR network is {34.5, 36.8, 36.2}. Our small network reduces it to {32.6, 35.0, 34.4}. Figure 6: Comparison on the MPRL measurement Then,ourmediumandlargenetworkgetbetterperformance and generative pseudo-sketches via network trained as {32.1, 34.6, 34.0} and {30.0, 32.5, 31.9}. without and with XY channels. Adding XY chan- The pseudo-sketch examples generated by each network nels can make the model more robust. (a) pseudo- are on the bottom part of Figure 5. It suggests that the re- sketches generated via network trained without XY sults of SR network can capture the overall structures, but channels;(b)pseudo-sketchesgeneratedvianetwork theyarefarawayfromsatisfyingindetails. Then,Oursmall trained with XY channels. network’s generation is quite better and has clearer con- tours. Moreover, the pseudo-sketches generatedby medium network are more vivid than the small ones. The results of large network are as good as medium ones, but the large 5. CONCLUSIONANDFUTUREWORK network is much more time consuming. In this paper, we propose an end-to-end fully convolu- Table 2 shows that our small network is almost as fast tional network in order to directly model the complex non- as the SR network. The medium network’s cost is about 2 linearmappingbetweenfacephotosandsketches. Fromthe times of the small one, and the large network’s runtime is experiments, we find out that fully convolutional network about three times of the medium one. is a powerful tool which can handle this difficult problem To balance the effectiveness and efficiency, we prefer our while providing a pixel-wise prediction both effectively and medium network as the final choice. efficiently. However, this solution still has some limitations. Firstly, SR Net SmallNet Medium Net Large Net the synthesized pseudo-sketches in [24] have shaper edges Time 8.5ms 8.6ms 17.1ms 50.8ms andclearercontoursthanourgeneratedones. Secondly,the database involved in our experiments only contains Asiatic Table 2: Runtime of single (155 × 200) image on face images, which may limit the generalization ability of NVIDIA Tesla K40 GPU our model to other racial groups. In future work, we will further improve our loss function 2005.CVPR2005.IEEEComputerSocietyConference and try various databases in our experiments, and we may on, volume 1, pages 1005–1010. IEEE, 2005. explore about the relation between our work and those in- [14] J. Long, E. Shelhamer, and T. Darrell. Fully convolu- volved with non-photorealistic rendering [11]. tional networks for semantic segmentation. CVPR (to AcknowledgementsWegratefullyacknowledgeNVIDIA appear), Nov. 2015. for GPU donations. [15] P. Luo, X. Wang, and X. Tang. Pedestrian parsing viadeepdecompositionalnetwork. InComputerVision (ICCV),2013IEEEInternationalConferenceon,pages References 2648–2655. IEEE, 2013. [1] D. Ciresan, A. Giusti, L. M. Gambardella, and [16] S. Ren, C. Jin, C. Sun, and Y. Zhang. Sketch-based J. Schmidhuber. Deep neural networks segment neu- image retrieval via adaptive weighting. In Proceedings ronalmembranesinelectronmicroscopyimages. InAd- of International Conference on Multimedia Retrieval, vances in neural information processing systems, pages page 427. ACM, 2014. 2843–2851, 2012. [17] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fer- [2] S. Ding, L. Lin, G. Wang, and H. Chao. Deep Feature gus, and Y. LeCun. Overfeat: Integrated recognition, LearningwithRelativeDistanceComparisonforPerson localizationanddetectionusingconvolutionalnetworks. Re-identification. Pattern Recognition, 2015. doi: 10. In International Conference on Learning Representa- 1016/j.patcog.2015.04.005. tions (ICLR 2014). CBLS, April 2014. [3] C. Dong, C. C. Loy, K. He, and X. Tang. Learn- [18] K. Simonyan and A. Zisserman. Very deep convolu- ing a deep convolutional network for image super- tionalnetworksforlarge-scaleimagerecognition. arXiv resolution. In Computer Vision–ECCV 2014, pages preprint arXiv:1409.1556, 2014. 184–199. Springer, 2014. [19] J. R. Smith and S.-F. Chang. Visualseek: a fully au- [4] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. tomated content-based image query system. In Pro- Scalable object detection using deep neural networks. ceedingsofthefourthACMinternationalconferenceon arXiv preprint arXiv:1312.2249, 2013. Multimedia, pages 87–98. ACM, 1997. [5] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. [20] Y. Song, L. Bao, Q. Yang, and M.-H. Yang. Real-time Learning hierarchical features for scene labeling. Pat- exemplar-basedfacesketchsynthesis. InProceedingsof tern Analysis and Machine Intelligence, IEEE Trans- European Conference on Computer Vision, pages 800– actions on, 35(8):1915–1929, 2013. 813, 2014. [6] X. Gao, N. Wang, D. Tao, and X. Li. Face sketch– [21] C.Szegedy,A.Toshev,andD.Erhan. Deepneuralnet- photo synthesis and retrieval using sparse representa- works for object detection. In Advances in Neural In- tion. CircuitsandSystemsforVideoTechnology,IEEE formation Processing Systems, pages 2553–2561, 2013. Transactions on, 22(8):1213–1226, 2012. [22] X.TangandX.Wang.Facesketchrecognition.Circuits [7] R.Girshick,J.Donahue,T.Darrell,andJ.Malik. Rich and Systems for Video Technology, IEEE Transactions featurehierarchiesforaccurateobjectdetectionandse- on, 14(1):50–57, 2004. mantic segmentation. arXiv preprint arXiv:1311.2524, [23] S. Wang, L. Zhang, L. Y., and Q. Pan. Semi-coupled 2013. dictionary learning with applications in image super- [8] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyra- resolutionandphoto-sketchsynthesis. InInternational mid pooling in deep convolutional networks for visual Conference on Computer Vision and Pattern Recogni- recognition. In Computer Vision–ECCV 2014, pages tion (CVPR). IEEE, 2012. 346–361. Springer, 2014. [24] X.WangandX.Tang. Facephoto-sketchsynthesisand [9] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long, recognition.PatternAnalysisandMachineIntelligence, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: IEEE Transactions on, 31(11):1955–1967, 2009. Convolutional architecture for fast feature embedding. [25] X. Wang, L. Zhang, L. Lin, Z. Liang, and W. Zuo. arXiv preprint arXiv:1408.5093, 2014. Deep joint task learning for generic object extraction. [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima- InAdvancesinNeuralInformationProcessingSystems, genetclassificationwithdeepconvolutionalneuralnet- pages 523–531, 2014. works. In Advances in neural information processing [26] B. Xiao, X. Gao, D. Tao, and X. Li. A new approach systems, pages 1097–1105, 2012. for face recognition by sketches in photos. Signal Pro- [11] L. Lin, K. Zeng, Y. Wang, Y.-Q. Xu, and S.-C. Zhu. cessing, 89(8):1576–1588, 2009. Video stylization: Painterly rendering and optimiza- [27] J. Zhang, N. Wang, X. Gao, D. Tao, and X. Li. Face tion with content extraction. Circuits and Systems for sketch-photo synthesis based on support vector regres- Video Technology, IEEE Transactions on, 23(4):577– sion. In Image Processing (ICIP), 2011 18th IEEE In- 590, 2013. ternationalConferenceon,pages1125–1128,Sept2011. [12] J. Liu, Y. Chen, and W. Gao. Mapping learning in doi: 10.1109/ICIP.2011.6115625. eigenspace for harmonious caricature generation. In [28] W. Zhang, X. Wang, and X. Tang. Lighting and pose Proceedingsofthe14thannualACMinternationalcon- robust face sketch synthesis. In Computer Vision– ference on Multimedia, pages 683–686. ACM, 2006. ECCV 2010, pages 420–433. Springer, 2010. [13] Q. Liu, X. Tang, H. Jin, H. Lu, and S. Ma. A nonlinear approach for face sketch synthesis and recognition. In Computer Vision and Pattern Recognition,

End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning PDF

1.1 MB·English

by Liliang Zhang

#journals #arxiv

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning PDF Free - Full Version

by Liliang Zhang| 1.1| English

Download End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning by Liliang Zhang in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning

No description available for this book.

Detailed Information

Author:	Liliang Zhang
Language:	English
File Size:	1.1
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning PDF?

Yes, on https://PDFdrive.to you can download End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning by Liliang Zhang completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning on my mobile device?

After downloading End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning?

Yes, this is the complete PDF version of End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning by Liliang Zhang. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.