Table Of ContentEnd-to-End Photo-Sketch Generation via Fully
Convolutional Representation Learning
Liliang Zhang1, Liang Lin1,*, Xian Wu1, Shengyong Ding1, Lei Zhang2
1Sun Yat-sen University, Guangzhou 510006, China
2Department of Computing, The Hong Kong Polytechnic University.
zhangll.level0@gmail.com, linliang@ieee.org
5 ABSTRACT interesting application is searching image databases using
1
free-hand sketch queries [19, 16].
0 Sketch-basedfacerecognitionisaninterestingtaskinvision
However,drawingavividsketchportraitistimeconsum-
2 and multimedia research, yet it is quite challenging due to
ing even for a skilled artist. Automatic face sketch genera-
the great difference between face photos and sketches. In
r tionhasbeenstudiedforalongtimeandithasmanyuseful
p this paper, we propose a novel approach for photo-sketch
applications for digital entertainment [12].
A generation, aiming to automatically transform face photos
Another important application based on face sketch is to
into detail-preserving personal sketches. Unlike the tradi-
assistlawenforcement. Assumedthatweneedtoautomati-
1 tionalmodelssynthesizingsketchesbasedonadictionaryof
1 exemplars,wedevelopafullyconvolutionalnetworktolearn callyretrievalaphotofromthedatabaseforaqueryimage,
the end-to-end photo-sketch mapping. Our approach takes which help the police to narrow down the suspect quickly.
] whole face photos as inputs and directly generates the cor- Unfortunately, the photo of the suspect maybe unavailable
V inmostcases. Todealwithsuchaproblem,thebestsubsti-
respondingsketchimages withefficientinference andlearn-
C ing, in which the architecture is stacked by only convolu- tute available is an artist drawing based on the recollection
of an eyewitness.
. tionalkernelsofverysmallsizes. Towellcapturetheperson
s Inthissituation,weonlyfocusonsketcheswithoutexag-
identity during the photo-sketch transformation, we define
c
geration, so that the sketch can realistically reflect the real
[ our optimization objective in the form of joint generative-
discriminative minimization. In particular, a discriminative person. Figure1showstwosamplesofphoto-sketchpairsin
2 regularization term is incorporated into the photo-sketch CUHK Face Sketch Database [24].
v generation, enhancing the discriminability of the generated
0
personsketchesagainstotherindividuals. Extensiveexperi-
8
mentsonseveralstandardbenchmarkssuggestthatourap-
1
proach outperforms other state-of-the-arts in both photo-
7
sketch generation and face sketch verification.
0
. CategoriesandSubjectDescriptors
1
0 I.2.10[ARTIFICIALINTELLIGENCE]:VisionandScene
5 Understanding (a) (b)
1
Keywords
:
v Figure1: Samplesofphoto-sketchpairs: (a)photos;
i Sketch-photo generation; face verification; neural nets (b) sketches drawn by artist
X
r 1. INTRODUCTION Figure1indicatesthegreatdifferencebetweenphotosand
a sketches. Thus, photo based face verification methods can-
Sketch is an important artistic drawing style and may be
not be directly applied in this problem. The key to sketch
the simplest form since it is only composed of lines. An
based face verification is to reduce the modality difference
∗Corresponding author is Liang Lin. This work was sup- between photos and sketches.
ported by the Hi-Tech Research and Development Program of One intuitive idea is to recover the photo image from a
China(no.2013AA013801),GuangdongNaturalScienceFounda- sketch. However, it’s an ill-pose problem because a sketch
tion(no.S2013050014548),andtheHongKongScholarprogram.
may lose many informations during the drawing procedure.
An alternative way is to generate a pseudo-sketch from a
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor
photo, which has been discussed in [22, 13, 24].
classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed
The original generation scheme is trying to find a trans-
forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita-
tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan formation from photos to sketches. Intuitively, it should
ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- be a complex nonlinear mapping. Thus, former works like
publish,topostonserversortoredistributetolists,requirespriorspecificpermission [22, 13, 24, 26, 27, 23, 20] simplify the generation problem
and/orafee.RequestpermissionsfromPermissions@acm.org. tosynthesis form. Theunderlyingassumptionisthat,iftwo
ICMR’15,June23–26,2015,Shanghai,China.
photo images (or patches) are similar, their corresponding
Copyright(cid:13)c 2015ACM978-1-4503-3274-3/15/06...$15.00.
sketch images (or patches) should also be similar. Thus, if
DOI:http://dx.doi.org/10.1145/2671188.2749321.
we find out a way to use other photo images (or patches) the pseudo-images. Random Sample LDA (RS-LDA) per-
tosynthesizeaphotoimage(orpatch),thenwecanusethe formed best among them. Zhang et al. [28] then improved
corresponding sketch images (or patches) to synthesize its theMRFframeworkbyaddingshapepriorsanddescriptors
pseudo-sketch. robust to lighting variations.
However, this simplification may has some limitation, es- Mostofmethodsabovemaylosesomevitaldetails,which
pecially the scalability. The time cost in synthesis based influenced the visual quality and face recognition perfor-
method grows linearly with the amounts of training data, mance. Thus, Zhang et al. [27] added a refinement step on
becauseitneedtousethesamplesintrainingsettosynthe- existing approaches. They applied a support vector regres-
size a new one. sion(SVR)basedmodeltosynthesizethehigh-frequencyin-
Onthecontrary,ourmethodtrytodirectly solvethegen- formation. Similarly, Gao et al. [6] proposed a new method
erationproblemwithanend-to-endmodelcalledfullyconvo- calledSNS-SREwithtwosteps,i.e.sparseneighborselection
lutionalnetwork(FCN),whichcanberegardedasaspecial (SNS)togetaninitialestimationandsparse-representation-
kind of convolutional neural networks (CNNs). More pre- based enhancement (SRE) for further improvement.
cisely, FCN is stacked by only convolutional layers. It can
2.2 Pixel-WisePredectionsviaCNNs
solvecomplexnonlinearproblemwhileproducingpixel-wise
outputs, which is very suitable for the photo-sketch gener- Convolutional neural network (CNN) was widely used in
ation problem. Please refer to Figure 2 to get an intuitive computer vision. Its typical structure contains a series of
idea. convolutional and pooling layers as feature extractors, and
Themaincontributionsofthisworkarethree-folds. First, severalfullconnectedlayersasclassifierstogiveprediction.
anend-to-endphoto-sketchgenerationmodelisstudiedwith It has achieved great success in large scale object classifica-
anovelarchitectureoffullyconvolutionalnetworks,whichis tion, localization and detection [10, 8, 17, 4, 21, 7].
original to the best of our knowledge. Second, we present a One important application of CNN was to produce dense
joint generative-discriminative formulation driving the net- or even pixel-wise predictions. Sermanet et al. [17] devel-
workoptimization,suchthatthegeneratedfacesketchescan oped a CNN-based framework called OverFeat, which in-
characterizethepersondetailaswellasthediscriminability tegrated recognition, localization and detection. The key
against other individuals. Third, our system demonstrates component was a pooling layer with offsets, which can imi-
superiorperformancescomparedwithotherstate-of-the-art tateslidingwindowtechniqueandproducedenseoutputsin
approaches on several benchmarks. thefinallayer. Asimilarideacalledspatialpyramidpooling
Rest of the paper is organized as follows: Section 2 dis- (SPP) was proposed by He et al. [8], which was also modi-
cusses the relative work in photo-sketch synthesis and con- fiedthelastpoolinglayer. Wangetal.[25]proposedajoint
volutional neural networks. In Section 3, we will interpret architectureforgenericobjectextraction,andLuoetal.[15]
our model and its implementation. In Section 4, extensive appliedadeepdecompositionalnetworkforpedestrianpars-
experiments suggest that our approach outperforms other ing. Both of them shared an idea, which is adopting a full
state-of-the-art methods on several benchmarks. In Section connectedlayeronthetopofthenetworktoproduceadense
5, we draw the conclusion of our work and list some ideas prediction. Moreover,patch-by-patchscanningtechniqueon
we may adopt in the future. the original image has been studied by Ciresan et al. [1] in
neuronal membrane segmentation and Farabet et al. [5] for
2. RELATIVEWORK scene labeling.
Recently, Dong et al. [3] applied CNN for image super-
2.1 Photo-SketchSynthesisandRecognition resolution and it can produce a pixel-wise output. They
discussed its relationship to sparse-coding-based methods,
In recent ten years, there has been several works on the
and concluded that they can use three convolutional layers
topic of photo-sketch synthesis and recognition.
to simulate the representation-mapping-reconstruction pro-
TangandWang[22]werethefirsttoaddresstheproblem
cedure in sparse coding.
withasignificantamountofdatabase. Theyproposedasyn-
Our work was mostly inspired by [3]. We conducted our
thesis method based on eigen transformation (ET). It was
early experiments via their network and then further im-
under the assumption that the photo-sketch mapping can
proved it. We called it fully convolutional network (FCN,
be approximated as linear, yet this assumption may be too
a similar idea was also proposed in [14]), for it only con-
strong especially when the hair region was included. Then,
tains convolutional layers and the corresponding activation
it used reconstruction error based distance for recognition.
function,butwithoutanyotherlayerslikepooling,fullcon-
Liu et al. [13] proposed a nonlinear face sketch synthesis
nected, local response normalization (LRN), etc. We would
method called locally linear embedding (LLE), which can
further discuss it in Section 3.2.
be considered as an improved version of [22]. Instead of
modelling the whole image, it applied ET on local patches
3. PHOTO-SKETCHGENERATION
with overlapping. It adopted a kernel-based nonlinear LDA
discriminative classifier for sketch recognition.
3.1 Formulation
Another local-based strategy proposed by Xiao et al. [26]
was based on Embedded Hidden Markov model (E-HMM). In our model, we use two constraints for sketch genera-
Theytransformedthesketchestopseudo-photosandapplied tion. The first one is the generated sketches should be as
eigenface algorithm for recognition. close to the ones drawn by artists as possible. This encour-
In [24], Wang and Tang proposed a multiscale Markov agesthedeepnetworktolearnthesketchingskillsfromthe
RandomFields(MRF)forfacephoto-sketchsynthesis,which artists. The second one is the generated sketches should
canbebothappliedtophoto-to-sketchsynthesisandsketch- be able to facilitate the law enforcement. That is, given a
to-photo synthesis. They evaluated a series of classifiers on sketch drawn by the artist, it should be able to identify the
subject in the photo database. These two constraints are Algorithm 1 Parameter Optimization
introduced as generative loss and discriminative regularizer Require:
in our objective function, which is similar to [2]. Training photos {P } and sketches {S };
i i
Suppose there are N subjects in the training set with Ensure:
each subject containing one photo Pi and one sketch Si. Network Parameters W
We use f(W,Pi) to denote the sketch generated by a fully 1: while t<T do
convolutional network parameterized by W. We define our 2: t←t+1;
loss function as follows where Lgen and Ldiscrim represent 3: Randomly select a subset of photos and sketches
thegenerativelossandthediscriminativeregularizerrespec- {P(cid:48)},{S(cid:48)} from the training set;
i j
tively. Here we use α to control the weight of the discrimi- 4: for all P(cid:48) do
i
native regularizer to the overall objective. 5: Do forward propagation to get f(W,P(cid:48))
i
6: end for
L(P,S,W)=L (P,S,W)+αL (P,S,W) (1)
gen discrim 7: ∆W=0
For the generative loss, we use a straight function which 8: for all Pi(cid:48) do
is defined as the pixel-wise difference between the ground- 9: Calculate the partial derivative with respect to the
truth sketch and the generated one. output: ∂L
∂f(W,P(cid:48))
i
10: Run backward propagation to obtain the gradient
N
L (P,S,W)= 1 (cid:88)(S −f(W,P ))2 (2) with respect to the network parameter: ∆Wi
gen N i i 11: Accumulate the gradient: W+=∆W
i=1 i
12: end for
Forthediscriminativeregularizer,weencouragethedrawn 13: Wt =Wt−1−λ ∆W
t
sketchofoneparticularpersonshouldbedifferentfromthe
14: end while
generatedsketchofanotherpersonasdefinedbythefollow-
ing function.
L (P,S,W)= of arbitrary size, and then produce a corresponding spatial
discrim
1 (cid:88)N (cid:88)N log(cid:32)1+e−(Si−f(λW,Pj))2(cid:33) (3) ouRtpeulta.tionship to the Patch-Wise RepresentationIn
N(N −1) Equation (4), a pixel on the output feature map only will
i=1j=1,j(cid:54)=i
be influenced by a patch on the input feature map, i.e. its
λ is a parameter used to avoid numeric overflow. receptivefield. Thispropertyisalsomaintainedundercom-
Parameter Optimization Using this model, the learn- position of convolutions. Thus, fully convolutional repre-
ingprocessistominimizethelossfunctionL(P,S,W). This sentation on the whole image can be regarded as a batch
is achieved by the standard network propagation algorithm of patch-wise representation on patches of the image. Nev-
in the batch training manner. More precisely, in each iter- ertheless, fully convolutional representation is much more
ation, we randomly select a set of photo-sketch pairs and efficient, for the patches overlap significantly (we set stride
construct the generative loss items and discriminative reg- to 1 in all layers).
ularizer items respectively. In order to derive the gradient BorderEffectTrade-Off Convolutionaloperationswill
with respect to the network parameter W, the key step is shrinkthesizeofthefeaturemap. Forexample,ifweadopt
to calculate the partial derivative of the loss function with a(3×3)convolutionoperatorona(3×3)input,theoutput
respect to the output (generated sketch) of each photo. As size will be shrank to (1×1).
one photo may be involved into several cost items, we need Asmoreconvolutionallayersstackedup,moreshrinkswill
to go through each items to accumulate the derivative with beaccumulatedintheoutputs. Apossiblesolutionistoadd
respect to the output. Algorithm 1 gives the details of the properpaddingbeforetheconvolutionoperation. Butitwill
learning process. bring on border effects, which has been claimed in [3].
Thus, in the trade-off, we don’t use padding in the con-
3.2 FullyConvolutionalRepresentation
volutional layers during both the training and testing time.
In this subsection, we will discuss the property of convo- Each(155×200)photoimagewillbeshrankto(143×188)
lutionallayersandhowtheycanbecascadedasapixel-wise in the representation (See Figure 2).
fully convolutional representation of an input image.
Composition of ConvolutionsAtypicalconvolutional 3.3 Implementation
layer has K kernels with activation function f can be for-
Pre-processing As it is described in [24], all the photos
mulated as
and sketches are translated, rotated, and scaled such that
yk =f((W ∗x) +b ) (4) the two eye centers of all the face images are at fixed posi-
ij k ij k
tion in the pre-processing step. This simple geometric nor-
x denotes the input feature maps. k ∈ {1,2,...,K} de-
malizationstepmakesthesamefacecomponentsindifferent
notestheindexofafilter,andW andb arethek-thfilter
k k images in roughly alignment.
weightsandbias. yikj denotestheelementatthecoordinate Another pre-processing step is inspired by the results∗
(i,j) on the k-th output feature map.
of [24]. The transformation mentioned above produces a
Equation (4) indicates that the convolutional operation
(200×250) image, but it may have some black regions on
preserves the spatial relationship. What’s more, composi-
theborderareas. Wecropthe(155×200)centerpartofthe
tion of convolutions will not change this property, i.e. we
image in order to exclude this negative influence.
can use a stack of convolutional layers to represent a com-
plex nonliear mapping, which can be adopted on an input ∗http://www.ee.cuhk.edu.hk/˜xgwang/sketch multiscale.html
155 143
200 188
5x5 5x5 1x1 1x1 3x3 3x3
5 128 128 64 64 1 1
(RGB+XY)
Input Full image inference Output
Figure 2: The overview of our model. It takes a full-size photo image as input and directly generates a
full-size pseudo-sketch as output. The middle part is the architecture of our fully convolutional network. It
contains six convolutional layers, with rectified linear units as activation functions (omitted in the figure).
Sincewechoosetoavoidbordereffects,thesketchimages 4. EXPERIMENTS
needtobecroppedto(143×188)tofitinthenetworkoutput We evaluate our model on CHUK student dataset [24].
dimension. It includes 188 faces. 88 faces are selected for training and
SpatialPatch-wiseLearningwithOverlappingFrom remaining 100 faces are selected for testing. For each face,
thepreviousworksonphoto-sketchsynthesis[13,24],patch- there is a sketch drawn by the artist and a photo taken in
wise learning with overlapping is very important to handle a frontal pose, under normal lighting condition, and with a
the non-linearity between photos and sketches. Intuitively, neural expression.
patches in different positions are diverse from others (e.g. We mainly do three aspects of experiments, which will
eye patches, nose patches and mouth patches). Therefore, be shown in the next three subsections. Section 4.1 will
learning different patch representations in different spatial discuss our photo-sketch generation results and comparison
positions respectively is a very straightforward idea. with the synthesis based method in [24]. Section 4.2 will
WehandleitbyaddingadditionalXYchannelsinthein- showthatourgeneratedpseudo-sketchessignificantlyreduce
put, i.e. the input image data contain five channels, three themodalitybetweenphotosandsketches. Thustheycanbe
RGB channels of the photo image, and two channels of the adopted to a sketch-based face verification system. Section
corresponding coordinate (i,j). This tiny modification sig- 4.3 will involve empirical study about our model. Several
nificantly improves the result, which will be discussed in factors such as network depth and filter numbers will be
Section 4.3. discussed on the evaluation protocol both qualitatively and
NetworkArchitectureWeapplythenetworkproposed quantitatively.
in[3]inourearlyexperiments,thenwemodifyitsarchitec-
ture for further enhancement. We borrow the idea in [18], 4.1 Face-SketchGeneration
whichhasagreatsuccessinILSVRC-2014. Theirmaincon-
In Figure 3, we show some examples of our generated re-
tribution is to deeper the CNN under computational con-
sultsandcomparethemto[24]. Forfaircomparison,welist
straints via very small (3×3) convolutional filters in all
all14pseudo-sketcheswhichcanbefoundontheirwebsite.
layers.
Figure 3 indicates that our results contain more vital de-
Similarly, we modify the 3 layers network in [3] into a 6
tails. For example, the man in column (e) of the first row
layers network, which is shown in Figure 2. For the (9×
has two little locks of hair on his forehead. But this detail
9), (1×1), (5×5) convolutional layer in [3], we replace
is missing in the pseudo-sketch in [24] (in column (g)). On
them with two (5×5), (1×1) and (3×3) convolutional
thecontrary,ourgeneratedresult(incolumn(f))retainsthis
layers respectively. Moveover, we double the filter amounts
detail information which may be very important to distin-
in every layer to improve the network’s capacity.
guishthismanfromothers. Anotherexampleisthewoman
We call it medium network due to its size. We have two
in column (a) of the second row. She has a distinctive fea-
similararchitecturescalledsmallnetworkandlargenetwork,
turecomparingtootherpersonsinthedatabase,forshehas
which will be further discussed in Section 4.3.
twobigeyes. Ourmethodcanalsocapturethisdetailinthe
Training Details Our results are based on our medium
pseudo-sketch (in column(b)).
network (See Figure 2). As mentioned above, we first pre-
Itsuggeststhatsynthesisbasedmethodsmayfailinsome
process the photo images into (155×200) and the sketch
cases. Themainreasonisthattheyareunderanassumption
images into (143×188). The optimization objective has
that the synthesized sketch image (or patch) can be recon-
been discussed in Section 3.1. We set λ as 109 and set α as
structed by the sketch images (or patches) in the training
104.
set. However, this assumption may be too strong, for some
We use Caffe [9] for implementation. The filter weights
persons have their facial distinctions in fact.
are initialized by drawing randomly from a Gaussian dis-
Our generative method overcomes this difficulty. We try
tribution with zero mean and standard deviation 0.01, and
to directly learn the transformation from photo space to
the bias are initialized by zero. We set the learning rate as
sketch space. Even though it’s a complex nonlinear map-
10−11,thenittakesseveralhourstoconvergeonaNVIDIA
ping, FCN has the capacity to handle this challenge.
Tesla K40 GPU.
Compared to traditional synthesized method, our novel
generative approach has two folds of advantages. Firstly, it
(a) (b) (c) (d) (e) (f) (g) (h)
Figure 3: Compared to synthesis based method [24], our generative approach could retain more details in
the original photos, e.g. the two locks of hair in forehead of the man in column (e) of row 1; the two big
eyes of the woman in column (a) of row 2. (a)&(e) photos; (b)&(f) our generative pseudo-sketches; (c)&(g)
synthesized pseudo-sketches in [24]; (d)&(h) sketches drawn by the artist.
could retain more detail information from the photos. Sec-
ondly, our inference time is independent on the amount of
training data. The time cost in synthesis based method 100
grows linearly with the data amounts, while the runtime of
ourapproachisonlyaffectedbythesizeoftheinputphoto.
98
)
%
4.2 Sketch-BasedFaceVerification (
S
Inthissubsection,wewillshowthatourgeneratedpseudo- M 96
C
sketchsignificantlyreducethemodalitybetweenphotosand 1
sketches. We follow the same testing procedure in [22, 13, k
n 94
24], which can be concluded in two steps: (a) convert the a
r
photosintestingsetintocorrespondingpseudo-sketches;(b)
with discriminative regularizer
define a feature or transformation to measure the distance 92
without discriminative regularizer
between the query sketch and the pseudo-sketches.
In our implementation, we use our model to generate the
pseudo-sketchesinprocedure(a),anduseourgenerativeloss 90
5 10 15 20 25 30 35
(see Equation (2)) as the distance measurement. Since the training samples amounts
sizes of our pseudo-sketches are (188×143), we also crop
the query sketch into (188×143).
Figure 4: The model trained with discrimina-
Followingthesameprotocoldescribedin[22],wecompare
tive regularizer consistently outperforms the model
our approach with previous methods on cumulative match
trained without discriminative regularizer. Both
score (CMS) in Table 1. CMS measures the percentage
modelgets100%accuracyonrank1candidatewhile
of“the correct answer is in the top n matches”, where n is
involves less than half training samples. i.e. 27 sam-
called the rank.
ples in 88 photo-sketch pairs.
As shown in Table 1, we form a baseline experiment,
which is to convert the photo into gray scale as somehow
“pseudo-sketch”to clarify the modality difference between
In this subsection, further analyses will be conducted on
photospaceandsketchspace. Inthisbaseline,theaccuracy
the factors whichaffectour photo-sketch generation perfor-
forthefirstrankonlyequalsto41%,whichisfarawayfrom
mance.
satisfying. In early method like ET described in [22], has
Different Network ArchitecturesOurfirstconcernis
got 71% for the top one match. Latter trials in [24], [28],
about the depth of the network and the numbers of filters.
[27] have greatly improved the verification accuracy on top
We use the network proposed in [3] (we call it SR network)
one candidate to 96%, 99% and even 100%.
in our early experiments. However, the result is not quite
FromthelastrowinTable1,ourapproachalsogets100%
satisfying,forthisnetworkseemstoosmalltolearnthecom-
correct answer on its first guess. It’s worth to notice that
plex nonlinear mapping from photos to sketches. Thus, we
ourapproachisquitedifferentfromtheformermethods,for
improve the network by: (a) adding more layers to the net-
it applies generation instead of synthesis.
work; and (b) adding more numbers of filters to each layer.
Weconductourexperimentsonthreedifferentnetworkar-
Rank 1 Rank 3 Rank 5 Rank 10
chitectures. Duetotheirscales,wecallthemsmall,medium
Baseline 41 56 59 70
and large network respectively.
ET [22] 71 81 88 96
ThemediumnetworkarchitecturecanbefoundinFigure
MRF [24] 96 - - 100 2. The difference between it and SR network please refer
MRF+ [28] 99 - - 100 to Section 3.3. Moreover, the only difference between our
SVR [27] 100 - - - small,mediumandlargenetworkistheirfilternumbers. The
Ours 100 100 100 100 medium network has two times of filters compared to the
smallone. Andthelargenetworkdoublesthefilternumbers
Table 1: Cumulative Match Scores (CMS) compari- of the medium network.
son on full training set (88 photo-sketch pairs). Our For comparison, we define a measurement call MPRL,
method achieves 100% accuracy on the first guess. which is short for multiscale pixel-wise reconstruction loss.
It evaluates the pixel-wise accuracy on different scales. To
Moreover, our generative method is very robust and has explain MPRL, we firstly introduce pixel-wise reconstruc-
excellentgeneralizationability. InFigure4,itsuggeststhat tion loss (PRL) on one photo-sketch pair.
our method can using a portion of training data to get a Foraphoto-sketchpair,wedenotethesketchastheground
100%accuracyforthefirstrank. Forexample,weonlyneed truth (GT) and generated pseudo-sketch as the prediction
5 training samples to get to 95% accuracy, and 27 training (P). Both of their sizes are (W ×H). Thus, PRL can be
samplesisenoughtogeta100%scorefortoponecandidate. formulated as
We also design a verification on our optimization objec- (cid:118)
(cid:117)W H
tive on Figure 4. It suggests that the model trained with PRL= 1 (cid:117)(cid:116)(cid:88)(cid:88)(GT(x,y)−P(x,y))2 (5)
the discriminative regularizer consistently outperforms the W ×H
x=1y=1
model trained without discriminative regularizer.
MPRL is the multiscale version of PRL. In practice, for
4.3 EmpiricalStudy
each photo-sketch pair, we rescale both the GT and the P
MultiscalePixel-wise Reconstruction Loss Difference between trained with and without XY
channels Now we consider another factor that influences
Scale=0.5 Scale=1 Scale=2
the model’s performance. As mentioned above, we add XY
40
channelstotheRGBcolorchannels. Weassumetheseaddi-
35
tionalspatialmessagescouldhelpthenetworktodistinguish
30
different spatial patches and learning different representa-
25
20 tion on them.
15 Similarly, we use MPRL for quantitative measurement
10 and use pseudo-sketch comparison for qualitative analysis.
5 In Figure 6, the left part is for results generated by our
0 mediumnetworkbuttrainedwithoutXYchannels,andthe
right part is for results generated by our medium network
trained normally. Figure 6 suggests that latter one outper-
forms former one both in numbers and perception.
MultiscalePixel-wise Reconstruction Loss
Scale=0.5 Scale=1 Scale=2
40
35
30
25
20
15
(a) SR Net (b) Small Net (c) Medium Net (d) Large Net 10
5
0
Figure 5: Comparison on the MPRL measurement
and generative pseudo-sketches via different net-
work architectures. The model achieves better per-
formance while to deeper the network or to add
more filters. (a-d) pseudo-sketches generated via
network in [3], our small network, our medium net-
work and our large network.
to three scales {0.5, 1, 2} and evaluate their PRL to form
theMPRL. Then,forthewholetestset,weevaluateallthe
pairs and average their MPRL as the final measurement.
Figure 5 summarizes the results both in qualitative and
(a)w/o XY channels (b) w/ XY channels
quantitative. The MPRL of SR network is {34.5, 36.8,
36.2}. Our small network reduces it to {32.6, 35.0, 34.4}.
Figure 6: Comparison on the MPRL measurement
Then,ourmediumandlargenetworkgetbetterperformance
and generative pseudo-sketches via network trained
as {32.1, 34.6, 34.0} and {30.0, 32.5, 31.9}.
without and with XY channels. Adding XY chan-
The pseudo-sketch examples generated by each network
nels can make the model more robust. (a) pseudo-
are on the bottom part of Figure 5. It suggests that the re-
sketches generated via network trained without XY
sults of SR network can capture the overall structures, but
channels;(b)pseudo-sketchesgeneratedvianetwork
theyarefarawayfromsatisfyingindetails. Then,Oursmall
trained with XY channels.
network’s generation is quite better and has clearer con-
tours. Moreover, the pseudo-sketches generatedby medium
network are more vivid than the small ones. The results of
large network are as good as medium ones, but the large 5. CONCLUSIONANDFUTUREWORK
network is much more time consuming. In this paper, we propose an end-to-end fully convolu-
Table 2 shows that our small network is almost as fast tional network in order to directly model the complex non-
as the SR network. The medium network’s cost is about 2 linearmappingbetweenfacephotosandsketches. Fromthe
times of the small one, and the large network’s runtime is experiments, we find out that fully convolutional network
about three times of the medium one. is a powerful tool which can handle this difficult problem
To balance the effectiveness and efficiency, we prefer our while providing a pixel-wise prediction both effectively and
medium network as the final choice. efficiently.
However, this solution still has some limitations. Firstly,
SR Net SmallNet Medium Net Large Net the synthesized pseudo-sketches in [24] have shaper edges
Time 8.5ms 8.6ms 17.1ms 50.8ms andclearercontoursthanourgeneratedones. Secondly,the
database involved in our experiments only contains Asiatic
Table 2: Runtime of single (155 × 200) image on face images, which may limit the generalization ability of
NVIDIA Tesla K40 GPU our model to other racial groups.
In future work, we will further improve our loss function 2005.CVPR2005.IEEEComputerSocietyConference
and try various databases in our experiments, and we may on, volume 1, pages 1005–1010. IEEE, 2005.
explore about the relation between our work and those in- [14] J. Long, E. Shelhamer, and T. Darrell. Fully convolu-
volved with non-photorealistic rendering [11]. tional networks for semantic segmentation. CVPR (to
AcknowledgementsWegratefullyacknowledgeNVIDIA appear), Nov. 2015.
for GPU donations.
[15] P. Luo, X. Wang, and X. Tang. Pedestrian parsing
viadeepdecompositionalnetwork. InComputerVision
(ICCV),2013IEEEInternationalConferenceon,pages
References
2648–2655. IEEE, 2013.
[1] D. Ciresan, A. Giusti, L. M. Gambardella, and
[16] S. Ren, C. Jin, C. Sun, and Y. Zhang. Sketch-based
J. Schmidhuber. Deep neural networks segment neu-
image retrieval via adaptive weighting. In Proceedings
ronalmembranesinelectronmicroscopyimages. InAd-
of International Conference on Multimedia Retrieval,
vances in neural information processing systems, pages
page 427. ACM, 2014.
2843–2851, 2012.
[17] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fer-
[2] S. Ding, L. Lin, G. Wang, and H. Chao. Deep Feature
gus, and Y. LeCun. Overfeat: Integrated recognition,
LearningwithRelativeDistanceComparisonforPerson
localizationanddetectionusingconvolutionalnetworks.
Re-identification. Pattern Recognition, 2015. doi: 10.
In International Conference on Learning Representa-
1016/j.patcog.2015.04.005.
tions (ICLR 2014). CBLS, April 2014.
[3] C. Dong, C. C. Loy, K. He, and X. Tang. Learn-
[18] K. Simonyan and A. Zisserman. Very deep convolu-
ing a deep convolutional network for image super-
tionalnetworksforlarge-scaleimagerecognition. arXiv
resolution. In Computer Vision–ECCV 2014, pages
preprint arXiv:1409.1556, 2014.
184–199. Springer, 2014.
[19] J. R. Smith and S.-F. Chang. Visualseek: a fully au-
[4] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov.
tomated content-based image query system. In Pro-
Scalable object detection using deep neural networks.
ceedingsofthefourthACMinternationalconferenceon
arXiv preprint arXiv:1312.2249, 2013.
Multimedia, pages 87–98. ACM, 1997.
[5] C. Farabet, C. Couprie, L. Najman, and Y. LeCun.
[20] Y. Song, L. Bao, Q. Yang, and M.-H. Yang. Real-time
Learning hierarchical features for scene labeling. Pat-
exemplar-basedfacesketchsynthesis. InProceedingsof
tern Analysis and Machine Intelligence, IEEE Trans-
European Conference on Computer Vision, pages 800–
actions on, 35(8):1915–1929, 2013.
813, 2014.
[6] X. Gao, N. Wang, D. Tao, and X. Li. Face sketch–
[21] C.Szegedy,A.Toshev,andD.Erhan. Deepneuralnet-
photo synthesis and retrieval using sparse representa-
works for object detection. In Advances in Neural In-
tion. CircuitsandSystemsforVideoTechnology,IEEE
formation Processing Systems, pages 2553–2561, 2013.
Transactions on, 22(8):1213–1226, 2012.
[22] X.TangandX.Wang.Facesketchrecognition.Circuits
[7] R.Girshick,J.Donahue,T.Darrell,andJ.Malik. Rich
and Systems for Video Technology, IEEE Transactions
featurehierarchiesforaccurateobjectdetectionandse-
on, 14(1):50–57, 2004.
mantic segmentation. arXiv preprint arXiv:1311.2524,
[23] S. Wang, L. Zhang, L. Y., and Q. Pan. Semi-coupled
2013.
dictionary learning with applications in image super-
[8] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyra-
resolutionandphoto-sketchsynthesis. InInternational
mid pooling in deep convolutional networks for visual
Conference on Computer Vision and Pattern Recogni-
recognition. In Computer Vision–ECCV 2014, pages
tion (CVPR). IEEE, 2012.
346–361. Springer, 2014.
[24] X.WangandX.Tang. Facephoto-sketchsynthesisand
[9] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,
recognition.PatternAnalysisandMachineIntelligence,
R. Girshick, S. Guadarrama, and T. Darrell. Caffe:
IEEE Transactions on, 31(11):1955–1967, 2009.
Convolutional architecture for fast feature embedding.
[25] X. Wang, L. Zhang, L. Lin, Z. Liang, and W. Zuo.
arXiv preprint arXiv:1408.5093, 2014.
Deep joint task learning for generic object extraction.
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima-
InAdvancesinNeuralInformationProcessingSystems,
genetclassificationwithdeepconvolutionalneuralnet-
pages 523–531, 2014.
works. In Advances in neural information processing
[26] B. Xiao, X. Gao, D. Tao, and X. Li. A new approach
systems, pages 1097–1105, 2012.
for face recognition by sketches in photos. Signal Pro-
[11] L. Lin, K. Zeng, Y. Wang, Y.-Q. Xu, and S.-C. Zhu.
cessing, 89(8):1576–1588, 2009.
Video stylization: Painterly rendering and optimiza-
[27] J. Zhang, N. Wang, X. Gao, D. Tao, and X. Li. Face
tion with content extraction. Circuits and Systems for
sketch-photo synthesis based on support vector regres-
Video Technology, IEEE Transactions on, 23(4):577–
sion. In Image Processing (ICIP), 2011 18th IEEE In-
590, 2013.
ternationalConferenceon,pages1125–1128,Sept2011.
[12] J. Liu, Y. Chen, and W. Gao. Mapping learning in
doi: 10.1109/ICIP.2011.6115625.
eigenspace for harmonious caricature generation. In
[28] W. Zhang, X. Wang, and X. Tang. Lighting and pose
Proceedingsofthe14thannualACMinternationalcon-
robust face sketch synthesis. In Computer Vision–
ference on Multimedia, pages 683–686. ACM, 2006.
ECCV 2010, pages 420–433. Springer, 2010.
[13] Q. Liu, X. Tang, H. Jin, H. Lu, and S. Ma. A non-
linear approach for face sketch synthesis and recogni-
tion. In Computer Vision and Pattern Recognition,