Table Of ContentSubspace Clustering Based Tag Sharing for Inductive Tag
Matrix Refinement with Complex Errors
∗
Yuqing Hou† Zhouchen Lin † Jin-ge Yao§
†KeyLab. ofMachinePerception(MOE),SchoolofEECS,PekingUniversity,P.R.China
§InstituteofComputerScienceandTechnology,PekingUniversity,P.R.China
{yqh, zlin, yaojinge}@pku.edu.cn
6
1
ABSTRACT errors,orrigidlyassignfixedweightstodifferentkindsofer-
0
rors [11], having no adaptability when working on different
2 Annotating images with tags is useful for indexing and re-
trieving images. However, many available annotation data datasets with different levels of annotation errors.
n In this paper, we propose a framework called Subspace
include missing or inaccurate annotations. In this paper,
u clusteringandMatrixcompletionwithComplexerrors(SMC).
we propose an image annotation framework which sequen-
J
Since current tag refinement methods suffer from the ex-
tially performs tag completion and refinement. We utilize
1 treme sparsity problem [20], SMC performs tag completion
thesubspacepropertyofdataviasparsesubspaceclustering
2 and refinement sequentially. During tag completion, SMC
for tag completion. Then we propose a novel matrix com-
pletion model for tag refinement,integrating visual correla- tries to introduce many additional proper tags to images
] viaexploringsubspacepropertyintheimagecollection. We
V tion, semantic correlation and the novelly studied property
then adapt the inductive matrix completion [13] model to
of complex errors. The proposed method outperforms the
C perform the following tag refinement procedure, utilizing
state-of-the-artapproachesonmultiplebenchmarkdatasets
. side information such as thecorrelation between visual fea-
s even when they contain certain levels of annotation noise.
c turesand theircorresponding tags (visual correlation), cor-
Keywords
[ relationbetweenthesemanticinformationoftags(semantic
image annotation; subspace clustering; matrix completion; correlation) and thecomplex errors.
3 complex errors The main contributionsof thispaper include:
v
5 1. INTRODUCTION • Weperformtagcompletionandtagrefinementsequen-
5 tially, showing that tag refinement benefits from tag
It is useful to annotate images with textual tags for the
0 completion.
purposeofimageindexingandretrieval. Toannotateproper
3
0 tags, one need to bridge the gap between low level visual • We formulate tag completion in a subspace clustering
. features of an image and corresponding high level semantic framework totackle theextremesparsity problem.
1 information[16]. Sincemanualannotationislaborintensive,
0 automatic annotation has aroused much attention. Many • Wenovellyadapttheinductivematrixcompletionmodel
6 machine learning based approaches havebeen developed. for tag refinement, taking visual correlation, seman-
1 Currently many image annotation data have been col- tic correlation and our novelly studied complex errors
v: lected from crowdsourcing services [21, 12], providing large propertyinto consideration.
i amount of data for training while being noisy due to an- 2. THE PROPOSEDFRAMEWORK
X notation errors. Annotationerrors are usually complex and
r mainlycomeintwoforms: missingtagsandinaccuratetags.
2.1 Overview
a Most image annotation approaches solely focus on one of
thosetwo,eithertryingtoimputethemissingtags(tagcom- We denote the observed tag matrix as O ∈ {0,1}Ni×Nt,
pletion/tag assignment) [11] or correcting inaccurate tags whereeachrowcorrespondstooneimage,eachcolumncor-
(tag refinement) [22, 8, 16]. Other existing methods fail to respondstoonetextualtag,andNi andNt denotethenum-
model the complex errors properly. They either treat them berof images andtags, respectively. Oij takes value1 only
in the same way [24], ignoring the complex property of the if image i is annotated with tag j and 0 otherwise.
Weare targeting at modifying thevalues in matrix O by
∗Corresponding author. matrix completion methods to perform image annotation.
After the matrix completion procedure, if the value of Oij
changes from nonzero (zero) to zero (nonzero), we say that
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor
thealgorithmremoves(adds)tagjfrom(to)imagei. Meth-
classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed
forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita- odsbasedonmatrixcompletionarerobustandefficientsince
tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan they only operate on the tag matrix, avoiding error propa-
ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre-
gation from image segmentation.
publish,topostonserversortoredistributetolists,requirespriorspecificpermission
and/orafee.Requestpermissionsfrompermissions@acm.org. However,inmanycasesOissosparsethatsomecolumns
SIGIR’16,July17-21,2016,Pisa,Italy have at most one known entries and some rows have no
(cid:13)c 2016ACM.ISBN978-1-4503-4069-4/16/07...$15.00 knownentriesatall,makingexistingmethodsnotapplicable
DOI:http://dx.doi.org/10.1145/2911451.2914693 [20]. In order to overcome such extreme sparsity, we first
performtagcompletiontomakeOdenser,creatingabetter but take values in [0,1], representing the confidence level
condition for the following tag refinement procedure. More between each image-tag pair.
specifically,weperformsubspaceclusteringoverimagesand
2.3 Matrix CompletionforTagRefinement
share tags within subspaces.
Fortagrefinement,existingmethodsusuallydependsheav- Thetagcompletionproceduremakesthetagmatrixmuch
ily on image segmentation and visual featureextraction ac- denserandthusavoidstheextremesparsityproblem. Then
curacies [1]. However, image segmentation and feature ex- we can refine the tag matrix. In our framework we novelly
traction procedures always contain a lot of noises, which adapt the inductive matrix completion model (IMC) [13]
affect the following annotation procedure severely. Mean- for tag refinement, due to its scalability and capability of
while, recent matrix completion based methods [10, 8, 9] incorporating various kindsof side information.
stand out duetotheirrobustnessand efficiency,since these 2.3.1 InductiveMatrixCompletion
algorithms avoid theimage segmentation procedure.
OurproposedframeworkiscalledSubspaceclusteringand Let vi ∈ Rfi denote the fi-dimensional feature vector of
Matrix completion with Complex errors (SMC), because imageiandtj ∈Rft denotetheft-dimensionalfeaturevec-
tor of tag j. Let V ∈ RNi×fi denote the feature matrix of
it utilizes the subspace property of image collections (Sec-
tion 2.2) and addresses the complex errors and side infor- Ni images, where the i-th row is the image feature vector
mation in an inductivematrix completion model for tagre- vi⊤, and T ∈RNt×ft denote the feature matrix of Nt tags,
where thei-th row is thetag feature t⊤.
finement (Section 2.3). i
Forimageannotation,weassumethatthetagmatrixcan
2.2 Subspace Clustering forTag Completion beapproximatedbyapplyingvisualfeaturevectorsandtag
feature vectors associated with its row and column entries
2.2.1 SubspaceClustering onto an underlying low-rank matrix M, i.e. O ≈ VMT⊤,
Itisreasonabletoassumethatimagesbelongingtodiffer- whereM=PQ⊤ [13]and P∈Rfi×r andQ∈Rr×ft are of
entcategoriesareapproximatelysampledfromamixtureof rankr≪Ni,Nt. Thegoalistosolvethefollowingproblem:
several low-dimensional subspaces. The membership of the min loss(O,VPQ⊤T⊤)+λ1(rank(PQ⊤)). (3)
data points to the subspaces is unobserved, leading to the P,Q
challengingproblemofsubspaceclustering. Herethegoalis
Acommonchoiceforthelossfunctionisthesquaredloss.
toclusterdataintok clusterswitheach clustercorrespond- The low-rank constraint on PQ⊤ makes (3) NP-hard. A
ing to a subspace.
standard relaxation is to use the trace norm, i.e. sum of
Oneofthestate-of-the-artmethodisthesparsesubspace singularvalues. Minimizingthetrace-normofM=PQ⊤ is
clustering (SSC) model [5]. The idea behind SSC is to ex- equivalenttominimizing 1(kPk2 +kQk2)[13]. Therelaxed
2 F F
press a data point as a linear (or affine) combination of
optimization problem we usein thiswork is therefore:
neighboring data points. The neighbors can be any other
points in the data set. While every point is a combination min kO−VPQ⊤T⊤k2 + λ1(kPk2 +kQk2). (4)
ofallotherdatapoints,SSCseeksforthesparsestrepresen- P,Q F 2 F F
tation among all the candidates by minimizing the number
2.3.2 VisualCorrelation
of nonzerocoefficients [6].
Wedenotethesetofimages,representedasvisualfeature We want to get the refined tag matrix Oˆ = VPQ⊤T⊤
dveractwonrsf,roams Va u=nio[vn1o,vf2k,.s.u.b,svpna]c.esA. sEsaucmhincoglutmhantotfheVycaarne rfroowmotfhOˆe oarsigOˆinia,lctoargremspaotnridxinOg.toHtehree rweefinreepdretasegnvtetchteoritohf
be represented by a linear combination of the bases in a imagei. Thuswecanmeasurethecorrelationbetweenimage
“dictionary”. SSCusesthematrixV itselfasthedictionary i and image j in two ways: 1) similarity between image
while explicitly considering noise: featuresvi andvj,2)similaritybetweenrefinedtagvectors
min kZk1+µkEk2F, (1) sOˆimi ialanrdtOˆhejm. eSsinacnedvtihsuuasllayresimanilnaortaimteadgewsitohftesnimbilealrontgagtso,
Z,E
these two kindsof similarities should becorrelated.
s.t. V=VZ⊤+E,diag(Z)=0,Z1=1, (2) Suchvisualcorrelationcanbeenforcedbysolvingthefol-
lowing optimization
where Z⊤ = [z1,z2,...,zn] is the coefficient matrix with
meaacthrizxi. TbehiinsgprtohbelermepcreasnenbteastoiolvnedofeffiviciaenntdlyEusiisngthmeoedrerronr min XNt XNt kOˆi−Oˆjk2gij, (5)
P,Q
sparse optimization algorithms, such as linearized alternat- i=1j=1
ingGdivireenctaiosnpamrseethreopdrse[s1e7n]t.ationforeachdatapoint,wecan wherekOˆi−Oˆjk2 measuresthesimilarity betweentagvec-
definethe affinity matrix as A=|Z|+|Z⊤|. Subspacesare tors Oˆi and Oˆj and gij measures the similarity between
then obtained by applyingspectral clustering to the Lapla- visual features vi and vj. In this work, we adopt cosine
cian matrix of A [5]. similarity, i.e. gij =cos(vi,vj). The formulation forces tag
vectors with large similarities also have large similarity in
2.2.2 TagSharing their correspondingvisual features and viceversa.
The formulation can berewritten as
We improve the search based neighbor voting algorithm
proposedin[18]tosharetagsineachclusterseparately. We minTr(OˆLvOˆ⊤)=minTr(VPQ⊤T⊤LvTQP⊤V⊤), (6)
rank all the tags for the cluster, taking tag frequency, tag P,Q P,Q
co-occurrence and local frequency into consideration. The whereLv =diag(G1)−GistheGraphLaplacian[3]ofthe
elementsoftagmatrixaftertagsharingarenolongerbinary similarity matrix G=(gij).
2.3.3 SemanticCorrelation We set the regularization terms of visual correlation and
Similarly, we can also enforce semantic correlation be- semanticcorrelation withthesameweightλ2 forsimplicity.
tween tags. Since each column of the matrix Oˆ represents This simplification does not harm performance, as we find
thefeatureofatag,wecanmeasurethecorrelationbetween duringpreliminary experiments.
two tags using the similarity between their corresponding This objective function is non-convex. To solve the opti-
column vectors of Oˆ. Meanwhile, semantic similarity be- mization problem, we adapt the solver for low-rank empiri-
tween two tags can be measured using word vectors. These cal risk minimization for multi-label learning (LEML) [23],
two kindsof similarities should be correlated as well. whichnaturallyfitsforthesettingsoflarge-scalemulti-label
We can enforce the semantic correlation by solving the learning with missing labels. The solver uses alternating
following optimization, in a similar form as (6): minimization (fix P and solve for Q and vice versa) to up-
datethe variables. When either Por Q is fixed,theresult-
minTr(Oˆ⊤LsOˆ)=minTr(TQP⊤V⊤LsVPQ⊤T⊤), (7) ingsubprobleminonevariable(QorP)canbesolvedusing
P,Q P,Q
iterative conjugate gradient procedure.
whereLs =diag(H1)−HistheGraphLaplacianofthesim-
ilaritymatrixH=(hij),witheachelementhij =cos(ti,tj). 4. EXPERIMENTAL EVALUATION
2.3.4 FeaturesVectors
4.1 Datasets andExperimental Setup
WeutilizeDeCAF6[4]toextract4,096-dimensionalvisual
features for each image, which have high level information. WeevaluateourproposedSMCframeworkontwobench-
Meanwhile, we adopt pre-trained word embedding vectors mark datasets: Labelme [21] and MIRFlickr-25K [12]. Ta-
(word2vec) [19] to construct 300-dimensional features for ble1demonstratesthedetailedstatistics. Thesetwodatasets,
each tag, tryingto capturesemantic information. especially MIRFlickr-25K, are rather noisy, with a number
of the tags being misspelled or meaningless. Hence, a pre-
2.3.5 ComplexErrors
processingprocedureisperformed. Wematcheachtagwith
As we have mentioned, annotation errors come in two entries in theWikipedia thesaurus and only retain thetags
forms: missing tags and inaccurate tags. Since human be- in accordance with Wikipedia.
ingsarerelativelyreasonable,theuser-providedtagsarerea-
sonably accuratetocertain level[24]. Usersmightmiss one
or several proper tags among thefew related tags, butmay Table 1: Statistics of 2 Datasets
Statistics Labelme MIRFlickr-25K
become less probable to add one or several inaccurate tags
No. ofimages 2,900 25,000
from the massive unrelated tag sets [12]. In other words, if
VocabularySize 495 1,386
an image is not originally annotated with a tag, it is more TagsperImage(mean/max) 10.5/48 12.7/76
likely that they really have no relation at all. Thus the ImagesperTag(mean/max) 67.1/379 416.5/76,890
errors are mainly composed of inaccurate tags rather than
missing tags. Andwe should paymore attention todenoise We compare our method with the state-of-the-art meth-
theinaccuratetagsratherthancompletingthemissingones. ods, including matrix completion-based models (i.e. LRES
To model the complex structure of errors, we improve [24], TCMR [8], RKML [9]), search-based models (i.e. JEC
thematrixcompletion modelbyputtinglessweightsonthe [18], TagProp [11], and TagRelevance [15]), mixture mod-
unannotated positions: els (i.e. CMRM [14] and MBRM [7]) and co-regularized
mP,iQnkO−VPQ⊤T⊤k2F −µkUΩ(O−VPQ⊤T⊤)k2F, (8) lbeyarintsinelgf,mdoedneolte(dFaasstTSaMgC[2]I)M. TCh,eistaaglsoreficonmempaenretdprtoocvederuirfye
the benefit from the tag completion procedure. We tune
whereΩrepresentsthepositionswheretheimagesareorig-
the parameters on the validation set of the two datasets
inally not annotated. U isaprojection operatorandµacts
separately for every method in comparison. Note that the
as a weighting parameter which changes adaptively in dif-
weightingparameterµwetunechangesfrom 0.4(Labelme)
ferent datasets according to theirnoise levels.
to0.7(MIRFlickr-25K),confirmingthatasthedatabecome
Existing methods never model these two kinds of errors
more and more noisy, we should pay more attention to the
separately. TheysimplymodeltheerrorsasLaplaciannoise
noisy tags and less on missing tags.
[24] or Gaussian noise [22]. To our knowledge, our model
We measure all the methods in terms of average preci-
is the first to model the missing errors and inaccurate er-
sion@N (AP@N) and average recall@N (AR@N). In the
rors separately. The model can further adapt to different
top N completed tags, precision@N isto measuretheratio
datasets according totheir noise levels.
ofcorrecttagsandrecall@N istomeasuretheratioofmiss-
3. FINALMODEL
ing ground-truthtags, both averaged overall test images.
Based on the components regarding low-rankness, visual
4.2 Evaluationand Observation
correlation,semanticcorrelationandcomplexerrors,wefor-
mulate theobjective function as follows: Table 2 and Table 3 show performance comparisons on
mP,iQnkO−VPQ⊤T⊤k2F −µkUΩ(O−VPQ⊤T⊤)k2F theWtewcoandaotbasseertvse, trhesapte:c1t)ivGeleyn.erally,methodsachievebetter
+λ1(kPk2 +kQk2)+ performance on Labelme, since tags in MIRFlickr-25K are
2 F F more noisy. 2) Methods based on matrix completion, such
λ2[Tr(VPQ⊤T⊤LvTQP⊤V⊤+TQP⊤V⊤LsVPQ⊤T⊤)]. amsaSnMceCs.,3L)ROEuSraSnMdCTfCraMmRe,wuosrukaslhlyowacshiinecvreeatshiengbeasdtvpaenrtfaogre-
By solving P and Q we can then construct the refined tag to LRES as the data become more and more noisy, justify-
matrix Oˆ =VPQ⊤T⊤ and useit for refined annotation. ingourassumptionandmodelonthenoises. 4)SMCnearly
[3] F. Chung.Spectral graph theory. 1997.
Table 2: Performance Comparison on Labelme
[4] J. Donahue,Y.Jia, O. Vinyals,J. Hoffman,N.Zhang,
Labelme
E. Tzeng, and T. Darrell. Decaf: A deep convolutional
N=2 N=5 N=10
AP AR AP AR AP AR activation feature for generic visual recognition. arXiv
SMC 0.51 0.36 0.46 0.50 0.35 0.62 preprint arXiv:1310.1531, 2013.
SMC IMC 0.47 0.34 0.40 0.48 0.31 0.59 [5] E.Elhamifar andR.Vidal.Sparsesubspaceclustering.
LRES [24] 0.42 0.32 0.35 0.45 0.27 0.56 InCVPR, 2009.
TCMR [8] 0.44 0.32 0.37 0.45 0.29 0.55 [6] E. Elhamifar and R.Vidal. Clustering disjoint
RKML [9] 0.21 0.14 0.19 0.20 0.14 0.22
subspacesvia sparse representation. In ICASSP,2010.
JEC [18] 0.33 0.29 0.27 0.38 0.20 0.48
[7] S.Feng, R.Manmatha, and V. Lavrenko.Multiple
TagProp [11] 0.39 0.31 0.33 0.45 0.25 0.56
TagRel [15] 0.43 0.32 0.34 0.45 0.27 0.55 Bernoulli relevance models for image and video
CMRM [14] 0.20 0.14 0.18 0.19 0.12 0.22 annotation. InCVPR, 2004.
MBRM [7] 0.23 0.14 0.18 0.20 0.12 0.27 [8] Z.Feng, S. Feng,R. Jin, and A.Jain. Image tag
FastTag [2] 0.43 0.34 0.37 0.44 0.28 0.57 completion bynoisy matrix recovery.In ECCV. 2014.
[9] Z.Feng, R.Jin, and A.Jain. Large-scale image
annotation by efficient and robust kernelmetric
Table 3: Performance Comparison on MIRFlickr-
learning. In ICCV,2013.
25K
[10] A.Goldberg, B. Recht,J. Xu,R.Nowak, and X.Zhu.
MIRFlickr-25K
N=2 N=5 N=10 Transduction with matrix completion: Three birds
AP AR AP AR AP AR with one stone. In NIPS,2010.
SMC 0.53 0.39 0.40 0.47 0.33 0.61 [11] M. Guillaumin, T. Mensink, J. Verbeek,and
SMC IMC 0.45 0.34 0.36 0.43 0.30 0.52 C. Schmid.Tagprop: Discriminative metric learning in
LRES [24] 0.43 0.35 0.32 0.40 0.26 0.45 nearest neighbormodels for image auto-annotation. In
TCMR [8] 0.45 0.35 0.35 0.41 0.28 0.48 ICCV,2009.
RKML [9] 0.21 0.15 0.13 0.23 0.13 0.22
[12] M. Huiskesand M. Lew. TheMIR Flickr retrieval
JEC [18] 0.33 0.30 0.25 0.34 0.19 0.35
evaluation. InICMIR,2008.
TagProp [11] 0.39 0.35 0.28 0.37 0.20 0.41
TagRel [15] 0.42 0.34 0.30 0.37 0.20 0.40 [13] P.Jain and I.Dhillon. Provable inductivematrix
CMRM [14] 0.20 0.15 0.13 0.18 0.11 0.20 completion. arXiv preprint arXiv:1306.0626, 2013.
MBRM [7] 0.22 0.16 0.13 0.18 0.10 0.22 [14] J. Jeon, V. Lavrenko,and R.Manmatha. Automatic
FastTag [2] 0.43 0.35 0.30 0.41 0.27 0.42 image annotation and retrieval usingcross-media
relevancemodels. In SIGIR,2003.
[15] X.Li, C. Snoek, and M. Worring. Learning social tag
outperforms all the other algorithms in all cases. 5) Per-
relevanceby neighborvoting. TM,2009.
formance comparison betweenSMCandSMC IMCdemon-
[16] X.Li, T. Uricchio, L. Ballan, M. Bertini, C. G. Snoek,
strate the remarkable benefit of tag completion for tag re-
and A.Del Bimbo. Socializing thesemantic gap: A
finement. 6) Performance on MIRFlickr-25K in some sense
comparative surveyon image tag assignment,
provides an evidencefor the robustnessof SMC.
refinementand retrieval. arXiv preprint
arXiv:1503.08248, 2015.
5. CONCLUSION
[17] Z.Lin, R. Liu,and Z. Su.Linearized alternating
In this work we present an effective framework for image direction method with adaptivepenalty for low-rank
annotationbyperformingtagcompletionandtagrefinement representation. InNIPS, 2011.
sequentially. Our method first clusters images using sparse [18] A.Makadia, V. Pavlovic, and S.Kumar. A new
subspaceclusteringandsharestags usinganeighborvoting baseline for image annotation. In ECCV.2008.
algorithm, then refines tags by adapting inductive matrix [19] T. Mikolov, K. Chen,G. Corrado, and J. Dean.
completionwhilenovellyutilizingvisualandsemanticinfor- Efficient estimation of word representations in vector
mation. Experiments show the effectiveness of our frame- space. arXiv preprint arXiv:1301.3781, 2013.
workandsuggestthattagrefinementcanbenefitalotfrom
[20] N.Natarajan and I.Dhillon. Inductivematrix
performing tag completion first.
completion for predictinggene–disease associations.
Bioinformatics,2014.
Acknowledgments
[21] B. Russell, A.Torralba, K. Murphy,and W. Freeman.
ZhouchenLinissupportedbyNationalBasicResearchPro- Labelme: adatabase and web-based tool for image
gram of China (973 Program) (grant no. 2015CB352502), annotation. IJCV,2008.
NationalNaturalScienceFoundation(NSF)ofChina(grant [22] L. Wu,R.Jin, and A.Jain. Tag completion for image
nos. 61272341 and 61231002), and Microsoft Research Asia retrieval. TPAMI,2013.
Collaborative Research Program. [23] H.Yu,P.Jain, P. Kar, and I.Dhillon. Large-scale
6. REFERENCES multi-labellearning with missing labels. InICML,
2014.
[1] G. Carneiro, A.Chan, P. Moreno, and N.Vasconcelos.
[24] G. Zhu,S.Yan,and Y.Ma. Image tag refinement
Supervisedlearning of semantic classes for image
towards low-rank, content-tagprior and error sparsity.
annotation and retrieval. TPAMI,2007.
InACM MM, 2010.
[2] M. Chen, A.Zheng, and K.Weinberger. Fast image
tagging. In ICML,2013.