Noname manuscript No. (will be inserted by the editor) Supervised multiview learning based on simultaneous learning of multiview intact and single view classifier Qingjun Wang · Haiyan Lv · Jun Yue · Eugene Mitchell 6 Received: date/Accepted: date 1 0 2 Abstract Multiview learning problem refers to the problem of learning a n classifierfrommultipleviewdata.Inthisdataset,eachdatapointsispresented a J by multiple different views. In this paper, we propose a novel method for this problem. This method is based on two assumptions. The first assumption is 9 that each data point has an intact feature vector, and each view is obtained ] by a linear transformation from the intact vector. The second assumption is V that the intact vectors are discriminative, and in the intact space, we have a C linearclassifiertoseparatethepositiveclassfromthenegativeclass.Wedefine . an intact vector for each data point, and a view-conditional transformation s c matrix for each view, and propose to reconstruct the multiple view feature [ vectorsbytheproductofthe correspondingintactvectorsandtransformation 1 matrices.Moreover,we also proposea linearclassifier in the intactspace,and v learn it jointly with the intact vectors. The learning problem is modeled by 8 a minimization problem, and the objective function is composed of a Cauchy 9 errorestimator-basedview-conditionalreconstructiontermoveralldatapoints 0 2 and views, and a classification error term measured by hinge loss over all the 0 intactvectorsofallthedatapoints.Someregularizationtermsarealsoimposed . to different variables in the objective function. The minimization problem 1 0 is solve by an iterative algorithm using alternate optimization strategy and 6 gradientdescentalgorithm.Theproposedalgorithmshowsitadvantageinthe 1 compression to other multiview learning algorithms on benchmark data sets. : v i X QingjunWang, JunYue SchoolofInformationandElectricalEngineering,LudongUniversity,Yantai264025,China r a E-mail:qjwang386@hotmail.com HaiyanLv NavalAeronauticalandAstronauticalUniversity,Yantai 264025,China EugeneMitchell DepartmentofComputer Science,RyersonUniversity,Toronto, ONM5B2K3,Canada E-mail:emitchell328496@outlook.com 2 QingjunWangetal. Keywords Multiview learning · Supervised learning · Intact space · Hinge loss 1 Introduction 1.1 Background Multiview learning has been an important in machine learning community [51,42,53,31,54,6,27,28,40,47]. In traditional machine learning problems, we usually assume that a data point has a feature vector to represent its input information. For example, in image recognition problem, we can extract a visual feature vector from an image, using a texture descriptor [36,34,33,52, 32,15,49,20]. In this scene, the texture is a view of the image.However,there could be more than one view of an image. Besides the texture view, we can also extract feature vectors from other views, including shape and color. An otherexampleisthe problemofclassificationofscientificarticles,andwemay extract a feature vector from the main text of the article [30,17,37,25,17,37, 22,5,12,24].However,themaintextisjustoneviewthearticle,andwecanalso havefeatures fromotherviews,suchas abstract,reference list,etc. Multiview learning arguesthat we should learnfrom more than one views to presentthe dataandconstructapredictor.Themotiveformultiviewlearningisthatsingle view based data representation is usually incomplete, and different views can present complementary information for the learning problem. In the problem of multiview learning, the input of a data point is not just one single feature vector of one single view, but multiple feature vectors presenting different views.Thetargetofmultiviewlearningistolearnapredictortotakemultiple viewfeaturevectorstopredictonesingleoutputofadatapoint.The problem of multiview learning can be classified to two types, supervised multiview learning and unsupervised learning. – Supervised multiview learning refers to the problem of learning from a dataset,whereboththemultiview inputandoutputareavailableforeach data point [26,21,16]. In this problem, the output is usually a class label, or a continues response. In this case, the learning problem is to build a predictive model from the training data set to predict the output of a input data point, with help the input-output pairs of the training set. – Unsupervised multiview learning refers to the problem of cluster a set of data points, and the multiview inputs of each data point are given [13,41, 57]. In this problem, the outputs of the data points are not available. Inthispaper,weinvestigatetheproblemofsupervisedmultiviewlearning, and propose a novel algorithm to solve it. The proposed method is based on an assumption that different views of a data point are generated from one single intact feature vector, and the view generation is performed by a linear transformation.Wetrytorecovertheintactfeaturevectorforeachdatapoint from its multiview feature vectors, with guiding of its corresponding output, i.e., its binary class label. TitleSuppressedDuetoExcessiveLength 3 1.2 Relevant works There are some existing multiview learning methods. We the state-of-the-arts of them as follows. – Zhangetal.[56]proposedtouselocallearning(LL)methodfortheproblem ofmultiviewlearningproblem,anddesignsalocalpredictivemodelforeach data point based on the multiview inputs. The local predictive model is learned on the nearest neighbors of a data point. – Sindhwani et al. [39] proposed to use co-training algorithm for multiview learning problems to improve the classification performance of each view (CT).Thismethodisbasedonmultiviewregularization,andtheagreement and smoothness over both labeled and unlabeled data points. – Quadrianto[38]proposedamultiviewlearningalgorithmtosolvetheprob- lemofviewdisagreement(VD),i.e.,differentviewsofonesingledatapoint do not belong to the same class. This method uses a conditional entropy criterion to find the disagreement among different views, and remove the data points with view disagreement from the training set. – Zhai [55] proposed multiview metric learning method with global consis- tency and local smoothness (GL) for the multiview learning problem with partially labeled data set. This method simultaneously consider both the global consistency and local smoothness, by assuming that the different views has a shared latent feature space, and imposing global consistency and local structure to the learning procedure. – Chen et al. [3] proposed a statistical subspace multiview representation method (SS), by leveraging both multiview dependencies and supervision information. This method is based on a subspace Markov network of mul- tiview latent, and assumes that the multiviews and the class labels are conditionally independent. The algorithmis basedon the maximizationof data likelihood, and the minimization of classification error. 1.3 Contributions In this paper,we proposea novelsupervisedmultiview learningmethod. This method is based on the assumption of single discriminative intact of different multiview inputs. Under this assumption, although there are different views of one single data point, one single intact feature vector exists for the data point. This intact feature vector is assumed to be discriminative, i.e., it can represents the class information of each data point. Moreover, the feature vector of each view of a data point can be obtained from the intact vector, by performing a linear view-conditional transformation to the intact feature vector.Inthisway,ifwelearnthediscriminativeintactfeaturevectorforeach training data point, we can learn a classifier in the intact with the help of the class labels of the training data points. To this end, we proposed a novel method to learn the hidden of the intact feature vector, the view-conditional transformationmatrices, and the classifier in the intact space simultaneously. 4 QingjunWangetal. We define a intact feature vector for each data point, and a transformation matrixforeachview.Thefeaturevectorofoneviewofeachdatapointcanbe reconstructed as the product of its corresponding transformation matrix and intactfeaturevector.Thereconstructionerrorforeachviewofeachdatapoint ismeasuredbytheCauchyerrorestimator[18,14].Tolearntheoptimalintact feature vectors and view-conditional transformation matrices, we propose to minimize the Cauchy errors.Moreover,due to the assumption that the intact featurevectorsarediscriminative,wealsoarguethatwecandesignaclassifier in the intact space, and the classifier can minimize the classification error. Thus we also propose to learn a linear classifier in the intact space, and use the hinge loss to measure the classificationerror the training set in the intact space [4,1]. To learn the optimal classifier parameter and the intact feature vectors, we also propose to minimize the hinge loss with regard to both the classifier parameter and the intact feature vectors. Tomodeltheproblem,weproposeajointoptimizationproblemforlearning of intact vectors, view-conditional transformation matrices, and the classifier parameter vector. The objective function of this problem is composed of two error terms, and three regularization terms. The firs error term is the view reconstruction term measured by Cauchy estimator over all the data points and views. The second error term is the classification error over all the in- tact feature vectors of all training data points, measured by hinge losses.The three regularizationterms are all squared ℓ norm terms over each intact fea- 2 ture vectors, view-conditional matrices, and the classifier parameter vectors. The purpose of impose the squared ℓ norm to these variables are to reduce 2 the complexity of the learned outputs. To minimize the proposed objective function,we adaptanalternateoptimizationstrategy,i.e.,whenthe objective function is minimized with regard to one variable, other variables are fixed. The optimization with regardto each variable is conducted by using gradient descent algorithm. The contributions of this paper are of three parts: 1. We propose a novel supervised multiview learning framework by simulta- neous learning of intact feature vectors, view-conditional transformation matrices, and classifier parameter vector. 2. We build a novel optimization problem for this learning problem, by con- sidering both the view reconstruction problem, and the classifier learning problem. 3. Wedevelopaniterativealgorithmtosolvethisoptimizationproblembased on alternate optimization strategy and gradient descent algorithm. 1.4 Paper organization This paper is organized as follows: In section 2, the proposed method for su- pervised multiview learning is introduced. In this section, we first model this problem as a minimization problem of a objective function, and then solve it with an iterative algorithm. In section 3, the proposed iterative algorithm TitleSuppressedDuetoExcessiveLength 5 is evaluated. We first give an analysis of its sensitivity to parameters, and then compare it to some state-of-the-art algorithms, and finally test the run- ning time performance of the proposed algorithm. In section 4, we give the conclusion of this paper. 2 Methods Inthissection,weintroducetheproposedsupervisedmultiviewlearningmethod. 2.1 Problem modeling We assume we are dealing with supervised binary classification problem with multiviewdata.Atrainingdatasetofndatapointsisgiven,X ={θ ,··· ,θ }. 1 n θ =(x1,··· ,xm,y )isthei-thdatapoint.Theinformationofeachdatapoint i i i i iscomposedoffeaturevectorsofmviews,andabinaryclasslabelyi.xji ∈Rdj isthed -dimensionalfeaturevectorofthej-thviewofthei-thdatapoint,and j y ∈{+1,−1}isathebinaryclasslabelofthei-thdatapoint.Theproblemof i supervisedmultiview learning is to learn a predictive model from the training set, which can predict a binary class label from the multiview input of a test datapoint.Weassumethereisanintactvectorz ∈Rd forthei-thdatapoint, i and its j-th view xj can be reconstructed by a linear transformation, i xj ←W z , (1) i j i where Wj ∈ Rdj×d is the view-conditional linear transformation matrix for the j-thview.Pleasethe view-conditionaltransformationmatrix forthe same view of all the data points is the same. By learning both the W and z , we j i can recover the hidden intact vector for the i-th data point, z , and use it for i classification problem. To this end, we propose to minimize the reconstruc- tion error. The reconstruction error is measured by Cauchy error estimator, E(xj,W z ), i j i 2 xj −W z i j i E(xj,W z )=log 1+ 2 . (2) i j i  (cid:13) c2 (cid:13)  (cid:13) (cid:13) (cid:13) (cid:13)     Thiserrorestimatorhasbeenshowntoberobust,anditalsoprovidesaoffset. We proposetominimize this errorestimatoroveralldatapointsandallviews with regard to both z ,i=1,··· ,n, and W ,j =1,··· ,m, i j 2 n m n m xj −W z i j i min E(xj,W z )= log 1+ 2 (3) zi|ni=1,Wj|mj=1Xi=1Xj=1 i j i Xi=1Xj=1  (cid:13)(cid:13)(cid:13) c2 (cid:13)(cid:13)(cid:13)        6 QingjunWangetal. Moreover,wealsoassumethattheintactfeaturevectorsofthedatapoints are discriminative, and presents the class information, thus the intact feature vectorscanminimize a classificationloss function ofthe data set.We propose to learn the intact feature vector of the i-th data point by jointly learning a liner classifier to predict its class label, y . The classifier is designed as linear i function, y ←ω⊤z (4) i i The usage of a linear function as the classifier is motive by the work of Fan andTang [8].Fan andTang [8] proposedto use a linear classifierto maximize the area under the ROC Curve (AUC) for the problem of imbalance learning andcostsensitivelearning.FanandTang[8]foundthatalinearclassifierused to maximize AUC searches an optimal solution in a very constrained space, andenhance the maximumAUC linear classifierby extending its searchingin thesolutionspace,andimprovingthewaytousethestructureoftheclassifier. ThusthelinearclassifierhasbeenproventobeeffectiveintheoptimizingAUC by Fan and Tang [8], it inspires us to use it to learn an effective classifier in the intact vector space. The classificationerrorcan be measured by the hinge loss function, L(y ,ω⊤z )=max(0,1−y ω⊤z ). (5) i i i i Thistheoptimizationofthislossfunctioncanobtainalargemarginclassifier. To learn the optimal classifier and the discriminative intact feature vectors, we propose to minimize the classifier loss measuredby the hinge loss function of the classification result over all the training data points, n n min L(y ,ω⊤z )= max(0,1−y ω⊤z ) (6) i i i i zi|ni=1,ω(i=1 i=1 ) X X Moreover, to prevent the problem of over-fitting of variables, we propose to minimize the squaredℓ normofthe variablesto regularizethe learningz , 2 i W , and ω, j n m min R(z |n ,W |m ,ω)= kz k2+ kW k2+kωk2 . (7) zi|ni=1,Wj|mj=1,ω i i=1 j j=1 Xi=1 i 2 Xj=1 j 2 2   TitleSuppressedDuetoExcessiveLength 7 Ouroveralllearningproblemisobtainedbyconsideringboththeproblems of view-conditional reconstruction in (3), and classifier learning in the intact space in (6), n m min E(xj,W z )+αL(y ,ω⊤z )+γR(z |n ,W |m ,ω) zi|ni=1,Wj|mj=1,ωXi=1Xj=1 i j i i i i i=1 j j=1 2  n m xji −Wjzi = log 1+ 2  (cid:13) c2 (cid:13)  (cid:13) (cid:13) Xi=1Xj=1 (cid:13) (cid:13)   n   +α max(0,1−y ω⊤z ) i i i=1 X n m +γ kz k2+ kW k2+kωk2 ,  i 2 j 2 2 Xi=1 Xj=1    (8) where α is a tradeoff parameter to balance the view-conditional reconstruc- tion terms and the classification error terms, and γ is a tradeoff parameter to balance the view-conditional reconstruction terms and the regularization terms. By optimizing this problem, we can learn intact feature vectors which canpresentthemultiviewinputsofthedatapoints,andalsoisdiscriminative. 2.2 Optimization To solve the optimization problem in (21), we propose to use the alternate optimizationstrategy.Theoptimizationisconductedinaniterativealgorithm. When one variable is considered, the others are fixed. After one variable is updated, it will be fixed in the next iteration when other variable is updated. In the following subsections, we will discuss how to update each variable. 2.2.1 Updating z i Whenwe wanttoupdate z ,we onlyconsiderthis singlevariable,while fix all i other variables. Thus we have the following optimization problem, 2 m xji −Wjzi min log 1+ 2 +αmax(0,1−y ω⊤z )+γkz k2 . z   (cid:13) c2 (cid:13)  i i i 2 i Xj=1 (cid:13)(cid:13) (cid:13)(cid:13)      (9)   8 QingjunWangetal. The second term max(0,1−y ω⊤z ) is not a convex function, and it is hard i i to optimize it directly. Thus we rewrite it as follows, 1−y ω⊤z ,if 1−y ω⊤z >0 max(0,1−y ω⊤z )= i i i i (10) i i 0, otherwise. (cid:26) We defineaindicatorvariable,β ,toindicatewhichofthe abovecasesistrue, i 1,if 1−y ω⊤z >0 β = i i (11) i 0, otherwise, (cid:26) and rewrite (10) as follows, max(0,1−y ω⊤z )=β 1−y ω⊤z (12) i i i i i Pleasenotethatβ isalsoafunctionofz ,how(cid:0)ever,wefir(cid:1)stupdateitbyusing i i z solvedinpreviousiteration,andthenfixittoupdatez incurrentiteration. i i In this way, (9) is rewritten as 2 m xji −Wjzi min log 1+ 2 +αβ 1−y ω⊤z +γkz k2 =g(z ) , z   (cid:13) c2 (cid:13)  i i i i 2 i  i Xj=1  (cid:13)(cid:13) (cid:13)(cid:13)  (cid:0) (cid:1)    (13) where g(z ) is the objective of this optimization problem. To seek the mini- i mization of g(z ), we use gradient descent algorithm. This algorithm update i z by descending it to the direction of gradient of g(z ), i i z ←z −µ∇g(z ), (14) i i i where µ is the descent step, and ∇g(z ) is the gradient function of g(z ). We i i set ∇g(z ) as the partial derivative of g(z ) with regardto z , i i i ∂g(z ) m 2Wj⊤(xji−Wjzi) ∇g(z )= i = c2 −αβ y ω+γz i ∂zi j=1 1+ kxji−Wjzik22 i i i X c2 (cid:18) (cid:19) (15) m 2W⊤(xj −W z ) = j i j i −αβ y ω+γz . i i i 2 Xj=1 c2+ xji −Wjzi 2 (cid:18) (cid:13) (cid:13) (cid:19) By substituting (15) to (14), w(cid:13)(cid:13)e have the(cid:13)(cid:13)final updating rule of zi, m 2W⊤(xj −W z ) zi ←zi−µ j i j i −αβiyiω+γzi. (16) 2 Xj=1 c2+ xji −Wjzi   (cid:18) (cid:13) (cid:13)2(cid:19)  (cid:13) (cid:13) (cid:13) (cid:13) TitleSuppressedDuetoExcessiveLength 9 2.2.2 Updating W j When we want to optimize W , we fix all other variables and only consider j W itself. The optimization problem is changed to the follows, j 2 n xj −W z i j i min log 1+ 2 +γkW k2 =f(W ) . (17) Wj Xi=1  (cid:13)(cid:13)(cid:13) c2 (cid:13)(cid:13)(cid:13)  j 2 j      where f(W) is the objective function of this problem. To solve this problem, j we also update W by using the gradient descent algorithm, j W ←W −µ∇f(W ), (18) j j j where ∇f(W ) is the gradient function of f(W ), j j ∂f(W ) n 2(xji−Wjzi)z⊤i ∇f(W )= j = c2 +γW j ∂Wj i=1 1+ kxji−Wjzik22 j X c2 (cid:18) (cid:19) (19) n 2(xj −W z )z⊤ = i j i i +γW . j 2 Xi=1 c2+ xji −Wjzi 2 (cid:18) (cid:13) (cid:13) (cid:19) Substituting (19) to (18), we have(cid:13)the final up(cid:13)dating rule of W , (cid:13) (cid:13) j n 2(xj −W z )z⊤ Wj ←Wj −µ i j i i +γWj. (20) 2 Xi=1 c2+ xji −Wjzi   (cid:18) (cid:13) (cid:13)2(cid:19)  (cid:13) (cid:13) 2.2.3 Updating ω (cid:13) (cid:13) When we want to update ω to minimize the objective function of (21), we fix the other variables, and only consider ω. Thus the problem in (21) is transferred to n min α β 1−y ω⊤z +γkωk2 =h(ω) . (21) ω i i i 2 ( ) i=1 X (cid:0) (cid:1) Please note that β is actually a function of ω. However, similar the strategy i to solve z , we also update it according to ω solved in previous iteration, and i fix it to update ω in current iteration. When β ,i = 1,··· ,n are fixed, we i update ω to minimize h(ω) by using the gradient descent algorithm, ω ←ω−µ∇h(ω), (22) 10 QingjunWangetal. where ∇h(ω) is the gradient function of h(ω), and it is defined as follows, ∂h(ω) n ∇h(ω)= ∂ω =−α βiyizi+γω. (23) i=1 X By substituting it to (24), we have the final updating rule for ω, n ω ←ω−µ −α β y z +γω . (24) i i i ! i=1 X 2.3 Iterative algorithm Afterwehavetheupdatingrulesofallthevariables,wecandesignaniterative algorithmforthelearningproblem.ThisiterativealgorithmhasoneouterFOR loop, and two inner FOR loops. The outer FOR loop is corresponding to the main iterations. The two inner FOR loops are corresponding to the updating of n intact feature vectors of n data points, and the updating of m view- conditional transformation matrices. The algorithm is given in Algorithm 1. TheiterationnumberT isdeterminedbycross-validationinourexperiments. – Algorithm 1. Iterative algorithm for multiview intact and single-view classifier learning (MISC). – Input: Training data set, (x1,··· ,xm,y ),··· ,(x1,··· ,xm,y ). 1 1 1 n n n – Input: Tradeoff parameters, α and γ. – Input: Maximum iteration number, T. – Initialization: z0,i=1,··· ,n, W0,j =1,··· ,m and ω0. i j – For t=1,··· ,T – Update descent step, µt ← 1 t – For i=1,··· ,n Update βt as follows, i βt = 1,if 1−yiωt−1⊤zti−1 >0 (25) i 0, otherwise. (cid:26) Update zt by fixing Wt−1,j =1,··· ,m, βt−1 and ωt−1, i j i m 2Wt−1⊤(xj −Wt−1zt−1) zti ←zti−1−µt j i j i 2 −αβityiωt−1+γzti−1. j=1 c2+ xj −Wt−1zt−1 X i j i   (cid:18) (cid:13) (cid:13)2(cid:19) (26) (cid:13) (cid:13) – End of For (cid:13) (cid:13)

