ebook img

Visualizing Large-scale and High-dimensional Data PDF

8.8 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Visualizing Large-scale and High-dimensional Data

Visualizing Large-scale and High-dimensional Data Jian Tang1, Jingzhou Liu2∗, Ming Zhang2, Qiaozhu Mei3 1MicrosoftResearchAsia,[email protected] 2PekingUniversity,{liujingzhou,mzhang_cs}@pku.edu.cn 3UniversityofMichigan,[email protected] ABSTRACT for business providers, data scientists, governments, educa- 6 tors, and healthcare practitioners. Many computational in- 1 We study the problem of visualizing large-scale and high- frastructures,algorithms,andtoolsarebeingconstructedfor 0 dimensionaldatainalow-dimensional(typically2Dor3D) the users to manage, explore, and analyze their data sets. 2 space. Much success has been reported recently by tech- Information visualization has been playing a critical role in niques that first compute a similarity structure of the data r this pipeline, which facilitates the description, exploration, p points and then project them into a low-dimensional space andsense-makingfromboththeoriginaldataandtheanaly- A with the structure preserved. These two steps suffer from sis results [16]. Classical visualization techniques have been considerable computational costs, preventing the state-of- proved effective for small or intermediate size data; they 5 the-art methods such as the t-SNE from scaling to large- however face a big challenge when applied to the big data. scaleandhigh-dimensionaldata(e.g.,millionsofdatapoints Forexample,visualizationssuchasscatterplots,heatmaps, ] and hundreds of dimensions). We propose the LargeVis, a G andnetworkdiagramsallrequirelayingoutdatapointsona technique that first constructs an accurately approximated 2Dor3Dspace,whichbecomescomputationallyintractable L K-nearest neighbor graph from the data and then layouts when there are too many data points and when the data . the graph in the low-dimensional space. Comparing to t- s have many dimensions. Indeed, while there exist numer- SNE, LargeVis significantly reduces the computational cost c ousnetworkdiagrams withthousandsof nodes, avisualiza- [ of the graph construction step and employs a principled probabilistic model for the visualization step, the objective tion of millions of nodes is rare, even if such a visualization 2 ofwhichcanbeeffectivelyoptimizedthroughasynchronous would easily reveal node centrality and community struc- v stochastic gradient descent with a linear time complexity. tures. In general, the problem is concerned with finding an 0 The whole procedure thus easily scales to millions of high- extremelylow-dimensional(e.g.,2Dor3D)representationof 7 large-scale and high-dimensional data, which has attracted dimensionaldatapoints. Experimentalresultsonreal-world 3 a lot of attentions recently in both the data mining com- data sets demonstrate that the LargeVis outperforms the 0 munity [25, 2, 27] and the infoviz community [11, 14, 17]. state-of-the-artmethodsinbothefficiencyandeffectiveness. 0 Compared to the high-dimensional representations, the 2D The hyper-parameters of LargeVis are also much more sta- . or 3D layouts not only demonstrate the intrinsic structure 2 ble over different data sets. of the data intuitively and can also be used as the basis to 0 6 GeneralTerms build many advanced and interactive visualizations. 1 Projecting high-dimensional data into spaces with fewer : Algorithms, Experimentation dimensions is a core problem of machine learning and data v mining. Theessentialideaistopreservetheintrinsicstruc- Xi Keywords ture of the high-dimensional data, i.e., keeping similar data points close and dissimilar data points far apart, in the r Visualization, big data, high-dimensional data a low-dimensional space. In literature, many dimensionality reduction techniques have been proposed, including both 1. INTRODUCTION linear mapping methods (e.g., Principle Component Analy- Wenowliveintheeraofthebigdata. Understandingand sis [15], multidimensional scaling [25]) and non-linear map- mining large-scale data sets have created big opportunities ping methods (e.g., Isomap [24], Locally Linear Embed- ding[20],LaplacianEigenmaps[2]). Asmosthigh-dimensional ∗Thisworkwasdonewhenthesecondauthorwasanintern datausuallylieonornearalow-dimensionalnon-linearman- at Microsoft Research Asia. ifold, the performance of linear mapping approaches is usu- ally not satisfactory [27]. For non-linear methods such as the Laplacian Eigenmaps, although empirically effective on small, laboratory data sets, they generally do not perform wellonhigh-dimensional,realdataastheyaretypicallynot CopyrightisheldbytheInternationalWorldWideWebConferenceCom- able to preserve both the local and the global structures of mittee(IW3C2). IW3C2reservestherighttoprovideahyperlinktothe the high-dimensional data. Maaten and Hinton proposed author’ssiteiftheMaterialisusedinelectronicmedia. thet-SNEtechnique[27],whichcapturesboththelocaland WWW2016,April11–15,2016,Montréal,Québec,Canada. the global structures. Maaten further proposed an accel- ACM978-1-4503-4143-1/16/04. http://dx.doi.org/10.1145/2872427.2883041. Figure 1: A typical pipeline of data visualization by first constructing a K-nearest neighbor graph and then projecting the graph into a low-dimensional space. eration technique [26] for the t-SNE by first constructing and documents), images, and networks. Experimental re- a K-nearest neighbor (KNN) graph of the data points and sults show that our proposed algorithm for constructing then projecting the graph into low-dimensional spaces with theapproximateK-nearestneighborgraphsignificantlyout- tree-based algorithms. T-SNE and its variants, which rep- performs the vantage-point tree algorithm used in the t- resent a family of methods that first construct a similarity SNE and other state-of-the-art methods. LargeVis gener- structure from the data and then project the structure into atescomparablegraphvisualizationstothet-SNEonsmall a2D/3Dspace(seeFigure1),havebeenwidelyadoptedre- datasetsandmoreintuitivevisualizationsonlargedatasets; cently due to the ability to handle real-world data and the it is much more efficient when data becomes large; the pa- good quality of visualizations. rameters are not sensitive to different data sets. On a set Despitetheirsuccesses,whenappliedtodatawithmillions of three million data points with one hundred dimensions, ofpointsandhundredsofdimensions,thet-SNEtechniques LargeVis is up to thirty times faster at graph construction are still far from satisfaction. The reasons are three-fold: andseventimesfasteratgraphvisualization. LargeVisonly (1) the construction of the K-nearest neighbor graph is a takes a couple of hours to visualize millions of data points computational bottleneck for dealing with large-scale and with hundreds of dimensions on a single machine. high-dimensional data. T-SNE constructs the graph using To summarize, we make the following contributions: the technique of vantage-point trees [28], the performance • Weproposeanewvisualizationtechniquewhichisable of which significantly deteriorates when the dimensionality to compute the layout of millions of data points with of the data grows high; (2) the efficiency of the graph vi- hundreds of dimensions efficiently. sualization step significantly deteriorates when the size of the data becomes large; (3) the parameters of the t-SNE • We propose a very efficient algorithm to construct an areverysensitiveondifferentdatasets. Togenerateagood approximateK-nearestneighborgraphfromlarge-scale, visualization, one has to search for the optimal parameters high-dimensional data. exhaustively, which is very time consuming on large data sets. It is still a long shot of the community to create high • Weproposeaprincipledprobabilisticmodelforgraph quality visualizations that scales to both the size and the visualization. Theobjectivefunctionofthemodelcan dimensionality of the data. beeffectivelyoptimizedthroughasynchronousstochas- Wereportasignificantprogressonthisdirectionthrough tic gradient descent with a time complexity of O(N). the LargeVis, a new visualization technique that computes • Weconductexperimentsonreal-world,verylargedata the layout of large-scale and high-dimensional data. The sets and compare the performance of LargeVis and t- LargeVis employs a very efficient algorithm to construct an SNE, both quantitatively and visually. approximate K-nearest neighbor graph at a high accuracy, whichbuildsontopofbutsignificantlyimprovesastate-of- 2. RELATEDWORK the-art approach to KNN graph construction, the random Tothebestofourknowledge,veryfewvisualizationtech- projectiontrees[7]. Wethenproposeaprincipledprobabilis- niquescanefficientlylayoutmillionsofhigh-dimensionaldata tic approach to visualizing the K-nearest neighbor graph, pointsmeaningfullyona2Dspace. Instead,mostvisualiza- which models both the observed links and the unobserved tions of large data sets have to first layout a summary or a (i.e., negative) links in the graph. The model preserves the coarse aggregation of the data and then refine a subset of structures of the graph in the low-dimensional space, keep- the data (a region of the visualization) if the user zooms in ing similar data points close and dissimilar data points far [5]. Admittedly, there are other design factors besides the away from each other. The corresponding objective func- computational capability, for example the aggregated data tion can be optimized through the asynchronous stochastic maybemoreintuitiveandmorerobusttonoises. However, gradient descent, which scales linearly to the data size N. withalayoutoftheentiredatasetasbasis,theeffectiveness Comparing to the one used by the t-SNE, the optimization of theseaggregated/approximatevisualizations will only be processofLargeVisismuchmoreefficientandalsomoreef- improved. Many visualization tools are designed to layout fective. Besides,ondifferentdatasetstheparametersofthe geographical data, sensor data, and network data. These LargeVis are much more stable. tools typically cannot handle high-dimensional data. We conduct extensive experiments on real-world, large- Manyrecentsuccessesofvisualizinghigh-dimensionaldata scale and high-dimensional data sets, including text (words come from the machine learning community. Methods like the t-SNE first compute a K-nearest-neighbor graph and methodsscalestomillionsofdatapoints. Maatenimproved then visualizes this graph in a 2D/3D space. Our work fol- the efficiency of t-SNE through two tree based algorithms lows this direction and makes significant progress. [26], which scale better to large graphs. The optimization ofthet-SNErequiresthefullybatchgradientdescentlearn- 2.1 K-nearestNeighborGraphConstruction ing, the time complexity of which w.r.t the data size N is ConstructingK-nearestneighbor(KNN)graphsfromhigh- O(NlogN). LargeVis can be naturally optimized through dimensionaldataiscriticaltomanyapplicationssuchassim- asynchronous stochastic gradient descent, with a complex- ilaritysearch,collaborativefiltering,manifoldlearning,and ity of O(N). Besides, the parameters of t-SNE are very network analysis. While the exact computation of a KNN sensitive on different sets while the parameters of LargeVis has a complexity of O(N2d) (with N being the number of remain very stable. datapointsanddbeingthenumberofdimensions)whichis There are many algorithms developed in the information too costly, existing approaches use roughly three categories visualizationcommunitytocomputethelayoutofnodesina oftechniques: space-partitioningtrees[3,10,21,7],locality network. TheycanalsobeusedtovisualizetheKNNgraph. sensitivehashingtechniques[8,6,12],andneighborexplor- The majority of these network layout methods use either ing techniques [9]. The space-partitioning methods divide the abovementioned dimensionality reduction techniques or the entire space into different regions and organize the re- force-directedsimulations. Amongthem,force-directedlay- gionsintodifferenttreestructures,e.g.,k-dtrees[3,10],vp- outsgeneratesbettervisualizations,buttheirhighcomputa- trees [28], cover trees [4], and random projection trees [7]. tionalcomplexity(rangingfromO(N3)toO(Nlog2N)with Oncethetreesareconstructed,thenearestneighborsofeach N beingthenumberofnodes[22])haspreventedthemfrom datapointcanbefoundthroughtraversingthetrees. Thelo- being applied to millions of nodes. calitysensitivehashing[8]techniquesdeploymultiplehash- Amongthem,theclassicalFruchterman-Reingoalgorithm ing functions to map the data points into different buckets [11]andtheoriginalForceAtlasalgorithmprovidedinGephi anddatapoints inthe same bucketsare likelytobesimilar [1] have a complexity of O(N2). An improved version of to each other. The neighbor exploring techniques, such as ForceAtlas called the ForceAtlas2 [14] and the newly devel- theNN-Descent[9],isbuiltontopoftheintuitionthat“my opedOpenordalgorithm[17]reducethetimecomplexityto neighbors’ neighbors are likely to be my neighbors.” Start- O(NlogN). These two algorithms have been used to visu- ing from an initial nearest-neighbor graph, the algorithm alizeonemilliondatapoints1,butthecomplexityprevents iteratively refines the graph by exploring the neighbors of them from scaling up further. neighbors defined according to the current graph. TheLargeVisisalsorelatedtoourpreviousworkonnet- The above approaches work efficiently on different types work/graph embedding, the LINE model [23]. LINE and of data sets. The k-d trees, vp-trees, or cover-trees have otherrelatedmethods(e.g.,Skipgram[18])arenotdesigned been proved to very efficient on data with a small num- forvisualizationpurposes. Usingthemdirectlytolearn2/3- ber of dimensions. However, the performance significantly dimensionalrepresentationsofdatamayyieldineffectivevi- deteriorates when the dimensionality of the data becomes sualization results. However, these methods can be used as large (e.g., hundreds). The NN-descent approach is also a preprocessor of the data for the visualization (e.g., use usuallyefficientfordatasetswithasmallnumberofdimen- LINEorSkipgramtolearn100dimensionalrepresentations sions[9]. Acomparisonofthesetechniquescanbefoundat of the data and then use LargeVis to visualize them). https://github.com/erikbern/ann-benchmarks. Theran- domprojectiontreeshavedemonstratedstate-of-the-artper- 3. LARGEVIS formance in constructing very accurate K-nearest neighbor In this section, we introduce the new visualization tech- graphs from high-dimensional data. However, the high ac- nique LargeVis. Formally, given a large-scale and high- curacy is at the expense of efficiency, as to achieve a higher dimensional data set X = {(cid:126)x ∈ Rd} , our goal accuracymanymoretreeshavetobecreated. Ourproposed i i=1,2,...,N is to represent each data point (cid:126)x with a low-dimensional technique is built upon random projection trees but signifi- i vector (cid:126)y ∈ Rs, where s is typically 2 or 3. The basic idea cantlyimprovesitusingtheideaofneighborexploring. The i of visualizing high-dimensional data is to preserve the in- accuracy of a KNN graph quickly improves to almost 100% trinsic structure of the data in the low-dimensional space. without investing in many trees. Existing approaches usually first compute the similarities of all pairs of {(cid:126)x ,(cid:126)x } and then preserve the similarities 2.2 GraphVisualization i j in the low-dimensional transformation. As computing the The problem of graph visualization is related to dimen- pairwise similarities is too expensive (i.e., O(N2d)), recent sionality reduction, which includes two major types of ap- approaches like the t-SNE construct a K-nearest neighbor proaches: lineartransformationsandnon-lineartransforma- graphinsteadandthenprojectthegraphintothe2Dspace. tions. Whenprojectingthedatatoextremelylow-dimensional LargeVis follows this procedure, but uses a very efficient spaces (e.g., 2D), the linear methods such as the Princi- algorithm for K-nearest neighbor graph construction and a pleComponentAnalysis[15]andthemultidimensionalscal- principledprobabilisticmodelforgraphvisualization. Next, ing [25] usually do not work as effectively as the non-linear we introduce the two components respectively. methods as most high-dimensional data usually lies on or near low-dimensional non-linear manifolds. The non-linear 3.1 EfficientKNNGraphConstruction methodssuchasIsomap[24],locallinearembedding(LLE)[20], AK-nearestneighborgraphrequiresametricofdistance. Laplacian Eigenmaps [2] are very effective on laboratory WeusetheEuclideandistance||(cid:126)x −x(cid:126)||inthehigh-dimensional i j datasetsbutdonotperformreallywellonreal-worldhigh- space, the same as the one used by t-SNE. Given a set of dimensionaldata. MaatenandHintonproposedthet-SNE[27], which is very effective on real-world data. None of these 1http://sebastien.pro/gephi-esnam.pdf high-dimensionaldatapoints{(cid:126)xi}i=1,...,N,inwhich(cid:126)xi ∈Rd, Algorithm 1: Graph Construction constructingtheexactKNNgraphtakesO(N2d)time-too costly. Various indexing techniques have been proposed to Data: K{(cid:126)x,in}uim=1b,e..r.,oNf,itneurmatbioenrsofItterree.sNT,numberofneighbors approximate the KNN graph (see Section 2). Result: ApproximateK-nearestneighborgraphG. Amongthesetechniques,therandomprojectiontreeshave 1. BuildNT randomprojectiontreeson{(cid:126)xi}i=1,...,N; 2. Searchnearestneighbors: been proved to be very efficient for nearest-neighbor search foreachnodeiinparallel do inhigh-dimensionaldata. Thealgorithmstartsbypartition- Searchtherandomprojectiontreesfori’sK nearest ing the entire space and building up a tree. Specifically, for neighbors,storetheresultsinknn(i); end every non-leaf node of the tree, the algorithm selects a ran- 3. Neighborexploring: dom hyperplane to split the subspace corresponding to the whileT <Iterdo non-leaf node into two, which become the children of that Setoldknn()=knn(),clearknn(); foreachnodeiinparallel do node. The hyperplane is selected through randomly sam- CreatemaxheapHi; plingtwopointsfromthecurrentsubspaceandthentaking forj∈oldknn(i)do the hyperplane equally distant to the two points. This pro- forl∈oldknn(j)do cess continues until the number of nodes in the subspace Calculatedist(i,l)=||(cid:126)xi−(cid:126)xl||; Pushlwithdist(i,l)intoHi; reaches a threshold. Once a random projection tree is con- PopifHi hasmorethanK nodes; structed, every data point can traverse the tree to find a end correspondingleafnode. Thepointsinthesubspaceofthat end leaf node will be treated as the candidates of the nearest PutnodesinHi intoknn(i); end neighborsoftheinputdatapoint. Inpracticemultipletrees T++; can be built to improve the accuracy of the nearest neigh- end bors. Once the nearest neighbors of all the data points are foreachnodeiandeachj∈knn(i)do Addedge(i,j)intographG; found, the K-nearest neighbor graph is built. end However, constructing a very accurate KNN graph re- 4. CalculatetheweightsoftheedgesaccordingtoEqn.1,2. quires many trees to be built, which significantly hurts the efficiency. This dilemma has been a bottleneck of applying random projection trees to visualization. In this paper we wanttokeepsimilarverticesclosetoeachotheranddissim- propose a new solution: instead of building a large num- ilar vertices far apart in the low-dimensional space. Given beroftreestoobtainahighlyaccurateKNNgraph,weuse a pair of vertices (v ,v ), we first define the probability of neighbor exploring techniques to improve the accuracy of a i j observingabinaryedgee =1betweenv andv asfollows: lessaccurategraph. Thebasicideaisthat“aneighborofmy ij i j neighbor is also likely to be my neighbor”[9]. Specifically, P(e =1)=f(||(cid:126)y −(cid:126)y ||), (3) ij i j we build a few random projection trees to construct an ap- proximateK-nearestneighborgraph,theaccuracyofwhich where(cid:126)y istheembeddingofvertexv inthelow-dimensional i i may be not so high. Then for each node of the graph, we space, f(·) is a probabilistic function w.r.t the distance of searchtheneighborsofitsneighbors,whicharealsolikelyto vertex y and y , i.e., ||(cid:126)y −(cid:126)y ||. When y is close to y in i j i j i j be candidates of its nearest neighbors. We may repeat this the low-dimensional space (i.e., ||(cid:126)y −(cid:126)y || is small), there is i j formultipleiterationstoimprovetheaccuracyofthegraph. a large probability of observing a binary edge between the In practice, we find that only a few iterations are sufficient twovertices. Inreality,manyprobabilisticfunctionscanbe toimprovetheaccuracyoftheKNNgraphtoalmost100%. usedsuchasf(x)= 1 orf(x)= 1 . Wecompare 1+ax2 1+exp(x2) For the weights of the edges in the K-nearest neighbor different probabilistic functions in Section 4. graph,weusethesameapproachast-SNE.Theconditional Eqn.(3)onlydefinestheprobabilityofobservingabinary probability from data(cid:126)x to(cid:126)x is first calculated as: i j edge between a pair of vertices. To further extend it to exp(−||(cid:126)x −(cid:126)x ||2/2σ2) generalweightededges,wedefinethelikelihoodofobserving pj|i = (cid:80)(i,k)∈Eexp(i−||(cid:126)xij−(cid:126)xk||i2/2σi2), and (1) a weighted edge eij =wij as follows: pi|i =0, P(eij =wij)=P(eij =1)wij. (4) where the parameter σi is chosen by setting the perplexity With the above definition, given a weighted graph G = of the conditional distribution p·|i equal to a perplexity u. (V,E), the likelihood of the graph can be calculated as: Then the graph is symmetrized through setting the weight between(cid:126)xi and(cid:126)xj as: O= (cid:89) p(eij =1)wij (cid:89) (1−p(eij =1))γ w = pj|i+pi|j. (2) (i,j)∈E (i,j)∈E¯ ij 2N ∝ (cid:88) w logp(e =1)+ (cid:88) γlog(1−p(e =1)), ij ij ij The complete procedure is summarized in Algo. 1. (i,j)∈E (i,j)∈E¯ (5) 3.2 A Probabilistic Model for Graph Visual- inwhichE¯isthesetofvertexpairsthatarenotobservedand ization γ is an unified weight assigned to the negative edges. The OncetheKNNgraphisconstructed,tovisualizethedata first part of Eqn. (5) models the likelihood of the observed wejustneedtoprojectthenodesofthegraphintoa2D/3D edges, and by maximizing this part similar data points will space. We introduce a principled probabilistic model for keepclosetogetherinthelow-dimensionalspace;thesecond this purpose. The idea is to preserve the similarities of the part models the likelihood of all the vertex pairs without vertices in the low-dimensional space. In other words, we edges,i.e.,negativeedges. Bymaximizingthispart,dissim- Table 1: Statistics of the data sets. ilar data will be far away from each other. By maximizing DataSet #data #dimension #categories the objective (5), both goals can be achieved. 20NG 18,846 100 20 Optimization. Directly maximizing Eqn. (5) is compu- MNIST 70,000 784 10 tationally expensive, as the number of negative edges is WikiWord 836,756 100 - quadratic to the number of nodes. Inspired by the negative WikiDoc 2,837,395 100 1,000 sampling techniques [18], instead of using all the negative CSAuthor 1,854,295 100 - edges, we randomly sample some negative edges for model DBLPPaper 1,345,560 100 - LiveJournal 3,997,963 100 5,000 optimization. For each vertex i, we randomly sample some vertices j according to a noisy distribution P (j) and treat n (i,j) as the negative edges. We used the noisy distribution • MNIST:thehandwrittendigitsdataset3. Eachimage in [18]: P (j) ∝ d0.75, in which d is the degree of vertex n j j is treated as a data point. j. Letting M be the number of negative samples for each positive edge, the objective function can be redefined as: • WikiWord: the vocabulary in the Wikipedia articles4 (cid:88) (cid:0) (wordswithfrequencylessthan15areremoved). Each O = wij logp(eij =1)+ word is a data point. (i,j)∈E • WikiDoc: the entire set of English Wikipedia articles M (cid:88) (cid:1) (articlescontaininglessthan1000wordsareremoved). E γlog(1−p(e =1)) . (6) jk∼Pn(j) ijk Eacharticleisadatapoint. Welabelthearticleswith k=1 the top 1,000 Wikipedia categories and label all the AstraightforwardapproachtooptimizeEqn.(6)isstochas- other articles with a special category named“others.” tic gradient descent, which is problematic however. This is because when sampling an edge (i,j) for model updating, • CSAuthor: theco-authorshipnetworkinthecomputer theweightoftheedgew willbemultipliedintothegradi- sciencedomain,collectedfromMicrosoftAcademicSearch. ij ent. When the values of the weights diverge (e.g., ranging Each author is a data point. from1tothousands),thenormsofthegradientalsodiverge, • DBLPPaper: the heterogeneous networks of authors, in which case it is very difficult to choose an appropriate papers,andconferencesintheDBLPdata5. Eachpa- learningrate. Weadopttheapproachofedgesamplingpro- per is a data point. posed in our previous paper [23]. We randomly sample the edgeswiththeprobabilityproportionaltotheirweightsand • LiveJournal: the LiveJournal social network6. Every then treat the sampled edges as binary edges. With this nodeislabeledwiththecommunitiesitbelongsto,ifit edgesamplingtechnique,theobjectivefunctionremainsthe isoneofthemostpopular5,000communities,orwith same and the learning process will not be affected by the a special category named“others.” variance of the weights of the edges. To further accelerate the training process, we adopt the Note that although the original data sets all come with asynchronous stochastic gradient descent, which is very ef- variety numbers of dimensions (e.g., size of the vocabulary ficient and effective on sparse graphs [19]. The reason is for text documents), for comparison purposes we represent thatwhendifferentthreadssampledifferentedgesformodel them with a fixed number of dimensions (e.g., 100) before updating, as the graph is very sparse, the vertices of the applying any visualization techniques. This step is not re- sampled edges in different threads seldom overlap, i.e., the quiredforLargeVisinpractice,butlearninganintermediate embeddingsoftheverticesorthemodelparametersusually representation of the data can improve (e.g., smooth) the do not conflict across different threads. similarity structure of the original data. There are quite a Forthetimecomplexityoftheoptimization,eachstochas- few efficient embedding learning techniques (such as Skip- tic gradient step takes O(sM), where M is the number of gram [18] and LINE [23]), the computational cost of which negative samples and s is the dimension of low-dimensional will not be a burden of the visualization. Specifically, the space (e.g., 2 or 3). In practice, the number of stochastic representationsofnodesinnetworkdataarelearnedthrough gradientstepsistypicallyproportionaltothenumberofver- the LINE;the representationsofwords arelearnedthrough ticesN. Therefore,theoveralltimecomplexityisO(sMN), theLINEusingasimpleco-occurrencenetwork;andtherep- which is linear to the number of nodes N. resentationsof documentsaresimplytakenastheaveraged vectorsofthewordsinthedocuments. Thevectorrepresen- 4. EXPERIMENTS tationoftheimagedataisalreadyprovidedfromthesource, so we do not further learn a new embedding. WeevaluatetheefficiencyandeffectivenessoftheLargeVis WesummarizethestatisticsoftheabovedatasetsinTa- bothquantitativelyandqualitatively. Inparticular,wesep- ble 1. Next, we report the results of KNN graph construc- aratelyevaluatetheperformanceoftheproposedalgorithms tion and graph visualization respectively. All the following for constructing the KNN graph and visualizing the graph. results are executed on a machine with 512GB memory, 32 4.1 DataSets coresat2.13GHz. Whenmultiplethreadsareused,thenum- ber of threads is always 32. For visualization purposes, in We select multiple large-scale and high-dimensional data all the experiments, we learn a 2D layout of the data. setsofvarioustypesincludingtext(wordsanddocuments), images, and networks including the following: 3Available at http://yann.lecun.com/exdb/mnist/ 4https://en.wikipedia.org/wiki/Wikipedia:Database_ • 20NG: the widely used text mining data set 20news- download groups2. We treat each article as a data point. 5Available at http://dblp.uni-trier.de/xml/ 2Available at http://qwone.com/~jason/20Newsgroups/ 6Available at https://snap.stanford.edu/data/ 500 l LargeVis 700 l LargeVis 300 l LargeVis Running time (minute)0100300500 l LRNVaPPNr−−−gTTeDrrVeeeiesescent l lllll Running time (minute)0100200300400 RNVPPN−−−TTDrreeeesecent llllll Running time (minute)0100300500 RNVPPN−−−TTDrreeeesecent l lllll Running time (minute)050100200 RNVPPN−−−TTDrreeeesecent l lllll 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Accuracy Accuracy Accuracy (a) WikiWord (b) WikiDoc (c) LiveJournal (d) CSAuthor Figure 2: Running time v.s. Accuracy of KNN graph construction. The lower right corner indicates optimal performance. LargeVis outperforms the vantage-point tree and other state-of-the-art methods. 4.2 ResultsonKNNGraphConstruction forW•Ke-RtnfiioaernasntrdoecofsomtrmanPnpedriagoorhejmebctohptrieroognpjreaeTcrprtfheoioercnsmo[tan7rns]e.tcereWsuocientfiuodtshnieffe,etiArhneencnlniutmodaypil7nglegosm:yrisetthnemtmas-. Accuracy0.70.80.91.0 llll lllll lllll lllll l LarlllllgeVis lllll Accuracy0.60.81.0 lll lllll lllll lllll l LarlllllgeVis lllll • Vantage-pointtrees[28]. Thisistheapproachusedby 0.50.6 0.4 l the t-SNE. l 0.4 0.2 l 0 1 2 3 4 5 0 1 2 3 4 5 • NN-Descent [9]. This is a representative neighbor ex- # iterations # iterations ploring technique. (a) WikiDoc (b) LiveJournal Figure 3: Accuracy of KNN Graph w.r.t number of itera- • LargeVis. Our proposed technique by improving ran- tions of neighbor exploring in LargeVis. Curves correspond dom projection trees with neighbor exploration. to initializing KNN Graphs at different levels of accuracy. Fig. 2 compares the performance of different algorithms for KNN graph construction. The number of neighbors for evenifstartingfromaveryinaccurateKNNgraph. Similar each data point is set as 150. For each algorithm, we try results are also observed in other data sets. differentvaluesofitsparameters,resultinginacurveofrun- Our proposed algorithm for KNN graph construction is ning time over accuracy (i.e., the percentage of data points very efficient, easily scaling to millions of data points with thataretrulyK-nearestneighborsofanode). Someresults hundredsofdimensions. Thissolvesthecomputationalbot- of the vantage-point trees could not be shown as the values tleneck of many data visualization techniques. Next, we are too large. For LargeVis, only one iteration of neigh- compare algorithms that visualize the KNN graphs. All vi- bor exploring is conducted. Overall, the proposed graph sualizationalgorithmsusethesameKNNgraphsconstructed constructionalgorithmconsistentlyachievesthebestperfor- by LargeVis as input, setting the perplexity to 50 and the mance (the shortest running time at the highest accuracy) number of neighbors for each data point to 150. on all the four data sets, and the vantage-point trees per- form the worst. On the WikiDoc data set, which contains 4.3 GraphVisualization around 3 million data points, our algorithm takes only 25 Wecomparethefollowinggraphvisualizationalgorithms: minutes to achieve 95% accuracy while vantage-point trees take 16 hours, which is 37 times slower. Compared to the • SymmetricSNE[13]. Theapproachofsymmetricstochas- original random projection trees, the efficiency gain is also ticneighborembedding. Toscaleitupforlargegraphs, salient. Onsomedatasets,e.g.,LiveJournalandCSAuthor, theBarnes-Hutalgorithm[26]isusedforacceleration. it is very costly to construct a KNN graph at a 90% accu- • t-SNE [26]. The state-of-the-art approach for visual- racy through random projection trees. However, with the izing high-dimensional data, also accelerated through neighbor exploring techniques, the accuracy of the graph the Barnes-Hut algorithm. significantly improves to near perfection. How many iterations of neighbor exploring are required • LINE [23]. A large-scale network/graph embedding forLargeVistoachieveagoodaccuracy? Fig.3presentsthe method. Although not designed for visualization pur- results of the accuracy of KNN Graph w.r.t the number of poses, we directly learn a 2-dimensional embedding. iterations of neighbor exploring. We initialize KNN graphs First-order proximity [23] is used. with different levels of accuracy, constructed with different numbers of random projection trees. Neighbor exploring is • LargeVis. Ourproposedtechniqueforgraphvisualiza- very effective. On WikiDoc, the accuracy of the approxi- tion introduced in Section 3.2. mate KNN graph improves from 0.4 to almost 1 with only ModelParametersandSettings. Forthemodelparam- oneiterationofneighborexploring. OnLiveJounal,atmost etersinSNEandt-SNE,wesetθ=0.5andthenumberofit- three iterations are needed to achieve a very high accuracy, erationsto1,000,whicharesuggestedby[26]. Forthelearn- 7https://github.com/spotify/annoy ing rate of t-SNE, we find the performance is very sensitive w.r.t. different values and the optimal values on different learning rate of t-SNE yields optimal performance, which data sets vary significantly. We report the results with the is comparable to LargeVis. However, on the large data sets defaultlearningrate200andtheoptimalvaluesrespectively. WikiDoc and LiveJournal, which contain millions of data ForbothLINEandLargeVis,thesizeofmini-batchesisset points, the LargeVis is more effective or comparable to the as 1; the learning rate is set as ρ =ρ(1−t/T), where T is t-SNEwithoptimallearningrates,significantlyoutperform- t thetotalnumberofedgesamplesormini-batches. Different ing t-SNE with default learning rate. However, empirically valuesofinitiallearningrateisusedbyLINEandLargeVis: tuningthelearningrateoft-SNErequiresrepeatedlytrain- ρ = 0.025 in LINE and ρ = 1 in LargeVis. The number ing, which is very time consuming on the large data sets. 0 0 of negative samples is set as 5 and γ is set as 7 by default. The optimal learning rates of t-SNE on different data sets The number of samples or mini-batches T can be set pro- varysignificantly. Onthesmalldatasets20NGandMNIST, portional to the number of nodes. In practice, a reasonable the optimal learning rate is around 200, while on the large number of T for 1 million nodes is 10K million. The LINE data sets WikiDoc and LiveJournal, the optimal values be- and LargeVis can be naturally parallelized through asyn- come as large as 3000. Comparing to t-SNE, the perfor- chronously stochastic gradient descent. We also parallelize manceofLargeVisisverystablew.r.tthelearningrate,the SymmetricSNEandt-SNEbyassigningdifferentnodesinto default value of which can be applied to various data sets different threads in each full batch gradient descent. with different sizes. We also notice that the performance of Evaluation. Theevaluationofdatavisualizationisusually the LINE is very bad, showing that an embedding learning subjective. Here we borrow the approach adopted by the method is not appropriate for data visualization as is. t-SNEtoevaluatethevisualizationsquantitatively[26]. We Table2comparestherunningtimeoft-SNEandLargeVis use a KNN classifier to classify the data points based on for graph visualization. On the small data sets 20NG and their low-dimensional representations. The intuition of this MNIST, the two algorithms perform comparable to each evaluation methodology is that a good visualization should other. However,onthelargedatasets,theLargeVisismuch beabletopreservethestructureoftheoriginaldataasmuch moreefficientthanthet-SNE.Specially,onthelargestdata as possible and hence yield a high classification accuracy set LiveJournal, which contains 4 million data points, the with the low-dimensional representations. LargeVis is 6.6 times faster than the t-SNE. 4.3.1 ComparingDifferentProbabilisticFunctions 4.3.3 Performancew.r.t. DataSize WefurthercomparetheperformanceoftheLargeVisand t-SNEw.r.tthesizeofthedataintermsofbotheffectiveness 0.35 l l l l l l l l l and efficiency. Fig. 6 presents the results on the WikiDoc Accuracy0.250.30 l l 11111(((((11111+++++0x23e2xxx.5)p22x))(2-)x2)) Accuracy0.40.6 l 11111(((((11111+++++0x23e2xxx.5)p22x))(2-)x2)) aodonbfadtttaahi.LenievIddenaJbtoFayuigrrinna.cna6rld(eoadam)saetlasay,nsdsbeaytm6s(.upbsl)Diin,niggffwedetriheffceneatrndesenisfztaeeuepslettorhcflaeedtanartanatsaignetsgshetoersfasttaiehzrseee, 0.20 0.2 the performance of the LargeVis increases while the per- formance of t-SNE decreases. By exhaustively tuning the 0.15 10 20 30 40 50 0.0 10 20 30 40 50 learningrates,theperformanceoft-SNEwillbecomparable Number of neighbors in KNN Number of neighbors in KNN to LargeVis. However, this process is very time consuming, (a) WikiDoc (b) LiveJournal especially on large-scale data sets. Fig. 6(c) and 6(d) show Figure 4: Comparing different probabilistic functions. that the LargeVis becomes more and more efficient than t- SNEasthesizeofthedatagrows. Thisisbecausethetime WefirstcomparedifferentprobabilisticfunctionsinEq.(3), complexity of graph visualization in t-SNE is O(Nlog(N)) which define the probability of observing a binary edge be- while that of LargeVis is O(N). tween a pair of vertices based on the distance of their low- dimensional representations. We compare functions f(x)= 4.3.4 ParameterSensitivity 1 andf(x)= 1 withvariousvaluesofa. Fig.4 Finally, we investigate the sensitivity of the parameters 1+ax2 1+exp(−x2) presents the results on the WikiDoc and the LiveJournal in the LargeVis including the number of negative samples data sets. We can see among all probabilistic functions, (M) and training samples (T). Fig. 7(a) shows the results f(x)= 1 achievesthebestresult. Thisprobabilityfunc- w.r.tthenumberofnegativesamples. Whenthenumberof 1+x2 tion specifies a long-tailed distribution, therefore can also negativesamplesbecomeslargeenough(e.g.,5),theperfor- solve the“crowding problem”according to [27]. In the fol- mance becomes very stable. For each data point, instead lowing experiments, we always use f(x)= 1 . of using all the negative edges, we just need to sample a 1+x2 fewnegativeedgesaccordingtoanoisydistribution. Anin- 4.3.2 ResultsonDifferentDataSets teresting future direction is to design a more effective noisy We compare the efficiency and effectiveness of different distribution for negative edge sampling. Fig. 7(b) presents visualization algorithms. Fig. 5 compares the classification the results w.r.t the number of training samples. When the accuracywiththeK-nearestneighborclassifierbyusingthe numbersamplesbecomeslargeenough,theperformancebe- low-dimensional representations as features. For the KNN comes very stable. classifier, different numbers of neighbors are tried. For t- 4.4 VisualizationExamples SNE,boththeresultswithdefaultlearningrate200andthe optimal learning rate tuned thorough exhaustively search Finally,weshowseveralvisualizationexamplessothatwe are reported. On the small data sets 20NG and MNIST, can intuitively evaluate the quality of LargeVis visualiza- which contain less than 100,000 data points, the default tionsandcomparetheperformanceoft-SNEandLargeVis. Accuracy0.40.50.60.70.8 l l l l lLtSL−aINNSrEgNEeEVis l Accuracy0.70.80.91.0 l l l l lLtSL−aINNSrEgNEeEVis l Accuracy0.200.250.300.35 l l l l LttSL−−aINNSSrEgNNEelEEV ((isdoepftaimualt)l)l Accuracy0.20.40.60.8 l l l l LttSL−−aINNSSrEgNNEelEEV ((isdoepftaimualt)l)l 0.3 0.6 0.0 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Number of neighbors in KNN Number of neighbors in KNN Number of neighbors in KNN Number of neighbors in KNN (a) 20NG (b) MNIST (c) WikiDoc (d) LiveJournal Figure 5: Performance of classifying data points according to 2D representations using the K-nearest neighbor classifier. Overall,theLargeVisismoreeffectiveorcomparabletot-SNEwithoptimallearningrates,significantlyoutperformingt-SNE with the recommended learning rate (1,000) on large data sets. The optimal learning rates of t-SNE vary significantly on differentdatasets,rangingfromaround200(on20NGandMNIST)to2,500(onWikiDoc)and3,000(onLiveJournal),which areveryexpensivetosearchonlargedatasets. Evenwiththeoptimalparameters,t-SNEisinferiortoLargeViswhichsimply uses default parameters on all data sets. Table 2: Comparison of running time (hours) in graph visualization between the t-SNE and LargeVis. Algorithm 20NG MNIST WikiWord WikiDoc LiveJournal CSAuthor DBLPPaper t-SNE 0.12 0.41 9.82 45.01 70.35 28.33 18.73 LargeVis 0.14 0.23 2.01 5.60 9.26 4.24 3.19 Speedup Rate 0 0.7 3.9 7 6.6 5.7 4.9 0.35 50 80 Accuracy0.200.250.30 l 0.2 l 0.4 l 0l.6 Ltt−−laSSrgNNe0EEV. 8((isdoepftaimualt)l)1.l0 Accuracy0.40.50.60.7 l 0.2 l 0.4 l 0l.6 Ltt−−laSSrgNNe0EEV. 8((isdoepftaimualt)l)1.l0 Running time (hours)010203040 l 0.2l lLt−a0SrgN.4eEVisl 0.6 l 0.8 1.l0 Running time (hours)0204060 l 0.2l lLt−a0SrgN.4eEVisl 0.6 l 0.8 1.l0 Percentage of data Percentage of data Percentage of data Percentage of data (a) Accuracy (WikiDoc) (b) Accuracy (LiveJournal) (c) Time (WikiDoc) (d) Time (LiveJournal) Figure 6: Accuracy andrunning time of the LargeVis and t-SNE w.r.t the size of data. LargeVis is much more efficient than t-SNE when the size of the data grows. ful and comparable to each other. On the large data sets such as WikiDoc and LiveJournal, which contain at least 0.34 l l l l 0.34 l l l l l 2.8 million data points, the visualizations generated by the l 0.32 0.32 LargeVislookmuchmoreintuitivethantheonesbyt-SNE. Accuracy0.30 l LargeVis Accuracy0.30 l LargeVis peFrsigg.e1n0ersahtoewdsbayrLeagriogneVoifs.thEeavcihsucaolilzoarticoonrroesfpDoBndLsPtopaa- 0.28 0.28 computerscienceconference. Thevisualizationisveryintu- 0.26 0.26 iptaivpee.rsTohfe“pWapWerWs p(uCbolmishpeadniaotnWVoWluWmea)r,”eccoornrneescptoenddtinogthtoe 3 4 5 6 7 40 60 80 100 120 its workshop and poster papers. The closest conference to #Negative samples #Training samples (10^9) WWW is ICWSM, right to the north. This“Web”cluster (a) Negative samples (b) Training samples is close to SIGIR and ECIR on the west (the information Figure7: PerformanceofLargeVisw.r.tthenumberofnega- retrieval community), with three digital library conferences tivesamplesandtrainingsamplesonWikiDoc. Performance close by. KDD papers locate to the east of WWW, and the is not sensitive to the two parameters. database conferences ICDE, SIGMOD, EDBT and VLDB are clustered to the south of KDD. It is interesting to see that the papers published at CIKM are split into three dif- Fig. 8 and Fig. 9 present the visualizations. Different col- ferent parts, one between SIGIR and WWW, and two be- ors correspond to different categories (20NG), or clusters tweenKDDandICDE,respectively. Thisclearlyreflectsthe computed with K-means based on high-dimensional repre- three different tracks of the CIKM conference: information sentations (WikiWord, WikiDoc, CSAuthors and LiveJour- retrieval, knowledge management, and databases. nal). 200clustersareusedforallthefourdatasets. Wecan see that on the smallest data set 20NG, the visualizations 5. CONCLUSION generated by the t-SNE and LargeVis are both meaning- (a) 20NG (t-SNE) (b) 20NG (LargeVis) (c) WikiDoc (t-SNE) (d) WikiDoc (LargeVis) (e) LiveJournal (t-SNE) (f) LiveJournal (LargeVis) Figure8: Visualizationsof20NG,WikiDoc,andLiveJournalbyt-SNEandLargeVis. Differentcolorscorrespondtodifferent categories (20NG) or clusters learned by K-means according to high-dimensional representations. (a) WikiWord (LargeVis) (b) CSAuthor (LargeVis) Figure9: VisualizationsofWikiWordandCSAuthorbyLargeVis. ColorscorrespondtoclusterslearnedbyK-meansaccording to high-dimensional representations. Figure 10: Visualizing the papers in DBLP by LargeVis. Each color corresponds to a conference. This paper presented a visualization technique called the of visualizations. In the future, we plan to use the low- LargeVis which lays out large-scale and high-dimensional dimensional layouts generated by the LargeVis as the basis data in a low-dimensional (2D or 3D) space. LargeVis eas- for more advanced visualizations and generate many intu- ily scales up to millions of data points with hundreds of itive and meaning visualizations for high-dimensional data. dimensions. It first constructs a K-nearest neighbor graph Another interesting direction is to handle data dynamically ofthedatapointsandthenprojectsthegraphintothelow- changing over time. dimensional space. We proposed a very efficient algorithm forconstructingtheapproximateK-nearestneighborgraphs Acknowledgments andaprincipledprobabilisticmodelforgraphvisualization, The co-author Ming Zhang is supported by the National the objective of which can be optimized effectively and ef- NaturalScienceFoundationofChina(NSFCGrantNo. 61472006 ficiently. Experiments on real-world data sets show that and 61272343); Qiaozhu Mei is supported by the National the LargeVis significantly outperforms the t-SNE in both Science Foundation under grant numbers IIS-1054199 and the graph construction and the graph visualization steps, CCF-1048168. in terms of both efficiency, effectiveness, and the quality

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.