ebook img

Information Distance: New Developments PDF

0.05 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Information Distance: New Developments

1 Information Distance: New Developments Paul M.B. Vita´nyi CWI, Amsterdam, The Netherlands (Invited Lecture) 2 1 0 2 n a Abstract would be able to simultaneously detect all similarities J between pieces that other effective distances can detect 5 In pattern recognition, learning, and data mining one seperately. obtains information from information-carrying objects. Such a “universal” metric was co-developedby us as a ] V This involves an objective definition of the information normalizedversionofthe “informationmetric”of [1],[9]. C in a single object, the information to go from one object There it was shown that the information metric minorizes to another object in a pair of objects, the information to up to a constant all effective distances satisfying a mild . s go from one object to any other object in a multiple of density requirement(excluding for example distances that c [ objects, and the shared information between objects. This are 1 for every pair x,y such that x6=y). This justifies the is called “information distance.” We survey a selection of notion that the information distance is universal. 1 new developments in information distance. We maybeinterestedwhathappensintermsofproper- v 1 ties or features of the pair of objects analyzed, say x and 2 y. It can be shown that the information distance captures 2 every property of which the Kolmogorov complexity is I. The Case n=2 1 logarithmic in the length of min{|x|,|y|}. If those lengths . 1 gotoinfinity,thenlogarithmofthoselengthsgotoinfinity 0 The clustering we use is hierarchical clustering in too. In this case the information distance captures every 2 dendrogramsbased on a new fast heuristic for the quartet property. 1 method [5]. If we consider n objects, then we find n2 Thisinformationdistance(actuallyametricuptominor : v pairwise distances. These distances are between natural additiveterms)isnormalizedsothattheresultingdistances i data. We let the data decide for themselves, and construct X arein[0,1]andcanbeshowntoretainthemetricproperty, a hierarchical clustering of the n objects concerned. For [8]. The result is the “normalized information distance” r a details see the cited reference.The method takes the n×n (actually a metric up to neglidgible terms). All this is in distancematrixasinput,andyieldsadendrogramwiththe terms of Kolmogorov complexity [9]. n objectsas leaves(so the dendrogramcontainsn external It articulates the intuition that two objects are deemed nodesorleavesandn−2internalnodes.Weassumen≥4. close if we can significantly “compress” one given the The method is available as an open-source software tool, information in the other, that is, two pieces are more [2]. similar if we can more succinctly describe one given the Our aim is to capture, in a single similarity metric, other. The normalized information distance discovers all every effective distance: effective versions of Hamming effective similarities in the sense that if two objects are distance, Euclidean distance, edit distances, alignment close according to some effective similarity, then they distance, Lempel-Ziv distance, and so on. This metric are also close according to the normalized information shouldbesogeneralthatitworksineverydomain:music, distance. text, literature, programs, genomes, executables, natural Put differently, the normalized information distance language determination, equally and simultaneously. It represents similarity according to the dominating shared feature between the two objects being compared. In com- Affiliation: National Research Center for Mathematics and Computer parisons of more than two objects, different pairs may Science in the Netherlands (CWI). Address: CWI, Science Park 123, 1098XGAmsterdam,TheNetherlands. Email:[email protected] have different dominating features. For every two objects, 2 this normalized information metric distance zooms in on mass of the information available about almost every the dominant similarity between those two objects out of conceivabletopicmakesitlikelythatextremeswill cancel a wide class of admissible similarity features. Since the andthemajorityoraverageis meaningfulin a low-quality normalized information distance also satisfies the metric approximatesense.Below,wegiveageneralmethodtotap (in)equalities, and takes values in [0,1], it may be called the amorphouslow-grade knowledge available for free on “the” similarity metric. theInternet,typedinbylocalusersaimingatpersonalgrat- Unfortunately,the universality of the normalized infor- ification of diverse objectives, and yet globally achieving mation distance comes at the price of noncomputability. whatiseffectivelythelargestsemanticelectronicdatabase Recently we have shown that the normalized information in the world. Moreover, this database is available for all distance is not even semicomputable (this is weaker than byusinganysearchenginethatcanreturnaggregatepage- computable)and there is no semicomputablefunctionat a count estimates like Google for a large range of search- computable distance of it [13]. queries. Since the Kolmogorov complexity of a string or file is While the previous NCD method that compares the the length of the ultimate compressed version of that file, objectsthemselvesusing(1)isparticularlysuitedtoobtain wecanuserealdatacompressionprogramstoapproximate knowledge about the similarity of objects themselves, the Kolmogorovcomplexity.Therefore,to applythis ideal irrespectiveof commonbeliefsaboutsuch similarities, we precisemathematicaltheoryinreallife,wehavetoreplace nowdevelopamethodthatusesonlythenameofanobject the use of the noncomputableKolmogorov complexity by and obtains knowledge about the similarity of objects by an approximationusing a standard real-world compressor. tapping available information generated by multitudes of Starting from the normalized information distance, if Z is webusers.Thenewmethodisusefultoextractknowledge a compressor and we use Z(x) to denote the length of the from a given corpus of knowledge, in this case the pages compressed version of a string x, then we arrive at the on the Internet accessed by a search engine returning Normalized Compression Distance: aggregate page counts, but not to obtain true facts that arenotcommonknowledgeinthatdatabase.Forexample, Z(xy)−min(Z(x),Z(y)) NCD(x,y)= , (1) common viewpoints on the creation myths in different max(Z(x),Z(y)) religions may be extracted by the web-based method, but where for convenience we have replaced the pair (x,y) contentiousquestions of fact concerning the phylogenyof in the formula by the concatenation xy, and we ignore species can be better approachedby using the genomesof logarithmic terms in the numerator and denominator, see these species, rather than by opinion. This approach was [8], [3]. In [3] we propose axioms to capture the real- proposed by [4]. We skip the theory. world setting, and show that (1) approximates optimality. In contrast to strings x where the complexity Z(x) Actually, the NCD is a family of compression functions represents the length of the compressed version of x parameterized by the given data compressor Z. using compressor Z, for a search term x (just the name for an object rather than the object itself), the code of A. Web-basedSimilarity length G(x) represents the shortest expected prefix-code word length of the event x (the number of pages of the Internetreturnedbyagivensearchengine).Theassociated To make computers more intelligent one would like to normalizedwebdistance(NWD)isdefinedjustas(1)with representmeaningincomputer-digestableform.Long-term the search engine in the role of compressor yielding code and labor-intensiveefforts like the Cyc project[7] and the lengthsG(x),G(y)forthesingletonsearchtermsx,ybeing WordNet project [11] try to establish semantic relations compairedandacodelengthG(x,y)forthedoubletonpair between common objects, or, more precisely, names for (x,y), by those objects. The idea is to create a semantic web of such vast proportions that rudimentary intelligence and G(x,y)−min(G(x),G(y)) knowledge about the real world spontaneously emerges. NWD(x,y)= . (2) max(G(x),G(y)) Thiscomesatthegreatcostofdesigningstructurescapable ofmanipulatingknowledge,andenteringhighqualitycon- This NWD uses the backgroundknowledgeon the web as tents in these structuresby knowledgeablehumanexperts. viewed by the search engine as conditional information. While the efforts are long-running and large scale, the The same formula as (2) can be written in terms of overall information entered is minute compared to what frequencies of the number of pages returned on a search is available on the Internet. query as The rise of the Internet has enticed millions of users to type in trillions of characters to create billions of max{logf(x),logf(y)}−logf(x,y)} NWD(x,y)= , (3) web pages of on average low quality contents. The sheer logN−min{logf(x),logf(y)} and if f(x),f(y)>0 and f(x,y)=0 then NWD(x,y)=¥ . to multiples. This approach was introduced in [10] while3 It is easy to see that most of the theory is developed in [14]. 1) NWD(x,y) is undefined for f(x)= f(y)=0; Let X denote a finite list of m finite binary strings 2) NWD(x,y)=¥ for f(x,y) =0 and either or both definedbyX=(x1,...,xm),theconstitutingstringsordered f(x)>0 and f(y)>0; and length-increasinglexicographic. We use lists and not sets, 3) NWD(x,y)≥0 otherwise. since if X is a set we cannot express simply the distance froma stringtoitself orbetweenstringsthatareallequal. ThenumberN isrelatedtothenumberofpagesMindexed Let U be the reference universal Turing machine. Given by the search engine we use. Our experimental results the string x we define the information distance to any suggestthateveryreasonable(greaterthanany f(x))value i string in X by E (X)=min{|p|:U(x,p,j)=x for all can be used for the normalizing factor N, and our results max i j x,x ∈X}. It is shown in [10], Theorem 2, that seem in generalinsensitiveto this choice.Inour software, i j this parameter N can be adjusted as appropriate, and we E (X)= maxK(X|x), (4) max oftenuseM forN.Inthe[4]we analyzethemathematical x:x∈X properties of NWD, and prove the universality of the up to a logarithmic additive term. Define E (X) = min search engine distribution. We show that the NWD is not min K(X|x). Theorem 3 in [10] states that for every x:x∈X a metric, in contrast to the NCD. The generic example list X =(x ,...,x ) we have 1 m showing the nonmetricity of semantics (and therefore the (cid:229) E (X)≤E (X)≤ min E (x,x ), (5) NWD) is thata man is close to a centaur,and a centauris min max max i k i:1≤i≤m close to a horse, but a man is very differentfrom a horse. xi,xk∈X&k6=i uptoalogarithmicadditiveterm.Thisisnotacorollaryof B. Question-AnswerSystem (4) as stated in [10], but both inequalities follow from the definitions.Thelefthandsideisinterpretedastheprogram lengthofthe“mostcomprehensiveobjectthatcontainsthe A typical procedure for finding an answer on the mostinformationaboutallthe others[allelementsof X],” Internet consists in entering some terms regarding the andtherighthandsideisinterpretedastheprogramlength question into a Web search engine and then browsing the of the “most specialized object that is similar to all the searchresultsin searchfor theanswer. Thisis particularly others.” inconvenient when one uses a mobile device with a slow Information distance for multiples, that is, finite internet connection and small display. Question-answer lists, appears both practically and theoretically promis- (QA)systemsattempttosolvethisproblem.Theyallowthe ing. The results below appear in [14]. In all cases user to enter a question in natural language and generate the results imply the corresponding ones for the pair- an answer by searching the Web autonomously. the QA wise information distance defined as follows. The in- system QUANTA [15] that uses variants of the NCD and formation distance in [1] between strings x and x theNWDtoidentifythecorrectanswertoaquestionoutof 1 2 is E (x ,x ) = max{K(x |x ),K(x |x )}. In the [14] severalcandidatesforanswers.QUANTAisremarkablein max 1 2 1 2 2 1 E (X)=max K(X|x). These two definitions coin- thatitusesneitherNCDnorNWDintroducedsofar,buta max x:x∈X cide for |X|=2 since K(x,y|x)=K(y|x) up to an addi- variationthatisneverthelessbasedonthesametheoretical tive constant term. The reference investigate the maximal principles. This variation is tuned to the particular needs overlap of information which for |X|=2 specializes to of a QA system. Without going in too much detail it uses Theorem 3.4 in [1]. A corollary in [14] shows (4) and themaximaloverlapofprogram pgoingfromfile xtofile another corollary shows that the lefthand side of (5) y, and program q going from file y to file x. The system can indeed be taken to correspond to a single program QUANTA is 1.5 times better (accordingto generally used embodying the “most comprehensive object that contains measures) than its competition. themostinformationaboutalltheothers”asstatedbutnot argued or proved in [10]. The reference proves metricity II. n>2 and universality which for |XY|=2 (for metricity) and |X| = 2 (for universality) specialize to Theorem 4.2 in In many applications we are interested in shared in- [1]; additivity; minimum overlap of information which formation between many objects instead of just a pair for |X|=2 specializes to Theorem 8.3.7 in [12]; and the of objects. For example, in customer reviews of gadgets, nonmetricityofnormalizedinformationdistanceforlistsof in blogs about public happenings, in newspaper articles morethantwoelementsandthefailureofcertainproposals about the same occurrence, we are interested in the most of a normalizing factor (to achieve a normalized version). comprehensiveone or the most specialized one. Thus, we Incontrast,forlists oftwo elementswe cannormalizethe wanttoextendtheinformationdistancemeasurefrompairs informationdistanceasinLemmaV.4andTheoremV.7of 4 [8]. The definitions are of necessity new as are the proof ideas. Remarkably, the new notation and proofs for the generalcasearesimplerthanthementionedexistingproofs for the particular case of pairwise information distance. III. Conclusion By now applications abound. See the many references to the papers [8], [3], [4] in Google Scholar. The methods turns out to be more-or-less robust un- der change of the underlying compressor-types: statistical (PPMZ), Lempel-Zivbased dictionary(gzip), block based (bzip2), or special purpose (Gencompress). Obviously the windowsize matters, as well as how goodthe compressor is. For example, PPMZ gives for mtDNA of the inves- tigated species diagonal elements (NCD(x,x)) between 0.002and 0.006.The compressorbzip2doesconsiderably worse, and gzip gives something in between 0.5 and 1 on the diagonal elements. Nonetheless, for texts like books gzip does fine in our experiments; the window size is sufficient and we do not use the diagonal elements. But for genomics gzip is no good. References [1] C.H.Bennett,P.Ga´cs,M.Li,P.M.B.Vita´nyi,W.Zurek,Information Distance,IEEETrans.InformationTheory,44:4(1998),1407–1423. [2] R.L.Cilibrasi,TheCompLearnToolkit,2003–,www.complearn.org [3] R.L. Cilibrasi, P.M.B. Vita´nyi, Clustering by compression, IEEE Trans.Information Theory,51:4(2005), 1523–1545. [4] R.L. Cilibrasi, P.M.B. Vita´nyi, IEEE Trans. Knowledge and Data Engineering, 19:3(2007), 370–383. [5] R.L. Cilibrasi, P.M.B. Vita´nyi, A fast quartet tree heuristic for hierarchical clustering, PatternRecognition, 44(2011)662-677 [6] A.N. Kolmogorov, Three approaches to the quantitative definition ofinformation, ProblemsInform.Transmission,1:1(1965), 1–7. [7] D.B.Lenat,Cyc:Alarge-scaleinvestmentinknowledgeinfrastruc- ture,Comm.ACM,38:11(1995),33–38. [8] M.Li,X.Chen,X.Li,B.Ma,P.M.B.Vita´nyi.Thesimilaritymetric, IEEETrans.Information Theory,50:12(2004), 3250-3264. [9] M.Li,P.M.B.Vita´nyi. AnIntroduction toKolmogorov Complexity andItsApplications, 3ndEd.,Springer-Verlag, NewYork,2008. [10] M.Li,C.Long,B.Ma,X.Zhu,Informationsharedbymanyobjects, Proc.17thACMConf.Inform.Knowl.Management, 2008,1213– 1220. [11] G.A. Miller et.al, WordNet, A Lexical Database for the English Language,Cognitive Science Lab,Princeton University. [12] An.A.Muchnik,Conditionalcomplexityandcodes,Theor.Comput. Sci.,271(2002), 97–109. [13] S.A.Terwijn,L.Torenvliet,P.M.B.Vita´nyi,Nonapproximability of theNormalizedInformationDistance,J.Comput.SystemSciences, 77:4(2011), 738–742. [14] P.M.B. Vita´nyi, Information distance in multiples, IEEE Trans. Inform.Theory,57:4(2011), 2451-2456. [15] X. Zhang, Y. Hao, X.-Y. Zhu, M. Li, New Information Distance Measure and Its Application in Question Answering System, J. Comput.Sci.Techn.,23:4(2008), 557–572.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.