Table Of Content

US 20100274753A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2010/0274753 A1 Liberty et al. (43) Pub. Date: Oct. 28, 2010 (54) METHODS FOR FILTERING DATA AND 863 is a continuation-in-part of application No. FILLING IN MISSING DATA USING 11/165,633, ?led on Jun. 23, 2005, noW abandoned. NONLINEAR INFERENCE (60) Provisional application No. 60/779,958, ?led on Mar. 7, 2006, provisional application No. 60/610,841, ?led (76) Inventors: Edo Liberty, New Haven, CT (U S); on Sep. 17, 2004, provisional application No. 60/ 697, Steven Zucker, Harnden, CT (US); 069, ?led on Jul. 5, 2005, provisional application No. Yosi Keller, Rehovol (IL); Mauro 60/582,242, ?led on Jun. 23, 2004. M. Maggioni, Durham, NC (US); Ronald R. Coifman, North Haven, Publication Classi?cation CT (US); Frank GeshWind, (51) Int. Cl. Madison, CT (US) G06N 5/02 (2006.01) Correspondence Address: (52) U.S. Cl. ........................................................ .. 706/50 FULBRIGHT & JAWORSKI, LLP (57) ABSTRACT 666 FIFTH AVE The present invention is directed to a method for inferring/ NEW YORK, NY 10103-3198 (US) estimating missing values in a data matrix d(q, r) having a plurality of roWs and columns comprises the steps of: orga (21) App1.No.: 12/784,155 niZing the columns of the data matrix d(q, r) into a?inity folders of columns With similar data pro?le, organizing the (22) Filed: May 20, 2010 roWs of the data matrix d(q, r) into a?inity folders of roWs With similar data pro?le, forming a graph Q of augmented Related U.S. Application Data roWs and a graph R of augmented columns by similarity or (63) Continuation of application No. 11/715,863, ?led on correlation of common entries; and expanding the data matrix Mar. 7, 2007, noW abandoned, Which is a continuation d(q, r) in terms of an orthogonal basis of a graph Q><R to in-part ofapplication No. 11/230,949, ?led on Sep. 19, infer/ estimate the missing values in said data matrix d(q, r).on 2005, noW abandoned, said application No. 1 1/715, the diffusion geometry coordinates. 10 Patent Application Publication Oct. 28, 2010 Sheet 1 0f 3 US 2010/0274753 A1 Figure.- 1 Figure 2 Patent Application Publication Oct. 28, 2010 Sheet 2 of3 US 2010/0274753 A1 Figure 3 mad $22111 . Q i 2 2" df a0) / ? ‘990 npui 3.33 (2231:: , UL, ’ (xdwefeafniatu ihn ?1n)i,t ay)n dim I {Zjixka?g m. o ‘5 _ To (X,§I}=T(x,y}if m LOCQZGLQP?) ‘351$ x: y > a’ g] 8, T1 (xyh?sthenvise )4} =€sheinéex setof¥f> mm‘ ‘ 1520 __> __ ..p x¢=o Tl “1331-; restricted w I’? and wzittanasamatrix , on R. ‘ » men- ' .v ,Pé : i=i + i _ - __ mm 5 5*! “ ’ 1W“ ..v \ hog Sizem ) < K m i = gm? I Output x;, Pg, and ‘R, for ail relevant i Patent Application Publication Oct. 28, 2010 Sheet 3 0f3 US 2010/0274753 A1 Figure 4 The Warm Wide Web PFA I PF? Search Request Harmer Worm Wide Web Document and Resuits Dispiayer Acquisition Engine PFG PFE PFK PFB PM * PF2 Qocument Comparison Document Comparisan Search Engine indexer PA /PFC Document and Comparison informatian Database US 2010/0274753 A1 Oct. 28, 2010 METHODS FOR FILTERING DATA AND [0006] The term “data mining” as used herein broadly FILLING IN MISSING DATA USING refers to the methods of data organization and subset and NONLINEAR INFERENCE feature extraction. Furthermore, the kinds of data described or used in data mining are referred to as (sets of) “digital RELATED APPLICATION documents.” Note that this phrase is used for conceptual illustration only, can refer to any type of data, and is not meant [0001] This application claims priority bene?t under Title to imply that the data in question are necessarily formally 35 U.S.C. §1 19(e) of US. Provisional PatentApplication No. documents, nor that the data in question are necessarily digi 60/779,958, ?led Mar. 7, 2006, Which is incorporated by tal data. The “digital documents” in the traditional sense of reference in its entirety. Also, this application is continuation the phrase are certainly interesting examples of the kinds of in-part ofU.S. application Ser. No. 11/230,949, ?led Sep. 19, data that are addressed herein. 2005, Which claims priority bene?t under Title 35 U.S.C. §119(e) of provisional patent application No. 60/610,841 OBJECTS AND SUMMARY OF THE ?led Sep. 17, 2004 and provisional patent application No. INVENTION 60/697,069 ?led Jul. 5, 2005, each Which is incorporated by reference in its entirety. Also, this application is a continua [0007] The present system and method described are herein tion-in-part of US. patent application Ser. No. 11/165,633 applicable at least in the case in Which, as is typical, the given ?led Jun. 23, 2005, Which claims priority bene?t under Title data to be analyZed can be thought of as a collection of data 35 U.S.C. §119(e) of provisional patent application no. objects, and for Which there is some at least rudimentary 60/582,242 ?led Jun. 23, 2004, each Which is incorporated by notion of What it means for tWo data objects to be similar, reference in its entirety. close to each other, or nearby. [0008] The present invention relates to methods for orga BACKGROUND OF THE INVENTION niZation of data, and extraction of information, subsets and other features of data, and to techniques for ef?cient compu [0002] The present invention relates generally to data tation With said organiZed data and features. More speci? denoising, robust empirical functional regression, interpola cally, the present invention relates to mathematically moti tion and extrapolation, and more speci?cally in some aspects vated techniques for ef?ciently empirically discovering to ?lling in missing data using nonlinear inference. Common useful metric structures in high-dimensional data, and for the challenges encountered in information processing and computationally ef?cient exploitation of such structures. knowledge extraction tasks involve corrupt data, either noisy [0009] It is an object of the present invention to automati or With missing entries. Some embodiments of the present cally augment search queries, modeling the intended context invention make e?icient use of the netWork of inferences and of a given search query by using prior knoWledge about the similarities betWeen the data points to create robust nonlinear user of the search and/or the context of the search. As in the estimators for missing entries. example above, the search term “gates” could be reWritten for [0003] Also, the present invention relates generally to data a CMOS technologist as “logic gates OR CMOS gates”, base searching, data organization, information extraction, While it could be rewritten as “Bill Gates” for an operating and data features extraction. More particularly, the present system softWare business pundit, and “iron gates” for a invention relates to personaliZed search of databases includ Wrought-iron specialist. For users With multiple interests, ing intranets and the Internet, and to mathematically moti several forms could be used. vated techniques for ef?ciently empirically discovering use [0010] It is an object of the present invention to augment a ful metric structures in high-dimensional data, and for the ?rst search query With extra search terms and Boolean logic, computationally ef?cient exploitation of such structures. The based on the ?rst query as Well as some additional knoWledge methods disclosed relate as Well to improvement of informa about the intention of the user including but not limited to user tion retrieval processes generally, by providing methods of preferences, interests, prior search choices, bookmarks, augmenting these processes With additional information that emails, ?les, Web sites and blogs read or frequented by the re?nes the scope of the information to be retrieved. user, etc. This augmentation can then be used to construct a [0004] Search terms have different meanings in different second search query; the augmented query. contexts. Prior art search engines, such as Google, typically [0011] It is an object of the present invention to use statis use a single method of interpretation and scoring of search tical aspects of one or more relevant corpora of documents, in results. Thus, in Google for example, the most popular mean part, to de?ne the interests of a user or class of users. For ing of a particular search term Will end up being prioritiZed example, to apply the present invention to the augmentation over alternate, less popular, meanings. HoWever, often the of search queries to speci?cally search for results relevant for user really intends to search for the alternate meaning(s). For baseball enthusiasts, a corpus of documents may be used that example, the search query term “gates” may mean “logic consists of baseball neWs articles, baseball encyclopedia gates”, “Bill Gates”, “Wrought-iron gates”, etc. In each case, entries, baseball Website content & blogs, and the like. the addition of extra keyWords could serve to disambiguate [0012] It is an object of the present invention to use statis the search query. HoWever, often a user does not realiZe that tical aspects of the interaction betWeen a ?rst search query these extra terms are needed, or otherWise does not Wish to put and the one or more relevant corpora of documents, to de?ne in the time or effort perfecting the search query. one or more second search queries. For example, suppose that [0005] Consequently there is a need for a personaliZed in a baseball speci?c corpus, those documents that contain the search engine technology capable of augmenting a ?rst query Word “positions” are much more likely than average to search query, based on some additional knoWledge about the also contain the associated terms “?rst base”, “second base”, intention of the user. More generally, there is a need for “third base”, “shortstop”, “out?eld”, “pitcher,”, “catcher”, information retrieval technology that factors in additional etc. Then an embodiment of the present invention can, for knoWledge to return improved results. example, given as input the query Word, produce a second US 2010/0274753 A1 Oct. 28, 2010 search query that is made from the query Word, With the set of directions in the high dimensional space that are char addition of the associated terms, and some Boolean connec acteristic of the contrast betWeen the subset of points, and the tors. For example, “positions” can become: “positions AND full set of points corresponding to the Whole corpus. The (‘?rst base’ OR ‘second base’ OR ‘third base’ OR ‘shortstop’ embodiment then selects those other points that are also OR ‘out?eld’ OR ‘pitcher’ OR ‘catcher’)”. Within or near the region, or are also disposed along the [0013] In this regard, an embodiment of the present inven directions in the high dimensional space, and the music ?les tion comprises a search query rewriting system Which takes as (or, e.g., a list of pointers or indexes thereto) corresponding to input a ?rst query. The ?rst query is used to run a ?rst search the data points are returned as the results of the improved on a ?rst corpus of documents, returning a ?rst subset of “query by example”. It should be noted that in order to carry documents in response to the ?rst search. Word frequency out the steps described, one needs only a statistical charac statistics are computed for the ?rst subset of documents. ter‘ization of the large set of points to be searched, as Well as These statistics are compared With the corresponding Word set of points given as examples. Hence it Will be readily seen frequency statistics for the corpus as a Whole, or for the by one skilled in the art that it is not necessary to characterize language as a Whole. Resultant Words are identi?ed for Which every music ?le individually, in order to use the disclosed the difference betWeen the Word’s frequency in the ?rst sub set method to improve information retrieval processes. of documents, as compared With the corresponding Whole corpus or Whole-language frequencies, is largest (e. g. above a [0016] The fr_matr_bin-type embodiments relate in part to given threshold, or, say, the 5 largest). A second query is methods for ?nding objects that have similarity or a?inity to formed consisting of the ?rst query, Boolean connectors, and some other target objects or search query results. In accor the resultant words. (eg <?rst query> AND Wordl OR dance With an embodiment of the present invention, diffusion geometries also relate in part to methods for ?nding similarity Word2 OR . . . OR Word5). A second search is then run on a or a?inity betWeen objects. In this regard, elements disclosed second one or more corpora of documents, for example on the Internet. The second search is a search for documents that herein relating to the use of fr_matr_bin-type embodiments match the second query. The results of the second search are on the one hand, and on the other hand elements disclosed returned to the user. herein relating to the use of diffusion geometry, can be inter changed. [0014] One of skill in the art Will readily see that While the present invention is disclosed in terms of search query reWr‘it [0017] In accordance With an embodiment of the present ing, the techniques disclosed relate more generally to the invention (see FIG. 1), corpora (5) and (9) of data is used to improvement of information retrieval processes. To this end, add meaning to the query. Hence, it is only necessary that in some aspects it is obj ect of the present invention to improve corpora (5) and (9) be a “rich enough” statistical sample of the information retrieval processes generally, by providing meth full set of documents (i.e., music ?les). It is appreciated that ods of augmenting the processes With additional information this “rich enough” statistical sample can be accomplished in that re?nes the scope of the information to be retrieved. Gen a number of Ways standard in the art. For example, the statis erally these statistical information about one or more corpora tical sample can be obtained iteratively by trying a small of data elements, and the interaction betWeen a ?rst data subset, collecting and storing the results of a number of typi retrieval speci?cation and the one or more relevant corpora of cal/popular queries, and then adding more documents at ran dom and performing the same typical/popular queries. If the data elements, is used to de?ne one or more second data retrieval speci?cations. The second data retrieval speci?ca results are roughly the same, then stop adding more docu tions are used to retrieve information of a more relevant ments. HoWever, if the results are not roughly the same, then scope, from a second one or more corpora of data elements. add more documents at random until the process stabilizes, We sometimes refer broadly to the class of embodiments i.e., results are roughly the same. Alternatively, one can per described in this paragraph as fr_matr_bin-type. This name form some other measure of statistical completeness/change comes from the name of a particular set of algorithms Within in adding a feW more documents, or any other method for the broad class, but the term “fr_matr_bin-type” is meant to statistical completeness or signi?cance. refer to this general class of embodiments just described. [0018] In accordance With an exemplary embodiment of [0015] In this regard, an embodiment of the present inven the present invention, for example for music ?les, the present tion comprises a search by example system. For illustration, invention characterizes the music ?les With “extra features” to compute music a?inity (or generally, music “meaning”) or We Will consider such a system Working on a set of datapoints in a high-dimensional space. More speci?cally, We Will use as obtain a “rich enough” statistical sample (i.e., in the corpora an example the problem of music similarity “search by (5) and (9)). The corpus (13) of music ?les necessary to example”. In such embodiment, a search engine is disposed to perform information retrieval needs to be a full set of all search through a corpus of digital music ?les. For each ?le, available documents (i.e., music ?les), but the present inven the system has pre-computed a set of numerical coordinates tion, at least in certain embodiments, does not need to char that characterize various standard aspects of the ?le. In this acterize these music ?les With “extra features” as With the corpora (5) and (9). Way the embodiment can treat the corpus of data as a set of points in a high dimensional space. Such characteristic [0019] In another aspect, the present systems and methods numerical coordinates are knoWn to those of skill in the art, described relate herein are applicable to diffusion geometry and include, but are not limited to, timberal Fourier, MERL and document analysis, processing and information extrac and cep stral coe?icients, Hidden Markov Model parameters, tion. These methods and systems described herein are appli dynamic range vs. time parameters, etc. In an exemplary cable at least in the case in Which, as is typical, the given data query by example interface, a user speci?es a feW music ?les to be analyzed can be thought of as a collection of data from the corpus of digital music ?les. The embodiment then objects, and for Which there is some at least rudimentary characterizes the coordinates of the subset of points associ notion of What it means for tWo data objects to be similar, ated With the speci?ed feW music ?les, and selects a region or close to each other, or nearby. US 2010/0274753 A1 Oct. 28, 2010 [0020] In an embodiment, the present invention relates to these mathematical objects. For example, a vector could be the fact that certain notions of similarity or nearness of data represented, but is not limited to being represented, as an objects (including but not limited to conventional Euclidean ordered n-tuple of ?oating point numbers, stored in a com metrics or similarity measures such as correlation, and many puter. A function could be represented, but is not limited to be others described beloW) are not a priori very useful inference represented, as a sequence of samples of the function, or tools for sorting high dimensional data. In one aspect of the coe?icients of the function in some given basis, or as sym present invention, We provide techniques for remapping digi bolic expressions given by algebraic, trigonometric, transcen tal documents, so that the ordinary Euclidean metric becomes dental and other standard or Well de?ned function expres more useful for these purposes. Hence, data mining and infor sions. mation extraction from digital documents can be consider ably enhanced by using the techniques described herein. The [0026] In the present invention it is convenient to think of a techniques relate to augmenting given similarity or nearness digital document as an ordered list of numbers (coordinates) representing parametric attributes of the document. Note that concepts or measures With empirically derived diffusion geometries, as further de?ned and described herein. this representation is used as an illustrative and not a limiting [0021] An aspect of the present invention relates to the fact concept, and one skilled in the art Will readily understand hoW that, Without the present invention, it is not practical to com the examples described above, and many others, can be pute oruse diffusion distances on high dimensional data. This brought in to such a form, or treated in other forms of repre is because standard computations of the diffusion metric sentation, by techniques that are substantially equivalent to require d*n2 or even d*n3 number of computations, Where d is those describe herein. the dimension of the data, and n the number of data points. [0027] Such digital documents, eg images and text docu This Would be expected because there are O(n2) pairs of ments having many attributes, typically have dimensions points, so one might believe that it is necessary to perform at exceeding 100. In accordance With an embodiment of the least n2 operations to compute all pairWise distances. HoW present invention, the use of given metrics (i.e., notions of ever, the present invention, as disclosed, includes a method similarity, etc.) in digital document analysis is restricted only for computing a dataset, often in linear time O(n) or O(n to the case of very strong similarity betWeen documents, a log(n)), from Which approximations to these distances, to similarity for Which inference is self evident and robust. Such Within any desired precision, can be computed in ?xed time. similarity relations are then extended to documents that are [0022] The present invention provides a natural data driven not directly and obviously related by analyZing all possible self-induced multiscale organiZation of data in Which differ chains of links or similarities connecting them. This is ent time/ scale parameters correspond to different representa achieved through the use of diffusions processes (processes tions of the data structure at different levels of granularity, that are analogous to heat-?oW in a mathematical sense that While preserving microscopic similarity relations. Will be described herein), and this leads to a very simple and [0023] Examples of digital documents in this broad sense, robust quantity that can be measured as an ordinary Euclidean could be, but are not limited to, an almost unlimited variety of distance in a loW dimensional embedding of the data. The possibilities such as sets of object-oriented data objects on a term embedding as used herein refers to a “diffusion map” computer, sets of Web pages on the World Wide Web, sets of and the distance thereby de?ned as a “diffusion metric.” document ?les on a computer, sets of vectors in a vector [0028] In yet another aspect, the present invention relates in space, sets of points in a metric space, sets of digital or analog part to in?uencing the position or presence on a search result signals or functions, sets of ?nancial histories of various list generated by a computer netWork search engine and for kinds (eg stock prices over time), sets of readouts from a in?uencing a position or presence or placement Within an scienti?c instrument, sets of images, sets of videos, sets of advertising section of document or rendering of a document audio clips or streams, one or more graphs (i.e. collections of or meta-document on a computer netWork. In part, systems nodes and links), consumer data, relational databases, to and methods are disclosed for enabling information providers name just a feW. using a computer netWork such as the Internet to in?uence a [0024] In each of these cases, there are various useful con position for a search listing Within a search result list gener cepts of said similarity, closeness, and nearness. These ated by a computer netWork search engine and for in?uencing include, but are not limited to, examples given in the present a position or presence or placement of a listing Within a disclosure, and many others knoWn to those skilled in the art, document or rendering of a document or meta-document on a including but not limited to cases in Which the content of the computer netWork. The term listing as used herein refers to data objects is similar in some way (eg for vectors, being any digital document content that a provider Wishes to have close With respect to the norm distance) and/or if data objects listed, rendered, displayed, or otherWise delivered using a are stored in a proximal Way in a computer memory, or disk, computer netWork, by one practicing the present invention. etc, and/or if typical user-interaction With the objects is simi Such a listing can be, but is not limited to banner advertise lar in some way (eg tends to occur at similar time, or With ments, text advertisements, video clips and other media, and similar frequency), and/or if, during an interactive process, a can be as simple as a link to another Web page or Web site. The user or operator of the present invention indicates that the term advertising opportunity herein refers to any instance objects in question are similar, or assigns a quantitative mea Where there is an opportunity to position a search listing, or sure of similarity, etc. In the case of nodes in a graph, or in the position, place or present a listing Within an advertising or case of tWo Web pages on the Internet, the objects can be other section Within a document or rendering of a document thought of as similar for reasons including, but not limited to, or meta-document on a computer netWork. The term adver cases in Which there is a link from one to the other. tising as used herein refers to any act of listing, rendering, [0025] Note that, in practical terms, although mathematical displaying, or otherWise delivering a listing or other content objects, such as vectors or functions, are discussed herein, the using a computer netWork, in exchange for compensation or present invention relates to real-World representations of other value. US 2010/0274753 A1 Oct. 28, 2010 [0029] More generally, in this aspect, the present invention [0037] The diffusion geometric techniques and other tech relates to the strategic matching of online content for optimi niques disclosed herein provide a neW and novel means of Zation of collaborative opportunities for one Web page or Web displaying advertisements that are related to content and for site to display content related to another Web page orWeb site. Which preferential positioning of the advertisements dis Examples of such use include, but are not limited to: played can be determined by relevance to the context, as Well as in?uenced by a bidding process or other economic consid [0030] l. the addition of links to a Web site, designed to erations. Algorithms for preferential positioning of advertise increase intra-site click through rate; ments, etc, are disclosed herein. [0031] 2. the addition of links betWeen a strategic set of [0038] An aspect of the present invention relates to the Web sites, designed to increase inter-site click through application of the above algorithm and related ones, to the rates; and problem of automatically designing or augmenting the links [0032] 3. the provision of services designed to pair up Within a single company’s Web site. Web companies often product and service listings With advertising opportuni Wish to increase the amount of tra?ic on their Web sites, and ties the amount of time and volume of data vieWed by customers [0033] In accordance With an embodiment of the present of their sites. Offering links from pages on the site to related invention, the system and method provides a database having pages on the site provides a proactive replacement for an accounts for the listing providers. Each account contains con outside search engine. Users Will be able to ?nd What they tact and billing information for a listing provider. In addition, need (e. g. if they enter a site from the result of a search each account contains at least one search listing having at engine), and then ?nd related information, and thus be moti least tWo components: 1. at least one digital document vated to “explore” the site. This is true for sites in general, and describing the product, service or other listing to be posi also speci?cally When the site in question is one that contains tioned, placed, or presented; and 2. a bid amount, Which is catalog-like or other listings of products and services. In a preferably a money amount, for a listing. The listing provider store, customers often begin shopping by looking at one prod may add, delete, or modify a search listing after logging into uct but end up buying another product. By having tight links his or her account via an authentication process. The present betWeen related products, online sites can achieve this same invention includes methods for determining the eligibility of “emotional buying” phenomenon. any listing for any given advertising opportunity. During an [0039] An aspect of the present invention relates to the advertising opportunity, the selection of or positioning of a application of the above algorithm and related ones, to the listing is in?uenced by a continuous online competitive bid problem of automatically designing or augmenting the links ding process. The bidding process occurs Whenever an adver betWeen tWo or more companies’ Web sites. Web companies tising opportunity arises. The system and method of the often Wish to increase the amount of traf?c that they receive present invention then compares all bid amounts for those from or provide to af?liated sites. The present invention pro listings eligible for the advertising opportunity in question, vides a method to design or augment the links betWeen these and generates a rank value for all eligible listings. The rank sites, thereby linking related content, and organically increas value generated by the bidding process determines Where the ing this traf?c. One skilled in the art Will see hoW to do this, netWork information providers listing Will appear in the con and hoW it results in economic bene?t to the parties in ques text determined by the advertising opportunity. A higher bid tion, each in a Way analogous to the case described in the by a netWork information provider Will result in a higher rank previous paragraph. value and a more advantageous placement. [0040] In accordance With an embodiment of the present [0034] There are current systems that, for example, display invention, a method and system retrieves information in advertisements Within a paid section of a Web page, Wherein response to an information retrieval request comprises the choice of advertisements displayed relates to keyWord extracting additional information from a ?rst corpus of data matching and other similar techniques, and the preferential elements based on the request. The request is modi?ed based positioning of the advertisements displayed is determined by on the additional information to re?ne the scope of informa a bidding process. For example, Google, Inc. practices this tion to be retrieved from a second corpus of data elements. technique (see “Google AdSense” at: <http://WWW.google. The information is retrieved from the second corpus of data com/ads/>). elements based on the modi?ed request. [0035] There are current systems that, for example, display [0041] In accordance With an embodiment of the present advertisements Within a section of a search engine query invention, a method of in?uencing traf?c betWeen predeter result page, Wherein the choice of advertisements displayed mined Web pages comprises the steps of: determining diffu relates to keyWord matching and other similar techniques, sion geometry coordinates of a set of Web pages, the set of and the preferential positioning of the advertisements dis Web pages comprising at least one of the predetermined Web played is determined by a bidding process. For example, pages; and determining links betWeen the Web pages based on Google, practices this technique (see “Google Ad Words” at: the diffusion geometry coordinates. <http://WWW.google.comiads/>). [0042] In accordance With an embodiment of the present [0036] In these current systems, advertisements are placed invention, a computer readable medium comprises code for by a method that uses keyWords, but keyWords can be retrieving information in response to an information retrieval ambiguous. For example, the keyWord “nails” might bring up request, the code comprising instructions for: extracting addi advertisements for hardWare stores in these prior art systems, tional information from a ?rst corpus of data elements based even When searched from a Website about Women’s beauty, on the request; modifying the request based on the additional Where results about nail polish, etc, are more appropriate as information to re?ne the scope of information to be retrieved top advertisements. Hence there is a need for methods and from a second corpus of data elements; and retrieving infor systems as disclosed herein, Which, in part, are able to resolve mation from the second corpus of data elements based on the such ambiguities. modi?ed request. US 2010/0274753 A1 Oct. 28, 2010 [0043] In accordance With an embodiment of the present [0051] In accordance With an exemplary embodiment of invention, a computer readable medium comprises code for the present invention, the data matrix d(q, r) comprises initial in?uencing traf?c betWeen predetermined Web pages, the customer preference data and the inventive method for infer code comprising instructions for: determining diffusion ring/ estimating missing values in a data matrix d(q, r) further geometry coordinates of a set of Web pages, the set of Web comprises the step of predicting additional customer prefer pages comprising at least one of the predetermined Web ences from the data matrix d(q, r). pages; and determining links betWeen the Web pages based on [0052] In accordance With an exemplary embodiment of the diffusion geometry coordinates. the present invention, the data matrix d(q, r) comprises mea [0044] In accordance With an embodiment of the present sured values of an empirical function f(q, r) and the invention invention, a system for retrieving information in response to method for inferring/estimating missing values in a data an information retrieval request comprises: an extracting matrix d(q, r) further comprises the step of nonlinear regres module for extracting additional information from a ?rst cor sion modeling of the empirical function f(q, r). pus of data elements based on the request; a processing mod [0053] In accordance With an exemplary embodiment of ule for modifying the request based on the additional infor the present invention, the data matrix d(q, r) is a questionnaire mation to re?ne the scope of information to be retrieved from d(q, r) and the inventive method further comprises the steps of a second corpus of data elements; and a retrieving module for determining Whether a response (qo, r0) to the questionnaire retrieving information from the second corpus of data ele d(q, r) is an anomalous response. ments based on the modi?ed request. [0054] In accordance With an exemplary embodiment of [0045] In accordance With an embodiment of the present the present invention, the inventive method further comprises invention, a system for in?uencing tra?ic betWeen predeter the steps of generating a dataset d1 (q, r) comprising responses mined Web pages comprises a processing module for deter to the questionnaire d(q, r), omitting the response (qo, r0) from mining diffusion geometry coordinates of a set of Web pages, the dataset d1 (q, r), reconstructing the missing response (qo, the set of Web pages comprising at least one of the predeter r0) from the dataset d1(q, r) to provide a reconstructed value, mined Web pages; and determining links betWeen the Web comparing the reconstructed value to the response (qo, r0), pages based and determining the response (qO, r0) to be anomalous When a [0046] In accordance With an exemplary embodiment of distance betWeen the reconstructed value and the response the present invention, a method for inferring/ estimating miss (qo, r0) is larger than a pre-determined threshold. ing values in a data matrix d(q, r) having a plurality of roWs [0055] In accordance With an exemplary embodiment of and columns comprises the steps of: organiZing the columns the present invention, the data matrix d(q, r) comprises data of the data matrix d(q, r) into a?inity folders of columns With relevant to fraud or deception and the inventive method fur similar data pro?le, organiZing the roWs of the data matrix ther comprises the step of detecting fraud or deception from d(q, r) into af?nity folders of roWs With similar data pro?le, said data matrix d(q, r). forming a graph Q of augmented roWs and a graph R of [0056] In accordance With an exemplary embodiment of augmented columns by similarity or correlation of common the present invention, a computer readable medium com entries; and expanding the data matrix d(q, r) in terms of an prises code for inferring/estimating missing values in a data orthogonal basis of a graph Q><R to infer/estimate the missing matrix d(q, r) having a plurality of roWs and columns. The values in said data matrix d(q, r).on the diffusion geometry code comprises instructions for organiZing the columns of coordinates. said data matrix d(q, r) into af?nity folders of columns With similar data pro?le, organiZing the roWs of said data matrix [0047] In accordance With an exemplary embodiment of the present invention, the data matrix d(q, r) comprises ques d(q, r) into af?nity folders of roWs With similar data pro?le, tionnaire data and the inventive method for inferring/estimat forming a graph Q of augmented roWs and a graph R of ing missing values in a data matrix d(q, r) additionally com augmented columns by similarity or correlation of common entries; and expanding the data matrix d(q, r) in terms of an prises the step of ?lling in an unknown response to a orthogonal basis of a graph Q><R to infer/estimate the missing questionnaire to infer/estimate missing values in the data matrix d(q, r). values in the data matrix d(q, r). [0057] Various other objects, advantages and features of the [0048] In accordance With an exemplary embodiment of present invention Will become readily apparent from the the present invention, the inventive method for inferring/ ensuing detailed description, and the novel features Will be estimating missing values in a data matrix d(q, r) additionally particularly pointed out in the appended claims. comprises the step of expanding the data matrix d(q, r) in terms of a tensor product of Wavelet bases for graphs Q and R. BRIEF DESCRIPTION OF THE DRAWINGS [0049] In accordance With an exemplary embodiment of the present invention, the inventive method for inferring/ [0058] For a more complete understanding of the present estimating missing values in a data matrix d(q, r) additionally invention, reference is noW made to the folloWing descrip comprises the steps of, for each tensor Wavelet in basis, com tions taken in conjunction With the accompanying draWing, in puting a Wavelet coe?icient by averaging on the support of the Which: tensor Wavelet and retaining the coe?icient in the expansion [0059] FIG. 1 shoWs a block diagram of a contextualiZed only if validated by a randomiZed average. search engine in accordance With an embodiment of the [0050] In accordance With an exemplary embodiment of present invention; the present invention, the inventive method for inferring/ [0060] FIG. 2 shoWs a schematic representation of an imag estimating missing values in a data matrix d(q, r) additionally ined forest, With trees and shrubs, presumed to burn at differ comprises the steps of constructing diffusion Wavelets and ent rates; taking supports of the resulting diffusion Wavelets at a ?xed [0061] FIG. 3 shoWs an exemplary ?oW chart for comput scale on said columns of said graph R, for at least one of the ing multiscale diffusion geometry in accordance With an organiZing step. embodiment of the present invention; and US 2010/0274753 A1 Oct. 28, 2010 [0062] FIG. 4 illustrates a Public Find Similar Document [0077] Note that in certain embodiments the corpora (9) Internet Utility in accordance With an embodiment of the can be taken to be the same as (5). In such case, it is seen that present invention. the algorithm of the present invention acts to ?nd those Words [0063] The discussion associated With the ?gure illustrates Which are much more likely to occur in documents that meet an embodiment of the present invention in the context of the ?rst search query criteria, Within the subject(s) of interest analysis of the spread of ?re in the forest, and illustrates a use to the user of the search, as compared With the generic occur of the embodiment in the analysis of diffusion in a netWork. rence of the Words Within the subject(s) of interest to the user of the search. In other variants of the algorithm, (9) and (10) DETAILED DESCRIPTION OF THE are omitted, f1:0, and (7) d:f0 (6). EMBODIMENTS [0078] The corpora (13) can be, in certain embodiments, the entire Internet, or the set of documents indexed by a public [0064] As shoWn in FIG. 1, there is illustrated a How chart or private search engine. Since, in certain embodiments, the describing an exemplary method in accordance With an algorithm of the present invention takes a ?rst search query, embodiment of the present invention (fr_matr_bin( ): and produces a second search query, each suitable for full text [0065] Step 110: A user (1) enters a ?rst search query (2) search, these queries can be passed to search engines via into a search query user interface (3). techniques standard in the art, including but not limited to [0066] Step 120: The query (2) is sent to a ?rst search HTTP requests and/or netWork interfaces such as SOAP. The engine (4). results returned by these search engines can be displayed as is [0067] Step 130: The ?rst search engine (4) performs a standard in the art, including but not limited to display in a search on a ?rst one or more corpora of documents (5) broWser by rendering results encoded With HTML, XML, using the query (2). Java, JavaScript, Python, Peri, PHP, etc. [0068] Step 140: Mean Word frequencies f0 (6) are com [0079] In certain embodiments, at least on of the searches puted on the set of documents returned by the ?rst search described can be performed by matrix techniques. More spe engine (4). ci?cally, suppose that one has a set of N documents, With a [0069] Step 150: Mean Word frequencies f1 (10) are vocabulary or reduced vocabulary of M Words. One can then computed for a second one or more corpora of docu form the N><M matrix W, so that W(i,j )?he number of times ments (9). (It is appreciated that this step can be done that Word number j occurs in document number i. once at initialization.) [0080] In certain embodiments, provisions are made to [0070] Step 160: The difference d (7) f0-f1:is calcu ignore stop Words. Stop Words are Words that are commonly lated. used, such as “the,” “an,” or “and”, that are often deliberately [0071] Step 170: The set of Words (8) is identi?ed corre ignored by search applications When responding to a query. sponding to those top K Words for Which d (7) is greatest Often stop Words are the most common Words in the lan (for some ?xed parameter K), or e. g., to those Words for guage. In some embodiments, sets of stop Words are aug Which d is greater than some thresholdt (for some ?xed mented by adding additional words (eg Common Words) parameter t). that are speci?c to the corpora used. [0072] Step 180: A neW search query (11) is de?ned by [0081] In certain embodiments, provisions are made to cor combining the ?rst query (2) and the set of Words (8). For rect spelling errors. This can be done, for example, by using example if the ?rst query (2) is “nail”, and the set of SOUNDEX scores to identify Words that are misspelled but Words (8) is {“polish”, “beauty”, “manicure”}, then the are most likely meant to be other given Words. One can also neW search query (11) could be “nail AND (polish OR employ other techniques, such as a list of commonly mis beauty OR manicure)”. Other algorithms for this com spelled Words, phrases and queries. In the present context, bination are disclosed herein. statistics and other information, including but not limited to [0073] Step 190: The neW query is sent to a second information from the corpora and/or the search logs, can be search engine (12) disposed to search a third one or more used to identify misspellings and likely suggested replace corpora of documents (13). ments for input queries. Spelling errors in the corpora can also [0074] Step 200: The results returned by the second be ?agged and automatically, semi-automatically, partially search engine (12) are displayed on a search result user assisted or manually corrected. interface (14). [0082] In accordance With embodiments of the present [0075] In certain embodiments, the corpora (9) represent invention, certain Word frequency coe?icients, or differences the language as a Whole. For example, if the target searches betWeen Word frequencies, are set to Zero When they are are conducted in English, then corpora (9) can be a random beloW a given threshold. In this Way, “noise” is removed from sample of documents in the English language. The corpora the process. For example, in the case Where documents are (5) are used to de?ne the subject(s) of interest to the user of the being tested for the presence of a set of Words or phrases as in search. For example, if the subject of interest is Maj or League the search in step 130 of FIG. 1, one can take only those Baseball, then the documents in question can be a Web-craW documents that contain the phrase more than a certain number of WWW.mlb.com, as Well as neWs articles, encyclopedia of times. This number can be ?xed, or it can be some fraction articles, etc, on the subject of baseball. of the average number, Where the average is taken, for [0076] In this Way, it is seen that the algorithm of the present example, over the set of documents for Which the value is at invention, in certain embodiments, acts to ?nd those Words least 1. A corresponding type of threshold can also be applied Which are much more likely to occur in documents that meet in one or more of steps, for example to steps 170, 180 or 190. the ?rst search query criteria, Within the subject(s) of interest [0083] In certain embodiments, searches are implemented to the user of the search, as compared With the generic occur in part using sparse matrix representations. For example, rence of the Words Within the target search language as a given the matrix W(i,j) as described herein, for a ?rst one or Whole. more corpora, and an initial search query based on the pres

Description:

technique (see “Google AdSense” at: ). embodiments, the ab solute requirement of appearing together is relaxed to

Methods for filtering data and filling in missing data using nonlinear inference PDF

28 Pages·2013·3.66 MB·English

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Methods for filtering data and filling in missing data using nonlinear inference PDF Free - Full Version

by Unknow| 2013| 28 pages| 3.66| English

Download Methods for filtering data and filling in missing data using nonlinear inference by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Methods for filtering data and filling in missing data using nonlinear inference

technique (see “Google AdSense” at: ). embodiments, the ab solute requirement of appearing together is relaxed to

Detailed Information

Author:	Unknown
Publication Year:	2013
Pages:	28
Language:	English
File Size:	3.66
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Methods for filtering data and filling in missing data using nonlinear inference Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Methods for filtering data and filling in missing data using nonlinear inference PDF?

Yes, on https://PDFdrive.to you can download Methods for filtering data and filling in missing data using nonlinear inference by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Methods for filtering data and filling in missing data using nonlinear inference on my mobile device?

After downloading Methods for filtering data and filling in missing data using nonlinear inference PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Methods for filtering data and filling in missing data using nonlinear inference?

Yes, this is the complete PDF version of Methods for filtering data and filling in missing data using nonlinear inference by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Methods for filtering data and filling in missing data using nonlinear inference PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.