Table Of ContentUS 20100274753A1
(19) United States
(12) Patent Application Publication (10) Pub. No.: US 2010/0274753 A1
Liberty et al. (43) Pub. Date: Oct. 28, 2010
(54) METHODS FOR FILTERING DATA AND 863 is a continuation-in-part of application No.
FILLING IN MISSING DATA USING 11/165,633, ?led on Jun. 23, 2005, noW abandoned.
NONLINEAR INFERENCE
(60) Provisional application No. 60/779,958, ?led on Mar.
7, 2006, provisional application No. 60/610,841, ?led
(76) Inventors: Edo Liberty, New Haven, CT (U S);
on Sep. 17, 2004, provisional application No. 60/ 697,
Steven Zucker, Harnden, CT (US);
069, ?led on Jul. 5, 2005, provisional application No.
Yosi Keller, Rehovol (IL); Mauro
60/582,242, ?led on Jun. 23, 2004.
M. Maggioni, Durham, NC (US);
Ronald R. Coifman, North Haven, Publication Classi?cation
CT (US); Frank GeshWind,
(51) Int. Cl.
Madison, CT (US)
G06N 5/02 (2006.01)
Correspondence Address: (52) U.S. Cl. ........................................................ .. 706/50
FULBRIGHT & JAWORSKI, LLP (57) ABSTRACT
666 FIFTH AVE
The present invention is directed to a method for inferring/
NEW YORK, NY 10103-3198 (US)
estimating missing values in a data matrix d(q, r) having a
plurality of roWs and columns comprises the steps of: orga
(21) App1.No.: 12/784,155
niZing the columns of the data matrix d(q, r) into a?inity
folders of columns With similar data pro?le, organizing the
(22) Filed: May 20, 2010
roWs of the data matrix d(q, r) into a?inity folders of roWs
With similar data pro?le, forming a graph Q of augmented
Related U.S. Application Data
roWs and a graph R of augmented columns by similarity or
(63) Continuation of application No. 11/715,863, ?led on correlation of common entries; and expanding the data matrix
Mar. 7, 2007, noW abandoned, Which is a continuation d(q, r) in terms of an orthogonal basis of a graph Q><R to
in-part ofapplication No. 11/230,949, ?led on Sep. 19, infer/ estimate the missing values in said data matrix d(q, r).on
2005, noW abandoned, said application No. 1 1/715, the diffusion geometry coordinates.
10
Patent Application Publication Oct. 28, 2010 Sheet 1 0f 3 US 2010/0274753 A1
Figure.- 1
Figure 2
Patent Application Publication Oct. 28, 2010 Sheet 2 of3 US 2010/0274753 A1
Figure 3
mad
$22111 .
Q
i 2 2" df a0) / ?
‘990 npui 3.33 (2231:: , UL, ’
(xdwefeafniatu ihn ?1n)i,t ay)n dim I {Zjixka?g m. o
‘5
_ To (X,§I}=T(x,y}if m LOCQZGLQP?)
‘351$ x: y > a’ g]
8, T1 (xyh?sthenvise
)4} =€sheinéex setof¥f>
mm‘ ‘
1520 __> __ ..p x¢=o Tl “1331-;
restricted w I’? and
wzittanasamatrix ,
on R. ‘ »
men- '
.v ,Pé :
i=i + i _ -
__ mm
5 5*! “ ’
1W“ ..v
\ hog
Sizem ) < K
m i = gm?
I Output x;, Pg, and ‘R,
for ail relevant i
Patent Application Publication Oct. 28, 2010 Sheet 3 0f3 US 2010/0274753 A1
Figure 4
The Warm Wide Web
PFA
I PF?
Search Request Harmer Worm Wide Web Document
and Resuits Dispiayer Acquisition Engine
PFG PFE PFK PFB
PM * PF2
Qocument Comparison Document Comparisan
Search Engine indexer
PA /PFC
Document and Comparison
informatian Database
US 2010/0274753 A1 Oct. 28, 2010
METHODS FOR FILTERING DATA AND [0006] The term “data mining” as used herein broadly
FILLING IN MISSING DATA USING refers to the methods of data organization and subset and
NONLINEAR INFERENCE feature extraction. Furthermore, the kinds of data described
or used in data mining are referred to as (sets of) “digital
RELATED APPLICATION documents.” Note that this phrase is used for conceptual
illustration only, can refer to any type of data, and is not meant
[0001] This application claims priority bene?t under Title to imply that the data in question are necessarily formally
35 U.S.C. §1 19(e) of US. Provisional PatentApplication No.
documents, nor that the data in question are necessarily digi
60/779,958, ?led Mar. 7, 2006, Which is incorporated by
tal data. The “digital documents” in the traditional sense of
reference in its entirety. Also, this application is continuation
the phrase are certainly interesting examples of the kinds of
in-part ofU.S. application Ser. No. 11/230,949, ?led Sep. 19,
data that are addressed herein.
2005, Which claims priority bene?t under Title 35 U.S.C.
§119(e) of provisional patent application No. 60/610,841
OBJECTS AND SUMMARY OF THE
?led Sep. 17, 2004 and provisional patent application No.
INVENTION
60/697,069 ?led Jul. 5, 2005, each Which is incorporated by
reference in its entirety. Also, this application is a continua [0007] The present system and method described are herein
tion-in-part of US. patent application Ser. No. 11/165,633 applicable at least in the case in Which, as is typical, the given
?led Jun. 23, 2005, Which claims priority bene?t under Title data to be analyZed can be thought of as a collection of data
35 U.S.C. §119(e) of provisional patent application no. objects, and for Which there is some at least rudimentary
60/582,242 ?led Jun. 23, 2004, each Which is incorporated by notion of What it means for tWo data objects to be similar,
reference in its entirety. close to each other, or nearby.
[0008] The present invention relates to methods for orga
BACKGROUND OF THE INVENTION niZation of data, and extraction of information, subsets and
other features of data, and to techniques for ef?cient compu
[0002] The present invention relates generally to data tation With said organiZed data and features. More speci?
denoising, robust empirical functional regression, interpola cally, the present invention relates to mathematically moti
tion and extrapolation, and more speci?cally in some aspects vated techniques for ef?ciently empirically discovering
to ?lling in missing data using nonlinear inference. Common useful metric structures in high-dimensional data, and for the
challenges encountered in information processing and computationally ef?cient exploitation of such structures.
knowledge extraction tasks involve corrupt data, either noisy [0009] It is an object of the present invention to automati
or With missing entries. Some embodiments of the present cally augment search queries, modeling the intended context
invention make e?icient use of the netWork of inferences and of a given search query by using prior knoWledge about the
similarities betWeen the data points to create robust nonlinear user of the search and/or the context of the search. As in the
estimators for missing entries. example above, the search term “gates” could be reWritten for
[0003] Also, the present invention relates generally to data a CMOS technologist as “logic gates OR CMOS gates”,
base searching, data organization, information extraction, While it could be rewritten as “Bill Gates” for an operating
and data features extraction. More particularly, the present system softWare business pundit, and “iron gates” for a
invention relates to personaliZed search of databases includ Wrought-iron specialist. For users With multiple interests,
ing intranets and the Internet, and to mathematically moti several forms could be used.
vated techniques for ef?ciently empirically discovering use [0010] It is an object of the present invention to augment a
ful metric structures in high-dimensional data, and for the ?rst search query With extra search terms and Boolean logic,
computationally ef?cient exploitation of such structures. The based on the ?rst query as Well as some additional knoWledge
methods disclosed relate as Well to improvement of informa about the intention of the user including but not limited to user
tion retrieval processes generally, by providing methods of preferences, interests, prior search choices, bookmarks,
augmenting these processes With additional information that emails, ?les, Web sites and blogs read or frequented by the
re?nes the scope of the information to be retrieved. user, etc. This augmentation can then be used to construct a
[0004] Search terms have different meanings in different second search query; the augmented query.
contexts. Prior art search engines, such as Google, typically [0011] It is an object of the present invention to use statis
use a single method of interpretation and scoring of search tical aspects of one or more relevant corpora of documents, in
results. Thus, in Google for example, the most popular mean part, to de?ne the interests of a user or class of users. For
ing of a particular search term Will end up being prioritiZed example, to apply the present invention to the augmentation
over alternate, less popular, meanings. HoWever, often the of search queries to speci?cally search for results relevant for
user really intends to search for the alternate meaning(s). For baseball enthusiasts, a corpus of documents may be used that
example, the search query term “gates” may mean “logic consists of baseball neWs articles, baseball encyclopedia
gates”, “Bill Gates”, “Wrought-iron gates”, etc. In each case, entries, baseball Website content & blogs, and the like.
the addition of extra keyWords could serve to disambiguate [0012] It is an object of the present invention to use statis
the search query. HoWever, often a user does not realiZe that tical aspects of the interaction betWeen a ?rst search query
these extra terms are needed, or otherWise does not Wish to put and the one or more relevant corpora of documents, to de?ne
in the time or effort perfecting the search query. one or more second search queries. For example, suppose that
[0005] Consequently there is a need for a personaliZed in a baseball speci?c corpus, those documents that contain the
search engine technology capable of augmenting a ?rst query Word “positions” are much more likely than average to
search query, based on some additional knoWledge about the also contain the associated terms “?rst base”, “second base”,
intention of the user. More generally, there is a need for “third base”, “shortstop”, “out?eld”, “pitcher,”, “catcher”,
information retrieval technology that factors in additional etc. Then an embodiment of the present invention can, for
knoWledge to return improved results. example, given as input the query Word, produce a second
US 2010/0274753 A1 Oct. 28, 2010
search query that is made from the query Word, With the set of directions in the high dimensional space that are char
addition of the associated terms, and some Boolean connec acteristic of the contrast betWeen the subset of points, and the
tors. For example, “positions” can become: “positions AND full set of points corresponding to the Whole corpus. The
(‘?rst base’ OR ‘second base’ OR ‘third base’ OR ‘shortstop’ embodiment then selects those other points that are also
OR ‘out?eld’ OR ‘pitcher’ OR ‘catcher’)”. Within or near the region, or are also disposed along the
[0013] In this regard, an embodiment of the present inven directions in the high dimensional space, and the music ?les
tion comprises a search query rewriting system Which takes as (or, e.g., a list of pointers or indexes thereto) corresponding to
input a ?rst query. The ?rst query is used to run a ?rst search the data points are returned as the results of the improved
on a ?rst corpus of documents, returning a ?rst subset of “query by example”. It should be noted that in order to carry
documents in response to the ?rst search. Word frequency out the steps described, one needs only a statistical charac
statistics are computed for the ?rst subset of documents.
ter‘ization of the large set of points to be searched, as Well as
These statistics are compared With the corresponding Word
set of points given as examples. Hence it Will be readily seen
frequency statistics for the corpus as a Whole, or for the
by one skilled in the art that it is not necessary to characterize
language as a Whole. Resultant Words are identi?ed for Which
every music ?le individually, in order to use the disclosed
the difference betWeen the Word’s frequency in the ?rst sub set
method to improve information retrieval processes.
of documents, as compared With the corresponding Whole
corpus or Whole-language frequencies, is largest (e. g. above a [0016] The fr_matr_bin-type embodiments relate in part to
given threshold, or, say, the 5 largest). A second query is methods for ?nding objects that have similarity or a?inity to
formed consisting of the ?rst query, Boolean connectors, and some other target objects or search query results. In accor
the resultant words. (eg <?rst query> AND Wordl OR dance With an embodiment of the present invention, diffusion
geometries also relate in part to methods for ?nding similarity
Word2 OR . . . OR Word5). A second search is then run on a
or a?inity betWeen objects. In this regard, elements disclosed
second one or more corpora of documents, for example on the
Internet. The second search is a search for documents that herein relating to the use of fr_matr_bin-type embodiments
match the second query. The results of the second search are on the one hand, and on the other hand elements disclosed
returned to the user. herein relating to the use of diffusion geometry, can be inter
changed.
[0014] One of skill in the art Will readily see that While the
present invention is disclosed in terms of search query reWr‘it [0017] In accordance With an embodiment of the present
ing, the techniques disclosed relate more generally to the invention (see FIG. 1), corpora (5) and (9) of data is used to
improvement of information retrieval processes. To this end, add meaning to the query. Hence, it is only necessary that
in some aspects it is obj ect of the present invention to improve corpora (5) and (9) be a “rich enough” statistical sample of the
information retrieval processes generally, by providing meth full set of documents (i.e., music ?les). It is appreciated that
ods of augmenting the processes With additional information this “rich enough” statistical sample can be accomplished in
that re?nes the scope of the information to be retrieved. Gen a number of Ways standard in the art. For example, the statis
erally these statistical information about one or more corpora tical sample can be obtained iteratively by trying a small
of data elements, and the interaction betWeen a ?rst data subset, collecting and storing the results of a number of typi
retrieval speci?cation and the one or more relevant corpora of cal/popular queries, and then adding more documents at ran
dom and performing the same typical/popular queries. If the
data elements, is used to de?ne one or more second data
retrieval speci?cations. The second data retrieval speci?ca results are roughly the same, then stop adding more docu
tions are used to retrieve information of a more relevant ments. HoWever, if the results are not roughly the same, then
scope, from a second one or more corpora of data elements. add more documents at random until the process stabilizes,
We sometimes refer broadly to the class of embodiments i.e., results are roughly the same. Alternatively, one can per
described in this paragraph as fr_matr_bin-type. This name form some other measure of statistical completeness/change
comes from the name of a particular set of algorithms Within in adding a feW more documents, or any other method for
the broad class, but the term “fr_matr_bin-type” is meant to statistical completeness or signi?cance.
refer to this general class of embodiments just described. [0018] In accordance With an exemplary embodiment of
[0015] In this regard, an embodiment of the present inven the present invention, for example for music ?les, the present
tion comprises a search by example system. For illustration, invention characterizes the music ?les With “extra features”
to compute music a?inity (or generally, music “meaning”) or
We Will consider such a system Working on a set of datapoints
in a high-dimensional space. More speci?cally, We Will use as obtain a “rich enough” statistical sample (i.e., in the corpora
an example the problem of music similarity “search by (5) and (9)). The corpus (13) of music ?les necessary to
example”. In such embodiment, a search engine is disposed to perform information retrieval needs to be a full set of all
search through a corpus of digital music ?les. For each ?le, available documents (i.e., music ?les), but the present inven
the system has pre-computed a set of numerical coordinates tion, at least in certain embodiments, does not need to char
that characterize various standard aspects of the ?le. In this acterize these music ?les With “extra features” as With the
corpora (5) and (9).
Way the embodiment can treat the corpus of data as a set of
points in a high dimensional space. Such characteristic [0019] In another aspect, the present systems and methods
numerical coordinates are knoWn to those of skill in the art, described relate herein are applicable to diffusion geometry
and include, but are not limited to, timberal Fourier, MERL and document analysis, processing and information extrac
and cep stral coe?icients, Hidden Markov Model parameters, tion. These methods and systems described herein are appli
dynamic range vs. time parameters, etc. In an exemplary cable at least in the case in Which, as is typical, the given data
query by example interface, a user speci?es a feW music ?les to be analyzed can be thought of as a collection of data
from the corpus of digital music ?les. The embodiment then objects, and for Which there is some at least rudimentary
characterizes the coordinates of the subset of points associ notion of What it means for tWo data objects to be similar,
ated With the speci?ed feW music ?les, and selects a region or close to each other, or nearby.
US 2010/0274753 A1 Oct. 28, 2010
[0020] In an embodiment, the present invention relates to these mathematical objects. For example, a vector could be
the fact that certain notions of similarity or nearness of data represented, but is not limited to being represented, as an
objects (including but not limited to conventional Euclidean ordered n-tuple of ?oating point numbers, stored in a com
metrics or similarity measures such as correlation, and many puter. A function could be represented, but is not limited to be
others described beloW) are not a priori very useful inference represented, as a sequence of samples of the function, or
tools for sorting high dimensional data. In one aspect of the
coe?icients of the function in some given basis, or as sym
present invention, We provide techniques for remapping digi bolic expressions given by algebraic, trigonometric, transcen
tal documents, so that the ordinary Euclidean metric becomes
dental and other standard or Well de?ned function expres
more useful for these purposes. Hence, data mining and infor
sions.
mation extraction from digital documents can be consider
ably enhanced by using the techniques described herein. The [0026] In the present invention it is convenient to think of a
techniques relate to augmenting given similarity or nearness digital document as an ordered list of numbers (coordinates)
representing parametric attributes of the document. Note that
concepts or measures With empirically derived diffusion
geometries, as further de?ned and described herein. this representation is used as an illustrative and not a limiting
[0021] An aspect of the present invention relates to the fact concept, and one skilled in the art Will readily understand hoW
that, Without the present invention, it is not practical to com the examples described above, and many others, can be
pute oruse diffusion distances on high dimensional data. This brought in to such a form, or treated in other forms of repre
is because standard computations of the diffusion metric sentation, by techniques that are substantially equivalent to
require d*n2 or even d*n3 number of computations, Where d is those describe herein.
the dimension of the data, and n the number of data points. [0027] Such digital documents, eg images and text docu
This Would be expected because there are O(n2) pairs of ments having many attributes, typically have dimensions
points, so one might believe that it is necessary to perform at exceeding 100. In accordance With an embodiment of the
least n2 operations to compute all pairWise distances. HoW
present invention, the use of given metrics (i.e., notions of
ever, the present invention, as disclosed, includes a method similarity, etc.) in digital document analysis is restricted only
for computing a dataset, often in linear time O(n) or O(n
to the case of very strong similarity betWeen documents, a
log(n)), from Which approximations to these distances, to
similarity for Which inference is self evident and robust. Such
Within any desired precision, can be computed in ?xed time.
similarity relations are then extended to documents that are
[0022] The present invention provides a natural data driven not directly and obviously related by analyZing all possible
self-induced multiscale organiZation of data in Which differ chains of links or similarities connecting them. This is
ent time/ scale parameters correspond to different representa achieved through the use of diffusions processes (processes
tions of the data structure at different levels of granularity, that are analogous to heat-?oW in a mathematical sense that
While preserving microscopic similarity relations.
Will be described herein), and this leads to a very simple and
[0023] Examples of digital documents in this broad sense, robust quantity that can be measured as an ordinary Euclidean
could be, but are not limited to, an almost unlimited variety of distance in a loW dimensional embedding of the data. The
possibilities such as sets of object-oriented data objects on a term embedding as used herein refers to a “diffusion map”
computer, sets of Web pages on the World Wide Web, sets of and the distance thereby de?ned as a “diffusion metric.”
document ?les on a computer, sets of vectors in a vector [0028] In yet another aspect, the present invention relates in
space, sets of points in a metric space, sets of digital or analog
part to in?uencing the position or presence on a search result
signals or functions, sets of ?nancial histories of various list generated by a computer netWork search engine and for
kinds (eg stock prices over time), sets of readouts from a
in?uencing a position or presence or placement Within an
scienti?c instrument, sets of images, sets of videos, sets of advertising section of document or rendering of a document
audio clips or streams, one or more graphs (i.e. collections of or meta-document on a computer netWork. In part, systems
nodes and links), consumer data, relational databases, to and methods are disclosed for enabling information providers
name just a feW. using a computer netWork such as the Internet to in?uence a
[0024] In each of these cases, there are various useful con position for a search listing Within a search result list gener
cepts of said similarity, closeness, and nearness. These ated by a computer netWork search engine and for in?uencing
include, but are not limited to, examples given in the present a position or presence or placement of a listing Within a
disclosure, and many others knoWn to those skilled in the art, document or rendering of a document or meta-document on a
including but not limited to cases in Which the content of the computer netWork. The term listing as used herein refers to
data objects is similar in some way (eg for vectors, being any digital document content that a provider Wishes to have
close With respect to the norm distance) and/or if data objects listed, rendered, displayed, or otherWise delivered using a
are stored in a proximal Way in a computer memory, or disk, computer netWork, by one practicing the present invention.
etc, and/or if typical user-interaction With the objects is simi Such a listing can be, but is not limited to banner advertise
lar in some way (eg tends to occur at similar time, or With ments, text advertisements, video clips and other media, and
similar frequency), and/or if, during an interactive process, a can be as simple as a link to another Web page or Web site. The
user or operator of the present invention indicates that the term advertising opportunity herein refers to any instance
objects in question are similar, or assigns a quantitative mea Where there is an opportunity to position a search listing, or
sure of similarity, etc. In the case of nodes in a graph, or in the position, place or present a listing Within an advertising or
case of tWo Web pages on the Internet, the objects can be other section Within a document or rendering of a document
thought of as similar for reasons including, but not limited to, or meta-document on a computer netWork. The term adver
cases in Which there is a link from one to the other. tising as used herein refers to any act of listing, rendering,
[0025] Note that, in practical terms, although mathematical displaying, or otherWise delivering a listing or other content
objects, such as vectors or functions, are discussed herein, the using a computer netWork, in exchange for compensation or
present invention relates to real-World representations of other value.
US 2010/0274753 A1 Oct. 28, 2010
[0029] More generally, in this aspect, the present invention [0037] The diffusion geometric techniques and other tech
relates to the strategic matching of online content for optimi niques disclosed herein provide a neW and novel means of
Zation of collaborative opportunities for one Web page or Web displaying advertisements that are related to content and for
site to display content related to another Web page orWeb site. Which preferential positioning of the advertisements dis
Examples of such use include, but are not limited to: played can be determined by relevance to the context, as Well
as in?uenced by a bidding process or other economic consid
[0030] l. the addition of links to a Web site, designed to
erations. Algorithms for preferential positioning of advertise
increase intra-site click through rate;
ments, etc, are disclosed herein.
[0031] 2. the addition of links betWeen a strategic set of
[0038] An aspect of the present invention relates to the
Web sites, designed to increase inter-site click through
application of the above algorithm and related ones, to the
rates; and
problem of automatically designing or augmenting the links
[0032] 3. the provision of services designed to pair up Within a single company’s Web site. Web companies often
product and service listings With advertising opportuni
Wish to increase the amount of tra?ic on their Web sites, and
ties the amount of time and volume of data vieWed by customers
[0033] In accordance With an embodiment of the present of their sites. Offering links from pages on the site to related
invention, the system and method provides a database having pages on the site provides a proactive replacement for an
accounts for the listing providers. Each account contains con outside search engine. Users Will be able to ?nd What they
tact and billing information for a listing provider. In addition, need (e. g. if they enter a site from the result of a search
each account contains at least one search listing having at engine), and then ?nd related information, and thus be moti
least tWo components: 1. at least one digital document vated to “explore” the site. This is true for sites in general, and
describing the product, service or other listing to be posi also speci?cally When the site in question is one that contains
tioned, placed, or presented; and 2. a bid amount, Which is catalog-like or other listings of products and services. In a
preferably a money amount, for a listing. The listing provider store, customers often begin shopping by looking at one prod
may add, delete, or modify a search listing after logging into uct but end up buying another product. By having tight links
his or her account via an authentication process. The present betWeen related products, online sites can achieve this same
invention includes methods for determining the eligibility of “emotional buying” phenomenon.
any listing for any given advertising opportunity. During an [0039] An aspect of the present invention relates to the
advertising opportunity, the selection of or positioning of a application of the above algorithm and related ones, to the
listing is in?uenced by a continuous online competitive bid problem of automatically designing or augmenting the links
ding process. The bidding process occurs Whenever an adver betWeen tWo or more companies’ Web sites. Web companies
tising opportunity arises. The system and method of the often Wish to increase the amount of traf?c that they receive
present invention then compares all bid amounts for those from or provide to af?liated sites. The present invention pro
listings eligible for the advertising opportunity in question,
vides a method to design or augment the links betWeen these
and generates a rank value for all eligible listings. The rank sites, thereby linking related content, and organically increas
value generated by the bidding process determines Where the ing this traf?c. One skilled in the art Will see hoW to do this,
netWork information providers listing Will appear in the con and hoW it results in economic bene?t to the parties in ques
text determined by the advertising opportunity. A higher bid
tion, each in a Way analogous to the case described in the
by a netWork information provider Will result in a higher rank previous paragraph.
value and a more advantageous placement.
[0040] In accordance With an embodiment of the present
[0034] There are current systems that, for example, display invention, a method and system retrieves information in
advertisements Within a paid section of a Web page, Wherein response to an information retrieval request comprises
the choice of advertisements displayed relates to keyWord extracting additional information from a ?rst corpus of data
matching and other similar techniques, and the preferential elements based on the request. The request is modi?ed based
positioning of the advertisements displayed is determined by on the additional information to re?ne the scope of informa
a bidding process. For example, Google, Inc. practices this tion to be retrieved from a second corpus of data elements.
technique (see “Google AdSense” at: <http://WWW.google. The information is retrieved from the second corpus of data
com/ads/>). elements based on the modi?ed request.
[0035] There are current systems that, for example, display [0041] In accordance With an embodiment of the present
advertisements Within a section of a search engine query invention, a method of in?uencing traf?c betWeen predeter
result page, Wherein the choice of advertisements displayed mined Web pages comprises the steps of: determining diffu
relates to keyWord matching and other similar techniques, sion geometry coordinates of a set of Web pages, the set of
and the preferential positioning of the advertisements dis Web pages comprising at least one of the predetermined Web
played is determined by a bidding process. For example, pages; and determining links betWeen the Web pages based on
Google, practices this technique (see “Google Ad Words” at: the diffusion geometry coordinates.
<http://WWW.google.comiads/>).
[0042] In accordance With an embodiment of the present
[0036] In these current systems, advertisements are placed invention, a computer readable medium comprises code for
by a method that uses keyWords, but keyWords can be retrieving information in response to an information retrieval
ambiguous. For example, the keyWord “nails” might bring up request, the code comprising instructions for: extracting addi
advertisements for hardWare stores in these prior art systems, tional information from a ?rst corpus of data elements based
even When searched from a Website about Women’s beauty, on the request; modifying the request based on the additional
Where results about nail polish, etc, are more appropriate as information to re?ne the scope of information to be retrieved
top advertisements. Hence there is a need for methods and from a second corpus of data elements; and retrieving infor
systems as disclosed herein, Which, in part, are able to resolve mation from the second corpus of data elements based on the
such ambiguities. modi?ed request.
US 2010/0274753 A1 Oct. 28, 2010
[0043] In accordance With an embodiment of the present [0051] In accordance With an exemplary embodiment of
invention, a computer readable medium comprises code for the present invention, the data matrix d(q, r) comprises initial
in?uencing traf?c betWeen predetermined Web pages, the customer preference data and the inventive method for infer
code comprising instructions for: determining diffusion ring/ estimating missing values in a data matrix d(q, r) further
geometry coordinates of a set of Web pages, the set of Web comprises the step of predicting additional customer prefer
pages comprising at least one of the predetermined Web ences from the data matrix d(q, r).
pages; and determining links betWeen the Web pages based on [0052] In accordance With an exemplary embodiment of
the diffusion geometry coordinates. the present invention, the data matrix d(q, r) comprises mea
[0044] In accordance With an embodiment of the present sured values of an empirical function f(q, r) and the invention
invention, a system for retrieving information in response to method for inferring/estimating missing values in a data
an information retrieval request comprises: an extracting matrix d(q, r) further comprises the step of nonlinear regres
module for extracting additional information from a ?rst cor sion modeling of the empirical function f(q, r).
pus of data elements based on the request; a processing mod [0053] In accordance With an exemplary embodiment of
ule for modifying the request based on the additional infor the present invention, the data matrix d(q, r) is a questionnaire
mation to re?ne the scope of information to be retrieved from d(q, r) and the inventive method further comprises the steps of
a second corpus of data elements; and a retrieving module for determining Whether a response (qo, r0) to the questionnaire
retrieving information from the second corpus of data ele d(q, r) is an anomalous response.
ments based on the modi?ed request. [0054] In accordance With an exemplary embodiment of
[0045] In accordance With an embodiment of the present the present invention, the inventive method further comprises
invention, a system for in?uencing tra?ic betWeen predeter the steps of generating a dataset d1 (q, r) comprising responses
mined Web pages comprises a processing module for deter to the questionnaire d(q, r), omitting the response (qo, r0) from
mining diffusion geometry coordinates of a set of Web pages, the dataset d1 (q, r), reconstructing the missing response (qo,
the set of Web pages comprising at least one of the predeter r0) from the dataset d1(q, r) to provide a reconstructed value,
mined Web pages; and determining links betWeen the Web comparing the reconstructed value to the response (qo, r0),
pages based and determining the response (qO, r0) to be anomalous When a
[0046] In accordance With an exemplary embodiment of distance betWeen the reconstructed value and the response
the present invention, a method for inferring/ estimating miss (qo, r0) is larger than a pre-determined threshold.
ing values in a data matrix d(q, r) having a plurality of roWs [0055] In accordance With an exemplary embodiment of
and columns comprises the steps of: organiZing the columns the present invention, the data matrix d(q, r) comprises data
of the data matrix d(q, r) into a?inity folders of columns With relevant to fraud or deception and the inventive method fur
similar data pro?le, organiZing the roWs of the data matrix ther comprises the step of detecting fraud or deception from
d(q, r) into af?nity folders of roWs With similar data pro?le, said data matrix d(q, r).
forming a graph Q of augmented roWs and a graph R of [0056] In accordance With an exemplary embodiment of
augmented columns by similarity or correlation of common the present invention, a computer readable medium com
entries; and expanding the data matrix d(q, r) in terms of an prises code for inferring/estimating missing values in a data
orthogonal basis of a graph Q><R to infer/estimate the missing matrix d(q, r) having a plurality of roWs and columns. The
values in said data matrix d(q, r).on the diffusion geometry code comprises instructions for organiZing the columns of
coordinates. said data matrix d(q, r) into af?nity folders of columns With
similar data pro?le, organiZing the roWs of said data matrix
[0047] In accordance With an exemplary embodiment of
the present invention, the data matrix d(q, r) comprises ques d(q, r) into af?nity folders of roWs With similar data pro?le,
tionnaire data and the inventive method for inferring/estimat forming a graph Q of augmented roWs and a graph R of
ing missing values in a data matrix d(q, r) additionally com augmented columns by similarity or correlation of common
entries; and expanding the data matrix d(q, r) in terms of an
prises the step of ?lling in an unknown response to a
orthogonal basis of a graph Q><R to infer/estimate the missing
questionnaire to infer/estimate missing values in the data
matrix d(q, r). values in the data matrix d(q, r).
[0057] Various other objects, advantages and features of the
[0048] In accordance With an exemplary embodiment of
present invention Will become readily apparent from the
the present invention, the inventive method for inferring/
ensuing detailed description, and the novel features Will be
estimating missing values in a data matrix d(q, r) additionally
particularly pointed out in the appended claims.
comprises the step of expanding the data matrix d(q, r) in
terms of a tensor product of Wavelet bases for graphs Q and R.
BRIEF DESCRIPTION OF THE DRAWINGS
[0049] In accordance With an exemplary embodiment of
the present invention, the inventive method for inferring/ [0058] For a more complete understanding of the present
estimating missing values in a data matrix d(q, r) additionally invention, reference is noW made to the folloWing descrip
comprises the steps of, for each tensor Wavelet in basis, com tions taken in conjunction With the accompanying draWing, in
puting a Wavelet coe?icient by averaging on the support of the Which:
tensor Wavelet and retaining the coe?icient in the expansion [0059] FIG. 1 shoWs a block diagram of a contextualiZed
only if validated by a randomiZed average. search engine in accordance With an embodiment of the
[0050] In accordance With an exemplary embodiment of present invention;
the present invention, the inventive method for inferring/ [0060] FIG. 2 shoWs a schematic representation of an imag
estimating missing values in a data matrix d(q, r) additionally ined forest, With trees and shrubs, presumed to burn at differ
comprises the steps of constructing diffusion Wavelets and ent rates;
taking supports of the resulting diffusion Wavelets at a ?xed [0061] FIG. 3 shoWs an exemplary ?oW chart for comput
scale on said columns of said graph R, for at least one of the ing multiscale diffusion geometry in accordance With an
organiZing step. embodiment of the present invention; and
US 2010/0274753 A1 Oct. 28, 2010
[0062] FIG. 4 illustrates a Public Find Similar Document [0077] Note that in certain embodiments the corpora (9)
Internet Utility in accordance With an embodiment of the can be taken to be the same as (5). In such case, it is seen that
present invention. the algorithm of the present invention acts to ?nd those Words
[0063] The discussion associated With the ?gure illustrates Which are much more likely to occur in documents that meet
an embodiment of the present invention in the context of the ?rst search query criteria, Within the subject(s) of interest
analysis of the spread of ?re in the forest, and illustrates a use to the user of the search, as compared With the generic occur
of the embodiment in the analysis of diffusion in a netWork. rence of the Words Within the subject(s) of interest to the user
of the search. In other variants of the algorithm, (9) and (10)
DETAILED DESCRIPTION OF THE are omitted, f1:0, and (7) d:f0 (6).
EMBODIMENTS [0078] The corpora (13) can be, in certain embodiments,
the entire Internet, or the set of documents indexed by a public
[0064] As shoWn in FIG. 1, there is illustrated a How chart
or private search engine. Since, in certain embodiments, the
describing an exemplary method in accordance With an
algorithm of the present invention takes a ?rst search query,
embodiment of the present invention (fr_matr_bin( ):
and produces a second search query, each suitable for full text
[0065] Step 110: A user (1) enters a ?rst search query (2) search, these queries can be passed to search engines via
into a search query user interface (3). techniques standard in the art, including but not limited to
[0066] Step 120: The query (2) is sent to a ?rst search HTTP requests and/or netWork interfaces such as SOAP. The
engine (4).
results returned by these search engines can be displayed as is
[0067] Step 130: The ?rst search engine (4) performs a standard in the art, including but not limited to display in a
search on a ?rst one or more corpora of documents (5) broWser by rendering results encoded With HTML, XML,
using the query (2). Java, JavaScript, Python, Peri, PHP, etc.
[0068] Step 140: Mean Word frequencies f0 (6) are com [0079] In certain embodiments, at least on of the searches
puted on the set of documents returned by the ?rst search described can be performed by matrix techniques. More spe
engine (4).
ci?cally, suppose that one has a set of N documents, With a
[0069] Step 150: Mean Word frequencies f1 (10) are vocabulary or reduced vocabulary of M Words. One can then
computed for a second one or more corpora of docu form the N><M matrix W, so that W(i,j )?he number of times
ments (9). (It is appreciated that this step can be done that Word number j occurs in document number i.
once at initialization.) [0080] In certain embodiments, provisions are made to
[0070] Step 160: The difference d (7) f0-f1:is calcu ignore stop Words. Stop Words are Words that are commonly
lated.
used, such as “the,” “an,” or “and”, that are often deliberately
[0071] Step 170: The set of Words (8) is identi?ed corre ignored by search applications When responding to a query.
sponding to those top K Words for Which d (7) is greatest Often stop Words are the most common Words in the lan
(for some ?xed parameter K), or e. g., to those Words for guage. In some embodiments, sets of stop Words are aug
Which d is greater than some thresholdt (for some ?xed mented by adding additional words (eg Common Words)
parameter t).
that are speci?c to the corpora used.
[0072] Step 180: A neW search query (11) is de?ned by [0081] In certain embodiments, provisions are made to cor
combining the ?rst query (2) and the set of Words (8). For rect spelling errors. This can be done, for example, by using
example if the ?rst query (2) is “nail”, and the set of SOUNDEX scores to identify Words that are misspelled but
Words (8) is {“polish”, “beauty”, “manicure”}, then the are most likely meant to be other given Words. One can also
neW search query (11) could be “nail AND (polish OR employ other techniques, such as a list of commonly mis
beauty OR manicure)”. Other algorithms for this com spelled Words, phrases and queries. In the present context,
bination are disclosed herein. statistics and other information, including but not limited to
[0073] Step 190: The neW query is sent to a second information from the corpora and/or the search logs, can be
search engine (12) disposed to search a third one or more used to identify misspellings and likely suggested replace
corpora of documents (13). ments for input queries. Spelling errors in the corpora can also
[0074] Step 200: The results returned by the second be ?agged and automatically, semi-automatically, partially
search engine (12) are displayed on a search result user assisted or manually corrected.
interface (14). [0082] In accordance With embodiments of the present
[0075] In certain embodiments, the corpora (9) represent invention, certain Word frequency coe?icients, or differences
the language as a Whole. For example, if the target searches betWeen Word frequencies, are set to Zero When they are
are conducted in English, then corpora (9) can be a random beloW a given threshold. In this Way, “noise” is removed from
sample of documents in the English language. The corpora the process. For example, in the case Where documents are
(5) are used to de?ne the subject(s) of interest to the user of the being tested for the presence of a set of Words or phrases as in
search. For example, if the subject of interest is Maj or League the search in step 130 of FIG. 1, one can take only those
Baseball, then the documents in question can be a Web-craW documents that contain the phrase more than a certain number
of WWW.mlb.com, as Well as neWs articles, encyclopedia of times. This number can be ?xed, or it can be some fraction
articles, etc, on the subject of baseball. of the average number, Where the average is taken, for
[0076] In this Way, it is seen that the algorithm of the present example, over the set of documents for Which the value is at
invention, in certain embodiments, acts to ?nd those Words least 1. A corresponding type of threshold can also be applied
Which are much more likely to occur in documents that meet in one or more of steps, for example to steps 170, 180 or 190.
the ?rst search query criteria, Within the subject(s) of interest [0083] In certain embodiments, searches are implemented
to the user of the search, as compared With the generic occur in part using sparse matrix representations. For example,
rence of the Words Within the target search language as a given the matrix W(i,j) as described herein, for a ?rst one or
Whole. more corpora, and an initial search query based on the pres
Description:technique (see “Google AdSense” at: ). embodiments, the ab solute requirement of appearing together is relaxed to