Table Of ContentSynthesizing Mapping Relationships Using Table Corpus
∗
Yue Wang Yeye He
UniversityofMassachusettsAmherst MicrosoftResearch
Amherst,MA,USA Redmond,WA,USA
yuewang@cs.umass.edu yeyehe@microsoft.com
ABSTRACT
Country Code Ticker Company
United States USA MSFT Microso+ Corp
Mappingrelationships,suchas(country,country-code)or
Canada CAN ORCL Oracle
(company, stock-ticker), are versatile data assets for an South Korea KOR INTC Intel
array of applications in data cleaning and data integration Japan JPN GE General Electric
likeauto-correctionandauto-join. However,todaythereare China CHN UPS United Parcel Services
… … … …
nogoodrepositoriesofmappingtablesthatcanenablethese
intelligent applications. (a)ISOcountrycodes (b)Stocktickers
Given a corpus of tables such as web tables or spread- State Abbrev. Airport Name IATA
sheet tables, we observe that values of these mappings of- Alabama AL Los Angeles Interna.onal Airport LAX
ten exist in pairs of columns in same tables. Motivated by Alaska AK San Francisco Interna.onal Airport SFO
Arizona AZ Tokyo Interna.onal Airport HND
theirbroadapplicability,westudytheproblemofsynthesiz-
Arkansas AR London Heathrow Airport LHR
ing mapping relationships using a large table corpus. Our California CA Beijing Capital Interna.onal Airport PEK
synthesis process leverages compatibility of tables based on … … … …
co-occurrence statistics, as well as constraints such as func- (c)Stateabbreviations (d)AirportIATAcodes
tionaldependency. Experimentresultsusingwebtablesand
enterprisespreadsheetssuggestthattheproposedapproach Table1: Exampleone-to-onemappingtables: (a)Countries
can produce high quality mapping relationships. toISOcodes,(b)Companynamestostock-tickers,(c)State
names to abbreviations, (d) Airports to IATA-codes.
1. INTRODUCTION
Model Make City State
Mapping tables, sometimes also referred to as bridge ta-
F-150 Ford Chicago Illinois
bles [27], are two-column tables where each distinct value Mustang Ford San Francisco California
in the left column maps to a unique value in the right col- Accord Honda Los Angeles California
umn (or functional dependencies hold). Table 1 gives a few Camry Toyota Houston Texas
Charger Dodge Sea9le Washington
example mapping tables with one-to-one mapping relation-
… … … …
ships. Table2showsadditionalexampleswithmany-to-one
(a)Carmakeandmodel (b)Cityandstate
mappings.
Mapping tables like these are important data assets for Table 2: Example many-to-one mapping tables: (a) Car
a variety of applications such as data integration and data makes and models, (b) Cities and states
cleaning. We briefly discuss three scenarios here.
Auto-correction. Real-worldtablesareoftendirty,where ID Employee Residence State City State
inconsistent values may be present in same columns. Ta- 2910 Bren, Steven California San Francisco California
1923 Morris, Peggy Washington Sea/le Washington
ble3showssuchanexample. Thelastcolumnaboutstate 1928 Raynal, David Oregon Los Angeles California
are mixed with both full state names and state abbrevia- 2491 Crispin, Neal CA Houston Texas
tions. An intelligent data quality agent, equipped with the 4850 Wells, William WA Denver Colorado
… … … … …
mappingtableinTable1c,caneasilydetectandalertusers
about such inconsistency, by discovering that values in the Table 3: Auto-correction: Table 4: Auto-fill: automat-
correct inconsistent values ically populate values based
∗
Work done at Microsoft Research. (highlighted) using Table 1c. on mappings from Table 2b.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor Ticker Market Cap Company Total '89 -'13 Dem Rep
classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed GE 255.88B General Electric $59,456,031 41% 58%
forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita- WMT 212.13B Walmart $47,497,295 52% 44%
tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan MSFT 380.15B Oracle $34,216,308 35% 64%
ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- ORCL 255.88B MicrosoG Corp. $33,910,357 48% 50%
publish,topostonserversortoredistributetolists,requirespriorspecificpermission UPS 94.27B AT&T Inc. $33,752,009 47% 51%
and/orafee.Requestpermissionsfrompermissions@acm.org. … … … … … …
SIGMOD’17,May14-19,2017,Chicago,IL,USA
Table 5: Auto-join: joining related tuples based on map-
(cid:13)c 2017ACM.ISBN978-1-4503-4197-4/17/05...$15.00 pings from Table 1b
DOI:http://dx.doi.org/10.1145/3035918.3064010
leftandrightcolumnofTable1caremixedinoneuserdata ing expensive online reasoning over large table corpora for
column. Furthermore, it can automatically suggest correc- specificapplicationslikeauto-join[24]andauto-fill[38],one
tionsbasedonthemappingrelationship(e.g.,correcting CA could index synthesized mapping tables using hash-based
to California). techniques (e.g., bloom filters) for efficiently lookup based
Auto-fill. In this example scenario in Table 4, a user on value containment. Such logic is both simple to imple-
has a list of city names. She wants to add a column of ment and easy to scale.
state names corresponding to the cities. By just entering a Lastly,mappingtablesareversatiledataassetswithmany
few example values (e.g., California for San Francisco), applications. By solving this common underlying problem
the system automatically discovers the intent by matching and producing mapping tables as something that can be
existing value pairs with those in Table 2b, and can thus easily plugged into other applications, it brings benefits to
suggest to automatically fill remaining values in the right a broad class of applications as opposed to requiring sepa-
column (grayed out in Table 4). ratereasoninglogictobedevelopedfordifferentapplications
Auto-join. Indataintegrationandad-hocdataanalysis, (e.g., [24] for auto-join and [38] for auto-fill).
users often need to “join” two tables together, whose key Why synthesize tables. Given table corpora such as
columns may have different representations. In Table 5 for HTML tables from web or spreadsheets from enterprises,
example, an analyst needs to join the left table that has fragments of useful mapping relationships exist. For exam-
stocks by their market capitalization, with the right table ple, the country and country-ISO3-code columns in Ta-
that lists companies by their political contributions, to an- ble1aareoftenadjacentcolumnsinsametablesontheweb.
alyze potential correlations. However a direct join is not As such, an alternative class of approaches is to “search”
possible since the subject column of the left table is stock tables based on input values and then ask users to select
tickers,whiletherighttableusescompanynames. Asystem relevantones(e.g.,GoogleWebTables[1],MicrosoftPower
equippedwithmappingtableswouldmakethejoinpossible Query [2], and DataXFormer [4]). However, because de-
by using Table 1b as an intermediate bridge that performs sired values pairs often span across multiple tables, users
a three-way join to connect these two user tables, without frequently need to search, inspect and understand table re-
asking users to provide an explicit mappings. sults, before manually piecing them together from multiple
Synthesize mapping tables with human curation. tables. Our experience suggests that this process is often
In this work, we develop methods to automatically syn- too cumbersome for end users.
thesize mapping relationships from existing table corpora, Mappings synthesized from multiple tables, on the other
where the goal is to generate as many high-quality map- hand, take away the complexity and make it easy for end
pings as possible. Because algorithms are bound to make users. More specifically, synthesized mappings have the fol-
mistakes,additionalhumanverificationandcurationcanbe lowing benefits.
usedtoensureveryhighprecision(Section4.3). Theresult- •Completeness. Inmanycasesonetableonlycoversasmall
ingmappingscanthenbeutilizedtoenabletheapplications fractionofmappingsinthesamerelationship. Forexample,
discussed above in a unified manner. whilethereexistthousandsofairports,awebtablelikeTa-
Why pre-compute mappings. While there are sepa- ble 1d often lists only a small fraction of popular airports.
rate solutions for auto-join and auto-fill problems (e.g., [24, Stitching together tables in the same relationship provides
38]), our approach has a few important advantages. better coverage and is clearly desirable.
First, synthesized mappings are amenable to human in- •Synonymousmentions. Eachindividualtablefromatable
spection and curation, which is critical to ensure very high corpus typically only has one mention for the same entity.
quality. In attempting to commercialize technologies simi- For example, Table 1a has South Korea and KOR. In reality
lar to [24, 38] in enterprise spreadsheet software like Excel, differenttablesusedifferentbutsynonymousnames. Table6
the main feedback we received is the trustworthiness of re- showsrealresultssynthesizedfrommanywebtables,which
sults produced by black-box algorithms. Algorithms with has different synonyms of South Korea. Similarly the right
even 99% correctness is still unacceptable in the context of part has many synonyms for Congo. Note that a specific
enterprisespreadsheets,becauseanyerrorintroducedbyal- synonymof South Koreamaynotnecessarilyco-occurwith
gorithmswouldbedifficultforuserstodetect,butishighly another synonym of Congo in the same web table, and the
embarrassing and damaging in enterprise settings. probability of co-occurrence in conjunction with synonyms
Ananalogywewouldliketodrawistheknowledge-bases of additional countries is even lower. However, any com-
used in search engines such as Google and Microsoft Bing. bination of these synonyms may actually be used in user
Similar to our problem, the quality required for knowledge- tables that may require auto-join or auto-fill. Using single
bases is also very high, so commercial knowledge bases are tables as mappings would not provide sufficient coverage in
created in offline processes that combine algorithmic au- these cases. On the other hand, if all these synonyms are
tomation with human curation. Mapping tables can be synthesized together as one table like in Table 6, then any
viewed as the counterpart of knowledge bases in the rela- combinationofthesesynonymscanstillbecoveredwithout
tionalworld,whereasimilarcurationprocessmaybeneeded requiring users to perform manual synthesis from multiple
becauseofthequalityrequirement. Andlikesearchengines tables.
that have millions of users, spreadsheet software can reach •Spuriousmappings. Certainmappingsthatappeartohold
millionsofdataanalysts,suchthatthecostofcuratingmap- locallyinsingletablesmaynotbemeaningful. Forexample,
pings can be amortized over a large user base to make the a random table listing departure-airport and arrival-
effort worthwhile. airport may happen to have values observe functional de-
Second, synthesized mapping relationships can be mate- pendency at the instance level. However, at a conceptual
rialized as tables, which are easy to index and efficient to levelthisisnotausefulmapping. Suchaspuriousmapping,
scale to large problems. For example, instead of perform- when indexed from single tables, can trigger false-positive
Step 1:
Country Code Country Code Input:
Korea (Republic) KOR Congo (Democra6c Rep.) COD Table Candidate
Korea (South) KOR Congo (Demographic Republic of) COD Corpus Extraction
KOREA REPUBLIC OF KOR Congo, Democra6c Republic of the COD
Korea, Republic of KOR CONGO, DEMOCRATIC REPUBLIC OF (WAS ZAIRE) COD
Korea, Republic of (South Korea) KOR Congo, Democra6c Republic of the (Congo & Kinshasa) COD
Korea, South KOR Congo, The Democra6c Republic of COD
Republic of Korea KOR CONGO, THE DRC COD Step 2:
South Korea KOR Democra6c Republic of Congo COD Table
… … … … Step 3: Synthesis
Conflict
Table6: Examplesfromasynthesizedmappingrelationship Resolution
(country, country-ISO3-code) using real web tables. The
lefttableshowsexamplesofsynonymsforthecountrySouth
Korea, all of which map to the same code KOR. The right
table shows similar examples for Congo. Figure 1: Solution overview with three main steps: (1) Ex-
tract candidate two-column-tables; (2) Synthesize related
tables; (3) Resolve conflicts in the same relationship.
results for applications like auto-correct and auto-join. A
holisticanalysisofglobalrelationshipsarenecessarytoiden-
tify true mappings from spurious ones. listedinTable1(allofwhicharecommonmappings),while
Existingapproaches: Giventhattablesynthesisisneeded Freebasemissestwo(stocksandairports). Furthermore,for
toassisthumancuration,welookatexistingtechniquesthat mappings that do exist in KB, they typically do not have
can be used here. synonyms like the ones in Table 6. Lastly, KB have limited
Union tables. Ling and Halevy et al. studied the prob- coverage beyond the public web domain, such as mapping
lemofstitchingtogetherwebtablesinthesameweb-domain (cost-center-name → cost-center-code) that is specific
(where tables are more homogeneous) based on meta data to enterprises domains.
such as column names [30]. While the technique is not de- Contribution. Observing that mapping relationships
signedtosynthesizerelationshipsfromalargeheterogeneous are well-represented in tables, we propose to automatically
corpus, it is the only work we are aware of that performs synthesize mapping relationships using table corpora. We
table synthesis from corpora. We will show that adapting formalize this as an optimization problem that maximizes
this to a large corpus of heterogeneous tables will fail, be- positive compatibility between tables while respecting con-
cause column names are often undescriptive [15] that leads strains of negative compatibility imposed by functional de-
to over-grouping and low-quality mappings. For example, pendencies. We show a trichotomy of complexity for the
in Table 1a, the column name for countries are often just resulting optimization problem, and develop an efficient al-
name,andthecolumnnameforcountry-codesmaybecode. gorithm that can scale to large table corpus (e.g., 100M ta-
As a result, grouping by column names tends to lump this bles). Ourevaluationusingrealtablecorporasuggeststhat
tablewithothername-to-codemappings. Ourapproachrea- theproposedapproachcansynthesizehighqualitymapping
sonsaboutcompatibilityoftablesbasedonvalues,whichare tables.
more reliable in telling the true relationships.
Schema matching. There is a long and fruitful line of re-
2. SOLUTIONOVERVIEW
searchonschemamatchingthatsuggestspossiblemappings
between table columns [31]. However, schema matching is Inthissection,wefirstintroducenotionslikemappingre-
typically used in database contexts for a small number of lationships and table corpora necessary for discussions. We
schemas,andproducespair-wisematchesforhumanusersto then give a high-level overview of our synthesis solution.
evaluate. In our problem we are given hundreds of millions
2.1 Preliminaries
ofschemasasinput,forwhichpairwisehumanverificationis
infeasible, and aggregation of pairwise decisions to a group Mapping relationships. The goal of this work is to
level is necessary for human curation. Furthermore, since discover mapping relationships. Specifically, we focus on
weareonlyinterestedinmappingrelationships,whicharea binary mappings involving two attributes.
specifictypeoftablesthatalwaysobservefunctionaldepen-
dencies, we can deriving additional negative incompatibility Definition 1. Let R be a conceptual relation with two
induced by FDs that is not explored by schema matching. attributes X,Y. The relationship is a mapping relationship,
For example, there are multiple country-to-code relation- denoted by M(X,Y) or X → Y, if for all x ∈ X, x func-
shipssuchas(country→ISO3-country-code),(country→ tionally determines one and precisely one value y∈Y.
FIFA-country-code),(country→IOC-country-code),etc,
allofwhichsharesubstantialvalueoverlapaswellassimilar Examples of mapping relationships include (country →
columnnames. Schemamatchingtechniqueswouldidentify country-code)and(company→stock-ticker)asshownin
them as matches and merge them incorrectly, whereas we Table 1 and Table 2. There is a mapping relationship be-
wouldpreventthesynthesisbecauseoftheFD-basedincom- tween attributes country and country-code, for instance,
patibility. Considering both positive and negative signals is sincevalueinoneattributeisuniquelyassociatedwithpre-
critical for high-quality synthesis at a large scale. cisely one value in the other attribute.
Knowledge base. Knowledge bases (KB) such as Free- Notethatthisiscloselyrelatedtofunctionaldependency
base[7]andYAGO[34]haveimportantentity-relationships (FD),traditionallydefinedoveronephysicaltable. Wemake
thatcanbeviewedassynthesized(semi-automatically)from the distinction to define mappings as conceptual relation-
differentsources. However,manymappingsaremissingfrom ships that can be represented in multiple tables, but may
KB.Forinstance,YAGOhasnoneoftheexamplemappings neverbefullyembodiedinonephysicaltable(e.g.,thesyn-
thesized mapping shown in Table 6 with both South Korea Home Team Away Team Date Stadium Loca3on
and Korea (South) would not occur in one table). Chicago Bears Greenbay Packers 10-12 Soldier Field Chicago, IL 60605
Detroit Lions Minnesota Vikings 10-12 Ford Field Detroit, MI
Existing FD discovery work mainly focuses on efficiency
Detroit Lions Greenbay Packers 10-19 Ford Field Detroit, MI
(e.g., [3]), because it is intended for interactive data explo-
Minnesota Vikings Chicago Bears 10-19 US Bank Stadium Minneapolis
ration on ad-hoc data sets. However, in our problem the Greenbay Packers Minnesota Vikings 10-26 Lambeau Field 1265 Lombardi Ave
key challenge is to produce high-quality synthesis of tables … … … … …
to assist human curation, where efficiency is not as impor- Table 7: An example input table. Candidate two-column
tant because the corpus is given a priori and synthesis can tables can be extracted using both PMI and FD filtering.
be run as offline jobs.
For cases where both X → Y and Y → X are map-
ping relationships, we call such bi-directional relationships shipandarecompatiblewitheachother. Thereasonthisis
1:1 mappings (examples are in Table 1). If the mapping necessary is because many web tables and spreadsheets are
relationship only holds in one direction, then it is an N:1 for human consumption [30], and as a result contain only a
mapping (Table 2). subset of instances for the ease of browsing. Furthermore,
Itisworthnotingthatinpractice,becauseofnameambi- one table in most cases mentions an entity by one name;
guity,functionalrelationshipinsomemappingsmayappear synthesis helps to improve coverage of synonyms that are
to only hold approximately. For example, city → state important for many applications.
is conceptually a mapping relationship. However, when en-
Step 3: Conflict Resolution. Because results from table
tities are represented as strings, the functional relationship
synthesis piece together many tables, some of which are
may not completely hold. For example, in the same table
bound to have erroneous values inconsistent with others,
there may be a city called Portland in the state of Oregon,
namely two pairs of values in the same mapping with the
andanothercityPortlandinthestateof Maine,thusgiving
sameleft-hand-sidevaluebutdifferentright-hand-side(thus
the appearance of violating FD. To take such name ambi-
violating the definition of mappings). These can often hap-
guity into account, we consider relationships whose surface
pen due to quality issues or extraction errors. We apply a
forms are approximate mapping relationships.
post-processingsteptoresolveconflictsinsynthesizedmap-
Definition 2. Let R be a conceptual relation with two ping relationships to produce our final results.
attributes X,Y. The relationship is a θ-approximate map-
pingrelationship,denotedbyMθ(X,Y)orX →θ Y,ifthere 3. CANDIDATETABLEEXTRACTION
exists a subset R ⊂ R with |R| ≥ θ|R|, in which all x ∈ X
functionally determines one and precisely one value y∈Y. In this section we briefly describe the preprocessing of
tables. Recall that in this work we focus on synthesizing
Weconsiderapproximatemappingswithθover95%. Here- binary mapping relationships. We start with two-column
after we will simply use mapping relationship to refer to its tables extracted from an existing table corpus. Given a ta-
θ-approximate version when the context is clear. ble T = {C ,C ,...C } with n columns, we can extract bi-
1 2 n
Table corpora. The only input toour problem is acor- narytableswithpairsofcolumns{(C ,C )|i,j ∈[n],i(cid:54)=j},
i j
pus of tables. for a total of 2(cid:0)n(cid:1) such column pairs. For example, in Fig-
2
Definition 3. A table corpus T = {T} is a set of rela- ure7,wecanconceptuallyextractallpairsofcolumnssuch
as(Home Team,Away Team),(Home Team,Date),(Home Team,
tional tables T, each of which consists of a set of columns,
Stadium), (Home Team, Location), etc.
or written as T ={C ,C ,...}.
1 2 Because not all these pairs are meaningful mappings, we
Todayrelationaltablesareabundantandareveryrichin filter out candidates with a coherence-based filtering and a
nature. In this study, we use a corpus of 100M tables ex- local FD based filtering.
tractedfromtheWeb,andacorpusof500Ktablesextracted
3.1 ColumnFilteringbyPMI
from spreadsheet files crawled from the intranet of a large
enterprise. When given a large table corpus (especially web tables),
some tables are inevitably of low quality. Quality issues
2.2 SolutionOverview
can arise because (1) columns may be mis-aligned due to
Ourapproachhasthreemainsteps,asshowninFigure1. extractionerrors(especiallyforcomplicatedtableslikepivot
Step 1: Candidate Extraction. This step starts by ex- table and composite columns); or (2) some table columns
just have incoherent values.
haustively extracting pairs of columns from all tables in
In both of these cases, the resulting table column will
the corpus as candidates for synthesis. For each table T =
{C ,C ,...,C } with n columns, we can extract 2(cid:0)n(cid:1) such appear to be“incoherent”when looking at all values in this
1 2 n 2 column. For example, the last column Location in Table 7
ordered pairs. However, many column pairs are not good
havemixedandincoherentvalues. Wewouldliketoexclude
candidate for mapping relationships because (1) for some
such columns from consideration for mapping synthesis.
columnpairifthelocalrelationshipisalreadynotfunctional,
Therefore we measure the coherence of a table column
then it is unlikely to participate in true mappings; and (2)
based on semantic coherence between pairs of values. We
some table columns are of low quality and are not coherent
apply a data-driven approach to define coherence based on
enough(e.g.,withmixedconcepts). Toaddresstheseissues,
co-occurrence statistics in a corpus. Let s(u,v) be the co-
we use FD constraints as well as value-based co-occurrence
herence between two values u and v. Define C(u)={C|u∈
statistics to prune away low-quality candidate tables.
C,C ∈ T,T ∈ T} as the columns in the table corpus T
Step2: TableSynthesis. In this step, we judiciously syn- containing value u, and define C(v) similarly. Clearly, if
thesize two-column tables that describe the same relation- C(u)∩C(v)isalargeset,itmeansuandv areco-occurring
frequently (e.g., u = USA and v = Canada). Then they in-
tuitively are highly related and thus should have a high se-
mantic coherence score.
WeusePoint-wiseMutualInformation(PMI)[14]toquan-
tify the strength of co-occurrence as a proxy for coherence.
p(u,v)
PMI(u,v)=log (1)
p(u)p(v)
Wherep(u)andp(v)aretheprobabilitiesofseeinguand
v fromatotalofN columnsinatablecorpusT,definedas
p(u)= |C(u)|, p(v)= |C(v)| and p(u,v)= |C(u)∩C(v)|.
N N N
Wedefinecoherenceoftwovalues,denotedbys(u,v),asa Figure2: MappingsfromWikipedia1forcountrynamesand
normalizedversionofPMIcalledNormalizedPMI(NPMI), threetypesofcountrycodes: IOC,FIFA,andISO.Thethree
which has a range of [−1,1]: have identical codes for many countries, but also different
ones for many others (in red circles).
PMI(u,v)
s(u,v)=NPMI(u,v)=
−logp(u,v)
Country IOC Country IOC Country ISO
Afghanistan AFG Afghanistan AFG Afghanistan AFG
Usings(u,v),thecoherencescoreofacolumnC ={v ,v ,...},
1 2 Albania ALB Albania ALB Albania ALB
denotedasS(C),issimplytheaverageofallpair-wisescores. Algeria ALG Algeria ALG Algeria DZA
American Samoa ASA American Samoa (US) ASA American Samoa ASM
(cid:80)
s(v ,v ) South Korea KOR Korea, Republic of (South) KOR South Korea KOR
S(C)= vi,vj∈C,i<j i j (2) US Virgin Islands ISV United States Virgin Islands ISV US Virgin Islands VIR
(cid:0)|C|(cid:1)
2 (a)B1: IOC-(1) (b)B2: IOC-(2) (c)B3: ISO
We can then filter out a column C if its coherence S(C)
Table 8: Example two-column binary tables for synthesis:
is lower than a threshold.
(a)CountriesandIOCcodes,(b)CountriesandIOCcodes,
wheresomecountriesusealternativesynonymscomparedto
Example 4. Table7isanexampletablewithfivecolumns.
thefirsttable,(c)CountriesandISOcodes,wherethecode
Column coherence computed using NPMI in Equation (2)
for some country can be different from the first two tables.
would reveal that the first four columns all have high coher-
ence scores, because values in these columns co-occur often
in the table corpus. 4. TABLESYNTHESIS
The last column Location, however, has low coherence,
Using candidate two-column tables produced from the
because values in this column are mixed and do not co-occur
previousstep,wearenowreadytosynthesizerelationships.
often enough in other columns. We will remove this column
Recall that synthesis provides better coverage for instances
when generating column pairs.
(e.g., synonyms) as discussed in the introduction.
3.2 Column-PairFilteringbyFD
4.1 CompatibilityofCandidateTables
After removing individual columns with low coherence
Inordertodecidewhatcandidatetablesshouldbestitched
scores, we use the resulting table T = {C ,C ,...,C } to
1 2 n togetherandwhatshouldnot,weneedtoreasonaboutcom-
generate binary tables with ordered column pairs B(T) =
patibility between tables.
{(C ,C )| i,j ∈ [n],i (cid:54)= j} as candidate tables. However,
i j
most of these two-column tables do not express meaning-
PositiveEvidenceforCompatibility.
fulmappingrelationships,suchas(Home Team,Away Team),
Let B = {(l ,r )} and B(cid:48) = {(l(cid:48),r(cid:48))} be two binary rela-
and (Home Team, Date) in Table 7. i i i i
tionships produced by the previous step, each with sets of
Since our goal is to produce mapping relationships, we
(left, right) value pairs. If these two relations share many
applylocalFDcheckingtopruneawaycolumnpairsunlikely
common value pairs, or |B∩B(cid:48)| is large, they are likely in
tobemappings. AsdiscussedinDefinition3weaccountfor
the same relationship and compatible for synthesis.
name ambiguity (like (Portland → Oregon) and (Portland
Let w+(B,B(cid:48)) be the positive compatibility between B
→ Maine)) by allowing approximate FD that holds for 95%
and B(cid:48). We would like to use set-based similarity to quan-
of values.
tify compatibility based on the overlap |B∩B(cid:48)|. However,
Example 5. Continue with Example 4, we have pruned common metrics like Jaccard Similarity, defined as ||BB∪∩BB(cid:48)(cid:48)||,
would not work because if one small relation is fully con-
awaythelastcolumnLocationfromTable7basedoncoher-
ence scores. Four columns remain, for a total of 2(cid:0)4(cid:1)=12 tained by another (B ⊃ B(cid:48), |B| (cid:29) |B(cid:48)|), the compatibility
2 should intuitively be high, but the Jaccard Similarity score
orderedcolumnpairs. Only2outofthe12columnpairssat-
would actually be low.
isfyFD,namely,(Home Team,Stadium)and(Stadium,Home
Containment metrics would mitigate this issue, but Jac-
Team).
cardContainmentisasymmetric–wewantittobesymmet-
ricbecauseboththecompatibilityofB,B(cid:48) andthecompat-
We note that around 78% candidates can be filtered out
withthesemethods. Theprocedureusedinthisstepcanbe 1
https://en.wikipedia.org/wiki/ComparisonofIOC, FIFA, and
found in Appendix A. ISO3166countrycodes
ibilityofB(cid:48),B areessentiallythesamething(w+(B,B(cid:48))= should change based on the length of values. For example,
w+(B(cid:48),B)). GiventheseweuseasymmetricvariantofJac- forshortvaluessuchas“USA”or“RSA”(forSouthAfrica),
card Containment called Maximum-of-Containment [8] for anyabsolutedistancethreshold≥1wouldincorrectlymatch
w+(B,B(cid:48)): the two. Fractional threshold on the other hand would re-
quire an exact match for short strings like these. We fur-
|B∩B(cid:48)| |B∩B(cid:48)|
w+(B,B(cid:48))=max{ , } (3) therrestrictthethresholdtobewithinsomefixedthreshold
|B| |B(cid:48)| k = 10 to safeguard false positives. Combining, we use
ed
θ (v ,v )=min{(cid:98)|v |·f (cid:99),(cid:98)|v |·f (cid:99),k }.
Example 6. Table 8 shows three two-column candidate ed 1 2 1 ed 2 ed ed
tables, B , B and B , respectively. The first two are for
1 2 3 Example 7. We continue with Example 6 in Table 8.
the IOC code, while the last is for a different ISO code. All
When using approximate matching for positive compatibil-
of these three are valid mappings but are for two different
ity, w+(B ,B ) will now be updated to max{4,4} = 0.67.
country-code standards, as explained in Figure 2. 1 2 6 6
This is because in addition to the first three matching rows
Using Equation (3), we can compute the positive compat-
between B and B , now the fourth row“American Samoa”
ibility between each pair of tables. For example, we have 1 2
w+(B ,B )=max{3,3}=0.5, because |B ∩B |=3 (the and “American Samoa (US)” will also be considered as a
1 2 6 6 1 2 match, as the Edit Distance between the two values is 2
first three rows), suggesting that the two tables share a sig-
(ignoring punctuations), which is no greater than θ =
nificant fraction of mappings and are likely to be compatible ed
min{(cid:98)13·0.2(cid:99),(cid:98)15·0.2(cid:99),10}=2.
for synthesis.
Efficiency. Although conceptually compatibility scores Efficiency. There are hundreds of millions of table pairs
can be computed for all pairs of candidates, in reality most for which we need to compute compatibility. Let m and n
tables share no common values, and will have a score of 0. be the numbers of values in a pair of tables. For each pair
ApracticalissuehereisthatgivenN totalcandidatetables, we need to make O(nm) approximate string comparisons,
we need to perform O(N2) expensive containment compu- eachofwhichisinturnO(|v1||v2|)whenusingconventional
tations. With millions of tables, this quadratic step is too dynamic programming on the full matrix. This is too ex-
expensive even for large Map-Reduce clusters. pensive even for production Map-Reduce clusters.
In reality we observe that the scores for most pairs of Ourobservationisthattherequirededitdistancethresh-
tablesarezerosincetheysharenooverlappingvaluesatall. old θed is small in most cases. So using ideas similar to
For example, Table 1a is about countries and Table 1b is the Ukkonen’s algorithm [35], we only compute DP on the
about stock tickers. They have no overlaps in value-pairs, narrow band in the diagonal direction of the matrix, which
so both positive and negative weights are 0. Computing makes it O(θed·min{|v1|,|v2|}). Since θed is small it makes
scores for these non-overlapping sets is clearly wasteful. this step feasible. Pseudo-code of this step can be found in
To address this problem, we use inverted-index-like re- Appendix B.
grouping in a Map-Reduce round to map all tables sharing Synonyms. In some cases, synonyms of entity names
at least some common value-pairs to the same partition, so may be available, e.g., using existing synonym feeds such
thatcompatibilityiscomputedonlyforpairsoftableswithin as [10]. If we know, for instance,“US Virgin Islands”and
each partition. Specifically, we evaluate w+(B,B(cid:48)) only if “United States Virgin Islands”are synonyms from external
B and B(cid:48) share more than θoverlap value pairs (both left sources,wecanboostpositivecompatibilitybetweenB1and
andrightvalues),andsimilarlyweevaluatew−(B,B(cid:48))only B2 in Table 8 accordingly. We omit discussions on possible
if B and B(cid:48) share more than θ left-hand-side values. lookup-based matching in the interest of space.
overlap
Inpractice, thenumberofnon-zeroweightededgesismuch
smaller than N2. This optimization makes it possible to NegativeEvidenceforIncompatibility.
scalethepair-wisecomputationsteptohundredsofmillions Positiveevidencealoneisoftennotsufficienttofullycapture
of tables. compatibility betweentables, as tables of different relation-
Approximate String Matching. In real tables, val- shipsmaysometimeshavesubstantialoverlap. Forexample,
ues from different tables often have slight variations, such it canbe computedthatthe positive compatibility between
as “Korea, Republic of” & “Korea Republic”, or “Ameri- B inTable8aandB Table8cismax{3,3}=0.5(thefirst,
1 3 6 6
can Samoa”&“American Samoa (US)”. In practice, there secondandfifthrowsmatch). Giventhehighscore,thetwo
are other extraneous information in table cells, such as the will likely merge incorrectly (note that one is for IOC code
footnote mark“[1]”in the fourth row in Figure 2. These whiletheotherisforISO).Thisissueexistsingeneralwhen
artificially reduce positive compatibility and in some cases one of the columns is short and ambiguous (e.g. codes),
increasenegativecompatibilitybetweentables,whichisun- or when one of the tables has mixed values from different
desirable. mappings (e.g., both city to state and city to country).
Toaccountforsuchminorsyntacticvariations,weuseap- We observe that in these cases the two tables actually
proximatestringmatchingbetweencellvalues. Specifically, also contain conflicting value pairs, such as the third and
we measure the Edit Distance, denoted as d (v ,v ), be- fourth row in the example above where the two tables have
ed 1 2
tween a pair of values v and v . We treat v and v as a the same left-hand-side value, but different right-hand-side
1 2 1 2
match if d (v ,v ) is smaller than a threshold θ . Here values. Thisviolatesthedefinitionofmappingrelationship,
ed 1 2 ed
we use a fractional threshold defined as θ = min{(cid:98)|v |· andisaclearindicationthatthetwotablesarenotcompat-
ed 1
f (cid:99),(cid:98)|v |·f (cid:99)},whichisdynamicallydeterminedbasedon ible, despite their positive scores.
ed 2 ed
thelengthofstring|v |,|v |,andafixedfractionalvaluef We thus introduce a negative incompatibility between ta-
1 2 ed
(e.g., 0.2). We choose to use a fractional distance instead bles. GiventwotablesB andB(cid:48),definetheirconflict set as
of an absolute distance, because the desired edit distance F(B,B(cid:48)) = {l|(l,r) ∈ B,(l,r(cid:48)) ∈ B(cid:48),r (cid:54)= r(cid:48)}, or the set of
values that share the same left-hand-side but not the right- ISO 0.7
hand-side. For example, between B in Table 8a and B -0.5 3 3,4,5
Table 8c, (Algeria, ALG) and (Algeri1a, DZA) is a conflict.3 1 0.5 -0. 7 4 ISO IOC
-0.7
To model the (symmetric) incompatibility between two -0.33 0.8 0.6
tables B and B(cid:48), we define a negative incompatibility score 0.67 0.33 1.5
w−(B,B(cid:48))similartopositivecompatibilityinEquation(3): 2 0.67 5 IOC 1,2
(a) (d)
|F(B,B(cid:48))| |F(B,B(cid:48))|
w−(B,B(cid:48))=−max{ , } (4) 1.3
|B| |B(cid:48)| -0.5 3,5 -0.7 3,4,5
Example 8. WecontinuewithExample7inTable8. As 1 0.5 -0.7 4 1 0.5
idTsiassbculuebsss8teaadntaeinaardlliaeBnr3d, wtihniellTplaeobasldietitv8oecincioscmomrpraaecxtit{bim63l,iet36yrg}ebse=tbwe0ete.w5ne,eBnw1htiwcinoh 0.67 -02 .33 1.0 (b) 0.67 -02 .33 1.0 (c)
different relationships (IOC and ISO).
Usingnegativeincompatibility,wecancomputew−(B1,B3) Figure 3: Graph representation of candidate tables. Solid
as−max{3,3}=−0.5,sincethethird,forthandsixthrows vertices on the left represent tables for ISO codes; hollow
6 6
conflict between the two tables, and both tables have 6 rows. vertices on the right represent tables for IOC codes. Fur-
This suggests that B1 and B3 have substantial conflicts, in- thermore, solid edges indicate positive compatibility, while
dicating that a merge will be inappropriate. dashed edges indicate negative incompatibility. Edges with
In comparison, for B1 in Table 8a and B2 in Table 8b, weight of 0 are omitted on the graph.
which talk about the same relationship of IOC, their conflict
set is empty and w−(B ,B )=0, indicating that we do not
1 2
havenegativeevidencetosuggestthattheyareincompatible. Wewanttomaximizethesumofthisscoreacrossallparti-
tions, or (cid:80) w+(P). This is our optimization objective.
4.2 ProblemFormulationforSynthesis P∈P
On the other hand, we do not want to put incompati-
WeuseagraphG=(B,E)tomodelcandidatetablesand ble tables with non-trivial w− scores, such as B and B in
1 3
theirrelationships,whereB istheunionofallbinarytables Example 9, in the same partition. Since we disallow this
produced in the preprocessing step in Section 3. In G each to happen, we treat edges with negative scores w− below a
vertexrepresentsatableB ∈B. Furthermore,foreachpairs thresholdτ ashard-constraints. Notethatanegativethresh-
ofverticesB,B(cid:48) ∈B,weusecompatibilityscoresw+(B,B(cid:48)) old τ (e.g., −0.2) is used in place of 0 because we do not
and incompatibility scores w−(B,B(cid:48)) as the positive and over-penalize tables with slight inconsistency due to minor
negative edge weights of the graph. quality and extraction issues. We ignore the rest with in-
Example 9. Given the tables B , B and B in Table 8, significant negative scores by essentially forcing them to 0.
1 2 3 Let w−(P) be the sum of substantial negative weights in P
we can represent them and their compatibility relationships
defined below.
as a graph as in Figure 3(a).
As discussed in Example 7, the positive compatibility be- w−(P)= (cid:88) w−(B ,B )
i j
tween w+(B ,B )=0.67, which is shown as solid edge with
positive weig1ht 2in this graph. Similarly we have negative Bi,Bj∈P,w−(Bi,Bj)<τ
edge weights like w−(B ,B )=−0.5 as discussed in Exam- Weusethisasaconstraintofourformulation–wewantno
1 3
ple 8. This graph omits edges with a weight of 0, such as edgesinthesamepartitiontohavesubstantialconflicts,or,
w−(B ,B ). w−(P)=0,∀P ∈P.
1 2
Putting these together, we formulate table synthesis as
Since we need to synthesize compatible tables into larger
follows.
mapping relationships, in the context of graph G we need
to group compatible vertices/tables together. This natu- Problem 10 (Table Synthesis).
rwahlleyrecoearrcehspPon⊆dsBtoreparepsaernttistioansiunbgsePto=ft{aPb1le,sPt2h,a..t.}caonf bBe, max (cid:88) w+(P) (5)
i
synthesized into one relationship. Since different partitions P∈P
correspondtodistinctrelationships,thepartitioningshould s.t. (cid:88) w−(P)=0 (6)
bedisjoint(P ∩P =∅,i(cid:54)=j), andtheyshouldcollectively
i j P∈P
(cid:83)
cover B, or P =B. (cid:92)
P∈P P P =∅, ∀P (cid:54)=P (7)
Intuitively, therearemanywaystopartitionB disjointly, i j i j
butwewanttofindagoodpartitioningthathasthefollow- (cid:91)
P =B (8)
ing desirable properties: (1) compatible tables are grouped
P∈P
together as much as possible to improve coverage of indi-
vidual mapping relationships; and (2) incompatible tables Byplacingcompatibletablesinthesamepartition,wescore
should not be placed in the same partition. more in the objective function in Equation (5), but at the
We translate these intuitive requirements into an opti- sametimeEquation(6)guaranteesthatnoconflictingnega-
mization problem. First, we want each partition P to have tiveedgecanbeinthesamepartition. Equation(7)and(8)
as many compatible tables as possible. Let w+(P) be the are used to ensure that P is a proper disjoint partitioning.
sum of positive compatibility in a partition P:
Example 11. WerevisittheexampleinFigure3(a). Us-
w+(P)= (cid:88) w+(B ,B ) ingtheformulationabove,itcanbeverifiedthatthebestpar-
i j
titioningis{{B ,B },{B ,B ,B }},whichgroupstwoISO
Bi,Bj∈P,i<j 1 2 3 4 5
sameleft-hand-sidevalue,buthavedifferentright-hand-side
values. This could be due to quality issues in the original
input tables, such as the example in Figure 4 that has in-
correctchemicalsymbolsfortwooftherows(thesymbolof
TelluriumshouldbeTeandTelluriumshouldbeI).Qual-
ityissueslikethisareactuallycommoninlargecorpus,and
manifestthemselvesasinconsistentmappingsinsynthesized
results. Since the majority of tables in the partition should
agreewiththeground-truthmapping,weresolveconflictsby
Figure 4: A real table with errors that can cause conflicts. removing the least number of low-quality tables, such that
the resulting partition has no conflicts.
Let P be a partition with candidate tables {B ,B ,...},
1 2
tables and three IOC tables into separate partitions. This
each of which is a set of value pairs B = {(l,r)}. Recall
i
partitioning has a total score of 2.77 based on Equation (5), that in Section 4.1 we define a conflict set F(B,B(cid:48)) to be
without violating constraints in Equation (6) by not placing {l|(l,r) ∈ B,(l,r(cid:48)) ∈ B(cid:48),r (cid:54)= r(cid:48)}. We can again leverage
negative edges in the same partition. synonyms and do not treat (l,r),(l,r(cid:48)) as conflicts if (r,r(cid:48))
It is worth noting that existing techniques like schema
are known to be synonyms.
matching [31] only consider positive similarity (because FD
Now we want to find out the largest subset P ⊆P such
T
do not generally hold in tables), and as a result merge all 5
thatnotwotablesinP conflictwitheachother,whichcan
T
tables in this example, producing results of low quality.
be formulated as follows.
Theorem 12. The problem Table-Synthesis is NP-hard.
Problem 14 (Conflict Resolution).
We prove this using a reduction from graph multi-cut. (cid:12) (cid:12)
(cid:12) (cid:12)
There also exists a trichotomy of complexity depending on (cid:12) (cid:91) (cid:12)
the number of negative edges in the graph. A proof of the max (cid:12)(cid:12) Bi(cid:12)(cid:12)
hardness can be found in Appendix C. (cid:12)Bi∈PT (cid:12)
We can show that the loss-minimization version of this s.t. F(Bi,Bj)=∅, ∀Bi,Bj ∈PT (9)
problemcanbesolvedusingLP-relaxationandrandomized
The objective is to include as many value pairs as possible,
rounding [5], to produce O(logN) approximation. Details
under the constraint that no pairs of tables in the selected
of this LP-based solution can be found in a full version of
subset P can have conflict.
this paper. While this LP based solution is practical for T
This problem is NP-hard (reduction from Independent
problemsofmoderatesizes,wearedealingwithgraphswith
Set). So we iteratively find and remove a value pair that
millionsofverticesandaquadraticnumberofvariables,for
conflicts with the most other value pairs. Pseudo-code of
which existing LP-solvers cannot currently handle.
this procedure can be found in Appendix G.
Asaresult,weuseanefficientheuristictoperformgreedy
Wenotethattherearemanyexistingmethodsforconflict
synthesis. Specifically,weinitiallytreateachvertexasapar-
resolution[28]thatcanconceptuallybeappliedtothepost-
tition. Wetheniterativelymergeapairofpartitions(P ,P )
1 2
thatarethemostcompatibletogetanewpartitionP(cid:48),and processingstep,anditisinterestingtoexploretheirapplica-
bility. Because we donotconsider this post-processingstep
updatetheremainingpositive/negativeedges.Thealgorithm
tobeourkeycontribution,andweincludethisstepherefor
terminateswhennopartitionscanbemerged. Pseudo-code
completeness,wedonotperformanexhaustivecomparison.
of this procedure can be found in Appendix D.
Efficiency. While the procedure above appears straight- 4.3 SynthesizedMappingsforCuration
forward for graphs that fit in a single machine, scaling to
While the synthesized mappings produced by our algo-
largegraphsonMap-Reduceisnotstraightforward. Weuse
rithm are generally of high quality, for many applications a
adivide-and-conquerapproachtofirstproducecomponents
very high precision is required. For example, for commer-
thatareconnectednon-triviallybypositiveedgesonthefull
cialspreadsheetsoftwarelikeExcel,anyerrorintroducedby
graph, and then look at each subgraph individually. More
black-box algorithms can be hard to detect by users, but
discussions on this step can be found in Appendix E.
hasdamagingconsequencesandthusunacceptable. Insuch
Example 13. Figure 3 shows how Algorithm 3 works on settings, our approach of pre-computing all candidate map-
a small graph. The algorithm first merges {B } and {B } pings from table corpora allows humans to inspect and cu-
3 5
to get Figure 3b because Edge ({B },{B }) has the great- rate these mappings to ensure very high accuracy. High-
3 5
est weight. The weight of Edge ({B },{B ,B }) changes as quality mappings produced by automatic algorithms can
2 3 5
w+({B },{B ,B })⇐w+({B },{B })+w+({B },{B }). greatly reduce the effort required by human curators.
2 3 5 2 3 2 5
TheweightofEdge({B },{B ,B })alsochangessimilarly. It is interesting to note that synthesized results we pro-
4 3 5
Thealgorithmthenmerges{B ,B }and{B }togetFig- ducehaveanaturalnotionofimportance/popularity. Specif-
3 5 4
ure3candfinallycombines{B }and{B }togetFigure3d. ically,foreachsynthesizedmapping,wehavestatisticssuch
1 2
The algorithm stops because of the negative weight between as the number of web domains whose tables contributed to
{B ,B } and {B ,B ,B }. this mapping, and how many raw tables are synthesized in
1 2 3 4 5
the same cluster, etc. Such statistics are very well corre-
ConflictResolution. Weobservethatsynthesizedrelations lated to the importance of the mapping, because the more
often have conflicts that require post-processing. Specifi- itoccursinthetablecorpus,themorelikelyitisfrequently
cally, when we union all tables in the same partition to- used and important. This property makes results produced
gether, there will be a small fraction of rows that share the by our approach amenable to human curation – instead of
listofcountriesandcapitals listofpokemonsandcategories truth. Finally we combine these high-quality web tables
listofcarmodelsandmakes listofaminoacidsandsymbols with instances in Freebase and YAGO if they have cover-
age. Notethattheresultingmappingrelationshipshaverich
Figure 5: Example queries with“list of A and B” synonymsforthesameentity(e.g.,asshowninTable6),as
well as more comprehensive coverage for instances. Con-
FIPS5-2 ISO3166-1Alpha-3 structing such a benchmark set, and ensuring its correct-
FIPS10-4 ISO3166-1Numeric ness/completeness is a time-consuming process. We intend
IANACountryCode ITU-RCountryCode to publish this benchmark set online to facilitate future re-
IATAAirportCode ITU-TCountryCallingCode search in this area.
ICAOAirportCode MARCCountryCode
Enterpriseismoredifficulttobenchmarkbecauseofthe
IOCCountryCode NUTS(EU)
difficulty in ensuring completeness of instances in certain
ISO3166-1Alpha-2 SGCCodes(Canada)
mappings–thegroundtruthmaybeinmasterdatabasesfor
Figure 6: Geocoding Systems whichwehavenoaccess(e.g.,employeeandlogin-alias).
Nevertheless,webuilt30besteffortbenchmarkcases. Recall
resultsonthesetestsshouldbeinterpretedasrelative-recall
lookingatafullcorpuswithmillionsoftables,onejustneeds given the difficulty to ensure completeness.
to look at synthesized results popular enough.
Metrics. We use the standard precision, recall and f-score
5. EXPERIMENTS to measure the performance. Let B∗ = {(l∗,r∗)} be a
ground truth mapping, and B = {(l,r)} be a synthesized
5.1 ExperimentalSetup relationship for which we want to evaluate its quality. The
precisionofB isdefinedas |B∩B∗|,therecallis |B∩B∗|,and
|B| |B∗|
TableCorpus. Weusetwotablecorporaforourevaluation. the f-score is 2pprerceicsiisoionn+·rreeccaallll.
Thefirsttablecorpus,henceforthdenotedasWeb,hasover
100 million tables crawled and extracted from the public Methodscompared. We compare the following methods.
web. These tables cover diverse domains of interests.
• UnionDomain. LingandHalevyetal.[30]proposetounion
Thesecondtablecorpus,denotedasEnterprise,hasabout
togethertableswithinthesamewebsitedomain,iftheircol-
500K tables extracted from spreadsheets files crawled from
umn names are identical but row values are disjoint. We
the intranet of a large IT company.
applythistechniquebyessentiallygroupingtablesbasedon
column names and domain names. We evaluate the result-
ComputingEnvironment. Weimplementedalgorithmsde-
inguniontablesagainsteachbenchmarkcasebypickingthe
scribedinthispaperasMap-Reduceprograms. Weranour
union table with the highest F-score.
jobsinalargeMap-Reducecluster,alongsidewithotherpro-
• UnionWeb. Noticingthatonlyunion-ingtablesinthesame
ductionjobs. OurinputforWebhasabout223Mtwo-column
domain may be restrictive and missing instances for large
tables with a size of over 200GB.
relationships,weextendthepreviousapproachtoalsomerge
tables with the same column names across the web, and
Benchmarks. We have built a benchmark dataset to eval- evaluateallbenchmarkcaseslikeabove. Thisisavariantof
uateourframeworkonWebtablecorpus2. Thisbenchmark UnionDomain.
datasetcontains80desirablemappingrelationshipsthatwe • Synthesis. This is our approach that synthesizes map-
manually curated. These relationships are collected from ping relationships as described in Section 4.
two sources. • SynthesisPos. ThisisthesameasSynthesisexceptthat
• Geocoding: We observe that geography is a common do- it does not use the negative signals induced by FDs. This
mainwithrichmappingrelationshipsthatareoftenusedin helps us to understand the usefulness of negative signals.
auto-join and auto-correction scenarios. Examples here in- • WiseIntegrator [22, 23]. Thisisarepresentativemethod
clude geographical and administrative coding such as coun- in a notable branch in schema matching that collectively
trycode,statecode,etc. Sowetake14casesfromaWikipedia matches schemas extracted from Web forms. It measures
list of geocoding systems3. We omit codes that are impos- the similarity between candidates using linguistic analysis
sible to enumerate such as military grid reference system, of attribute names and value types, etc., and performs a
andonesnotcompletelylistedonWikipediasuchasHASC greedy clustering to group similar attributes.
code. Figure 6 lists all cases we take. • SchemaCC. In this method, we mimic pair-wise schema
• QueryLog: Wesamplequeriesofthepattern“listofAand matchers that use the same positive/negative similarity as
B”inBingquerylogsthatsearchformappingrelationships. our approach. Because match decisions are pair-wise, we
Figure 5 shows a few examples with true mappings. aggregate these to a group-level based on transitivity (e.g.,
For Web, after selecting mapping relationships, we curate iftableAmatchesBandBmatchesC,thenAalsomatches
instancesforeachrelationship,bycombiningdatacollected C). This is implemented as connected components on very
from web tables as well as knowledge bases. Specifically, largegraphs,whereedgesarethresholdbasedonaweighted
we find a group of tables for each relationship, and then combinationofpositive/negativescores. Wetesteddifferent
manually select high-quality ones to merge into the ground thresholds in the range of [0,1] and report the best result.
2Mappings in the web benchmark is available at • SchemaPosCC. This is the same as SchemaCC but without
https://www.microsoft.com/en-us/research/publication/ negativesignalsinducedbyFDs,sincetheyarenotexplored
synthesizing-mapping-relationships-using-table-corpus/ intheschemamatchingliterature. Weagaintestthresholds
3https://en.wikipedia.org/wiki/Geocoding in [0,1] and report the best number.
1.0
Incomparison,usingonlyrawtablesfromWikiTablewith
0.8 no synthesis has high precision but low recall, because not
e0.6 only are certain instances and synonyms missing (these ta-
Scor0.4 bles tend to be short for human consumption), many rela-
tionshipsarealsomissingaltogetherfromWikiTable. Sothe
0.2 approach of manually going over high-quality WikiTable to
curate mapping relationships is unlikely to be sufficient.
0.0
SynthesisWikiTableWebTUanbiloenDomaiUnnionSWyentbhesisPoCsorrelSactihoenmaPosCSCchemWiasCeCIntegratorFreebase YAGO iTTabhlee,WebbuTtabcolnesiadpeprrsotaacbhleussensotralwimtitaebdletsositmheilaWr itkoipWeidki-a
domain and thus has substantially better recall. While the
Algorithms AvgFscore AvgPrecision AvgRecall
precision of WebTable and Synthesis are comparable, the
Figure 7: Average f-score, precision and recall comparison. recall of Synthesis is substantially higher (0.88 vs. 0.32).
Despite this, we want to note that the setup of this com-
parison of is very favorable for WebTable – we select the
• Correlation [12]. Inthismethod,weagainmimicpair- best table among the hundreds of millions of raw tables in
wiseschemamatcherswiththesamepositive/negativescores WebTable, whereas in Synthesis we only use relations syn-
as Synthesis. Instead of using connected components for thesizedfromover8websitedomainsthatisthreeordersof
aggregation as in SchemaCC above, here we instead use the magnitude less (Section 4.3). Because it is not possible for
correlation clustering that handles graphs with both posi- humantogoovermillionsoftablestopickusefulmappings
tive or negative weights. We implement the state-of-the- inpractice,WebTableonlyprovidesanupper-boundofwhat
art correlation clusterin on map-reduce [12], which requires can be achieved and not really a realistic solution.
O(log|V|·∆+)iterationsandtakesalongtimetoconverge UnionDomainandUnionWebsynthesizetablesbasedonta-
(|V| is the number of vertices of the graph and ∆+ is the ble column names and domain names. The recall of these
maximumdegreeofallvertices). Wetimeoutafter20hours two approaches is considerably better than WikiTable and
and evaluate the results at that point. WebTable, showing the benefit of performing table synthe-
• WikiTable. Wikipedia has many high-quality tables cov- sis. However, this group of approaches merge tables only
ering various domains, many of which have mapping rela- basedoncolumnnames,whichareknowntobeuninforma-
tionships. Tounderstandthequalityofusingrawtablesin- tiveandundescriptiveinmanycases. Weobservethatwhen
stead of performing synthesis, we also evaluate each bench- applied to the whole web, this often leads to over-grouping
mark case by finding best pair of columns in a Wikipedia andunder-grouping. Theoverallf-scoresoftheseapproaches
table that has the highest in F-score. arethebestamongallexistingmethods,butstilllagbehind
• WebTable. This method is very similar to the previous Synthesis, which uses values that are more indicative of
WikiTable, but use all tables in the Web corpus instead of table compatibility.
just Wikipedia ones. SynthesisPos uses the same algorithm as Synthesis but
• Freebase. Freebase [7] is a well-known knowledge base does not consider the negative incompatibility induced by
that has been widely used. We obtained its RDF dump4 FDs. It is interesting to observe that result quality suffers
and extract relationships by grouping RDF triples by their substantially, which underlines the importance of the nega-
predicates. Wetreatthesubject→objectasonecandidate tive signals.
relationship,andtheobject→subjectasanothercandidate. SchemaCC performs substantially worse than Synthesis.
• YAGO. YAGO [34] is another public knowledge base that Recall that it uses the same positive/negative signals, but
is extensively used. We process a YAGO data dump simi- aggregate pair-wise match decisions using connected com-
lartoFreebase,bygroupingYAGORDFtriplesusingtheir ponents. This simple aggregation tends to over-group and
predicates to form subject-object and object-subject rela- under-group different tables, producing undesirable table
tionships. clusters.
Note that in all these cases, we score each benchmark SchemaPosCCignoresthenegativesignalsusedinSchemaCC,
case by picking the relationship in each data set that has sinceFD-inducednegativesignalsarenotexploredinschema
the best f-score. This is favorable to all the methods – a matching. Unsurprisingly,resultqualitydropsevenfurther.
human who wishes to pick the best relationship to be used CorrelationissimilartoSchemaCCthatalsomimicsschema
asmappings,andwhocouldaffordtoinspectallthesetables, matcherswithsamesignals,butaggregateusingcorrelation
would effectively pick the same tables. clustering. Overall, its f-score is better than SchemaCC, but
is still worse than Synthesis. We think there are two main
5.2 QualityComparison reasons why it does not work well. First, at the conceptual
Figure 7 shows the average f-score, precision and recall level, the objective of correlation clustering is the sum of
across all 80 benchmark cases in the Web benchmark for all positive and negative edges. Because the number of table
methods compared. Synthesis scores the best in average pairsthatwouldbeindifferentclustersfarexceedstheones
recall(0.88)andf-score(0.90),whileWikiTablehasthebest that should be in the same clusters, making negative edges
average precision (0.98) 5. dominate the objective function. However, in our problem,
we should actually only care about whether tables in the
4https://developers.google.com/freebase/ same clusters correspond to the identical mapping, which
5Since WikiTable methods miss many relationships, we ex- are the intra-cluster positive edges that are more precisely
clude cases whose precision is close to 0 from the average- modeled in our objective function. Second, a shortcoming
precision computation. This makes the average precision of the parallel-pivot algorithm [12] is that it only looks at
favorable to WikiTable. The same is applied to other table
asmallneighborhoodforclusters(i.e. one-hopneighborsof
and knowledge based methods.
Description:Company. MSFT. Microsoft Corp. ORCL. Oracle. INTC. Intel. GE. General Electric. UPS. United Parcel Services … … (b) Stock tickers. State. Abbrev. Alabama. AL . requiring users to perform manual synthesis from multiple tables. typically used in database contexts for a small number of schemas