ebook img

A DNF Blocking Scheme Learner for Heterogeneous Datasets PDF

0.62 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A DNF Blocking Scheme Learner for Heterogeneous Datasets

A DNF Blocking Scheme Learner for Heterogeneous Datasets ABSTRACT Blocking results in the selection of a small subset of pairs, 5 called the candidate set, which is input to a second step to EntityResolutionconcernsidentifyingco-referententitypa- 1 determineco-referenceusingasophisticatedsimilarityfunc- irsacrossdatasets. Atypicalworkflowcomprisestwosteps. 0 tion. We exclusively address blocking in this work. In the first step, a blocking method uses a one-many func- 2 Blockingmethodsuseablockingscheme toassignentities tion called a blocking scheme to map entities to blocks. In toblocks. Recently,DisjunctiveNormalForm(DNF)Block- n the second step, entities sharing a block are paired and a compared. Current DNF blocking scheme learners (DNF- ing Scheme Learners (BSLs) were proposed to learn DNF J BSLs) apply only to structurally homogeneous tables. We blocking schemes using supervised [3, 21], semi-supervised [6] or unsupervised machine learning techniques [16]. 8 present an unsupervised algorithmic pipeline for learning DNF-BSLsoperateinanexpressivehypothesisspaceand DNF blocking schemes on RDF graph datasets, as well as ] structurally heterogeneous tables. Previous DNF-BSLs are have shown excellent empirical performance [3]. Despite B theiradvantages,currentDNF-BSLsassumethatinputdat- admitted as special cases. We evaluate the pipeline on six D real-worlddatasetpairs. Unsupervisedresultsareshownto asetsaretabular andhavethesameschemas. Thelatteras- sumption is often denoted as structural homogeneity [12]. . be competitive with supervised and semi-supervised base- s Theassumptionoftabularstructuralhomogeneityrestricts lines. Tothebestofourknowledge,thisisthefirstunsuper- c application of DNF-BSLs to other data models. Recent [ vised DNF-BSL that admits RDF graphs and structurally heterogeneous tables as inputs. growth of graph datasets and Linked Open Data (LOD) [5] 1 motivatesthedevelopmentofaDNF-BSLfortheRDF1data v model. These graphs are often published by independent CategoriesandSubjectDescriptors 4 sources and are heterogeneous [27]. 9 I.2.6[Artificial Intelligence]: Learning;H.2.8[Database Inthispaper,wepresentagenericalgorithmicpipelinefor 6 Management]: Database applications—Data Mining learning DNF blocking schemes on pairs of RDF datasets. 1 The formalism of DNF blocking schemes relies on the ex- 0 GeneralTerms istence of a schema. RDF datasets on LOD may not have . accompanying schemas [5]. Instead of relying on possibly 1 Algorithms, Experimentation, Performance unavailablemetadata,webuildadynamicschema usingthe 0 5 properties in the RDF dataset. The RDF dataset can then 1 Keywords be logically represented as a property table, which may be : populatedatrun-time. Previously,propertytableswerede- v Heterogeneity, Entity Resolution, Unsupervised Learning finedasphysical datastructuresusedintheimplementation i X of triplestores [28]. Using a logical property table represen- 1. INTRODUCTION tation admits application of a DNF-BSL to RDF datasets. r a Entity Resolution (ER)istheidentificationofco-referent Asaspecialcase,thepipelinealsoadmitsstructurallyhet- entities across datasets. Different communities refer to it erogeneous tabular inputs. That is, the pipeline can be ap- as instance matching, record linkage and the merge-purge pliedtotabulardatasetswithdifferent schemas. Thus,pre- problem [12, 13]. Scalability indicates a two-step solution viousDNF-BSLsbecomespecialcasesofthepipeline,since [12]. The first step, blocking, mitigates brute-force pairwise they learn schemes on tables with the same schemas. As a comparisonsonallentitiesbyclusteringentitiesintoblocks secondspecialcase,thepipelineaccommodatesRDF-tabular and then comparing pairs of entities only within blocks [9]. heterogeneity,withoneinput,RDF,andtheother,tabular. RDF-tabularheterogneityapplieswhenlinkingdatasetsbe- tween LOD and the relational Deep Web [15, 5]. Much existing ER research bypasses the issue of struc- tural heterogeneity by assuming that schema matching is Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor physically undertaken prior to executing an ER algorithm personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare [12]. Schema matching itself is an active, challenging re- notmadeordistributedforprofitorcommercialadvantageandthatcopies search problem [1, 14]. Instead of relying on the unrealis- bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to ticassumptionofperfectschemareconciliation,thepipeline republish,topostonserversortoredistributetolists,requirespriorspecific permissionand/orafee. 1Resource Description Framework Copyright20XXACMX-XXXXX-XX-X/XX/XX...$15.00. admits possibly noisy schema mappings from an existing cates generator to perform unsupervised schema matching schema matcher, and directly uses the mappings to learn [4]. We describe Dumas in Section 5. a DNF blocking scheme. If a matcher is unavailable, the The property table representation used in this paper is a pipeline allows for a fall-back option. physically implemented data structure in the Jena triple- Weshowanunsupervisedinstantiationofthegenericpipe- store API2 [28]. In this paper, it is used as a logical data line by strategically employing an existing instance-based structure. We note that the concept of logically represent- schema matcher called Dumas [4]. In addition to schema ing one data model as another has precedent. In particu- mappings,Dumasalsogeneratesnoisyduplicates,whichwe lar, the literature abounds with proposed methods on how recycle and pipe to a new DNF-BSL. The new DNF-BSL to integrate relational databases (RDB) with the Semantic uses only two parameters, which may be robustly tuned for Web. Sahoo et al. extensively surveyed this topic, called good empirical performance across multiple domains. The RDB2RDF [24]. A use-case is the Ultrawrap architecture, BSL is also shown to have a strong theoretical guarantee. whichutilizesRDB2RDFtoenablereal-timeOntology-based We evaluate the full unsupervised pipeline on six real- DataAccessorOBDA[25]. Weeffectivelytackletheinverse world dataset pairs against two extended baselines, one of problem by translating an RDF graph to a logical property whichissupervised [3,16]. Oursystemperformanceisfound table. This is the first application to devise such an inverse to be competitive, despite training samples not being man- translation for heterogeneous ER. ually provided. Given that unsupervised methods exist for thesecond ERstepinboththeSemanticWebandrelational 3. PRELIMINARIES communities [13, 12], this first unsupervised heterogeneous Wepresentdefinitionsandexamplestoplacetheremain- DNF-BSL enables, in principle, fully unsupervised ER in der of the work in context. Consider a pair of datasets R both communities. 1 and R . Each dataset individually conforms to either the Section2describessomerelatedworkinthearea,followed 2 RDFortabulardatamodel. AnRDF datasetmaybevisu- bypreliminariesinSection3. Section4describesthegeneric alized as a directed graph or equivalently, as a set of triples. pipeline, followed by an unsupervised instantiation in Sec- A triple is a 3-tuple of the form (subject, property3, object). tion 5. Section 6 describes the experiments, with results Atabular datasetconformstoatabularschema,whichisthe discussed in Section 7. The paper concludes in Section 8. table name followed by a set of fields. The dataset instance is a set of tuples, with each tuple comprising field values. 2. RELATEDWORK ER was comprehensively surveyed by Elmagarmid et al. Example 1. Dataset1(inFigure1)isanRDFdatasetvi- [12], with a generic approach represented by Swoosh [2]. sualized as a directed graph G=(V,E), and can be equiv- Separately, blocking has witnessed much specific research, alently represented as a set of |E| triples. For example, with Christen surveying blocking methods [9]. Bilenko et (Mickey Beats, hasWife, Joan Beats) would be one such al. [3],andMichelsonandKnoblock[21]independentlypro- triple in the triples representation. Datasets 2 and 3 are posed supervised DNF-BSLs in 2006. Since then, a semi- tabular dataset examples, with the former having schema supervised adaptation of the BSL proposed by Bilenko et EmergencyContact(Name,Contact,Relation). Thefirsttu- al. hasbeenpublished[3,6],aswellasanunsupervisedsys- ple of Dataset 2 has field values Mickey Beats, Joan Beats tem [16]. The four systems assume structural homogeneity. and Spouse respectively. The keyword null is reserved. We discuss their core principles in Section 4.3. AccordingtotheRDFspecification4,subjectsandproper- Heterogeneousblockingmaybeperformedwithoutlearn- tiesmustnecessarilybeUniformResourceIdentifiers(URIs), ingaDNFscheme. OneexampleisLocalitySensitiveHash- while an object node may either be a URI or a literal. URI ing(LSH),employedbytheHarra system,forinstance[17]. elements in RDF files typically have associated names (or LSH is promising but applies only to specific distance mea- labels), obtained through dereference. For ease of exposi- sures,likeJaccardandcosine. TheTypifier systembyMaet tion, we henceforth refer to every URI element in an RDF al. isanotherrecentexamplethatreliesontype inferencing file by its associated name. Note also, that in the most andwasdesignedforWebdatapublishedwithoutsemantic general case, RDF datasets do not have to conform to any types[20]. Incontrast,DNF-BSLscanbeappliedgenerally, schema. Thisiswhytheyarecommonlyvisualizedassemi- withmultiplestudiesshowingstrongempiricalperformance structured datasets, and not as tables. In Section 4.1, we [21,3,6,16]. Finally,althoughrelated,clustering istechni- showhowtodynamicallybuildaproperty schema andlogi- cally treated separately from blocking in the literature [9]. cally represent RDF as a tabular data structure. In the Semantic Web, ER is simultaneously known as in- Anentityisdefinedasasemantically distinctsubject node stance matching and link discovery, and has been surveyed in an RDF dataset, or as a (semantically distinct) tuple in by Ferraram et al. [13]. Existing works restrict inputs to a tabular dataset. The entity referring to Mickey Beats is RDF. Also, most techniques in the Semantic Web do not showninredinalldatasetsinFigure1. EntityResolution is learn schemes, but instead present graph-based blocking theprocessofresolvingsemanticallyequivalent(butpossibly methods, a good example being Silk [27]. syntactically different) entities. As earlier described, ER is Finally,theframeworkinthispaperalsoreliesonschema traditionallyconductedontableswiththesameschemas,or mapping. Schema mapping is an active research area, with onone5 dataset. Anexample wouldbe identifying that the a good survey provided by Bellahsene et al. [1]. Gal notes two highlighted tuples in Dataset 3 are duplicates. that it is a difficult problem [14]. Schema matchers may return 1:1 or n:m mappings (or 2https://jena.apache.org/ even1:nandn:1). Aninstance-based schemamatcherrelies 3Alternatively called predicate; we uniformly use property. on data instances to perform schema matching [1]. A good 4http://www.w3.org/RDF/ exampleisDumas [4], whichreliesonaninexpensivedupli- 5Commonlyjustcalleddeduplication,ifthisisthecase[12]. Figure 1: Three example datasets exhibiting various kinds of heterogeneity In the Semantic Web, ER is operationalized by connect- it was input the Last Name field values from the first and ing two equivalent entities with an owl:sameAs6 property fourth tuples in Dataset 3. Since these field values have a edge. Forexample,thetwonodesreferringtoMickeyBeats token (Beats) in common, the GBP returns True. inDataset1shouldbeconnectedusinganowl:sameAs edge. EasyoperationalizingofER(andmoregenerally,linkspeci- Aspecificblockingpredicate (SBP)explicitlypairsaGBP fication [27])explainsinparttherecentinterestinERinthe to a specific field. SemanticWeb[13]. Intherelationalsetting,ERistradition- Definition 3. Aspecific blocking predicate isapair(p ,f) ally operationalized through joins or mediated schemas. It i where p is a general blocking predicate and f is a field. islessevidenthowtooperationalizeERacrossRDF-tabular i A specific blocking predicate takes two tuples t and t as inputs, such as linking Datasets 1 and 2. We return to this 1 2 arguments and applies p to the appropriate field values f issue in Section 4.1. i 1 and f from both tuples. A tuple pair is said to be covered Tointroducethecurrent notionofDNFblockingschemes, 2 if the specificblocking predicate returns True for thatpair. tabularstructuralhomogeneityisassumedfortheremainder ofthissection. Inlatersections,wegeneralizetheconcepts. Previous DNF research assumed that all available GBPs Themostbasicelementsofablockingschemeareindexing can be applied to all fields of the relation [16, 3, 6, 21] . functions hi(xt) [3]. An indexing function accepts a field Hence, given a relation R with m fields in its schema7, and valuefromatupleasinputandreturnsasetY thatcontains sGBPs,thenumberofSBPsisexactlyms. Finally,aDNF 0 or more blocking key values (BKVs). A BKV identifies a blocking scheme is defined as: blockinwhichthetupleisplaced. Intuitively,onemaythink ofablockasahashbucket,exceptthatblockingisone-many Definition 4. A DNF blocking scheme f is a positive P while hashing is typically many-one [9]. For example, if Y propositionalformulaconstructedinDisjunctiveNormalFo- containsmultipleBKVs,atupleisplacedinmultipleblocks. rm8(DNF),usingagivensetH ofSBPsasthesetofatoms. Additionally,ifeachterm isconstrainedtocompriseatmost Definition 1. An indexing function h : Dom(h ) → U∗ i i one atom, the blocking scheme is referred to as disjunctive. takesasinputafieldvaluex fromsometupletandreturns t asetY thatcontains0ormoreBlockingKeyValues(BKVs) SBPscannotbenegated,sincetheDNFschemeisaposi- from the set of all possible BKVs U∗. tiveformula. Atuplepairissaidtobecoverediftheblocking scheme returns True for that pair. Intuitively, this means The domain Dom(h ) is usually just the string datatype. i that the two constituent tuples share a block. In practice, The range is a set of BKVs that the tuple is assigned to. both duplicate and non-duplicate tuple pairs can end up Each BKV is represented by a string identifier. gettingcovered,sinceblockingisjustapre-processingstep. Example 2. An example of an indexing function is To- Example 4. Considerthedisjunctivescheme(ContainsC- kens. When applied to the Last Name field value of the ommonToken,LastName)∨(SameFirstDigit,Zip),applied fourthtupleinDataset3,theoutputsetY is{W.,Beats,Jr.}. on Dataset 3. While the two tuples referring to Mickey This leads to the notion of a general blocking predicate Beats would share a block (with the BKV Beats), the non- (GBP). Intuitively, a GBP p (x ,x ) takes as input field duplicate tuples referring to Susan and Samuel would also values from two tuples, t andi tt,1antd2 uses the ith indexing shareablock(withtheBKV6). Notealsothatthefirstand 1 2 function to obtain BKV sets Y and Y for the two argu- fourthtuplessharemorethanoneblock,sincetheyalsohave 1 2 ments. The predicate is satisfied iff Y and Y share ele- BKV 7 in common. 1 2 ments,orequivalently,ift andt haveablockincommon. 1 2 Givenablockingscheme,ablockingmethod wouldneedto Definition 2. Ageneralblockingpredicate p :Dom(h )× maptuplestoblocksefficiently. PerDefinition4,ablocking i i Dom(h ) → {True,False} takes as input field values x scheme takes a tuple pair as input. In practice, linear-time i t1 and x from two tuples, t and t , and returns True if hash-based techniques are usually applied. t2 1 2 h (x )∩h (x )(cid:54)=Φ, and returns False otherwise. i t1 i t2 Example 5. To efficiently apply the blocking scheme in EachGBPisalwaysassociatedwithanindexingfunction. the previous example on each individual tuple, tokens from Example 3. Consider the GBP ContainsCommonToken, 7Structural homogeneity implies exactly one input schema, even if there are multiple relational instances. associated with the previously introduced Tokens. Suppose 8A disjunction of terms, where each term is a conjunction 6http://www.w3.org/TR/owl-ref/ of literals. the field value corresponding to field Last Name are ex- tracted, along with the first character from the field value of the Zip field, to obtain the tuple’s set of BKVs. For ex- ample, applied to the first tuple of Dataset 3, the BKV set {Beats,7} is extracted. An index is maintained, with the BKVs as keys and tuple pointers as values. With n tuples, traditional blocking computes the blocks in time O(n) [9]. Figure 3: Property Table representation of Dataset Let the set of generated blocks be Π. Π contains sets of 1 in Figure 1 the form B , where B is the block referred to by the BKV v v v. The candidate set of pairs Γ is given below: reserved delimiter (; in Figure 3). Technically, field values (cid:91) Γ= {(r,s)},∀r,s∈Bv|r∈R1,s∈R2 (1) now have set semantics, with null representing the empty Bv∈Π set. Also, the original set of triples can be losslessly recon- structedfromthepropertytable(andviceversa),makingit Γispreciselythesetinputtothesecond stepofER,which an information-preserving logical representation. Although classifieseachpairasaduplicate,non-duplicateorprobable intuitively straightforward, we detail these lossless conver- duplicate [8]. Blocking should produce a small Γ but with sions in the Appendix. highcoverageanddensityofduplicates. Metricsquantifying The physical property table was proposed to eliminate these properties are defined in Section 6. expensive property-property self-joins that occur frequently Finally, schema mapping is utilized in the paper. The in SPARQL9 queries. For ER, the logical data structure formal definitionofamappingisquitetechnical;thesurvey is useful because it allows for a dynamic schema that is by Bellahsene et al. provides a full treatment [1]. In this resolvableatrun-time. TriplestoreslikeJenaallowupdating paper, an intuitive understanding of the mapping as a pair andqueryingofRDFgraphs,despitetheunderneathtabular of field-sets suffices. For example, ({Name},{First Name, representation [28]. If the RDF dataset is already stored in Last Name}) is a 1:n mapping between Datasets 2 and 3. such a triplestore, it would not have to be moved prior to More generally, mappings may be of cardinality n:m. The ER.Thisgivesthepropertytableasystems-level advantage. simplest case is a 1:1 mapping, with singleton components. More importantly, having a schema for an RDF dataset meansthatwecaninvokeasuitableschemamatcherinthe 4. THEGENERICPIPELINE generalizedpipeline. Aswesubsequentlyshow,theextended Figure 2 (a) shows a schematic of the generic pipeline. DNF-BSLinthepipelinerequiresthedatasetstohave(pos- Two heterogeneous datasets are initially provided as input, sibly different) schemas10. Finally, representing RDF as a with either dataset being RDF or tabular. If the dataset is tableallowsustoaddressRDF-tabularheterogeneity,byre- RDF, we logically represent it as a property table. We de- ducing it to tabular structural heterogeneity. Traditional scribethedetailsandadvantagesofthistabular datastruc- ERoperationsbecomewell-definedforRDF-tabularinputs. tureinSection4.1. Thekeypointtonoteisthattheschema Onekeyadvantageofthepropertyschemaisthatitdoes matching moduletakestwotables asinput,regardlessofthe not rely on RDF Schema (RDFS) metadata. In practice, data model, and outputs a set of schema mappings Q. An this allows us to represent any file on Linked Open Data extended DNF-BSLacceptsQandalsoatrainingsetofdu- tabularly, regardless of whether metadata is available. plicatesandnon-duplicatesasinput,andlearnsanextended EveninthesimpleexampleofFigure3, wenotethatthe DNF blocking scheme. The DNF-BSL and the blocking property table is not without its challenges. It is usually scheme need to be extended because tables are now struc- sparse, and it may end up being broad, for RDF datasets turallyheterogeneous,andtheSection3formalismdoesnot with many properties. Furthermore, properties are named natively apply. In Section 4.2, the formalism is extended to using different conventions (for example, the prefix Has oc- admit structural heterogeneity. curs in all the properties in Figure 3) and could be opaque, Figure 2 (b) shows an unsupervised instantiation of the depending on the dataset publisher. We show empirically generic pipeline. We describe the details in Section 5. (Section 6) that the instantiated pipeline (Section 5) can handle these difficulties. 4.1 PropertyTableRepresentation Despite their formal description as sets of triples, RDF 4.2 Extendingtheformalism filesareoftenphysically storedintriplestoresassparseprop- TheformalisminSection3assumedstructuralhomogene- erty tables [28]. In this paper, we adapt this table instead ity. We extend it to accommodate structural heterogeneity. as a logical tabular representation of RDF. To enable the Asinput,considertwodatasetsR andR . Ifeitherdataset 1 2 logical construction on RDF dataset R, define the property isinRDF,weassumethatitisinpropertytableform. Con- schema {subject}∪{p|∃(s,p,o),(s,p,o) ∈ R}. In essence, sider the original definition of SBPs in Definition 3. SBPs we flatten the graph by assigning each distinct property (or are associated with a single GBP and a single field, mak- edge label) a corresponding field in this schema, along with ing them amenable only to a single schema. To account for an extra subject field. Every distinct subject in the triples- heterogeneity, we extend SBPs by replacing the field input set has exactly one corresponding tuple in the table. with a mapping. Denote as A and A the respective sets 1 2 Forexample,Figure3isthepropertytablerepresentation of fields of datasets R and R . We define simple extended 1 2 of Dataset 1 in Figure 1. If a subject does not have a cor- SBPs below: responding object value for a given property, the reserved keyword null is entered. If a subject has multiple object 9http://www.w3.org/TR/rdf-sparql-query/ values for a property, the values are concatenated using a 10Even non-DNF blocking typically assumes this [20]. Figure 2: (a) shows the generic pipeline for learning a DNF blocking scheme on heterogeneous datasets, with (b) showing an unsupervised instantiation Definition 5. Asimpleextendedspecificblockingpredicate isapair(p ,m)wherep isageneralblockingpredicateand i i m = ({f },{f }) is a mapping from a single field f ∈ A 1 2 1 1 to a single field f ∈ A . The predicate takes two tuples 2 2 t and t as arguments and applies p to the field values 1 2 i corresponding to f and f in the two tuples respectively. 1 2 The correspondence in Definition 5 is denoted as simple Figure 4: An example showing how H is formed sinceitusesamappingofthesimplestcardinality(1:1). Def- c inition 3 can be reformulated as a special case of Definition 5, with A =A and f =f . 1 2 1 2 Finally, (simple or complex) extended DNF schemes can The SBP semantics are not evident if the mapping car- be defined in a similar vein as Definition 4, using (simple dinality is n:m, 1:n or n:1, that is, between two arbitrary or complex) extended SBPs as atoms. One key advantage field-subsets,F ⊆A andF ⊆A . Ifweinterpret thetwo 1 1 2 2 ofusingdisjunctionasthecombineoperatorinDefinition6 setsasrepresenting|F ||F |1:1mappings,weobtainaset of 1 2 is that, assuming simple extended SBPs as atoms for both simple extended SBPs {(p ,({f },{f }))|f ∈F ,f ∈F }. i 1 2 1 1 2 2 simpleandcomplexextendedDNFschemes,theschemere- Theinterpretationaboveismotivatedbytherequirement mainsapositivebooleanformula, byvirtueofdistributivity that an SBP should always return a boolean value. We ap- of conjunction and disjunction. proachtheproblembyusingthen:mmappingstoconstruct the set of simple extended SBPs, as shown above. We then 4.3 ExtendingExistingDNF-BSLs usedisjunction asacombine operatoronallelementsofthe ExistingDNF-BSLs[3,21,16,6]relyonsimilarhigh-level constructedsettoyieldasinglebooleanresult. Wecanthen principles,whichistodeviseanapproximationalgorithmfor define complex extended SBPs. solving the NP-hard optimization problem first formalized Definition 6. A complex extended specific blocking predi- by Bilenko et al. [3]. The approximation algorithms are cate is a pair (pi,M) where pi is a general blocking pred- differentinthattheyrequiredifferentparametersandlevels icate and M is a mapping from a set F1 ⊆ A1 to a set of supervision. These are detailed below and summarized F2 ⊆A2. The predicate takes two tuples t1 and t2 as argu- in Table 1. These BSLs were originally designed only for ments and applies on them |F1||F2| simple extended SBPs structurally homogeneous tables, with a single field-set A. (pi,m), where m ranges over all 1:1 mappings in the set We describe their underlying core principles before describ- {({f1},{f2})|f1 ∈ F1,f2 ∈ F2}. The predicate returns the ingextensionsinkeepingwiththeformalisminSection4.2. disjunction of these |F1||F2| values as the final output. Assume a set of GBPs G. The core of all approximation algorithms would first construct a search space of SBPs H Example 6. Consider the mapping between sets {Name} by forming the cross product of G and A. The goal of the inDataset2and{FirstName,LastName}inDataset3,in algorithm is to choose a subset H(cid:48) ⊆H such that the opti- Figure 1. Let the input GBP be ContainsCommonToken. mizationcondition11laidoutbyBilenkoetal. issatisfied,at The complex extended SBP corresponding to these inputs least approximately [3]. The condition assumes that train- would be ContainsCommonToken({Name}, {First Name}) ingsetsofduplicatesD andnon-duplicatesN areprovided. ∨ ContainsCommonToken({Name}, {Last Name}). This Intuitively, the condition states that the disjunctive block- complex extended SBP would yield the same result as the ing scheme formed by taking the disjunction of SBPs in H(cid:48) simpleextendedSBPContainsCommonToken({Name},{Na- covers (seeDefinition4)atleast(cid:15)|D|duplicates,whilecov- me})ifanewfieldcalledName isderived fromthemerging ering the minimum possible non-duplicates [3]. Note that (cid:15) of the First Name and Last Name fields in Dataset 3. is a parameter common to all four systems12 in Table 1. An operator like conjunction is theoretically possible but In order to learn a DNF scheme (as opposed to just dis- mayproverestrictivewhenlearningpracticalschemes. The junctive), a beam search parameter k (also common to all disjunction operator makes complex extended SBPs more 11We describe the condition here intuitively, and formally expressive than simple extended SBPs, but requires more state it in the Appendix. computation (Section 4.3). Evaluating alternate combine 12(cid:15) was designated as min thresh in the original System 1 operators is left for future work. paper [21], and σ in the System 3 paper [6]. Specifically, we build a set Q of all 1:1 mappings, |Q| = Table 1: DNF-BSL Systems |A ||A |. Recall A is the field set of dataset i. We de- 1 2 i note the constructed set H of SBPs as simple exhaustive. ID System Parameters Supervision 1 Michelson and (cid:15), k Supervised Note that for the set of all mappings (≈ 2|A1|2|A2|), the constructedsetH (denotedcomplexexhaustive)isnotcom- Knoblock [21] putationally feasible for non-trivial cases. Even the simple 2 Bilenko et al. [3] (cid:15),η,k Supervised exhaustivecaseisonlyafall-backoption,sinceatrue setof 3 Cao et al. [6] s,(cid:15),τ,α,k Semi-sup. 1:1 mappings Q would be much smaller16 than this set. 4 Kejriwal and Mi- Generator: Unsup. ranker [16] c,ut,lt,d,nd Learner: (cid:15),η,k 5. ANUNSUPERVISEDINSTANTIATION 5 Algorithm 1 herein κ, k Unsup. A key question to ask is whether the generic pipeline can be instantiated in an unsupervised fashion. As we showed earlier, existing DNF-BSLs that can be extended require foursystems)isrequired. Thisparameterisusedtosupple- some form of supervision. An unsupervised heterogeneous ment theoriginalsetH withterms,toobtainanewsetHc. DNF-BSL is important because, in principle, it enables a This combinatorial process is demonstrated in Figure 4. fully unsupervised ER workflow in both the relational and H originallyconsistsoftheSBPsa,bandc. TheseSBPs SemanticWebcommunities. AsthesurveysbyElmagarmid coversometuplepairs(TPs). Supposek=2. Atermofsize et al. and Ferraram et al. note, unsupervised techniques 2 is formed by checking if any TP is covered by (at least) for the second ER step do exist already [12, 13]. A second two SBPs. For example, TP-3 is covered by SBPs a and motivationistheobservationthatexistingunsupervisedand b, and hence, also covered by the term a∧b. For k > 2, semi-supervised homogeneous DNF-BSLs (Systems 3-4) re- terms from size 2 to size k are recursively added to H; the quire considerable parameter tuning. Parameter tuning is final supplemented set is denoted as Hc. Note that for |H| increasingly being cited as an important algorithmic issue, predicates, building Hc takes O(|H|k) time per TP. Given inapplicationsrangingfromschemamatching[18]togeneric the exponential dependence on k and diminishing returns, machine learning [11]. Variety in Big Data implies that al- previous results capped k at 2 [16, 3]. If k=1, Hc =H. gorithm design cannot discount parameter tuning. ThesetH(cid:48) ⊆Hcthatisnowchosenbytheapproximation WeproposeanunsupervisedinstantiationwithanewDNF- scheme would potentially consist of terms and SBPs, with BSLthatrequiresonlytwo parameters. InTable1,onlythe their disjunction yielding a k-DNF scheme13. supervisedSystem1requirestwoparameters. Theschematic While System 1 only requires (cid:15) and k as its parameters, of the unsupervised instantiation (of the generic pipeline in Systems 2 and 4 prune their search spaces by removing all Figure 2(a)) is shown in Figure 2(b). We use the existing SBPs and terms from Hc that cover more than η|N| non- schema matcher, Dumas, in the instantiated pipeline [4]. duplicates. Notethatthisstepheuristically improvesboth14 Dumasoutputs1:1fieldmappingsbyfirstusingaduplicates quality and run-time. It comes at the risk of failure, since generator to locate tuple pairs with high cosine similarity. ifthesearchspaceisprunedexcessively(byhighη),itmay Inthesecondstep,DumasusesSoft-TFIDF tobuildasimi- become impossible to cover at least (cid:15)|D| duplicates. Sys- larity matrix fromeachgeneratedduplicate. Ifnduplicates tems 3 and 4 require less supervision but significantly more are input to the second step, n similarity matrices are built parametertuning,giventheyrelyonmanymoreparameters. and then averaged into a single similarity matrix. The as- Systems1-3relyondifferent15SetCovering (SC)variants signment problem is then solved by invoking the Hungar- [10]. All three systems can be extended by constructing a ian Algorithm on this matrix [19]. This results in exactly search space of complex extended SBPs using G and the min(|A |,|A |)1:1fieldmappings(thesetQ)beingoutput. 1 2 mappings set Q, instead of G and a field set A. The un- In addition to using Q, we recycle the noisy duplicates of derlying SC approximations operate in this abstract search Dumas and pipe them into Algorithm 1. Note that Dumas space to choose the final set H(cid:48). An extended DNF scheme does not generate non-duplicates. We address this issue in is formed by a disjunction of (extended) elements in H(cid:48). a novel way, by permuting the generated duplicates set D. Modifying System 4 is problematic because the system is SupposethatDcontainsntuplepairs{(r ,s ),...(r ,s )}, 1 1 n n unsupervised and runs in two phases. The first phase (de- with each r,s respectively from datasets R ,R . By ran- 1 2 noted as generator) generates a noisy training set, and the domly permuting the pairs in D, we heuristically obtain second phase (denoted learner) performs feature-selection non-duplicatepairsoftheform(r ,s ),i(cid:54)=j. Notethat(at i j on the noisy set to output a DNF blocking scheme. The most) n! distinct permutations are possible. For balanced feature-selectionbasedlearnerissimilartoSystems1-3and supervision, we set |N| = |D|, with N the permutation- can beextended. Unfortunately,thegeneratorexplicitly as- generated set. sumeshomogeneity,andcannotbedirectlyextendedtogen- Empirically,thepermutationisexpectedtoyieldaprecise eratetrainingexamplesforheterogeneousdatasets. Thisim- N becauseofobservedduplicatessparsity inERdatasets[9, pliesthattheDNF-BSLcomponentoftheproposedgeneric 16]. Thissparsityisalsoakeytenetunderlyingtheblocking pipeline in Figure 2(a) cannot be instantiated with an ex- procedure itself. If the datasets were dense in duplicates, isting unsupervised system. blocking would not yield any savings. In the event that a schema matcher (and thus, Q) is Algorithm 1 shows the pseudocode of the extended DNF unavailable in Figure 2(a), we present a fall-back option. BSL.InputstothealgorithmarethepipedDumasoutputs, D and Q. To learn a blocking scheme from these inputs, 13A k-DNF formula has at most k literals in each term. two parameters k and κ need to be specified. Similar to 14Since H now contains only a few, high-quality elements. c 15Surveyed briefly in the Appendix. 16At most min(|A |,|A |) 1 2 Algorithm 1 Learn Extended k-DNF Blocking Scheme ferencebetweenthefractions ofcoveredduplicatesandnon- Input: SetDofduplicatetuplepairs,SetQofmappings duplicates. Athresholdparameter,κ,isusedtoremovethe Parameters : Beamsearchparameterk,SC-thresholdκ SBPs and terms that have low scores. Intuitively, κ tries Output : Extended DNF Blocking Scheme B to balance the conflicting needs of previously described pa- Method : //Step 0: Construct sets N and H rameters, (cid:15) and η, and reduce tuning effort. The range of Permute pairs in D to obtain N, |N|=|D| κ is [-1,1]. An advantage of the parameter is that it has Construct set H of simple extended SBPs using set G of an intuitive interpretation. A value close to 1.0 would in- GBPs and Q dicate that the user is confident about low noise-levels in Supplement set H to get set H using k inputs D and Q, since high κ implies the existence of ele- c //Step 1: Build Multimaps MD(cid:48) and MN(cid:48) ments in Hc that cover many positives and few negatives. Construct M =< X,H >, X is a tuple pair in D, Since many keys in M(cid:48) are removed by high κ, this also D X D H ⊆H contains the elements in H covering X leadstocomputationalsavings. However,settingκtoohigh X c c Repeat previous step to build M for tuple pairs in N (perhaps because of misguided user confidence) could po- N Reverse M and M to respectively get M(cid:48) and M(cid:48) tentially lead to excessive purging of M(cid:48) , and subsequent D N D N D //Step 2: Run approximation algorithm algorithm failure. Experimentally, we show that κ is easily for all X ∈keyset(M(cid:48) ) do tunableandevenhighvaluesofκarerobusttonoisyinputs. D Score X byusingformula|M(cid:48) (X)|/|D|−|M(cid:48) (X)|/|N| Weighted Set Covering (W-SC) is then performed using D N Remove X if score(X)<κ Chvatal’s algorithm19 [10],witheachkeyinM(cid:48) actingasa D end for set andthetuplepairscoveredbyallkeysaselementsofthe Perform W-SC on keys in M(cid:48) using Chvatal’s heuristic, universe set U. For example, assuming all SBPs and terms D weights are negative scores in the keyset of M(cid:48) in Figure 5 have scores above κ, U = D //Step 3: Construct and output DNF blocking scheme {1,2,3,4,5}. Note that only M(cid:48) is pruned (using κ) and D B:= Disjunction of chosen keys also, W-SC is performed only on M(cid:48) . M(cid:48) only aids in the D N Output B scorecalculation(andsubsequentpruningprocess)andmay be safely purged from memory before W-SC commences. W-SCneedstofindasubsetoftheM(cid:48) keyset thatcovers D all of U and with minimum total weight. For this reason, theweightofeachsetisthenegative ofitscalculatedscore. GiventhatsetschosenbyW-SCactuallyrepresentSBPsor terms, their disjunction is the k-DNF blocking scheme. Under plausible20 complexity assumptions, Chvatal’s al- gorithm is essentially the best-known polynomial-time ap- proximation for W-SC [23]. For example, Bilenko et al. usedPeleg’sapproximationtoRed-BlueSC [22,7],whichis known to have worse bounds [7]. The proposed DNF-BSL hasthestrongesttheoreticalapproximationguaranteeofall Figure 5: Step 1 of Algorithm 1, assuming the in- systems in Table 1. formation in Figure 4 6. EXPERIMENTS (extended) Systems 1-3 in Table 1, G, Q and k are used to 6.1 Metrics construct the search space, H . Note that G is considered c thealgorithm’sfeaturespace,andisnotadataset-dependent The quality and efficiency of blocking is evaluated by the input (or parameter). Designing an expressive G has com- special metrics, Reduction Ratio (RR), Pairs Completeness putationalandqualitativeeffects,asweempiricallydemon- (PC) and Pairs Quality (PQ) [9]. Traditional metrics like strate. We describe the GBPs included in G in Section 6. precisionandrecalldonotapplytoblockingsinceitisapre- Step 0 in Algorithm 1 is the permutation step just de- processing step, and its output is not the final ER output. scribed, to generate the non-duplicates set N. G and Q are TodefineRR,whichintuitivelymeasuresefficiency,consider thenusedtoconstructthesetH ofsimpleextended17 SBPs thefullsetofpairsΩthatwouldbegeneratedintheabsence (Definition5),with|H|=|G||Q|. H issupplemented(using ofblocking. Specifically,Ωistheset{(r,s)|r∈R ,s∈R }. 1 2 parameterk)toyieldHc,asearlierdescribedinSection4.3. RR is then given by the formula, 1−|Γ|/|Ω|. Given the Step1constructsmultimaps18onwhichSetCovering(SC) sparsity of duplicates in real-world datasets [9], RR should is eventually run. As a first logical step, multimaps MD ideally be close to, but not precisely, 1.0 (unless Γ=φ). and MN are constructed. Each tuple pair (TP) in D is a Denote the set of all duplicate pairs, or the ground-truth, key in MD, with the SBPs and terms in Hc covering that as Ωm. Similarly, consider the set of duplicates included in TP comprising the value set. MD is then reversed to yield the candidate set Γm =Γ∩Ωm. The metric PC quantifies M(cid:48) . M(cid:48) is built analogously. Figure 5 demonstrates the theeffectiveness oftheblockingschemebymeasuringcover- D N procedure, assuming D contains TPs 1-5, covered as shown age ofΓ withrespecttoΩ . Specifically,itisgivenbythe m m in Figure 4. The time complexity of building (both) M(cid:48) formula, |Γ |/|Ω |. Low PC implies that recall on overall D m m and M(cid:48) is O(|H|k(|D|+|N|)). ERwillbelow,sincemanyduplicatesdidnotshareablock N In Step 2, each key is first scored by calculating the dif- to begin with. 17Since Dumas only outputs 1:1 mappings 19WeincludeChvatal’salgorithmintheSCAppendixsurvey. 18Multimap keys reference multiple values (or a value set). 20P ⊂NP PC and RR together express an efficiency-effectiveness ters, versus the five of System 3. Finally, the three System tradeoff. ThemetricPQissometimesusedtomeasurehow 4 parameters are comparable to the corresponding System dense theblocksareinduplicates,andisgivenby|Γ |/|Γ|. 2 parameters, and evaluations in the original work showed m PQ has not been reported in recent BSL literature [3],[16]. thatthelearnerisrobusttominorparametervariations[16]. One reason is that PQ can be equivalently expressed as For these reasons, the feature-selection learner of System 4 c.PC/(1−RR), where c is |Ω |/|Ω|. When comparing two was extended to serve as a semi-supervised baseline. m BSLs,PQcanthereforebeexpressedwhollyintermsofPC 6.4 Methodology and RR. We do not consider PQ further in this paper. In this section, we describe experimental methodology. 6.2 Datasets Experimental results will be presented in Section 7. The heterogeneous test suite of six dataset pairs (and 6.4.1 Dumas nine individual datasets) is summarized in Table 2. The suite spans over four domains and the three kinds of het- Giventhatthetestsuiteislargerandmoreheterogeneous erogeneity discussed in the paper. All datasets are from inthispaperthanintheoriginalDumaswork[4],weperform real-world sources. We did not curate these files in any two preliminary experiments to evaluate Dumas. Dumas way, except for serializing RDF datasets as property tables was briefly described in Section 5. (instead of triples-sets). The serializing was found to be Preliminary Experiment 1: We evaluate the perfor- near-instantaneous(<1second)inallcases,withnegligible mance of Dumas’s duplicates generator. The generator uses run-time compared to the rest of the pipeline. TF-IDF to retrieve the highest scoring duplicates in order. Dataset Pairs (DPs) 1 and 2 are the RDF benchmarks Suppose we retrieve t duplicates (as an ordered list). De- inthe2010instance-matching track21 ofOAEI22,anannual note, for any k≤t, the number of true positives in sub-list SemanticWebinitiative. Notethatanearliertabular version [1...k]asd(k). DefinePrecision@k asd(k)/k andRecall@k of DP 1 is also popular in homogeneous ER literature [9]. as d(k)/|Ωm|, where Ωm is the ground-truth set of all du- DPs3,5and6describevideogame information. DP6has plicates. We plot Precision@k against Recall@k for all DPs alreadybeenusedasatestcaseinapreviousschemamatch- todemonstratetheprecision-recalltradeoff. Toobtainafull ingwork[26]. vgchartz isatabulardatasettakenfromarep- setofdatapoints,wesettat50,000forallexperiments,and utable charting website23. DBpedia contains 48,132 triples calculate Precision@k and Recall@k for k∈{1...t}. extractedfromDBpedia24,andhasfour(three25 properties Preliminary Experiment 2: Although t was set to a andsubject)fieldsand16,755tuplesinpropertytableform. highvalueinthepreviousexperiment,Dumasrequiresonly Finally,IBM containsuser-contributeddataextractedfrom a few top pairs (typically 10-50) for the schema matching the Many Eyes page26, maintained by IBM Research. step[4]. Tocomputethesimilaritymatrixfromtpairs,Du- DP4describesUSlibraries. Libraries1 wasfromaPoint masusesSoft TF-IDF,whichrequiresanoptional threshold of Interest website27, and Libraries 2 was taken from a US θ. For a given DP, denote Q(cid:48) as the ground-truth set of government listing of libraries. schema mappings, Q as the set of Dumas mappings, and DPs2and4containn:mground-truthschemamappings, Qm ⊆Qasthesetofcorrect Dumasmappings. Definepre- while the others only contain 1:1 ground-truth mappings. cision (onthistask)as|Qm|/|Q|andrecall as|Qm|/|Q(cid:48)|. In this experiment, we set t and θ to default values of 50 and 6.3 Baselines 0.5 respectively and report the results. We also vary these Table 1 listed existing DNF-BSLs, with Section 4.3 de- parameters and describe performance differences. scribinghowtheextendedversionsfitintothepipeline. We 6.4.2 DNFBSLParameterTuning extend two state-of-the-art systems as baselines: Supervised Baseline: System 2 was chosen as the su- We describe parameter tuning methodology. Note that pervised baseline [3], and favored over System 1 [21] be- thebeamsearchparameterk(seeTable1)isnottechnically causeofbetterreportedempiricalperformanceonacommon tuned,butset,incumbentontheexperiment(Section6.4.3). benchmark employed by both efforts. Tuning the parame- Baseline Parameters: Both baseline parameters are ter η leads to better blocking schemes, making System 2 a tuned in similar ways. We do an exhaustive parameter state-of-the-art supervised baseline. sweep of (cid:15) and η, with values in the range of 0.90−0.95 Semi-supervised Baseline: We adapt System 4 as a and 0.0−0.05 (respectively) typically maximizing baseline semi-supervised baselinebyfeedingitthesamenoisydupli- performance. Lowη maximizesRR,whilehigh(cid:15)maximizes cates generated by Dumas as fed to the proposed learner, PC. Although extreme values, with η = 0 and (cid:15) = 1, are as well as a manually labeled negative training set. The expected to maximize performance, they led to failure28 on learnerofSystem4usesfeature selection andwasshownto alltestcases. Thisdemonstratesthenecessityofproperpa- beempiricallycompetitivewithsupervised baselines[16]. In rameter tuning. Fewer parameters imply less tuning, and contrast, System 3 did not evaluate its results against Sys- faster system deployment. tem2. Also,thelearnerofSystem4requiresthreeparame- SC-thresholdκ: Thesingleparameterκoftheproposed method was tuned on the smallest test case (DP 1) and 21http://oaei.ontologymatching.org/2013/#instance found to work best at 0.9. To test the robustness of κ to 22Ontology Alignment Evaluation Initiative different domains, we fixed it at 0.9 for all experiments. 23vgchartz.com |D| and |N|: We emulate the methodology of Bilenko 24dbpedia.org et al. in setting the numbers of positive and negative sam- 25genre, platform and manufacturer. ples input to the system. Bilenko et al. input 50% of true 26www-958.ibm.com/software/data/cognos/manyeyes/ positives and an equal number of negatives to train their datasets 27http://www.poi-factory.com/poifiles 28(cid:15)|D| duplicates could not be covered (see Section 4.3). Table 2: Details of dataset pairs. The notation, where applicable, is (first dataset) /× (second dataset) ID Dataset Pairs Fields Total Entity Pairs DuplicatePairs Data Model 1 Restaurant1/Restaurant2 8/8 339 × 2256 = 764,784 100 RDF/RDF 2 Persons 1 /Persons 2 15/14 2000 × 1000 = 2 million 500 RDF/RDF 3 IBM/vgchartz 12/11 1904 × 20,000 ≈ 38 million 3933 Tabular/Tabular 4 Libraries 1 /Libraries 2 5/10 17,636 × 26,583 ≈ 469 million 16,789 Tabular/Tabular 5 IBM/DBpedia 12/4 1904 × 16,755 ≈ 32 million 748 Tabular/RDF 6 vgchartz/DBpedia 11/4 20,000 × 16,755 ≈ 335 million 10,000 Tabular/RDF system [3]. Let this number be n (= |D|,|N|) for a given test suite. For example, n = 50 for DP 1. For fairness, we use these numbers for the semi-supervised baseline and proposed system also. We retrieve the top n pairs from the DumasgeneratorandinputthepairstobothsystemsasD. Additionally,weprovidenlabelednon-duplicates(asN)to the semi-supervised baseline. In subsequent experiments, the dependence of the proposed system on n is tested. 6.4.3 ExtendedDNFBSL We describe the evaluation of the extended DNF BSLs. SetGofGBPs: Bilenkoetal. firstproposedasetGthat Figure 6: Dumas duplicates-generation results, for hassincebeenadoptedinfutureworks[3,6,16]. Thisorig- k∈{1,2...50000} inal set contained generic token-based functions, numerical functionsandcharacter-basedfunctions[3]. Norecentwork attemptedtosupplementGwithmoreexpressiveGBPs. In formancevariationsoftheproposedsystemwiththenumber particular,phonetic GBPssuchasSoundex,Metaphone and ofprovidedduplicates,n. Inindustrialsettingswithanun- NYSIIS were not added to G, despite proven performance known ground-truth, n would have to be estimated. An benefits [8]. We make an empirical contribution by supple- important question is if we can rely on getting good results menting G with all nine phonetic features implemented in with constant n, despite DP heterogeneity. an open-source package29. For fairness, the same G is al- Statistical Significance: We conduct experiments in waysinputtoalllearnersintheexperimentsbelow. Results tenruns,and(whererelevant)reportstatisticalsignificance on Experiment 2 will indirectly demonstrate the benefits of levels using the paired sample Student’s t-distribution. On supplementing G. A more detailed description of the origi- blockingmetrics,wereportifresultsarenotsignificant(NS), nal (and supplemented) G is provided in the Appendix. weaklysignificant(WS),significant(SS)orhighlysignificant Experiment1: Toevaluatetheproposedlearner against (HS),basedonwhetherthep-valuefallswithinbrackets[1.0, the baselines, we input the same Q (found in Preliminary 0.1), (0.05,0.1], (0.01, 0.05] and [0.0, 0.01] respectively. As Experiment 2) to all three systems. The systems learn the for the choice of samples, we always individually paired PC DNF scheme by choosing a subset H(cid:48) ⊆ H of SBPs and and RR of the proposed system against the baseline that c supplemented terms, with H constructed from G, Q and k achieved a better average on the metric. c (Section 4.3). We learn only disjunctive schemes by setting Implementation: All programs were implemented in k to 1 in this experiment. Java on a 32-bit Ubuntu virtual machine with 3385 MB of Experiment2: WerepeatExperiment1butwithk=2. RAM and a 2.40 GHz Intel 4700MQ i7 processor. We mentioned in Section 4.3 that complexity is exponential in k for all systems in Table 1. Because of this exponential 7. RESULTSANDDISCUSSION dependence, it was not possible to run the experiment for k =2 on all DPs. We note which DPs presented problems, 7.1 Dumas andwhy. Notethat,ifG,Qandthetrainingsetsarefixed, increasing k seems to be the only feasible way of improving PreliminaryExperiment1: Figure6showstheresults blocking quality. However, G is more expressive in this pa- ofDumas’sduplicates-generator. ExceptforDP3,precision per. Intuitively,weexpectthedifferenceacrossExperiments on all cases seems inadequate even at low recall levels. Al- 1 and 2 to be narrower than in previous work. though recall of 100% is eventually attained on most DPs, Experiment 3: In a third set of experiments, we evalu- thepriceis(near)0%precision. Closerinspectionofthere- ate how blocking performance varies with Q. To the base- sultsshowedthatmanyfalsepositivesgotrankedatthetop. line methods, we input the set of (possibly n:m) ground- We discuss the implications of these noisy results shortly. truth mappings Q(cid:48) while the Dumas output Q is retained Preliminary Experiment 2: Table3showstheschema fortheproposedlearner. Thegoalistoevaluateifextended mapping results retrieved from Dumas with parameters set DNF-BSLsaresensitive,orifnoisy1:1matcherslikeDumas at t = 50 and θ = 0.5. These default values were found to suffice for the end goal. maximize performance inall cases, inagreementwithsimi- Experiment 4: We report on run-times and show per- lar values in the original work [4]. We also varied t from 10 to10,000andθfrom0to0.9. Performanceslightlydeclined 29org.apache.commons.codec.language (by at most 5%) for some DPs when t < 50, but remained cost. With k = 2, only DPs 1 and 5 were found computa- Table 3: Best Results of Dumas Schema-Matcher tionally feasible. On the other DPs, the program either ran out of RAM (DPs 4,6), or did not terminate after a long31 Dataset Pair Precision Recall time (DPs 2,3). The former was observed because of high 1 100% 87.5% n and the latter because of the large number of fields (see 2 92.86% 86.67% Table2). Settingkbeyond2wascomputationallyinfeasible 3 91% 100% evenforDPs1and5. Furthermore, resultsonDPs1and5 4 75% 33.33% showed no statistical difference compared to Experiment 1, 5 100% 100% eventhoughrun-timeswentupbyanapproximatefactorof 6 100% 100% 16 (for both DPs). Experiment 3: We providedthe ground-truthsetQ(cid:48) to baseline methods (and with k again set to 1), while retain- otherwise constant across parameter sweeps. This confirms ing Q for the proposed method. Again, we did not observe thatDumasisquiterobusttotandθ. Onedisadvantage of anystatisticallysignificantdifferenceinPCorRRforeither Dumasisthatitisa1:1matcher. Thisexplainsthelowerre- baseline method. We believe this is because the cases for callnumbersonDPs2and4,whichcontainn:mmappings. which Q(cid:48) would most likely have proved useful (DPs 2 and InExperiment3,wetestifthisisproblematicbyproviding 4,whichcontainn:mmappingsthatDumascannotoutput) theground-truthsetQ(cid:48) tobaselines,andcomparingresults. already perform well with the Dumas-output Q. Discussion: Animportantpointtonotefromthe(mostly) Experiment 4: Theoretically, run-time of Algorithm 1 good results in Table 3 is that generator accuracy is not al- wasshowntobeO(|H|k(|D|+|N|));therun-timesofother ways predictive of schema mapping accuracy. This robust- systems in Table 1 are similar. Empirically, this has never ness to noise is always an important criteria in pipelines. If been demonstrated. For k = 1, we plot the run-time data noiseaccumulates ateachstep,thefinalresultswillbequal- collectedfromExperiment1runsonDPs1-6(Figure7(a)). itativelyweak,possiblymeaningless. SinceDumas’smatch- Thetrendisfairlylinear,withthesupervisedsystemslower ing component is able to compensate for generator noise, it for smaller inputs, but not larger inputs. The dependence isagoodempiricalcandidateforunsupervised1:1matching on|Q|showswhyaschemamatcherisnecessary,sinceinits on typical ER cases. Preliminary Experiment 1 also shows absence, the simple exhaustive set is input (Section 4.3). that on real-world ER problems, a simple measure like TF- As further validation of the theoretical run-time, Figure IDF is not appropriate (by itself) for solving ER. The gen- 7(b) shows the linear dependence of the proposed system erator noise provides an interesting test for the extended on |D|. Again, the trend is linear, but the slope32 depends DNF-BSLs. Since both the mappings-set Q and top n gen- on the schema heterogeneities of the individual DPs. For eratedduplicates(thesetD)outputbyDumasarepipedto example,DPs2and3,thelargestdatasetsintermsoffields the learner, there are two potential sources of noise. (Table 2), do not scale as well as the others. Finally,giventhattheproposedsystem(insubsequentex- Figure7(c)showsanimportantrobustnessresult. Weplot periments)permutes theDumas-outputD (Algorithm1)to the PC-RR f-score33 of the proposed system against |D|. generateN,wetestedtheaccuracyofthepermutationpro- Thefirstobservationisthat,evenforsmall|D|,performance cedureinafollow-upexperiment. With|D|rangingfrom50 isalreadyhighexceptonDP2,whichshowsasteepincrease to 10,000 in increments of 50, we generated N of equal size when |D| ≈ 100. On all cases, maximum performance is andcalculatednon-duplicatesaccuracy30 ofN overtenran- achieved at about |D| ≈ 700 and the f-score curves flatten domtrialspervalue. Inallcases,accuracywas100%,show- subsequently. Thisisqualitativelysimilartotherobustness ing that the permutation heuristic is actually quite strong. of Dumas to small numbers of positive samples. Figure7(c)alsoshowsthattheproposedmethodisrobust 7.2 ExtendedDNFBSL tooverestimatesof|D|. DP1,forexample,onlyhas100true Experiment1: Table4showsBSLresultsonallsixDPs. positives, but it continues to perform well at much higher Thehighoverallperformanceexplainstherecentpopularity |D| (albeit with a slight dip at |D|≈700) . ofDNF-BSLs. UsingtheextendedDNFhypothesisspacefor Discussion: We earlier stated, when discussing the pre- blocking schemes allows the learner to compensate for the liminary experimental results, that an extended DNF-BSL two sources of noise earlier discussed. Overall, considering canonlyintegratewellintothepipelineifitisrobusttonoise statistically significant results, the supervised method typi- fromprevioussteps. Previousresearchhasnotedtheoverall callyachievesbetterRR,butPCis(mostly)equallyhighfor robustness of DNF-BSLs. This led to the recent emergence allmethods,withtheproposedmethodperformingthebest of a homogeneous unsupervised system [16], adapted here onDP4(thelargestDP)andthesupervisedbaselineonDP as a semi-supervised baseline. Experiment 1 results showed 2, with high significance. We believe the former result was thatthisrobustnessalsocarriesovertoextended DNF-BSLs. obtainedbecausetheproposedmethodhasthestrongestap- Highoverallperformanceshowsthatthepipelinecanaccom- proximation bounds out of all three systems, and that this modate heterogeneity, a key goal of this paper. effect would be most apparent on large DPs. Importantly, Experiment 2 results demonstrate the advantage of hav- low standard deviation (often 0) is frequently observed for ing an expressive G, which is evidently more viable than all methods. The DNF-BSLs prove to be quite determinis- increasing k. On DPs 1 and 5 (that the systems succeeded tic,whichcanbeimportantwhenreplicatingresultsinboth research and industrial settings. Experiment 2: Next, we evaluated if k = 2 enhances 31Within a factor of 20 of the average time taken by the system for the k=1 experiment (for that DP). BSL performance and justifies the exponentially increased 32Theslopeisthehiddenconstantintheasympoticnotation. 301−|N ∩Ω |/|N| 332.RR.PC m RR+PC

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.