ebook img

Scalable, Trie-based Approximate Entity Extraction for Real-Time Financial Transaction Screening PDF

0.59 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Scalable, Trie-based Approximate Entity Extraction for Real-Time Financial Transaction Screening

Scalable, Trie-based Approximate Entity Extraction for Real-Time Financial Transaction Screening Emrah Budur Garanti Technology 34212, Istanbul, Turkey emrahbu [at] garanti.com.tr Abstract—Financial institutions have to screen their trans- actions to ensure that they are not affiliated with terrorism {1:F01MIDLGB22AXXX0548034693} 7 1 entities.Developingappropriatesolutionstodetectsuchaffilia- {2:I103BKTRUS33XBRDN3} tionspreciselywhileavoidinganykindofinterruptiontolarge 0 {3:{108:MT103}} amountoflegitimatetransactionsisessential.Inthispaper,we 2 present building blocks of a scalable solution that may help {4: n financial institutions to build their own software to extract :20:8861198−0706 a terrorism entities out of both structured and unstructured :23B:CRED J financialmessagesinrealtimeandwithapproximatesimilarity :32A:000612USD5443,99 2 matching approach. :33B:USD5443,99 1 I. INTRODUCTION :50K:MIYESE INTERNATIONAL LIMITED ] In September 11, 2011, one of the deadliest terrorism :52A:BCITITMM500 L :53A:BCITUS33 attack in US happened. After this event, the US government C :54A:IRVTUS3N strengthened the regulation against terrorism financing. The . :57A:BNPAFRPPGRE s most important responsibility was given to the financial c :59:/20041010050500001M02606 institutions, worldwide. Financial institutions have to screen [ AHMET EMRE their transactions against terrorism entities and block any 1 :70:/RFB/OYA/INVOICE SENT TAMERLAAN transaction affiliated with these entities. These entities are v TZARNAEV, FATIME ST. PLAZA DE HALIT published1 by the governmental institutions including the 2 28934 MOSTOLES (MADRID) 9 Office of Foreign Assets Control (OFAC) [2]. :71A:SHA 4 Let’s give a concrete example for what kind of respon- −} 3 sibility that financial institutions have. Figure 1 is a typical 0 swift message that is used for international money transfers. . 1 The problem with this message is that the bold name, 0 TAMERLAAN TZARNAEV, is in a watch list of the Fig. 1: Typical swift message 7 suspected international terrorists, namely TIDE [3], which 1 is maintained by the well-known intelligence institutions : v such as CIA, FBI and NSA. This kind of affiliation with i names of entities. So, the challenge of detecting names X terrorismentitiesinthefinancialmessagesshouldbedetected is getting tougher. and the message should be immediately blocked. Financial r a institutions will be on the hook for complicity aiding and • Fault tolerant match: Even if the exact name of the entityisknowntotheapplication,itispossiblethatthe abetting a terrorism action by failing to detect this kind of nameismisspelledinthetransaction.So,thealgorithm affiliation with terrorism entities. However, the identification is supposed to tolerate the misspellings which presents of the same entity, TAMERLAAN TZARNAEV, in a similar additional challenge. context failed by the US Customs Authorities which led up to BostonBombing afterwards [4]. Hence,achieving precise • Mining unstructured text: The names are not given in a clean structured field. Instead, it is given as an detection of such critical entities is a challenging issue. unstructured field, i.e. explanation text, and the field As exemplified above, the detection of entities out of may or may not include the name in it, also it may unstructuredtextposessomechallengesthatarelistedbelow. or may not include irrelevant data such as an address, • Name variations: Terrorism entities deliberately whichmakesithardertoextracttherelevantinformation change their names a lot. Hence, the list of names precisely. of terrorism entities is getting bigger and bigger. The • Noisy words: There are some terms which are fre- algorithm is supposed to cover all variations of the quently occurring both in illegal entities and legitimate entities, i.e. LIMITED. The algorithm needs to take 1Theycanbecollectedeitherfreelyfromtheweboronalicencebasis fromsomeprivateinstitutionsincludingThomsonReuters[1] thissituationintoaccountandavoidblockinglegitimate entities due to a common term. by Wang et al. [5]. The length of the q-gram is bounded • Minimizing false positives: Blocking a legitimate by the smallest term in the target dictionary dataset. This transaction is a false alert which threatens profitability leads to short length q-grams which result in long posting of the financial institution. For example, the algorithm lists. This situation makes the solution unscalable and slow is supposed to match the text Hosein with the name since merging long posting lists is time and CPU intensive HusseinwhileitshouldavoidblockingthetextSaydam operation. The problem was justified also by Zobel et al. due to its close textual proximity to the name Saddam. and Xiao et al. [13], [14]. Our algorithm minimizes the Hence, the algorithm is supposed to avoid false alerts problem by indexing terms as a whole instead of q-grams as much as possible. while ensuring that scalability, low latency and fault tolerant • Avoiding false negatives: Any entities which are in- search constraints are addressed. volved in the query either exactly or approximately Many researchers proposed solutions based on common should be extracted uncompromisingly. dissimilaritymeasuressuchasEditDistance,HammingDis- • Ensuring low latency: The algorithm needs to com- tance,JaccardDistance[5],[7],[8],[14].Ontheotherhanda plete checking the transaction in subseconds in order few researchers proposed probabilistic hash-based solutions not to interrupt the business workflows. suchaslocalitysensitivehashing(LSH)[15].However,LSH The algorithm that will be presented in this paper is approach violates the avoiding false negatives constraint designed and implemented to meet all of these constraints. due to the fact that LSH may miss some true positive results [6]. Therefore, we decided to proceed with a non- II. RELATEDWORK LSH approach, namely edit distance similarity approach, to Approximate entity extraction has drawn much attention avoidprobabilisticfalsenegativematches.Weleftotherkind byresearchers[5]–[8].Itwasreportedtobeusefultoextract of similarity metrics such as Hamming Distance, Jaccard approximateproductnamesfromproductreviewarticles[6], Distance as a future work. [8], author names and paper titles from publication records Edit distance can be implemented in various forms. Fol- [6], [7] gene and protein lexicon from publication records lowing Ukonnen, many researchers adopted q-gram based [5].Oneofthefewrelevantstudiesinfinancialindustrywas method [16]. In q-gram based method, matching q-grams suggestedbyXuetal.toextractcorporateentitiesoutoffree of two strings are considered to reflect the similarity of format financial contracts [9]. In this paper, we will present these strings. However, this approach suffers from long an entity extraction framework and analyze its efficiency posting list problem as described previously. An alternative by extracting terrorism entities out of unstructured financial form of edit distance implementation is to employ trie data messages. structure as explained by Arslan and Egecioglu [17]. Trie- Various methods were presented by researchers to accom- based approach eliminates the long posting list problem modate several problems of the task. One of the problems is of q-gram based implementation. Hence, we used trie data thefactthatthenumberofallpossibletypingerrors,whichis structure for indexing. potentially embedded in the query text, grows exponentially III. PROBLEMDEFINITION with respect to the number of allowed typing errors. As a common approach, all of the possible typing errors in query A. Notation is fabricated and probed against the index. However, most Let Σ denotes the alphabet. Let D denote the set of of the fabricated query terms doesn’t occur at all in the documents in target dictionary. Let q denotes a tokenizable target dictionary hence probing inexistent terms makes the free text query. Let d ∈ D denote a tokenizable document. algorithms inefficient and unscalable. A number researchers Let t and t denotes tokens in q and d respectively. Let q d proposed to limit the number of allowed errors to 1-edit, to (cid:107)t (cid:107) and (cid:107)t (cid:107) denote the length of the tokens t and t q d q d minimize the number of probing [10], [11]. However, this respectively. approach fails to identify the k-edit error matches where Let the function DocFreq(t ) return the document fre- d k > 1, for example the match of the query term Hosein quency of the record token t in the target dictionary set d with the record term Hussain where k = 3. Although D. Let the function TF(t ) return the term frequency while d over %80 of the errors are estimated to be 1-edit errors thefunctionIDF(t )returntheinversedocumentfrequency d [10],maximizingextractionrecallisessentialwhendetecting of the record token t . Let the function I(t ) = TF(t )× d d d terrorism entities in financial industry. Hence, we proposed IDF(t ) gives the information of the record token t . d d a scalable algorithm that covers k-edit errors for reasonably Let the function Edit(t ,t ) gives the edit distance and q d large k while probing no inexistent term at all. Editw(t ,t ) gives the weighted edit distance between the q d Another common problem is to find a suitable indexing tokens t and t . q d scheme for the target dictionary dataset that will help ad- Let τ refers to predefined edit distance threashold which l dressing fault tolerant matching, scalability and low latency is applied to a token of length l. constraints. A common approach for this problem is to Let the function exploit q-gram indexing to address fault tolerant search constraint [7], [8], [10], [12], [13]. However, there is a Edit(t ,t ) Sim(t ,t )=1− q d (1) problem with this approach which was emphasized also q d max((cid:107)t (cid:107),(cid:107)t (cid:107)) q d UserId Names gives the edit similarity while 1 CANBERKBERKINOZDEMIR 2 AHMETEMREBUDUR Editw(t ,t ) 3 OYACIMENBUDUR Simw(t ,t )=1− q d (2) 4 EMRAHBUDUR q d max((cid:107)tq(cid:107),(cid:107)td(cid:107)) 5 HUSSAINBERKBUDAK 6 HUSSEINOZDENCAN gives weighted edit similarity between the tokens t and q TABLE I: Sample names t . d Let a match md =(tq,td,e) (3) A. Indexing Inthisstep,thelistoftargetdictionaryentitiesareindexed refers to a matched token pairs (t ,t ) in the query q where q d and loadedinto main memoryfor later access.Although use Edit(t ,t )=e<(cid:15). q d of trie data structure was discouraged by Zobel et al. [13] Given a query q and a document d, a set of all possible due to space complexity, our application showed that the matchesm =(t ,t ,e)isdenotedasamatchsetM where d q d d advantages gained in terms of time complexity outweighs m ∈M . The function Support(M ) return the number of d d d thedisadvantagescomingfromspacecomplexity.Hence,we documents that contain all record tokens t in M . d d implemented trie data structure as suggested by Arslan and Given a document d and a corresponding match set M d Egecioglu [17]. let 1) Trie Index: The trie data structure was originally pro- 0≤R(M ,d)≤100 (4) posed by Briandais [18]. Later on, it was named by Fredkin d [19] reflecting the word retrieval. bearankingfunctionfordocumentd∈D againstthematch Figure 3 (positioned at the very end of the paper) is a set M where the output reflects how approximately the sample trie index which was generated based on the sample d documentdisinvolvedinthequeryq.NotethatR(M ,d)= records given in Table I. Each node refers to a substring of d 100 refers to exact match while R(M ,d) = 0 refers to no a token in reference dataset. The red node indicates end of d match. token. The numbers in boxed nodes leaning from red nodes Letσ referstopredefinedpercentagescorethreasholdthat are posting lists of the tokens. The list of record tokens was is used to filter out the acceptable candidate sets. obtained by tokenizing the asciified lower case form of the record name. B. Problem Formulation B. Query Processing Given a query q, we aim to extract top k documents d In this step, we preprocessed the input raw query with from q such that R(M ,d)≥σ. two main steps. As the first step, we have tokenized the d asciified lower case form of the query and obtained a list of query tokens. Then, we expanded the list of query tokens IV. SYSTEMARCHITECTURE with sequentially combined windows of existing terms. For example, given a query ”nether lands company” in which a The main building blocks of the application we present in whitespacecharacterswasintroducedduetoamisspellingor this paper are illustrated in Figure 2 as a system architecture apossiblelinefeed,weexpandedthequeryas”netherlands diagram.Thedetailsofthemodulesillustratedinthediagram company netherlands landscompany netherlandscompany”. are explained in subsequent sections. This action prevents false negative matches of the queries The lifecycle of the application can be reviewed in two having extra whitespaces injected into query terms. On the phases, namely offline and online. In the offline phase, the other hand, the generated nonsense terms are eliminated by targetdictionaryofsanctionedentitiesisindexedandloaded having no match in searching phase. We decided to proceed into main memory for later access. In the online phase, with a window of up to 4 terms as a result of our empirical the queries are processed and searched from the in-memory analysis. index while the candidate matches are returned as a result, in near real-time. C. Searching In this step, we traverse the trie index for each query token t while collecting the candidate match set M. One q of the crucial part of our application is to apply weighted edit distance while traversing the trie index which is an expanded form of the function DFT-LOOK-UP suggested ed by Arslan and Egecioglu [17]. Below are some highlights of Fig. 2: System architecture diagram the improvements that are introduced in the expanded form of the function. TheoriginalalgorithmsuggestedbyArslanandEgecioglu was a function that returns the minimum edit distance when the query token is compared against any of the record token Algorithm 1 GT FreeText ALGORITHM [17]. We have improved the algorithm to collect the posting lists of the candidate record tokens having up to a given 1: procedure GT FREETEXT(q) amount of edit distance from the query token. 2: M ←init (cid:46) Initialize candidate set dictionary In addition, we incorporated the confusion matrices, 3: for all tq ∈q do namely IUC,DUC,SUC, into edit distance calculation 4: TRAVERSE TRIE1(tq,M) steps. In this way, we were able to distinguish the unlikely 5: end for edit errors from likely edit errors. For example, we wanted 6: return M to eliminate the false positive match of the query term 7: end procedure ”Saydam” with the record term ”Saddam” while collecting 8: the candidate match of the query term ”Hossein” with the 9: record term ”Hussein”. Hence, by means of weigthed edit 10: procedure TRAVERSE TRIE1(tq,M) distance we were able to minimize the false positives. 11: m=len(tq) (cid:46) Length of query term The resulting algorithm is named as GT FreeText and 12: (cid:15)←τm presented in Algorithm 1. Let’s first review some additional 13: l←1 (cid:46) Init level to 1 notations that is used in Algorithm 1, below. 14: for all vc where vc ∈TC(v0) do 1) Notations for Algorithm 1: Let t refers to the j’th 15: TRAVERSE TRIE2(vc,M,(cid:15),l,tq) qj 16: end for letter of the token t . q 17: end procedure Let v denote a vertex in a tree and v denote the root of 0 18: the tree. Let the T (v ) be a function that enumerates the c p 19: children of the parent vertex v where v ∈T (v ). LetT (v)beafunctionthatrpeturnsthecparenctofpthevertex 20: procedure TRAVERSE TRIE2(vp,M,(cid:15),l,tq) v. Let Lpetter(v) be a function that returns the letter that 21: vpp ←Tp(vp) corresponds to the vertex v. 22: chda ←Letter(vpp) (cid:46) Previous letter of td Let P(v) be a function that enumerates the posting list 23: chdb ←Letter(vp) (cid:46) Current letter of td that corresponds to the vertex v. 24: chqa ←NULL (cid:46) Previous letter of tq 25: m=len(tq) (cid:46) Length of query term Let the polymorphic functions Token(veow) and 26: for j =0 to m do Token(m ) returns the record term t that corresponds to d d 27: chqb ←tqj (cid:46) Current letter of tq theLeentd othfeworpdovlyemrteoxrpvheicow afunndcttoiotnhse mTatockhenmsd(.d) and 28: IC ←Dv,i,j−1+IUC(chda,chdb) Tokens(M ) enumerates all of the record terms in 29: DC ←Dv,i−1,j +DUC(chqa,chqb) the documednt d and the match set M while (cid:107)Tokens(d)(cid:107) 30: SC ←Dv,i−1,j−1+SUC(chqb,chdb) d 31: Du,i,j ←min{ IC, DC, SC } and (cid:107)Tokens(M )(cid:107) denotes the number of the record terms d 32: chqa ←chqb in document d and the match set M respectively. d 33: end for Let m refer to the i’th match in the match set M . (d,i) d 34: e←Du,l,m Let t refers to the i’th record term in the document d. (d,i) 35: if EOW(vp)==true AND e≤(cid:15) then Let EOW(v) be a function that returns a boolean value 36: for d∈P(vp) do (cid:46) Collect posting list thatrepresentifthevertexv isanendofwordvertexornot. 37: td ←Token(vp) Let IUC(a,b) denotes a weighted insertion unit cost of 38: md ←(tq,td,e) the letter a after the letter b where 0 ≤ IUC(a,b) ≤ 1. Let 39: Append md to Md DUC(a,b) denote a weighted deletion unit cost of the letter 40: end for a after the letter b where 0 ≤ DUC(a,b) ≤ 1. Let SUC(a,b) 41: end if denote a weighted substitution unit cost of the letter a for 42: min distance ←min{Du,l,j|0≤j ≤m} the letter b where 0≤ SUC(a,b) ≤1. IUC(a,b) = DUC(a,b) = 43: if min distance ≤(cid:15) then SUC(a,b) =1whereeitherofthelettersaorbisanon-ascii 44: l←l+1 character including NULL. 45: (cid:15)←τmax(m,l) 46: for all vc where vc ∈Tc(vp) do D. Filtering 47: TRAVERSE TRIE2(vc,M,(cid:15),l) The aim of the filtering step is to improve the quality and 48: end for performance of the ranking step. We believed that the best 49: end if way to rank is not to rank. In other words, we will have 50: end procedure a better ranking step if we filter out the irrelevant matches that does not deserve to be ranked. In this section, we will introduce some tips of the filtering step. TERMS DOCFREQUENCY Query INNOCENTAGLOBALCORPORATION CORPORATION 7000+ Resultset BANK 6000+ 1 BADDYGLOBALCORPORATION INTERNATIONAL 5000+ 2 WANTEDINTLGLOBALCORPORATION SECURITIES 3000+ 3 SANCTIONEDGLOBALCORPORATIONLTD GLOBAL 2000+ 4 BOMBERGLOBALCORPORATION 5 NARCOTICGLOBALCORPORATION TABLE II: Document frequencies of sample noisy words TABLE V: Filtering out frequent itemsets TERMS SUPPORT CORPORATION+INTERNATIONAL 2000+ RECORDS SUPPORT GLOBAL+CORPORATION 100+ GLOBALCORPORATIONSECURITIES 1 CORPORATION+SECURITIES 40+ BANKINTERNATIONAL 1 INTERNATIONALCORPORATIONBANK 1 TABLE III: Support of frequent item sets TABLE VI: Unique term sets Forthefirsttip,let’staketheexamplegiveninTableIV.In comparedtothematchedqueryterms.Inotherword,ascore this example a query token CORPORATION caused many of 100 will refer to an exact match while the score of 0 will records that contain this token. Considering that there are be given to no match at all. After calculating percentage possibly even more records that contain this token which scores,wesortthematchesdescendinglybytheirpercentage wasshowninTableII,presentingthesematchesascandidate scores and select top k candidate results. Below is step by matchwillfillupthetopkslotswiththesenonsensematches. step formulation of this process. As a result, it will prevent true positive matches to take a significantslotintopkresultset.Sinceouraimistoimprove Given md =(tq,td,e), let the function thequalityofthematchesintopk slotswefilteredoutthose results that consist of a single match md =(tq,td,e) where MI(md)=Sim(tq,td)×I(td) (5) DocFreq(t )>k d gives the mutual information of the match m while d As an example for the second tip, we have analyzed another frequent token ”GLOBAL” which co-occurs with MIw(m )=Simw(t ,t )×I(t ) (6) the term ”CORPORATION” as shown in Table V. Since d q d d the number of documents that contain both of the terms returnstheweigthedmutualinformationofthematchm . d GLOBALandCORPORATIONisstillmorethan100,which Then, the total mutual information TMI of the match d was also shown in Table III, presenting these matches will set M for the document d is defined as follows: d fill up top k >100 slots unless we eliminate them. In order to eliminate them, we keep track of the number of records (cid:107)(cid:88)Md(cid:107) thatcorrespondstothecandidatesetsMd andfilteroutthose TMId(Md)= MI(m(d,i)) (7) match sets Md where Support(Md)≥k. i=1 As the final tip, we want to mention about the unique On the other hand, the total weighted mutual information records whose individual terms are all frequent terms as TMIw of the match set M for the document d is defined d d shown in Table VI. If we have a match set Md where as follows: Support(M )≤k weletthemtakeaslotintopk candidate d match set. (cid:107)(cid:88)Md(cid:107) TMIw(M )= MIw(m ) (8) d d (d,i) E. Ranking i=1 In this step, we need to calculate a score for each records The total information in the document d is defined as in the match sets M, which is given out of the filtering step. follows: Contrarytotheordinaryscoringschemescommonlyadopted bythemainstreamsearchengines,theresultingscoresofthis (cid:107)Tokens(d)(cid:107) (cid:88) TID = I(t ) (9) step must reflect percentage similarity of the record name d (d,i) i=1 So, the percentage score of the document d that corre- Query INNOCENTACORPORATION sponds to a particular match set M is defined as follows: Resultset d 1 BADDYCORPORATION 2 WANTEDINTLCORPORATION Score(d,M )=100× TMI(Md) (10) 3 SANCTIONEDCORPORATIONLTD d TID(d) 4 BOMBERCORPORATION 5 NARCOTICCORPORATION And the weighted percentage score of the document d that corresponds to a particular match set M is defined as d TABLE IV: Filtering out noisy word matches follows: Structured Individual-Typed Queries (Q ): The struc- TMIw(M ) in Scorew(d,M )=100× d (11) turedindividualdatasetconsistsoffullnamesofindividuals. d TID(d) Thetotalnumberofqueriesinthisdatasetis2110502.These Finally, in the ranking step each of the match set Md ∈ queries are searched from a list of individual names of size M is scored by means of the functions given in Eq.11 and 2038234. The dataset is not labeled and it was used just ranked descendingly by this score. As a result, the top k to test the response time performance of the applications in candidate result set is obtained. individual-typed queries. Structured Corporate-Typed Queries (Q ): The struc- F. Scaling co tured corporate dataset consists of the legal names of corpo- In this part of our application, we aimed to revise the rates. The total number of queries in this dataset is 488803. architectural design so that our application can be safely Thesequeriesaresearchedfromalistofreferencecorporate scaled out while increasing the number of records it can names of size 207468. The query dataset is not labeled and search from. it was used just to test the response time characteristics of We splitted our target reference records into n segments the applications in corporate-typed queries. of r records and created multiple trie data structure for Mixed Unstructured Free-Text Queries (Q ): The mix each segment. As a result, we obtained a forest of trie data mixed unstructured free-text queries dataset consists of a structures each containing r records. fraction of randomly selected international money transfer In the query time, we searched the query q from each queries that are received in one-month time frame. The trie data structure and obtained top k candidate results from total number of the queries in the dataset are 406928. But each trie. At the end, we merged n candidate result sets and we discovered that many queries are redundant thus we sortedallresultsbythecalculatedscoresandreturnedthetop aggragated the dataset. As a result, the number of distinct k records from the resulting merged and aggregated result queries in the dataset turned out to be 85572. All of the sets. Figure 4 shows the resulting architectural diagram. distinctqueriesareeitherlabeledastruepositivematchwith Each of trie index can be served by a seperate process certain record name or true negative match. The number of which can be running either all in the same machine or queries that are flagged as true positive flag with at least distributed machines. one record text is 8409. On the other hand a total of 12272 In this way, we are able to scale out the application recordnameshavebeenflaggedasatruepositivematchwith while increasing the number of the records in the target a certain query. dictionary and preserving the capability of addressing all of Table VII shows a sample snapshot of the labeled dataset. the constraints of the problem. Note that the query ”435021 BANK KBC” is flagged as true positivematchwithtwodifferentrecordtexts.Notealsothat the query text ”INVOICE RECEIPT” is flagged as a true negativematchsinceithasnocorrespondingmatchingrecord text. 2) Reference Datasets: We used three different reference datasets to search from. Individual Entities (R ): This dataset consists of in 2038234 entities of individual names. CorporateEntities(R ): Thisdatasetconsistsof207468 co entities of corporate legal names. Small Mixed Entities (R ): This dataset consists of Fig. 4: Scaling out architecture of the application mix 43019 entities of both individual and corporate legal names. B. Applications V. EXPERIMENTS We have benchmarked 2 different applications under the Wecarriedoutaseriesofexperimentsonalabeleddataset experimentation phase. Below are the brief details of these to figure out the performance of the algorithm in terms of applications. responsetime,indexingtimeandresponsequality.Beloware 1) GT FreeText: The application framework that is pre- the details of each type of analysis along with the details of sented in this paper is named as GT FreeText throughout the the dataset. experimental analysis. A. Dataset 2) LSH: We have benchmarked our application against We have collected two main type of data sets such as a locality sensitive hashing framework that is provided by queries datasets and reference datasets. Informatica, namely Name3. The configuration of the appli- 1) Queries Datasets: We collected three different dataset cation was done by a local representative of the application from a leading bank in Turkey such as structural individ- vendor.Thelineofbusinessapplicationthatwebenchmarked ual queries (Q ), structural corporate queries (Q ) and in this experimentation phase stores the hash indexes in a in co unstructured free text queries (Q ). The details of the relational database rather than in-memory. This application mix datasets are described below. was named as LSH throughout the experimental analysis. QType QueryText RecordText LSH turned in its worst response time performance in in MARIACELTIQ MARIANOYACELTIK in AHMETEMREMIYESE AHMETMIYESE mixed-typed unstructured free-text queries which was pro- co CITIBANK CITYBANK co HERMANNNIMCOMGMBH HERMANN jected on both the box plot and scattered data points in co DALGADURANMAKINAA.S. DURANMAKIN Figure 8a and 8b respectively. It can be easily observed mix MUHAMMEDSALIHAHMETEMRE MuhammadSALAH mix ATCENTERPRISESLTDKAYSERI ATCLTD that GT FreeText preserved its compelling response time mix 435021BANKKBC KWANGSONBANKINGCO. KBCFINANCIALINC performance also for mixed-type unstructured queries. mix ODESSA ODESSAAIR mix INVOICERECEIPT TABLE VII: Sample dataset C. Evaluation We have evaluated two main characteristics of the appli- cation, namely temporal characteristics and the quality of result set. The former is the temporal analysis in which we measured the indexing time and response time of the application. The latter measures the quality of the result sets of each applications compared to the human evaluations. D. Temporal Anaysis Indexingtimesandresponsetimesoftheapplicationswere analyzed under this section. 1) Indexing Time Anaysis: Each of three reference datasets were indexed by using GT FreeText and LSH. The Fig. 5: Indexing Time resulting indexing time for each application was shown in Fig 5. It can be observed that the indexing time complexity of GT FreeText is sublinear while LSH presents exponential time complexity. Considering that LSH stores its hash in- dexesinarelationaldatabase,wemayexpectbetterindexing time for LSH if it would store the indexes in-memory. Nevertheless, we can safely conclude that GT FreeText can index millions of records in a few minutes. 2) Response Time Anaysis: Response times of the ap- plications GT FreeText and LSH are analyzed across three (a) Scattered datapoints (b) Closer look different query datasets, namely structured individual-typed queries, structured corporate-typed queries and unstructured Fig. 6: Response time of searching Q from R in in mixed-typed queries. The resulting response times were presentedinFigure6,7and8respectively.Thescattereddata points for each plot shows that the response times to each type of query is way more lower in GT FreeText compared to LSH. Closer look at the box plot of individual-typed structural queries shows that response time of the majority of the queriesremainscomparableforGT FreeTextandLSH.How- (a) Scattered datapoints (b) Closer look ever, the scattered plot reveals that many queries takes more than 2 seconds in LSH which is unacceptable for real time Fig. 7: Response time of searching Qco from Rco financial transaction processing context. The scattered plot gives a clear picture to conclude that the response time of individual-typedqueriesinGT FreeTextapplicationremains under 1 second. The boxplot of corporate-typed queries shows that the latency of GT FreeText for the majority of the corporate- typed queries is higher than that of LSH. However, the scattered plot reveals that GT FreeText returns its response (a) Scattered datapoints (b) Closer look within 2 seconds where LSH fails to respond in 2 seconds Fig. 8: Response time of searching Q from R for many queries. mix mix E. Quality Anaysis where β adjusts the importance of precision and recall rela- tive to each other. We used F for evaluation since recall is Controlled experimentation has been conducted on the 5 waymoreimportantthanprecisionandsinceextractionrecall unstructured free-text queries dataset in which true positive is essential to address avoiding false negative constraint. matches were labeled. The dataset were queried across the Since the cost of false negative matches is prohibitively applications GT FreeText and LSH. Fig 9 and 10 shows the higher than the cost of false positive match in the context of evaluation metrics along with the match counts that were terrorism financing, we penalized false negative matches 5 calculatedfortheresultsetofeachapplication.Belowisthe timesmorethanfalsepositivematchesbyusingF measure comparative analysis of each type of metrics. β where β =5. We decided the value of β =5 as a result of ourempiricalanalysisinwhichweobservedthatthefunction F doesn’t reflect any significant change when β >5. β As a result, since GT FreeText outperformed in terms of both higher true positive matches and lower false negative error, therefore the value of F in GT FreeText turned out 5 to be greater than the one of LSH as given in Figure 10. VI. CONCLUSIONS The need for screening financial transaction against ter- rorism entities is increasing at a rapid race as it is being enforced by governmental institutions including US-OFAC. Fig. 9: Match Counts Financial institutions have to administer a real-time fast and scalable solution for this problem. Inspite of its critical importance, this problem has gone unseen in the field of Information Retrieval. We have proposed a solution for this problem addressing various constraints defined by the relevant business experts of a leading bank in Turkey. The building blocks of such solution is analysed such as query processing, searching, filtering and ranking by presenting a number of useful tips and tricks. The performance of the final application is evaluated in terms of indexing time, response time and the quality of the results by comparing with a line of business application which was based on locality sensitive hashing. The results clearly shows that the Fig. 10: Evaluation metrics proposed solution addresses all of the predefined constraints while outperforming the benchmarked application not only 1) Precision: The ratio of the relevant documents re- for structured queries but unstructured free-text queries also. trieved to the total number of retrieved documents. ACKNOWLEDGMENT # of relevant documents retrieved precision= (12) First and foremost, the author needs to say that formal # of retrieved documents thanks are inadequate to express his gratitude to his beloved The true positive match count of GT FreeText is greater wife Oya Cimen Budur who is always a powerful source of than the one in LSH. On the other hand, LSH returned energy for him with her consistent support to keep him on way more false positive results. Hence the precision of the right track. GT FreeText turned out to be greater than LSH as seen in TheauthoroffershissincerestgratitudetoCanberkBerkin Figure 10. Ozdemir for his continous support and substantial contribu- 2) Recall: The ratio of the relevant documents retrieved tion on this project. to the total number of relevant documents. TheauthorismostindebtedtoGokcerBelgusenandTolga Yavuzwhosparednoefforttoachieveoutstandingresultson # of relevant documents retrieved recall= (13) this study. # of all relevant documents The author owes a duty of good faith and fidelity towards Higher true positive match count and lower false negative his employer Garanti Technology who let him investigate errorcountofGT FreeTextmadeitsrecallvaluegreaterthan every aspects of the problem to come up with a competitive LSH as shown in Figure 10. solution while giving him the freedom to accomodate his 3) F-Measure: Theweightedharmonicmeanofprecision personal priorities. and recall. It can be formulated as Theauthorisgladtothankhisdearco-workersparticularly precision×recall Aybuke Kurues Seyitogullari, Cemre Yalvac, Zeynep Hiz, F =(1+β2) (14) β β2×precision+recall ElcinUgurluKasap,EnginSagandlastbutnotleastMustafa Duman since the author was previledged to have generous [12] GuoliangLi,DongDeng,andJianhuaFeng. Faerie:EfficientFilter- support and insightful comments from them in all part of ing Algorithms for Approximate Dictionary-based Entity Extraction. Proceedingsofthe2011ACMSIGMODInternationalConferenceon this study. Managementofdata,pages529–540,2011. [13] JustinZobelandPhilipDart. FindingApproximateMatchesinLarge REFERENCES Lexicons. SoftwarePracticeandExperience,25(MARCH):331–345, 1995. [14] ChuanXiao,WeiWang,andXueminLin.Ed-Join:AnEfficientAlgo- [1] World-Check,2016. rithmforSimilarityJoinsWithEditDistanceConstraints.Proceedings [2] SpeciallyDesignatedNationalsList(SDN),2016. oftheVLDBEndowment,1(1):933–944,2008. [3] TerroristIdentitiesDatamartEnvironment,2016. [15] AristidesGionis,PiotrIndyk,andRajeevMotwani. SimilaritySearch [4] Eric Schmitt and Michael S. Schmitt. Tamerlan Tsarnaev, Bomb inHighDimensionsviaHashing. VLDB’99Proceedingsofthe25th Suspect,WasonWatchLists,2016. International Conference on Very Large Data Bases, 99(1):518–529, [5] Wei Wang,Chuan Xiao,Xuemin Lin, and ChengqiZhang. Efficient 1999. approximateentityextractionwitheditdistanceconstraints. Proceed- [16] Esko Ukkonen. Approximate string-matching with q-grams and ings of the 35th SIGMOD international conference on Management maximal matches. Theoretical Computer Science, 92(1):191–211, ofdata-SIGMOD’09,pages759–770,2009. 1992. [6] Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, and Dong [17] Abdullah N. Arslan and Omer Egecioglu. Dictionary Look-Up Xin. Anefficientfilterforapproximatemembershipchecking. Proc. Within Small Edit Distance. International Journal of Foundations ACM SIGMOD Int. Conf. on Management of Data, pages 805–818, ofComputerScience,15(1):57–71,2004. 2008. [18] Rene De La Briandais. File Searching Using Variable Length Keys. [7] Dong Deng, Guoliang Li, and Jianhua Feng. An efficient trie- PapersPresentedatthetheMarch3-5,1959,WesternJointComputer basedmethodforapproximateentityextractionwithedit-distancecon- Conference,1:295–298,1959. straints.Proceedings-InternationalConferenceonDataEngineering, [19] Edward Fredkin. Trie memory. Communications of the ACM, pages762–773,2012. 3(9):490–499,1960. [8] DongDeng,GuoliangLi,JianhuaFeng,YiDuan,andZhiguoGong.A unifiedframeworkforapproximatedictionary-basedentityextraction. VLDBJournal,24(1):143–167,2014. [9] ZhengXu,DouglasBurdick,andLouiqaRaschid. ExploitingListsof NamesforNamedEntityIdentificationofFinancialInstitutionsfrom UnstructuredDocuments. 0(0):17,2016. [10] Aleksander Cisłak and Szymon Grabowski. A practical index for approximate dictionary matching with few mismatches. pages 1–9, 2015. [11] Gerth Stlting Brodal and Leszek Gasieniec. Approximate dictionary queries. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioin- formatics),1075:65–74,1996. Root H O B C E U Y Z U E A M S A D D R N R S E U A K B A E A E M N R K I E H I I I N R N N R K 5 6 2 1 6 2-3-4 5 5 1 6 1 4 2 Fig. 3: Sample trie for sample names in Table I

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.