ebook img

Benford-Newcomb Subsequences for Fraud Detection PDF

0.25 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Benford-Newcomb Subsequences for Fraud Detection

Benford-Newcomb Subsequences for Fraud Detection 3 1 Aaron Carl Smith 0 2 January 28, 2013 n a J Abstract 2 Benford’s law is frequently used to evaluate the likihood that data is 2 misrepresentative. Typically statistical tests measure the likihood. An- other method of employing Benford’s law is to compare the frequency of ] T leading digits to the probabilities of leading digits over a subset of the S natural numbers. This paper proposes using the probabilities of lead- . ing digits from uniform, natural numbers to establish interval criteria for h when to look more closely into the possibility of misrepresentative data. t a m Contents [ 1 1 Introduction 1 v 6 2 Benford-Newcomb Subsequences 2 8 0 6 1 Introduction . 1 0 Benford’s law gives a probability distribution for the frequency of the leading- 3 digitofnaturalnumbers. SimonNewcombdescribedtherulefordecimalrepre- 1 : sentation of natural numbers in 1881 [3], and Frank Benford generalized New- v comb’s observations to any base in 1938 [1]. In 1995, Theodore Hill used the i X mantissaσ-algebratofurtherextendtheleading-digitlawtorealnumbers. The r mantissa σ-algebra consists of sets of numbers with the same coefficient in sci- a entific notation after truncation [2]. Definition 1.1 (Benford’sLaw). In base b, the probability that the leading digit of a real number is k is given by P(k)=log (1+ 1), k ∈{1,2,3,...,b−1}. (1.1) b k In decimal representation (base 10), the probabilities of each the leading digits are given by P(k)=log (1+ 1), k ∈{1,2,3,4,5,6,7,8,9}, (1.2) 10 k 1 which approximately gives: k 1 2 3 4 5 6 7 8 9 P(k) 0.301 0.176 0.125 0.097 0.079 0.067 0.058 0.051 0.046 The law goes further to say that the probability distribution of digits after the leading digit converges to uniform as the digit’s position moves to the right [1, 2]. Benford’s law does not apply to several types of numeric data, such as identification numbers. 2 Benford-Newcomb Subsequences Consider the map f that sends natural numbers to their leading digits, b f :N→{1,2,3,...,b−1},x(cid:55)→floor( x ). (2.1) b bfloor(logbx) Let µ be the uniform probability measure on N where µ (k) = 1 ∀ k ∈ N N N {1,2,3,...,N}. Let’s use µ to construct a probability measure of leading N digits, P (k)=µ ({x∈N|f (x)=k}). (2.2) bN N b Forafixedbasebandfixedleadingdigitk,considerthesequences(P (k))∞ ; bN N=1 in general these sequences do not converge. The purpose of this paper is to propose using intervals of the form [liminfP (k),limsupP (k)] (2.3) bN bN N→∞ N→∞ to identify possibly fraudulent data. If a data set’s frequency of leading digits, in base b representation, is not contained in these intervals, then look further into the possibility of tamper data. For N > b, with respect to N the local minimums are of the form P (k)= 1+b+b2+...+bα−1, N =kbα−1 (2.4) bN kbα−1 and the local maximums are of the form P (k)= 1+b+b2+...+bα, N =(k+1)bα−1. (2.5) bN (k+1)bα−1 Thus if the frequencies of a data set’s leading digits are not within [ 1 , b ], (2.6) k(b−1) (k+1)(b−1) further inquiry is called for. The advantage of the interval method is that one may use it to quickly screen data. 2 Base 10 CDFs Benford's Law and Intervals for Base 10 1.0 9 1.0 | liminf limsup interval 0.9 8 1 Benford's Law 8 0. 7 8 0. 0.7 6 6 encies 0.6 babilities 0.50. 5 frequ 0.4 pro 0.4 4 3 l 0.3 0.2 l 0.2 2 l 0 l l l l l l 0.1 1 0. 0 0. 1 22 3 44 5 66 7 88 9 1 59 199 299 399 499 599 699 799 899 999 leading digits N The lines show how the cdfs change with N. The figures were constructed with R [4]. References [1] F. Benford, The law of anomalous numbers, Proceedings of the American Philosophical Society (1938), 551–572. [2] TheodoreP.Hill, A statistical derivation of the significant-digit law, Statist. Sci. 10 (1995), no. 4, 354–363. MR 1421567 (98a:60021) [3] Simon Newcomb, Note on the Frequency of Use of the Different Digits in Natural Numbers, Amer. J. Math. 4 (1881), no. 1-4, 39–40. MR 1505286 [4] R Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2012, ISBN 3- 900051-07-0. 3

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.