ebook img

Too good to be true: when overwhelming evidence fails to convince PDF

0.5 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Too good to be true: when overwhelming evidence fails to convince

Too good to be true: when overwhelming evidence fails to convince Lachlan J. Gunn,1,∗ Franc¸ois Chapeau-Blondeau,2,† Mark D. McDonnell,3,‡ Bruce R. Davis,1,§ Andrew Allison,1,¶ and Derek Abbott1,∗∗ 1School of Electrical and Electronic Engineering, The University of Adelaide 5005, Adelaide, Australia 2Laboratoire Angevin de Recherche en Ing´enierie des Syst`emes (LARIS), University of Angers, 62 avenue Notre Dame du Lac, 49000 Angers, France 3School of Information Technology and Mathematical Sciences, University of South Australia, Mawson Lakes SA, 5095, Australia.†† Is it possible for a large sequence of measurements or observations, which support a hypothesis, tocounterintuitivelydecreaseourconfidence? Canunanimoussupportbetoogoodtobetrue? The assumption of independence is often made in good faith, however rarely is consideration given to whether a systemic failure has occurred. 6 Taking this into account can cause certainty in a hypothesis to decrease as the evidence for 1 it becomes apparently stronger. We perform a probabilistic Bayesian analysis of this effect with 0 examplesbasedon(i)archaeologicalevidence,(ii)weighingoflegalevidence,and(iii)cryptographic 2 primality testing. We find that even with surprisingly low systemic failure rates high confidence is very difficult to n achieve and in particular we find that certain analyses of cryptographically-important numerical a tests are highly optimistic, underestimating their false-negative rate by as much as a factor of 280. J 5 INTRODUCTION when they can—surprisingly—disimprove confidence in ] P the final outcome. We choose the striking example that A increasingconfirmatoryidentificationsinapoliceline-up In a number of branches of science, it is now well- . known that deleterious effects can conspire to produce or identity parade can, under certain conditions, reduce t a ourconfidencethataperpetratorhasbeencorrectlyiden- a benefit or desired positive outcome. A key example t tified. s where this manifests is in the field of stochastic reso- [ nance [1–3] where a small amount of random noise can Imagine that as a court case drags on, witness after 1 surprisinglyimprovesystemperformance,providedsome witness is called. Let us suppose thirteen witnesses have v aspect of the system is nonlinear. Another celebrated testified to having seen the defendant commit the crime. 0 example is that of Parrondo’s Paradox where individu- Witnesses may be notoriously unreliable, but the sheer 0 ally losing strategies combine to provide a winning out- magnitude of the testimony is apparently overwhelming. 9 0 come [4, 5]. Anyone can make a misidentification but intuition tells 0 Looselyspeaking,asmallamountof‘bad’canproduce us that, with each additional witness in agreement, the 1. a ‘good’ outcome. But is the converse possible? Can too chance of them all being incorrect will approach zero. 0 much ‘good’ produce a ‘bad’ outcome? In other words, Thus one might na¨ıvely believe that the weight of as 6 can we have too much of a good thing? many as thirteen unanimous confirmations leaves us be- 1 yond reasonable doubt. The answer is affirmative—when improvements are : v made that result in a worse overall outcome this situa- However, this is not necessarily the case and more i X tion is known as Verschlimmbesserung [6] or disimprove- confirmationscansurprisinglydisimproveourconfidence ment. Whilst this converse paradigm is less well known that the defendant has been correctly identified as the r a in the literature, a key example is the Braess Paradox perpetrator. This type of possibility was recognised in- where an attempt to improve traffic flow by adding by- tuitivelyinancienttimes. UnderancientJewishlaw[13], pass routes can counterintuitively result in worse traffic one could not be unanimously convicted of a capital congestion [7–9]. Another example is the truel, where crime—it was held that the absence of even one dissent- three gunmen fight to the death—it turns out that un- ing opinion among the judges indicated that there must der certain conditions the weakest gunman surprisingly remain some form of undiscovered exculpatory evidence. reduces his chances of survival by firing a shot at either Such approaches are greatly at odds with standard of his opponents [10]. These phenomena can be broadly practice in engineering, where measurements are often considered to fall under the class of anti-Parrondo ef- taken to be independent. When this is so, each new fects [11, 12], where the inclusion of winning strategies measurement tends to lend support to the outcome with fail. which it most concords. An important question, then, is Inthispaper,forthefirsttime,weperformaBayesian todistinguishbetweenthetwotypesofdecisionproblem; mathematicalanalysistoexplorethequestionofmultiple thosewhereadditionalmeasurementstrulylendsupport, confirmatory measurements or observations for showing and those for which increasingly consistent evidence ei- 2 ther fails to add or actively reduces confidence. Other- in Roman-occupied Britain or whether it was brought wise,itisonlylaterwhentheresultscomeunderscrutiny from Italy by travelling merchants. Suppose that we are that unexpectedly good results are questioned; Mendel’s fortunate and that a test is available to distinguish be- plant-breeding experiments provide a good example of tweentheclayfromthetworegions;clayfromonearea— this [14, 15], his results matching their predicted values letussupposethatitisBritain—containsatraceelement sufficientlywellthattheirauthenticityhasbeenmiredin which can be detected by laboratory tests with an error controversy since the early 20th century. ratep =0.3. Thisisclearlyexcessive,andsowerunthe e The key ingredient is the presence of a hidden fail- test several times. After k tests have been made on the ure state that changes the measurement response. This pot, the number of errors will be binomially-distributed change may be a priori quite rare—in the applications E ∼Bin(k,pe). If the two origins, Britain and Italy, are that we shall discuss, it ranges from 10−1 to 10−19—but a priori equally likely, then the most probable origin is whenseveralobservationsareaggregated,theaposteriori the one suggested by the greatest number of samples. probabilityofthefailurestatecanincreasesubstantially, Nowimaginethatseveralmanufacturersofpotteryde- and even come to dominate the a posteriori estimate of liberatelyintroducedlargequantitiesofthiselementdur- themeasurementresponse. Weshallshowthatbyinclud- ingtheirproductionprocess,andthatthereforeitwillbe ing error rates, this changes the information-fusion rule detected with 90% probability in their pots, which make in a measurement dependent way. Simple linear super- up pc = 1% of those found; of these, half are of British position no longer holds, resulting in non-monotonicity origin. Wecallpcthecontaminationrate. Thisisthehid- that leads to these counterintuitive effects. denfailurestatetowhichwealludedintheintroduction. This paper is constructed as follows. First, we intro- Then, after the pot tests positive several times, we will duce an example of a hypothetical archaeological find: become increasingly certain that it was manufactured in a clay pot from the Roman era. We consider multiple Britain. However, as more and more test results are re- confirmatory measurements that decide whether the pot turned from the laboratory, all positive, it will become was made in Britain or Italy. Via a Bayesian analysis, moreandmorelikelythatthepotwasmanufacturedwith we then show that due to failure states, our confidence this unusual process, eventually causing the probability in the pot’s origin does not improve for large numbers of of British origin, given the evidence, to fall to 50%. This confirmatorymeasurements. Webeginwiththisexample is the essential paradox of the system with hidden fail- of the pot, due to its simplicity and that it captures the urestates—overwhelmingevidencecanitselfbeevidence essential features of the problem in a clear manner. of uncertainty, and thus be less convincing than more ambiguous data. Second, we build on this initial analysis and extend it to the problem of the police identity parade, showing ourconfidencethataperpetratorhasbeenidentifiedsur- Formal model prisingly declines as the number of unanimous witnesses becomes large. We use this mathematical framework to Let us now proceed to formalise the problem above. revisit a specific point of ancient Jewish law—we show Suppose we have two hypotheses, H and H , and a se- that it does indeed have a sound basis, even though it 0 1 riesofmeasurementsX=(X ,X ,...,X ). Wedefinea grossly challenges our na¨ıve expectation. 1 2 n variable F ∈N that determines the underlying measure- Third, we finish with a final example to show that our ment distribution, p (x). We may then use Bayes’ analysis has broader implications and can be applied to X|F,Hi law to find electronic systems of interest to engineers. We chose the example of a cryptographic system and that a surpris- P[X|H ]P[H ] P[H |X]= i i , (1) ingly small bit error rate can result in a larger-than- i P[X] expected reduction in security. Our analyses ranging from cryptography to criminol- ogy,provideexamplesofhowrarefailuremodescanhave which can be expanded by condition with respect to F, a counterintuitive effect on the achievable level of confi- yielding dence. (cid:88) P[X|H ,f]P[H ,F =f] i i f = . (2) A HYPOTHETICAL ROMAN POT (cid:88) P[X|H ,f]P[H ,F =f] k k f,Hk Let us begin with a simple scenario, the identification oftheoriginofaclaypotthathasbeendugfromBritish In our examples there are a number of simplifying soil. ItsdesignidentifiesitasbeingfromtheRomanera, conditions—there are only two hypotheses and two mea- andallthatremainsistodeterminewhetheritwasmade surement distributions, reducing Eqn. 2 to 3  1 −1 (cid:88) P[X|H ,F =f]P[H ,F =f]  1−i 1−i    P[H |X]=1+ f=0  . (3) i  1   (cid:88)   P[X|H , F =f]P[H , F =f] i i f=0 Computation of these a posteriori probabilities thus re- Substituting these probability masses into Eqn. 2, we quires knowledge of two distributions: the measure- see in Figure 1 that as more and more tests return posi- ment distributions P[X|H ,F], and the state probabili- tive results, we become increasingly certain of its British k tiesP[H ,F]. Havingtabulatedthese,wemaysubstitute heritage, but an unreasonably large number of positive i themintoEqn.3,yieldingtheaposteriori probabilityfor resultswillindicatecontaminationandsoyieldareduced each hypothesis. In this paper, the measurement distri- level of certainty. butions P[X|H ,F =f] are all binomial, however this is i not the case in general. 1.0 Analysis of the pot origin distribution pcg=g0 0.9 n] In the case of the pot, the hypotheses and mea- gi ori 0.8 surement distributions—the origin and contamination, hg s respectively—are shown in Table I. Briti 0.7 10¯³ P[ 10¯² P[F,3Hi] P[Positive3result3|3F,3Hi] Origin Origin 0.6 pcg=g10¯¹ Italy Britain Italy Britain 0.5 H₀ H₁ H₀ H₁ 0 5 10 15 20 25 30 Unanimously-positivegtests d Y 0.005 0.005 d Y 0.9 0.9 e e minat F=0 ½3pc ½3pc minat F=0 FIG. 1. Probability that the pot is of British origin given a a n numbers of tests, all coming back positive, for a variety nt nt Co N 0.495 0.495 Co N 0.3 0.7 of contamination rates pc and a 30% error rate. In the case of the pot above, with p = 10−2, we see a peak at n = 5, F=1 ½3)1−pc) ½3)1−pc) F=1 pe 1−pe afterwhichthelevelofcerctaintyfallsbackto0.5asitbecomes Each square represents an outcome Each square represents a distribution morelikelythatthepotoriginatesatacontaminatingfactory. When p = 0, this is the standard Bayesian analysis where TABLE I. The model parameters for the case of the pot for c use in Eqn. 3 with a contamination rate p = 10−2. The a failure states are not considered. We see therefore that even c smallcontaminationratescanhavealargeeffectontheglobal priori distribution of the origin is identically 50% for both behaviour of the testing methodology. Britain and Italy, whether or not the pot’s manufacturing process has contaminated the results. As a result, the two columns of P[F,H ] are identical. The columns of the mea- i surement distribution, shown right, differ from one another, Itisworthtakingamoment,however,tobrieflydiscuss thereby giving the test discriminatory power. When the pot theeffectsofweakeningcertainconditions; inparticular, has been contaminated, the probability of a positive result is weconsidertwocases: thatwheretherateofcontamina- identical for both samples, rendering the test ineffective. tion depends upon the origin of the pot, and that where theresultsaftercontaminationarealsoorigin-dependant. Each measurement is Bernoulli-distributed, and the Where the rate of contamination depends upon the number of positive results is therefore described by a Bi- origin, evidence of contamination provides some small nomial distribution, with the probability mass function evidence of where the pot came from. Thus if 80% of contaminated pots are of British origin, then Figure 1 (cid:18) (cid:19) n P[X =x]= px(1−p)n−x will eventually converge to 0.8 rather than 0.5. x If the probability of a positive test is dependent upon after N trials, the probability p being taken from the the origin even when contaminated, then the behaviour measurement distribution section of Table I. ofthetestprotocolchangesqualitatively. Supposingthat 4 inthecontaminatedcasethetestissubstantiallylessac- P[F,[Hi] P[Identification[|[F,[Hi] curate,theprobabilityofBritishoriginwilldroptowards Suspect[is… Suspect[is… 0.5, as now, due to the increased likelihood of contami- Innocent Guilty Innocent Guilty nation; eventually, however, itwillriseagaintowards1.0 H₀ H₁ H₀ H₁ as sufficient data becomes available to make use of the lessprobativeexperimentthatwenowknowtobetaking e Y 0.005 0.005 e Y 0.9 0.9 d d a a place. par F=0 ½[pc ½[pc par F=0 d[ d[ e e as as Bi N 0.495 0.495 Bi N 0.13 0.52 THE RELIABILITY OF IDENTITY PARADES F=1 ½[(1−pc) ½[(1−pc) F=1 pfp 1−pfn Each square represents an outcome Each square represents a distribution We initially described the scenario of a court case, in TABLE II. The model parameters for the hypothetical iden- which witness after witness testifies to having seen the tity parade. In a similar fashion to the first example, we defendant commit the crime of which he is accused. But assume a priori a 50% probability of guilt. In this case, in-court identifications are considered unreliable, and in themeasurementdistributionsaresubstantiallyassymmetric reality if identity is in dispute then the identification is with respect to innocence and guilt, unlike Table I. made early in the investigation under controlled condi- tions [16]. At some point, whether before or after being charged, the suspect has most likely been shown to each of a simulated crime, whether in person or on video, and witnessamongstanumberofothers,knownasfillers,who askedtolocatetheperpetratoramongstanumberofpeo- arenotundersuspicion. Eachwitnessisaskedtoidentify ple. In some cases the perpetrator will be present, and the true perpetrator, if present, amongst the group. in others not. The former allows estimation of the false- Thisprocess,knownasanidentityparade orline-up,is negative rate of the process—the rate that the witness anexperimentintendedtodeterminewhetherthesuspect fails to identify the perpetrator when present—and the is in fact the same person as the perpetrator. It may be latter the false-positive rate—the rate at which an inno- performedonlyonce, orrepeatedmanytimeswithmany cent suspect will be mistakenly identified. Let us denote witnesses. Ashumanmemoryisinherentlyuncertain,the by p the false-negative rate; this is equal to the pro- fn processwillincluderandomerror;iftheexperimentisnot portion of subjects who failed to correctly identify the properly carried out then there may also be systematic perpetrator when he was present, and was found in [17] error, and this is the problem that concerns us in this to be 48%. paper. Estimatingthefalsepositiverateiscomplicatedbythe Having seen how a unanimity of evidence can create factthatonlyonesuspectispresentinthelineup—when uncertainty in the case of the unidentified pot, we now the suspect is innocent, an eyewitness who incorrectly apply the same analysis to the case of an identity pa- identifies a filler as being the perpetrator has correctly rade. If the perpetrator is not present—that is to say, rejected the innocent suspect as being the perpetrator, if the suspect is innocent—then in an unbiased parade despite their error. For the purposes of our analysis, we the witness should be unable to choose the suspect with assume that the witness selects at random in this case, a probability greater than chance. Ideally, they would andthereforedividethe80%perpetrator-absentselection declinetomakeaselection,howeverthisdoesnotalways rateof[17]bythenumberofparticipantsL=6,yielding occur in practice [16, 17], and forms part of the random a false-positive rate of p =0.133. fp error of the procedure. If the parade is biased—whether Let us now suppose that there is a small probability intentionallyorunintentionally—forexamplebecause(i) p thattheline-upisconductedincorrectly—forexample, c the suspect is somehow conspicuous [18], (ii) the staff volunteershavebeenchosenwhofailtoadequatelymatch running the parade direct the witness towards him, (iii) the description of the perpetrator—leading to identifica- by chance he happens to resemble the perpetrator more tion of the suspect 90% of the time, irrespective of his closely than the fillers, or (iv) because the witness holds guilt. For the sake of analysis we assume that if this a bias, for example because they have previously seen occurs, it will occur for all witnesses, though in prac- the suspect [16], then an innocent suspect may be se- tice the police might perform the procedure correctly for lected with a probability greater than chance. This is some witnesses and not others. The probability of the the hidden failure state that underlies this example; we suspect being identified for each of the cases is shown in assume in our analysis that this is completely binary— Table II. either the parade is completely unbiased or it is highly Ifweassumea50%priorprobabilityofguilt,andinde- biased against the suspect. pendent witnesses, the problem is now identical to that In recent decades, a number of experiments [17, 19] of identifying the pot. The probability of guilt, given have been carried out in order to establish the reliability the unanimous parade results, is shown in Figure 2 as a of this process. Test subjects are shown the commission function of the number of unanimous witnesses. 5 P[F,PHi] P[IdentificationP|PF,PHi] 1.0 SuspectPis… SuspectPis… 0.95 pcg=g0 Innocent Guilty Innocent Guilty H₀ H₁ H₀ H₁ 0.9 ygofgguilt 0.8 10¯³ ceedings F=Y0 0½.0Pp0c5 0½.0Pp0c5 ceedings F=Y0 0.95 0.95 obabilit 0.7 pcg=g10¯² 10¯⁴ asedPpro N 0.495 0.495 asedPpro N 0.14 0.75 Pr Bi F=1 ½P(1−pc) ½P(1−pc) Bi F=1 pfp 1−pfn 0.6 Each square represents an outcome Each square represents a distribution 0.5 TABLE III. The model parameters for the Sanhedrin trial. 0 5 10 15 20 25 Again,weassumeana priori 50%probabilityofguilt. How- Unanimously-positivegidentifications ever, the measurement distributions are the results of [21, model (2)] for juries; in contrast to the case of the identity FIG. 2. Probability of guilt given varying numbers of unani- parade, the false negative rate is far lower. Despite the trial mousline-upidentifications,assuminga50%priorprobability beingconductedbyjudges,wechoosetousethejuryresults, ofguiltandidentificationaccuraciesgivenby[17]. Ofnoteis as the judges tendancy towards conviction is not reflected in thatforthecasethatwehaveplottedherewherethewitnesses the highly risk-averse rabbinic legal tradition. are unanimous, with a failure rate p = 0.01 it is impossible c to reach 95% certainty in the guilt of the suspect, no matter how many witnesses have been found. morrow in hope of finding new points in favour of the defence”. The value of this rule becomes apparent when we con- Weseethatafteracertainnumberofunanimouslypos- siderthattheSanhedrinwascomposed,forordinarycap- itive identifications the probability of guilt diminishes. ital offenses, of 23 members [13, Sanhedrin 2a]. In our Even with only one in ten-thousand line-ups exhibiting line-up model, this many unanimous witnesses would in- this bias towards the suspect, the peak probability of dicate a probability of guilt scarcely better than chance, guiltisreachedwithonlyfiveunanimouswitnesses,com- suggesting that the inclusion of this rule should have a pletely counter to intuition—in fact, with this rate of substantial effect. failure, ten identifications in agreement provide less evi- We show the model parameters for the Sanhedrin de- dence of guilt than three. We see also that even with a cision in Table III, which we use to compute the proba- 50%priorprobabilityofguilt,a1%failureraterendersit bility of guilt in Figure 3 for various numbers of judges impossible to achieve 95% certainty if the witnesses are condemning the defendant. We see that the probabil- unanimous. ity of guilt falls as judges approach unanimity, however This tendency to be biased towards a particular mem- excluding unanimous decisions substantially reduces the berofthelineupwhenanerroroccurshasbeennoted[16, probability of false conviction. paragraph4.31]priortothemorerigorousresearchstim- Itisworthstressingthattheexactshapesofthecurves ulated by the advent of DNA testing, leading us to sus- in Figure 3 are unlikely to be entirely correct; commu- pect that our sub-1% contamination rates are probably nication between the judges will prevent their verdicts overly optimistic. from being entirely independent, and false-positive and false-negative rates will be very much dependent upon the evidentiary standard required to bring charges, the ANCIENT JUDICIAL PROCEDURE strength of the contamination when it does occur, and the accepted burden of proof of the day. However, it The acknowledgement of this type of phenomenon is is nonetheless of qualitative interest that with reason- notentirelynew;indeed,theadage”toogoodtobetrue” able parameters, this ancient law can be shown to have dates to the sixteenth century [20, good, P5.b]. More- a sound statistical basis. over, its influence on judicial procedure was visible in Jewishlawevenintheclassicalera;untiltheRomansul- timately removed the right of the Sanhedrin to confer THE RELIABILITY OF CRYPTOGRAPHIC death sentences, a defendant unanimously condemned SYSTEMS by the judges would be acquitted [13, Sanhedrin 18b], the Talmud stating “If the Sanhedrin unanimously find Wenowconsideradifferentexample,drawnfromcryp- guilty,heisacquitted. Why? —Becausewehavelearned tography. An important operation in many protocols is by tradition that sentence must be postponed till the thegenerationandverificationofprimenumbers; these- 6 int trialdivision (long to test) 1.0 { long i; 10¯⁴ long threshold; uilt 0.8 10¯³ g y¯of¯ 0.6 pc¯=¯10¯² {if(to test % 2 == 0) bilit return 1; a b 0.4 } o Pr 0.2 threshold = (long)sqrt(to test); Conviction for(i = 3; i <= threshold; i += 2) 0.0 0 5 10 15 20 { if(to test % i == 0) Condemning¯judges { return 1; FIG. 3. Probability of guilt as a function of judges in agree- } ment out of 23—the number used by the Sanhedrin for most } capital crimes—for various contamination rates p . We as- c sume as before that half of defendants are guilty, and use return 0; the estimated false-positive and false-negative rates of juries } from [21, model (2)], 0.14 and 0.25 respectively. We arbi- trarily assume that a ‘contaminated’ trial will result in the a positive vote 95% of the time. The panel of judges numbers FIG. 4. A function that tests for primality by attempting to 23, with conviction requiring a majority of two and at least factorise its input by brute force. one dissenting opinion [13, Sanhedrin]; the majority of two means that the agreement of at least 13 judges is required in order to to cast a sentence of death, to a maximum of matelyaλ=10−19 probabilitythatanygivenbitwillbe 22 votes in order to satisfy the requirement of a dissenting flipped in any given second. We will make the assump- opinion. These necessary conditions for a conviction by the Sanhedrin are shown as the pink region in the graph. tion that, in the machine code for the primality-testing routine, there exists at least one bit that, if flipped, will cause all composite numbers—or some class of compos- curity of some protocols depends upon the primality of ite numbers known to the adversary—to be accepted as a number that may be chosen by an adversary; in this prime. Asanexampleofhowthiscouldhappen,consider case, one may test whether it is a prime, whether by the function shown in Figure 4 that implements a brute- brute-force or by using another test such as the Rabin- force factoring test. Assuming that the input is odd, the Miller [22, p. 176] test. As the latter is probabilistic, function will reach one of two return statements, return- we repeat it until we have achieved the desired level ingzeroorone. TheCcompilerGCCcompilesthesetwo of security—in [22], a probability 2−128 of accepting a return statements to composite as prime is considered acceptable. However, a na¨ıve implementation cannot achieve this level of secu- 45 0053 B8010000 movl $1, %eax rity, as we will demonstrate. 45 00 46 0058 EB14 jmp .L3 The reason is that despite it being proven that each iteration of the Rabin-Miller test will reject a composite and number with probability at least 0.75, a real computer may fail at any time. The chance of this occurring is 56 0069 B8000000 movl $0, %eax small,howeveritturnsoutthattheprobabilityofastray 56 00 cosmicrayflippingabitinthemachinecode,causingthe testtoacceptcompositenumbers,issubstantiallygreater respectively. That is to say, it stores the return value as than 2−128. an immediate into the EAX register and then jumps to the cleanup section of the function, labelled .L3. The store instructions on lines 45 and 56 have machine-code Code changes caused by memory errors valuesB801000000andB8000000000forreturnvaluesof one and zero respectively. These differ by only one bit, Data provided by Google [23] suggests that a given and therefore can be transformed into one another by a memory module has approximately an 8% probability of single bit-error. If the first instruction is turned into the suffering an error in any given year, independent of ca- second,thiswillcausethefunctiontoreturnzeroforany pacity. Assuminga4GBmodule,thisresultsinapproxi- oddinput,thusalwaysindicatingthattheinputisprime. 7 P[F,=Hi] P[Acceptance=|=F,=Hi] 1 Number=is… Number=is… Prime Composite Prime Composite y H₀ H₁ H₀ H₁ bilit2¯³² ba Afterm1mo ositive=result F=Y0 0.000.010×11=p0c¯¹³0.909.999×9=1p0c¯¹³ ositive=result F=Y0 1.0 1.0 ptancempro2¯⁶⁴ AAfftteerrmm11ds p p e ways= N ≈0.001 ≈0.999 ways= N 1.0 0.25 emacc2¯⁹⁶ Afterm1ns Al F=Ea1ch sq0u.0a0r1e= |re1p−rpesce(nts 0a.n9 9o9u=t|c1o−mpec( Al F=Ea1ch square represents a distribution Fals Nomfaults 2¯¹²⁸ TABLE IV. Model parameters for the Rabin-Miller test on 0 16 32 48 64 random2000-bitnumbers. However,wehavenochoicebutto Rabin-Millermiterations assume the lower bound on the composite-number rejection rate, and so this model is inappropriate. Furthermore, in an adversarial setting the attacker may intentionally choose FIG.5. Theacceptancerateasafunctionoftimeinmemory a difficult-to-detect composite number, rendering the prior and the number of Rabin-Miller iterations under the single- distribution optimistic. errorfaultmodeldescribedinthispaper. Anacceptancerate of2−128 isnormallychosen,howeverwithouterrorcorrection this cannot be achieved. The false-acceptance rate after k iterations is given by p [k] = 4−k(1−p )+p , where p is fa f f f the probability that a fault has occurred that causes a false The effect of memory errors on confidence acceptance 100% of the time. We estimate p to be equal f to 10−19T, where T is the length of time in seconds that the code has been in memory. Atcryptographically-interestingsizes—ontheorderof 22000—roughly one in a thousand numbers is prime [22, p. 173]. We might calculate the model parameters as 4−k, but before—forinterest’ssake,wehavedonesoinTableIV— andcalculatetheconfidenceinanumber’sprimalityafter pfa =P[AF ∪AR] (4) agivennumberoftests. However,thisisnotparticularly =P[A ]+P[A ]−P[A ,A ]. (5) F R F R useful, for two reasons: first, the rejection probability of 75% is a lower bound, and for randomly chosen numbers Since A and A are independent, F R is a substantial underestimate; second, we do not always choose numbers at random, but rather may need to test =P[A ]+P[A ]−P[A ]P[A ] (6) those provided by an adversary. In this case, we must F R F R assume that they have tried to deceive us by providing =4−k(1−pf)+pf (7) a composite number, and would instead like to know the ≥p . (8) f probability that they will be successful. The Bayesian estimator in this case would provide only a tautology of No matter how many iterations k of the algorithm are the type: ‘given the data and the fact that the number performed, this is substantially greater than the 2−128 is composite, the number is composite’. securitylevelthatispredictedbyprobabilisticanalysisof thealgorithmalone,thusdemonstratingthatalgorithmic Let us suppose that the machine containing the code analyses that do not take into account the reliability of is rebooted every month, and the Rabin-Miller code re- the underlying hardware can be highly optimistic. The mainsinmemoryforthedurationofthisperiod;then,ne- false acceptance rate as a function of the number of test glecting other potential errors that could affect the test, iterations and time in memory is shown in Figure 5. at the time of the reboot the probability that the bit A real cryptographic system will include many such has flipped is now p = 2.6×10−13; this event we de- f checks in order to make sure that an attacker has not note A . Let k be the number of iterations performed; F chosen weak values for various parameters, and a failure the probability of accepting a composite number is at ofanyofthesemayresultinthesystembeingbroken, so most 4−k, and we assume that the adversary has chosen our calculations are somewhat optimistic. a composite number such that this is the true probabil- Error-correcting-code equipped (ECC) memory will ity of acceptance. We denote the event that the prime is substantially reduce the risk of this type of fault, and accepted by the correctly-operating algorithm A . R forregularly-accessedregionsofcode—multipletimesper When hardware errors are taken into account, the second—will approach the 2−128 level. A single parity probabilityofacceptingacompositenumberisnolonger bit, as used in at least some CPU level-one instruction 8 caches[24],requirestwobit-flipstoinduceanerror. Sup- judicialpanelindicatesthattherequirementofadissent- posetheparityischeckedeveryRseconds,thentheprob- ingopinionwouldhaveprovidedasubstantialincreasein ability of an undetected bit-flip in any given second is the probability of guilt required to secure a conviction. Applied to cryptographic systems, we see that even (λR)2 the minuscule probability that one particular bit in the λ(cid:48) = =λ2R. (9) R system’s machine code will be flipped due to a memory error over the course of a month, rendering the system Forcodethatisaccessedevenmoderatelyoften,thiswill insecure, is approximately 280 times larger than the risk come much closer to 2−128. For example, if R = 100ms predictedbyalgorithmicanalysis. Thisdemonstratesthe then this results in a false-acceptance rate of 2−108 af- importance of strong error correction in modern crypto- ter one month, much closer to the 2−128 level of secu- graphicsystemsthatstriveforafailurerateontheorder rity promised by analysis of the algorithm alone. The of2−128,alevelofcertaintythatappearstobeotherwise stronger error-correction codes used by the higher-level unachievable without active mitigation of the effect. caches and main memory will detect virtually all such The use of naturally-occuring memory errors for DNS errors—with two-bit detection capability, the rate of un- hijacking[26]haspreviouslybeendemonstrated,andthe detected bit-flips will be at most abilityofausertodisturbprotectedaddressesbywriting to adjacent cells [27] has been demonstrated, however λ(cid:48) =λ3R2, (10) little consideration has been given to the possibility that thistypeoffaultmightoccursimplybychance,implying and even with check rate of only once per 100ms, the thatsecurityanalyseswhichassumereliablehardwareare rate of memory errors is essentially zero, increasing the substantially flawed when applied to consumer systems false-acceptance rate by a factor of only 10−14 above the lacking error-corrected memory. 2−128 levelthatwouldbeachievedinaperfectoperating We have considered only a relatively simple case, in environment. which there are only two levels of contamination. How- ever,inpracticalsituationswemightexpectanyofawide rangeoffailuremodesvaryingcontinuously. Wehavede- DISCUSSION scribed a simple case in the appendix, where a coin may bebiased—ornot—towardseitherheadsortailswithany Thisphenomenonisinterestinginthatitiscommonly strength; were one to apply this to the case of an iden- known and applied heuristically, and trivial examples tity parade, for example, one would find a probability such as the estimation of coin bias [25, section 2.1] thatthesuspectisindeedtheperpetrator,asbefore,but have been well-analysed—see the appendix for a brief taking into account that there may well be slight biases discussion—but these rare failure states are rarely, if that nudge the witnesses towards or away from the sus- ever, considered when a statistical approach to decision- pect, not merely catastrophic ones. The result is heavily makingisappliedtoanentiresystem. Realsystemsthat dependent upon the distribution of the bias, and lacking attempt to counter failure modes producing consistent sufficient data to produce such a model we have chosen data tend to focus upon the detection of particular fail- toeschewthecomplexityofthecontinuousapproachand uresratherthanthemerefactofconsistency. Sometimes focus on a simple two-level model. An example of this there is little choice—a casino that consistently ejected approach is shown in [28, p. 117]. gamblersonawinningstreakwouldsoonfinditselfwith- A related concept to that which we have discussed out a clientelle—however we have demonstrated that in is the Duhem-Quine hypothesis [28, p. 6]; this is the many cases the level of consistency needed to cast doubt idea that an experiment inherently tests hypotheses as on the validity of the data is surprisingly low. a group—not merely the phenomenon that we wish to Ifthisisso,thenwemustreconsidertheuseofthresh- examine, but also the correct function of the experimen- olding as a decision mechanism when there is the poten- talapparatusforexample,andthatthatonlythedesired tialforsuchfailuremodestoexist,particularlywhenthe independent variables are being changed. Our thesis is a consequencesofanincorrectdecisionarelarge. Whenthe related one, namely that in practical systems the failure decision rule takes the form of a probability threshold, it of these auxiliary hypotheses, though unlikely, result in is necessary to deduce an upper threshold as well, such a significant reduction in confidence when it occurs, an aswasshowninFigure3,inordertoavoidcapturingthe effect which has traditionally been ignored. region indicative of a systemic failure. That this phenomenon was accounted for in ancient Jewish legal practice indicates a surprising level of in- CONCLUSION tuitive statistical sophistication in this ancient law code; thoughpredatingbymillenniathestatisticaltoolsneeded We have analysed the behaviour of systems that are to perform a rigorous analysis, our simple model of the subjecttosystematicfailure,anddemonstratedthatwith 9 relatively low failure rates, large sample sizes are not re- They use Bayes’ law in its proportional form, quired in order that unanimous results start to become P[Q|{data}]∝P[{data}|Q]P[Q], (11) indicative of systematic failure. We have investigated theeffectofthisphenomenonuponidentityparades,and where Q is the probability that a coin-toss will yield shownthatevenwithonlya1%rateoffailure,confidence heads. VariouspriordistributionsQ[H]canbechosen, a begins to decrease after only three unanimous identifica- matter that we will discuss momentarily. tions, failing to reach even 95%. As the coin tosses are independent, the data can We have also applied our analysis of the phenomenon be boiled down to a binomial random variable X ∼ to cryptographic systems, investigating the effect by Bin(p,n), where n is the number of coin tosses made. which confidence in the security of a parameter fails to Substituting the binomial probability mass function into increase with further testing due to potential failures of Eqn. 11, they find that the underlying hardware. Even with a minuscule failure rate of 10−13 per month, this effect dominates the anal- P[Q|X]∝QX(1−Q)n−XP[Q]. (12) ysis and is thus a significant determining factor in the overall level of security, increasing the probability that a As the number of samples n increases, this becomes in- maliciously-chosenparameterwillbeacceptedbyafactor creasinglypeakedaroundthevalueQ=X/n,this‘peak- of more than 280. ing’ effect limited by the shape of P[Q]. As the number Hidden failure states such as these reduce confidence of samples increases, the QX(1−Q)n−X part of the ex- farmorethanintuitionleadsonetobelieve,andmustbe pression eventually comes to dominate the shape of the more carefully considered than is the case today if the posteriordistributionP[Q|X],andwehavenochoicebut lofty targets that we set for ourselves are to be achieved to believe that the coin genuinely does have a bias close in practice. to X/n. Intheexamplespreviouslydiscussed,wehaveassumed that bias is very unlikely; in the coin example, this cor- COMPETING INTERESTS responds to a prior distribution P[Q] that is strongly clusteredaroundQ=0.5; inthiscase, averylargenum- ber of samples will be necessary in order to conclusively We have no competing interests. reject the hypothesis that the coin is unbiased or nearly so. However, eventually this will occur, and the poste- AUTHOR CONTRIBUTIONS rior distribution will change; when this occurs, the sys- temhasvisiblyfailed—acasinousingthecoinwilldecide that they are not in fact playing the game that they had LJGdraftedthemanuscript. LJGandDAdevisedthe planned, andmustceasebeforetheirlossbecomescatas- concept. LJG, FC-B, MDM, BRD, AA, and DA carried trophic. ThisismuchlikeinthecaseoftheSanhedrin—if out analyses and checking. toomanyjudgesagree,thesystemhasfailed,andshould All authors contributed to proofing the manuscript. not be considered reliable. All authors gave final approval for publication. FUNDING ∗ [email protected] LachlanJ.GunnisaVisitingScholarattheUniversity † [email protected] of Angers, France, supported by an Endeavour Research ‡ [email protected] Fellowship from the Australian Government. § [email protected] MarkD.McDonnellissupportedbyanAustralianRe- ¶ [email protected] ∗∗ [email protected] search Fellowship (DP1093425) and Derek Abbott †† SchoolofElectricalandElectronicEngineering,TheUni- is supported by a Future Fellowship (FT120100351), versity of Adelaide 5005, Adelaide, Australia both from the Australian Research Council (ARC). [1] R.Benzi,A.Sutera,andA.Vulpiani,“Themechanismof stochastic resonance,” Journal of Physics A: Mathemat- ical and General, vol. 14, no. 11, pp. L453–L457, 1981. Analysis of a biased coin [2] M. D. McDonnell, N. G. Stocks, C. E. M. Pearce, and D. Abbott, Stochastic Resonance: From Suprathreshold Stochastic Resonance to Stochastic Signal Quantization. It is worth adding a brief discussion of a simple and Cambridge University Press, 2008. well-known problem that has some relation to what we [3] M.D.McDonnellandD.Abbott,“Whatisstochasticres- have discussed, namely the question of whether or not a onance? Definitions,misconceptions,debates,anditsrel- coin is biased. We follow the Bayesian approach given evance to biology,” PLoS Computational Biology, 2009, in [25]. art. e1000348. 10 [4] G. P. Harmer and D. Abbott, “Game theory: losing 107–121, 1994. strategies can win by Parrondo’s paradox,” Nature, vol. [18] M. S. Wogalter and D. B. Marwitz, “Suggestiveness in 402, p. 864, 1999. photospread lineups: similarity induces distinctiveness,” [5] D. Abbott, “Asymmetry and disorder: a decade of Par- Applied Cognitive Psychology, vol. 6, no. 5, pp. 443–452, rondo’s paradox,” Fluctuation and Noise Letters, vol. 9, 1992. no. 1, pp. 129–156, 2010. [19] R. S. Malpass and P. G. Devine, “Eyewitness identifica- [6] M.N.Mead,“Columbiaprogramdigsdeeperintoarsenic tion: lineupinstructionsandtheabsenceoftheoffender,” dilemma,”Environ. Health Perspect.,vol.113,no.6,pp. Journal of Applied Psychology, vol. 66, no. 4, pp. 482– A374–A377, 2005. 489, 1981. [7] D. Braess, “U¨ber ein Paradoxon aus der Verkehrspla- [20] Oxford English Dictionary. Oxford University Press, nung,” Unternehmensforschung, vol. 12, pp. 258–268, http://oed.com/, accessed 2015-10-06. 1969. [21] B. D. Spencer, “Estimating the accuracy of jury ver- [8] H. Kameda, E. Altman, T. Kozawa, and Y. Hosokawa, dicts,” Journal of Empirical Legal Studies, vol. 4, no. 2, “Braess-likeparadoxesindistributedcomputersystems,” pp. 305–329, 2007. IEEE Transactions on Automatic Control,vol.45,no.9, [22] N. Ferguson, B. Schneier, and T. Kohno, Cryptography pp. 1687–1691, 2000. Engineering: Design Principles and Practical Applica- [9] Y.A.Korilis,A.A.Lazar,,andA.Orda,“Avoidingthe tions. Wiley, Indianapolis, USA, 2010. Braess paradox in noncooperative networks,” Journal of [23] B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM Applied Probability, vol. 36, no. 1, pp. 211–222, 1999. errors in the wild: A large-scale field study,” in Proceed- [10] A.P.FlitneyandD.Abbott,“Quantumtwo-andthree- ings of the Eleventh International Joint Conference on person duels,” J. Opt. B: Quantum Semiclass. Opt., Measurement and Modeling of Computer Systems, SIG- vol. 6, no. 8, pp. S860–S866, 2004. METRICS ’09, Seattle, WA, USA, 2009, pp. 193–204. [11] G. P. Harmer, D. Abbott, P. G. Taylor, and J. M. R. [24] Advanced Micro Devices, “AMD Parrondo, “Brownian ratchets and Parrondo’s games,” Opteron processor product datasheet,” Chaos, vol. 11, no. 3, pp. 705–714, 2001. http://support.amd.com/TechDocs/23932.pdf, 2007, [12] S. N. Ethier and J. Lee, “Parrondo’s paradox via redis- accessed 2015-10-12. tribution of wealth,” Electron. J. Probab., vol. 17, 2012, [25] D. S. Sivia and J. Skilling, Data Analysis: A Bayesian art. 20. Tutorial. Oxford University Press, 2006. [13] I.Epstein,Ed.,TheBabylonianTalmud. SoncinoPress, [26] A. Dinaberg, “Bitsquatting: DNS hijacking without ex- London, 1961. ploitation,”inProceedings of BlackHat Security,LasVe- [14] R. A. Fisher, “Has Mendel’s work been rediscovered?” gas, USA, 2011. Annals of Science, vol. 1, no. 2, pp. 115–137, 1936. [27] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, [15] A. Franklin, Ending the Mendel-Fisher Controversy. C. Wilkerson, K. Lai, and O. Mutlu, “Flipping bits in University of Pittsburgh Press, 2008. memorywithoutaccessingthem: Anexperimentalstudy [16] P. Devlin, C. Freeman, J. Hutchinson, and P. Knights, of DRAM disturbance errors,” in Proc. IEEE 41st An- ReporttotheSecretaryofStatefortheHomeDepartment nual International Symposium on Computer Architecu- of the Departmental Committee on Evidence of Identifi- ture, Minneapolis, MN, USA, 2014, pp. 361–372. cation in Criminal Cases. HMSO, London, 1976. [28] L. Bovens and S. Hartmann, Bayesian Epistemology. [17] R.A.Foster,T.M.Libkuman,J.W.Schooler,andE.F. Oxford, 2004. Loftus, “Consequentiality and eyewitness person identi- fication,”Applied Cognitive Psychology,vol.8,no.2,pp.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.