What probability is involved in "Beyond Reasonable Doubt" in criminal trials?

Tony Gardner-Medwin, University College London

This web page began in March 2003. Some key ideas (and some new ones) are best approached by reading a 2005 article What probability should a jury address? in the Royal Statistical Society's Journal  Significance 2:9-12 (2005) , made available here by permission (click here or on the title). Other articles in the same issue of Significance dealing with Stats and the Law are available via links from the publisher.  The first part of the website here also provides a brief and hopefully clear introduction, but the later notes are rather technical in some respects.

A more recent and wide-ranging article by me about uncertainty: "Reasonable Doubt: Uncertaintites in Education, Science and Law (2011)" is published in a British Academy volume "Evidence, Inference and Enquiry" edited by P.Dawid, W.Twining and M.Vasilaki (2011). A pdf of the final ms is available here. For a reprint, please email me (a.gardner-medwin@ucl.ac.uk)

There is opportunity to discuss issues raised on a forum here:-

Juries are obviously concerned with weighing evidence and judging probabilities (albeit not necessarily in a very quantitative way), given the uncertainties in a case. This kind of decision is the domain of Bayesian decision theory, and in at least some situations it is clearly appropriate to apply rigorous principles of such theory to help arrive at conclusions and clarify the issues, especially where they can be formulated quantitatively.

Some statisticians (see e.g. Refs.1,2 below) have indicated that the task in such situations is to estimate the probability that the defendant is guilty, given the evidence presented (denoted P(G|E) ). Only if this probability is greater than some high (albeit debatable) figure, perhaps 99%, should a jury convict. The legal presumption of innocence (and the corresponding requirement that guilt must be proved 'beyond reasonable doubt') would then be reflected in this high threshold, which ensures that defendants will be acquitted unless the jury considers that the probability that they are innocent is extremely low.  I am not concerned with a debate about the level at which this threshold should be set, but about whether the probabilities of guilt or innocence are indeed the ones that should be addressed. Neither of the cited authors presents a case for asserting this, and indeed in correspondence Dawid suggests that it is simply  'natural and obvious'.

I want to argue that this is wrong.  I certainly grant that it does seem at first sight natural, but I do not think it is either what is best expressed by the legal phrase 'beyond reasonable doubt' or what is socially and politically desirable, for reasons I shall discuss.  I believe it is the probability that such evidence could arise without guilt that is critical. This, like the probability of guilt, is a Bayesian probability: it is the degree of belief that the facts could have arisen consistent with innocence (i.e. broadly, that the defence account is plausible). If this proposition is capable of reasonable belief, above some low threshold probability, then the jury should acquit - even if it may appear much more likely that the evidence arose through guilt. The difference may at first sight seem abstruse, but it has profound consequences for the issues that should be taken into account in a trial, for the perceived fairness of the system, and in some cases for the outcome.

To see the difference, consider a situation that jury members must commonly encounter. They may believe that the defendant is probably guilty, yet find the defence case both plausible and completely consistent with the evidence, so they must acquit. A simple instance would be a caricature related to the Sally Clark case, in which a death was either due to SIDS (Sudden Infant Death Syndrome) or murder. Suppose there is a city in which 1 in 100 female children are suffocated by a parent preferring a male child, while 1 in 10,000 die from SIDS.  On the evidence of a mysterious female cot death alone, a jury would conclude there was a 99% probability that this case is murder, yet SIDS is entirely plausible (some SIDS cases are certainly expected to occur in a large population), and is wholly consistent with the evidence. If you ask the question 'Is doubt (i.e. the hypothesis that SIDS has occurred) reasonable?'  the answer is surely yes. Such cases are bound to occur and, when they do, the evidence (an unexplained death) is what results. Acquittal is surely indicated. The high incidence of infanticide in the population (which is what leads to the conclusion that murder is much more likely than SIDS) should have no bearing on the judgement about conviction or acquittal and should not (and probably would not) be admissible evidence in court.

One can argue in such a situation (as Dawid implicitly does) that the decision should not be made in the way described, but according to the principle of maximising expected utility. On this basis one would take into account the benefits to society of convicting murderers and the costs of acquitting them, as well as the benefits and costs of acquitting and convicting innocent people.  To make an optimal decision with such an objective, one does need to arrive at the Bayesian posterior probability P(G|E). The relative costs and benefits could then justify conviction in a case like that described, because the benefits of locking up murderers might be considered to outweigh the costs of convicting innocent people. Maximising expected utility is the basis of many rational decisions (including the majority of medical decisions based on uncertain evidence). But it violates what I understand to be the principle of presumption of innocence. It is not what I understand by proof of guilt 'beyond reasonable doubt'; and it requires that evidence be introduced in court (for example, about the incidence of a crime) that would not, I believe, be considered admissible in the UK. It would be a totalitarian solution. Conviction, given equivalent weight of evidence against you, would be more likely if your accused crime was common and/or serious. Conviction could occur despite a defence case that is totally coherent and consistent with the known facts. White persons charged with crime in areas where crime is mostly carried out by blacks would be more likely to be acquitted than black persons ***, despite identical evidence against each.  Faith in the processes of law would clearly evaporate.

All these problems disappear (except for the cost to society of letting some guilty defendants go free for lack of evidence) if instead of trying to gauge the probability that a defendant is guilty, we gauge the probability that the evidence could have occurred, given the assumption that the defendant is innocent.  Fortunately this is, I think, what courts in the UK try instinctively to do. Of course they sometimes conspicuously fail (as in the notorious Sally Clark case** and many others)  through inability to understand statistical issues fully, or the lack of the kind of expert statistical testimony advocated by the Royal Statistical Society ( Ref. 2) in the wake of the Sally Clark case.  This case was exceptional in that there were many failings in court : timely correction of any one of these could probably have led to an early acquittal. However, its reliance in the main on very simple evidence (2 unexplained cot deaths) highlights the inappropriateness of  resolving uncertainty by attempting to estimate the probability that a defendant is guilty. To do this would require (as pointed out in Refs.1,2) consideration of evidence about the incidence of infanticide in the population, so as to compare the likelihoods that the known facts were an  example of murder or misfortune. Conviction on the basis that 'lots of people do indeed commit such a crime' is not what we want from a legal system, and the idea of taking such evidence into account is neither ethically (nor legally, under English law) acceptable. Interestingly, this case makes possible a simple comparison with a hypothetical medical scenario in which, say, a near fatal cot crisis might be due to condition A which is treatable through a risky operation or condition B which is untreatable. It would be entirely appropriate to base a decision whether to operate on evidence that included relative incidence, racial and genetic factors, etc., as well as direct evidence about the patient. There is no equivalent of a 'presumption of innocence' that applies to good health, to particular diseases, or to the right to be treated or untreated.  These decisions are sensibly taken on the basis of relative probabilities and the relative risks, costs and benefits (the latter, preferably as perceived by the patient) of the various outcomes.

I have included a more formal discussion of the probabilistic strategies in a brief appendix*****. Though not particularly technical, those unfamiliar with probability formalisms may prefer to skip this. My thesis is that juries should consider the probability that, given all the facts and uncertainties, the defendant's case could be true. This is different from the probability that the defendant is innocent, satisfactorily analysed by Dawid.  Both criteria for a decision can be equally clear, though each may be hard to quantify.  My concern is not with the mathematics but with the type of question asked, and the potentially severe consequences if this is wrong. Many court cases are of course complex so that a jury must rely for judgement on intuition to handle issues that are interdependent and essentially unquantifiable. But this makes it all the more important, I suspect, to clarify what question is being asked, and the appropriate logic.

A.R. Gardner-Medwin, Physiology, UCL, London WC1E 6BT   24/3/03, and subsequently revised
E-mail a.gardner-medwin@ucl.ac.uk

1. A.P. Dawid  (2001) Bayes theorem and weighing evidence by juries.
2. Royal Statistical Society   (2002) Letter from the President to the Lord Chancellor regarding the use of statistical evidence in court cases. (see also http://www.rss.org.uk/statsandlaw )

* 'Weight of evidence' corresponds here to what a jury person might describe as the 'strength of the case against the defendant', and in technical terms, in at least simple cases, to the sum of the log likelihood ratios given guilt and innocence, for independently relevant facts - including the absence of any observations that might have been expected on one or other hypothesis.

**  Sally Clark Case. In that case (see http://www.sallyclark.org.uk/ ) the notoriously small claimed incidence of double SIDS in the population - 1 in 73 million - was probably a huge underestimate, principally because it treated SIDS incidents as necessarily independent (as pointed out by the Royal Statistical Society, in a Press Release (23/10/01) and letter (23/1/02) to the Lord Chancellor, and now seemingly agreed by everyone). But in my view even a correct figure would have been quite irrelevant because the defendant came to court for the (more or less) sole reason that she had lost 2 children through either murder or SIDS. Only two hypotheses were tenable about the defendant: she was a murderer or had experienced double SIDS. On each of these two hypotheses the likelihood of the principal evidence arising (2 unexplained deaths) was unity, not some small number referred to the general population. The defendant was not selected at random from the population and then found to have lost 2 babies (which is the situation in which the disputed probability for the incidence of double SIDS would apply); she was selected because she had lost 2 babies in an unexplained way. Given this, the critical statistic ( the probability that the evidence in court would be at least as incriminating as presented, if the defendant was in fact innocent) was essentially unity. The case should have been thrown out because the circumstances that brought her to court would have brought to court any unfortunate victim of double SIDS.  Only if double SIDS was considered wholly untenable as a hypothesis (i.e. bound never to occur in the population) would there have been any case for conviction on the basis of evidence that consisted almost entirely of the 2 deaths alone.  The letter from the Royal Statistical Society referred to above understates this problem rather severely when it says: 'The fact that two deaths by SIDS is quite unlikely is, taken alone, of little value.'  The justification it goes on to offer for this statement is the same as Dawid's, based on an assumption that the task is to establish the probability of guilt: 'Two deaths by murder may well be even more unlikely. What matters is the relative likelihood of the deaths under each explanation, not just how unlikely they are under one explanation'. The Lord Chancellor could be excused if he dismissed this argument, since it seeks to introduce evidence (about the incidence of a crime) that would not be admissible in court. Unfortunately he probably failed to see that a different argument leads to a much stronger conclusion wholly consistent with the law of evidence.

***  Selective use of evidence. It has been suggested to me that rejecting evidence about relative crime rates in different ethnic groups, while accepting evidence linked more directly to the crime, is really a moral choice I am introducing - not justified on purely statistical grounds. I don't think so. Suppose evidence A is that the defendant belongs to a crime-prone ethnic group, and evidence B is that when apprehended he had blood on his shirt.  Both are more likely to be true if the defendant is guilty (G) than if he is not guilty (~G) and both contribute to the probability P(G|E) that in the light of all the known facts, the defendant is guilty.  However, I argue that only B and not A contributes to proper assessment of the case, given the presumption of innocence.  Evidence A may affect the prior probability that the defendant will turn out to be guilty, judged even before presentation of evidence in the case****, and as a result it may affect the posterior probability P(G|E) after the case; but I argue that it should not (on purely statistical grounds) contribute to the critical issue, which is the weight of evidence favouring G or ~G in the specific case and relating to the specific defendant. One can see this by attempting to introduce the ethnic evidence into the case. One can apply the 'How come?' test to the evidence: the likelihoods that a jury is assessing about a piece of evidence are essentially, in an adversarial legal system, the probabilities that explanations offered by the defence (answers to the question 'How come this evidence arose, given the hypothesis ~G?') and by the prosecution (the same, given the hypothesis G) could be true.  In the case of evidence A these answers are identical: 'The defendant  belongs to this ethnic group because he was born into it.'  In the case of B the answers will be different, and their likelihoods may be different, thus contributing to the weight of evidence in the case.
    The handling of such issues can be trickier in cases that are in a sense intermediate. Suppose evidence C is that the defendant belongs to a group sworn to kill the person who was the victim in a case. My argument says that logically this evidence is irrelevant, since the explanation of  'How come you are a member of this group?' is unlikely to be very different from defence and prosecution, with therefore the same likelihood, and it therefore does not weigh in favour of G or ~G in relation to the specific case.  Yet it is surely hard to argue that it should be inadmissible evidence or that it should not sway a jury. The defence case is presumably along the lines 'Yes, the defendant would have taken an opportunity to kill the victim, but in fact he was in this case simply a bystander'. My position is that the jury should indeed consider simply the likelihood, on the evidence, that such a sworn killer was in this instance a bystander. Part of this assessment might be the answer to the question 'How come, given you were sworn to kill this person, you did not - according to your case - take part?'.  The defendant must be acquitted if the answers to such questions, in the light of the facts, do not seem too improbable. The way in which I would feel prepared, as a jury person, to be swayed more directly by the evidence C per se is actually not in relation to any weight of evidence favouring guilt, but in relation to the criterion I would set for 'reasonable doubt'.  Given C (though not ethnic evidence such as A) I might be more inclined to convict someone on the basis of a case that was otherwise weak, on the grounds that the undesirability of a false conviction (negative utility) was somewhat reduced by evidence C.  28/3/03

**** Relevance of the process of selection for prosecution. Strictly speaking evidence A might not even do this in the expected manner, depending on how the defendant was selected for trial.  If the police apprehended him because he was the only person in the vicinity of a minority crime-prone ethnic group, his ethnicity might actually render him a priori less likely to be guilty than someone apprehended from the majority culture, for whom there would presumably be some more solid basis for selection. 28/3/03

***** A more formal approach. How is one to resolve uncertainty and establish conviction beyond reasonable doubt if it is not a question of judging the probability of guilt? The failure of  this, the simplest mathematical interpretation of the task in hand, combined with the obvious difficulty there is in quantifying uncertainties in anything but the rarest of court cases, could conspire to the view that probability theory has little or no place in the courtroom. But the mathematics allows one to distinguish alternative approaches that are crisply different when the facts are simple, and that still form the basis for distinct intuitive approaches in more difficult cases.  A first step is to dissect the various stages of a Bayesian argument to arrive at a probability of guilt in the light of evidence:

  1. Establishment of  a 'prior' probability of guilt before taking the evidence into account
  2. Gathering and selection of evidence to be taken into account
  3. Assessing the weight of evidence (for or against guilt)
  4. Adjusting the prior probability according to this weight, to arrive at a 'posterior' probability.
It is the first 2 operations that  cause the main trouble. Prior probabilities are (reflecting the literal meaning of the word) close to prejudice - especially if based on subjective generalisations about race, class, gender, dress, or police behaviour, etc..  Even objective statistics (e.g. about previous conviction rates in the same court or incidence of the particular crime) are of dubious relevance and may be considered unacceptable bases for priors, inadmissible as explicit evidence at stage (2). If you can't set a prior, you cannot arrive at a posterior probability in Bayesian theory.  A common way out of this dilemma in applying Bayes' theorem from a starting point of what is essentially ignorance, is to say that since there are 2 possibilities (G and not-G) and no basis for preferring either one, the rational thing is to assign P=0.5 to each.  But this is a very dubious argument (since there is no defined stochastic model for what determines G or not-G), scarcely if at all more justifiable than one based on prejudice. However, it is at least tantamount to saying: 'Let's forget stages (1) and (4) and decide the case simply on the weight of the evidence (3)', which seems more in line with legal practice. This leaves the problem of selecting such evidence (2). This is of course a legal and ethical minefield, with a few problems highlighted above ***.  Selection of evidence that relates only directly to the facts at issue does help to ensure that a defendant's right to be presumed innocent is challenged for him/her as an individual, and  in relation to a specific event, rather than on the basis of generalisations that might offer a better overall success rate for decisions on a purely statistical basis. But selection of evidence can be dangerous, omitting for example highly pertinent issues of how meticulously different types of evidence were sought, and by what means the defendant was selected for trial. It would be better in principle to admit any sort of evidence and to adopt a strategy for handling the evidence (stage (3) above - the only one now left as admissible!) that is not affected by the kinds of evidence that seem unnecessary or inappropriate.

The process of assessing the weight of evidence (stage 3) in the conventional Bayesian analysis is one of arriving at an overall likelihood ratio - the ratio Pg/Pi of probabilities that the evidence, taken as a whole, would be observed if the defendant is either guilty (Pg) or innocent (Pi). The posterior probability can then be calculated (stage 4) by multiplying the prior odds ratio from stage 1 (r = p/(1-p) where p is the prior probability of guilt) by this likelihood ratio, to obtain the posterior odds ratio (r' = r Pg/Pi). This ratio can then be straightforwardly converted to the posterior probability p' = r'/(1+r').  This set of operations implements Bayes theorem.  However, as discussed above, a preferred strategy would cut out stages 1 and 4 altogether and treat Pg and Pi differently. Defining the weight of evidence*  as  W=log(Pg/Pi) - essentially a measure of the strength of the case against the defendant - we can ask the question'What is the probability that a person who is innocent, possessing the established characteristics of the defendant, and selected for trial in the manner of the defendant, would encounter a case against him/her that is as strong or stronger than W?' .  This is a question about the distribution of W, as a random variable.  Its assessment, like that of any other probability in court, can be difficult - requiring the weighing of uncertainties and explanations about potentially unreliable testimony. It is essentially equivalent to the question 'What probability (or degree of belief) can one assign to the narrative provided by the defence (and challenged by the prosecution), as an explanation of the evidence without implying guilt?'  This question successfully distinguishes between legally admissible and inadmissible forms of evidence that could each influence the wrong type of probability (the probability of guilt). If a defendant has a general characteristic G (ethnic, genetic, income, appearance, mode of speech, prior conviction, etc.) that is more common amongst guilty than innocent persons, or if s/he is the subject of a piece of seemingly directly relevant evidence D (seen holding a smoking gun, running away from the scene, with matching DNA, etc.) the question is the same : how do you account, if you are innocent,  for the fact that you are (or were) foreign, feeble-minded, poor, scruffy, ill-spoken, previously in prison, reported to be seen with a smoking gun, running away from the scene, with matching DNA, etc.. Expressed this way, the narratives in response to characteristics G obviously do not bear on the probability that the case against this particular defendant would be as strong as it actually is:  they are fixed things that would be the same in any case, whether the defendant was innocent or guilty.  On the other hand, the narratives in response to testimony of type D are the normal stuff of criminal trials - the jury must assess whether it is plausible, as perhaps claimed, that the witness is lying, that the gun was a plant, that the defendant really was trying to catch a bus, or that the police really did bring to court with no further evidence the first person from a database of 10 million found to match (with a false identification rate of, say, 1 in 100,000) the DNA at the crime scene, etc., etc..

This approach helps in principle to do away with the need for careful selection of evidence presented to a jury (2, in the initial scheme above).  In principle, a jury making what I think is the correct type of decision should always be able to make a better decision by being better informed with more facts of any type.  For this reason, one could argue that in principle they should even be allowed to know the criminal history of the defendant (contrary to UK practice) because it can be relevant to interpretation of the way the defendant  has reacted to circumstances. More important, they should know the full narrative of how the defendant came to be selected for trial, since the manner in which something is selected can be highly relevant to the probabilistic inferences that can be drawn from its characteristics. (For example, consider the DNA match referred to in the last paragraph: it would be dramatically more significant if the match were found as a result of testing the victim's next door neighbour. This example is extreme, but I guess there have been many miscarriages of justice because juries have been kept in ignorance of police procedure, even procedure carried out in good faith and consistent with good protocols.)  Despite these arguments in principle that favour maximum availability of information, one must recognise the difficulty for a jury of keeping clear the nature of a decision to be made. Rules of evidence that suppress information (perhaps particularly criminal history) may be justified on the grounds that the jury might be inappropriately swayed to make a decision on the wrong basis - on the basis of the probability of guilt rather than the plausibility of innocence.