More prejudicial than probative?
The fuller version of the January magazine article on violence risk assessment with actuarial risk assessment tools
The past decade has witnessed the irresistible rise of violence risk assessment. Calls for risk assessment reports are more frequent than ever before: from minor cases of domestic breaches of the peace, through all indictable cases of sexual offending, to the ultimate in risk assessments ─ in terms of time, cost and detail – those prepared in relation to orders of lifelong restriction (s 210B of the Criminal Procedure (Scotland) Act 1995).
The increasing pervasiveness of violence risk assessments begs the question, are they credible, how much weight should the decision maker place on such assessments; indeed, what challenges should defence agents pose? In this brief paper I want to focus specifically on the use of actuarial risk scales. These scales include the Risk Matrix 2000, Stable and Acute 2007, the LSI-R and the Static-99 ─ names with which courts are becoming increasingly familiar.
The use of these actuarial procedures is perhaps surprising, given that both Government reports and professional best practice guidelines support the use of different approaches – approaches based on structured professional judgment, not those based upon actuarial methods (1-4). However, the use of actuarial methods may be less surprising when it is considered that they are quick to use and require relatively little training. They are attractive to organisations under pressure to respond to the burgeoning demand for risk assessments.
The actuarial paradigm is apparently straightforward. A group of offenders, usually prisoners, is assessed ─ often in terms of characteristics that are easy to measure, e.g. age, marital status, history of offending, type of victims etc; they are followed up and new criminal convictions are identified from criminal records. Statistical methods are applied to link the assessed characteristics to the observed probability of reconviction. This information about group relationships is used to make a prognostication about a new individual: guidance is given to decision-makers about his likelihood of reoffending.
This paper is underpinned by an increasing disquiet about the use of these assessment techniques and their potential misleading effects on judicial decisions. The use of actuarial tests is now so pervasive that their validity appears to have been accepted; they are not subject to sufficient challenge. A further concern is that poor practice based on actuarial scales will devalue the currency of properly conducted assessments of violence risk using the structured professional judgment approach, the approach recommended by both the Cosgrove and MacLean reports (2, 5). As will be seen, it is my opinion that the application of actuarial tests to make decisions in the individual case is more prejudicial than probative.
A disquieting case
My disquiet about actuarial approaches was confirmed by a referral from Sheriff Kenneth Mitchell of Glasgow. A social worker opined that the individual accused of sexual offence was “high risk”; a psychologist opined he was “low risk”. I suspect the sheriff was bemused: he contacted me. The social worker had applied the RM2000 correctly and in line with the manual; he implied that “high risk” equated to a likelihood of sexual reoffending of 26% over five years and 36% over 15 years: figures that would raise concern. How valid were these figures?
There are two stages in calculating the level of risk using the RM2000. On the first step, three risk factors are considered; the individual’s age at commencement of risk, his number of court appearances for sexual crimes and offences and the number of court appearances for other types of crimes and offences. The accused scored zero on the latter two items, but he obtained a positive score merely because he was between the age of 18 and 24. On the basis of his age the accused was assessed as being of “medium risk” for sexual reoffending.
On the second step “aggravating factors” are considered. These include convictions for a contact offence against a male, convictions for a sex offence against a stranger, any convictions for a non-contact sex offence, and finally, whether the accused is single or has never lived with an adult partner for at least two years. Two of these aggravating factors applied; in terms of the RM2000 this is sufficient to increase the risk factor by one category, i.e., a “medium risk” is moved to a “high risk”.
Two points are of relevance. First, the court heard, and accepted, that while the accused had not met the victim before, she had pursued him by text message and phone message for four or five weeks prior to the offence; nonetheless, she was deemed a stranger victim. She consented to sexual intercourse and as she was only 12 years and 11 months at the time, he committed an offence. Secondly, the accused, a 19-year-old student, had not married or cohabited for two years or more, i.e. he was deemed to have difficulty forming intimate relationships. This seems tenuous ─ it is not usual for 19-year-old students to have cohabited for two years or more; indeed, the contrary is the case.
This evaluation appeared to me to confirm Menken’s observation “There is a simple solution to every human problem ─ neat, plausible and wrong.” The conclusion that this accused person posed a “high risk” was based on three pieces of information: he was between 18 and 24 years of age, he had not cohabited for two years or more, and while he and his victim had communicated regularly over four weeks or so, they had not met. In my experience people are more complex than that ─ as are the risks that they pose.
Actuarial approaches to violence risk: a false analogy
Actuarial methods are compelling because they appear to be scientific, they are based on data, they are based on statistical analyses, and their product is a number. Unfortunately, this appearance of science is very misleading. There are at least three lines of argument that challenge the utility of these devices for making prognostications about an accused individual: logical, statistical and empirical.
The (il)logic of actuarial approaches to violence risk
From the logical perspective the reasoning inherent in the actuarial approach commits the fallacy of division (6). This fallacy rests on drawing a conclusion about an individual member of a group based on the collective properties of that group. For example, it is obviously fallacious to argue that if, in general, intelligent people earn more than less intelligent people then Jules, with an IQ of 120, will earn more than Jim with an IQ of 100. Equally, it is fallacious to argue that since people who score highly on an actuarial risk scale generally reoffend more than people who do not score highly, Bill in the “high risk” group will reoffend more often – or more quickly – than Brian in the “low risk” group.
A common defence of the actuarial approach is founded upon a related fallacy. “If it is alright for life insurance companies, it should be alright for psychology.” Indeed, a sheriff has made such a remark during one of my lectures on risk assessment. The analogue is false. The actuary makes a profit by predicting the proportion of insured lives that will end in a particular time period: the actuary has no interest in predicting the deaths of particular individuals. By way of contrast, the decision maker in court is only interested in the accused in front of them, not the properties of any statistical group from which they may be derived. This has been long recognised; Sherlock Holmes knew this! “While the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty. You can, for example, never foretell what any one man will do, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant.” (7)
The illusion of certainty: statistical aspects of actuarial approaches to violence risk
Numerical statements – e.g., there is a 36% likelihood that this individual will reoffend sexually in the next 15 years ─ are powerful. Numbers stick in the mind. It is difficult for the decision maker to disregard them and alter their evaluation based even on detailed, credible and contradictory information. This is the anchoring bias ─ a well-established cognitive bias that influences all human judgment. Judges in court are not immune (8, 9). This problem is compounded by the tendency to predict rather than forecast, to provide a single value of the likelihood that someone will offend, without any indication of confidence that should be placed on that single value; an indication such as the range of possible values which that likelihood may take. Is the range narrow or wide? Deterministic predictions create the illusion of certainty in the judge’s mind and may lead to sub-optimal action: a lenient sentence when more control is required, or equally, a disproportionate sentence when such is not required.
It is possible to use statistical methods to quantify the degree of (un)certainty that is associated with any estimate, and this includes predictions. Typically, the precision of any estimate (e.g. mean rate of reoffending of a group) can be measured by the width of a confidence interval; a confidence interval gives an estimated range of values which is likely to include the value of the unknown “true” value being estimated. Unfortunately, the manuals for actuarial scales generally do not provide the information necessary to determine uncertainty.
If we return to the disquieting case, the accused was said to be in the “high” group: he was estimated to have a 26% probability of reconviction within five years ─ the 95% confidence intervals can be conservative estimated to be between 19% and 34% (see (10) for a description of a method). Perhaps this does not seem too uncertain; however, this is not the relevant interval for judging confidence concerning the estimate of risk posed by an individual. The confidence interval is concerned with the average for a group, but the decision maker is only interested in the individual in front of them. To assess the confidence about the probability of reconviction for a new individual, an individual not in the development sample, requires another calculation ─ the calculation of a prediction interval; this interval is essentially a form of confidence interval which applies to the prediction.
For the disquieting case the prediction interval was conservatively estimated as lying between 2% to 88%. (see (11) for a technical discussion of the distinction between confidence and prediction intervals). None of the manuals for the actuarial scales provide this information; indeed, many actuarialists do not appear to appreciate the relevance of this consideration. (12)
The problem of making predictions for individuals using statistical models is now recognised in other disciplines. It is not merely a function of the complexity of assessing the psychological characteristics of individuals. For example, in relation to medical risks Rose (1992) (13) indicated: “Unfortunately the ability to estimate the average risk for a group, which may be good, is not matched by any corresponding ability to predict which individuals are going to fall ill soon” (p 48). Individual cases demonstrate this. Stephen Hawking wryly observed: “Thirty years ago I was diagnosed with motor neurone disease, and given two and a half years to live. I have always wondered how they could be so precise about the half.” (It would be interesting to know whether those making decisions concerning the release of Mr Al-Megrahi appreciated this uncertainty.)
This view that we cannot predict for individuals has been regarded as controversial in the field of violence risk. (12) Perhaps a thought experiment using a non-psychological example may clarify the point. If I tell you the height of the next man to enter the court, how accurately can you predict his weight? This example has several advantages for the purpose of illustrating the problem of predicting in the individual case. First, the precision of the measurement of height and weight should be substantially greater than for the measurement of either risk factors for violence or violent reoffending. Secondly, the prediction is immediate and not degraded by the passage of time, i.e. not five, 10 or 15 years as is the case with some actuarial scales. Thirdly, the correlation between height and weight is stronger than that between violence risk factors and violent behaviour; this should make prediction easier. We have shown elsewhere that for Scottish men whose height is 1.7m the best estimate is 78 kg; however, the prediction interval ─ the range within which 95% of men will lie – is between 61kg and 95kg. (11) Thus predicting the weight of the next individual into the court based of knowledge of his height is a hit or miss activity. Therefore, how can high precision be expected in predictions about complex and changing risk potential over many years to come?
In other areas of life it has been long recognised that forecasts should entail an estimate of the degree of certitude that the forecaster holds about their prognostication (14, 15). From a scientific and professional perspective it is more honest to communicate the degree of (un)certainty with which the expert holds their opinion; this assists the decision maker to make rational decisions about the management of any risk. Relevant information about uncertainty is not made available for any of the actuarial scales in common use.
Actuarial risk assessments as screening tools
Within Scotland and beyond, actuarial instruments are becoming institutionalised. Under multi-agency public protection arrangements (MAPPA), police officers and social workers, for example, are being trained in the use of the RM2000 (MAPPA, 2009; http://220.127.116.11/Publications/2009/10/23131902/6). A growing trend to scepticism amongst certain practitioners may have lead to a shift in position: “we only use the actuarial as a screen”. This sounds amiable, tolerant, and evenhanded ─ unfortunately there is no compelling empirical evidence to support such a use. Perhaps alarmingly, despite the clear limitations of actuarial approaches, the Risk Management Authority argues for the use of the RM2000 as a screening tool for the Scottish population of sexual offenders to identify those who require further ─ and state-of-the-art ─ risk assessment. “The RMA continues to work with the Scottish Government in supporting and developing an integrated multidisciplinary approach to risk assessment in which the RM2000 plays a useful role as a screening instrument…” (Risk Management Authority, 2007; http://www.rmascotland.gov.uk/ViewFile.aspx?id=363).
There are two problems with this position, first, in practice this rarely happens; the social worker and the police officer do not have the time ─ and they probably do not have the training ─ to provide the systematic risk assessment required if the offender is caught in the screen. The decision maker in court is provided with the results of the actuarial scale without any consideration of certitude or risk formulation.
Secondly, and perhaps more critically, what is the scientific credibility of this position: has it been demonstrated that these instruments are effective screens? Indeed, the contrary is the case. Screens are used in medicine in asymptomatic individuals to identify the risk of future disease. It is not generally appreciated that to be effective as a screen risk, factors ─ or sets of risk factors ─ must be very strongly associated with disorder being screened for. (16) The strength of this association can be evaluated using an odds ratio: for example, the risk of developing the disease for those with the highest 20% of scores on a risk factor compared with those with lowest 20% of scores. Wald et al (1999) indicated that even an odds ratio of 200 will only yield a detection rate (the proportion of affected individuals with a positive result on the screen) of 56% for a false-positive rate of 5%. (17) Grubin (2008) published research that has been used by the Risk Management Authority to support the use of the Risk Matrix 2000. It is not possible to calculate the exactly equivalent figure for the RM2000 from published data; however, the closest that can be achieved is the calculation for the top 25% and bottom 25%: this gives an odds ratio of 14.6; several orders of magnitude below that which is required for an efficient screen.
To evaluate the effectiveness of a screening tool it is necessary to compare the relationships between the distributions of the risk factors, e.g. RM2000 scores for those who reoffend and those who do not reoffend. To this best of my knowledge this has not been done. Regrettably a request for access to the data derived from publicly funded research, in order to carry out these and other relevant analyses, has been declined. It is perhaps noteworthy that of the four offenders in Grubin’s (2008) study who received life sentences for their new convictions, one was in the “low risk” category; three were in the “medium risk” category; none were in the “high” or “very high” risk categories. At the very least, to be effective, a screen should identify all, or nearly all, cases, i.e. the screen should have a low false negative rate, and in particular, it should identify serious cases such as those who receive life sentences.
Challenges to decisions guided by evidence based on actuarial scales
Actuarial scales have been the subject of consideration in a number of appeal cases. It is perhaps surprising ─ and somewhat concerning ─ that the scientific basis of the conclusions based on actuarial scales including the RM2000, STATIC-99 and LSI-R has not been subject to scrutiny and challenge. The results of these tests are accepted at face value. From a public policy perspective it should be noted that the application of these instruments can ─ and does ─ lead to errors in both directions: individuals who are assessed by more comprehensive procedures to be “low risk” may be deemed to be “high risk”; or “high risk” cases may be deemed to be “low risk”. As noted above, all four prisoners who received a life sentence in the Scottish study failed to be identified as “high” or “very high” risk. The public is poorly served by errors in either direction.
A number of Scottish appeal cases illustrate both the influence and lack of critical appreciation directed at these procedures. In the appeal case of HMA v Thomas Russell Currie  HCJAC 67 a ground of appeal was that “The learned trial judge erred in failing to obtain a full risk assessment.” In their decision (at ) their Lordships concluded: “The Risk Matrix 2000 assessment tool is regularly and widely used for the purposes of assessing the risk presented by an offender to the public… In our view she [the trial judge] was entitled to proceed upon the basis of the outcome of the risk assessment carried out using Risk Matrix 2000.” Would their Lordships come to this view if they appreciated the lack of certitude associated with opinions based on the RM2000?
In the case of Neil Duncan Robertson v HMA (Appeal no XC1020/03, 17 February 2004) it was accepted by their Lordships that use of the RM2000 provided a valid opinion that the convicted person was high risk. The application of another actuarial instrument ─ the Static-99 ─ was part of the evidence used to argue controversially that James Taylor, an individual convicted, amongst other things, of raping a baby girl was “low risk”: HMA v JT (Appeal no XC1062/03, 24 September 2004). The Static-99 was used in another case to argue for “high risk”: Jason Alexander Jordan v HMA  HCJAC 24.
One exception to this acceptance of opinions based on actuarial scales that I am aware of is the case of Lord Watson: the appeal court accepted that a report I prepared "casts doubt on the validity of the risk assessment". There were a number of difficulties in the use of an actuarial instrument (the LSI-R) in addition to those alluded to above. For example, the procedure used ─ the LSI-R ─ was developed on Canadian prisoners with an average age of 26.89, in a sample of general offenders with no reference to fireraising. Lord Watson was not Canadian; he had not been to prison before; he was convicted of fireraising; he was 56 years of age when the assessment was carried out (statistically it was very unlikely that there would have been anyone of his age in the development sample).
This case raised a general point: even if the actuarial approach were to considered to be appropriate, it is axiomatic that any individual being assessed should be similar to those with whom they are being compared. In statistical language they should be drawn from the same population. Such inappropriate comparisons are common. In recent cases I have seen the RM2000 being used with first offenders even although the procedure was developed using data from prisoners (data from the Cosgrove report suggests that fewer than 50% of those convicted of a sexual offence receive custodial sentences); first offenders are likely to be different from recidivists. I have seen the actuarial scales used to assess internet offenders even although the internet was of limited availability when the development studies were carried out.
Actuarial risk assessments and expert testimony
Should evidence based on actuarial scales be the basis for expert testimony? Lord Wheatley has recently provided a clear and detailed restatement of the role, responsibility and privileges of the expert witness (Brian Wilson and Iain Murray v HMA  HCJAC 58). In brief, the evidence must contribute to the proper resolution of the dispute and provide relevant information from and area of knowledge or experience that a judge or jury would not generally have access to. Critically, Lord Wheatley noted, “the witness must demonstrate a sufficiently authoritative understanding of the theory and practice of the subject” (at ).
As argued above, the scientific basis for actuarial scales, as applied to individuals, may be more illusory than real. In the United States, in relation to scientific evidence, the theories and procedures on which the expert testimony is based should be accepted within the appropriate scientific community (e.g. Frye v United States, 1923), theory and procedures should be testable, have been subjected to peer review, and error rates should be established (Daubert v Merrell Dow Pharmaceuticals, 1993). If criteria such as these were to be applied it is difficult to see how actuarial procedures would be deemed to be admissible given that the uncertainty of individual predictions is large, unknown, or indeed perhaps unknowable.
Given the complexity of the issues discussed above, are the usual witnesses required to provide evidence on risk – criminal justice social workers – in a position, by dint of their training or experience, to provide “a sufficiently authoritative understanding of the theory and practice of the subject”? I suspect not.
In conclusion, I would urge decisions makers and others to be cautious in the weight they place on opinions derived from actuarial risk assessments. From a scientific ─ rather than a legal ─ perspective it appears to me that the application of these tests is more prejudicial than probative. As Neils Bohr remarked, “Prediction is difficult, particularly about the future.” I would be interested in the answers to two questions. Are defence agents who do not challenge assessments based on these tests failing their clients? Are organisations that require their employees to use these flawed procedures at corporate risk?
David J Cooke is Professor of Forensic Clinical Psychology at Glasgow Caledonian University and the University of Bergen, and was a member of the MacLean Committee
(1) Department of Health. Best practice in managing risk: Principles and evidence for best practice in the assessment and management of risk to self and others in mental health services. 2007. London, Department of Health.
(2) Lord MacLean. A report of the committee on serious violent and sexual offenders. 2000. Edinburgh, Scottish Executive.
(3) Risk Management Authority: Standards and Guidelines for Risk Assessment. 2006. Paisley, Risk Management Authority.
(4) Royal College of Psychiatry. Rethinking risk to others in menal health services. 2008. London, Royal College of Psychiatry.
(5) Lady Cosgrove. Reducing the Risk: Improving the response to sex offending.The Report of the Expert Panel on Sex Offending. 2001. Edinburgh, Scottish Government.
(6) Rorer, L. Personality assessment: A conceptual survey. In: Pervin, L A (ed), Handbook of personality: Theory and research. New York: Guilford; 1990; 693-720.
(7) Doyle, A C. The sign of the four. 1994 ed. Oxford: World's Classics, 1890.
(8) Englich, B, Mussweiler, T. Sentencing under uncertainty: Anchoring effects in the courtroom. Journal of Applied Social Psychology 2001; 31:1535-1551.
(9) Englich, B, Soder, K. Moody experts - How mood and expertise influence judgmental anchoring. Judgement and Decision Making 2009; 4:41-50.
(10) Hart, S D, Michie, C, Cooke, D J. The precision of actuarial risk assessment intstruments: Evaluating the "Margins of Error" of group versus individual predictions of violence. British Journal of Psychiatry 2007; 170:60-65.
(11) Cooke, D J, Michie, C. Limitations of diagnostic precision and predictive utility in the individual case: A challenge for forensic practice. Law and Human Behavior. In press.
(12) Craig, L, Beech, A R. Best practice in conducting actuarial risk assessments with adult sexual offenders. Journal of Sexual Aggression 2009; 15:193-211.
(13) Rose, G. The strategy of preventative medicine. Oxford: Oxford Medical Publications, 1992.
(14) Krzysztofowwicz, R. The case for probabilistic forecasting in hydrology. Journal of Hydrology 2001; 249:2-9.
(15) Cooke, W E. Forecasts and verifications in Western Australia. Monthly Weather Review 1906; 34:23-24.
(16) Wald, N J, Hackshaw, A K, Frost, C D. When can a risk factor be used as a worthwhile screening test? British Medical Journal 1999; 319:1562-1565.
(17) Grubin, D. Validation of Risk Matrix 2000 for Use in Scotland. Report Prepared for the Risk Management Authority. 2008. Paisley, Risk Management Authority.