By Alessio Farhadi and Adam Lund
Matt has ordered 2 million Covid-19 antibody test kits at a cost of $7 per unit. Upon delivery he learns the test kits are only 85% accurate. How much does Matt need to spend on average per person to determine with 99% accuracy the presence of Covid-19 antibodies? Discuss how this may change with the prevalence of Covid-19 within the population.
Initially, we set this problem as a thought provoking interview brainteaser. It became apparent the solution to the problem was far from simple. Whilst it has become commonplace to approach many statistical problem through Machine Learning techniques, we show a rigorous analytical approach is often the best starting point yielding more complete solutions.
A good rule of thumb is to take 3 consecutive test from the same batch of test kits. If one test result is inconsistent with the rest then take a 4th and stop. When the result is 3-1 go with the majority, and if 2-2 throw the test kits away.
Comments and feedback are welcome.
Please note, the information contained in this article is solely for illustrative purposes and MUST NOT be interpreted as professional medical advice. Please consult your physician if you are unwell or suffering from Covid-19 type symptoms.
I live in the Bayesian world for the purposes of this problem. This is a fancy way of saying I assume my probabilities for this problem are a given – known unknowns in the language of Donald Rumsfeld. The result of each Covid-19 antibody tests is binary, meaning it may only be in one of two states: positive or negative (real antibody test kit results are within a range). Whilst 85% certainty may be sufficient for the purposes mass population statistics, for an individual concerned this may present an unacceptable level certainty. Especially, when false Covid-19 positives may lead to the dire consequences and super spreaders. For example, a carer of an elderly care home incorrectly being told they have antibodies present.
To solve this problem, I need to figure out how many binary (positive/negative) tests are required to get my error rate below 1%. Let’s start with the simplest case and approach.
Case (i): Consecutive test results consistent
We assume is the probability a Covid-19 sufferer’s test result is correct (85%) and (15%) incorrect. Each test is drawn from an independent and identical distribution (i.i.d.) of the test kits available. We start with the simplest case, where n-consecutive Covid-19 trial tests result are consistent and correct. The minimum number of tests required to be inaccurate with less than 1% probability can be derived from the expression
With , we find 3 consistent consecutive test results ($21 cost) would be sufficient to attain 99% accuracy. However, this would be a special case, and also the minimum number of tests required to achieve Matt’s original desired accuracy.
Given the 15% probability that a test result is incorrect, we are highly likely to require more than 3 tests in order to reach our desired 99% accuracy threshold. In fact, the probability of having at least one incorrect test result within our first 3 tests taken is (39%). To complicate matters further, in practical terms, we have a high degree on uncertainty around the true accuracies of antibody test kits given the product infancy.
Case (ii): One correct test result
Let’s consider the case we take 4 test and just 1 result is correct. Once again, we restrict the discussion to known positive/negative cases. There are 3 ways (combinations) this may happen: On the first, second, or third testing event. The probability P of this occurring is
For a test kit with 85% accuracy () the probability %. The cost is $28.
Case (iii): Two correct test result
Now the case where 5 tests are taken and just 2 of the results are accurate. This can occur in 6 combinations. Thus,
which gives 1.4% probability of this occurring at $35 cost. Unfortunately, just below our 99% accuracy threshold. We require an additional accurate test to get us over the 99% hurdle at a cost of $42. A subsequent incorrect test would start to send us down the rabbit hole, and the total cost starts to spiral. Matt should really consider the trade-offs between the accepted accuracy threshold, and the average cost per person. For now, we will set our primary objective to be accuracy over costs.
After a little more consideration we can envision (see below) how the accuracy and the implied accuracy and costs evolved as more test are taken. We stop testing once we are 99% confident we have correctly diagnosed the presences or absence of Covid-19 antibodies in a patient.
More generally, if we take n tests with k correct results, we continue testing until
(Number of possible paths) .
This reminds me a lot the optimal stopping problem when traders consider the early exercising of American Style Option. With the notable difference, the penalty for early exercise of an American option decreases as the option time value erodes. Whereas, for testing, the cost increases with each incremental test taken.
As a policy, I would set a minimum of 3, and maximum of 4 tests should just one test be inconsistent, at an expected cost of
C = 0.614 x 21 + 0.386 x 28 = $23.70
Much like Matt, I find myself out of my depth to do anything more statistically rigorous. Fortunately, I have a smart friend (Adam) with a PhD in Statistics who can help. Let’s ask the experts.
Before we begin, we must to clarify some terminology and what we mean by test accuracy. The sensitivity, , represents the true positive rate, and the specificity, , the true negative rate.
These are important factors doctors and statisticians need in order to determine the accuracy of a test kit, expressed as
where is the prevalence, i.e. the fraction of positive in the population.
If a test kit has 85% accuracy the statistical errors become significant when Covid-19 test kits are used to estimate the prevalence if the true rate or prevalence is below 15%. This would require larger scale testing than the typical 1000-2000 sample sizes in most studies.
We will assume the test kits are drawn from an i.i.d. such that if a known Covid-19 positive (or negative) person, takes consecutive tests, tests will be correct, and incorrect.
Let us assume Jon is an individual that would like to know whether or not he has suffered from Covid-19. And, if so, with what degree of can he be certain.
Let us assume we have a time series of consecutive Covid-19 test results for an individual such that for all
when Covid-19 negative, or
when Covid-19 positive.
How do we combine test results to derive a joint probability distribution in such way that allows us to control the confidence level (accuracy)?
This is a classical hypothesis testing problem the Neyman-Pearson Lemma seeks to address. We let denote the null hypothesis that a Jon is Covid-19 negative. By contrast, is the alternative hypothesis, Jon is Covid-19 positive. Given observations , we need to find a hypothesis (model) which best fits our observations. In our case, we compare hypothesis with .
Since each is either 0 or 1 (a Bernoulli trail), and are i.i.d. it follows that their joint distribution must follow a binomial distribution of order . We use the pre-determined accuracy to define our null hypothesis. This enables us to reduce the problem to one where we wish to get a best fit parameter between two binomial distribution functions. This situation is termed a simple hypothesis.
We know the test result is correct with probability . This means given a patient is negative Covid-19 antibodies (within our null hypothesis), we require and . Similarly given a patient is positive (within the alternative hypothesis) we must have . If we let denote the total number of positives test results, and , the number of negatives, we obtain the following expressions for the probability function
and under the alternative , the likelihood is
So we see that we have a particularly simple version of a simple hypothesis setup for parameters and .
Now classical statistic testing theory, that is the Neyman-Pearson Lemma, tells us that we should use the likelihood ratio test (LRT) quantity
when deciding which of the two hypothesis we should accept. The overall test or decision rule based on the observations is then
where , . In particular, and are determined such that the decision rule has statistical significance (probability of false positive) i.e.
This decision rule is optimal in the following sense; it is the most powerful test (the lowest probability of making a type II error) at significance level meaning the test that gives us the highest probability of rejecting a false null hypothesis.
Using (2) gives us a testing procedure but still leaves the question of how to obtain the desired accuracy. To answer that we need to know the distribution of our test quantity or, if thats not easy to figure out or a non-standard distribution, its asymptotical distribution. In our setup however, if we consider an equivalent decision rule based on the logarithm of the likelihood ratio , we can obtain analytical expressions of the error probabilities. In this connection let us first define the quantity
As a function of the accuracy , quantifies the amount of information embedded in the test variable , its entropy if you will, and in turn indicates if we can learn anything from repeating the test. As such we can think of as the learning rate of our overall test procedure. More specifically, maximal entropy is reached for , the uniform binary distribution, yielding reflecting that repeating the test won’t let us learn anything. On the other hand for , (minimal entropy) reflecting that the learning procedure gives us the truth every time i.e. is an oracle procedure. Note that is symmetric around reflecting that a binary test that is wrong more than half the time is just as good as a binary test that is right more than half the time.
Now instead of using the rule (2) we will compare the test quantity
to so that our decision variable is now given as
Clearly (3) is equivalent to (2) and makes intuitive sense. It shows directly that the threshold controlling the significance level (i.e. the accuracy of the test), for any fixed , is in fact a function of the number of trials . It is convenient to choose to obtain the majority rule; after trials if the majority of answers are Covid-19 positive we will reject the null and accept that Jon is positive, in case of a tie we will let chance decide and otherwise we will accept that he is negative.
Now what remains is to find such that the test procedure based on (3) has accuracy . This is easy since is binomial distributed under each hypothesis. This means the probabilities of error, and , are easy to compute as a function of under each hypothesis.
In particular, the error probability under is
The last expression is obtained by changing the index to and using the fact that
Under we get that the error probability is
As noted initially to obtain overall accuracy for any prevalence level we must have according to (1). By (4) and (5) this happens if and only if implying we have to let in our test procedure. Now for fixed to finally determine we plot (5) as a function of in Figure 1.
Probability of error as a function of trials
Note that the first time we obtain the desired accuracy is after an odd number of trials for each , i.e. it is always suboptimal to test an even number of times. This is actually good information to have since some places indeed recommend that you test twice if you want higher confidence in the result. According to Figure 1, for the decision rule (3), this does not make any sense.
Finally to solve Matt’s original problem, for and , he needs to test Jon times for a total cost of $72. Doing that he actually gets around a accuracy, however, he can save $16 if he is willing to accept a slightly lower accuracy of around .
With 85% accuracy test kits, Matt can achieve 99% accuracy in Covid-19 diagnosis with 9 consecutive tests at a cost of $72 per patient.
We now have an answer to the question; Jon would need to get tested 9 times to obtain (more than) accuracy with a Covid-19 test that is accurate using the LRT based decision rule. To answer this question we did a power analysis to obtain the minimum number of tests we would need for our likelihood ratio test to attain a certain power.
However, we have not proved that this is the optimal test strategy in terms of cost i.e. that the likelihood ratio test is in fact the most efficient test. Efficiency is defined by the sample size, i.e. the above, needed to obtain the required power. So to fully answer original question and obtain the minimal cost for Matt we need to test with higher efficiency.
During World War II, several groups of researchers were concerned with statistical efficiency. For example, when testing batches of military equipment, to minimize the associated costs. In particular, the Statistical Research Group (SRG) at Columbia University was working extensively on this problem. Milton Friedman, an influential economist, and statistician Wilson Allen Wallis conjectured a sequential testing strategy where the number of tests required depended on the observable outcomes. In practical terms, this sequential testing approach yields higher time and cost efficiencies compared to the preset method in LRT at an insignificant statistical cost (same error rates and ).
Optimal stopping problems are well featured in mathematical finance – as previously mentioned in the early exercise of an American option. European options, unlike American style, have a fixed exercise/maturity date which permits analytical pricing using the Black-Scholes equation. Pricing an American option is both mathematically and computationally taxing due to the inherent optimal stopping problem of early exercise.
As Alessio touched upon above, the Covid-19 testing optimal stopping differs from American options with the incremental cost of each test kit, whereas the American option has reducing time value as it approaches maturity. In Jon’s case for instance, using 85 accurate tests, 5 consistent test results in a row will render the need for the additional 4 test – saving $32, time and resources. This demonstrate given an appropriate stopping rule it is possible to obtain the accuracy we desire for a lower cost. However, as with an American option, setting an optimal stopping rule is non-trivial.
After hearing about Friedmans and Wallis conjecture, Abraham Wald proposed a mathematical framework, sequential probability ratio testing (SPRT), where by a stopping rule is derived for such similar trial problems. Later, together with Jacob Wolfowitz, he demonstrated that his proposed stopping rule is also optimal for the simple hypothesis setting; It is the most efficient test for a given accuracy, see .
To introduce the SPRT procedure and stopping rule we do as in Wald’s original paper  and define the following two quantities
where we note that with we have . Next consider the random variable defined by
with Bernoulli i.i.d.. Then, are i.i.d. with and it follows that we can write from (3) as
Now the sequential probability likelihood ratio procedure can be formalized by the algorithm below:
Given the single test accuracy and desired error rates and the constant and define the stopping rule; for each iteration if we stop and accept , if we stop and accept , and otherwise we continue to test.
In Machine Learning (ML) terminology, we can think of the SPRT as an online (sequential) learning procedure and correspondingly the classical Neyman-Pearson LRT procedure as batch learning. The SPRT algorithm is optimal such that on average it will result in significantly fewer tests to statistically accept a hypothesis compared to other methods. It turns out that using such an online learning (in sample) approach often yields greater efficiency through a reduced number of trials.
To compute the expected number of tests Jon would have to take using the SPRT procedure, we begin by defining our optimal stop as
where is a random variable representing the number of iterations Algorithm (1) will do before it stops. i.e. The number of test Jon must take to obtain the desired Covid-19 test accuracy. As shown in , the SPRT minimizes , the expected time to acceptance each hypothesis, for any desired accuracy level.
Next, we can explicitly compute the average number of tests required to achieve our desired () level of accuracy. To do so, let represent a random varible with a probabilty distribution identical to that of the i.i.d. variables in (7). From (8), using Wald’s identity it follows that
For the denominator
To calculate the numerator, we observe that or , such that Adam’s Law (Tower property) implies
Since is the probability of accepting when is true (under ) and is the probability of accepting when is true (under ) we obtain
Note, takes values in the set
with known, it is straightforward to determine such that
As , it follows from our definition of in (9), and that
Combining (11), (12) and (14) we obtain
Finally, using (10) we can write (9) as
In the special case where , such that we have an overall test accuracy according to (6), we get , and .
Using (13), it follows that which gives rise to the condition
The expression (15) represents the average the number of tests required as a function of the individual test kit accuracy in order to achieved our desired level of accuracy . Moreover, it provides the minimum number of tests required to have an error rate , i.e. accuracy .
In (15), when , reflecting that one cannot learn much from test kits which are no more accurate than a series of coin flips. Also, for we get implying by (13) and then as expected. Figure (2) shows the function in (15) for .
Expected number of Covid−19 test required for 99% target accuracy
In Table (1) we compare the values of for , the trials needed using the batch learning approach above, the LRT, with the expected number of trials needed with the SPRT approach.
We see from Table (1) by following the SPRT approach we have at least halved the average number of test kits required to attain a 99% confidence in our Covid-19 diagnosis as compared with LRT. In practical terms $29.40 vs $63, saving Matt total of $33.60 per patient.
With an optimal stopping approach Matt can still achieve 99% accuracy in Covid-19 diagnosis using less than half as many test kits (4.2 average), at an expected cost of $29.40 per patient.
References Abraham Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2):117-186, 1945.  Abraham Wald and Jacob Wolfowitz. Optimum character of the sequential probability ratio test. The Annals of Mathematical Statistics, 19(3):326–339, 1948.