Appendix A: Stratified Sampling to Assess Accuracy | RDP 2025-06: An AI-powered Tool for Central Bank Business Liaisons: Quantitative Indicators and On-demand Insights from Firms

RDP 2025-06: An AI-powered Tool for Central Bank Business Liaisons: Quantitative Indicators and On-demand Insights from Firms Appendix A: Stratified Sampling to Assess Accuracy

Nicholas Gray, Finn Lattimore, Kate McLoughlin and Callan Windsor

August 2025

Download the Paper 284KB

Most paragraphs in the liaison corpus are not about wages (or labour costs). Figure A1 shows the distribution of the LM's predicted probabilities for the wages topic. If we set the threshold for the LM to 0.9, then it would only classify 7 per cent of paragraphs to be about wages. If we randomly sample 600 paragraphs from this population we are unlikely to pick up many of these paragraphs – that is, we would expect to have only 42 paragraphs to assess. This can make it difficult to validate the performance of our method using metrics that rely on observing more than a few cases of wages paragraphs.

Figure A1: Distribution of LM Topic Scores - Bar graph showing the distribution of LM topic scores across deciles. The bottom decile has the highest concentration of scores, with a sharp drop through the middle deciles and a small increase in concentration in the top decile. The top decile is coloured different from the lower ones to represent that these have a high likelihood to be about wages. The graph highlights the rarity of high-confidence wage-related paragraphs. — Figure A1: Distribution of LM Topic Scores

Share of paragraphs about ‘wages’ by probability decile

The solution is to up-sample paragraphs that are likely to be about wages. We use the LM's distribution of predicted probabilities for wages to guide how we sample the population. First, we stratify the population based on the deciles of the distribution. Then we randomly sample an equal number of times from each of the deciles (or stratum). This method is known as stratified sampling.

To account for the imbalanced sampling of each stratum, we calculate an importance weight based on the sample's representativeness of the entire population. For example, a paragraph sampled from the [0,0.1] stratum (which contains about 70 per cent of paragraphs) would be 10 times more representative compared to a paragraph sampled from the (0.9,1] stratum (which is only 7 per cent of paragraphs). This is known as importance sampling and is often used to reduce the variance when estimating expected values of distributions. The importance sampling weighting formula for each of our stratums is given by:

w_{i} = \frac{N_{i}}{N}

where N_i is the number of instances in decile i in the (estimated) true population, and N is the total number of instances in the true population. The weighted F1 score is the sum of the F1 scores for each decile, weighted by their respective weights:

F 1_{w e i g h t e d} = \sum_{i = 1}^{10} w_{i} \cdot F 1_{i}

We can simulate a classification task on an imbalanced distribution to test if up-sampling helps improve the assessment of model performance. For the simulation, we have a population of 400,000 observations split into 10 stratums based on the distribution in Figure A1. Observations sampled from each stratum get assigned a predicted and true label based on the rates of predicted and true probabilities shown in Figure A2. The predicted probability is based on the LM's output, while the true probability is an estimation of the true positive rate based on spot checks of the real liaison data. For example, predictions below 0.6 in the real data are very rarely about wages, while most paragraphs observed discussing wages have a score above 0.9.

Figure A2: Simulation Probability Distributions - Two-panel line graph outlining simulated probability distributions. The left panel shows simulated predicted probabilities across deciles, with a linear increase in percentage values. The right panel shows simulated true probabilities over the same intervals, with a nonlinear distribution that rises sharply from around 8th decile. — Figure A2: Simulation Probability Distributions

Share of simulated draws by probability decile

In this simulation exercise, as a baseline, we randomly sample 600 observations from the whole population and calculate our performance metrics. Then we apply stratified sampling by taking 60 random observations from all 10 stratums. This ensures we up-sample observations more likely to be true. We account for the up-sampling by including the importance weighting of each stratum when calculating the performance metrics. We repeat this 1,000 times to calculate the mean and variance of each performance metric using both sampling methods.

We find that a sample of 600 observations on average gives a sufficiently accurate estimation of the true performance metrics for both methods. However, the variance of the metrics for the stratified sampling is lower than random sampling, especially for recall (almost three times smaller; see Table A1). This demonstrates that our stratified sampling method is more accurate given the large class imbalance in our data. This same method is used to up-sample wages paragraphs when validating our liaison wages measures.

Table A1: Simulation Performance Metrics
Calculated over 1,000 samples
Metric	Full sample	Random sampling		Stratified sampling
Metric	Full sample	Mean	Variance	Mean	Variance
Recall	0.950	0.952	0.034	0.951	0.014
Precision	0.908	0.907	0.045	0.909	0.036
F1	0.928	0.928	0.030	0.929	0.021