When the ruler is made of the thing it measures: Multi-model evidence on AI occupational exposure scores

There is a circularity at the centre of the empirical AI-and-labour literature. To estimate how artificial intelligence is reshaping work, researchers ask an AI to score each occupation according to how much of its task content the AI could perform. The instrument and the phenomenon are the same kind of object. The procedure is now standard. Goldman Sachs’ widely cited estimate that AI could expose 300 million jobs to automation itself rests on a version of AI. So do the IMF’s cross-country exposure analysis (Cazzaniga et al. 2024), the ILO’s Generative AI and Jobs Index (Gmyrek et al. 2025), the Yale Budget Lab’s projections (Gimbel et al. 2025), and PwC’s 2025 Global AI Jobs Barometer covering one billion job advertisements. The numbers anchored by these scores now sit in central bank communications, finance ministry briefings, board presentations, and even state-level K-12 guidance documents in the US (Arizona AI Alliance 2025).

In our paper (Yin et al. 2026), my co-authors and I ask a question the standard procedure leaves implicit. Suppose the same scoring exercise is run with a different AI. Does the result change?

The answer is yes, and by far more than the field has acknowledged. The flagship statistic from Eloundou et al. (2024) – the share of US occupations with more than half of their tasks at high direct exposure – ranges from 2.7% under Google’s Gemini 2.5 to 51.5% under Anthropic’s Claude 4.5 on identical task data. That is a nineteen-fold spread. GPT-4 (the model used in the original Eloundou study) places the figure at 3.8%; ChatGPT-5 places it at 20.3%. Same instructions, same O*NET task descriptions, same data pipeline. Only the AI rater varied.

The pattern is sharper at the level of executive perception. Take management as an example, the kind of category any chief financial officer or workforce planner cares about. Under Claude, more than 80% of management occupations are classified as high-exposure. Under Gemini, fewer than one in five are. Under GPT-4, roughly one in four are. A workforce strategy memo that reads management roles as deeply at risk and one that reads them as moderately exposed are downstream of the same paper, the same data, and the same procedure. They differ only in which AI did the scoring.

These dispersion findings have a structural explanation, and the explanation is what makes them difficult to wave away. Each frontier model is its own instrument. Its exposure judgements are the joint product of an underlying labour market reality and the model’s training corpus, calibration choices, and reinforcement signals. None of those choices is wrong in any obvious sense, and none of them is small. The differences are also systematic rather than random: each model carries a consistent directional tilt across occupations, so the bias does not cancel with larger samples.

There is, in addition, a feedback channel that the standard measurement-error framework does not accommodate. The tasks where AI capability is advancing fastest are also the tasks generating the most training data for newer models. As the underlying technology evolves, the instrument that measures the technology’s effect on work evolves with it. The measurement and the phenomenon are not independent. This violates the mean-zero error assumption underlying classical sensitivity analyses, and it is the analytical reason the bias is irreducible from a single-model exercise.

A natural reaction is to ask whether downstream conclusions are also unstable. They are. When each model’s scores are plugged into the standard difference-in-differences employment specification, the point estimate of the AI-exposure effect flips sign across raters: positive under three of the four annotators, negative under the fourth. None of the four reaches conventional statistical significance, but the qualitative reading would differ across the four readings.

The identity of the occupations most affected also shifts sharply across raters. Two analyses with different AI raters would arrive at different conclusions about both the magnitude and the direction of the labour market effect, and target different occupational categories on the same evidence.

For applied research and policy, the implication is straightforward in design and consequential in practice. Any analysis that conditions on AI-generated exposure scores, whether it appears in a peer-reviewed journal, an international institution’s working paper, or a corporate strategy memo, should report results from at least two or three different frontier AI models. Where conclusions converge across raters, the finding is likely robust (yes, there might be other sources of errors). Where they diverge, the divergence itself is informative: it tells us that the inference depends on a feature of the model rather than a feature of the labour market. The compute cost of doing this is roughly the same as running an analysis on three different cloud providers. The price of not doing it is a body of empirical work in which a portion of the cross-paper variation reflects which AI happened to be available when the analysis was run.

The point also generalises beyond AI and labour research. Wherever AI is asked to produce a number that anchors a consequential decision – in pricing, eligibility scoring, hiring, lending, programme planning – the model is not a neutral observer. It is an instrument whose calibration shifts as it is updated. Treating multi-model checks as a standard practice now, in this early period when the empirical literature is still consolidating, is far less costly than discovering the instability later.

The deeper philosophical point that economists may find worth holding onto is this: we have a long tradition of asking whether the rulers we use are well calibrated for the things we measure. The classical literature on measurement error treats the ruler as fixed. When the ruler is itself made of the thing it measures, when the technology that scores AI exposure is an instance of the technology whose effect we are studying, fixedness is an assumption, not a property. Our paper is a first quantitative reading of how much that assumption matters in the present application. Across four current frontier AI raters and on identical task data, it matters by a factor of nineteen.

References

Arizona AI Alliance (2025), “Arizona’s official genAI guidance for K-12” (May 2025 update).

Briggs, J, and D Kodnani (2023), “The potentially large effects of artificial intelligence on economic growth”, Goldman Sachs Global Investment Research, March 2023.

Cazzaniga, M, F Jaumotte, L Li, G Melina, A J Panton, C Pizzinelli, E Rockall, and M M Tavares (2024), “Gen-AI: Artificial intelligence and the future of work”, IMF Staff Discussion Note SDN/2024/001.

Eloundou, T, S Manning, P Mishkin, and D Rock (2024), “GPTs are GPTs: Labour market impact potential of LLMs”, Science 384(6702): 1306–08.

Gimbel, M, M Kinder, A Kendall, and M Lee (2025), “Evaluating the impact of AI on the labor market: Current state of affairs”, Yale Budget Lab.

Gmyrek, P, J Berg, K Kamiński, F Konopczyński, A Ładna, B Nafradi, K Rosłaniec, and M Troszyński (2025), “Generative AI and jobs: A refined global index of occupational exposure”, International Labour Organization.

PwC (2025), The fearless future: 2025 global AI jobs barometer.

Yin, M, H Vu, and C Persico (2026), “How (un)stable are LLM occupational exposure scores? Evidence from multi-model replication”, NBER Working Paper 35110.

Source link