Writing code versus shipping code: Productivity effects across generations of AI coding tools

In 1987, Robert Solow famously quipped that “you can see the computer age everywhere but in the productivity statistics” (Solow 1987). Four decades later, the same concern animates the policy debate over generative AI. Experimental studies find that AI tools raise worker performance on specific tasks by 15–50% in customer support (Brynjolfsson et al. 2025), professional writing (Noy and Zhang 2023) and especially software development (Peng et al. 2023, Cui et al. 2026). Forecasts of AI’s aggregate impact nonetheless vary widely, from percentage points of additional annual productivity growth to barely measurable effects (Acemoglu 2025, Jones 2026; see Filippucci et al. 2024 for an overview of this debate on Vox). Early firm-level evidence likewise suggests only modest effects so far (Aldasoro et al. 2026).

A central reason for this divergence is a question that has been hard to test directly: do productivity gains on individual tasks translate into more final output? According to the ‘weak links’ or bottleneck hypothesis, they need not. If production consists of complementary stages and AI accelerates only some of them, final output remains limited by the weakest stage – the one humans still perform. This is the ‘O-ring’ logic of Kremer (1993), applied to AI by Aghion et al. (2019) and Jones (2026). In a recent paper (Demirer et al. 2026), we provide one of the first direct empirical tests of this hypothesis in software development, one of the earliest and most prominent domains of AI adoption.

Three generations of AI coding tools

Software development is an ideal setting for two reasons. First, it has already experienced three distinct generations of widely adopted AI tools: autocomplete (suggesting code as the developer types, available since 2021), sync agents (writing and editing code alongside the developer in real time, such as Claude Code), and async agents (working autonomously on an assigned task without developer oversight, such as GitHub’s Coding Agent or OpenAI’s Codex). Second, software production has a well-defined hierarchy of stages. Lines of code are bundled into commits, commits into pull requests (bundles of changes submitted for review and integration), pull requests into projects, and projects into shipped releases, so productivity can be measured at each stage of the chain.

We combine the public GitHub histories of more than 100,000 developers with internal usage records from Microsoft, which owns GitHub. For some tools, we observe adoption directly in subscription data; for others, we identify it from publicly visible traces of usage on GitHub (Claude Code, for instance, leaves recognizable footprints in commit histories). To estimate productivity effects, we use a matched event-study design: each adopter is compared to a control developer with near-identical activity exactly one year earlier, which avoids comparing adopters with ‘non-adopters’ who may in fact be quietly using AI themselves. Placebo tests with non-AI tools, flat pre-trends, and agreement between our autocomplete estimate and the field-experimental estimate of Cui et al. (2026) for the same tool in the same period all support a causal interpretation (we report these checks in detail in our paper).

Task-level gains grow with each generation

Each generation of tools delivers larger productivity gains than the last. Measured by commits, a common measure of coding activity, adopting autocomplete raises a developer’s output by roughly 40%. Adding sync agents takes the cumulative effect to roughly 140%, and adding async agents to roughly 180%. The gains are larger for less active developers but remain substantial across the entire activity distribution, and they grow over time in step with major model releases.

Writing code versus shipping code

These gains attenuate sharply at higher levels of the production hierarchy. Figure 1 summarises our estimates: combining all three generations of tools, output at the commit level roughly triples, and raw code volume rises by far more, but the same developers work on only about 50% more projects and ship only about 30% more releases. For sync agents alone, a more than sevenfold increase in lines of code becomes a 65% increase in pull requests, yet releases rise by only 20%.

Figure 1 Productivity effects of AI coding tools across the production hierarchy

Notes: Matched event-study estimates of the cumulative effect of adopting AI coding tools on each layer of the software production hierarchy. Because more capable tools are adopted alongside earlier generations, the figure shows cumulative effects of adopting all tools up to and including a given generation.
Source: Demirer et al. (2026).

This attenuation is what a ‘weak links’ view of production predicts. In our model of software production, each stage’s output is combined with human effort at the next stage: AI-written code still needs to be reviewed, integrated, tested, and released by humans, whose expertise and judgment intermediate between AI output and the final product (Acemoglu et al. 2023). When stages are strong complements, even unbounded automation of one stage yields only bounded gains in final output. Calibrating the model to our estimates points to strong complementarity between AI output and downstream human effort. The binding constraint on software output is shifting from writing code to the stages humans still perform: reviewing, integrating, testing, and releasing it.

More apps, but no more usage

We also examine whether these gains appear in aggregate data. On GitHub they do: the number of completed code changes (merged pull requests) and newly created projects (repositories) both accelerate from early 2025, coinciding with the growth of agent-authored commits. By early 2026, more than 5% of public commits are attributable to a single tool, Claude Code, a lower bound on AI’s true share since most AI use leaves no visible trace.

GitHub activity is still a developer-side measure, however; it does not tell us whether new software reaches consumers or how much it is used. To address this, we assemble monthly panels for four major application marketplaces (the Apple App Store, Google Play, the Chrome Web Store, and SourceForge), observing each app’s entry date and early usage. New app releases have risen, though unevenly across marketplaces. Monthly new iOS apps roughly double between early 2025 and April 2026 (Figure 2), new Chrome extensions accelerate as well, and Google Play breaks from a years-long decline, while SourceForge shows little change. A moderate and uneven increase is in line with our developer-level estimates: releases sit at the top of the production hierarchy, where the effects of AI should be smallest.

Figure 2 New iOS apps and their usage

Notes: Monthly counts of newly released iOS applications (left) and total usage accumulated by each monthly cohort of new apps in its first three months as proxied by total ratings (right). Shaded region marks the agentic-coding era (February 2025 onward). The paper reports analogous series for Google Play, the Chrome Web Store, and SourceForge; the pattern is cleanest for iOS.
Source: Demirer et al. (2026).

Despite this expansion in supply, total engagement with each monthly cohort of new apps in its first three months is flat or declining on every marketplace, and the share of new apps that fail to reach even a modest audience has risen. While the flat aggregate rules out market-expansion effects, the rising share of low-usage apps also offers little support for a ‘long-tail’ channel in which many niche apps each find a small audience, a channel that Reimers and Waldfogel (2026) do find for AI-assisted books on Amazon.

This pattern admits two interpretations, and our data currently cannot distinguish between them. The marginal AI-era apps may simply be of lower quality: the entry cost fell, so less promising projects now clear the publication bar. Alternatively, there may be one further bottleneck beyond production, since consumer attention and discovery are scarce and need not expand with the number of entrants. Under either interpretation, the expansion in supply has not yet translated into measurably more software consumption.

Looking ahead

Our results have a direct implication for the AI productivity debate: when production stages are complementary, task-level gains cannot simply be extrapolated to aggregate output. In software, AI’s most advanced application domain, a 180% gain in coding activity becomes a 30% gain in shipped releases and, so far, no detectable gain in usage. Growth projections that apply task-level estimates without adjusting for downstream bottlenecks will therefore overstate AI’s near-term impact.

The attenuated gains we estimate are nevertheless economically meaningful. A 50% increase in projects and a 30% increase in shipped releases would be a remarkable effect for any workplace technology, even if these gains are not yet visible in usage. And they may still grow. Tool providers are actively working to shift the bottlenecks we document, for instance by promoting AI tools for code review, and capable agentic tools have only been widely available since 2025. Our estimated effects rise with each major model release, so the impact on final output may increase as the technology matures and diffuses.

The same logic suggests which margins to monitor going forward. AI’s ultimate effect on software output will depend on whether these efforts succeed in easing the downstream stages: producing code that requires less human review, automating integration and testing, and improving discovery and adoption on the consumer side. The history of general-purpose technologies suggests that bottlenecks do eventually shift (Brynjolfsson et al. 2021). Until they do, Solow’s quip applies to generative AI with renewed force: one can see the AI age everywhere in the code, but only partly in the output statistics.

References

Acemoglu, D (2025), “The Simple Macroeconomics of AI”, Economic Policy 40(121): 13–58.

Acemoglu, D, D Autor and S Johnson (2023), “How AI can become pro-worker”, VoxEU.org, 4 October.

Aghion, P, B F Jones and C I Jones (2019), “Artificial Intelligence and Economic Growth”, in The Economics of Artificial Intelligence: An Agenda, University of Chicago Press.

Aldasoro, I, L Gambacorta, R Pál, D Revoltella, C Weiss and M Wolski (2026), “How AI is affecting productivity and jobs in Europe”, VoxEU.org, 17 February.

Brynjolfsson, E, D Li and L Raymond (2025), “Generative AI at Work”, Quarterly Journal of Economics 140(2): 889–942.

Brynjolfsson, E, D Rock and C Syverson (2021), “The Productivity J-Curve: How Intangibles Complement General Purpose Technologies”, American Economic Journal: Macroeconomics 13(1): 333–372.

Cui, Z, M Demirer, S Jaffe, L Musolff, S Peng and T Salz (2026), “The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers”, Management Science, forthcoming.

Demirer, M, L Musolff and L Yang (2026), “Writing Code vs. Shipping Code: Productivity Effects Across Generations of AI Coding Tools”, NBER Working Paper 35275.

Filippucci, F, P Gal and M Schief (2024), “Miracle or myth: Assessing the macroeconomic productivity gains from artificial intelligence”, VoxEU.org, 8 December.

Jones, C I (2026), “A.I. and Our Economic Future”, NBER Working Paper 34779.

Kremer, M (1993), “The O-Ring Theory of Economic Development”, Quarterly Journal of Economics 108(3): 551–575.

Noy, S and W Zhang (2023), “Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence”, Science 381(6654): 187–192.

Peng, S, E Kalliamvakou, P Cihon and M Demirer (2023), “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot”, arXiv preprint arXiv:2302.06590.

Reimers, I and J Waldfogel (2026), “AI and the Quantity and Quality of Creative Products: Have LLMs Boosted Creation of Valuable Books?”, NBER Working Paper 34777.

Solow, R (1987), “We’d Better Watch Out”, New York Times Book Review, 12 July.

Source link