Why is AI so bad at reading PDFs?

Last November, the House Oversight Committee had just released 20,000 pages of documents from the estate of Jeffrey Epstein, and Luke Igel and some friends were clicking around, trying to follow the threads of conversation through garbled email threads and a PDF viewer that was, frankly, “gross.” In the coming months, the Department of Justice would release its own batches of files, more than three million of them — again, all PDFs.

This was a problem. While the Department of Justice had run optical character recognition over the text, it was not very good, Igel said, rendering the files more or less unsearchable.

“There was no interface the government put out that allowed you to actually see any sort of summary of things like flights, things like calendar events, things like text messages. There was no real index. You just had to get lucky and hope that the document ID that you were looking at contains what you’re looking for,” said Igel, cofounder of the AI video editing startup Kino. What if, Igel thought, they built a Gmail clone to view and search all this correspondence in a more intuitive way?

To do this, they would need to extract the information contained in PDFs, which is far less straightforward than it might sound. Despite rapid progress in AI’s ability to build complex software and solve advanced physics problems, the ubiquitous format of PDF remains something of a grand challenge. Edwin Chen, the CEO of the data company Surge, includes it among AI’s “unsexy failures” limiting real-world usefulness. Last year, he found that even state-of-the-art models asked to extract information from a PDF will instead summarize it, confuse footnotes with body text, or outright hallucinate contents. In a half-joking timeline of AI development, the researcher Pierre-Carl Langlais placed “PDF parsing is solved!” shortly before AGI.

First, Igel’s friend, the “tech jester” Riley Walz, used his remaining credits on Google’s Gemini. It only worked reliably for some of the cleanest scans, and would be prohibitively expensive to run on millions of documents anyway, so Igel reached out to his former MIT classmate Adit Abraham, who happened to work in the office above his, where he ran a PDF-parsing AI company called Reducto.

PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by them

Reducto, one of several companies trying to solve PDFs, was able to extract information from email threads with cryptic decoding errors, heavily redacted call logs, and low-quality scans of handwritten flight manifests. After the data was exported in a usable format, Igel and Walz went on a building spree, creating essentially a full Epstein-themed app ecosystem: Jmail, an unsettling, searchable prototype of Epstein’s inbox; Jflights, an interactive globe crisscrossed with flight paths, each one clickable to view underlying PDFs of flight data, passenger manifests, and scanned email invitations; Jamazon, to search Epstein’s Amazon purchases; and Jikipedia, to search businesses and people who turn up in the files, citing, naturally, more PDFs.

“That’s where the magic of extracting information of PDFs became real for me,” Igel said. “It’s going to completely change the way a lot of jobs happen.”

PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by them. The format was developed by Adobe in the early 1990s as a way to reproduce documents while preserving their precise visual appearance, first when printing them on paper, then later when depicting them on a screen. Where formats like HTML represent text in logical order, PDF consists of character codes, coordinates, and other instructions for painting an image of a page.

Optical character recognition (OCR) can turn those pictures of words back into text computers can use, but if it comes across a PDF where text is displayed in multiple columns — as many academic papers are — it will plow ahead left to right and create an unintelligible jumble. OCR tools are designed to detect and correct for these sorts of formatting variations, but tables, images, diagrams, captions, footnotes, and headers all present further obstacles. If you give an AI assistant like ChatGPT a PDF, it will cycle through a variety of these tools, sometimes fail, sometimes pass the PDF to a large vision model to perform OCR, sometimes hallucinate, and generally take a very long time and use a lot of computing power for uneven results.

“The key issue is that they cannot recognize editorial structure,” said Langlais. “It’s all fine while it’s relatively simple text, but then you’ve got all these tables, you’ve got forms. A PDF is part of some kind of textual culture with norms that it needs to understand.”

A further problem that arises from and compounds PDF’s inherent difficulty is that models rarely train on them. This has begun to change, partly because AI developers are increasingly desperate for high-quality data, and PDFs contain a disproportionate amount of it. Government reports, textbooks, academic papers — all PDFs. “PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models,” wrote researchers at the Allen Institute for AI last year in a paper announcing a new specialized PDF-reading model.

Image: Kristen Radtke / The Verge

“The lore has it that the very first PDF ever was an IRS 1040,” said Duff Johnson, CEO of the PDF Association, the industry organization that helps develop the PDF global standard, ISO 32000-2:2020, itself a PDF nearly a thousand pages long. In 1994, the IRS wanted a way to share forms that were absolutely consistent without printing and mailing every possible document, so it mailed CDs full of PDFs instead. From there, PDF spread with email to become a fundamental component of digital work. Book publishers sending manuscripts to the printer, patent applicants submitting diagrams of new devices, anyone who needed to share a document that would look the same to whomever received it turned to PDF.

“There’s no other technology solving the problem the PDF solves,” said Duff. Websites are temporary, appearing differently depending on the browser, mediated by CSS. Links rot. Word docs change depending on your machine and can be edited and overwritten. A PDF is the same no matter who opens it, when, or how.

“That’s what engineering companies need. That’s what lawyers need. That’s what governments need. That’s what anybody who’s doing anything in the world, who has records to maintain, they need that,” Duff said. “Earlier today I opened up a PDF from 1995. I didn’t worry about it. I just opened it. It worked fine. It worked perfectly. I would expect no less.” (It was a PDF about PDFs.)

“So I’m very certain that we will improve fairly fast, but because all these language models are probabilistic, there is just no way to guarantee it will be correct”

There has been a shift over the last year or so toward specialized PDF-parsing models, said Luca Soldaini, a researcher at the Allen Institute for AI who worked on their PDF model, olmOCR. They trained a vision language model — like a large language model, but with pixels instead of word tokens — on about 100,000 PDFs: public domain books, academic papers, brochures, documents from the Library of Congress with human-written transcriptions. The model was further trained to optimize specific problem areas, like parsing tables without mixing up the rows and columns.

“If text is large on the page, the model will learn to say, ‘Oh, that’s probably a header,’” said Soldaini. The model was the most popular one the institute released last year, Soldaini said, rivaling the institute’s generalist models. A PDF reading AI doesn’t capture the spotlight like those models, Soldaini said, but people are actually using it.

A few months later, researchers at Hugging Face, the company that runs a popular open-source AI platform, had just published a 5 billion-document dataset for training multilingual models and were thinking about what to do next. They had already processed the whole of Common Crawl, the enormous archive of mostly HTML text scraped from the web that forms the foundation of many large language models. Like many AI researchers, Hugging Face’s Hynek Kydlíček recalled, they were wondering whether they had run out of easily available data.

“We thought, let’s look at the Common Crawl and, like, maybe there is more stuff we just haven’t seen,” said Kydlíček. Indeed, there was: roughly 1.3 billion PDFs. “That’s how we figured out that PDFs could be actually a super big and super high-quality source we can still train on,” Kydlíček said. “But the format of PDFs is, like, super super hard to extract text from.”

Kydlíček and his collaborators rigged up a system that separated PDFs into easy to parse — mostly text — and difficult to parse, full of images and charts. The hard PDFs were sent to a version of olmOCR that had been modified by Reducto, called RolmOCR. After they stripped out the PDFs of horse racing results that made up an inexplicably large quantity of the corpus, the team declared they had “liberated three trillion of the finest tokens,” now available for model training.

Yet parsing PDFs well enough for model training is one thing. Extracting them with the degree of accuracy demanded by lawyers and engineers is another. When the Hugging Face team did their first tests, they found their model would invent text when there wasn’t any, filling blank pages with nonsense and describing images and art. They trained it to correct these errors, but it’s impossible to anticipate every formatting oddity or off-kilter scan.

“It’s solved in like 98 percent of cases, and like in many areas you always have this problem of getting these last 2 percent,” Kydlíček said. “I would say OCR is one of the best economic use cases for visual language models, so there are a lot of eyes on it right now, a lot of people throwing a lot of resources onto this. So I’m very certain that we will improve fairly fast, but because all these language models are probabilistic, there is just no way to guarantee it will be correct.”

One of the teams doing the best work, Kydlíček said, is Reducto, the company Igel is using to parse the Epstein files. Abraham cofounded the company as a service that managed customers’ long-term histories with language models, similar to the “memory” feature that is now standard in chatbots. Abraham kept getting requests to manage people’s files as well, which naturally were in the form of PDFs. He found working with them to be “shockingly hard.”

Like self-driving cars, PDFs have a long tail of unusual challenges

“One of our core intuitions was all these documents were made for humans like you and I to interpret, and there’s a lot of visual information here that we take for granted, like that every gap between two paragraphs is me telling you, ‘Hey, this is a new idea.’ Every indentation is me telling you, ‘Hey, this is a sub idea of the parent idea.’ The question was like, how do you encode all of that context?”

Much of the team had a background in self-driving vehicles, where computer vision models “segment” data into entities like car, pedestrian, dumpster. They took a similar approach to PDFs, using a model to first divide the page into headers, tables, footnotes, and so on, before passing them to other specialized models for parsing. When they posted about their approach in early 2024, the response was immediate.

“This wasn’t supposed to be a pivot,” said Abraham. Other developers reached out to say that their progress had been stymied by PDFs. “It kind of spiraled from there.”

Reducto now uses a growing assortment of small, specialized models taking multiple passes to parse a PDF. When the segmenting model detects a table, it goes to a table-parsing model. If a chart is detected, different elements get sent to different models: one trained to extract axes, another to read legends, and so on. A vision language model then takes a pass on the output to correct errors. Using this approach, Reducto is able to turn charts into spreadsheets with a high degree of accuracy, something Abraham says the company’s financial clients have long requested and that stymies far larger frontier models.

Still, like self-driving cars, PDFs have a long tail of unusual challenges.

“There’s a big difference between getting a car to stay in a lane versus getting a car to handle whatever would show up on the street, and we see with PDFs a similar thing. I’ve seen the most insane documents you could imagine,” said Abraham. PDF files that contain other PDFs, legal documents with passages sometimes underlined and sometimes crossed out, faxes of medical forms that doctors have scrawled over and drawn lines connecting ideas on different edges of the page. “I don’t think PDFs are a fully solved problem. I wish that were the case. We’re close, but there’s still plenty to do.”

There will be no shortage of PDFs to parse. The format does not appear to be going anywhere. Why would it, asked Duff of the PDF Association, with some incredulity at the very thought. Companies once tried to unseat PDF, Duff said, but their products are “now a footnote in history,” while PDFs continue to proliferate.

“Look at the Google Trends for PDF,” Duff said. It shows a steadily rising curve (with dips in August) year after year. “No other technology looks like that. More and more people over time are including PDF in their searches, because that tends to be where the high-quality content is.

“What’s going to happen is that all the world’s systems will instead understand and use PDF better and better,” Duff said. “The AI companies didn’t focus on PDF, because PDF is very hard, until they realized that, well, it turns out a lot of the really high-quality stuff is in fact in PDF, and so now we have to deal with it.”

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.