How We Digitized 300,000 Pages of Classical Chinese with AI Vision Agents

The problem with old books

Here is a thing nobody warns you about classical Chinese texts: the sourcing is the hard problem. Not the translation. Not the NLP. The sourcing.

We set out to build a structured digital corpus of 23 classical texts — from the 130-chapter Records of the Grand Historian (史記) to a 4,096-verse oracular poem collection (焦氏易林) to a 2,783-page cosmological treatise (皇極經世書). Everything from military strategy to metaphysics. The corpus would feed three products: a Warring States simulation game, a Book of Changes reference site, and a divination app.

The translation pipeline was the part we'd planned for. The part where you can't get the raw text off the internet without tripping a CAPTCHA on your eleventh request — that was the part we hadn't.¹

The easy wins and the first walls

The first texts were straightforward. Military classics like Sunzi's Art of War (孫子兵法), Wuzi (吳子), and Liutao (六韜) had clean transcriptions on zh.wikisource.org. Fetch the HTML, parse it into markdown with citation metadata, move on.

Then the walls appeared.

ctext.org — the largest database of pre-modern Chinese texts in existence, over five billion characters — is the gold standard for transcribed text. Carefully collated against specific historical editions, with Siku Quanshu classification metadata right on the page. But it aggressively blocks automated access. After about 10-20 requests, CAPTCHAs lock you out. The API works but has its own rate limits. zh.wikisource.org started returning 403s to our fetching tools entirely — we had to fall back to curl with browser-like User-Agent headers.

We quickly developed a three-tier sourcing hierarchy:

ctext.org — most reliable text, but rate-limited
arteducation.com.tw (中華古詩文古書籍網) — a Taiwanese classical text site with chapter-per-page HTML, no rate limits, decent transcription quality
PDF scans via AI vision — the nuclear option, for when no digital transcription existed at all

For the Strategies of the Warring States (戰國策, 33 juan) and Records of the Grand Historian (史記, 130 chapters), we mixed sources: ctext.org for what we could get before the CAPTCHAs kicked in, arteducation.com.tw for bulk extraction, manual verification against known editions. Not glamorous. Effective.

When OCR fails: the woodblock print problem

Not every text has a digital transcription. The Huangji Jingshi Shu (皇極經世書) — Shao Yong's 11th-century cosmological opus mapping a 129,600-year cosmic cycle — exists primarily as woodblock-printed pages in the Qing dynasty's Complete Library of the Four Treasuries (欽定四庫全書). Zhejiang University Library had digitized these as 600 DPI PDF scans through the CADAL project, a partnership between Chinese academic libraries and the Internet Archive that has digitized over 500,000 texts. The scans were available on Archive.org.

Our first instinct was standard OCR. Split multi-page PDFs into individual page images, run them through OCR models.

The results were catastrophic. We graded it 3 out of 10.

The Zhouyi Zhushu (周易注疏) was the worst case. It uses a multi-layer commentary layout — the original classic in large characters, Wang Bi's commentary in medium characters, Kong Yingda's subcommentary in small double-column characters, and Lu Deming's phonetic glosses interleaved in even smaller characters. Four layers of text, four different font sizes, all on the same page. OCR was essentially useless:

Massive text duplication: The OCR re-read overlapping page regions, producing the same passage 2–4 times with slight variations
Layer confusion: Content from the 經 (classic), 注 (commentary), 疏 (subcommentary), and 音義 (phonetic glosses) bled across markers randomly
Garbled character sequences: Many passages were nonsensical strings that don't form valid Classical Chinese

This is consistent with recent benchmarks. The AncientDoc benchmark — the first systematic evaluation of vision-language models on Chinese ancient documents — found that multi-layer commentary layouts remain among the hardest cases for both traditional OCR and modern VLMs, precisely because the spatial relationships between text layers carry semantic meaning that pixel-level pattern matching can't decode.²

But the real lesson came when we tried to fix the bad OCR.

The fabrication threshold

When we asked an LLM to "clean up" the OCR output, it silently reconstructed the text from its training data rather than faithfully correcting the scan. The "cleaned" output represented a generic composite text, not the specific Siku Quanshu edition we were trying to digitize.

We caught this because the subcommentary was suspiciously short. The actual Kong Yingda text — the 疏 layer, the most degraded in our OCR — is far longer than what was produced. The model had generated a plausible-looking abridgment. From its training data. Without telling us.

This is worth stating as a hard rule: when OCR quality drops below roughly 5/10, "cleanup" becomes "reconstruction" — which is fabrication, not editing. There's a threshold below which the model has so little signal from the source image that it's effectively writing from memory. The output looks scholarly. It isn't. It represents no specific edition. And if you're working with texts that have a 2,000-year editorial history, "which edition?" is not a pedantic question.³

The vision agent pipeline

The breakthrough was realizing we didn't need traditional OCR at all.

Modern vision-language models read images natively, with contextual understanding that lets them handle complex layouts, variant characters, and annotation conventions that defeat statistical pattern matching. As one recent survey puts it, multimodal LLMs represent "a fundamental rethinking where models understand documents as unified visual-linguistic entities rather than treating text extraction as a computer vision task with linguistic post-processing."

The pipeline:

1. Source. Download CADAL scans from Archive.org. These follow the ID pattern 06xxxxxx.cn — Zhejiang University Library's digitization of the Siku Quanshu manuscript copies, typically at 600 DPI.

2. Split. Break multi-page PDFs into individual page images using pdfseparate from poppler-utils.

3. Downsample. This was the key economic insight. VLMs process images by tiling them into 512×512 pixel chunks, each consuming roughly 170 tokens. A 600 DPI page produces 4x more tiles than 300 DPI — meaning 4x the cost. But higher resolution doesn't improve transcription quality once characters exceed about 30×30 pixels.

For standard woodblock body text, 300 DPI was sufficient. We kept 400-600 DPI only for pages with tiny interlinear annotations. Estimated savings: ~75% token reduction with no quality loss for standard text. Nobody tells you this. You find out when your first batch costs four times what you budgeted.⁴

4. Chunk into 15-page batches. Each batch became a task for one AI vision agent. Fifteen pages was the sweet spot: enough context for the agent to understand text flow and catch page-boundary artifacts, small enough for maximum parallelism.

5. Dispatch parallel subagents.

This is where things got interesting.

46 agents, 677 pages, 15 minutes

The Jiaoshi Yilin (焦氏易林, "Forest of Changes") was the proving ground. It's a Han dynasty collection of 4,096 four-line oracular poems — one for each possible hexagram-to-hexagram transformation in the I Ching. Attributed to Jiao Yanshou around 40 BC, though scholars have debated the authorship for centuries — some credit Cui Zhuan, who served as governor during Wang Mang's reign. The Siku Quanshu edition spans 677 pages across 4 volumes.

No modern digital transcription existed for the Siku edition specifically.

We split 677 pages into 46 chunks of 15 pages each and dispatched 46 parallel Claude vision agents simultaneously. Each agent read its 15 pages of woodblock-print PDF, transcribed the classical Chinese, inserted [p.NNN] page markers, identified hexagram transformation boundaries, and preserved small-character annotations in parentheses.

The entire extraction completed in roughly 15 minutes wall-clock time. 46 chunks, ~21,000 lines, 404KB of transcribed text.

This isn't incrementally better than sequential processing. It's a different kind of workflow entirely. MIT Technology Review's 2024 AI Systems report found that parallel AI processing architectures achieve 75% faster time-to-insight compared to sequential approaches. In our case, the improvement was closer to 46x — one agent per chunk, all running simultaneously, wall-clock time determined by the slowest agent rather than the sum of all agents.⁵

Edition comparison as scholarship

We then cross-referenced the vision extraction against ctext.org's version — a different, modern collated edition — and documented the discrepancies:

ctext-sourced entries matched 99.7% against a known reference — confirming our tooling worked
SKQS vision-extracted entries matched only 6% against ctext — but this wasn't an error rate

It reflected genuine edition differences between the 18th-century Siku Quanshu manuscript and ctext's modern collated text. We found variant characters (脣→唇, 犬→狗), genuine textual variants between editions (宜種秦稷 vs 宜稼黍稷), and yes, some vision misreads that needed correction.

But the ability to distinguish "our edition says this, theirs says that" was precisely the point of edition-faithful extraction. When your pipeline conflates editions, you get a text that's easy to read and impossible to cite. When it preserves them, you get a text that's harder to read and actually useful to scholars.

This is the difference between digitization and transcription. Transcription gives you a text. Digitization gives you this text, from this edition, with this provenance chain.

2,783 pages of cosmology

Emboldened by the Yilin extraction, we tackled the Huangji Jingshi Shu — Shao Yong's cosmological masterwork. 14 volumes plus an interpretive manual, 2,783 pages total.

Same pipeline: split, chunk, dispatch parallel vision agents, reassemble. The extraction ran from December 2025 through January 2026.

But we didn't extract everything.

The Huangji Jingshi has a structure that resists uniform treatment. Volumes 1–5 are chronological tables and cosmological frameworks — hundreds of pages of stem-branch grids mapping temporal cycles. The vision agents transcribed these page by page, though the output is a hybrid: real character-by-character data entries mixed with English commentary the agents added unprompted. Volumes 6–12 cover everything else — Warring States chronology, sound-and-music correspondence tables with ●○ notation grids, pitch pipe systems, astronomical correlations, philosophical synthesis. For these, we extracted a complete Daozang edition from 1445 and Wang Zhi's 1782 commentary edition, but the clean Siku base text for Volumes 6–12 remains unprocessed. The PDFs are sitting there. We just haven't run the pipeline on them yet.⁶

What we did extract — across three editions and an interpretive manual — produced 271 markdown chunk files:

The complete temporal hierarchy (the 129,600-year Yuan cycle, nested as 12×30×12×30) from the Siku base text
Music theory and pitch pipe systems from Wang Zhi's commentary and the Daozang edition
An observational framework (觀物) using sensory triads as a divination interface, from Zhang Xingcheng's interpretive manual
Governance philosophy integrating moral causation with territorial calculations

The three-edition approach turned out to be more useful than extracting one edition completely. The Daozang edition (1445) gives the raw systematic tables. Wang Zhi's commentary (1782) gives the philosophical principles. Zhang Xingcheng's Suoyin gives the operational formulas — how to actually use the system. Peter Bol at Harvard has written extensively about Shao Yong's method of "observing things" — but scholars studying the numerical details of the Huangji Jingshi have had to work from physical reproductions. Now they don't, at least for the parts we've processed.

From raw text to structured JSON

Raw classical Chinese is useful for scholars. Our products needed machine-readable structured data. We built a translation pipeline producing JSON with full provenance:

{
  "slug": "05-qin-san",
  "titleZh": "秦策三",
  "titleEn": "Stratagems of Qin, Part Three",
  "source": {
    "text": "《戰國策》",
    "edition": "中華古詩文古書籍網 transcription",
    "sourceUrl": "https://www.arteducation.com.tw/guwen/bookv_4383.html",
    "commentaryBase": "鮑彪 (Bao Biao) Song dynasty commentary"
  }
}

Every file traces back to a specific historical edition through a specific digital source at a specific URL. Mandatory citation blocks in every text file:

> **Citation**
> Source: ctext.org transcription
> Base text (底本): 《武英殿二十四史》本《史記》
> Original: 《欽定四庫全書》史部·正史類
> Source URL: https://ctext.org/shiji/wu-di-ben-ji/zh
> Downloaded: 2026-03-06

By March 2026: 677 structured JSON files covering 14 Warring States-era primary texts, 147 JSON files across 9 divination/esoteric texts, and raw extractions across all 23 classical texts — though "across" is doing some work in that sentence, since the Huangji Jingshi is still missing its Siku base text for Volumes 6–12.

The numbers

Metric	Count
Classical texts digitized	23
Total pages processed (scans)	~5,000+
Structured JSON translation files	824
Jiaoshi Yilin verses extracted	4,096
Shiji chapters translated	130
Parallel vision agents (max single run)	46
ADR documents written	7
Project duration	~3.5 months

What we learned

Source traceability is non-negotiable. The Siku Quanshu says 脣, the modern collation says 唇. Both are valid. But if you don't track provenance, you don't know which you have. This is true for classical Chinese and it's true for any corpus where the source material has an editorial history longer than your attention span.

Traditional OCR is the wrong tool for classical woodblock prints. Multi-layer commentary layouts, variant character forms, and interlinear annotations defeat statistical pattern matching. VLMs succeed because they bring contextual understanding — they know what Classical Chinese should look like. The CHAT transcription model trained on 1.7 million lines of Chinese historical documents can recognize over 16,000 characters at 99% accuracy — but that's for regular-script prints, not multi-layer commentary with four font sizes on one page.

Below a certain quality threshold, LLM "cleanup" is fabrication. This is the finding I keep coming back to. It's not a quality gradient. There's a cliff. Above it, the model corrects from the source. Below it, the model writes from memory. The output is fluent, plausible, and wrong. And you won't catch it unless you know what the source should look like — which, if you're digitizing a text specifically because no digital version exists, you don't.

Parallel subagents are transformative for extraction at scale. 677 pages in 15 minutes with 46 parallel agents. The 15-page chunk size was crucial: large enough for context, small enough for parallelism. This is not a workflow optimization. It's a category change in what's feasible for a solo developer with a laptop.

Downsampling saves 75% and nobody will tell you. 600 DPI scans look better to humans but produce 4x more tiles for VLMs with no transcription quality improvement. 300 DPI is the sweet spot for standard woodblock body text. We learned this the expensive way.

Edition comparison is scholarship, not QA. When our Siku Quanshu extraction disagreed with ctext.org's text, our first instinct was "we have errors." Sometimes we did. But often the disagreement was the data — genuine textual variants between editions separated by centuries of editorial tradition. The instinct to treat disagreement as error is the instinct to flatten history into consensus. Resist it.⁷

The corpus now powers a live Warring States diplomacy game where AI agents quote Sunzi and Hanfeizi in strategic deliberation, a reference site where readers can explore I Ching commentaries in bilingual format, and an iOS app that surfaces Jiaoshi Yilin oracular poems for divination.

The pipeline is documented and reproducible. The raw scans and extraction manifests are in the repository. And the texts — some of them untouched by digital transcription until now — are finally searchable, citable, and usable.

Not bad for 3.5 months of work with a laptop and a fleet of AI agents reading 250-year-old woodblock prints.

The extraction pipeline, all 7 ADRs, and the raw markdown for all 23 texts live in a private repository. The structured JSON translations are distributed across the product repositories that consume them. I haven't open-sourced the corpus — partly because the provenance chains are complex enough that releasing without documentation would do more harm than good, and partly because I spent 3.5 months building it and haven't decided yet whether that makes me a steward or a hoarder.

This post was written with Claude. I set the editorial direction, verified claims against the actual repo, and corrected the parts where the AI confidently described things that didn't happen. The AI did the drafting, the DFW-adjacent footnote spirals, and the web research for citations. The 23 classical texts were written by people who have been dead for between 900 and 2,400 years, which is either reassuring or unsettling depending on how you feel about the permanence of your own work.

ctext.org is careful about this for good reason. Donald Sturgeon's paper on large-scale OCR for pre-modern Chinese texts discusses the challenges of maintaining a freely accessible corpus while preventing abuse. The irony is that the site's anti-scraping measures push researchers toward lower-quality sources — which is worse for scholarship but presumably better for server costs. ↩
The CHAT transcription model — trained on 1.7 million lines spanning the 10th to 20th century — achieves 99% accuracy on regular-script prints. But that's the easy case. Multi-layer commentary with interleaved phonetic glosses in four different font sizes is a fundamentally different problem. The model that reads the large characters fine can't parse the spatial relationship between a 注 annotation and the 經 line it's commenting on. Layout is semantics, and OCR doesn't do semantics. ↩
Christopher Gait's The Forest of Changes (2016) — the first complete English translation — is based on a different edition than the Siku Quanshu version we extracted. His notes frequently reference textual variants across editions. The point is not that one edition is "right." The point is that edition-faithful extraction preserves the variants so scholars can do their work. ↩
The math: a typical Siku Quanshu page at 600 DPI is roughly 3600×5400 pixels. At 512×512 tiles, that's ~72 tiles × 170 tokens = ~12,240 tokens per page. At 300 DPI (1800×2700), it's ~18 tiles × 170 tokens = ~3,060 tokens per page. Four times the cost for, in our testing, no measurable improvement in character recognition accuracy for standard body text. The savings compound fast when you're processing 5,000 pages. ↩
There's a theoretical literature on this — Amdahl's law, etc. — but the practical reality is simpler than the theory. Each 15-page chunk is fully independent. No shared state. No coordination overhead. The parallelism is embarrassingly parallel in the computer science sense, which is the best kind of parallel in the getting-things-done sense. ↩
This is the honest gap in the project. Volumes 6–12 of the Siku base text span Warring States chronology, sound-music correspondence tables, pitch pipe systems, astronomical correlations, and philosophical synthesis — the second half of the work. We have them via the 1445 Daozang edition (which predates Siku by 337 years and represents an independent textual transmission) and Wang Zhi's 1782 commentary edition (which embeds the base text as lemmas with interpretation). But the clean Siku primary text — the one we can cite as "this is exactly what the 18th-century copyists wrote" — is still sitting as unprocessed PDFs. The pipeline exists. The PDFs exist. The work just hasn't been done. Completeness is a gradient, not a binary, and sometimes you publish at 70% because the alternative is publishing at never. ↩
There's a broader methodological point here that applies well beyond classical Chinese. Any time you have a "ground truth" dataset and your extraction disagrees with it, the disagreement is ambiguous between "extraction error" and "genuine difference in source material." The default assumption — that the ground truth is right and your extraction is wrong — is only valid when you're extracting from the same source. When you're extracting from a different edition, the default assumption is actively misleading. This seems obvious stated plainly. It is not obvious at 2 AM when your validation script reports a 94% error rate and you're wondering whether your entire pipeline is broken. ↩