ZigguratCodex: From PDF Chaos to a Structured Knowledge Platform
ZigguratCodex is a long-term project I’m building to solve a problem I kept running into while studying Ancient Mesopotamia: The best scholarship exists in scattered PDFs, often poorly indexed, difficult to search, and disconnected. The project goal is a technical one: build a repeatable pipeline that converts a PDF archive into an auditable knowledge system. […]
ZigguratCodex is a long-term project I’m building to solve a problem I kept running into while studying Ancient Mesopotamia:
The best scholarship exists in scattered PDFs, often poorly indexed, difficult to search, and disconnected.
The project goal is a technical one: build a repeatable pipeline that converts a PDF archive into an auditable knowledge system.
Architecture overview
The project is split into three layers:
1) Ingestion layer
A controlled intake process that standardizes files and metadata. This sounds boring, but it prevents future chaos. It includes:
• filename normalization
• consistent metadata fields (author, year, title, edition)
• duplicate detection
• source integrity checks
2) Processing layer
PDFs are heterogeneous. The pipeline branches:
A. Text-native PDFs
• extract text directly
• normalize encoding and typography
• remove repeated headers/footers and page noise
B. Scanned PDFs
• render pages to images
• OCR pass (English-first, later multilingual)
• post-processing to reduce OCR errors
• segmenting into structured blocks
3) Knowledge layer
Once text is usable, the platform builds structured objects:
• Source (the original book/PDF)
• Passage (a clean segment with a page reference)
• Entity (person/place/deity/term)
• Topic article (editorial synthesis with citations)
• Relations (links between entities and topics)
This structure enables three key features:
1) Auditable content
Every claim can be traced back to a specific source and page reference. That matters when the project is built from historical scholarship.
2) Real navigation
Instead of “scrolling a PDF”, users can move through the system by:
• entity index
• thematic clusters
• geography and timeline views
• cross-references and citations
3) Search that scales
Keyword search is step one. The system is designed for layered retrieval:
• full-text search
• entity search
• citation-based discovery
• relationship-based exploration
Current status
The archive foundation is assembled and standardized. The processing and knowledge layers are under active development.
ZigguratCodex is currently in development, and I’ll share updates as the pipeline and platform become stable.