Project Gutenberg is one of the internet's great cultural datasets: a public-domain library whose metadata can show which books remain available, searchable, teachable, and reusable.
This report asks what the public-domain canon looks like as data. The answer is not merely literature. It is language, copyright, digitization, education, and remix culture.
FAST FACTS
DATASET CONTEXT
Project Gutenberg provides machine-readable metadata in RDF/XML, MARC, and CSV formats. Official pages recommend using the metadata feeds rather than crawling the website.
A scaled version of this report should ingest the weekly CSV or RDF catalog, normalize subjects/languages/authors, and join to adaptation, school syllabus, and Wikidata signals.
Reader path: if you are new to the topic, treat each chart as a guided tour of one question: who leads, how concentrated the field is, what changes over time, and where the outliers sit. If you already know the domain, use the same charts as a challenge: check whether the metric is the right proxy, whether the source omits an important population, and whether the headline survives the limitations section.
CHART 1 - LANGUAGE GRAVITY
Project Gutenberg is not just a library. It is a map of what digitized public-domain culture looks like when volunteer labor, copyright timelines, language, and reader demand overlap.
The first signature is language gravity. English-language texts become the easiest canon to access, search, remix, and teach.
CHART 2 - ERA MACHINE
Copyright law turns time into a cultural filter. The works that are old enough, popular enough, and legible enough become the available past.
That makes Gutenberg a powerful meso dataset: not every book ever written, but the books most ready to be reactivated.
CHART 3 - AUTHOR MEMORY
Authors become infrastructure when their works are everywhere: in classrooms, editions, audiobooks, adaptations, quote databases, and training corpora.
That is why public domain matters to AI culture too. Availability shapes what machines and people can easily learn from.
CHART 4 - ADAPTATION POWER
Some subjects move cleanly across medium. Adventure becomes film; childhood becomes brand memory; gothic becomes horror grammar.
The public-domain market is therefore not dead literature. It is reusable cultural material.
CHART 5 - CANON AND REMIX
Pride and Prejudice is classroom canon and adaptation engine. Sherlock Holmes, Dracula, and Frankenstein are more like cultural APIs: infinitely callable characters and structures.
That is the Artometrics claim: the most powerful books are not only read. They are reused.
CONCLUSION
The public-domain canon is not neutral. It is shaped by language, copyright age, volunteer digitization, educational reuse, and adaptation economics.
For Artometrics, Gutenberg is a bridge dataset: it connects literature, AI training culture, education, film adaptation, and historical memory.
REFERENCES
Project Gutenberg. Offline Catalogs and machine-readable metadata documentation.
Project Gutenberg feeds: RDF/XML and CSV catalog files.
gutenbergtools/catalog_tools. Notes on Project Gutenberg catalog CSV and RDF updates.
Library of Congress subject logic and Wikidata work/author metadata for future joins.
EDITOR'S NOTE
Chart values are editorial indices designed from Project Gutenberg source fields and public canon/adaptation signals. A production pass should ingest the official catalog directly.
