PROJECT GUTENBERG: The Artometrics of the Public-Domain Canon

Project Gutenberg is one of the internet's great cultural datasets: a public-domain library whose metadata can show which books remain available, searchable, teachable, and reusable.

This report asks what the public-domain canon looks like as data. The answer is not merely literature. It is language, copyright, digitization, education, and remix culture.

FAST FACTS

75K+Approximate scale of Project Gutenberg ebooks in public-facing summaries

WeeklyCSV catalog update cadence noted in Gutenberg tooling docs

DailyRDF catalog update cadence noted by Gutenberg

8Languages compared in the first chart

10Authors used as anchors

5Charts in this report

DATASET CONTEXT

Project Gutenberg provides machine-readable metadata in RDF/XML, MARC, and CSV formats. Official pages recommend using the metadata feeds rather than crawling the website.

A scaled version of this report should ingest the weekly CSV or RDF catalog, normalize subjects/languages/authors, and join to adaptation, school syllabus, and Wikidata signals.

Reader path: if you are new to the topic, treat each chart as a guided tour of one question: who leads, how concentrated the field is, what changes over time, and where the outliers sit. If you already know the domain, use the same charts as a challenge: check whether the metric is the right proxy, whether the source omits an important population, and whether the headline survives the limitations section.

CHART 1 - LANGUAGE GRAVITY

English dominates the accessible public-domain shelf

Project Gutenberg is not just a library. It is a map of what digitized public-domain culture looks like when volunteer labor, copyright timelines, language, and reader demand overlap.

The first signature is language gravity. English-language texts become the easiest canon to access, search, remix, and teach.

CHART 2 - ERA MACHINE

The 19th century becomes the public-domain literary core

Copyright law turns time into a cultural filter. The works that are old enough, popular enough, and legible enough become the available past.

That makes Gutenberg a powerful meso dataset: not every book ever written, but the books most ready to be reactivated.

CHART 3 - AUTHOR MEMORY

Digital availability and cultural memory reinforce each other

Authors become infrastructure when their works are everywhere: in classrooms, editions, audiobooks, adaptations, quote databases, and training corpora.

That is why public domain matters to AI culture too. Availability shapes what machines and people can easily learn from.

CHART 4 - ADAPTATION POWER

Adventure, childhood, and gothic subjects adapt especially well

Some subjects move cleanly across medium. Adventure becomes film; childhood becomes brand memory; gothic becomes horror grammar.

The public-domain market is therefore not dead literature. It is reusable cultural material.

CHART 5 - CANON AND REMIX

Some books become curriculum while others become remix engines

Pride and Prejudice is classroom canon and adaptation engine. Sherlock Holmes, Dracula, and Frankenstein are more like cultural APIs: infinitely callable characters and structures.

That is the Artometrics claim: the most powerful books are not only read. They are reused.

CONCLUSION

The public-domain canon is not neutral. It is shaped by language, copyright age, volunteer digitization, educational reuse, and adaptation economics.

For Artometrics, Gutenberg is a bridge dataset: it connects literature, AI training culture, education, film adaptation, and historical memory.

REFERENCES

Project Gutenberg. Offline Catalogs and machine-readable metadata documentation.

Project Gutenberg feeds: RDF/XML and CSV catalog files.

gutenbergtools/catalog_tools. Notes on Project Gutenberg catalog CSV and RDF updates.

Library of Congress subject logic and Wikidata work/author metadata for future joins.

EDITOR'S NOTE

Chart values are editorial indices designed from Project Gutenberg source fields and public canon/adaptation signals. A production pass should ingest the official catalog directly.