About Census Monkey Typewriter

The Project

CMT is a continuous feed of LLM-generated hypothesis-driven data explorations of American society using Census Bureau data. This project is an experiment to generate plausible and farcical analyses blog posts in the style of late 2010s blogs and data journalism sites. Everything from hypothesis generation to analysis planning to code execution, iteration, and testing is designed to function without human involvement. Everything is executed via Claude Code using a mixture of Opus, Sonnet, and Gemini Pro relying on a series of fine-tuned prompts and a metacontextual documentation that grows with each analysis based on a combination of agentic self-reflection and human feedback.

For more details, see the original blog post describing the entire process in narrative detail. Feel free to explore the full repo and prompt context directly on Github. You can also choose to see it as an experimental framework for automated social science research if you allow yourself to extrapolate a bit into the near future.

Methodology at a high level

Each analysis follows a structured workflow:

  1. Hypothesis Generation: LLM-generated research questions targeting demographic, economic, or geographic patterns based on a hypothesis generating prompt
  2. Data Collection: Analyses are constrained to only look at publicly available US Census Bureau data available via API and trained on tidycensus R package documentation.
  3. Statistical Analysis: Analyses are expected to try to apply appropriate statistical methods (regression, clustering, time series, spatial analysis), but this can be hit and miss. The agents will sometimes attempt to resolve causality and identification, but are not always effective in doing so.
  4. Validation: Robustness checks, sensitivity analysis, and simulated peer review protocols
  5. Documentation: Complete R Markdown reports with reproducible code which are then fed into this page

There are three categories of analyses: Serious: allegedly relevant policy-relevant analyses with potential practical implications; Exploratory: novel cross-disciplinary methodological approaches or data discovery projects; and, Whimsical: lower stakes, nonsense, and curious explorations.

Technical Infrastructure

  • Data Source: U.S. Census Bureau American Community Survey
  • Analysis Environment: R with tidyverse, tidycensus, and specialized packages
  • Documentation: R Markdown → Hugo publication pipeline
  • Reproducibility: All code and data processing steps documented

© Dmitry Shkolnik 2025

Powered by Hugo & adapted from Kiss.