← Blog

Build Log #011: 98% Disk Full With 131 Million Rows to Feed

We needed to backfill 4.9 million addresses into a 131-million-row parcel database. The server had other plans — specifically, 2% free disk space. A story about Docker hoarding, Postgres backup policies written by optimists, and the spatial join that saved the project.

Build logs are our honest engineering journal. Not the polished case study — the actual "why is the server on fire" reality of building data systems at scale.

The Problem We Were Trying to Solve

We've been building a nationwide parcel database — over 131 million property records across all 50 US states, ingested from hundreds of county and state GIS endpoints. The geometry data is solid. Boundaries, acreage, parcel IDs — all there.

But here's the thing about parcel data: a polygon without an address is basically a shape on a map. You can't search for it. You can't geocode to it. You can't match it to anything useful. And about 14% of our parcels — roughly 18 million records — had no address at all.

The plan was simple: take the OpenAddresses dataset (a free, open-source collection of address points from official sources worldwide), load it into staging tables, and spatially join addresses to parcels that didn't have one. If an address point falls inside a parcel polygon, that's your address.

Simple in theory. In practice, we were about to learn that our server had been quietly dying for weeks.

The 98% Moment

We SSH'd in to kick off the backfill and ran df -h out of habit. The output made our stomachs drop:

/dev/sda1       410G   402G   7.8G  98% /

98% full. Seven gigabytes free on a server holding a 131-million-row PostGIS database that we were about to write millions more rows into.

A spatial join across millions of address points and millions of parcels needs temp space. Postgres uses disk for sort operations, hash joins, and WAL (write-ahead log) segments. If the disk fills to 100% during a write operation, Postgres doesn't gracefully degrade — it panics. Transactions abort. WAL corruption is possible. In the worst case, you're restoring from backup. And our backup situation, as we were about to discover, was part of the problem.

Where Did 400GB Go?

The server has a 410GB disk. Our Postgres data directory was about 180GB — reasonable for 131 million parcels with PostGIS geometries. So where was the other 220GB?

Three culprits:

1. Postgres backups with the retention policy of a digital hoarder.

We had a backup script running nightly via cron. Sensible! The retention was set to 30 days. Less sensible when each backup is a pg_dump of a 180GB database compressed to roughly 25GB. Six backups were sitting in /var/backups/postgres/, consuming about 150GB. We kept the two most recent and deleted the rest.

# Before: RETENTION_DAYS=30
# After:  RETENTION_DAYS=3
# Freed: ~100GB

The embarrassing part? We wrote that backup script. The 30-day retention was a copy-paste default from a tutorial. Nobody did the math on how big the dumps would actually be at scale.

2. Docker's build cache: the silent disk vampire.

We run the application stack in Docker Compose — API, workers, Redis, Nginx, monitoring. Every docker compose up --build creates new image layers. Every failed build leaves dangling images. Over months of active development, Docker had accumulated over 40GB of build cache and orphaned images.

docker system prune -a --volumes=false -f
docker buildx prune -f
# Freed: ~41GB

The --volumes=false is critical — we did not want to blow away our named Postgres volume. That would have been a different build log entirely. A much shorter, much sadder one.

3. Stale temp files and log rotation that wasn't rotating.

Another 30GB scattered across /tmp, old ingestion staging files, and application logs that had been configured for rotation but never actually rotated because the logrotate cron was pointing at the wrong path.

Total freed: ~175GB. Server went from 98% to 53%. We could breathe again.

The Backfill: Spatial Joins at Scale

With disk space no longer an existential threat, we ran the address backfill. The approach:

  1. Load OpenAddresses data for each state into a staging table (filtered — no point loading California addresses if California parcels already have addresses)
  2. Create a spatial index on the staging table
  3. Run a ST_Contains join: for each parcel with a null address, find address points that fall within the parcel polygon
  4. Update the parcel record with the matched address
  5. Drop the staging table, move to the next state

The key optimization — learned the hard way in earlier backfill attempts — was state-filtered staging tables. Our first attempt loaded the entire OpenAddresses dataset (92 million points) into a single staging table. The spatial join took hours per state because Postgres was scanning address points in Maine while trying to match parcels in California.

With per-state staging tables (typically 2-8 million rows each), each join completed in minutes. The difference between a mega-table and a focused one: hours vs. minutes per state. Indexing 6 million rows is fast. Indexing 92 million and then filtering is not.

The Numbers

The backfill filled 4.9 million parcel addresses across multiple states. The highlights:

  • California: 71.2% → 95.8% coverage (3.7 million parcels filled — the single biggest win)
  • Connecticut: 80.1% → 94.1%
  • Pennsylvania: 80.3% → 95.8%
  • Oklahoma, Alabama, New Mexico, Alaska — all saw meaningful jumps

Overall weighted address coverage went from approximately 86% to 93.2% across 131.8 million parcels. That's the difference between "most parcels are searchable" and "nearly all parcels are searchable." For an API product, that gap is everything.

What's Still Broken

We're not going to pretend this is solved. Thirty-one states are still under 95% coverage. The stubborn ones:

  • Iowa: 76.2% — OpenAddresses data is thin here. Rural state, townships instead of municipalities, address formats are inconsistent.
  • Alaska: 56.1% — Vast tracts of land with no traditional addresses. Many parcels are identified by legal descriptions, not street addresses. This may never hit 90%.
  • Louisiana: 77.9% — Parish system, unique address conventions, limited open data.

For these, OpenAddresses is exhausted. The next move is statewide ArcGIS parcel layers that include address fields we can spatially join back, or direct partnerships with state GIS offices. Different playbook, slower progress.

Lessons

Do the disk math before you write the backup script. A 30-day retention policy sounds responsible until each backup is 25GB. 25GB × 30 days = 750GB. Our disk is 410GB. This is arithmetic, not engineering. We should have caught it.
Docker builds leak disk like a slow faucet. If you're doing active development with frequent rebuilds, schedule a monthly docker system prune or set up Docker's built-in garbage collection. We added it to our maintenance cron and moved on.
Scope your staging tables. A spatial join against a 92-million-row table is a fundamentally different operation than the same join against a 6-million-row table, even with proper indexes. Filter early, filter aggressively. Your query planner will thank you.
93% isn't 100%, but it's a product. Perfectionism kills data products. We could spend months chasing the last 7% or we could ship what we have and improve incrementally. The API is live. The data is useful. The remaining gaps are documented and prioritized. That's good enough to start selling.

This is Build Log #011. We publish these as we build — the real engineering stories behind production data systems. If you're building something that needs parcel data, spatial analysis, or just want to commiserate about disk space management, get in touch.