Build Log #010: The Spatial Join From Hell (15 Hours → 12 Minutes)

We tried to backfill 1.9 million missing addresses by joining against a 92-million-row staging table. It ran for 15 hours. The fix was embarrassingly obvious.

Build log — where we document the actual engineering work, including the parts where we stare at a progress bar for 15 hours before realizing the whole approach was wrong.

The Problem: Address Gaps

We have a parcel database. About 92 million rows across 47 US states. Each parcel is a polygon with (ideally) an address, owner, acreage, assessed value — the works. The problem is that "ideally" is doing a lot of heavy lifting in that sentence.

Our overall searchable rate — the percentage of parcels that actually have a usable street address — was sitting at 85.8%. That sounds okay until you realize it means 13 million parcels are invisible to anyone searching by address. Some states were brutal: Kansas at 30%, North Dakota at 29%, Iowa at 64%.

The fix seemed straightforward: OpenAddresses maintains a massive open dataset of geocoded addresses. Download their data, load it into a staging table, then spatially join it against our parcels. If an OpenAddresses point falls inside a parcel polygon, adopt that address.

Simple geometry. ST_Contains(parcel.geom, oa_point.geom). What could go wrong?

Version 1: The Naive Approach

We downloaded OpenAddresses data for the southern US region — about 30 million address points across a dozen states. Loaded them all into a single staging table called oa_addresses_south. Built a spatial index on it. Wrote a script that would iterate through each state's parcels in batches of 500, running the spatial join to find matching addresses.

Started it on a Saturday night and went to bed feeling productive.

Woke up Sunday morning. North Carolina was still on its first query. Fifteen hours in, and it hadn't finished a single state.

We checked the query plan. The issue was immediately obvious — in the way that things are obvious once you already know the answer and feel stupid for not seeing it eight hours ago.

The Haystack Problem

The staging table had 92 million rows. Every region's addresses, all in one table. When we ran a spatial join for North Carolina parcels, PostgreSQL had to search that entire staging table for points that fell inside each NC polygon.

Yes, we had a spatial index. But PostGIS spatial indexes (GIST) work by building bounding box trees. When your staging table contains addresses from Florida to West Virginia to Texas, the bounding boxes overlap massively. The index could narrow it down from 92 million to maybe 20 million, but that's still 20 million point-in-polygon tests per batch.

We were asking PostgreSQL to find a needle in a haystack, but the haystack was the entire southern United States and the needle was in Raleigh.

Version 2: Smaller Batches (Still Wrong)

First instinct: reduce batch sizes. Maybe 500 parcels at a time was too aggressive. We tried 100. Then 50.

This is the engineering equivalent of driving faster to make up for being lost. The fundamental problem wasn't batch size — it was that every single query had to scan 92 million staging rows. Smaller batches just meant we'd be doing that scan more times.

We killed it after an hour. Same symptoms, just more of them.

Version 3: Filter the Haystack First

The fix, when it finally clicked, was so simple it hurt.

Don't search the whole table. Build a smaller table first.

Before processing North Carolina parcels, extract just North Carolina's addresses from the staging table into a temporary table:

CREATE TEMP TABLE oa_state_addresses AS
SELECT hash, street, city, postcode, geom
FROM oa_addresses_south
WHERE state = 'NC';

CREATE INDEX ON oa_state_addresses USING GIST (geom);

Now the spatial join runs against maybe 3 million rows instead of 92 million. The GIST index is tight — all bounding boxes are within North Carolina. Point-in-polygon tests hit the right area immediately.

North Carolina: 12.8 minutes. Not 15 hours. Twelve minutes and forty-eight seconds.

We sat there watching the progress bar actually move and felt a very specific kind of shame.

The Full Run

We rewrote the backfill script as v3. For each state:

Extract that state's addresses into a temp table
Build a fresh spatial index on the temp table
Run the spatial join in batches of 500
Drop the temp table
Move to the next state

Four regions. All states. Two hours total.

The results:

1.89 million parcels got new addresses
Overall searchable rate: 85.8% → 87.8%
Ohio: +500,000 addresses
North Carolina: +502,000 addresses
Arizona: +262,000 addresses
Kansas: 30% → 95% searchable (the biggest single-state improvement we've ever had)

Kansas going from 30% to 95% in a single run was genuinely satisfying. That's a state that went from "basically useless" to "better coverage than most commercial datasets" in two hours.

What's Still Broken

OpenAddresses doesn't cover everything. Some states remain stubbornly low:

Wisconsin: 71% (OA data exists but doesn't fill the gaps)
Iowa: 64%
Mississippi: 70%
North Dakota: 29% (statewide source has geometry only — no addresses at all)
Montana: 59%

These need different data sources entirely — county assessor feeds, state GIS portals, maybe even geocoding against the USPS database. That's a different build log.

The Lesson

This is one of those optimizations that's so obvious it's almost not worth writing about. Filter the smaller side of a join before running it. Don't make the database search the whole world when you only need one state.

But here's the thing: we know this. We've been building spatial data pipelines for months. We've optimized PostGIS queries before. We literally wrote build logs about query performance. And we still launched the naive version first, watched it run overnight, and only figured it out after 15 wasted hours.

The gap between knowing a principle and applying it at 2am when you're excited about shipping a feature is real. We got so focused on the result (fill address gaps!) that we skipped the five-minute step of asking "wait, how big is the staging table and does this query actually make sense?"

Rule of thumb for spatial joins: If your staging table is more than 10x bigger than the subset you're actually joining against, create a filtered temp table first. The time spent creating the temp table + index is almost always less than the time wasted on a bad query plan. Always filter the smaller table first.

Fifteen hours to twelve minutes. A 75x improvement. From a fix that took ten minutes to write.

That's the job.

Build Log #010. We publish these as we go — the real engineering, not the polished version. If you're dealing with geospatial data at scale and want to talk shop, reach out.