← Blog

Build Log #009: The 295x Speedup That Was Just a Missing Index

55 million rows. No spatial index. Every reverse geocode was a full sequential scan. The fix was one SQL statement, and the real lesson is about what you forget to check when you're busy being clever.

This is an entry in our build log — an honest engineering journal of building AI-powered tools for real estate and land analysis. Not the polished version. The actual story.

Previously on "We Made It Faster"

Two posts ago, we wrote about taking our analysis pipeline from 58 seconds down to 15. We parallelized 22 API calls. Killed a retry storm. Added timing instrumentation. Felt pretty good about ourselves.

Then a potential client tried to analyze a parcel in Iowa and the request timed out at 128 seconds.

So much for feeling good.

The Symptom: Iowa Is Broken

The report came in from a real user — an agricultural land consultant looking at parcels in Crawford County, Iowa. They'd search an address, and the spinner would just... spin. Two minutes. Timeout. Nothing.

Meanwhile, Utah parcels returned in 12 seconds. North Carolina in 11. Kansas, which has almost no data, came back in 14 seconds with a "sorry, not much here" shrug.

Iowa? Dead silence, then a 504.

The frustrating part: we'd just shipped phase timing instrumentation. We had phase_latlon_to_features, phase_reverse_lookup, phase_feature_store, the whole nine yards. So we knew where time was being spent.

We just didn't want to believe the answer.

The Diagnosis: 55 Seconds Doing Nothing Useful

The phase timers told the story immediately:

phase_reverse_lookup: 55.2s
phase_feature_store:   2.1s
phase_pre_ml:          0.8s
phase_ml_predict:      1.1s

55 seconds. Just on the reverse geocode — the step where we take latitude/longitude coordinates and find the nearest parcel in our database. A PostGIS ST_DWithin query against the core_parcels table.

Now, core_parcels is our main table. It holds every parcel we've ingested across every state. At this point, that's 55 million rows with full geometry data.

One query. 55 million rows. 55 seconds.

Want to guess why?

The Fix: One Line of SQL

CREATE INDEX idx_core_parcels_geom ON core_parcels USING GIST (geom);

That's it. That's the fix.

There was no GiST spatial index on the geometry column. Every ST_DWithin query was doing a full sequential scan across 55 million polygons, computing distance for every single one, to find the nearest parcel to a given point.

With the index: 0.1 seconds.

Without the index: 55 seconds.

That's a 295x speedup from a single CREATE INDEX statement.

I wish I could tell you this took days of profiling and clever query optimization. I wish I could write a heroic story about rewriting the query planner or implementing a custom R-tree. But no. We just forgot to index the column. On a table with 55 million rows. That gets queried on every single analysis request.

How Did We Miss This?

Here's the annoying part: it worked fine for months. When core_parcels had 5 million rows, the sequential scan took about 3 seconds. Not great, but buried under the 30+ seconds of external API calls. Invisible in the noise.

We kept ingesting data. 10 million rows. 20 million. 35 million. The sequential scan got linearly slower — because that's what sequential scans do — but we were simultaneously optimizing other things. Parallelizing API calls. Killing retry storms. The overall analysis time kept going down even as this one query got worse.

By the time we hit 55 million rows, the reverse geocode alone was taking longer than the entire pipeline used to take. But because we'd been optimizing other bottlenecks, we hadn't re-profiled this specific step since it was "fine."

Classic boiled frog. The data grew gradually, the pain grew gradually, and we never noticed because we were looking elsewhere.

The Stacking Effect: Three Fixes, Not One

The index was the big win, but we shipped two more fixes in the same session:

Smart fast-fail for reverse geocode. Previously, if our local database didn't have a parcel match, we'd fall through to external geocoding endpoints — Census Bureau, ArcGIS, the whole chain. Each one with its own timeout. For states where we had good coverage but an inexact address match, this added 30-55 seconds of wasted external calls.

The fix: if the local DB has parcels within 0.02 degrees (~2km) but none match the address exactly, stop there. Don't ask the Census Bureau. They don't know either. If there are zero parcels within 0.5 degrees, skip immediately — we don't have data for this area, and no amount of external calls will conjure it.

Irrigation feasibility prefetch. The WaterResourcesService.resolve_irrigation_feasibility call was running sequentially after everything else, adding 12.7 seconds. It has zero dependencies on the other data. Moved it into the parallel prefetch pool where it belonged.

Combined results:

  • Iowa: 128s (timeout) → 11.1s
  • Kansas: 45.8s → 13.9s
  • Utah: 72-94s → ~12s

Iowa went from literally unusable to faster than most web pages load.

The Docker Gotcha That Almost Cost Us an Hour

Quick aside on a bonus lesson we paid for this week: after writing the fix, we ran docker compose restart api and tested. Still slow. What?

Turns out our Docker setup bakes the Python source code into the image at build time. docker compose restart just restarts the existing container with the existing image. The source files on the host had changed, but the running container was still using the old code.

The correct command: docker compose up -d --build api

This seems obvious in retrospect. It is obvious. But at 1 AM, staring at timing logs that refuse to change, "maybe Docker isn't picking up the new code" isn't your first thought. You're looking at query plans and connection pools and EXPLAIN ANALYZE output. Not container image caching.

We've added this to our deployment runbook in all-caps.

The Accuracy Emergency

Speed wasn't the only crisis this week. Our third real user searched for "130 N Academy St, Mooresville, NC" — Iredell County, near Lake Norman. The system returned a parcel on W Moore Ave in the wrong county entirely.

Three real users. Three accuracy issues. That's a 100% failure rate on first impressions.

We shipped a three-layer defense:

  1. Approximate tagging: If the nearest parcel polygon doesn't actually contain the search point, we tag it as Local/PostGIS(approx) instead of claiming it's exact.
  2. Similarity gate: If Census geocodes the address but the nearest parcel's address similarity score is below 0.25, we return the coordinates but don't attach the wrong parcel. Source: Census/Geocoder(no-parcel).
  3. Frontend warning: When the source contains "approx" or "no-parcel," users see a persistent banner: "Exact parcel boundary not found." No silent wrong answers.

The root cause for that specific search was a coverage gap — we only had 25 of North Carolina's 100 counties loaded. Iredell wasn't one of them. We launched a full NC ingestion and now have all 100 counties, 5.9 million parcels.

But the broader lesson is this: wrong data is worse than no data. A blank result says "we don't know." A confident wrong answer says "we know, and here's your parcel" — and then the user makes decisions based on garbage. The three-layer defense exists specifically to convert wrong answers into honest "I'm not sure" answers.

What We Shipped This Week (The Full List)

  • 295x reverse geocode speedup (spatial index)
  • Smart fast-fail on external geocode endpoints
  • Irrigation feasibility parallelized
  • Three-layer accuracy defense (approx tags, similarity gate, frontend warnings)
  • North Carolina: 25 → 100 counties (5.9M parcels)
  • Connecticut: 1.14M sale transaction records ingested
  • Iowa: address coverage 6.4% → 62%, assessed values backfilled
  • Full comparable sales search feature (/app/comps)
  • Total database: ~80M parcels across 34 states

The Real Lesson

We've now written three consecutive build logs about making the same endpoint faster. Each time, the biggest win was the most embarrassing fix. A retry storm against a dead endpoint. Sequential API calls that could have been parallel. A missing index on the primary query table.

None of these required sophisticated algorithms. None required distributed computing or horizontal scaling or any of the things you'd reach for in a system design interview. They required looking. Profiling. Questioning assumptions. Asking "wait, does this table even have an index?"

The 295x speedup took 30 seconds to write. Finding out we needed it took embarrassingly longer.

Lesson: When your database doubles in size, reprofile everything. What was "fine at 5 million rows" is potentially catastrophic at 55 million. Especially spatial queries. Especially without an index.

This is Build Log #009. We publish these as we build — the real engineering stories behind production AI and data systems. If you're scaling geospatial data, fighting accuracy issues, or just want to commiserate about indexes you forgot to create, let's talk.