Build Log #005: Our ML Model Was Trained on Fake Soil Data

A 10-row CSV fixture was silently replacing real USDA soil data for every parcel in America. The predictions looked fine. The pipeline never complained. Here's how we found it during a routine audit.

Build Log is our honest engineering journal. Not the polished case study. The real stuff — including the parts where we look dumb.

The Context

I'm building LandPlanner.ai — a platform that does instant site feasibility analysis for land developers. You punch in coordinates, and we pull data from a dozen federal APIs — FEMA flood zones, USGS seismic hazard, NOAA climate normals, NREL solar irradiance, EPA brownfields — run it through an ML pipeline, and give you a feasibility assessment that used to cost $5K-$15K from a planning engineer.

One of those data sources is SSURGO — the USDA's Soil Survey Geographic Database. It tells you things like drainage class, soil composition, permeability, and whether land is classified as prime farmland. If you're evaluating a parcel for development, soil data isn't optional. It determines foundation requirements, stormwater engineering, septic feasibility, and whether you'll get pushback from the county about building on agricultural land.

SSURGO was "working." The pipeline ran. The predictions came back. No errors in the logs.

It was also completely fake.

The Discovery

I was doing a data source audit — one of those boring but necessary exercises where you go through each API connector and verify it's actually hitting the real endpoint. We have about 20 data sources. Most were fine. FEMA, USGS, NOAA — all making real HTTP calls, returning real data.

Then I looked at the .env file:

SSURGO_API_ENABLED=0
SSURGO_FIXTURE_ENABLED=1

Cool. Cool cool cool.

The fixture file was a 10-row CSV covering a tiny latitude/longitude range somewhere in central Utah. Ten rows. For a platform that's supposed to analyze parcels across 17 US states.

Any coordinates outside that microscopic bounding box — which is to say, essentially all coordinates — got no soil data at all. The featurizer handled this gracefully by filling missing numeric features with 0.0. No error. No warning. Just... zero permeability, zero drainage, zero everything. Silently fed into the ML model like that was a real place where soil doesn't exist.

Why It Was Like That

The fixture system wasn't malicious or lazy. It was pragmatic. Early in development, before the API integrations were stable, I'd built fixture files for each data source so I could develop the pipeline without waiting for HTTP calls. A 10-row CSV that returns instantly beats a 40-second API timeout when you're iterating on feature engineering.

The problem is I never turned it off.

The SSURGO API connector was written, tested, and working. But somewhere in the process of getting everything else running — Stripe billing, authentication, the frontend, the Celery task queue — switching SSURGO_FIXTURE_ENABLED from 1 to 0 fell off the list. There was no deployment checklist. No integration test that verified data source provenance. The fixture was supposed to be scaffolding. It became load-bearing.

The Fix (That Wasn't)

Simple, right? Flip the flags. SSURGO_API_ENABLED=1, SSURGO_FIXTURE_ENABLED=0. Restart the backend. Done.

I flipped the flags, ran a test analysis, and got a stack trace.

The SSURGO connector was querying the USDA's Soil Data Access (SDA) API — a free SOAP/REST service at sdmdataaccess.nrcs.usda.gov. The SQL query was joining two tables: muaggatt (map unit aggregated attributes) and the spatial geometry table. One of the columns we needed was farmlndcl — the farmland classification.

The query referenced mu_agg.farmlndcl, pulling it from the muaggatt table.

Except farmlndcl isn't on muaggatt. It's on the mapunit table.

The SSURGO database schema has dozens of tables with overlapping naming conventions. muaggatt has aggregated attributes. mapunit has per-unit attributes. farmlndcl sounds like it should be aggregated. It's not. It's a per-unit classification that lives on mapunit.

So the "working" API connector had never actually worked. It was written, it compiled, the SQL syntax was valid — but it referenced a column on the wrong table. The fixture had been covering for a connector that would have failed on every single real request.

The Actual Fix

Two changes in ssurgo_api.py:

Changed mu_agg.farmlndcl to mu.farmlndcl
Removed the muaggatt JOIN (we weren't using any other columns from it)

Tested with coordinates in Iowa (42.0, -93.5). Response came back in about 2 seconds:

{
  "drainage_class": "Poorly drained",
  "hydrologic_group": "C/D",
  "texture": "Clay loam",
  "series_name": "Harps",
  "permeability": 0.399,
  "farmland_class": "All areas are prime farmland",
  "source": "api:ssurgo"
}

Real data. From the actual USDA. For the first time ever in production.

The Audit Continues

Finding SSURGO prompted a full data source inventory. Here's what we found:

Actually live and working: FEMA flood zones, USGS seismic/groundwater, NOAA climate, NREL solar, EPA brownfields, BLS construction costs, Census demographics, Walk Score, AirNow air quality, USFS wildfire risk, NCES schools, IPAC endangered species, traffic data.

Intentionally using fixtures: NASS (agricultural statistics) — this one is deliberately on fixture mode because the API consistently times out at ~40 seconds, which blocks the entire analysis endpoint. This is a known tradeoff and it's the right call until we add async timeout handling.

Disabled entirely: CropScape (redundant with NASS), Regrid parcel boundaries (paid API, no key yet), First Street climate risk (paid API, no key yet).

SSURGO was the only source that was accidentally on fixtures. But one is enough.

The Uncomfortable Question

Our production ML model — the one currently serving predictions to users — was trained on ~945,000 samples. Every single one of those samples went through the featurizer with SSURGO data set to zero. The model has literally never seen real soil data.

The model's R² is 0.603 and MAPE is 4.8%, which sounds reasonable until you realize it achieved those numbers while being completely blind to soil conditions. Either soil data doesn't matter for our prediction target (possible but unlikely for land feasibility), or the model learned to compensate by over-weighting other correlated features (more likely and more dangerous).

We need to retrain. With real data this time.

What We're Doing About It

Beyond the immediate fix, we're building three things:

Data provenance tracking. Every analysis now logs whether each data source returned real API data, fixture data, or nothing. If a source silently degrades to fixtures, we'll know.
Feature completeness scoring. Instead of silently filling missing values with 0.0, we track how many features were actually populated vs. defaulted. A prediction based on 30/45 real features is qualitatively different from one based on 45/45, and the user should know.
Pre-deployment data source checklist. Before any deployment touches production, verify each data source is hitting its real endpoint. Boring. Essential.

The lesson: Fixture data is a development tool, not a deployment strategy. If your test data can silently replace real data in production without anything breaking, your pipeline doesn't actually know the difference — and that's the real bug.

This is Build Log #002. Previous entries: #004 (security audit), #003 (local embeddings), #002 (fake confidence scores), #001 (RAG bugs). We publish these because building AI systems for real businesses is messier than the tutorials suggest, and we think the honest version is more useful. Get in touch if you want to talk about building something.