Build Log #004: We Found 20 Security Issues in Our Own AI Platform

We ran a full security audit on our production AI platform. Twenty findings — from "that's embarrassing" to "that's a live exploit." Every one remediated. Here's what we found and what we learned about securing AI systems.

Build Log is our engineering journal. Previous entries: #001 (RAG bugs), #002 (fake confidence scores), #003 (local embeddings).

Why We Audited Ourselves

We run a production AI platform — LandPlanner.ai — that handles real user data, processes payments through Stripe, and integrates with 15+ external APIs. It's Docker-based, runs on a hardened Ubuntu server, and we built it over months of rapid iteration.

"Rapid iteration" is a polite way of saying "we shipped fast and worried about security later."

Later arrived. We committed to a full audit before any marketing push. The rule was simple: find everything, fix everything, no excuses. Here's what we found.

The Findings (Grouped by Severity)

Critical: Things That Could Cause Real Damage

1. Let's Encrypt private key committed to git.

Not in the current tree — in the git history. Someone had committed the TLS certificate's private key months ago, then removed it in a later commit. But git never forgets. Anyone with repo access could extract the private key from history and intercept all HTTPS traffic to the domain.

Fix: git filter-repo to permanently purge the file from all history. Force push. Revoke and regenerate the certificate. Add *.pem and *.key to .gitignore. This is a "stop everything and fix it now" finding.

2. API keys and secrets in git history.

Same pattern. API keys for Stripe, Anthropic, and other services had been committed to config files at various points during development. They were moved to .env files later, but the old commits still contained live credentials.

Fix: Purge history with git filter-repo. Rotate every compromised key. Add a pre-commit hook that scans for common secret patterns (API keys, passwords, tokens). Set up a GitHub security scanning workflow.

3. Docker port bindings bypassing the firewall.

This is the one that keeps data center veterans up at night. We had ufw configured correctly — ports 22, 80, 443 only. Perfect firewall rules. Except Docker manipulates iptables directly, completely bypassing ufw.

Our docker-compose.yml had services binding to 0.0.0.0:8002, 0.0.0.0:5432, 0.0.0.0:3000, and 0.0.0.0:9090. PostgreSQL, the API server, Grafana, and Prometheus — all directly accessible from the internet, firewall be damned.

Fix: Remove all unnecessary port bindings from docker-compose.yml. Services that only need inter-container communication use Docker's internal network — no port binding at all. PostgreSQL bound to 127.0.0.1:5433:5432 (localhost only, non-standard port). This is not optional — if you use Docker and ufw, you probably have this bug right now.

4. Pickle deserialization in cache layer.

Our caching module used Python's pickle for serialization. If you're not familiar: pickle.loads() on untrusted data is a remote code execution vector. An attacker who can write to the cache can execute arbitrary Python code on the server.

Fix: Replace all pickle serialization with JSON. Slightly more restrictive (can't serialize arbitrary Python objects), but RCE-free. If you see pickle in a production codebase, treat it as a bug.

High: Things That Should Be Fixed Immediately

5. Redis with no authentication. Anyone on the Docker network could connect to Redis. Added password auth, updated all connection strings.

6. SSH password authentication enabled. Key-based auth was set up, but password auth wasn't explicitly disabled. Switched to ed25519 key-only, disabled PasswordAuthentication, disabled ChallengeResponseAuthentication.

7. CORS open to all origins. Access-Control-Allow-Origin: * on the API. Locked down to the production domain only.

8. .env file permissions too broad. World-readable. Contains Stripe live keys, database credentials, API keys. chmod 600 — owner read/write only.

9. Grafana using default admin password. admin/admin. On a service that was accidentally internet-accessible (see finding #3). Changed immediately.

Medium: Things That Should Be Fixed Soon

10. No dependency vulnerability scanning. No automated process to check for known CVEs in Python packages. Added a GitHub Actions workflow that runs pip-audit and safety on every push.

11. No CSRF protection on admin endpoints. Admin panel had session cookies but no CSRF tokens. Added CSRF middleware.

12. Deprecated Docker Compose syntax. version: "3.9" is deprecated and generates warnings. Removed (Compose V2 doesn't need it).

13-16. Various configuration hardening items. Secure cookie flags, rate limiting on auth endpoints, request size limits, log sanitization (stripping credentials from error logs).

Low: Good Practices We Were Missing

17-20. Documentation, monitoring, and process items. No security runbook. No incident response plan. No automated backup verification. No API key rotation schedule.

The Two Findings That Matter Most

If you take nothing else from this post, remember findings #3 and #4.

Docker + ufw = false sense of security. This combination is running on thousands of production servers right now, and the operators believe their firewall is protecting them. It's not. Docker's iptables manipulation bypasses ufw entirely. If your Docker Compose file has ports: - "8080:8080", that port is accessible from the internet regardless of your ufw rules. The fix is either: don't bind ports you don't need, or bind to 127.0.0.1 explicitly.

Pickle in production is a CVE waiting to happen. Every security-conscious Python developer knows this, but it still shows up in production codebases constantly — especially in caching layers, ML model serialization, and session stores. The rule is absolute: never unpickle data that could have been modified by an untrusted party. Use JSON, MessagePack, or Protocol Buffers instead.

What This Has to Do With AI

Here's the thing: none of these findings were AI-specific. They were infrastructure security basics — secrets management, network hardening, serialization safety, access controls.

But AI systems make all of these worse because:

AI platforms handle more sensitive data. Your RAG system indexes internal documents, customer data, proprietary knowledge. A database breach on an AI platform leaks your client's entire institutional knowledge.
AI systems have more API keys. LLM provider, embedding service, vector database, monitoring, each one a credential to manage and protect.
AI developers move fast. ML engineers and AI developers are optimizing for model performance, not infrastructure security. The Docker Compose file is a means to an end, not a security boundary.
AI platforms often run in "research mode" longer than they should. What started as a Jupyter notebook experiment becomes a FastAPI service becomes a production system — without ever getting a security review.

This is exactly why AI compliance and governance is one of our core services. The AI model can be perfect, but if the infrastructure it runs on has 20 security holes, your client's data isn't safe.

The Audit Process

For anyone who wants to audit their own AI platform, here's the process we followed:

Git history scan. Search for committed secrets: git log --all -p | grep -i "api_key\|password\|secret\|token\|private". If you find anything, git filter-repo + rotate credentials.
Network scan. From outside the server: nmap -sT server_ip. Every open port you don't expect is a finding.
Docker port audit. docker compose config | grep -A2 ports. Every binding to 0.0.0.0 is suspect.
Dependency scan. pip-audit and safety check on your requirements.
Serialization audit. grep -r "pickle" --include="*.py". Replace with JSON.
File permissions. find . -name ".env" -o -name "*.pem" -o -name "*.key" | xargs ls -la. Nothing should be world-readable.
Auth review. Test every endpoint without credentials. Test with expired credentials. Test with another user's credentials.
CORS check. curl -H "Origin: https://evil.com" -I https://your-api/endpoint. If it reflects the origin, that's a finding.

This isn't exhaustive, but it catches the most common and most dangerous issues. For our platform, it found 20 — every one of which was fixable in a day.

The Result

Twenty findings. All remediated. Git history clean. Network hardened. Credentials rotated. Serialization safe. Monitoring in place.

The platform is more secure now than 99% of AI systems in production. Not because we're security geniuses — because we actually looked. Most teams never audit. They assume Docker + cloud = secure. It doesn't.

If you're running an AI system in production and you haven't done a security audit, you have at least five of these twenty findings right now. Probably more. The question is whether you find them before someone else does.

Build Log #004. We offer AI compliance and security audits because we've been through it ourselves. If you want your AI system audited by engineers who know what to look for, let's talk.