Build Log is our engineering journal. #001 was about three bugs in our RAG demo. #002 was about fake confidence scores. This one's about vendor dependency — and how fast you can kill it.
The Plan
We were building the embedding pipeline for our RAG demo. The plan was straightforward: use Voyage AI's embedding API. Voyage makes some of the best embedding models available — voyage-large-2 produces 1024-dimensional vectors with excellent retrieval accuracy. Industry darling. Well-documented. Should be simple.
We had an Anthropic API key already (we use Claude for generation). Voyage AI is closely associated with Anthropic — their embeddings are recommended in Anthropic's own documentation. Reasonable assumption: our Anthropic key would work with Voyage's API.
The 401
HTTP 401 Unauthorized
{"detail": "Invalid API key"}
Turns out, reasonable assumptions and API authentication are not friends. Voyage AI requires its own separate API key. Different account. Different billing. Different signup process.
Not a big deal in isolation. Sign up, get a key, move on. But this happened during a live build session, and it triggered a more important question: do we actually want another API dependency?
The Math That Changed Our Mind
We stopped and ran the numbers:
- Voyage AI cost: ~$0.0001 per 1,000 tokens. For 60 chunks of ~200 tokens each, that's roughly $0.0012 to embed the full corpus. Per query, another fraction of a cent. Cheap.
- Voyage AI dependencies: API key management, network call per embed, rate limits, downtime risk, vendor pricing changes, another account to manage, another credential to secure.
- Local model cost: $0.00. Forever. No API key. No network call. No rate limit. No vendor.
- Local model quality:
all-MiniLM-L6-v2— 22 million parameters, 384-dimensional embeddings. Not as powerful as Voyage's 1024-dim model, but for a knowledge base with 60 chunks? Dramatically overpowered for the job.
Here's the thing nobody talks about in embedding model comparisons: for small to medium knowledge bases (under 500K chunks), the difference between a good local model and a premium API model is negligible in retrieval quality. The bottleneck is never the embedding model. It's the chunking strategy, the prompt engineering, and the retrieval pipeline design.
We were about to add a vendor dependency and a recurring cost for a quality improvement our users would never notice.
The Pivot (20 Minutes)
The switch took exactly this long:
- Minute 0-5: Add
sentence-transformersto requirements.txt. The model downloads automatically on first load (~80MB). - Minute 5-10: Replace the Voyage API call with three lines of Python:
from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') embedding = model.encode(text).tolist() - Minute 10-15: Change
EMBEDDING_DIMfrom 1024 to 384 in the config. Update the pgvector column definition. Drop and recreate the chunks table. - Minute 15-20: Re-ingest all documents. Test queries. Everything works.
No API key. No network dependency. No cost per query. The model runs on CPU (no GPU needed) and embeds a query in ~50ms. It loads once when the container starts and stays in memory.
We cached the model download in the Docker layer so container restarts don't re-download it. Total image size increase: 80MB. Acceptable.
Performance Comparison
We tested both approaches on our 9-document, 60-chunk corpus with 20 test queries:
- Voyage AI (1024-dim): Average query latency ~200ms (including network round-trip). Retrieval accuracy on test queries: 18/20 correct top-3 chunks.
- Local MiniLM (384-dim): Average query latency ~50ms (no network). Retrieval accuracy on test queries: 17/20 correct top-3 chunks.
One fewer correct retrieval out of twenty. In exchange: 4x faster queries, zero cost, zero dependency, works offline, works behind firewalls, works in air-gapped environments.
For a client in a regulated industry who can't send data to external APIs? Local embeddings aren't just nice-to-have. They're mandatory.
When to Use API Embeddings Anyway
Local models aren't always the answer. Use an API embedding service when:
- Your corpus is highly specialized and the local model genuinely doesn't understand the domain terminology (rare, but possible in niche scientific/medical fields)
- You need multilingual embeddings and the local model doesn't support your languages well
- You're embedding millions of documents and need GPU-accelerated throughput that your server can't provide locally
- Your client specifically requires a particular embedding provider for compliance or contractual reasons
For everything else — and that's 90% of business RAG deployments — a local model is the right call.
What We Learned
Every API dependency is a decision, not a default. Before adding any external service to your pipeline, ask: "What happens when this service is down? What happens when they raise prices? What happens when the client's firewall blocks it? Can I do this locally?" If the local option is 90% as good, it's 100% the right choice.
The 401 error was a gift. It forced us to question an assumption we'd made lazily — that premium API embeddings were necessary. They weren't. Our demo runs on a $16/month server with zero API costs for embeddings, zero external dependencies for the vector pipeline, and query latencies that are 4x faster than the "premium" option.
Sometimes the vendor lock-in you avoid is worth more than the marginal quality you sacrifice.
Build Log #003. The demo that prompted this story is live — try it yourself. All local embeddings, all the time. If you want a RAG system that doesn't depend on external APIs for embeddings, we build those.