Embedding Pipeline Design for AI Applications
Embeddings are the quiet workhorse of modern AI: they turn text into vectors so you can retrieve, cluster, and rank content at scale. Designing the pipeline around them is less about picking a trendy model and more about chunk boundaries, update semantics, cost curves, and where state actually lives. This post walks through decisions we see teams get wrong first, then ties them to MongoDB Atlas for durable chunk and vector storage and Redis Cloud for caching and job orchestration.
Chunking: the decision that compounds
Retrieval quality often hinges on chunking more than on embedding dimension. A chunk that is too large dilutes the signal; one that is too small loses context. Production systems usually blend heuristics with measured recall on a labeled set of queries.
Four strategies teams actually ship
- Fixed-size windows (e.g. 256–512 tokens with overlap) are predictable, cheap to implement, and easy to version. They work well for homogeneous docs but can split tables, code, and legal clauses mid-thought.
- Sentence-based splitting respects linguistic boundaries and improves readability of retrieved passages. It struggles when single sentences are long or when meaning spans multiple sentences.
- Semantic chunking clusters adjacent sentences or paragraphs by embedding similarity, merging until a coherence score drops. It yields higher-quality RAG answers on long-form content but adds latency and complexity during ingest.
- Recursive splitting (e.g. headings, then paragraphs, then sentences) preserves document structure for wikis, policies, and APIs. It pairs naturally with metadata filters (section, product, env) at query time.
A pragmatic pattern is recursive or heading-aware chunking for structured sources, fixed or sentence windows for chat logs and tickets, and semantic chunking only where offline evaluation shows a clear lift. Store chunk boundaries, source offsets, and a stable chunk_id so you can re-embed or delete without orphaning vectors.
Choosing an embedding model
Model choice is a triangle of quality, cost per token, and operational fit (hosted API vs self-hosted GPU). For English-heavy RAG, OpenAI text-embedding-3-small and text-embedding-3-large are common defaults; Cohere embed-v3 family competes on multilingual and classification-adjacent tasks; BGE and E5 variants are strong open-source baselines when you control the inference stack.
| Model | Default dimensions | Indicative price (USD per 1M tokens) | Typical hosted API latency (p50-ish, single request) |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 (shortenable to 512) | ~$0.02 | ~50–150 ms |
| OpenAI text-embedding-3-large | 3072 (shortenable) | ~$0.13 | ~80–250 ms |
| Cohere embed-english-v3 (representative) | 1024 | ~$0.10 (check current Cohere list) | ~60–200 ms |
| BGE-large-en-v1.5 (self-hosted) | 1024 | Infra + GPU only | ~1–30 ms per small batch on GPU (hardware-dependent) |
| intfloat/e5-large-v2 (self-hosted) | 1024 | Infra + GPU only | ~1–30 ms per small batch on GPU (hardware-dependent) |
Prices and latencies move with provider rate cards and region; treat the table as order-of-magnitude guidance, not a quote. Always run your own benchmark on a slice of real queries: nDCG, MRR, or simple human-judged relevance beats leaderboard scores.
Dimension tradeoffs and Matryoshka-style shortening
Higher dimensions can capture finer semantic distinctions but increase index size, RAM, and distance-compute cost. OpenAI’s third-generation embeddings support shortening dimensions without retraining by using the leading components of the vector, which lets you start wide and compress after A/B testing. A move from 3072 to 1536 dimensions can roughly halve vector storage for the same document corpus, at some cost to recall that you should measure rather than assume.
Treat embedding dimension as a capacity knob: turn it down only when your offline eval says you can afford to.
Incremental re-embedding when documents change
Full re-embeds on every edit do not scale. Hash the normalized text of each chunk and re-embed only when the hash changes. For large documents, maintain a content_version on the parent and bump child chunk versions only for affected spans. Tombstone deleted chunks in the vector index so readers never see stale hits during propagation lag.
MongoDB Atlas: chunks, vectors, and change-driven pipelines
Store each chunk as a document with fields such as source_id, chunk_index, text, embedding, embedding_model, embedding_version, content_hash, and updated_at. Atlas Vector Search indexes the embedding field while you keep transactional metadata alongside it, which simplifies audits and GDPR-style deletes compared to splitting blobs across a pure vector DB and a separate store.
Use Change Streams on the chunks collection to enqueue re-embedding work when inserts or relevant updates occur. That gives you an at-least-once trigger path into your worker tier without polling. Manage index definitions explicitly: when you change dimension or distance metric, create a new index, backfill, cut traffic over, then retire the old index to avoid serving mixed geometry.
Redis Cloud: cache, hot sets, and batch queues
Redis Cloud shines as a low-latency layer in front of expensive embedding APIs. Cache keys can be content_hash plus model id plus dimension, with TTL aligned to how often sources change. For RAG hot paths, cache not only embeddings but also serialized top-k chunk payloads to shave milliseconds off repeat queries.
Use Redis lists or streams as a durable-enough queue for batch embedding jobs: producers push chunk ids from Change Streams consumers; workers batch texts (e.g. 64–256 per API call where the provider allows) to maximize throughput and minimize overhead. Separate queues for backfill versus real-time paths so a historical re-index does not starve fresh edits.
Batch versus real-time embedding
- Batch/offline embedding maximizes tokens per dollar and simplifies rate-limit handling; latency to searchability is minutes to hours, acceptable for docs and knowledge bases.
- Near-real-time embedding (seconds) fits chat support and collaborative editors; pair smaller batches with Redis-backed deduplication so bursts do not stampede the API.
- Hybrid paths backfill nightly while applying a fast lane for new or edited chunks—Change Streams plus priority queues implement this cleanly.
Embedding versioning and index compatibility
Never mix vectors from different models or different training revisions in one search index. Persist embedding_model and embedding_version on every chunk and include them in cache keys. When you upgrade, run dual-write or shadow indexes: query both, compare metrics, then flip traffic. Skipping this step is how teams get irreproducible relevance regressions.
Cost at scale: a back-of-the-envelope
Suppose you ingest 500 million tokens per month across your corpus. At published small-model pricing near two cents per million tokens, API cost is on the order of ten dollars for a full pass—before chunk overlap, retries, and re-embeds. Large-model pricing at roughly thirteen cents per million tokens pushes the same naive full pass toward sixty-five dollars. The dominant cost in many deployments becomes not the initial embed but repeated re-embeds, oversized chunks (extra tokens), and uncached duplicate content.
| Monthly token volume (full re-embed) | text-embedding-3-small (~$0.02 / 1M) | text-embedding-3-large (~$0.13 / 1M) |
|---|---|---|
| 50M tokens | ~$1.00 | ~$6.50 |
| 200M tokens | ~$4.00 | ~$26.00 |
| 1B tokens | ~$20.00 | ~$130.00 |
Add vector index storage (roughly a few bytes per dimension per vector plus HNSW graph overhead in Atlas), Redis memory for caches and queues, and GPU amortization for self-hosted models. The winning architecture usually combines content hashing, incremental updates, batching, and Redis-backed deduplication so you are not paying to embed the same paragraph fifty times.
Putting it together
A production embedding pipeline picks chunking that matches your content, freezes a model and dimension behind versioned metadata, uses Atlas for authoritative chunk plus vector storage with Vector Search and Change Streams, and layers Redis Cloud for deduplicated API results, hot chunk caches, and fair queuing between backfill and real-time work. Instrument embed latency, queue depth, tokens per successful index, and retrieval metrics per model version—those dashboards are how you justify the next upgrade without guessing.