What Is Vector Search? How Embeddings Power Modern AI Retrieval

FinTekCafe

18 May 2026 12 min read

What Is Vector Search? How Embeddings Power Modern AI Retrieval

Vector search is the per-query operation that turns a piece of text, an image, or a structured prompt into a point in high-dimensional space and finds the closest other points already stored in an index. It is the underlying machinery of every "chat with your documents" feature, every retrieval-augmented generation (RAG) pipeline, and most of the "similar to this" recommendations shipped by modern AI products. The technology is not new (Facebook's FAISS library is from 2017) but its 2023 to 2026 adoption inside the enterprise has produced a vendor market, a procurement pattern, and a set of misconceptions worth correcting.

This article explains what vector search actually is, how the underlying nearest-neighbor algorithms work, where vector search beats keyword search and where it loses, the 2026 vendor landscape, the hybrid-search pattern most production systems converge on, and a decision framework for picking the right starting architecture.

What Vector Search Is, in Plain English

Vector search is similarity search over numerical representations of content. An embedding model takes some input (a sentence, a paragraph, a code snippet, an image) and produces a fixed-length vector of floating-point numbers, often 384, 768, 1024, or 1536 dimensions in modern models. Two inputs that mean similar things produce vectors that are mathematically close (measured by cosine similarity, dot product, or Euclidean distance). At query time, the query is embedded with the same model and the index returns the K nearest stored vectors. Those K results are the "search results."

The reason vector search matters is that it captures semantic similarity rather than lexical overlap. A user query phrased as "cancel a subscription" can match a stored document titled "ending your paid plan" with zero shared keywords, because the embedding model has learned that those phrases occupy similar regions of vector space. Traditional BM25 keyword search would miss this match unless the document or the query contained the right synonyms.

The cost of that capability is real. Vector search adds an embedding step at write time, an embedding step at query time, a much larger index footprint than an inverted index, and a class of failure modes (relevance drift, embedding-model versioning) that keyword search does not have.

How Embeddings Turn Content Into Points in Space

Embeddings come from a neural network trained to compress meaning into geometry. The training procedure that produced OpenAI's text-embedding-3-large, Cohere's embed-multilingual-v3, Voyage AI's voyage-3, and the open-source BGE and E5 families used contrastive learning: the model is shown pairs of related and unrelated content and pushed to place related content close together and unrelated content far apart. After billions of training pairs the model has internalized a geometry of meaning that is good enough for production retrieval.

A few facts that matter operationally:

Dimensions are not free. A 1536-dimensional embedding takes 6 kilobytes per vector at float32, or 1.5 kilobytes at int8 quantization. A 100 million vector corpus at 1536 dimensions is on the order of 600 GB raw, before any index structures. Modern vendors compress aggressively (scalar and product quantization can reach 4 to 8 bits per dimension with limited recall loss), but the cost line items are real.

Different embedding models produce incompatible vector spaces. Cosine similarity between an OpenAI vector and a Cohere vector is meaningless. Switching embedding models means re-embedding the entire corpus.

Embedding models have a context window. Most production embedding models take 512 to 8192 tokens. Documents longer than the window must be chunked, and the chunking strategy meaningfully affects retrieval quality.

Embeddings drift. A 2023 embedding model trained before a regulatory term existed will not encode that term well. Domain-specific fine-tuned embeddings (financial, legal, medical) routinely beat general-purpose ones on in-domain queries by 10 to 30 percent on retrieval benchmarks.

How Nearest-Neighbor Search Actually Works

A brute-force nearest-neighbor search compares the query vector to every stored vector. For 10,000 vectors this is fine. For 10 million it is not. Modern vector databases use approximate nearest neighbor (ANN) algorithms that trade a small amount of recall for one to three orders of magnitude better latency.

Three algorithm families dominate the 2026 landscape:

HNSW (Hierarchical Navigable Small World)

HNSW builds a multi-layer graph where each node is a vector and edges connect each vector to a small number of its nearest neighbors. Queries traverse the graph from the top layer down, following edges that get progressively closer to the query. HNSW typically delivers 95 to 99 percent recall at sub-millisecond latency for corpora up to roughly 100 million vectors per shard.

HNSW is the default for Pinecone, Weaviate, Qdrant, Milvus, and pgvector (since the 0.5 release in 2023). The trade-off is memory: the graph structure itself adds 20 to 40 percent on top of the raw vector storage, and HNSW is hard to update efficiently for write-heavy workloads.

IVF (Inverted File Index)

IVF partitions the vector space into clusters (typically a few thousand) using k-means and stores each vector in its assigned cluster. Queries find the closest clusters first, then search within them. IVF is more memory-efficient than HNSW and updates more cheaply, at the cost of lower recall for the same speed budget.

IVF is the workhorse for very large corpora (multi-billion-vector deployments) and is the basis for FAISS configurations like IVF-PQ (inverted file with product quantization).

ScaNN and DiskANN

Google's ScaNN and Microsoft's DiskANN target the workloads HNSW cannot serve cost-effectively: billion-vector corpora that must run on disk rather than in memory. DiskANN in particular delivers competitive recall at latencies in the 10 to 50 millisecond range while keeping the index on SSD, which makes the per-vector cost roughly an order of magnitude lower than in-memory HNSW. Turbopuffer's serverless vector search is built on this pattern.

The Recall-Latency-Cost Triangle

Every ANN deployment picks two of: high recall, low latency, low cost. A typical operational target is 95 percent recall at 50 milliseconds at moderate cost; chasing 99 percent recall pushes latency or cost up by 2x to 5x. Treating any of the three as a free variable produces deployments that work in the demo and fail under load.

Where Vector Search Beats Keyword Search, and Where It Loses

Vector search beats keyword search when the user query expresses intent rather than exact terms. "Documents about regulatory compliance burdens after the 2024 reform" is a query that vector search can answer even if no document uses those exact words. "Show me posts that critique microservices" is answerable because critique-of-microservices content occupies a coherent region of embedding space even when the author never used the word "critique."

Keyword search beats vector search when the query contains identifiers, codes, or precise terms. A query of "MT103" should match documents containing "MT103," not documents about "international wire formats." A query of a CVE number, a SKU, a customer ID, or an error code is a keyword problem and pure vector search degrades it. Embeddings tend to over-generalize identifiers, treating "MT103" and "MT202" as nearly identical when they refer to entirely different SWIFT messages.

Keyword search also wins on legal and compliance retrieval where the user expects exact-phrase matching, on internal-search workloads with strong title-and-tag metadata, and on very small corpora where the embedding step adds cost without benefit.

Workload	Best fit	Why
Open-ended Q&A over documents	Vector or hybrid	Intent-based queries; vague phrasing
Code search by intent	Vector	"Function that retries failed payments"
Code search by symbol name	Keyword	"PaymentProcessor.retry"
Product catalog ("blue running shoes")	Vector	Captures color, type, use
Product catalog by SKU	Keyword	Exact identifier match
Regulatory text retrieval	Hybrid, keyword-first	Exact-phrase tolerance is required
Image similarity	Vector	Embeddings are the only option
Log search by error code	Keyword	Codes degrade in embedding space
Long-tail FAQ matching	Hybrid	Combines synonym tolerance and term precision

Hybrid Search: The Pattern Most Production Systems Converge On

Hybrid search runs both vector search and BM25 keyword search in parallel and fuses the two ranked lists, usually with reciprocal rank fusion (RRF) or a learned reranker on top. The hybrid pattern delivers measurably better quality than either approach alone on most enterprise workloads. Benchmarks published by Elastic, Weaviate, OpenSearch, and Microsoft Research over 2023 to 2025 consistently show hybrid retrieval beating pure-vector retrieval by 5 to 20 percent on standard relevance metrics (NDCG@10, MRR), with the gain biggest on queries that contain identifiers or proper nouns alongside intent words.

Hybrid search also reduces operational risk. A pure-vector deployment is one bad embedding-model upgrade away from a quality regression. A hybrid deployment degrades gracefully because the BM25 component continues to work even if the vector component drifts.

The production pattern is converging on:

Index both the raw text (BM25) and the embedding (vector ANN).
Retrieve top K from each (typically K of 50 to 200).
Fuse the two lists using RRF or a small learned model.
Optionally rerank the top 10 to 20 with a cross-encoder reranker like bge-reranker-v2-m3, Cohere Rerank, or Voyage rerank-2.

The reranker step is where most of the quality gains in 2025 to 2026 production deployments are coming from. Cross-encoder rerankers see the query and each candidate document together and score them in context, which captures relevance signals that bi-encoder embeddings cannot.

The 2026 Vector Search Vendor Landscape

The vendor market sorted itself out across 2024 and 2025. By 2026 the practical choices are:

Vendor / Stack	Strength	Where it fits
pgvector (Postgres extension)	Runs in your existing database, transactional, free	Less than 10 to 50 million vectors, where Postgres is already the system of record
Pinecone	Managed, serverless, mature operations	Mid-market and enterprise, low ops burden
Weaviate	Open-source, strong hybrid search, schema support	Teams that want SQL-like control with hybrid built in
Qdrant	Open-source, payload filtering, performant	Self-host or managed, strong on filtered queries
Milvus / Zilliz	Multi-billion vector scale, GPU acceleration	Workloads beyond 100 million vectors
OpenSearch / Elastic k-NN	Hybrid search in the same engine as logs and BM25	Teams already on OpenSearch or Elastic
Turbopuffer	Serverless DiskANN, very low cost per vector	Cost-sensitive at scale, willing to accept higher P99
Vespa	Mature ranking, hybrid, real-time updates	Search-as-a-product, high-stakes ranking
Couchbase / MongoDB Atlas Vector	Bundled with existing document database	Already on the platform, simpler operational story

The most common 2026 mistake is over-engineering the procurement. Most enterprise pilots have fewer than 10 million vectors, which means pgvector on managed Postgres (RDS, Aurora, Neon, Supabase) is the right starting answer, and the engineering capacity that would have gone into operating a separate vector database goes into building the retrieval and reranking pipeline that actually drives quality.

The case for a dedicated vector database in 2026 is: you have crossed roughly 50 to 100 million vectors per index, your query throughput exceeds Postgres-on-IOPS limits (a few hundred queries per second sustained), you need multi-region replication for read latency, or your team has the operational budget to run a second specialized data store.

A Decision Framework: Vector DB vs Hybrid Postgres

The decision tree below captures the choice for most enterprise AI teams in 2026:

Is the corpus under 10 million vectors and stable? Start with pgvector on managed Postgres. Add BM25 via Postgres full-text search and combine with RRF.
Is the corpus 10 to 100 million vectors with moderate write rate? pgvector with HNSW indexes is still viable up to roughly 50 million vectors per table; beyond that, evaluate Qdrant, Weaviate, or Pinecone.
Is the corpus over 100 million vectors, or do you need GPU-accelerated search? Milvus, Vespa, or Turbopuffer become the right answer.
Do you need strong filtering on metadata at query time (tenant ID, date ranges, ACLs)? Qdrant's payload filtering and Pinecone's metadata filtering are stronger than pgvector here, though pgvector with proper indexes can handle a meaningful subset.
Do you need search and analytics in the same engine? OpenSearch or Elastic, accepting the operational complexity.

The most common failure mode in 2026 is teams skipping step 1 and procuring a specialized vector database for a 2 million vector pilot, then spending six months building the operational capability that pgvector would have given them for free.

The Cost-Per-Million-Vectors Reality

Approximate 2026 cost ranges, normalized to one million vectors at 1536 dimensions with HNSW or equivalent, including replication and modest query load. Source: vendor public pricing pages, May 2026, with regional and reserved-instance discounts not applied.

Stack	Monthly storage cost per million vectors	Notes
pgvector on managed Postgres	USD 5 to USD 20	Storage cheap; compute is the constraint at scale
Pinecone serverless	USD 20 to USD 80	Pay-per-use, low ops
Qdrant Cloud	USD 25 to USD 60	Strong filtering, similar to Pinecone
Weaviate Cloud Services	USD 30 to USD 90	Hybrid built in
Milvus / Zilliz Cloud	USD 30 to USD 100	Scales much further
Turbopuffer	USD 1 to USD 10	DiskANN, higher P99
OpenSearch Service (k-NN)	USD 40 to USD 120	Bundled with logs and BM25

Query cost is a separate axis (usually billed per million queries or per query unit) and tends to dominate the bill for read-heavy workloads. Cost is not the most important variable below 10 million vectors. Above that scale, the choice meaningfully affects monthly run-rate.

Common Failure Modes

Chunking strategy ignored. The single largest determinant of retrieval quality in production RAG systems is the chunking strategy, not the embedding model or the database. Fixed-size 512-token chunks are a starting point; semantic chunking (sentence boundaries, structural boundaries) consistently outperforms.

No reranker. A surprising share of 2025 to 2026 production pilots ship without a reranker. Adding one is the highest-leverage quality intervention available.

Pure-vector for everything. Identifiers, codes, and proper nouns belong in keyword search. Pure-vector systems regress on these queries.

Embedding-model upgrade with no re-embed budget. New embedding models ship every six to twelve months. Teams that have not budgeted re-embedding cost for the corpus discover late that they are locked into a 2024 model with worse recall than 2026 alternatives.

No evaluation harness. Teams that ship vector search without a labeled evaluation set cannot tell whether a configuration change made retrieval better or worse. The lack of an eval harness is the strongest predictor of project regression over the following twelve months.

Frequently Asked Questions

Is vector search the same as semantic search?

In practice yes, with one nuance. Semantic search is the user-facing capability of retrieving content by intent rather than exact term match. Vector search is one implementation of that capability and the dominant implementation in 2026. Some semantic search systems use other approaches (knowledge graphs, structured rule sets, query rewriting with LLMs), but when a vendor says "semantic search" in 2026 they almost always mean vector search underneath.

Do you still need keyword search if you have vector search?

Yes, for production-grade retrieval. Pure-vector systems regress on queries that contain identifiers, codes, proper nouns, or exact phrases. The dominant production pattern is hybrid: run both vector and BM25 in parallel and fuse the results. The hybrid pattern is what underlies the hybrid retrieval architecture we covered in our vector database guide and what most enterprise RAG systems converge on.

How big does the corpus need to be before a dedicated vector database is worth it?

The practical threshold in 2026 is roughly 10 to 50 million vectors. Below that, pgvector on managed Postgres is the right starting answer for most teams because it removes an operational burden without meaningfully compromising quality. Above 50 to 100 million vectors per index, the recall, latency, and cost characteristics of a purpose-built vector database become hard to replicate in Postgres.

What is the difference between HNSW and IVF?

HNSW (Hierarchical Navigable Small World) is a graph-based index that delivers high recall at very low latency, at the cost of higher memory usage and slower updates. IVF (Inverted File Index) partitions the vector space into clusters and is more memory-efficient and update-friendly, at the cost of lower recall for the same speed. HNSW dominates the in-memory enterprise deployments below roughly 100 million vectors per shard; IVF and its variants dominate at multi-billion-vector scale and in disk-resident configurations like DiskANN.

How often do embedding models need to be updated?

Embedding models from major vendors release new versions every six to twelve months, and the new version is typically 5 to 15 percent better on standard retrieval benchmarks. Most production teams update on a 12 to 18 month cycle, batching the re-embedding cost. Re-embedding a 50 million document corpus with a modern model costs in the low thousands of dollars at 2026 API pricing and takes a few days end-to-end; the cost is not the constraint, the operational migration is.

Key Takeaways

Vector search is similarity search over numerical embeddings. An embedding model converts text or images into points in high-dimensional space, and the index returns the closest stored points to a query vector. The capability that matters operationally is matching by intent rather than by exact term.
HNSW is the default ANN algorithm for in-memory deployments under 100 million vectors; IVF and DiskANN dominate at larger scale and disk-resident configurations. Every deployment trades among recall, latency, and cost; treating any of the three as free produces deployments that work in the demo and fail under load.
Hybrid search beats pure-vector search on most production workloads. Combine BM25 and vector retrieval, fuse with reciprocal rank fusion, and add a cross-encoder reranker on the top 10 to 20 candidates. This pattern is the highest-leverage quality intervention available in 2026.
pgvector on managed Postgres is the right starting point for most 2026 enterprise pilots. A dedicated vector database becomes the right answer at roughly 10 to 50 million vectors and beyond, or when filtering, multi-region, or GPU-accelerated workloads enter the picture.
The most common failure modes are operational, not algorithmic. Bad chunking strategy, no reranker, no evaluation harness, and no budget for the next embedding-model upgrade collectively account for the majority of vector-search regressions in production.

Vector Databases: When You Actually Need One: sister article focused on the procurement decision
What Are AI Evals? An Executive Guide: why every retrieval system needs an evaluation harness
What Is a Data Lakehouse? A Decision-Maker's Guide: where vector indexes sit in the modern data stack
The Real Cost of AI Infrastructure: the budget framing for compute, storage, and inference
Prompt Engineering for Business: A Practical Guide: the prompting layer that consumes retrieval output