RAG data quality at scale: deduplication, semantic chunking, and hybrid retrieval that actually improves answers

RAG quality is mostly a data problem. Below is a production-oriented pipeline that consistently improves answer accuracy and reduces hallucinations: (1) deduplicate aggressively, (2) split on meaning not characters, (3) use hybrid retrieval (lexical + dense) with fusion, (4) rerank and diversify at query-time, and (5) evaluate with objective metrics before shipping. Evidence and defaults are included throughout. (NeurIPS Datasets Benchmarks Proceedings, G. V. Cormack, Weaviate)

TL;DR (defaults that work)

Dedup: MinHash/SimHash n-gram shingles; keep canonical; cluster near-dupes.
Chunking: sentence-aware with semantic merge; ~300–500 tokens; small overlap. (python.langchain.com, docs.cohere.com)
Retrieval: BM25 + dense embeddings with RRF fusion or score blending. (NeurIPS Datasets Benchmarks Proceedings, G. V. Cormack)
Rerank & diversify: cross-encoder (e.g., Cohere Rerank / MS MARCO models) + MMR to reduce redundancy. (docs.cohere.com, Sentence Transformers, Elastic)
Store: Postgres + pgvector (HNSW for recall/speed; IVFFlat for lower memory). (GitHub, cloudberry.apache.org, Severalnines)
Evaluate: Ragas (faithfulness, answer relevancy) on a fixed eval set; gate deployments on retrieval precision and groundedness. (docs.ragas.io, Langfuse)

Deduplication that scales

Duplicate and near-duplicate documents distort embeddings and retrieval - especially from web or ticket archives. Use fuzzy dedup (n-gram shingling + MinHash/SimHash) to cluster near-dupes, then keep a canonical record per cluster. This approach is standard in large-scale NLP pipelines and has been shown to reduce memorisation and improve downstream quality. (arXiv)

What to implement

Normalise (lowercase, Unicode NFKC), strip boilerplate/nav.
Shingle into 3–5-gram tokens; compute MinHash signatures; LSH into buckets; cluster by Jaccard similarity; optionally validate with SimHash Hamming distance. (arXiv)
Keep one canonical item per cluster (freshest; richest metadata).
Persist duplicate_cluster_id so you can suppress duplicates at query-time.

Why this matters Google's dedup work on C4 and related corpora shows removing near-dupes reduces verbatim copying and improves efficiency - evidence that duplicates harm model behaviour and evaluation. The same logic holds for RAG stores. (ACL Anthology)

# Pseudocode: MinHash LSH for near-duplicate clusters
from datasketch import MinHash, MinHashLSH

def shingles(text, n=5):
    tokens = text.split()
    for i in range(len(tokens)-n+1):
        yield ' '.join(tokens[i:i+n])

def signature(text):
    m = MinHash(num_perm=64)
    for s in shingles(text):
        m.update(s.encode("utf-8"))
    return m

lsh = MinHashLSH(threshold=0.8, num_perm=64)  # tune threshold
for doc in docs:
    m = signature(doc.content)
    lsh.insert(doc.id, m)
# Query clusters from lsh to assign duplicate_cluster_id

Semantic chunking (not just character splits)

Chunking drives retrieval hit-rate. Prefer semantic chunkers that split by sentence boundaries and merge sentences that remain semantically cohesive in embedding space. This keeps concepts intact and reduces "answer spread" across chunks. LangChain's SemanticChunker and LlamaIndex's semantic splitters follow this pattern. Use small overlap only when needed. (python.langchain.com, docs.cohere.com)

// Pseudocode: semantic chunking with fallback
const sentences = toSentences(cleanText);
const groups = groupBySemanticSimilarity(sentences, {window: 3, sim=0.75});
const chunks = mergeAdjacent(groups, {maxTokens: 450, overlapTokens: 40});

Rule of thumb Start around 300–500 tokens per chunk with ~10–20% overlap for text-heavy domains; adjust empirically using retrieval precision and answer faithfulness metrics. (docs.cohere.com)

Hybrid retrieval that actually finds things

Lexical and dense retrieval fail in different ways. Hybrid retrieval combines BM25/keyword with dense vectors and fuses results - commonly via Reciprocal Rank Fusion (RRF) or score blending. BM25 remains a robust baseline across domains (BEIR), and RRF is a simple, well-studied fusion method that often outperforms either signal alone. (NeurIPS Datasets Benchmarks Proceedings, G. V. Cormack)

Implementation options:

Parallel BM25 + vector search, then RRF to combine ranked lists. Many vector DBs and frameworks document this pattern. (Weaviate)
In LlamaIndex, QueryFusionRetriever improves hybrid ranking over naive score mixing. (LlamaIndex)

# Pseudocode: RRF on two ranked lists
def rrf(rank, k=60): return 1.0 / (k + rank)
def fuse(bm25_ranks, dense_ranks, k=60):
    scores = defaultdict(float)
    for i, doc in enumerate(bm25_ranks): scores[doc]+=rrf(i+1,k)
    for i, doc in enumerate(dense_ranks): scores[doc]+=rrf(i+1,k)
    return sorted(scores, key=scores.get, reverse=True)

Postgres-first note If you're staying in Postgres, use pgvector with HNSW for best speed-recall trade-off (higher memory and build time) or IVFFlat for faster build/lower memory; both are well-documented. (GitHub, cloudberry.apache.org, Severalnines)

Rerank and de-duplicate at query-time

Even good hybrid retrieval returns redundant or borderline contexts.

Cross-encoder rerankers (e.g., Cohere Rerank or MS MARCO cross-encoders) score each candidate chunk with the query and push the truly relevant ones to the top. This reliably increases precision@k. (docs.cohere.com, Sentence Transformers)
MMR (Maximal Marginal Relevance) adds diversity by penalising near-duplicates among the selected chunks, which reduces wasted context budget. (Elastic)

# After hybrid retrieval -> top_k=40
candidates = hybrid(query)[:40]
reranked = cross_encoder_rerank(query, candidates)[:12]  # precision
final_ctx = mmr_diversify(query, reranked, k=6, lambda_=0.5)  # diversity

Evaluate before (and after) shipping

Adopt a small, fixed eval set of realistic questions. Measure:

Retrieval: precision/recall@k on gold passages; or reference-free retrieval quality.
Answer: faithfulness (groundedness), answer relevancy, and context precision - Ragas provides these out of the box and integrates cleanly with tracing. Use it to gate releases. (docs.ragas.io, Langfuse)

Reference implementation (end-to-end)

flowchart LR
  A[Raw docs] --> B[Clean & normalise]
  B --> C[MinHash/SimHash near-dup clustering]
  C --> D[Semantic chunking + light overlap]
  D --> E[BM25 index]
  D --> F[Vector index (pgvector HNSW)]
  E & F --> G[RRF fusion]
  G --> H[Cross-encoder rerank]
  H --> I[MMR diversify]
  I --> J[Context to LLM]
  I --> K[Eval (Ragas) & tracing]

Operational checklist

Ingestion

Canonicalise URLs/IDs; store duplicate_cluster_id.
Blocklists for boilerplate (nav, cookie banners).

Chunking

Use sentence boundaries + semantic merge; verify average tokens/chunk and overlap. (python.langchain.com)

Retrieval

Run BM25 + dense in parallel; fuse with RRF; log per-query which channel won. (NeurIPS Datasets Benchmarks Proceedings, G. V. Cormack)

Rerank

Cross-encoder rerank to 8–16 items; MMR to 4–8 final contexts. (docs.cohere.com, Elastic)

Store

pgvector index selection: start with HNSW; fall back to IVFFlat if memory/build time bite. (GitHub, cloudberry.apache.org)

Evaluate

Automate Ragas on a nightly sample; block deploys if faithfulness drops >X% or cost/answer rises >Y%. (docs.ragas.io)

Notes & sources

BM25 remains a strong baseline; hybrid and late-interaction models often win when fused well (BEIR). (NeurIPS Datasets Benchmarks Proceedings)
RRF is a simple, robust fusion algorithm with strong evidence across IR literature and is widely used in hybrid search systems. (G. V. Cormack)
Semantic chunking via sentence grouping + embedding-space similarity is supported in mainstream tooling. (python.langchain.com)
Postgres pgvector: HNSW vs IVFFlat trade-offs are documented by maintainers and major guides. (GitHub, Severalnines)
Rerankers & MMR (diversity) improve precision and reduce redundancy in the final context window. (docs.cohere.com, Sentence Transformers, Elastic)