Creative Codes

Services · AI & Machine Learning

RAG Pipeline Development

Retrieval-augmented generation built for production accuracy. Your documents, queryable with cited answers. Hybrid retrieval, cross-encoder reranking, and incremental sync.

Need the data layer first? Web scraping services →

RAG query pipeline

01 · Query

User submits a natural language question to the API endpoint.

02 · Embed

Query is converted to a dense vector using the same embedding model used at ingest.

03 · Retrieve

Hybrid search: dense vector similarity + BM25 keyword matching over the indexed corpus.

04 · Rerank

Cross-encoder reranking scores retrieved chunks by relevance to the specific query.

05 · Generate

LLM generates an answer grounded in the top-k chunks. Cites source documents inline.

Architecture decisions

Framework, vector store, chunking.

Frameworks

LangChain

Agent orchestration, chain composition, document loaders

LlamaIndex

Index abstractions, query engines, node parsers

Vector Stores

Qdrant

Self-hosted, high-performance, large-scale collections

Pinecone

Managed, serverless, no ops overhead

ChromaDB

Development and small-scale deployments

Chunking Strategy

Recursive splitter

Respects paragraph and sentence boundaries

Semantic chunking

Splits on semantic shifts, not token count

Document-aware

Preserves section hierarchy from PDFs/DOCX

Stack

PythonLangChainLlamaIndexOpenAIClaudeQdrantPineconeFastAPIPostgreSQLDockerAWS

How the code looks.

rag_pipeline.py
from qdrant_client import QdrantClient
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

async def ingest_document(doc_path: str, metadata: dict) -> int:
    """Chunk, embed, and index a document into Qdrant."""

    # Load and chunk with overlap for cross-boundary context
    raw_text = extract_text(doc_path)        # PDF, DOCX, HTML, plain text
    splitter  = RecursiveCharacterTextSplitter(
        chunk_size=600, chunk_overlap=120,
        separators=["

", "
", ". ", " "],
    )
    chunks = splitter.split_text(raw_text)

    # Embed all chunks in a single batched API call
    embedder   = OpenAIEmbeddings(model="text-embedding-3-large")
    embeddings = await embedder.aembed_documents(chunks)

    # Upsert into Qdrant with source metadata for citation
    points = [
        PointStruct(
            id=str(uuid4()),
            vector=embedding,
            payload={
                "text": chunk,
                "source": doc_path,
                "page": metadata.get("page"),
                "section": metadata.get("section"),
                "doc_id": metadata["doc_id"],
            },
        )
        for chunk, embedding in zip(chunks, embeddings)
    ]
    client.upsert(collection_name=COLLECTION, points=points)
    return len(chunks)


async def query(question: str, top_k: int = 8) -> Answer:
    """Hybrid search: dense + BM25, then rerank, then generate."""

    # Dense retrieval
    q_embed  = await embedder.aembed_query(question)
    dense    = client.search(COLLECTION, q_embed, limit=top_k * 2)

    # BM25 keyword retrieval (catches exact-match queries)
    keyword  = bm25_index.search(question, top_n=top_k * 2)

    # Merge and rerank with cross-encoder
    combined = dedupe_and_merge(dense, keyword)
    reranked = cross_encoder.rerank(question, combined)[:top_k]

    # Generate with source citations
    context  = "

".join(c.text for c in reranked)
    answer   = await llm.acomplete(PROMPT.format(context=context, question=question))
    return Answer(text=answer, sources=[c.source for c in reranked])

RecursiveCharacterTextSplitter(..., overlap=120)

120-token overlap ensures context at chunk boundaries isn't lost. A sentence that spans two chunks is still retrievable from either side.

dedupe_and_merge(dense, keyword)

Hybrid retrieval runs dense and keyword search independently, then merges with reciprocal rank fusion. Dense finds semantically similar chunks; keyword finds exact technical terms.

cross_encoder.rerank(question, combined)[:top_k]

Cross-encoders score query-chunk pairs jointly. More expensive than bi-encoders, but dramatically more accurate on ambiguous queries. Only the top-k go to the LLM.

Case study

Enterprise RAG Knowledge System

10,000+ documents from SharePoint, Google Drive, and file servers. 94% answer accuracy. Sub-2-second query latency. Built for a financial services firm with access control and source citations on every response.

Read the case study →

Need production-grade RAG?

Tell us about your document corpus and query patterns. We'll scope it.

Book a call or send us a message