3. Embedding Chunks and Storing in Vector DB

Purpose:

Convert each chunk of text into a numerical representation (embedding) that encodes semantic meaning.
A Vector Database stores and searches embeddings — numerical representations of data (like text, images, audio) — based on semantic similarity rather than exact matches.

This step:

from pathlib import Path
import json
from sentence_transformers import SentenceTransformer
import chromadb

chunks_dir = Path("chunks")
model = SentenceTransformer('all-MiniLM-L6-v2')
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("rag_chunks")

for file in tqdm(chunks_dir.glob("*_chunks.json")):
    with open(file, "r", encoding="utf-8") as f:
        chunks = json.load(f)

    texts = [chunk["text"] for chunk in chunks]
    embeddings = model.encode(texts).tolist()

    for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
        collection.add(
            documents=[chunk["text"]],
            embeddings=[emb],
            ids=[f"{file.stem}_{i}"]
        )

print("All chunks embedded and stored in ChromaDB.")


    

4. Querying and Retrieving Relevant Context