3. Embedding Chunks and Storing in Vector DB
Purpose:
Convert each chunk of text into a numerical representation (embedding) that encodes semantic meaning.
A Vector Database stores and searches embeddings — numerical representations of data (like text, images, audio) — based on semantic similarity rather than exact matches.
This step:
- Loads chunked text
- Converts text to vectors using a transformer model
- Stores both text and its vector in ChromaDB
from pathlib import Path
import json
from sentence_transformers import SentenceTransformer
import chromadb
chunks_dir = Path("chunks")
model = SentenceTransformer('all-MiniLM-L6-v2')
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("rag_chunks")
for file in tqdm(chunks_dir.glob("*_chunks.json")):
with open(file, "r", encoding="utf-8") as f:
chunks = json.load(f)
texts = [chunk["text"] for chunk in chunks]
embeddings = model.encode(texts).tolist()
for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
collection.add(
documents=[chunk["text"]],
embeddings=[emb],
ids=[f"{file.stem}_{i}"]
)
print("All chunks embedded and stored in ChromaDB.")