2. Text Chunking

Purpose Recap

Break each .txt file (from Step 2) into smaller chunks of ~300–500 tokens (words), with optional overlap (~10–20%).


Install Required Packages

Install langchain

pip install langchain

Folder Setup

mkdir -p chunks

Python Script: chunk_texts.py

from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
import json

# Paths
input_dir = Path("extracted_text")
output_dir = Path("chunks")
output_dir.mkdir(parents=True, exist_ok=True)

# Configure the text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,            # Adjust based on LLM context size
    chunk_overlap=150,         # Keep some overlap to preserve context
)

for text_file in input_dir.glob("*.txt"):
    raw_text = text_file.read_text(encoding="utf-8")
    docs = splitter.create_documents([raw_text])

    # Save chunks as a list of dictionaries
    chunk_output = [
        {"chunk_id": i, "text": doc.page_content}
        for i, doc in enumerate(docs)
    ]

    output_path = output_dir / (text_file.stem + "_chunks.json")
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(chunk_output, f, indent=2)

    print(f"✅ Chunked {text_file.name} → {len(docs)} chunks saved to {output_path.name}")

This script:

Output Format Example

Each .json file will look like:

[
  {"chunk_id": 0, "text": "Employees must complete annual training..."},
  {"chunk_id": 1, "text": "Additionally, system admins should..."},
  ...
]

3. Embedding Chunks and Storing in Vector DB