2. Text Chunking
Purpose Recap
Break each
.txtfile (from Step 2) into smaller chunks of ~300–500 tokens (words), with optional overlap (~10–20%).
Install Required Packages
Install langchain
pip install langchain
Folder Setup
mkdir -p chunks
extracted_text/→ contains the full text files you already parsedchunks/→ will store.jsonfiles, each containing text chunks
Python Script: chunk_texts.py
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
import json
# Paths
input_dir = Path("extracted_text")
output_dir = Path("chunks")
output_dir.mkdir(parents=True, exist_ok=True)
# Configure the text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,            # Adjust based on LLM context size
    chunk_overlap=150,         # Keep some overlap to preserve context
)
for text_file in input_dir.glob("*.txt"):
    raw_text = text_file.read_text(encoding="utf-8")
    docs = splitter.create_documents([raw_text])
    # Save chunks as a list of dictionaries
    chunk_output = [
        {"chunk_id": i, "text": doc.page_content}
        for i, doc in enumerate(docs)
    ]
    output_path = output_dir / (text_file.stem + "_chunks.json")
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(chunk_output, f, indent=2)
    print(f"✅ Chunked {text_file.name} → {len(docs)} chunks saved to {output_path.name}")
This script:
- Loads text files
 - Breaks content into overlapping chunks (1000 characters with 150 overlap)
 - Saves chunked text to a structured JSON file
 
Output Format Example
Each .json file will look like:
[
  {"chunk_id": 0, "text": "Employees must complete annual training..."},
  {"chunk_id": 1, "text": "Additionally, system admins should..."},
  ...
]