2. Text Chunking
Purpose Recap
Break each
.txt
file (from Step 2) into smaller chunks of ~300–500 tokens (words), with optional overlap (~10–20%).
Install Required Packages
Install langchain
pip install langchain
Folder Setup
mkdir -p chunks
extracted_text/
→ contains the full text files you already parsedchunks/
→ will store.json
files, each containing text chunks
Python Script: chunk_texts.py
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
import json
# Paths
input_dir = Path("extracted_text")
output_dir = Path("chunks")
output_dir.mkdir(parents=True, exist_ok=True)
# Configure the text splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Adjust based on LLM context size
chunk_overlap=150, # Keep some overlap to preserve context
)
for text_file in input_dir.glob("*.txt"):
raw_text = text_file.read_text(encoding="utf-8")
docs = splitter.create_documents([raw_text])
# Save chunks as a list of dictionaries
chunk_output = [
{"chunk_id": i, "text": doc.page_content}
for i, doc in enumerate(docs)
]
output_path = output_dir / (text_file.stem + "_chunks.json")
with open(output_path, "w", encoding="utf-8") as f:
json.dump(chunk_output, f, indent=2)
print(f"✅ Chunked {text_file.name} → {len(docs)} chunks saved to {output_path.name}")
This script:
- Loads text files
- Breaks content into overlapping chunks (1000 characters with 150 overlap)
- Saves chunked text to a structured JSON file
Output Format Example
Each .json
file will look like:
[
{"chunk_id": 0, "text": "Employees must complete annual training..."},
{"chunk_id": 1, "text": "Additionally, system admins should..."},
...
]