1. Text Extraction
1. Step 1 – PDF Input Setup
1.1 Create Project Directory
Create a folder structure to organize data and pipeline steps:
mkdir -p ~/rag-pipeline/{pdfs,ocr_images,extracted_text,chunks,embeddings,models}
cd ~/rag-pipeline
This creates:
Folder | Purpose |
---|---|
pdfs/ |
Raw input PDFs (text or image-based) |
ocr_images/ |
Temporary folder for image-based OCR |
extracted_text/ |
Optional storage for raw extracted text |
chunks/ |
Chunked outputs |
embeddings/ |
Vector files for each chunk |
models/ |
Local LLM models (GGUF files etc.) |
2.2 Install Required Tools
Activate Python environment and install the basics:
python3 -m venv ragenv
source ragenv/bin/activate
pip install langchain pdfplumber pytesseract pdf2image Pillow
brew install tesseract poppler # Mac dependencies
-
tesseract
: OCR engine -
poppler
: Required forpdf2image
to convert PDFs to images (Mac) -
langchain
,pdfplumber
: for loading and handling PDFs
2.3 📥 Place PDFs in Folder
Copy all documents into the pdfs/
directory:
cp /path/to/your-documents/*.pdf ~/rag-pipeline/pdfs/
2.4 ✅ Check Which PDFs Are Image-Based
Create a script that scans your PDFs and identifies which ones are text-based vs image-based:
# check_pdf_type.py
from pathlib import Path
from PyPDF2 import PdfReader
pdf_dir = Path("pdfs")
for pdf_file in pdf_dir.glob("*.pdf"):
reader = PdfReader(str(pdf_file))
text = ""
for page in reader.pages:
text += page.extract_text() or ""
if len(text.strip()) < 100:
print(f"[IMAGE-BASED] {pdf_file.name}")
else:
print(f"[TEXT-BASED] {pdf_file.name}")
Run it:
python check_pdf_type.py