1. Text Extraction

1. Step 1 – PDF Input Setup

1.1 Create Project Directory

Create a folder structure to organize data and pipeline steps:

mkdir -p ~/rag-pipeline/{pdfs,ocr_images,extracted_text,chunks,embeddings,models}
cd ~/rag-pipeline

This creates:

Folder	Purpose
`pdfs/`	Raw input PDFs (text or image-based)
`ocr_images/`	Temporary folder for image-based OCR
`extracted_text/`	Optional storage for raw extracted text
`chunks/`	Chunked outputs
`embeddings/`	Vector files for each chunk
`models/`	Local LLM models (GGUF files etc.)

2.2 Install Required Tools

Activate Python environment and install the basics:

python3 -m venv ragenv
source ragenv/bin/activate

pip install langchain pdfplumber pytesseract pdf2image Pillow
brew install tesseract poppler  # Mac dependencies

tesseract: OCR engine
poppler: Required for pdf2image to convert PDFs to images (Mac)
langchain, pdfplumber: for loading and handling PDFs

2.3 📥 Place PDFs in Folder

Copy all documents into the pdfs/ directory:

cp /path/to/your-documents/*.pdf ~/rag-pipeline/pdfs/

2.4 ✅ Check Which PDFs Are Image-Based

Create a script that scans your PDFs and identifies which ones are text-based vs image-based:

# check_pdf_type.py
from pathlib import Path
from PyPDF2 import PdfReader

pdf_dir = Path("pdfs")
for pdf_file in pdf_dir.glob("*.pdf"):
    reader = PdfReader(str(pdf_file))
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""
    if len(text.strip()) < 100:
        print(f"[IMAGE-BASED] {pdf_file.name}")
    else:
        print(f"[TEXT-BASED]  {pdf_file.name}")

Run it:

python check_pdf_type.py