1. Text Extraction
1. Step 1 – PDF Input Setup
1.1 Create Project Directory
Create a folder structure to organize data and pipeline steps:
mkdir -p ~/rag-pipeline/{pdfs,ocr_images,extracted_text,chunks,embeddings,models}
cd ~/rag-pipeline
This creates:
| Folder | Purpose | 
|---|---|
pdfs/ | 
Raw input PDFs (text or image-based) | 
ocr_images/ | 
Temporary folder for image-based OCR | 
extracted_text/ | 
Optional storage for raw extracted text | 
chunks/ | 
Chunked outputs | 
embeddings/ | 
Vector files for each chunk | 
models/ | 
Local LLM models (GGUF files etc.) | 
2.2 Install Required Tools
Activate Python environment and install the basics:
python3 -m venv ragenv
source ragenv/bin/activate
pip install langchain pdfplumber pytesseract pdf2image Pillow
brew install tesseract poppler  # Mac dependencies
- 
tesseract: OCR engine - 
poppler: Required forpdf2imageto convert PDFs to images (Mac) - 
langchain,pdfplumber: for loading and handling PDFs 
2.3 📥 Place PDFs in Folder
Copy all documents into the pdfs/ directory:
cp /path/to/your-documents/*.pdf ~/rag-pipeline/pdfs/
2.4 ✅ Check Which PDFs Are Image-Based
Create a script that scans your PDFs and identifies which ones are text-based vs image-based:
# check_pdf_type.py
from pathlib import Path
from PyPDF2 import PdfReader
pdf_dir = Path("pdfs")
for pdf_file in pdf_dir.glob("*.pdf"):
    reader = PdfReader(str(pdf_file))
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""
    if len(text.strip()) < 100:
        print(f"[IMAGE-BASED] {pdf_file.name}")
    else:
        print(f"[TEXT-BASED]  {pdf_file.name}")
Run it:
python check_pdf_type.py