1. Dockerfile

Dockerfile Breakdown: Headless RAG Pipeline with OCR and Cron

This Dockerfile sets up a scheduled document processing pipeline using Python 3.10, Tesseract for OCR, and cron for scheduled jobs. It is optimized to run uploader.py nightly, processing documents from a ./docs folder.


1. Base Image

FROM python:3.10-slim

2. Environment Configuration

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    TZ=America/New_York

3. Working Directory

WORKDIR /app

You can inspect it with:

docker exec -it <container_id_or_name> sh
cd /app

4. Install System Dependencies

RUN apt-get update && apt-get install -y \
    build-essential \
    poppler-utils \
    tesseract-ocr \
    ocrmypdf \
    cron \
    curl \
    tzdata \
 && rm -rf /var/lib/apt/lists/*

5. Install Python Dependencies

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

6. Copy Application Source

COPY . .

7. Configure Crontab

COPY crontab.txt /etc/cron.d/uploader-cron
RUN chmod 0644 /etc/cron.d/uploader-cron && \
    crontab /etc/cron.d/uploader-cron

Example of crontab.txt:

0 2 * * * root python /app/uploader.py >> /app/logs/cron.log 2>&1

8. Start Cron in Foreground

CMD ["cron", "-f"]