1. Dockerfile
Dockerfile Breakdown: Headless RAG Pipeline with OCR and Cron
This Dockerfile sets up a scheduled document processing pipeline using Python 3.10, Tesseract for OCR, and cron for scheduled jobs. It is optimized to run uploader.py nightly, processing documents from a ./docs folder.
1. Base Image
FROM python:3.10-slim
-
Uses a minimal, official Python 3.10 image.
-
Reduces image size and attack surface.
2. Environment Configuration
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
TZ=America/New_York
-
PYTHONDONTWRITEBYTECODE=1: prevents writing .pyc files.
-
PYTHONUNBUFFERED=1: ensures output is logged in real time.
-
TZ=America/New_York: sets the correct timezone for scheduling tasks.
3. Working Directory
WORKDIR /app
- All operations are conducted in /app inside the container.
You can inspect it with:
docker exec -it <container_id_or_name> sh
cd /app
4. Install System Dependencies
RUN apt-get update && apt-get install -y \
build-essential \
poppler-utils \
tesseract-ocr \
ocrmypdf \
cron \
curl \
tzdata \
&& rm -rf /var/lib/apt/lists/*
-
build-essential: required to build Python packages.
-
poppler-utils: extracts content from PDF files.
-
tesseract-ocr and ocrmypdf: support OCR on image-based or scanned PDFs.
-
cron: runs scheduled jobs.
-
curl: useful for network-based operations.
-
tzdata: sets timezone correctly.
5. Install Python Dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
-
Installs Python packages listed in requirements.txt.
-
--no-cache-dir avoids unnecessary cache layers.
6. Copy Application Source
COPY . .
- Copies all application code into the container under /app.
7. Configure Crontab
COPY crontab.txt /etc/cron.d/uploader-cron
RUN chmod 0644 /etc/cron.d/uploader-cron && \
crontab /etc/cron.d/uploader-cron
-
Adds a cron schedule (typically for uploader.py) from crontab.txt.
-
Registers it into the system crontab.
Example of crontab.txt:
0 2 * * * root python /app/uploader.py >> /app/logs/cron.log 2>&1
8. Start Cron in Foreground
CMD ["cron", "-f"]
- Keeps the container alive by running cron in the foreground.