1. Dockerfile
Dockerfile Breakdown: Headless RAG Pipeline with OCR and Cron
This Dockerfile sets up a scheduled document processing pipeline using Python 3.10, Tesseract for OCR, and cron for scheduled jobs. It is optimized to run uploader.py nightly, processing documents from a ./docs folder.
1. Base Image
FROM python:3.10-slim
- 
Uses a minimal, official Python 3.10 image.
 - 
Reduces image size and attack surface.
 
2. Environment Configuration
ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    TZ=America/New_York
- 
PYTHONDONTWRITEBYTECODE=1: prevents writing .pyc files.
 - 
PYTHONUNBUFFERED=1: ensures output is logged in real time.
 - 
TZ=America/New_York: sets the correct timezone for scheduling tasks.
 
3. Working Directory
WORKDIR /app
- All operations are conducted in /app inside the container.
 
You can inspect it with:
docker exec -it <container_id_or_name> sh
cd /app
4. Install System Dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    poppler-utils \
    tesseract-ocr \
    ocrmypdf \
    cron \
    curl \
    tzdata \
 && rm -rf /var/lib/apt/lists/*
- 
build-essential: required to build Python packages.
 - 
poppler-utils: extracts content from PDF files.
 - 
tesseract-ocr and ocrmypdf: support OCR on image-based or scanned PDFs.
 - 
cron: runs scheduled jobs.
 - 
curl: useful for network-based operations.
 - 
tzdata: sets timezone correctly.
 
5. Install Python Dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
- 
Installs Python packages listed in requirements.txt.
 - 
--no-cache-dir avoids unnecessary cache layers.
 
6. Copy Application Source
COPY . .
- Copies all application code into the container under /app.
 
7. Configure Crontab
COPY crontab.txt /etc/cron.d/uploader-cron
RUN chmod 0644 /etc/cron.d/uploader-cron && \
    crontab /etc/cron.d/uploader-cron
- 
Adds a cron schedule (typically for uploader.py) from crontab.txt.
 - 
Registers it into the system crontab.
 
Example of crontab.txt:
0 2 * * * root python /app/uploader.py >> /app/logs/cron.log 2>&1
8. Start Cron in Foreground
CMD ["cron", "-f"]
- Keeps the container alive by running cron in the foreground.