0. Retrieval-Augmented Generation

1. Simple System Flow

BLUF: Finalized Script on Venv is here.
1. main_streamlit.py
2. requirements.txts
3. start_tmux.sh

flowchart TD
    A[📥 Input: Raw PDFs in /pdfs] --> B{"Check: Is PDF text-based? (pdffonts)"}
    
    B -- Yes --> C[🧪 Try extracting text with pdffonts + PyPDF2]
    C --> D{Is the extracted text sufficient?}
    
    D -- Yes --> E[✅ Save extracted text to /extracted_text]
    D -- No --> F[🔁 Run OCRmyPDF on original PDF]
    
    B -- No --> F
    F --> G[📄 Output OCR'd PDF to /ocr_output_pdfs]
    G --> H[🧠 Extract text from OCR'd PDF]
    H --> E
    
    E --> I{More PDFs to process?}
    I -- Yes --> B
    I -- No --> J[🏁 Done: All PDFs processed and text extracted]

1. PDF Input

Purpose: Supply raw, unstructured data in a format accessible to humans.


2. Text Extraction

Purpose: Convert human-readable documents into machine-readable text.


3. Text Chunking

Purpose: Break large documents into logical, overlapping pieces for LLM input.


4. Embedding

Purpose: Convert text chunks into vectors representing semantic meaning.


5. Vector Database

Purpose: Store and index high-dimensional vectors efficiently.


6. Similarity Search (Retriever)

Purpose: Match a query to the most relevant content chunks.


7. Prompt Construction

Purpose: Assemble retrieved content + user question into a prompt the LLM can reason over.


8. LLM Response

Purpose: Generate human-readable, natural-language answers.


Next Process

1. Text Extraction