0. Retrieval-Augmented Generation
1. Simple System Flow
BLUF: Finalized Script on Venv is here.
1. main_streamlit.py
2. requirements.txts
3. start_tmux.sh
flowchart TD
    A[📥 Input: Raw PDFs in /pdfs] --> B{"Check: Is PDF text-based? (pdffonts)"}
    
    B -- Yes --> C[🧪 Try extracting text with pdffonts + PyPDF2]
    C --> D{Is the extracted text sufficient?}
    
    D -- Yes --> E[✅ Save extracted text to /extracted_text]
    D -- No --> F[🔁 Run OCRmyPDF on original PDF]
    
    B -- No --> F
    F --> G[📄 Output OCR'd PDF to /ocr_output_pdfs]
    G --> H[🧠 Extract text from OCR'd PDF]
    H --> E
    
    E --> I{More PDFs to process?}
    I -- Yes --> B
    I -- No --> J[🏁 Done: All PDFs processed and text extracted]1. PDF Input
Purpose: Supply raw, unstructured data in a format accessible to humans.
- 
Why? Most real-world knowledge is locked in PDFs — policies, reports, academic work, manuals.
 - 
System View: Static file input; varies wildly in structure and size.
 - 
Failure Mode: PDFs can be encrypted, malformed, or image-only (which complicates downstream processing).
 
2. Text Extraction
Purpose: Convert human-readable documents into machine-readable text.
- 
Why? LLMs can't “see” what's inside PDFs — they need tokenized strings.
 - 
Tools:
- 
For text PDFs:
pdfplumber,PyMuPDF,langchainloaders. - 
For image PDFs:
pdf2image→pytesseract(OCR). 
 - 
 - 
Failure Mode: OCR might misread poor-quality images; wrong encoding (e.g., ligatures) might distort content.
 
3. Text Chunking
Purpose: Break large documents into logical, overlapping pieces for LLM input.
- 
Why? LLMs have limited context windows (e.g., 4K–8K tokens). Full documents won’t fit.
 - 
Chunking logic:
- 
Maintain semantic structure (avoid mid-sentence breaks).
 - 
Add overlap (e.g., 10–20%) to preserve context across chunks.
 
 - 
 - 
Failure Mode: Bad chunking = broken context = reduced retrieval quality.
 
4. Embedding
Purpose: Convert text chunks into vectors representing semantic meaning.
- 
Why? Enables searching by meaning, not exact keywords.
 - 
Example:
- 
“Employees must attend training.” ≈ “Annual training is mandatory.”
 - 
Both will generate similar embeddings, enabling powerful recall.
 
 - 
 - 
Failure Mode: Wrong embedding model = bad semantic matches.
 
5. Vector Database
Purpose: Store and index high-dimensional vectors efficiently.
- 
Why? With 5,000+ documents, you need sub-second search based on vector proximity (cosine similarity, etc).
 - 
Tools:
FAISS,Chroma,Qdrant,Weaviate - 
Failure Mode: Large vector sets may grow out of RAM; poorly indexed DBs will be slow.
 
6. Similarity Search (Retriever)
Purpose: Match a query to the most relevant content chunks.
- 
Why? LLMs are smart — but only within the context you give them.
 - 
Process:
- Query → embedding → search → top-k relevant chunks.
 
 - 
Failure Mode: Bad retrieval = hallucination. Garbage in, garbage out.
 
7. Prompt Construction
Purpose: Assemble retrieved content + user question into a prompt the LLM can reason over.
- 
Why? LLMs are stateless and context-blind. You must feed them relevant information at runtime.
 - 
Example Prompt:
Context: [Chunk 1] [Chunk 2] Question: What should employees do after a breach? Answer: - 
Failure Mode: Unstructured prompts = poor reasoning = inconsistent output.
 
8. LLM Response
Purpose: Generate human-readable, natural-language answers.
- 
Why? This is the final output — the "consultant's answer" based on retrieved knowledge + language modeling.
 - 
Tools:
llama.cpp,vllm,text-generation-webui,Ollama, etc. - 
Failure Mode: If context is weak or too generic, LLM will fabricate or be vague.
 
Next Process