0. Retrieval-Augmented Generation
1. Simple System Flow
BLUF: Finalized Script on Venv is here.
1. main_streamlit.py
2. requirements.txts
3. start_tmux.sh
flowchart TD A[📥 Input: Raw PDFs in /pdfs] --> B{"Check: Is PDF text-based? (pdffonts)"} B -- Yes --> C[🧪 Try extracting text with pdffonts + PyPDF2] C --> D{Is the extracted text sufficient?} D -- Yes --> E[✅ Save extracted text to /extracted_text] D -- No --> F[🔁 Run OCRmyPDF on original PDF] B -- No --> F F --> G[📄 Output OCR'd PDF to /ocr_output_pdfs] G --> H[🧠 Extract text from OCR'd PDF] H --> E E --> I{More PDFs to process?} I -- Yes --> B I -- No --> J[🏁 Done: All PDFs processed and text extracted]
1. PDF Input
Purpose: Supply raw, unstructured data in a format accessible to humans.
-
Why? Most real-world knowledge is locked in PDFs — policies, reports, academic work, manuals.
-
System View: Static file input; varies wildly in structure and size.
-
Failure Mode: PDFs can be encrypted, malformed, or image-only (which complicates downstream processing).
2. Text Extraction
Purpose: Convert human-readable documents into machine-readable text.
-
Why? LLMs can't “see” what's inside PDFs — they need tokenized strings.
-
Tools:
-
For text PDFs:
pdfplumber
,PyMuPDF
,langchain
loaders. -
For image PDFs:
pdf2image
→pytesseract
(OCR).
-
-
Failure Mode: OCR might misread poor-quality images; wrong encoding (e.g., ligatures) might distort content.
3. Text Chunking
Purpose: Break large documents into logical, overlapping pieces for LLM input.
-
Why? LLMs have limited context windows (e.g., 4K–8K tokens). Full documents won’t fit.
-
Chunking logic:
-
Maintain semantic structure (avoid mid-sentence breaks).
-
Add overlap (e.g., 10–20%) to preserve context across chunks.
-
-
Failure Mode: Bad chunking = broken context = reduced retrieval quality.
4. Embedding
Purpose: Convert text chunks into vectors representing semantic meaning.
-
Why? Enables searching by meaning, not exact keywords.
-
Example:
-
“Employees must attend training.” ≈ “Annual training is mandatory.”
-
Both will generate similar embeddings, enabling powerful recall.
-
-
Failure Mode: Wrong embedding model = bad semantic matches.
5. Vector Database
Purpose: Store and index high-dimensional vectors efficiently.
-
Why? With 5,000+ documents, you need sub-second search based on vector proximity (cosine similarity, etc).
-
Tools:
FAISS
,Chroma
,Qdrant
,Weaviate
-
Failure Mode: Large vector sets may grow out of RAM; poorly indexed DBs will be slow.
6. Similarity Search (Retriever)
Purpose: Match a query to the most relevant content chunks.
-
Why? LLMs are smart — but only within the context you give them.
-
Process:
- Query → embedding → search → top-k relevant chunks.
-
Failure Mode: Bad retrieval = hallucination. Garbage in, garbage out.
7. Prompt Construction
Purpose: Assemble retrieved content + user question into a prompt the LLM can reason over.
-
Why? LLMs are stateless and context-blind. You must feed them relevant information at runtime.
-
Example Prompt:
Context: [Chunk 1] [Chunk 2] Question: What should employees do after a breach? Answer:
-
Failure Mode: Unstructured prompts = poor reasoning = inconsistent output.
8. LLM Response
Purpose: Generate human-readable, natural-language answers.
-
Why? This is the final output — the "consultant's answer" based on retrieved knowledge + language modeling.
-
Tools:
llama.cpp
,vllm
,text-generation-webui
,Ollama
, etc. -
Failure Mode: If context is weak or too generic, LLM will fabricate or be vague.
Next Process