4. doc_processor.py

doc_processor.py --> formerly pdf_utils.py

pdf_utils.py Breakdown

Handles PDF preprocessing for a RAG pipeline using Docling, ChromaDB, and optional OCR fallback.


Active Functionality

build_vectordb_from_pdf(pdf_path: Path) -> Chroma

loader = DoclingLoader(file_path=[str(pdf_path)], export_type=ExportType.DOC_CHUNKS)
docs = loader.load()
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectordb = Chroma.from_documents(docs, embeddings)
Component Description
DoclingLoader Parses PDF and returns semantic document chunks
OllamaEmbeddings Embeds chunks using local LLM via Ollama
Chroma.from_documents() Stores embedded chunks in a persistent vector database

ocr_pdf(input_path: Path, output_path: Path) -> bool

subprocess.run(["ocrmypdf", str(input_path), str(output_path)], check=True)
Purpose Makes image-based PDFs searchable using ocrmypdf
Input Scanned or non-text PDFs
Output OCR’d PDF with embedded text
Tool Used ocrmypdf via system call

Deprecated / Commented-Out Sections

extract_text_from_pdf()


split_text_into_chunks()


Summary Table

Function Role Status
build_vectordb_from_pdf() Chunk + embed + store in ChromaDB ✅ Active
ocr_pdf() OCR fallback for scanned PDFs ✅ Active
extract_text_from_pdf() Text extraction + fallback OCR 💤 Legacy
split_text_into_chunks() Text chunking via LangChain 💤 Legacy