4. doc_processor.py
doc_processor.py --> formerly pdf_utils.py
pdf_utils.py Breakdown
Handles PDF preprocessing for a RAG pipeline using Docling, ChromaDB, and optional OCR fallback.
Active Functionality
build_vectordb_from_pdf(pdf_path: Path) -> Chroma
loader = DoclingLoader(file_path=[str(pdf_path)], export_type=ExportType.DOC_CHUNKS)
docs = loader.load()
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectordb = Chroma.from_documents(docs, embeddings)
| Component | Description | 
|---|---|
DoclingLoader | 
Parses PDF and returns semantic document chunks | 
OllamaEmbeddings | 
Embeds chunks using local LLM via Ollama | 
Chroma.from_documents() | 
Stores embedded chunks in a persistent vector database | 
ocr_pdf(input_path: Path, output_path: Path) -> bool
subprocess.run(["ocrmypdf", str(input_path), str(output_path)], check=True)
| Purpose | Makes image-based PDFs searchable using ocrmypdf | 
|---|---|
| Input | Scanned or non-text PDFs | 
| Output | OCR’d PDF with embedded text | 
| Tool Used | ocrmypdf via system call | 
Deprecated / Commented-Out Sections
extract_text_from_pdf()
- 
Uses
PyPDF2to extract text - 
Falls back to OCR if no text is found
 - 
Replaced by
DoclingLoader 
split_text_into_chunks()
- 
Uses
RecursiveCharacterTextSplitterto split raw text - 
Replaced by Docling's built-in chunking
 
Summary Table
| Function | Role | Status | 
|---|---|---|
build_vectordb_from_pdf() | 
Chunk + embed + store in ChromaDB | ✅ Active | 
ocr_pdf() | 
OCR fallback for scanned PDFs | ✅ Active | 
extract_text_from_pdf() | 
Text extraction + fallback OCR | 💤 Legacy | 
split_text_into_chunks() | 
Text chunking via LangChain | 💤 Legacy |