4. doc_processor.py
doc_processor.py --> formerly pdf_utils.py
pdf_utils.py
Breakdown
Handles PDF preprocessing for a RAG pipeline using Docling, ChromaDB, and optional OCR fallback.
Active Functionality
build_vectordb_from_pdf(pdf_path: Path) -> Chroma
loader = DoclingLoader(file_path=[str(pdf_path)], export_type=ExportType.DOC_CHUNKS)
docs = loader.load()
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectordb = Chroma.from_documents(docs, embeddings)
Component | Description |
---|---|
DoclingLoader |
Parses PDF and returns semantic document chunks |
OllamaEmbeddings |
Embeds chunks using local LLM via Ollama |
Chroma.from_documents() |
Stores embedded chunks in a persistent vector database |
ocr_pdf(input_path: Path, output_path: Path) -> bool
subprocess.run(["ocrmypdf", str(input_path), str(output_path)], check=True)
Purpose | Makes image-based PDFs searchable using ocrmypdf |
---|---|
Input | Scanned or non-text PDFs |
Output | OCR’d PDF with embedded text |
Tool Used | ocrmypdf via system call |
Deprecated / Commented-Out Sections
extract_text_from_pdf()
-
Uses
PyPDF2
to extract text -
Falls back to OCR if no text is found
-
Replaced by
DoclingLoader
split_text_into_chunks()
-
Uses
RecursiveCharacterTextSplitter
to split raw text -
Replaced by Docling's built-in chunking
Summary Table
Function | Role | Status |
---|---|---|
build_vectordb_from_pdf() |
Chunk + embed + store in ChromaDB | ✅ Active |
ocr_pdf() |
OCR fallback for scanned PDFs | ✅ Active |
extract_text_from_pdf() |
Text extraction + fallback OCR | 💤 Legacy |
split_text_into_chunks() |
Text chunking via LangChain | 💤 Legacy |