4. doc_processor.py

doc_processor.py --> formerly pdf_utils.py

`pdf_utils.py` Breakdown

Handles PDF preprocessing for a RAG pipeline using Docling, ChromaDB, and optional OCR fallback.

Active Functionality

`build_vectordb_from_pdf(pdf_path: Path) -> Chroma`

loader = DoclingLoader(file_path=[str(pdf_path)], export_type=ExportType.DOC_CHUNKS)
docs = loader.load()
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectordb = Chroma.from_documents(docs, embeddings)

Component	Description
`DoclingLoader`	Parses PDF and returns semantic document chunks
`OllamaEmbeddings`	Embeds chunks using local LLM via Ollama
`Chroma.from_documents()`	Stores embedded chunks in a persistent vector database

`ocr_pdf(input_path: Path, output_path: Path) -> bool`

subprocess.run(["ocrmypdf", str(input_path), str(output_path)], check=True)

Purpose	Makes image-based PDFs searchable using `ocrmypdf`
Input	Scanned or non-text PDFs
Output	OCR’d PDF with embedded text
Tool Used	`ocrmypdf` via system call

Deprecated / Commented-Out Sections

`extract_text_from_pdf()`

Uses PyPDF2 to extract text
Falls back to OCR if no text is found
Replaced by DoclingLoader

`split_text_into_chunks()`

Uses RecursiveCharacterTextSplitter to split raw text
Replaced by Docling's built-in chunking

Summary Table

Function	Role	Status
`build_vectordb_from_pdf()`	Chunk + embed + store in ChromaDB	✅ Active
`ocr_pdf()`	OCR fallback for scanned PDFs	✅ Active
`extract_text_from_pdf()`	Text extraction + fallback OCR	💤 Legacy
`split_text_into_chunks()`	Text chunking via LangChain	💤 Legacy

pdf_utils.py Breakdown

Active Functionality

build_vectordb_from_pdf(pdf_path: Path) -> Chroma

ocr_pdf(input_path: Path, output_path: Path) -> bool

Deprecated / Commented-Out Sections

extract_text_from_pdf()

split_text_into_chunks()

Summary Table

`pdf_utils.py` Breakdown

`build_vectordb_from_pdf(pdf_path: Path) -> Chroma`

`ocr_pdf(input_path: Path, output_path: Path) -> bool`

`extract_text_from_pdf()`

`split_text_into_chunks()`