5a. vectorizer.py

vectorizer.py → formerly embed_utils.py

vectorizer.py

Breakdown

Utility module for embedding text chunks and performing semantic search using sentence-transformers. Useful for lightweight embedding workflows and non-LangChain RAG pipelines.

1. get_model()

def get_model(model_name: str = "all-MiniLM-L6-v2") -> SentenceTransformer:

Component	Description
Purpose	Loads and caches the model globally to avoid reloading
Default Model	all-MiniLM-L6-v2 (compact, fast, general-purpose)
Optimization	Uses global _model object to implement a singleton pattern

Ideal for reuse across multiple embedding calls without overhead.

2. embed_chunks()

def embed_chunks(chunks: List[str], model_name: str = "all-MiniLM-L6-v2") -> np.ndarray:

Component	Description
Purpose	Converts a list of text chunks into vector embeddings
Input	List[str] — text chunks
Output	np.ndarray — dense vector embeddings
Model Used	Cached SentenceTransformer instance

Useful for standalone scripts, batch embedding, or interactive querying.

3. search_similar_chunks()

def search_similar_chunks(
    query: str,
    chunk_embeddings: np.ndarray,
    chunks: List[str],
    top_k: int = 3,
    return_scores: bool = False,
    model_name: str = "all-MiniLM-L6-v2"
) -> Union[List[str], List[Tuple[str, float]]]:

Component	Description
Purpose	Finds the top-k most semantically similar chunks to a query
Method	Cosine similarity via numpy dot product
Return Value	List of matching chunks or (chunk, score) pairs
Fallback	Returns empty list if similarity fails

Lightweight vector search alternative if Chroma is not used.

4. save_embeddings()and load_embeddings()

def save_embeddings(path: str, chunks: List[str], embeddings: np.ndarray)
def load_embeddings(path: str) -> Tuple[List[str], np.ndarray]

Function	Description
save_embeddings()	Saves chunk texts and embeddings as .npz
load_embeddings()	Reloads previously saved chunks and embeddings
Format	Uses numpy.savez and numpy.load
Use Case	For avoiding recomputation in pipelines or cold starts

Summary Table

Function	Role
get_model()	Load and cache embedding model
embed_chunks()	Convert chunks to dense vectors
search_similar_chunks()	Retrieve top-k relevant chunks
save_embeddings()	Save chunks and vectors to disk
load_embeddings()	Reload from saved vector file