5a. vectorizer.py

vectorizer.py → formerly embed_utils.py

vectorizer.py

Breakdown

Utility module for embedding text chunks and performing semantic search using sentence-transformers. Useful for lightweight embedding workflows and non-LangChain RAG pipelines.


1.  get_model()

def get_model(model_name: str = "all-MiniLM-L6-v2") -> SentenceTransformer:
Component Description
Purpose Loads and caches the model globally to avoid reloading
Default Model all-MiniLM-L6-v2 (compact, fast, general-purpose)
Optimization Uses global _model object to implement a singleton pattern

2. embed_chunks()

def embed_chunks(chunks: List[str], model_name: str = "all-MiniLM-L6-v2") -> np.ndarray:
Component Description
Purpose Converts a list of text chunks into vector embeddings
Input List[str] — text chunks
Output np.ndarray — dense vector embeddings
Model Used Cached SentenceTransformer instance

3. search_similar_chunks()

def search_similar_chunks(
    query: str,
    chunk_embeddings: np.ndarray,
    chunks: List[str],
    top_k: int = 3,
    return_scores: bool = False,
    model_name: str = "all-MiniLM-L6-v2"
) -> Union[List[str], List[Tuple[str, float]]]:
Component Description
Purpose Finds the top-k most semantically similar chunks to a query
Method Cosine similarity via numpy dot product
Return Value List of matching chunks or (chunk, score) pairs
Fallback Returns empty list if similarity fails

4. save_embeddings()and  load_embeddings()

def save_embeddings(path: str, chunks: List[str], embeddings: np.ndarray)
def load_embeddings(path: str) -> Tuple[List[str], np.ndarray]
Function Description
save_embeddings() Saves chunk texts and embeddings as .npz
load_embeddings() Reloads previously saved chunks and embeddings
Format Uses numpy.savez and numpy.load
Use Case For avoiding recomputation in pipelines or cold starts

Summary Table

Function Role
get_model() Load and cache embedding model
embed_chunks() Convert chunks to dense vectors
search_similar_chunks() Retrieve top-k relevant chunks
save_embeddings() Save chunks and vectors to disk
load_embeddings() Reload from saved vector file