vectorizer.py → formerly embed_utils.py
Breakdown
Utility module for embedding text chunks and performing semantic search using sentence-transformers. Useful for lightweight embedding workflows and non-LangChain RAG pipelines.
1. get_model()
def get_model(model_name: str = "all-MiniLM-L6-v2") -> SentenceTransformer:
Component |
Description |
Purpose |
Loads and caches the model globally to avoid reloading |
Default Model |
all-MiniLM-L6-v2 (compact, fast, general-purpose) |
Optimization |
Uses global _model object to implement a singleton pattern |
- Ideal for reuse across multiple embedding calls without overhead.
2. embed_chunks()
def embed_chunks(chunks: List[str], model_name: str = "all-MiniLM-L6-v2") -> np.ndarray:
Component |
Description |
Purpose |
Converts a list of text chunks into vector embeddings |
Input |
List[str] — text chunks |
Output |
np.ndarray — dense vector embeddings |
Model Used |
Cached SentenceTransformer instance |
- Useful for standalone scripts, batch embedding, or interactive querying.
3. search_similar_chunks()
def search_similar_chunks(
query: str,
chunk_embeddings: np.ndarray,
chunks: List[str],
top_k: int = 3,
return_scores: bool = False,
model_name: str = "all-MiniLM-L6-v2"
) -> Union[List[str], List[Tuple[str, float]]]:
Component |
Description |
Purpose |
Finds the top-k most semantically similar chunks to a query |
Method |
Cosine similarity via numpy dot product |
Return Value |
List of matching chunks or (chunk, score) pairs |
Fallback |
Returns empty list if similarity fails |
- Lightweight vector search alternative if Chroma is not used.
4. save_embeddings()and load_embeddings()
def save_embeddings(path: str, chunks: List[str], embeddings: np.ndarray)
def load_embeddings(path: str) -> Tuple[List[str], np.ndarray]
Function |
Description |
save_embeddings() |
Saves chunk texts and embeddings as .npz |
load_embeddings() |
Reloads previously saved chunks and embeddings |
Format |
Uses numpy.savez and numpy.load |
Use Case |
For avoiding recomputation in pipelines or cold starts |
Summary Table
Function |
Role |
get_model() |
Load and cache embedding model |
embed_chunks() |
Convert chunks to dense vectors |
search_similar_chunks() |
Retrieve top-k relevant chunks |
save_embeddings() |
Save chunks and vectors to disk |
load_embeddings() |
Reload from saved vector file |