ragoon.similarity_search#
Classes
|
A class dedicated to encoding text data, quantizing embeddings, and managing indices for efficient similarity search. |
- class ragoon.similarity_search.SimilaritySearch(model_name: str, device: str = 'cuda', ndim: int = 1024, metric: str = 'ip', dtype: str = 'i8')[source]#
Bases:
objectA class dedicated to encoding text data, quantizing embeddings, and managing indices for efficient similarity search.
- model_name#
Name or identifier of the embedding model.
- Type:
str
- device#
Computation device (‘cpu’ or ‘cuda’).
- Type:
str
- ndim#
Dimension of the embeddings.
- Type:
int
- metric#
Metric used for the index (‘ip’ for inner product, etc.).
- Type:
str
- dtype#
Data type for the index (‘i8’ for int8, etc.).
- Type:
str
- quantize_embeddings(embeddings, quantization_type)[source]#
Quantizes the embeddings for efficient storage and search.
- create_usearch_index(int8_embeddings, index_path)[source]#
Creates and saves a USEARCH integer index.
- load_usearch_index_view(index_path)[source]#
Loads a USEARCH index as a view for memory-efficient operations.
- search(query, top_k=10, rescore_multiplier=4)[source]#
Performs a search operation against the indexed embeddings.
Examples
>>> instance = SimilaritySearch( ... model_name="louisbrulenaudet/tsdae-lemone-mbert-base", ... device="cuda", ... ndim=768, ... metric="ip", ... dtype="i8" ) >>> embeddings = instance.encode(corpus=dataset["output"]) >>> ubinary_embeddings = instance.quantize_embeddings( ... embeddings=embeddings, ... quantization_type="ubinary" ) >>> int8_embeddings = instance.quantize_embeddings( ... embeddings=embeddings, ... quantization_type="int8" ) >>> instance.create_usearch_index( ... int8_embeddings=int8_embeddings, ... index_path="./usearch_int8.index" ) >>> instance.create_faiss_index( ... ubinary_embeddings=ubinary_embeddings, ... index_path="./faiss_ubinary.index" ) >>> top_k_scores, top_k_indices = instance.search( ... query="Sont considérées comme ayant leur domicile fiscal en France au sens de l'article 4 A", ... top_k=10, ... rescore_multiplier=4 )
- __init__(model_name: str, device: str = 'cuda', ndim: int = 1024, metric: str = 'ip', dtype: str = 'i8')[source]#
Initializes the EmbeddingIndexer with the specified model, device, and index configurations.
- Parameters:
model_name (str) – The name or identifier of the SentenceTransformer model to use for embedding.
device (str, optional) – The computation device to use (‘cpu’ or ‘cuda’). Default is ‘cuda’.
ndim (int, optional) – The dimensionality of the embeddings. Default is 1024.
metric (str, optional) – The metric used for the index (‘ip’ for inner product). Default is ‘ip’.
dtype (str, optional) – The data type for the USEARCH index (‘i8’ for 8-bit integer). Default is ‘i8’.
- encode(corpus: list, normalize_embeddings: bool = True) ndarray[source]#
Encodes the given corpus into full-precision embeddings.
- Parameters:
corpus (list) – A list of sentences to be encoded.
normalize_embeddings (bool, optional) – Whether to normalize returned vectors to have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used. Default is True.
- Returns:
The full-precision embeddings of the corpus.
- Return type:
np.ndarray
Notes
This method normalizes the embeddings and shows the progress bar during the encoding process.
- quantize_embeddings(embeddings: ndarray, quantization_type: str) ndarray | bytearray[source]#
Quantizes the given embeddings based on the specified quantization type (‘ubinary’ or ‘int8’).
- Parameters:
embeddings (np.ndarray) – The full-precision embeddings to be quantized.
quantization_type (str) – The type of quantization (‘ubinary’ for unsigned binary, ‘int8’ for 8-bit integers).
- Returns:
The quantized embeddings.
- Return type:
Union[np.ndarray, bytearray]
- Raises:
ValueError – If an unsupported quantization type is provided.
- create_faiss_index(ubinary_embeddings: bytearray, index_path: str = None, save: bool = False) None[source]#
Creates and saves a FAISS binary index from ubinary embeddings.
- Parameters:
ubinary_embeddings (bytearray) – The ubinary-quantized embeddings.
index_path (str, optional) – The file path to save the FAISS binary index. Default is None.
save (bool, optional) – Indicator for saving the index. Default is False.
Notes
The dimensionality of the index is specified during the class initialization (default is 1024).
- create_usearch_index(int8_embeddings: ndarray, index_path: str = None, save: bool = False) None[source]#
Creates and saves a USEARCH integer index from int8 embeddings.
- Parameters:
int8_embeddings (np.ndarray) – The int8-quantized embeddings.
index_path (str, optional) – The file path to save the USEARCH integer index. Default is None.
save (bool, optional) – Indicator for saving the index. Default is False.
- Return type:
None
Notes
The dimensionality and metric of the index are specified during class initialization.
- load_usearch_index_view(index_path: str) any[source]#
Loads a USEARCH index as a view for memory-efficient operations.
- Parameters:
index_path (str) – The file path to the USEARCH index to be loaded as a view.
- Returns:
A view of the USEARCH index for memory-efficient similarity search operations.
- Return type:
object
Notes
Implementing this would depend on the specific USEARCH index handling library being used.
- load_faiss_index(index_path: str) None[source]#
Loads a FAISS binary index from a specified file path.
This method loads a binary index created by FAISS into the class attribute binary_index, ready for performing similarity searches.
- Parameters:
index_path (str) – The file path to the saved FAISS binary index.
- Return type:
None
Notes
The loaded index is stored in the binary_index attribute of the class. Ensure that the index at index_path is compatible with the configurations (e.g., dimensions) used for this class instance.
- search(query: str, top_k: int = 10, rescore_multiplier: int = 4) Tuple[List[float], List[int]][source]#
Performs a search operation against the indexed embeddings.
- Parameters:
query (str) – The query sentence/string to be searched.
top_k (int, optional) – The number of top results to return.
rescore_multiplier (int, optional) – The multiplier used to increase the initial retrieval size for re-scoring. Higher values can increase precision at the cost of performance.
- Returns:
A tuple containing the scores and the indices of the top k results.
- Return type:
Tuple[List[float], List[int]]
Notes
This method assumes that binary_index and int8_index are already loaded or created.