ragoon.embeddings#
Classes
|
A class to load and process datasets to add embeddings using specified models. |
|
A class for Embedding Exploration Lab, visualizing high-dimensional embeddings in 3D space. |
- class ragoon.embeddings.EmbeddingsDataLoader(token: str, model_configs: List[Dict[str, str]], dataset_name: str | None = None, dataset: Dataset | DatasetDict | None = None, batch_size: int | None = 8, convert_to_tensor: bool | None = False, device: str | None = 'cuda')[source]#
Bases:
objectA class to load and process datasets to add embeddings using specified models.
This class handles loading a dataset from Hugging Face, processing it to add embeddings using specified models, and provides methods to save and upload the processed dataset.
- dataset_name#
The name of the dataset to load from Hugging Face.
- Type:
str
- token#
The token for accessing Hugging Face API.
- Type:
str
- model_configs#
The list of dictionaries with model configurations to use for generating embeddings.
- Type:
list of dict
- batch_size#
The number of samples to process in each batch.
- Type:
int
- dataset#
The loaded and processed dataset.
- Type:
datasets.DatasetDict
- convert_to_tensor#
Whether the output should be one large tensor. Default is False.
- Type:
bool, optional
- cuda_available#
Whether CUDA is available for GPU acceleration.
- Type:
bool
- device#
The device used for embedding processing if torch.cuda.is_available() is not reliable. Useful when using the Zero GPU on Hugging Face Space. Default is None.
- Type:
str, optional
- models#
A dictionary to store loaded models and their configurations.
- Type:
dict
- __init__(token, model_configs, dataset_name=None, dataset=None, batch_size=16, convert_to_tensor=False, device=None)[source]#
Initializes the EmbeddingDatasetLoader with the specified parameters.
- encode(texts, model, query_prefix=None, passage_prefix=None)[source]#
Create embeddings for a list of texts using a loaded model with optional prefixes.
- embed(batch, model, model_name, column='text', query_prefix=None, passage_prefix=None)#
Add embeddings columns to the dataset for each model.
- batch_embed(text)#
Embed a single text using all loaded models and return the results as a JSON string.
- process_splits(splits=None, column='text', load_all_models=True)#
Process specified splits of the dataset and add embeddings for each model.
- __init__(token: str, model_configs: List[Dict[str, str]], dataset_name: str | None = None, dataset: Dataset | DatasetDict | None = None, batch_size: int | None = 8, convert_to_tensor: bool | None = False, device: str | None = 'cuda')[source]#
Initialize the EmbeddingDatasetLoader with the specified parameters.
- Parameters:
token (str) – The token for accessing Hugging Face API.
model_configs (list of dict) – The list of dictionaries with model configurations to use for generating embeddings.
dataset_name (str, optional) – The name of the dataset to load from Hugging Face. Default is None.
dataset (Dataset or DatasetDict, optional) – The dataset to process. Default is None.
batch_size (int, optional) – The number of samples to process in each batch. Default is 16.
convert_to_tensor (bool, optional) – Whether the output should be one large tensor. Default is False.
device (str, optional) – The device used for embedding processing if torch.cuda.is_available() is not reliable. Useful when using the Zero GPU on Hugging Face Space. Default is ‘cuda’.
- load_dataset()[source]#
Load the dataset from Hugging Face.
- Raises:
Exception – If the dataset fails to load from Hugging Face.
- load_model(model_name: str) InferenceClient | SentenceTransformer[source]#
Load the specified model.
- Parameters:
model_name (str) – The name of the model to load.
- Returns:
model – The loaded model, either a SentenceTransformer or an InferenceClient.
- Return type:
Union[InferenceClient, SentenceTransformer]
- Raises:
Exception – If the model fails to load.
- load_models() Dict[str, Dict[str, InferenceClient | SentenceTransformer | str | None]][source]#
Load all specified models.
This method loads all models specified in the model_configs and returns them in a dictionary format.
- Returns:
models – A dictionary where each key is a model name and each value is a dictionary containing the model and any prefixes.
- Return type:
dict
Examples
>>> loader = EmbeddingDatasetLoader(token="your_token", model_configs=[{"model": "bert-base-uncased"}]) >>> models = loader.load_models()
- delete_model(model: InferenceClient | SentenceTransformer)[source]#
Delete the specified model and clear GPU cache.
- Parameters:
model (Union[InferenceClient, SentenceTransformer]) – The model to delete.
- Return type:
None
- encode(texts: List[str]) ndarray | dict[source]#
Create embeddings for a list of texts using a loaded model with optional prefixes, and optionally embed them into a batch.
- Parameters:
texts (list of str or dict) – The list of texts to encode or a batch of data from the dataset.
- Returns:
The embeddings for the texts, or the batch with added embedding columns.
- Return type:
np.ndarray or dict
- Raises:
Exception – If encoding or embedding fails.
- batch_encode(text: str) str[source]#
Embed a single text using all loaded models and return the results as a JSON string.
- Parameters:
text (str) – The text to embed.
- Returns:
The JSON string containing the embeddings from all models.
- Return type:
str
- Raises:
Exception – If embedding fails.
- process(splits: List[str] | None = None, column: str | None = 'text', preload_models: bool | None = False)[source]#
Process specified splits of the dataset and add embeddings for each model.
- Parameters:
splits (list of str, optional) – The list of splits to process. Default is None.
column (str, optional) – The name of the column containing the text to encode. Default is “text”.
preload_models (bool, optional) – Whether to load all models specified in the model_configs. Default is True.
- Return type:
None
- Raises:
Exception – If processing fails.
- get_dataset() Dataset | DatasetDict[source]#
Return the processed dataset.
- Returns:
dataset – The processed dataset.
- Return type:
Union[Dataset, DatasetDict]
- save_dataset(output_dir: str)[source]#
Save the processed dataset to disk.
- Parameters:
output_dir (str) – The directory to save the dataset.
- Raises:
Exception – If saving fails.
- upload_dataset(repo_id: str, token: str | None = None, private: bool | None = False)[source]#
Upload the processed dataset to the Hugging Face Hub.
- Parameters:
repo_id (str) – The repository ID to upload the dataset.
token (str, optional) – An optional authentication token for the Hugging Face Hub. If no token is passed, will default to the token saved locally when logging in with huggingface-cli login. Will raise an error if no token is passed and the user is not logged-in.
private (bool, optional) – Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter.
- Raises:
Exception – If uploading fails.
- class ragoon.embeddings.EmbeddingsVisualizer(index_path: str, dataset_path: str)[source]#
Bases:
objectA class for Embedding Exploration Lab, visualizing high-dimensional embeddings in 3D space.
This class provides functionality to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D plot.
- Parameters:
index_path (str) – Path to the FAISS index file.
dataset_path (str) – Path to the dataset containing labels.
- index_path#
Path to the FAISS index file.
- Type:
str
- dataset_path#
Path to the dataset containing labels.
- Type:
str
- index#
Loaded FAISS index.
- Type:
faiss.Index or None
- dataset#
Loaded dataset containing labels.
- Type:
datasets.Dataset or None
- vectors#
Extracted vectors from the FAISS index.
- Type:
np.ndarray or None
- reduced_vectors#
Dimensionality-reduced vectors.
- Type:
np.ndarray or None
- labels#
Labels from the dataset.
- Type:
list of str or None
- load_dataset() 'EmbeddingsVisualizer':[source]#
Load the dataset containing labels from the specified file path.
- reduce_dimensionality(
method: str = “umap”, pca_components: int = 50, final_components: int = 3, random_state: int = 42
- ) -> 'EmbeddingsVisualizer':
Reduce dimensionality of the extracted vectors with dynamic progress tracking.
- plot_3d() None:#
Generate a 3D scatter plot of the reduced vectors with labels.
Examples
>>> visualizer = EmbeddingsVisualizer(index_path="path/to/index", dataset_path="path/to/dataset") >>> visualizer.visualize( ... method="pca", ... save_html=True, ... html_file_name="embedding_visualization.html" ... )
- load_index()[source]#
Load the FAISS index from the specified file path.
- Returns:
self – The instance itself, allowing for method chaining.
- Return type:
- load_dataset(column: str = 'document')[source]#
Load the Dataset containing labels from the specified file path.
- Parameters:
column (str, optional) – The column of the split corresponding to the embeddings stored in the index. Default is ‘document’.
- Returns:
self – The instance itself, allowing for method chaining.
- Return type:
datasets.Dataset
- extract_vectors()[source]#
Extract all vectors from the loaded FAISS index.
This method should be called after load_index().
- Returns:
self – The instance itself, allowing for method chaining.
- Return type:
- Raises:
ValueError – If the index has not been loaded yet.
RuntimeError – If there’s an issue with vector extraction.
- reduce_dimensionality(method: str = 'umap', pca_components: int = 50, final_components: int = 3, random_state: int = 42)[source]#
Reduce dimensionality of the extracted vectors with dynamic progress tracking.
- Parameters:
method ({'pca', 'umap', 'pca_umap'}, optional) –
The method to use for dimensionality reduction, by default ‘umap’.
pca : Principal Component Analysis (PCA) is a linear dimensionality reduction technique
that is commonly used to reduce the dimensionality of high-dimensional data. It identifies the directions (principal components) in which the data varies the most and projects the data onto these components, resulting in a lower-dimensional representation.
umap : Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality
reduction technique that is particularly well-suited for visualizing high-dimensional data in lower-dimensional space. It preserves both local and global structure of the data by constructing a low-dimensional representation that captures the underlying manifold structure of the data.
pca_umap : PCA followed by UMAP is a two-step dimensionality reduction technique.
First, PCA is applied to reduce the dimensionality of the data. Then, UMAP is applied to further reduce the dimensionality and capture the non-linear structure of the data. This combination can be effective in preserving both global and local structure of the data.
pca_components (int, optional) – Number of components for PCA (used in ‘pca’ and ‘pca_umap’), by default 50.
final_components (int, optional) – Final number of components (3 for 3D visualization), by default 3.
random_state (int, optional) – Random state for reproducibility, by default 42.
- Returns:
self – The instance itself, allowing for method chaining.
- Return type:
- Raises:
ValueError – If vectors have not been extracted yet or if an invalid method is specified.
- create_plot(title: str = '3D Visualization of Embeddings', point_size: int = 3) Figure[source]#
Generate a 3D scatter plot of the reduced vectors with labels.
- Parameters:
title (str, optional) – The title of the plot (default is ‘3D Visualization of Embeddings’).
point_size (int, optional) – The size of the markers in the scatter plot (default is 3).
- Returns:
The generated 3D scatter plot.
- Return type:
go.Figure
- Raises:
ValueError – If vectors have not been reduced yet.
Notes
This method requires the plotly library to be installed.
Examples
>>> visualizer = EmbeddingsVisualizer() >>> plot = visualizer.create_plot(title='My Embeddings', point_size=5) >>> plot.show()
- visualize(column: str, method: str = 'tsne', pca_components: int = 50, final_components: int = 3, random_state: int = 42, title: str = '3D Visualization of Embeddings', point_size: int = 3, save_html: bool = False, html_file_name: str = 'embedding_visualization.html')[source]#
Full pipeline: load index, extract vectors, reduce dimensionality, and visualize.
- Parameters:
column (str) – The column of the split corresponding to the embeddings stored in the index.
method (str, optional) – The dimensionality reduction method to use. Default is ‘tsne’.
pca_components (int, optional) – The number of components to keep when using PCA for dimensionality reduction. Default is 50.
final_components (int, optional) – The number of final components to visualize. Default is 3.
random_state (int, optional) – The random state for reproducibility. Default is 42.
title (str, optional) – The title of the visualization plot. Default is ‘3D Visualization of Embeddings’.
point_size (int, optional) – The size of the points in the visualization plot. Default is 3.
save_html (bool, optional) – Whether to save the visualization as an HTML file. Default is False.
html_file_name (str, optional) – The name of the HTML file to save. Default is ‘embedding_visualization.html’.
- Return type:
None
- Raises:
None –
Examples
>>> visualizer = EmbeddingsVisualizer() >>> visualizer.visualize(method='tsne', pca_components=50, final_components=3, random_state=42)