ragoon.embeddings

Contents

ragoon.embeddings#

Classes

EmbeddingsDataLoader(token, model_configs[, ...])

A class to load and process datasets to add embeddings using specified models.

EmbeddingsVisualizer(index_path, dataset_path)

A class for Embedding Exploration Lab, visualizing high-dimensional embeddings in 3D space.

class ragoon.embeddings.EmbeddingsDataLoader(token: str, model_configs: List[Dict[str, str]], dataset_name: str | None = None, dataset: Dataset | DatasetDict | None = None, batch_size: int | None = 8, convert_to_tensor: bool | None = False, device: str | None = 'cuda')[source]#

Bases: object

A class to load and process datasets to add embeddings using specified models.

This class handles loading a dataset from Hugging Face, processing it to add embeddings using specified models, and provides methods to save and upload the processed dataset.

dataset_name#

The name of the dataset to load from Hugging Face.

Type:

str

token#

The token for accessing Hugging Face API.

Type:

str

model_configs#

The list of dictionaries with model configurations to use for generating embeddings.

Type:

list of dict

batch_size#

The number of samples to process in each batch.

Type:

int

dataset#

The loaded and processed dataset.

Type:

datasets.DatasetDict

convert_to_tensor#

Whether the output should be one large tensor. Default is False.

Type:

bool, optional

cuda_available#

Whether CUDA is available for GPU acceleration.

Type:

bool

device#

The device used for embedding processing if torch.cuda.is_available() is not reliable. Useful when using the Zero GPU on Hugging Face Space. Default is None.

Type:

str, optional

models#

A dictionary to store loaded models and their configurations.

Type:

dict

__init__(token, model_configs, dataset_name=None, dataset=None, batch_size=16, convert_to_tensor=False, device=None)[source]#

Initializes the EmbeddingDatasetLoader with the specified parameters.

load_dataset()[source]#

Load the dataset from Hugging Face.

load_model(model_name)[source]#

Load the specified model.

load_models()[source]#

Load all specified models.

encode(texts, model, query_prefix=None, passage_prefix=None)[source]#

Create embeddings for a list of texts using a loaded model with optional prefixes.

embed(batch, model, model_name, column='text', query_prefix=None, passage_prefix=None)#

Add embeddings columns to the dataset for each model.

batch_embed(text)#

Embed a single text using all loaded models and return the results as a JSON string.

process_splits(splits=None, column='text', load_all_models=True)#

Process specified splits of the dataset and add embeddings for each model.

get_dataset()[source]#

Return the processed dataset.

save_dataset(output_dir)[source]#

Save the processed dataset to disk.

upload_dataset(repo_id)[source]#

Upload the processed dataset to the Hugging Face Hub.

__init__(token: str, model_configs: List[Dict[str, str]], dataset_name: str | None = None, dataset: Dataset | DatasetDict | None = None, batch_size: int | None = 8, convert_to_tensor: bool | None = False, device: str | None = 'cuda')[source]#

Initialize the EmbeddingDatasetLoader with the specified parameters.

Parameters:
  • token (str) – The token for accessing Hugging Face API.

  • model_configs (list of dict) – The list of dictionaries with model configurations to use for generating embeddings.

  • dataset_name (str, optional) – The name of the dataset to load from Hugging Face. Default is None.

  • dataset (Dataset or DatasetDict, optional) – The dataset to process. Default is None.

  • batch_size (int, optional) – The number of samples to process in each batch. Default is 16.

  • convert_to_tensor (bool, optional) – Whether the output should be one large tensor. Default is False.

  • device (str, optional) – The device used for embedding processing if torch.cuda.is_available() is not reliable. Useful when using the Zero GPU on Hugging Face Space. Default is ‘cuda’.

load_dataset()[source]#

Load the dataset from Hugging Face.

Raises:

Exception – If the dataset fails to load from Hugging Face.

load_model(model_name: str) InferenceClient | SentenceTransformer[source]#

Load the specified model.

Parameters:

model_name (str) – The name of the model to load.

Returns:

model – The loaded model, either a SentenceTransformer or an InferenceClient.

Return type:

Union[InferenceClient, SentenceTransformer]

Raises:

Exception – If the model fails to load.

load_models() Dict[str, Dict[str, InferenceClient | SentenceTransformer | str | None]][source]#

Load all specified models.

This method loads all models specified in the model_configs and returns them in a dictionary format.

Returns:

models – A dictionary where each key is a model name and each value is a dictionary containing the model and any prefixes.

Return type:

dict

Examples

>>> loader = EmbeddingDatasetLoader(token="your_token", model_configs=[{"model": "bert-base-uncased"}])
>>> models = loader.load_models()
delete_model(model: InferenceClient | SentenceTransformer)[source]#

Delete the specified model and clear GPU cache.

Parameters:

model (Union[InferenceClient, SentenceTransformer]) – The model to delete.

Return type:

None

encode(texts: List[str]) ndarray | dict[source]#

Create embeddings for a list of texts using a loaded model with optional prefixes, and optionally embed them into a batch.

Parameters:

texts (list of str or dict) – The list of texts to encode or a batch of data from the dataset.

Returns:

The embeddings for the texts, or the batch with added embedding columns.

Return type:

np.ndarray or dict

Raises:

Exception – If encoding or embedding fails.

batch_encode(text: str) str[source]#

Embed a single text using all loaded models and return the results as a JSON string.

Parameters:

text (str) – The text to embed.

Returns:

The JSON string containing the embeddings from all models.

Return type:

str

Raises:

Exception – If embedding fails.

process(splits: List[str] | None = None, column: str | None = 'text', preload_models: bool | None = False)[source]#

Process specified splits of the dataset and add embeddings for each model.

Parameters:
  • splits (list of str, optional) – The list of splits to process. Default is None.

  • column (str, optional) – The name of the column containing the text to encode. Default is “text”.

  • preload_models (bool, optional) – Whether to load all models specified in the model_configs. Default is True.

Return type:

None

Raises:

Exception – If processing fails.

get_dataset() Dataset | DatasetDict[source]#

Return the processed dataset.

Returns:

dataset – The processed dataset.

Return type:

Union[Dataset, DatasetDict]

save_dataset(output_dir: str)[source]#

Save the processed dataset to disk.

Parameters:

output_dir (str) – The directory to save the dataset.

Raises:

Exception – If saving fails.

upload_dataset(repo_id: str, token: str | None = None, private: bool | None = False)[source]#

Upload the processed dataset to the Hugging Face Hub.

Parameters:
  • repo_id (str) – The repository ID to upload the dataset.

  • token (str, optional) – An optional authentication token for the Hugging Face Hub. If no token is passed, will default to the token saved locally when logging in with huggingface-cli login. Will raise an error if no token is passed and the user is not logged-in.

  • private (bool, optional) – Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter.

Raises:

Exception – If uploading fails.

class ragoon.embeddings.EmbeddingsVisualizer(index_path: str, dataset_path: str)[source]#

Bases: object

A class for Embedding Exploration Lab, visualizing high-dimensional embeddings in 3D space.

This class provides functionality to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D plot.

Parameters:
  • index_path (str) – Path to the FAISS index file.

  • dataset_path (str) – Path to the dataset containing labels.

index_path#

Path to the FAISS index file.

Type:

str

dataset_path#

Path to the dataset containing labels.

Type:

str

index#

Loaded FAISS index.

Type:

faiss.Index or None

dataset#

Loaded dataset containing labels.

Type:

datasets.Dataset or None

vectors#

Extracted vectors from the FAISS index.

Type:

np.ndarray or None

reduced_vectors#

Dimensionality-reduced vectors.

Type:

np.ndarray or None

labels#

Labels from the dataset.

Type:

list of str or None

load_index() 'EmbeddingsVisualizer':[source]#

Load the FAISS index from the specified file path.

load_dataset() 'EmbeddingsVisualizer':[source]#

Load the dataset containing labels from the specified file path.

extract_vectors() 'EmbeddingsVisualizer':[source]#

Extract all vectors from the loaded FAISS index.

reduce_dimensionality(

method: str = “umap”, pca_components: int = 50, final_components: int = 3, random_state: int = 42

) -> 'EmbeddingsVisualizer':

Reduce dimensionality of the extracted vectors with dynamic progress tracking.

plot_3d() None:#

Generate a 3D scatter plot of the reduced vectors with labels.

Examples

>>> visualizer = EmbeddingsVisualizer(index_path="path/to/index", dataset_path="path/to/dataset")
>>> visualizer.visualize(
...    method="pca",
...    save_html=True,
...    html_file_name="embedding_visualization.html"
... )
__init__(index_path: str, dataset_path: str)[source]#
load_index()[source]#

Load the FAISS index from the specified file path.

Returns:

self – The instance itself, allowing for method chaining.

Return type:

EmbeddingsVisualizer

load_dataset(column: str = 'document')[source]#

Load the Dataset containing labels from the specified file path.

Parameters:

column (str, optional) – The column of the split corresponding to the embeddings stored in the index. Default is ‘document’.

Returns:

self – The instance itself, allowing for method chaining.

Return type:

datasets.Dataset

extract_vectors()[source]#

Extract all vectors from the loaded FAISS index.

This method should be called after load_index().

Returns:

self – The instance itself, allowing for method chaining.

Return type:

EmbeddingsVisualizer

Raises:
  • ValueError – If the index has not been loaded yet.

  • RuntimeError – If there’s an issue with vector extraction.

reduce_dimensionality(method: str = 'umap', pca_components: int = 50, final_components: int = 3, random_state: int = 42)[source]#

Reduce dimensionality of the extracted vectors with dynamic progress tracking.

Parameters:
  • method ({'pca', 'umap', 'pca_umap'}, optional) –

    The method to use for dimensionality reduction, by default ‘umap’.

    • pca : Principal Component Analysis (PCA) is a linear dimensionality reduction technique

    that is commonly used to reduce the dimensionality of high-dimensional data. It identifies the directions (principal components) in which the data varies the most and projects the data onto these components, resulting in a lower-dimensional representation.

    • umap : Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality

    reduction technique that is particularly well-suited for visualizing high-dimensional data in lower-dimensional space. It preserves both local and global structure of the data by constructing a low-dimensional representation that captures the underlying manifold structure of the data.

    • pca_umap : PCA followed by UMAP is a two-step dimensionality reduction technique.

    First, PCA is applied to reduce the dimensionality of the data. Then, UMAP is applied to further reduce the dimensionality and capture the non-linear structure of the data. This combination can be effective in preserving both global and local structure of the data.

  • pca_components (int, optional) – Number of components for PCA (used in ‘pca’ and ‘pca_umap’), by default 50.

  • final_components (int, optional) – Final number of components (3 for 3D visualization), by default 3.

  • random_state (int, optional) – Random state for reproducibility, by default 42.

Returns:

self – The instance itself, allowing for method chaining.

Return type:

EmbeddingsVisualizer

Raises:

ValueError – If vectors have not been extracted yet or if an invalid method is specified.

create_plot(title: str = '3D Visualization of Embeddings', point_size: int = 3) Figure[source]#

Generate a 3D scatter plot of the reduced vectors with labels.

Parameters:
  • title (str, optional) – The title of the plot (default is ‘3D Visualization of Embeddings’).

  • point_size (int, optional) – The size of the markers in the scatter plot (default is 3).

Returns:

The generated 3D scatter plot.

Return type:

go.Figure

Raises:

ValueError – If vectors have not been reduced yet.

Notes

This method requires the plotly library to be installed.

Examples

>>> visualizer = EmbeddingsVisualizer()
>>> plot = visualizer.create_plot(title='My Embeddings', point_size=5)
>>> plot.show()
visualize(column: str, method: str = 'tsne', pca_components: int = 50, final_components: int = 3, random_state: int = 42, title: str = '3D Visualization of Embeddings', point_size: int = 3, save_html: bool = False, html_file_name: str = 'embedding_visualization.html')[source]#

Full pipeline: load index, extract vectors, reduce dimensionality, and visualize.

Parameters:
  • column (str) – The column of the split corresponding to the embeddings stored in the index.

  • method (str, optional) – The dimensionality reduction method to use. Default is ‘tsne’.

  • pca_components (int, optional) – The number of components to keep when using PCA for dimensionality reduction. Default is 50.

  • final_components (int, optional) – The number of final components to visualize. Default is 3.

  • random_state (int, optional) – The random state for reproducibility. Default is 42.

  • title (str, optional) – The title of the visualization plot. Default is ‘3D Visualization of Embeddings’.

  • point_size (int, optional) – The size of the points in the visualization plot. Default is 3.

  • save_html (bool, optional) – Whether to save the visualization as an HTML file. Default is False.

  • html_file_name (str, optional) – The name of the HTML file to save. Default is ‘embedding_visualization.html’.

Return type:

None

Raises:

None

Examples

>>> visualizer = EmbeddingsVisualizer()
>>> visualizer.visualize(method='tsne', pca_components=50, final_components=3, random_state=42)