ragoon package

Contents

ragoon package#

Submodules#

ragoon.chunks module#

class ragoon.chunks.ChunkMetadata(uuid: str, chunk_uuid: str, chunk_number: str)[source]#

Bases: object

Metadata for a text chunk within a dataset.

uuid#

The UUID of the original text.

Type:

str

chunk_uuid#

The UUID of the chunked text.

Type:

str

chunk_number#

The identifier of the chunk indicating its order and total number of chunks.

Type:

str

uuid: str#
chunk_uuid: str#
chunk_number: str#
__init__(uuid: str, chunk_uuid: str, chunk_number: str) None#
class ragoon.chunks.DatasetChunker(dataset: Dataset | DatasetDict, max_tokens: int, overlap_percentage: float, column: str, model_name: str = 'bert-base-uncased', uuid_column: str | None = None, separators: List[str] = ['.', '\n'], space_after_splitters: List[str] | None = None)[source]#

Bases: object

A class to chunk text data within a dataset for processing with embeddings models.

This class splits large texts into smaller chunks based on a specified maximum token limit, while maintaining an overlap between chunks to preserve context.

datasetUnion[datasets.Dataset, datasets.DatasetDict]

The dataset to be chunked. It can be either a Dataset or a DatasetDict.

max_tokensint

The maximum number of tokens allowed in each chunk.

overlap_percentagefloat

The percentage of tokens to overlap between consecutive chunks.

columnstr

The name of the column containing the text to be chunked.

model_namestr, optional

The name of the tokenizer model to use (default is “bert-base-uncased”).

uuid_columnOptional[str], optional

The name of the column containing UUIDs for the texts. If not provided, new UUIDs will be generated.

separatorsList[str], optional

List of separators used to split the text.

space_after_splittersOptional[List[str]], optional

List of separators that require a space after splitting (default is None).

>>> from datasets import load_dataset
>>> dataset = load_dataset("louisbrulenaudet/dac6-instruct")
>>> chunker = DatasetChunker(
...     dataset['train'],
...     max_tokens=512,
...     overlap_percentage=0.5,
...     column="document",
...     model_name="intfloat/multilingual-e5-large",
...     separators=["
“, “.”, “!”, “?”]

… ) >>> dataset_chunked = chunker.chunk_dataset() >>> dataset_chunked.to_list()[:3] [{‘text’: ‘This is a chunked text.’}, {‘text’: ‘This is another chunked text.’}, …]

__init__(dataset: Dataset | DatasetDict, max_tokens: int, overlap_percentage: float, column: str, model_name: str = 'bert-base-uncased', uuid_column: str | None = None, separators: List[str] = ['.', '\n'], space_after_splitters: List[str] | None = None) None[source]#
split_text(text: str) List[str][source]#

Splits a text into segments based on the specified separators.

Parameters:

text (str) – The text to be split.

Returns:

A list of text segments.

Return type:

List[str]

Examples

>>> chunker = DatasetChunker(dataset, 512, 0.1, 'text')
>>> chunker.split_text("This is a sentence. This is another one.")
['This is a sentence', '.', ' This is another one', '.']
create_chunks(text: str) List[str][source]#

Creates text chunks from a given text based on the maximum tokens limit.

Parameters:

text (str) – The text to be chunked.

Returns:

A list of text chunks.

Return type:

List[str]

Raises:

ValueError – If the text cannot be chunked properly.

Examples

>>> chunker = DatasetChunker(dataset, 512, 0.1, 'text')
>>> text = "This is a very long text that needs to be chunked."
>>> chunks = chunker.create_chunks(text)
>>> len(chunks)
2
finalize_chunk(chunk_text: str, is_last: bool) str[source]#

Finalizes the chunk text by adjusting leading/trailing separators.

Parameters:
  • chunk_text (str) – The chunk text to be finalized.

  • is_last (bool) – Indicates whether this is the last chunk.

Returns:

The finalized chunk text.

Return type:

str

Examples

>>> chunker = DatasetChunker(dataset, 512, 0.1, 'text')
>>> chunk = " This is a chunk."
>>> chunker.finalize_chunk(chunk, is_last=True)
'This is a chunk.'
chunk_dataset() Dataset | DatasetDict[source]#

Chunks the entire dataset into smaller segments.

Returns:

The chunked dataset, with each entry split into smaller chunks.

Return type:

Union[Dataset, DatasetDict]

Examples

>>> chunker = DatasetChunker(dataset, 512, 0.1, 'text')
>>> chunked_dataset = chunker.chunk_dataset()
>>> len(chunked_dataset)
1000

ragoon.embeddings module#

class ragoon.embeddings.EmbeddingsDataLoader(token: str, model_configs: List[Dict[str, str]], dataset_name: str | None = None, dataset: Dataset | DatasetDict | None = None, batch_size: int | None = 8, convert_to_tensor: bool | None = False, device: str | None = 'cuda')[source]#

Bases: object

A class to load and process datasets to add embeddings using specified models.

This class handles loading a dataset from Hugging Face, processing it to add embeddings using specified models, and provides methods to save and upload the processed dataset.

dataset_name#

The name of the dataset to load from Hugging Face.

Type:

str

token#

The token for accessing Hugging Face API.

Type:

str

model_configs#

The list of dictionaries with model configurations to use for generating embeddings.

Type:

list of dict

batch_size#

The number of samples to process in each batch.

Type:

int

dataset#

The loaded and processed dataset.

Type:

datasets.DatasetDict

convert_to_tensor#

Whether the output should be one large tensor. Default is False.

Type:

bool, optional

cuda_available#

Whether CUDA is available for GPU acceleration.

Type:

bool

device#

The device used for embedding processing if torch.cuda.is_available() is not reliable. Useful when using the Zero GPU on Hugging Face Space. Default is None.

Type:

str, optional

models#

A dictionary to store loaded models and their configurations.

Type:

dict

__init__(token, model_configs, dataset_name=None, dataset=None, batch_size=16, convert_to_tensor=False, device=None)[source]#

Initializes the EmbeddingDatasetLoader with the specified parameters.

load_dataset()[source]#

Load the dataset from Hugging Face.

load_model(model_name)[source]#

Load the specified model.

load_models()[source]#

Load all specified models.

encode(texts, model, query_prefix=None, passage_prefix=None)[source]#

Create embeddings for a list of texts using a loaded model with optional prefixes.

embed(batch, model, model_name, column='text', query_prefix=None, passage_prefix=None)#

Add embeddings columns to the dataset for each model.

batch_embed(text)#

Embed a single text using all loaded models and return the results as a JSON string.

process_splits(splits=None, column='text', load_all_models=True)#

Process specified splits of the dataset and add embeddings for each model.

get_dataset()[source]#

Return the processed dataset.

save_dataset(output_dir)[source]#

Save the processed dataset to disk.

upload_dataset(repo_id)[source]#

Upload the processed dataset to the Hugging Face Hub.

__init__(token: str, model_configs: List[Dict[str, str]], dataset_name: str | None = None, dataset: Dataset | DatasetDict | None = None, batch_size: int | None = 8, convert_to_tensor: bool | None = False, device: str | None = 'cuda')[source]#

Initialize the EmbeddingDatasetLoader with the specified parameters.

Parameters:
  • token (str) – The token for accessing Hugging Face API.

  • model_configs (list of dict) – The list of dictionaries with model configurations to use for generating embeddings.

  • dataset_name (str, optional) – The name of the dataset to load from Hugging Face. Default is None.

  • dataset (Dataset or DatasetDict, optional) – The dataset to process. Default is None.

  • batch_size (int, optional) – The number of samples to process in each batch. Default is 16.

  • convert_to_tensor (bool, optional) – Whether the output should be one large tensor. Default is False.

  • device (str, optional) – The device used for embedding processing if torch.cuda.is_available() is not reliable. Useful when using the Zero GPU on Hugging Face Space. Default is ‘cuda’.

load_dataset()[source]#

Load the dataset from Hugging Face.

Raises:

Exception – If the dataset fails to load from Hugging Face.

load_model(model_name: str) InferenceClient | SentenceTransformer[source]#

Load the specified model.

Parameters:

model_name (str) – The name of the model to load.

Returns:

model – The loaded model, either a SentenceTransformer or an InferenceClient.

Return type:

Union[InferenceClient, SentenceTransformer]

Raises:

Exception – If the model fails to load.

load_models() Dict[str, Dict[str, InferenceClient | SentenceTransformer | str | None]][source]#

Load all specified models.

This method loads all models specified in the model_configs and returns them in a dictionary format.

Returns:

models – A dictionary where each key is a model name and each value is a dictionary containing the model and any prefixes.

Return type:

dict

Examples

>>> loader = EmbeddingDatasetLoader(token="your_token", model_configs=[{"model": "bert-base-uncased"}])
>>> models = loader.load_models()
delete_model(model: InferenceClient | SentenceTransformer)[source]#

Delete the specified model and clear GPU cache.

Parameters:

model (Union[InferenceClient, SentenceTransformer]) – The model to delete.

Return type:

None

encode(texts: List[str]) ndarray | dict[source]#

Create embeddings for a list of texts using a loaded model with optional prefixes, and optionally embed them into a batch.

Parameters:

texts (list of str or dict) – The list of texts to encode or a batch of data from the dataset.

Returns:

The embeddings for the texts, or the batch with added embedding columns.

Return type:

np.ndarray or dict

Raises:

Exception – If encoding or embedding fails.

batch_encode(text: str) str[source]#

Embed a single text using all loaded models and return the results as a JSON string.

Parameters:

text (str) – The text to embed.

Returns:

The JSON string containing the embeddings from all models.

Return type:

str

Raises:

Exception – If embedding fails.

process(splits: List[str] | None = None, column: str | None = 'text', preload_models: bool | None = False)[source]#

Process specified splits of the dataset and add embeddings for each model.

Parameters:
  • splits (list of str, optional) – The list of splits to process. Default is None.

  • column (str, optional) – The name of the column containing the text to encode. Default is “text”.

  • preload_models (bool, optional) – Whether to load all models specified in the model_configs. Default is True.

Return type:

None

Raises:

Exception – If processing fails.

get_dataset() Dataset | DatasetDict[source]#

Return the processed dataset.

Returns:

dataset – The processed dataset.

Return type:

Union[Dataset, DatasetDict]

save_dataset(output_dir: str)[source]#

Save the processed dataset to disk.

Parameters:

output_dir (str) – The directory to save the dataset.

Raises:

Exception – If saving fails.

upload_dataset(repo_id: str, token: str | None = None, private: bool | None = False)[source]#

Upload the processed dataset to the Hugging Face Hub.

Parameters:
  • repo_id (str) – The repository ID to upload the dataset.

  • token (str, optional) – An optional authentication token for the Hugging Face Hub. If no token is passed, will default to the token saved locally when logging in with huggingface-cli login. Will raise an error if no token is passed and the user is not logged-in.

  • private (bool, optional) – Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter.

Raises:

Exception – If uploading fails.

class ragoon.embeddings.EmbeddingsVisualizer(index_path: str, dataset_path: str)[source]#

Bases: object

A class for Embedding Exploration Lab, visualizing high-dimensional embeddings in 3D space.

This class provides functionality to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D plot.

Parameters:
  • index_path (str) – Path to the FAISS index file.

  • dataset_path (str) – Path to the dataset containing labels.

index_path#

Path to the FAISS index file.

Type:

str

dataset_path#

Path to the dataset containing labels.

Type:

str

index#

Loaded FAISS index.

Type:

faiss.Index or None

dataset#

Loaded dataset containing labels.

Type:

datasets.Dataset or None

vectors#

Extracted vectors from the FAISS index.

Type:

np.ndarray or None

reduced_vectors#

Dimensionality-reduced vectors.

Type:

np.ndarray or None

labels#

Labels from the dataset.

Type:

list of str or None

load_index() 'EmbeddingsVisualizer':[source]#

Load the FAISS index from the specified file path.

load_dataset() 'EmbeddingsVisualizer':[source]#

Load the dataset containing labels from the specified file path.

extract_vectors() 'EmbeddingsVisualizer':[source]#

Extract all vectors from the loaded FAISS index.

reduce_dimensionality(

method: str = “umap”, pca_components: int = 50, final_components: int = 3, random_state: int = 42

) -> 'EmbeddingsVisualizer':

Reduce dimensionality of the extracted vectors with dynamic progress tracking.

plot_3d() None:#

Generate a 3D scatter plot of the reduced vectors with labels.

Examples

>>> visualizer = EmbeddingsVisualizer(index_path="path/to/index", dataset_path="path/to/dataset")
>>> visualizer.visualize(
...    method="pca",
...    save_html=True,
...    html_file_name="embedding_visualization.html"
... )
__init__(index_path: str, dataset_path: str)[source]#
load_index()[source]#

Load the FAISS index from the specified file path.

Returns:

self – The instance itself, allowing for method chaining.

Return type:

EmbeddingsVisualizer

load_dataset(column: str = 'document')[source]#

Load the Dataset containing labels from the specified file path.

Parameters:

column (str, optional) – The column of the split corresponding to the embeddings stored in the index. Default is ‘document’.

Returns:

self – The instance itself, allowing for method chaining.

Return type:

datasets.Dataset

extract_vectors()[source]#

Extract all vectors from the loaded FAISS index.

This method should be called after load_index().

Returns:

self – The instance itself, allowing for method chaining.

Return type:

EmbeddingsVisualizer

Raises:
  • ValueError – If the index has not been loaded yet.

  • RuntimeError – If there’s an issue with vector extraction.

reduce_dimensionality(method: str = 'umap', pca_components: int = 50, final_components: int = 3, random_state: int = 42)[source]#

Reduce dimensionality of the extracted vectors with dynamic progress tracking.

Parameters:
  • method ({'pca', 'umap', 'pca_umap'}, optional) –

    The method to use for dimensionality reduction, by default ‘umap’.

    • pca : Principal Component Analysis (PCA) is a linear dimensionality reduction technique

    that is commonly used to reduce the dimensionality of high-dimensional data. It identifies the directions (principal components) in which the data varies the most and projects the data onto these components, resulting in a lower-dimensional representation.

    • umap : Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality

    reduction technique that is particularly well-suited for visualizing high-dimensional data in lower-dimensional space. It preserves both local and global structure of the data by constructing a low-dimensional representation that captures the underlying manifold structure of the data.

    • pca_umap : PCA followed by UMAP is a two-step dimensionality reduction technique.

    First, PCA is applied to reduce the dimensionality of the data. Then, UMAP is applied to further reduce the dimensionality and capture the non-linear structure of the data. This combination can be effective in preserving both global and local structure of the data.

  • pca_components (int, optional) – Number of components for PCA (used in ‘pca’ and ‘pca_umap’), by default 50.

  • final_components (int, optional) – Final number of components (3 for 3D visualization), by default 3.

  • random_state (int, optional) – Random state for reproducibility, by default 42.

Returns:

self – The instance itself, allowing for method chaining.

Return type:

EmbeddingsVisualizer

Raises:

ValueError – If vectors have not been extracted yet or if an invalid method is specified.

create_plot(title: str = '3D Visualization of Embeddings', point_size: int = 3) Figure[source]#

Generate a 3D scatter plot of the reduced vectors with labels.

Parameters:
  • title (str, optional) – The title of the plot (default is ‘3D Visualization of Embeddings’).

  • point_size (int, optional) – The size of the markers in the scatter plot (default is 3).

Returns:

The generated 3D scatter plot.

Return type:

go.Figure

Raises:

ValueError – If vectors have not been reduced yet.

Notes

This method requires the plotly library to be installed.

Examples

>>> visualizer = EmbeddingsVisualizer()
>>> plot = visualizer.create_plot(title='My Embeddings', point_size=5)
>>> plot.show()
visualize(column: str, method: str = 'tsne', pca_components: int = 50, final_components: int = 3, random_state: int = 42, title: str = '3D Visualization of Embeddings', point_size: int = 3, save_html: bool = False, html_file_name: str = 'embedding_visualization.html')[source]#

Full pipeline: load index, extract vectors, reduce dimensionality, and visualize.

Parameters:
  • column (str) – The column of the split corresponding to the embeddings stored in the index.

  • method (str, optional) – The dimensionality reduction method to use. Default is ‘tsne’.

  • pca_components (int, optional) – The number of components to keep when using PCA for dimensionality reduction. Default is 50.

  • final_components (int, optional) – The number of final components to visualize. Default is 3.

  • random_state (int, optional) – The random state for reproducibility. Default is 42.

  • title (str, optional) – The title of the visualization plot. Default is ‘3D Visualization of Embeddings’.

  • point_size (int, optional) – The size of the points in the visualization plot. Default is 3.

  • save_html (bool, optional) – Whether to save the visualization as an HTML file. Default is False.

  • html_file_name (str, optional) – The name of the HTML file to save. Default is ‘embedding_visualization.html’.

Return type:

None

Raises:

None

Examples

>>> visualizer = EmbeddingsVisualizer()
>>> visualizer.visualize(method='tsne', pca_components=50, final_components=3, random_state=42)

ragoon.web_rag module#

class ragoon.web_rag.WebRAG(google_api_key: str, google_cx: str, completion_client, user_agent: str | None = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36')[source]#

Bases: object

__init__(google_api_key: str, google_cx: str, completion_client, user_agent: str | None = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36') None[source]#

WebRAG class.

This class facilitates retrieval-based querying and completion using various APIs.

Parameters:
  • google_api_key (str) – The API key for Google services.

  • google_cx (str) – The custom search engine ID for Google Custom Search.

  • completion_client (str) – The API client for the completion service (e.g., OpenAI’s GPT-3).

  • user_agent (str, optional) – The user agent string to be used in web requests. Default is a Chrome user agent.

web_scraper#

An instance of the WebScraper class for web scraping.

Type:

WebScraper

retriever#

An instance of the Retriever class for data retrieval.

Type:

Retriever

An instance of the GoogleSearch class for Google searches.

Type:

GoogleSearch

Examples

# Initialize RAGoon instance >>> ragoon = RAGoon( >>> google_api_key=”your_google_api_key”, >>> google_cx=”your_google_cx”, >>> completion_client=Groq(api_key=”your_groq_api_key”) >>> )

>>> # Search and get results
>>> query = "I want to do a left join in python polars"
>>> results = ragoon.search(
>>>    query=query,
>>>    completion_model="Llama3-70b-8192",
>>>    max_tokens=512,
>>>    temperature=1,
>>> )
search(query: str, completion_model: str, system_prompt: str | None = "\n        Given the user's input query, generate a concise and relevant Google search\n        query that directly addresses the main intent of the user's question. The search query must\n        be specifically tailored to retrieve results that can significantly enhance the context for a\n        subsequent dialogue with an LLM. This approach will facilitate few-shot learning by providing\n        rich, specific, and contextually relevant information. Please ensure that the response is\n        well-formed and format it as a JSON object with a key named 'search_query'. This\n        structured approach will help in assimilating the fetched results into an enhanced conversational\n        model, contributing to a more nuanced and informed interaction.\n        ", *args, **kargs)[source]#

Search for information and perform completion.

This method searches for information related to the given query and performs completion using the specified model. Additional parameters can be passed to the completion method.

Parameters:
  • query (str) – The search query.

  • completion_model (str) – The name or identifier of the completion model to be used.

  • *args – Additional positional arguments to be passed to the completion method.

  • **kwargs – Additional keyword arguments to be passed to the completion method.

Returns:

completion_data – A dictionary containing the generated completion data.

Return type:

dict

Module contents#