# Evaluation Metrics

The following is an overview of the available evaluation metrics that can be used to evaluate end-to-end performance of
a RAG application by measuring a distance between the ground truth answer and the predicted answer.

These metrics are calculated as part of  the `04_evaluation.py` script based on the `actual`, `expected` and `context` fields of
the `.jsonl` output file (referred to as "calculation base"), generated by `03_querying.py` script. See the [script inputs and outputs
guide](/docs/script-inputs-outputs.md#03_queryingpy) for more information.

You can choose which metrics should be calculated in your experiment by updating the `metric_types` field in the
`search_config.json` configuration file.

## Configuration Example

```json
"metric_types": [
    "lcsstr",
    "lcsseq",
    "cosine",
    "jaro_winkler",
    "hamming",
    "jaccard",
    "levenshtein",
    "fuzzy_score",
    "rouge1_precision",
    "rouge1_recall",
    "rouge1_fmeasure",
    "rouge2_precision",
    "rouge2_recall",
    "rouge2_fmeasure",
    "rougeL_precision",
    "rougeL_recall",
    "rougeL_fmeasure",
    "bert_all_MiniLM_L6_v2",
    "bert_base_nli_mean_tokens",
    "bert_large_nli_mean_tokens",
    "bert_large_nli_stsb_mean_tokens",
    "bert_distilbert_base_nli_stsb_mean_tokens",
    "bert_paraphrase_multilingual_MiniLM_L12_v2",
    "llm_answer_relevance",
    "llm_context_precision",
    "llm_context_recall"
]
```

## Algorithm-based Metrics

The following metrics are calculated by using different string similarity algorithms mostly backed by the [TextDistance
Python package](https://pypi.org/project/textdistance/).

### Longest common substring

| Configuration Key | Calculation Base     | Possible Values    |
| ----------------- | -------------------- | ------------------ |
| `lcsstr`          | `actual`, `expected` | Percentage (0-100) |

Calculates the longest common substring (LCS) similarity score between two strings.

### Longest common subsequence

| Configuration Key | Calculation Base     | Possible Values    |
| ----------------- | -------------------- | ------------------ |
| `lcsseq`          | `actual`, `expected` | Percentage (0-100) |

Computes the longest common subsequence (LCS) similarity score between two input strings.

### Cosine similarity (Ochiai coefficient)

| Configuration Key | Calculation Base     | Possible Values    |
| ----------------- | -------------------- | ------------------ |
| `cosine`          | `actual`, `expected` | Percentage (0-100) |

This coefficient is calculated as the intersection of the term-frequency vectors of the generated answer (actual) and the ground-truth answer (expected) divided by the geometric mean of the sizes of these vectors.

### Jaro-Winkler distance

| Configuration Key | Calculation Base     | Possible Values    |
| ----------------- | -------------------- | ------------------ |
| `jaro_winkler`    | `actual`, `expected` | Percentage (0-100) |

The Jaro-Winkler similarity score is a measure of similarity between two strings. The Jaro-Winkler similarity score is
calculated as the number of characters that are different between the two strings divided by the number of characters
that are the same between the two strings.

### Hamming distance

| Configuration Key | Calculation Base     | Possible Values    |
| ----------------- | -------------------- | ------------------ |
| `hamming`         | `actual`, `expected` | Percentage (0-100) |

The Hamming distance is a measure of similarity between two strings. The Hamming distance is calculated as the number of
characters that are different between the two strings.

### Jaccard similarity

| Configuration Key | Calculation Base     | Possible Values    |
| ----------------- | -------------------- | ------------------ |
| `jaccard`         | `actual`, `expected` | Percentage (0-100) |

The Jaccard similarity is calculated as the number of elements in the intersection of the two sets divided by the number
of elements in the union of the two sets.

### Levenshtein distance

| Configuration Key | Calculation Base     | Possible Values    |
| ----------------- | -------------------- | ------------------ |
| `levenshtein`     | `actual`, `expected` | Percentage (0-100) |

The Levenshtein distance is a measure of similarity between two strings. The Levenshtein distance is calculated as the
minimum number of insertions, deletions, or substitutions required to transform one string into the other.

### RapidFuzz similarity

| Configuration Key | Calculation Base     | Possible Values       |
| ----------------- | -------------------- | --------------------- |
| `fuzzy_score`           | `actual`, `expected` | Percentage (0 - 100) |

This metric is backed by the [RapidfFuzz Python package](https://github.com/rapidfuzz/RapidFuzz).
Calculates the fuzzy score between two documents using the levenshtein distance.

### Rouge retrieval metrics (Token based)

**Rouge** short for Recall-Oriented Understudy for Gisting Evaluation, is typically used in summarization evaluation tasks, comparing human produced references and system generated summaries. The core idea is to compare and validate sufficient overlap of common words or phrases in both reference and prediction. String metrics look at character level differences, whereas Rouge can help us compare token level matches. We use the [`rouge-score`](https://pypi.org/project/rouge-score/) to compute these measures. Here are some of the metrics we capture.

| Configuration Key                            | Calculation Base             | Possible Values       |
| -------------------------------------------- | ---------------------------- | --------------------- |
| `rouge{1 \| 2 \| L}_{precision \| recall \| fmeasure}` | `ground_truth`, `prediction` | Percentage (0 - 100)  |


- **rouge1_precision**: The ROUGE-1 precision score is the number of overlapping unigrams between the predicted and ground_truth strings divided by the number of unigrams in the ground_truth string.
- **rouge1_recall**: The ROUGE-1 recall score is the number of overlapping unigrams between the predicted and ground_truth strings divided by the number of unigrams in the predicted string.
- **rouge1_fmeasure**: This is the harmonic mean of the ROUGE-1 precision and recall scores.
- **rouge2_precision**: The ROUGE-2 precision score is the number of overlapping bigrams between the predicted and ground_truth strings divided by the number of bigrams in the ground_truth string.
- **rouge2_recall**: The ROUGE-2 recall score is the number of overlapping bigrams between the predicted and ground_truth strings divided by the number of bigrams in the predicted string.
- **rouge2_fmeasure**: This is the harmonic mean of the ROUGE-2 precision and recall scores.
- **rougeL_precision**: The ROUGE-L precision score is the length of overlapping longest common subsequence between the predicted and ground_truth strings divided by the number of unigrams in the predicted string.
- **rougeL_recall**: The ROUGE-L recall score is the length of overlapping longest common subsequence between the predicted and ground_truth strings divided by the number of unigrams in the ground truth string.
- **rougeL_fmeasure**: This is the harmonic mean of the ROUGE-L precision and recall scores.

## BERT-based semantic similarity

The following set of metrics calculates semantic similarity between two strings as percentage of differences based on
embeddings created by different BERT models. Backed by the [sentence-transformers Python
package](https://pypi.org/project/sentence-transformers/).

| Calculation Base     | Possible Values    |
| -------------------- | ------------------ |
| `actual`, `expected` | Percentage (0-100) |

| Configuration Key                          | BERT Model                                   |
| ------------------------------------------ | -------------------------------------------- |
| bert_all_MiniLM_L6_v2                      | MiniLM L6 v2 model                           |
| bert_base_nli_mean_tokens                  | Base model, mean tokens                      |
| bert_large_nli_mean_tokens                 | Large model, mean tokens                     |
| bert_large_nli_stsb_mean_tokens            | Large model, STS-B, mean tokens              |
| bert_distilbert_base_nli_stsb_mean_tokens  | DistilBERT base model, STS-B, mean tokens    |
| bert_paraphrase_multilingual_MiniLM_L12_v2 | Multilingual paraphrase model, MiniLM L12 v2 |

## LLM-based Metrics

The following metrics are calculated based on LLM reasoning. These metrics require the OpenAI endpoint to be configured
(see [Environment Variables](./environment-variables.md)).

These metrics also require the `chat_model_name` property to be set in the `search_config.json` configuration file. See
[Description of configuration elements](../README.md#description-of-configuration-elements) for details.

### LLM Answer relevance

| Configuration Key  | Calculation Base     | Possible Values                   |
| ------------------ | -------------------- | --------------------------------- |
| `llm_answer_relevance` | `actual`, `expected` | From 0 to 1 with 1 being the best |

Scores the relevancy of the answer according to the given question. Answers with incomplete, redundant or unnecessary
information is penalized.

### LLM Context precision

| Configuration Key   | Calculation Base    | Possible Values                                                   |
| ------------------- | ------------------- | ----------------------------------------------------------------- |
| `llm_context_precision` | `question`, `retrieved_contexts` | Percentage (0-100) |

Proportion of retrieved contexts relevant to the question. Evaluates whether or not the context generated by the RAG solution is useful for answering a question.

### LLM Context recall

| Configuration Key   | Calculation Base    | Possible Values                                                   |
| ------------------- | ------------------- | ----------------------------------------------------------------- |
| `llm_context_recall` | `question`, `expected`, `retrieved_contexts` | Percentage (0-100) |

Estimates context recall by estimating TP and FN using annotated answer (ground truth) and retrieved contexts. In an ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved context.
