# FlowerTune LLM Evaluation

This directory provides various evaluation metrics to assess the quality of your fine-tuned LLMs.
If you are participating [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard), evaluating your fine-tuned LLM is the final step prior to have your submission added to the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard#how-to-participate). The evaluation scores generated here will be displayed as the definitive values on the LLM Leaderboard.

## How to run

Navigate to the directory corresponding to your selected challenge ([`general NLP`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation/general-nlp), [`finance`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation/finance), [`medical`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation/medical), or [`code`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation/code)) and follow the instructions there to execute the evaluation.

> [!NOTE]  
> If you wish to participate in the LLM Leaderboard, you must not modify the evaluation code and should use the exact command provided in the respective directory to run the evaluation.


## Baseline results

The default template generated by `flwr new` (see the [Project Creation Instructions](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm#create-a-new-project)) for each challenge will produce results as follows, which serve as the lower bound on the LLM Leaderboard.

### General NLP

|         | STEM  |  SS   | Humanities |  Avg  |
|:-------:|:-----:|:-----:|:----------:|:-----:|
| Acc (%) | 12.37 | 13.49 |   12.60    | 12.82 |

### Finance

|         |  FPB  | FIQA  | TFNS  |  Avg  |  
|:-------:|:-----:|:-----:|:-----:|:-----:|
| Acc (%) | 44.55 | 62.50 | 28.77 | 45.27 |

### Medical

|         | PubMedQA | MedMCQA | MedQA |  Avg  |  
|:-------:|:--------:|:-------:|:-----:|:-----:|
| Acc (%) |  59.00   |  23.69  | 27.10 | 36.60 |

### Code

|            | MBPP  | HumanEval | MultiPL-E (JS) | MultiPL-E (C++) |  Avg  |  
|:----------:|:-----:|:---------:|:--------------:|:---------------:|:-----:|
| Pass@1 (%) | 31.60 |   23.78   |     28.57      |      25.47      | 27.36 |

> [!NOTE]  
> In the LLM Leaderboard, we rank the submissions based on the **average** value derived from different evaluation datasets for each challenge.


## Make submission on FlowerTune LLM Leaderboard

If your LLM outperforms the listed benchmarks in any challenge, 
we encourage you to submit your code and model to the FlowerTune LLM Leaderboard without hesitation (see the [How-to-participate Instructions](https://flower.ai/benchmarks/llm-leaderboard#how-to-participate)).
