# Awesome-Efficient-LLM
A curated list for **Efficient Large Language Models**

## Full List
  - [Network Pruning / Sparsity](pruning.md)
  - [Knowledge Distillation](knowledge_distillation.md)
  - [Quantization](quantization.md)
  - [Inference Acceleration](inference_acceleration.md)
  - [Efficient MOE](efficient_moe.md)
  - [Efficient Architecture of LLM](efficient_architecture_llm.md)
  - [KV Cache Compression](kv_cache_compression.md)
  - [Text Compression](text_compression.md)
  - [Low-Rank Decomposition](low_rank_decomposition.md)
  - [Hardware / System / Serving](hardware.md)
  - [Tuning](tuning.md)
  - [Efficient Training](efficient_training.md)
  - [Survey or Benchmark](survey.md)

### Please check out all the papers by selecting the sub-area you're interested in. On this main page, only papers released in the past 90 days are shown.

#### 🚀 Updates
* May 29, 2024: We've had this awesome list for a year now :smiling_face_with_three_hearts:! 
* Sep 6, 2023: Add a new subdirectory [project/](project/) to organize efficient LLM projects.
* July 11, 2023: A new subdirectory [efficient_plm/](efficient_plm/) is created to house papers that are applicable to PLMs. 

#### 💮 Contributing

If you'd like to include your paper, or need to update any details such as conference information or code URLs, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in `generate_item.py` and execute `python generate_item.py`. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience. 

#### :star: Recommended Paper

For each topic, we have curated a list of recommended papers that have garnered a lot of GitHub stars or citations.


## Paper from Sep 30, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

### Quick Link 
  - [Network Pruning / Sparsity](#network-pruning--sparsity)
  - [Knowledge Distillation](#knowledge-distillation)
  - [Quantization](#quantization)
  - [Inference Acceleration](#inference-acceleration)
  - [Efficient MOE](#efficient_moe)
  - [Efficient Architecture of LLM](#efficient-architecture-of-llm)
  - [KV Cache Compression](#kv-cache-compression)
  - [Text Compression](#text-compression)
  - [Low-Rank Decomposition](#low-rank-decomposition)
  - [Hardware / System / Serving](#hardwaresystemserving)
  - [Tuning](#tuning)
  - [Survey](#survey-or-benchmark)

### Network Pruning / Sparsity
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
| [![Star](https://img.shields.io/github/stars/IST-DASLab/sparsegpt.svg?style=social&label=Star)](https://github.com/IST-DASLab/sparsegpt) [![Publish](https://img.shields.io/badge/Conference-ICML'23-blue)]() [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br> :star: [SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot](https://github.com/IST-DASLab/sparsegpt) <br> Elias Frantar, Dan Alistarh| <img width="522" alt="image" src="figures/sparsegpt.png"> |[Github](https://github.com/IST-DASLab/sparsegpt) [paper](https://arxiv.org/abs/2301.00774) | [//]: #Recommend
| [![Star](https://img.shields.io/github/stars/horseee/LLM-Pruner.svg?style=social&label=Star)](https://github.com/horseee/LLM-Pruner) [![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]() [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br> :star: [LLM-Pruner: On the Structural Pruning of Large Language Models](https://arxiv.org/abs/2305.11627) <br> Xinyin Ma, Gongfan Fang, Xinchao Wang | <img width="561" alt="image" src="figures/llm_pruner.png">| [Github](https://github.com/horseee/LLM-Pruner) [paper](https://arxiv.org/abs/2305.11627)| [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/locuslab/wanda.svg?style=social&label=Star)](https://github.com/locuslab/wanda) [![Publish](https://img.shields.io/badge/Conference-ICLR'24-blue)]() [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]()  <br> :star: [A Simple and Effective Pruning Approach for Large Language Models](https://arxiv.org/abs/2306.11695) <br> Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter |<img width="1002" alt="image" src="https://user-images.githubusercontent.com/20168304/245999360-f951de47-269d-491d-826a-8e6d85627849.png"> |[Github](https://github.com/locuslab/wanda) <br> [Paper](https://arxiv.org/abs/2306.11695)| [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/princeton-nlp/LLM-Shearing.svg?style=social&label=Star)](https://github.com/princeton-nlp/LLM-Shearing) [![Publish](https://img.shields.io/badge/Conference-ICLR'24-blue)]() [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br> :star: [Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning](https://arxiv.org/abs/2310.06694) <br> Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen |<img width="1002" alt="image" src="figures/LLM-shearing.png"> |[Github](https://github.com/princeton-nlp/LLM-Shearing) <br> [Paper](https://arxiv.org/abs/2310.06694)| [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/NVlabs/MaskLLM.svg?style=social&label=Star)](https://github.com/NVlabs/MaskLLM) [![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]() [![Type](https://img.shields.io/badge/Semi_Structured-C2A4A6)]() <br> :star: [MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models](https://arxiv.org/abs/2409.17481) <br> Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, Xinchao Wang |<img width="302" alt="image" src="https://github.com/NVlabs/MaskLLM/blob/main/assets/animation-LQ.gif"> |[Github](https://github.com/NVlabs/MaskLLM) <br> [Paper](https://arxiv.org/abs/2409.17481)|[//]: #Recommend
|[HashAttention: Semantic Sparsity for Faster Inference](https://arxiv.org/abs/2412.14468) <br> Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica |<img width="1002" alt="image" src="https://arxiv.org/html/2412.14468v1/extracted/6081011/images/sparseatt.png"> |[Paper](https://arxiv.org/abs/2412.14468)|[//]: #12/30
|[Adaptive Pruning for Large Language Models with Structural Importance Awareness](https://arxiv.org/abs/2412.15127) <br> Haotian Zheng, Jinke Ren, Yushan Sun, Ruichen Zhang, Wenbo Zhang, Zhen Li, Dusit Niyato, Shuguang Cui, Yatong Han |<img width="1002" alt="image" src="https://arxiv.org/html/2412.15127v1/x2.png"> |[Paper](https://arxiv.org/abs/2412.15127)|[//]: #12/30
|[SlimGPT: Layer-wise Structured Pruning for Large Language Models](https://arxiv.org/abs/2412.18110) <br> Gui Ling, Ziyang Wang, Yuliang Yan, Qingwen Liu |<img width="302" alt="image" src="https://arxiv.org/html/2412.18110v1/x1.png"> |[Paper](https://arxiv.org/abs/2412.18110)|[//]: #12/30
|[Less is More: Towards Green Code Large Language Models via Unified Structural Pruning](https://arxiv.org/abs/2412.15921) <br> Guang Yang, Yu Zhou, Xiangyu Zhang, Wei Cheng, Ke Liu, Xiang Chen, Terry Yue Zhuo, Taolue Chen |<img width="1002" alt="image" src="figures/Flab-Pruner.png"> |[Paper](https://arxiv.org/abs/2412.15921)|[//]: #12/30
|[Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking](https://arxiv.org/abs/2412.01380) <br> Marco Federici, Davide Belli, Mart van Baalen, Amir Jalalirad, Andrii Skliar, Bence Major, Markus Nagel, Paul Whatmough |<img width="1002" alt="image" src="https://arxiv.org/html/2412.01380v1/x1.png"> |[Paper](https://arxiv.org/abs/2412.01380)|[//]: #12/09
|[Puzzle: Distillation-Based NAS for Inference-Optimized LLMs](https://arxiv.org/abs/2411.19146) <br> Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah et al |<img width="1002" alt="image" src="https://arxiv.org/html/2411.19146v2/x1.png"> |[Paper](https://arxiv.org/abs/2411.19146)|[//]: #12/09
|[![Star](https://img.shields.io/github/stars/yaolu-zjut/Navigation-LLM-layer-pruning.svg?style=social&label=Star)](https://github.com/yaolu-zjut/Navigation-LLM-layer-pruning)<br>[Reassessing Layer Pruning in LLMs: New Insights and Methods](https://arxiv.org/abs/2411.15558) <br> Yao Lu, Hao Cheng, Yujie Fang, Zeyu Wang, Jiaheng Wei, Dongwei Xu, Qi Xuan, Xiaoniu Yang, Zhaowei Zhu |<img width="1002" alt="image" src="https://github.com/yaolu-zjut/Navigation-LLM-layer-pruning/raw/main/framework.JPG"> |[Github](https://github.com/yaolu-zjut/Navigation-LLM-layer-pruning) <br> [Paper](https://arxiv.org/abs/2411.15558)|[//]: #12/03
|[Layer Importance and Hallucination Analysis in Large Language Models via Enhanced Activation Variance-Sparsity](https://arxiv.org/abs/2411.10069) <br> Zichen Song, Sitan Huang, Yuxin Wu, Zhongfeng Kang |<img width="1002" alt="image" src="https://arxiv.org/html/2411.10069v1/x1.png"> |[Paper](https://arxiv.org/abs/2411.10069)|[//]: #11/24
|[![Star](https://img.shields.io/github/stars/GATECH-EIC/AmoebaLLM.svg?style=social&label=Star)](https://github.com/GATECH-EIC/AmoebaLLM)[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]()<br>[AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment](https://arxiv.org/abs/2411.10606) <br> Yonggan Fu, Zhongzhi Yu, Junwei Li, Jiayi Qian, Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, Yingyan Celine Lin |<img width="1002" alt="image" src="https://arxiv.org/html/2411.10606v1/x2.png"> |[Github](https://github.com/GATECH-EIC/AmoebaLLM) <br> [Paper](https://arxiv.org/abs/2411.10606)|[//]: #11/24
|[Scaling Law for Post-training after Model Pruning](https://arxiv.org/abs/2411.10272) <br> Xiaodong Chen, Yuxuan Hu, Jing Zhang, Xiaokang Zhang, Cuiping Li, Hong Chen | |[Paper](https://arxiv.org/abs/2411.10272)|[//]: #11/24
|[![Star](https://img.shields.io/github/stars/hexuandeng/DRPruning.svg?style=social&label=Star)](https://github.com/hexuandeng/DRPruning)<br>[DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization](https://arxiv.org/abs/2411.14055) <br> Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Min Zhang, Zhaopeng Tu |<img width="1002" alt="image" src="https://github.com/hexuandeng/DRPruning/raw/main/pic/main.png"> |[Github](https://github.com/hexuandeng/DRPruning) <br> [Paper](https://arxiv.org/abs/2411.14055)|[//]: #11/24
|[![Star](https://img.shields.io/github/stars/thunlp/SparsingLaw.svg?style=social&label=Star)](https://github.com/thunlp/SparsingLaw)<br>[Sparsing Law: Towards Large Language Models with Greater Activation Sparsity](https://arxiv.org/abs/2411.02335) <br> Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun |<img width="1002" alt="image" src="https://github.com/thunlp/SparsingLaw/raw/master/figs/sample.jpg"> |[Github](https://github.com/thunlp/SparsingLaw) <br> [Paper](https://arxiv.org/abs/2411.02335)|[//]: #11/18
|[AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis](https://arxiv.org/abs/2411.02117) <br> Zichen Song, Yuxin Wu, Sitan Huang, Zhongfeng Kang |<img width="1002" alt="image" src="https://arxiv.org/html/2411.02117v1/x1.png"> |[Paper](https://arxiv.org/abs/2411.02117)|[//]: #11/18
|[Tailored-LLaMA: Optimizing Few-Shot Learning in Pruned LLaMA Models with Task-Specific Prompts](https://arxiv.org/abs/2410.19185) <br> Danyal Aftab, Steven Davy |<img width="1002" alt="image" src="https://arxiv.org/html/2410.19185v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.19185)|[//]: #11/18
|[![Star](https://img.shields.io/github/stars/AboveParadise/LLMCBench.svg?style=social&label=Star)](https://github.com/AboveParadise/LLMCBench)<br>[LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment](https://arxiv.org/abs/2410.21352) <br> Ge Yang, Changyi He, Jinyang Guo, Jianyu Wu, Yifu Ding, Aishan Liu, Haotong Qin, Pengliang Ji, Xianglong Liu |<img width="1002" alt="image" src="https://github.com/AboveParadise/LLMCBench/raw/main/figs/f1.png"> |[Github](https://github.com/AboveParadise/LLMCBench) <br> [Paper](https://arxiv.org/abs/2410.21352)|[//]: #11/17
|[Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs](https://arxiv.org/abs/2410.16135) <br> Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2410.16135v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.16135)|[//]: #10/30
|[![Star](https://img.shields.io/github/stars/IST-DASLab/EvoPress.svg?style=social&label=Star)](https://github.com/IST-DASLab/EvoPress)<br>[EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search](https://arxiv.org/abs/2410.14649) <br> Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh |<img width="1002" alt="image" src="figures/evopress.png"> |[Github](https://github.com/IST-DASLab/EvoPress) <br> [Paper](https://arxiv.org/abs/2410.14649)|[//]: #10/30
|[FedSpaLLM: Federated Pruning of Large Language Models](https://arxiv.org/abs/2410.14852) <br> Guangji Bai, Yijiang Li, Zilinghan Li, Liang Zhao, Kibaek Kim |<img width="1002" alt="image" src="https://arxiv.org/html/2410.14852v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.14852)|[//]: #10/30
|[![Star](https://img.shields.io/github/stars/piuzha/APT.svg?style=social&label=Star)](https://github.com/piuzha/APT)<br>[Pruning Foundation Models for High Accuracy without Retraining](https://arxiv.org/abs/2410.15567) <br> Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin | |[Github](https://github.com/piuzha/APT) <br> [Paper](https://arxiv.org/abs/2410.15567)|[//]: #10/30
|[Self-calibration for Language Model Quantization and Pruning](https://arxiv.org/abs/2410.17170) <br> Miles Williams, George Chrysostomou, Nikolaos Aletras |<img width="1002" alt="image" src="https://arxiv.org/html/2410.17170v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.17170)|[//]: #10/29
|[Beware of Calibration Data for Pruning Large Language Models](https://arxiv.org/abs/2410.17711) <br> Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang | |[Paper](https://arxiv.org/abs/2410.17711)|[//]: #10/29
|[![Star](https://img.shields.io/github/stars/haiquanlu/AlphaPruning.svg?style=social&label=Star)](https://github.com/haiquanlu/AlphaPruning)[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]()<br>[AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models](https://arxiv.org/abs/2410.10912) <br> Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W. Mahoney, Yaoqing Yang |<img width="1002" alt="image" src="https://arxiv.org/html/2410.10912v1/x1.png"> |[Github](https://github.com/haiquanlu/AlphaPruning) <br> [Paper](https://arxiv.org/abs/2410.10912)|[//]: #10/21
|[Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix](https://arxiv.org/abs/2410.11261) <br> Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11261v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.11261)|[//]: #10/21
|[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]()<br>[DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models](https://arxiv.org/abs/2410.11988) <br> Shangqian Gao, Chi-Heng Lin, Ting Hua, Tang Zheng, Yilin Shen, Hongxia Jin, Yen-Chang Hsu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11988v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.11988)|[//]: #10/21
|[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24%20Workshop-blue)]()<br>[Self-Data Distillation for Recovering Quality in Pruned Large Language Models](https://arxiv.org/abs/2410.09982) <br> Vithursan Thangarasa, Ganesh Venkatesh, Nish Sinnadurai, Sean Lie |<img width="1002" alt="image" src="https://arxiv.org/html/2410.09982v2/x1.png"> |[Paper](https://arxiv.org/abs/2410.09982)|[//]: #10/21
|[LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models](https://arxiv.org/abs/2410.13299) <br> David Hoffmann, Kailash Budhathoki, Matthaeus Kleindessner |<img width="1002" alt="image" src="https://arxiv.org/html/2410.13299v1/extracted/5931028/img/llm_to_mlp.png"> |[Paper](https://arxiv.org/abs/2410.13299)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/abx393/llm-pruning-calibration-data.svg?style=social&label=Star)](https://github.com/abx393/llm-pruning-calibration-data)[![Publish](https://img.shields.io/badge/Conference-EMNLP'24-blue)]()<br>[Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning](https://arxiv.org/abs/2410.07461) <br> Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Kumar Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, Shiwei Liu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.07461v1/x1.png"> |[Github](https://github.com/abx393/llm-pruning-calibration-data) <br> [Paper](https://arxiv.org/abs/2410.07461)|[//]: #10/13
|[Mitigating Copy Bias in In-Context Learning through Neuron Pruning](https://arxiv.org/abs/2410.01288) <br> Ameen Ali, Lior Wolf, Ivan Titov |<img width="1002" alt="image" src="figures/copy_icl.png"> |[Paper](https://arxiv.org/abs/2410.01288)|[//]: #10/04
|[![Star](https://img.shields.io/github/stars/IntelLabs/Hardware-Aware-Automated-Machine-Learning.svg?style=social&label=Star)](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SQFT)[![Publish](https://img.shields.io/badge/Conference-EMNLP'24%20Findings-blue)]() [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() [![Type](https://img.shields.io/badge/w/Quantization-39B0A9)]() <br>[SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models](https://arxiv.org/abs/2410.03750) <br> Juan Pablo Munoz, Jinjie Yuan, Nilesh Jain |<img width="1002" alt="image" src="figures/SQFT.png"> |[Github](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SQFT) <br> [Paper](https://arxiv.org/abs/2410.03750)|[//]: #10/01




### Knowledge Distillation
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|:star: [Knowledge Distillation of Large Language Models](https://arxiv.org/abs/2306.08543) <br> Yuxian Gu, Li Dong, Furu Wei, Minlie Huang |<img width="1002" alt="image" src="https://github.com/microsoft/LMOps/blob/main/minillm/figures/method.png"> |[Github](https://github.com/microsoft/LMOps/tree/main/minillm) <br> [Paper](https://arxiv.org/abs/2306.08543)| [//]: #Recommend
|[![Publish](https://img.shields.io/badge/Conference-COLING'25-blue)]()<br>[Self-Evolution Knowledge Distillation for LLM-based Machine Translation](https://arxiv.org/abs/2412.15303) <br> Yuncheng Song, Liang Ding, Changtong Zan, Shujian Huang |<img width="1002" alt="image" src="https://arxiv.org/html/2412.15303v1/extracted/6081708/model_two.png"> |[Paper](https://arxiv.org/abs/2412.15303)|[//]: #12/30
|[Large Language Models Compression via Low-Rank Feature Distillation](https://arxiv.org/abs/2412.16719) <br> Yaya Sy, Christophe Cerisara, Irina Illina |<img width="302" alt="image" src="https://arxiv.org/html/2412.16719v1/x1.png"> |[Paper](https://arxiv.org/abs/2412.16719)|[//]: #12/30
|[![Star](https://img.shields.io/github/stars/HITSZ-HLT/FSA-Distillation.svg?style=social&label=Star)](https://github.com/HITSZ-HLT/FSA-Distillation)<br>[Distilling Fine-grained Sentiment Understanding from Large Language Models](https://arxiv.org/abs/2412.18552) <br> Yice Zhang, Guangyu Xie, Hongling Xu, Kaiheng Hou, Jianzhu Bao, Qianlong Wang, Shiwei Chen, Ruifeng Xu |<img width="302" alt="image" src="https://arxiv.org/html/2412.18552v1/x1.png"> |[Github](https://github.com/HITSZ-HLT/FSA-Distillation) <br> [Paper](https://arxiv.org/abs/2412.18552)|[//]: #12/30
|[![Star](https://img.shields.io/github/stars/alonso130r/knowledge-distillation.svg?style=social&label=Star)](https://github.com/alonso130r/knowledge-distillation)<br>[Enhancing Knowledge Distillation for LLMs with Response-Priming Prompting](https://arxiv.org/abs/2412.17846) <br> Vijay Goyal, Mustafa Khan, Aprameya Tirupati, Harveer Saini, Michael Lam, Kevin Zhu |<img width="1002" alt="image" src="https://arxiv.org/html/2412.17846v1/extracted/6080471/prompt-example.png"> |[Github](https://github.com/alonso130r/knowledge-distillation) <br> [Paper](https://arxiv.org/abs/2412.17846)|[//]: #12/30
|[Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation](https://arxiv.org/abs/2411.14698) <br> Xunyu Zhu, Jian Li, Can Ma, Weiping Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2411.14698v1/x1.png"> |[Paper](https://arxiv.org/abs/2411.14698)|[//]: #12/03
|[![Star](https://img.shields.io/github/stars/kaistai/generative-context-distillation.svg?style=social&label=Star)](https://github.com/kaistai/generative-context-distillation)<br>[Generative Context Distillation](https://arxiv.org/abs/2411.15927) <br> Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo |<img width="1002" alt="image" src="figures/GCD.png"> |[Github](https://github.com/kaistai/generative-context-distillation) <br> [Paper](https://arxiv.org/abs/2411.15927)|[//]: #12/02
|[SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models](https://arxiv.org/abs/2410.19503) <br> Jahyun Koo, Yerin Hwang, Yongil Kim, Taegwan Kang, Hyunkyung Bae, Kyomin Jung |<img width="1002" alt="image" src="figures/switch.png"> |[Paper](https://arxiv.org/abs/2410.19503)|[//]: #11/17
|[![Star](https://img.shields.io/github/stars/jdeschena/sdtt.svg?style=social&label=Star)](https://github.com/jdeschena/sdtt)<br>[Beyond Autoregression: Fast LLMs via Self-Distillation Through Time](https://arxiv.org/abs/2410.21035) <br> Justin Deschenaux, Caglar Gulcehre |<img width="1002" alt="image" src="https://arxiv.org/html/2410.21035v1/x3.png"> |[Github](https://github.com/jdeschena/sdtt) <br> [Paper](https://arxiv.org/abs/2410.21035)|[//]: #11/17
|[Pre-training Distillation for Large Language Models: A Design Space Exploration](https://arxiv.org/abs/2410.16215) <br> Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, Juanzi Li | |[Paper](https://arxiv.org/abs/2410.16215)|[//]: #10/30
|[![Star](https://img.shields.io/github/stars/thu-coai/MiniPLM.svg?style=social&label=Star)](https://github.com/thu-coai/MiniPLM)<br>[MiniPLM: Knowledge Distillation for Pre-Training Language Models](https://arxiv.org/abs/2410.17215) <br> Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang |<img width="1002" alt="image" src="https://github.com/thu-coai/MiniPLM/raw/main/figures/method.png"> |[Github](https://github.com/thu-coai/MiniPLM) <br> [Paper](https://arxiv.org/abs/2410.17215)|[//]: #10/29
|[Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling](https://arxiv.org/abs/2410.11325) <br> Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11325v1/x2.png"> |[Paper](https://arxiv.org/abs/2410.11325)|[//]: #10/21
|[Evolutionary Contrastive Distillation for Language Model Alignment](https://arxiv.org/abs/2410.07513) <br> Julian Katz-Samuels, Zheng Li, Hyokun Yun, Priyanka Nigam, Yi Xu, Vaclav Petricek, Bing Yin, Trishul Chilimbi |<img width="1002" alt="image" src="https://arxiv.org/html/2410.07513v1/extracted/5913898/figures/main_alg_v3.png"> |[Paper](https://arxiv.org/abs/2410.07513)|[//]: #10/13



### Quantization
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|[![Star](https://img.shields.io/github/stars/IST-DASLab/gptq.svg?style=social&label=Star)](https://github.com/IST-DASLab/gptq)[![Publish](https://img.shields.io/badge/Conference-ICLR'22-blue)]()<br> :star: [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) <br> Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh |<img width="202" alt="image" src="figures/GPTQ.png"> |[Github](https://github.com/IST-DASLab/gptq) <br> [Paper](https://arxiv.org/abs/2210.17323)| [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/mit-han-lab/smoothquant.svg?style=social&label=Star)](https://github.com/mit-han-lab/smoothquant)[![Publish](https://img.shields.io/badge/Conference-ICML'23-blue)]() <br> :star: [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://arxiv.org/abs/2211.10438) <br> Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/smoothquant/blob/main/figures/intuition.png"> |[Github](https://github.com/mit-han-lab/smoothquant) <br> [Paper](https://arxiv.org/abs/2211.10438)| [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/mit-han-lab/llm-awq.svg?style=social&label=Star)](https://github.com/mit-han-lab/llm-awq) <br> :star: [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978) <br> Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/llm-awq/blob/main/figures/overview.png"> |[Github](https://github.com/mit-han-lab/llm-awq) <br> [Paper](https://arxiv.org/abs/2306.00978)| [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/OpenGVLab/OmniQuant.svg?style=social&label=Star)](https://github.com/OpenGVLab/OmniQuant)[![Publish](https://img.shields.io/badge/Conference-ICLR'24-blue)]()<br> :star: [OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models](https://arxiv.org/abs/2308.13137) <br> Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo |<img width="1002" alt="image" src="figures/omniquant.png"> |[Github](https://github.com/OpenGVLab/OmniQuant) <br> [Paper](https://arxiv.org/abs/2308.13137)| [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/utkarsh-dmx/project-resq.svg?style=social&label=Star)](https://github.com/utkarsh-dmx/project-resq)<br>[ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals](https://arxiv.org/abs/2412.14363) <br> Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang |<img width="1002" alt="image" src="figures/ResQ.png"> |[Github](https://github.com/utkarsh-dmx/project-resq) <br> [Paper](https://arxiv.org/abs/2412.14363)|[//]: #12/30
|[MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design](https://arxiv.org/abs/2412.14590) <br> Zhen Zheng, Xiaonan Song, Chuanjie Liu |<img width="1002" alt="image" src="https://arxiv.org/html/2412.14590v1/x1.png"> |[Paper](https://arxiv.org/abs/2412.14590)|[//]: #12/30
|[GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference](https://arxiv.org/abs/2412.17560) <br> Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Xing Mei, Lean Fu |<img width="1002" alt="image" src="https://arxiv.org/html/2412.17560v1/extracted/6090667/GQS_block.png"> |[Paper](https://arxiv.org/abs/2412.17560)|[//]: #12/30
|[LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment](https://arxiv.org/abs/2412.18135) <br> Binrui Zeng, Bin Ji, Xiaodong Liu, Jie Yu, Shasha Li, Jun Ma, Xiaopeng Li, Shangwen Wang, Xinran Hong |<img width="1002" alt="image" src="https://arxiv.org/html/2412.18135v1/x1.png"> |[Paper](https://arxiv.org/abs/2412.18135)|[//]: #12/30
|[SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization](https://arxiv.org/abs/2412.04180) <br> Runsheng Bai, Qiang Liu, Bo Liu |<img width="1002" alt="image" src="https://arxiv.org/html/2412.04180v1/x2.png"> |[Paper](https://arxiv.org/abs/2412.04180)|[//]: #12/09
|[CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models](https://arxiv.org/abs/2412.03599) <br> Amitash Nanda, Sree Bhargavi Balija, Debashis Sahoo |<img width="1002" alt="image" src="https://arxiv.org/html/2412.03599v1/x3.png"> |[Paper](https://arxiv.org/abs/2412.03599)|[//]: #12/09
|[![Publish](https://img.shields.io/badge/Conference-HPCA'25-blue)]()<br>[Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format](https://arxiv.org/abs/2411.15982) <br> Chao Fang, Man Shi, Robin Geens, Arne Symons, Zhongfeng Wang, Marian Verhelst |<img width="1002" alt="image" src="https://arxiv.org/html/2411.15982v1/x1.png"> |[Paper](https://arxiv.org/abs/2411.15982)|[//]: #12/03
|[MixPE: Quantization and Hardware Co-design for Efficient LLM Inference](https://arxiv.org/abs/2411.16158) <br> Yu Zhang, Mingzi Wang, Lancheng Zou, Wulong Liu, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2411.16158v1/x5.png"> |[Paper](https://arxiv.org/abs/2411.16158)|[//]: #12/03
|[![Star](https://img.shields.io/github/stars/abdelfattah-lab/BitMoD-HPCA-25.svg?style=social&label=Star)](https://github.com/abdelfattah-lab/BitMoD-HPCA-25)[![Publish](https://img.shields.io/badge/Conference-HPCA'25-blue)]()<br>[BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration](https://arxiv.org/abs/2411.11745) <br> Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah |<img width="1002" alt="image" src="https://arxiv.org/html/2411.11745v1/x5.png"> |[Github](https://github.com/abdelfattah-lab/BitMoD-HPCA-25) <br> [Paper](https://arxiv.org/abs/2411.11745)|[//]: #11/24
|[AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference](https://arxiv.org/abs/2411.09909) <br> Janghwan Lee, Jiwoong Park, Jinseok Kim, Yongjik Kim, Jungju Oh, Jinwook Oh, Jungwook Choi |<img width="1002" alt="image" src="figures/AMXFP4.png"> |[Paper](https://arxiv.org/abs/2411.09909)|[//]: #11/24
|[Bi-Mamba: Towards Accurate 1-Bit State Space Models](https://arxiv.org/abs/2411.11843) <br> Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen |<img width="1002" alt="image" src="https://arxiv.org/html/2411.11843v1/x2.png"> |[Paper](https://arxiv.org/abs/2411.11843)|[//]: #11/24
|["Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization](https://arxiv.org/abs/2411.02355) <br> Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh | |[Paper](https://arxiv.org/abs/2411.02355)|[//]: #11/18
|[GWQ: Gradient-Aware Weight Quantization for Large Language Models](https://arxiv.org/abs/2411.00850) <br> Yihua Shao, Siyu Liang, Xiaolin Lin, Zijian Ling, Zixian Zhu et al  |<img width="1002" alt="image" src="https://arxiv.org/html/2411.00850v1/x2.png"> |[Paper](https://arxiv.org/abs/2411.00850)|[//]: #11/18
|[A Comprehensive Study on Quantization Techniques for Large Language Models](https://arxiv.org/abs/2411.02530) <br> Jiedong Lang, Zhehao Guo, Shuyu Huang | |[Paper](https://arxiv.org/abs/2411.02530)|[//]: #11/18
|[BitNet a4.8: 4-bit Activations for 1-bit LLMs](https://arxiv.org/abs/2411.04965) <br> Hongyu Wang, Shuming Ma, Furu Wei |<img width="1002" alt="image" src="https://arxiv.org/html/2411.04965v1/x1.png"> |[Paper](https://arxiv.org/abs/2411.04965)|[//]: #11/18
|[![Star](https://img.shields.io/github/stars/Intelligent-Computing-Lab-Yale/TesseraQ.svg?style=social&label=Star)](https://github.com/Intelligent-Computing-Lab-Yale/TesseraQ)<br>[TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction](https://arxiv.org/abs/2410.19103) <br> Yuhang Li, Priyadarshini Panda |<img width="1002" alt="image" src="https://github.com/Intelligent-Computing-Lab-Yale/TesseraQ/raw/main/imgs/tesseraq.png"> |[Github](https://github.com/Intelligent-Computing-Lab-Yale/TesseraQ) <br> [Paper](https://arxiv.org/abs/2410.19103)|[//]: #11/17
|[![Star](https://img.shields.io/github/stars/xinghaow99/BitStack.svg?style=social&label=Star)](https://github.com/xinghaow99/BitStack)<br>[BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments](https://arxiv.org/abs/2410.23918) <br> Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu |<img width="1002" alt="image" src="https://github.com/xinghaow99/BitStack/raw/main/assets/bitstack.png"> |[Github](https://github.com/xinghaow99/BitStack) <br> [Paper](https://arxiv.org/abs/2410.23918)|[//]: #11/17
|[The Impact of Inference Acceleration Strategies on Bias of LLMs](https://arxiv.org/abs/2410.22118) <br> Elisabeth Kirsten, Ivan Habernal, Vedant Nanda, Muhammad Bilal Zafar | |[Paper](https://arxiv.org/abs/2410.22118)|[//]: #11/17
|[Understanding the difficulty of low-precision post-training quantization of large language models](https://arxiv.org/abs/2410.14570) <br> Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2410.14570v1/extracted/5935973/figures/fig1.png"> |[Paper](https://arxiv.org/abs/2410.14570)|[//]: #10/30
|[![Star](https://img.shields.io/github/stars/microsoft/BitNet.svg?style=social&label=Star)](https://github.com/microsoft/BitNet)<br>[1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs](https://arxiv.org/abs/2410.16144) <br> Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei |<img width="1002" alt="image" src="https://arxiv.org/html/2410.16144v2/x1.png"> |[Github](https://github.com/microsoft/BitNet) <br> [Paper](https://arxiv.org/abs/2410.16144)|[//]: #10/30
|[QuAILoRA: Quantization-Aware Initialization for LoRA](https://arxiv.org/abs/2410.14713) <br> Neal Lawton, Aishwarya Padmakumar, Judith Gaspers, Jack FitzGerald, Anoop Kumar, Greg Ver Steeg, Aram Galstyan | |[Paper](https://arxiv.org/abs/2410.14713)|[//]: #10/30
|[Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks](https://arxiv.org/abs/2410.14766) <br> Enkhbold Nyamsuren | |[Paper](https://arxiv.org/abs/2410.14766)|[//]: #10/30
| [![Star](https://img.shields.io/github/stars/SqueezeAILab/SqueezeLLM.svg?style=social&label=Star)](https://github.com/SqueezeAILab/SqueezeLLM) <br> :star: [SqueezeLLM: Dense-and-Sparse Quantization](https://arxiv.org/pdf/2306.07629.pdf) <br>Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer | <img width="1102" alt="image" src="figures/SqueezeLLM.png"> |[Github](https://github.com/SqueezeAILab/SqueezeLLM) <br> [Paper](https://arxiv.org/pdf/2306.07629.pdf)| [//]: #Recommend
|[Pyramid Vector Quantization for LLMs](https://arxiv.org/abs/2410.16926) <br> Tycho F. A. van der Ouderaa, Maximilian L. Croci, Agrin Hilmkil, James Hensman |<img width="1002" alt="image" src="https://arxiv.org/html/2410.16926v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.16926)|[//]: #10/29
|[SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators](https://arxiv.org/abs/2410.10714) <br> Rasoul Shafipour, David Harrison, Maxwell Horton, Jeffrey Marker, Houman Bedayat, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, Saman Naderiparizi |<img width="1002" alt="image" src="https://arxiv.org/html/2410.10714v2/x1.png"> |[Paper](https://arxiv.org/abs/2410.10714)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/ruikangliu/FlatQuant.svg?style=social&label=Star)](https://github.com/ruikangliu/FlatQuant)<br>[FlatQuant: Flatness Matters for LLM Quantization](https://arxiv.org/abs/2410.09426) <br> Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao |<img width="1002" alt="image" src="https://arxiv.org/html/2410.09426v1/x11.png"> |[Github](https://github.com/ruikangliu/FlatQuant) <br> [Paper](https://arxiv.org/abs/2410.09426)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/Mohammad-Mozaffari/slim.svg?style=social&label=Star)](https://github.com/Mohammad-Mozaffari/slim)<br>[SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs](https://arxiv.org/abs/2410.09615) <br> Mohammad Mozaffari, Maryam Mehri Dehnavi |<img width="1002" alt="image" src="https://arxiv.org/html/2410.09615v1/x1.png"> |[Github](https://github.com/Mohammad-Mozaffari/slim) <br> [Paper](https://arxiv.org/abs/2410.09615)|[//]: #10/21
|[Scaling laws for post-training quantized large language models](https://arxiv.org/abs/2410.12119) <br> Zifei Xu, Alexander Lan, Wanzin Yazar, Tristan Webb, Sayeh Sharify, Xin Wang |<img width="202" alt="image" src="https://arxiv.org/html/2410.12119v1/extracted/5929616/figures/fig_12.png"> |[Paper](https://arxiv.org/abs/2410.12119)|[//]: #10/21
|[Continuous Approximations for Improving Quantization Aware Training of LLMs](https://arxiv.org/abs/2410.10849) <br> He Li, Jianhang Hong, Yuanzhuo Wu, Snehal Adbol, Zonglin Li | |[Paper](https://arxiv.org/abs/2410.10849)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/LuoYingSong/DAQ.svg?style=social&label=Star)](https://github.com/LuoYingSong/DAQ)<br>[DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs](https://arxiv.org/abs/2410.12187) <br> Yingsong Luo, Ling Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2410.12187v2/x1.png"> |[Github](https://github.com/LuoYingSong/DAQ) <br> [Paper](https://arxiv.org/abs/2410.12187)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/enyac-group/Quamba.svg?style=social&label=Star)](https://github.com/enyac-group/Quamba)<br>[Quamba: A Post-Training Quantization Recipe for Selective State Space Models](https://arxiv.org/abs/2410.13229) <br> Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.13229v1/extracted/5933363/figures/outliers.png"> |[Github](https://github.com/enyac-group/Quamba) <br> [Paper](https://arxiv.org/abs/2410.13229)|[//]: #10/21
|[AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations](https://arxiv.org/abs/2410.13212) <br> Qian Tao, Wenyuan Yu, Jingren Zhou |<img width="1002" alt="image" src="https://arxiv.org/html/2410.13212v1/extracted/5933292/figures/kvmix.png"> |[Paper](https://arxiv.org/abs/2410.13212)|[//]: #10/21
|[Channel-Wise Mixed-Precision Quantization for Large Language Models](https://arxiv.org/abs/2410.13056) <br> Zihan Chen, Bike Xie, Jundong Li, Cong Shen |<img width="1002" alt="image" src="https://arxiv.org/html/2410.13056v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.13056)|[//]: #10/21
|[Progressive Mixed-Precision Decoding for Efficient LLM Inference](https://arxiv.org/abs/2410.13461) <br> Hao Mark Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Hongxiang Fan, Stylianos I. Venieris |<img width="1002" alt="image" src="https://arxiv.org/html/2410.13461v1/x4.png"> |[Paper](https://arxiv.org/abs/2410.13461)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/Anonymous1252022/EXAQ.svg?style=social&label=Star)](https://github.com/Anonymous1252022/EXAQ)<br>[EXAQ: Exponent Aware Quantization For LLMs Acceleration](https://arxiv.org/abs/2410.03185) <br> Moran Shkolnik, Maxim Fishman, Brian Chmiel, Hilla Ben-Yaacov, Ron Banner, Kfir Yehuda Levy |<img width="1002" alt="image" src="figures/EXAQ.png"> |[Github](https://github.com/Anonymous1252022/EXAQ) <br> [Paper](https://arxiv.org/abs/2410.03185)|[//]: #10/14
|[![Star](https://img.shields.io/github/stars/ChenMnZ/PrefixQuant.svg?style=social&label=Star)](https://github.com/ChenMnZ/PrefixQuant)<br>[PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs](https://arxiv.org/abs/2410.05265) <br> Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo |<img width="1002" alt="image" src="https://arxiv.org/html/2410.05265v1/x1.png"> |[Github](https://github.com/ChenMnZ/PrefixQuant) <br> [Paper](https://arxiv.org/abs/2410.05265)|[//]: #10/14
|[![Star](https://img.shields.io/github/stars/vahe1994/AQLM.svg?style=social&label=Star)](https://github.com/vahe1994/AQLM)<br> :star: [Extreme Compression of Large Language Models via Additive Quantization](https://arxiv.org/abs/2401.06118) <br> Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh |<img width="1002" alt="image" src="figures/MCQ.png"> |[Github](https://github.com/vahe1994/AQLM) <br> [Paper](https://arxiv.org/abs/2401.06118)| [//]: #Recommend
|[Scaling Laws for Mixed quantization in Large Language Models](https://arxiv.org/abs/2410.06722) <br> Zeyu Cao, Cheng Zhang, Pedro Gimenes, Jianqiao Lu, Jianyi Cheng, Yiren Zhao |<img width="1002" alt="image" src="figures/LLM-MPQ.png"> |[Paper](https://arxiv.org/abs/2410.06722)|[//]: #10/14
|[PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms](https://arxiv.org/abs/2410.05315) <br> Yilong Li, Jingyu Liu, Hao Zhang, M Badri Narayanan, Utkarsh Sharma, Shuai Zhang, Pan Hu, Yijing Zeng, Jayaram Raghuram, Suman Banerjee |<img width="1002" alt="image" src="figures/PalmBench.png"> |[Paper](https://arxiv.org/abs/2410.05315)|[//]: #10/14
|[CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression](https://arxiv.org/abs/2410.07505) <br> Wenyuan Liu, Xindian Ma, Peng Zhang, Yan Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2410.07505v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.07505)|[//]: #10/13
|[SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration](https://arxiv.org/abs/2410.02367) <br> Jintao Zhang, Jia wei, Pengle Zhang, Jun Zhu, Jianfei Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2410.02367v1/x5.png"> |[Paper](https://arxiv.org/abs/2410.02367)|[//]: #10/04
|[Addition is All You Need for Energy-efficient Language Models](https://arxiv.org/abs/2410.00907) <br> Hongyin Luo, Wei Sun |<img width="1002" alt="image" src="https://arxiv.org/html/2410.00907v1/x2.png"> |[Paper](https://arxiv.org/abs/2410.00907)|[//]: #10/02



### Inference Acceleration
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|[![Star](https://img.shields.io/github/stars/FMInference/DejaVu.svg?style=social&label=Star)](https://github.com/FMInference/DejaVu)[![Publish](https://img.shields.io/badge/Conference-ICML'23%20Oral-blue)]()<br> :star: [Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time](https://openreview.net/forum?id=wIPIhHd00i) <br> Zichang Liu, Jue WANG, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen |<img width="202" alt="image" src="figures/DajeVu.png"> |[Github](https://github.com/FMInference/DejaVu) <br> [Paper](https://openreview.net/forum?id=wIPIhHd00i)| [//]: #Recommend
| [![Star](https://img.shields.io/github/stars/flexflow/FlexFlow.svg?style=social&label=Star)](https://github.com/flexflow/FlexFlow/tree/inference) <br> :star: [SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification](https://arxiv.org/abs/2305.09781) <br> Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia| <img width="600" alt="image" src="https://github.com/flexflow/FlexFlow/blob/inference/img/overview.png">| [Github](https://github.com/flexflow/FlexFlow/tree/inference) <br> [paper](https://arxiv.org/abs/2305.09781) | [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/mit-han-lab/streaming-llm.svg?style=social&label=Star)](https://github.com/mit-han-lab/streaming-llm)<br> :star: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) <br> Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis |<img width="1002" alt="image" src="https://github.com/mit-han-lab/streaming-llm/blob/main/figures/schemes.png"> |[Github](https://github.com/mit-han-lab/streaming-llm) <br> [Paper](https://arxiv.org/abs/2309.17453)| [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/SafeAILab/EAGLE.svg?style=social&label=Star)](https://github.com/SafeAILab/EAGLE)<br>:star: [EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation](https://sites.google.com/view/eagle-llm) <br> Yuhui Li, Chao Zhang, and Hongyang Zhang |<img width="302" alt="image" src="https://github.com/SafeAILab/EAGLE/blob/main/figs/fig1.png"> |[Github](https://github.com/SafeAILab/EAGLE) <br> [Blog](https://sites.google.com/view/eagle-llm)| [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/FasterDecoding/Medusa.svg?style=social&label=Star)](https://github.com/FasterDecoding/Medusa)<br> :star: [Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads](https://arxiv.org/abs/2401.10774) <br> Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao |<img width="1002" alt="image" src="https://arxiv.org/html/2401.10774v1/x1.png"> |[Github](https://github.com/FasterDecoding/Medusa) <br> [Paper](https://arxiv.org/abs/2401.10774)| [//]: #Recommend
|[Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration](https://arxiv.org/abs/2412.00061) <br> Zhuofan Wen, Shangtong Gui, Yang Feng |<img width="302" alt="image" src="https://arxiv.org/html/2412.00061v1/x1.png"> |[Paper](https://arxiv.org/abs/2412.00061)|[//]: #12/09
|[PLD+: Accelerating LLM inference by leveraging Language Model Artifacts](https://arxiv.org/abs/2412.01447) <br> Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena |<img width="1002" alt="image" src="https://arxiv.org/html/2412.01447v1/x1.png"> |[Paper](https://arxiv.org/abs/2412.01447)|[//]: #12/09
|[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24%20ENLSP-blue)]()<br>[FastDraft: How to Train Your Draft](https://arxiv.org/abs/2411.11055) <br> Ofir Zafrir, Igor Margulis, Dorin Shteyman, Guy Boudoukh | |[Paper](https://arxiv.org/abs/2411.11055)|[//]: #11/24
|[![Star](https://img.shields.io/github/stars/David-Li0406/SMoA.svg?style=social&label=Star)](https://github.com/David-Li0406/SMoA)<br>[SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents](https://arxiv.org/abs/2411.03284) <br> Dawei Li, Zhen Tan, Peijia Qian, Yifan Li, Kumar Satvik Chaudhary, Lijie Hu, Jiayi Shen |<img width="1002" alt="image" src="figures/SMoA.png"> |[Github](https://github.com/David-Li0406/SMoA) <br> [Paper](https://arxiv.org/abs/2411.03284)|[//]: #11/18
|[The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation](https://arxiv.org/abs/2411.03786) <br> Lawrence Stewart, Matthew Trager, Sujan Kumar Gonugondla, Stefano Soatto | |[Paper](https://arxiv.org/abs/2411.03786)|[//]: #11/18
|[Accelerated AI Inference via Dynamic Execution Methods](https://arxiv.org/abs/2411.00853) <br> Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu | |[Paper](https://arxiv.org/abs/2411.00853)|[//]: #11/18
|[SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference](https://arxiv.org/abs/2411.04975) <br> Gabriele Oliaro, Zhihao Jia, Daniel Campos, Aurick Qiao |<img width="1002" alt="image" src="https://arxiv.org/html/2411.04975v1/x1.png"> |[Paper](https://arxiv.org/abs/2411.04975)|[//]: #11/18
|[Dynamic Strategy Planning for Efficient Question Answering with Large Language Models](https://arxiv.org/abs/2410.23511) <br> Tanmay Parekh, Pradyot Prakash, Alexander Radovic, Akshay Shekher, Denis Savenkov |<img width="1002" alt="image" src="https://arxiv.org/html/2410.23511v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.23511)|[//]: #11/17
|[![Star](https://img.shields.io/github/stars/Infini-AI-Lab/MagicPIG.svg?style=social&label=Star)](https://github.com/Infini-AI-Lab/MagicPIG)<br>[MagicPIG: LSH Sampling for Efficient LLM Generation](https://arxiv.org/abs/2410.16179) <br> Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2410.16179v2/x15.png"> |[Github](https://github.com/Infini-AI-Lab/MagicPIG) <br> [Paper](https://arxiv.org/abs/2410.16179)|[//]: #10/30
|[Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition](https://arxiv.org/abs/2410.17765) <br> Artem Basharin, Andrei Chertkov, Ivan Oseledets |<img width="1002" alt="image" src="figures/canonical_tensor_decomposition.png"> |[Paper](https://arxiv.org/abs/2410.17765)|[//]: #10/29
|[Efficient Inference for Augmented Large Language Models](https://arxiv.org/abs/2410.18248) <br> Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher |<img width="1002" alt="image" src="https://arxiv.org/html/2410.18248v1/extracted/5949546/figures/illustrations/api_example_png.png"> |[Paper](https://arxiv.org/abs/2410.18248)|[//]: #10/29
|[![Star](https://img.shields.io/github/stars/MatteoNulli/Vocabulary_pruning.svg?style=social&label=Star)](https://github.com/MatteoNulli/Vocabulary_pruning)<br>[Dynamic Vocabulary Pruning in Early-Exit LLMs](https://arxiv.org/abs/2410.18952) <br> Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec |<img width="1002" alt="image" src="https://github.com/MatteoNulli/Vocabulary_pruning/raw/main/src/images/final_nips.svg"> |[Github](https://github.com/MatteoNulli/Vocabulary_pruning) <br> [Paper](https://arxiv.org/abs/2410.18952)|[//]: #10/29
|[![Star](https://img.shields.io/github/stars/wangqinsi1/CoreInfer.svg?style=social&label=Star)](https://github.com/wangqinsi1/CoreInfer)<br>[CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation](https://arxiv.org/abs/2410.18311#) <br> Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen |<img width="1002" alt="image" src="https://wangqinsi1.github.io/coreinfer_page/static/images/overview.png"> |[Github](https://github.com/wangqinsi1/CoreInfer) <br> [Paper](https://arxiv.org/abs/2410.18311#)|[//]: #10/29
|[![Star](https://img.shields.io/github/stars/mit-han-lab/duo-attention.svg?style=social&label=Star)](https://github.com/mit-han-lab/duo-attention)<br>[DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads](https://arxiv.org/abs/2410.10819) <br> Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/duo-attention/raw/main/figures/method1.jpg"> |[Github](https://github.com/mit-han-lab/duo-attention) <br> [Paper](https://arxiv.org/abs/2410.10819)|[//]: #10/21
|[DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure](https://arxiv.org/abs/2410.11744) <br> Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, Tianhao Wu, Lei Zou |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11744v1/extracted/5913908/figures/tree_bold.png"> |[Paper](https://arxiv.org/abs/2410.11744)|[//]: #10/21
|[QSpec: Speculative Decoding with Complementary Quantization Schemes](https://arxiv.org/abs/2410.11305) <br> Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11305v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.11305)|[//]: #10/21
|[TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention](https://arxiv.org/abs/2410.05076) <br> Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia |<img width="1002" alt="image" src="https://arxiv.org/html/2410.05076v1/x2.png"> |[Paper](https://arxiv.org/abs/2410.05076)|[//]: #10/14
|[ParallelSpec: Parallel Drafter for Efficient Speculative Decoding](https://arxiv.org/abs/2410.05589) <br> Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.05589v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.05589)|[//]: #10/14
|[![Star](https://img.shields.io/github/stars/hemingkx/SWIFT.svg?style=social&label=Star)](https://github.com/hemingkx/SWIFT)<br>[SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration](https://arxiv.org/abs/2410.06916) <br> Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li |<img width="1002" alt="image" src="https://github.com/hemingkx/SWIFT/raw/main/assets/swift.png"> |[Github](https://github.com/hemingkx/SWIFT) <br> [Paper](https://arxiv.org/abs/2410.06916)|[//]: #10/14
|[![Star](https://img.shields.io/github/stars/MooreThreads/TurboRAG.svg?style=social&label=Star)](https://github.com/MooreThreads/TurboRAG)<br>[TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text](https://arxiv.org/abs/2410.07590) <br> Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang |<img width="1002" alt="image" src="https://github.com/MooreThreads/TurboRAG/raw/main/assets/image/TurboRAG.png"> |[Github](https://github.com/MooreThreads/TurboRAG) <br> [Paper](https://arxiv.org/abs/2410.07590)|[//]: #10/13
|[A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts](https://arxiv.org/abs/2410.01485) <br> Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng |<img width="1002" alt="image" src="https://arxiv.org/html/2410.01485v1/extracted/5895696/figures/model_architecture.png"> |[Paper](https://arxiv.org/abs/2410.01485)|[//]: #10/04


### Efficient MOE
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|[![Star](https://img.shields.io/github/stars/dvmazur/mixtral-offloading.svg?style=social&label=Star)](https://github.com/dvmazur/mixtral-offloading)<br>:star: [Fast Inference of Mixture-of-Experts Language Models with Offloading](https://arxiv.org/abs/2312.17238) <br> Artyom Eliseev, Denis Mazur |<img width="1002" alt="image" src="figures/mixtral_offloading.png"> |[Github](https://github.com/dvmazur/mixtral-offloading) <br> [Paper](https://arxiv.org/abs/2312.17238)| [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/duterscmy/CD-MoE.svg?style=social&label=Star)](https://github.com/duterscmy/CD-MoE)<br>[Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning](https://arxiv.org/abs/2412.00069) <br> Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, Lu Yin |<img width="1002" alt="image" src="https://arxiv.org/html/2412.00069v1/x2.png"> |[Github](https://github.com/duterscmy/CD-MoE) <br> [Paper](https://arxiv.org/abs/2412.00069)|[//]: #12/09
|[Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference](https://arxiv.org/abs/2412.00099) <br> Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, Babak Ehteshami Bejnordi |<img width="1002" alt="image" src="https://arxiv.org/html/2412.00099v1/x1.png"> |[Paper](https://arxiv.org/abs/2412.00099)|[//]: #12/09
|[![Star](https://img.shields.io/github/stars/EnflameTechnology/DeepSpeed.svg?style=social&label=Star)](https://github.com/EnflameTechnology/DeepSpeed)<br>[MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization](https://arxiv.org/abs/2411.00662) <br> Jingming Guo, Yan Liu, Yu Meng, Zhiwei Tao, Banglan Liu, Gang Chen, Xiang Li |<img width="1002" alt="image" src="https://arxiv.org/html/2411.00662v1/x1.png"> |[Github](https://github.com/EnflameTechnology/DeepSpeed) <br> [Paper](https://arxiv.org/abs/2411.00662)|[//]: #11/18
|[![Star](https://img.shields.io/github/stars/xiaochengsky/MoEI-2.svg?style=social&label=Star)](https://github.com/xiaochengsky/MoEI-2)<br>[MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition](https://arxiv.org/abs/2411.01016) <br> Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, Bo Yuan |<img width="1002" alt="image" src="https://arxiv.org/html/2411.01016v1/x1.png"> |[Github](https://github.com/xiaochengsky/MoEI-2) <br> [Paper](https://arxiv.org/abs/2411.01016)|[//]: #11/18
|[HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference](https://arxiv.org/abs/2411.01433) <br> Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo |<img width="1002" alt="image" src="https://arxiv.org/html/2411.01433v2/extracted/5980843/figures/overview5.png"> |[Paper](https://arxiv.org/abs/2411.01433)|[//]: #11/18
|[ProMoE: Fast MoE-based LLM Serving using Proactive Caching](https://arxiv.org/abs/2410.22134) <br> Xiaoniu Song, Zihang Zhong, Rong Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2410.22134v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.22134)|[//]: #11/17
|[ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference](https://arxiv.org/abs/2410.17954) <br> Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon |<img width="202" alt="image" src="https://arxiv.org/html/2410.17954v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.17954)|[//]: #10/29
|[EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference](https://arxiv.org/abs/2410.12247) <br> Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai | |[Paper](https://arxiv.org/abs/2410.12247)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/Aaronhuang-778/MC-MoE.svg?style=social&label=Star)](https://github.com/Aaronhuang-778/MC-MoE)<br>[MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More](https://arxiv.org/abs/2410.06270) <br> Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi |<img width="1002" alt="image" src="https://github.com/Aaronhuang-778/MC-MoE/raw/main/imgs/WX20241009-191322@2x.png"> |[Github](https://github.com/Aaronhuang-778/MC-MoE) <br> [Paper](https://arxiv.org/abs/2410.06270)|[//]: #10/14



### Efficient Architecture of LLM
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|[![Star](https://img.shields.io/github/stars/mbzuai-oryx/MobiLlama.svg?style=social&label=Star)](https://github.com/mbzuai-oryx/MobiLlama)<br>:star: [MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT](https://arxiv.org/abs/2402.16840) <br> Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan |<img width="402" alt="image" src="https://github.com/mbzuai-oryx/MobiLlama/raw/main/images/mobillama_generation.gif"> |[Github](https://github.com/mbzuai-oryx/MobiLlama) <br> [Paper](https://arxiv.org/abs/2402.16840) <br>[Model](https://huggingface.co/MBZUAI/MobiLlama-05B) | [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/XuezheMax/megalodon.svg?style=social&label=Star)](https://github.com/XuezheMax/megalodon)<br>:star: [Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length](https://arxiv.org/abs/2404.08801) <br> Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou |<img width="1002" alt="image" src="figures/megalodon.png"> |[Github](https://github.com/XuezheMax/megalodon) <br> [Paper](https://arxiv.org/abs/2404.08801)| [//]: #Recommend
|[Taipan: Efficient and Expressive State Space Language Models with Selective Attention](https://arxiv.org/abs/2410.18572) <br> Chien Van Nguyen, Huy Huu Nguyen, Thang M. Pham, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Ryan A. Rossi, Trung Bui, Viet Dac Lai, Franck Dernoncourt, Thien Huu Nguyen |<img width="1002" alt="image" src="https://arxiv.org/html/2410.18572v1/x2.png"> |[Paper](https://arxiv.org/abs/2410.18572)|[//]: #10/29
|[![Star](https://img.shields.io/github/stars/microsoft/SeerAttention.svg?style=social&label=Star)](https://github.com/microsoft/SeerAttention)<br>[SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs](https://arxiv.org/abs/2410.13276) <br> Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang |<img width="202" alt="image" src="https://arxiv.org/html/2410.13276v1/x4.png"> |[Github](https://github.com/microsoft/SeerAttention) <br> [Paper](https://arxiv.org/abs/2410.13276)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/TUDa-HWAI/Basis_Sharing.svg?style=social&label=Star)](https://github.com/TUDa-HWAI/Basis_Sharing)<br>[Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression](https://arxiv.org/abs/2410.03765) <br> Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, Grace Li Zhang |<img width="1002" alt="image" src="https://arxiv.org/html/2410.03765v1/x1.png"> |[Github](https://github.com/TUDa-HWAI/Basis_Sharing) <br> [Paper](https://arxiv.org/abs/2410.03765)|[//]: #10/14
|[Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions](https://arxiv.org/abs/2410.06577) <br> Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin |<img width="1002" alt="image" src="https://arxiv.org/html/2410.06577v1/x3.png"> |[Paper](https://arxiv.org/abs/2410.06577)|[//]: #10/14


### KV Cache Compression
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|:star: [Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs](https://arxiv.org/abs/2310.01801) <br> Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao |<img width="1002" alt="image" src="figures/FastGen.png"> |[Paper](https://arxiv.org/abs/2310.01801)| [//]: #Recommend
|[ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression](https://arxiv.org/abs/2412.03213) <br> Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo |<img width="1002" alt="image" src="https://arxiv.org/html/2412.03213v1/x1.png"> |[Paper](https://arxiv.org/abs/2412.03213)|[//]: #12/09
|[Unifying KV Cache Compression for Large Language Models with LeanKV](https://arxiv.org/abs/2412.03131) <br> Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2412.03131v1/x2.png"> |[Paper](https://arxiv.org/abs/2412.03131)|[//]: #12/09
|[Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity](https://arxiv.org/abs/2412.02252) <br> Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2412.02252v1/extracted/6041612/figs/intro.png"> |[Paper](https://arxiv.org/abs/2412.02252)|[//]: #12/09
|[MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache](https://arxiv.org/abs/2411.18077) <br> Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, Minjia Zhang |<img width="1002" alt="image" src="https://arxiv.org/html/2411.18077v2/x1.png"> |[Paper](https://arxiv.org/abs/2411.18077)|[//]: #12/07
|[TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection](https://arxiv.org/abs/2411.02886) <br> Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Kun Fu, Zheng Wang, Hui Xiong |<img width="1002" alt="image" src="https://arxiv.org/html/2411.02886v1/x1.png"> |[Paper](https://arxiv.org/abs/2411.02886)|[//]: #11/18
|[![Star](https://img.shields.io/github/stars/FYYFU/HeadKV.svg?style=social&label=Star)](https://github.com/FYYFU/HeadKV)<br>[Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning](https://arxiv.org/abs/2410.19258) <br> Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao |<img width="1002" alt="image" src="https://github.com/FYYFU/HeadKV/raw/main/main.png"> |[Github](https://github.com/FYYFU/HeadKV) <br> [Paper](https://arxiv.org/abs/2410.19258)|[//]: #11/17
|[![Star](https://img.shields.io/github/stars/JunqiZhao888/buzz-llm.svg?style=social&label=Star)](https://github.com/JunqiZhao888/buzz-llm)<br>[BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference](https://arxiv.org/abs/2410.23079) <br> Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He |<img width="1002" alt="image" src="https://arxiv.org/html/2410.23079v1/x1.png"> |[Github](https://github.com/JunqiZhao888/buzz-llm) <br> [Paper](https://arxiv.org/abs/2410.23079)|[//]: #11/17
|[![Star](https://img.shields.io/github/stars/whyNLP/LCKV.svg?style=social&label=Star)](https://github.com/whyNLP/LCKV)<br>[A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference](https://arxiv.org/abs/2410.14442) <br> You Wu, Haoyi Wu, Kewei Tu |<img width="202" alt="image" src="figures/cross-layer-kv.png"> |[Github](https://github.com/whyNLP/LCKV) <br> [Paper](https://arxiv.org/abs/2410.14442)|[//]: #10/30
|[Lossless KV Cache Compression to 2%](https://arxiv.org/abs/2410.15252) <br> Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang |<img width="1002" alt="image" src="https://arxiv.org/html/2410.15252v1/extracted/5937225/images/CLLA_Overview.png"> |[Paper](https://arxiv.org/abs/2410.15252)|[//]: #10/30
|[MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection](https://arxiv.org/abs/2410.14731) <br> Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng |<img width="1002" alt="image" src="https://arxiv.org/html/2410.14731v1/x2.png"> |[Paper](https://arxiv.org/abs/2410.14731)|[//]: #10/30
|[![Star](https://img.shields.io/github/stars/iankur/vqllm.svg?style=social&label=Star)](https://github.com/iankur/vqllm)<br>[Residual vector quantization for KV cache compression in large language model](https://arxiv.org/abs/2410.15704) <br> Ankur Kumar | |[Github](https://github.com/iankur/vqllm) <br> [Paper](https://arxiv.org/abs/2410.15704)|[//]: #10/30
|[![Star](https://img.shields.io/github/stars/yangyifei729/KVSharer.svg?style=social&label=Star)](https://github.com/yangyifei729/KVSharer)<br>[KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing](https://arxiv.org/abs/2410.18517) <br> Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen |<img width="1002" alt="image" src="https://github.com/yangyifei729/KVSharer/raw/main/img/main_fig.jpg"> |[Github](https://github.com/yangyifei729/KVSharer) <br> [Paper](https://arxiv.org/abs/2410.18517)|[//]: #10/29
|[LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy](https://arxiv.org/abs/2410.03111) <br> Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen |<img width="1002" alt="image" src="figures/LoRC.png"> |[Paper](https://arxiv.org/abs/2410.03111)|[//]: #10/14
|[SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation](https://arxiv.org/abs/2410.03960) <br> Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He |<img width="1002" alt="image" src="https://arxiv.org/html/2410.03960v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.03960)|[//]: #10/14
|[![Publish](https://img.shields.io/badge/Conference-ICML'24-blue)]()<br>[Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference](https://arxiv.org/abs/2403.09636) <br> Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti |<img width="1002" alt="image" src="figures/DMC.png"> |[Paper](https://arxiv.org/abs/2403.09636)|[//]: #10/02
|[KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head](https://arxiv.org/abs/2410.00161) <br> Isaac Rehg |<img width="1002" alt="image" src="https://arxiv.org/html/2410.00161v1/x5.png"> |[Paper](https://arxiv.org/abs/2410.00161)|[//]: #10/02
|[![Star](https://img.shields.io/github/stars/FFY0/AdaKV.svg?style=social&label=Star)](https://github.com/FFY0/AdaKV)<br>[Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference](https://arxiv.org/abs/2407.11550) <br> Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou |<img width="1002" alt="image" src="figures/adakv.png"> |[Github](https://github.com/FFY0/AdaKV) <br> [Paper](https://arxiv.org/abs/2407.11550)|[//]: #10/13


### Text Compression
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|[![Star](https://img.shields.io/github/stars/microsoft/LLMLingua.svg?style=social&label=Star)](https://github.com/microsoft/LLMLingua)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23-blue)]()<br>:star: [LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models](https://arxiv.org/abs/2310.05736) <br> Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu |<img width="1002" alt="image" src="https://github.com/microsoft/LLMLingua/blob/main/images/LLMLingua_framework.png"> |[Github](https://github.com/microsoft/LLMLingua) <br> [Paper](https://arxiv.org/abs/2310.05736)| [//]: #Recommend
|[![Star](https://img.shields.io/github/stars/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.svg?style=social&label=Star)](https://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression)<br>[L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression](https://arxiv.org/abs/2412.16642) <br> Junxuan Zhang, Zhengxue Cheng, Yan Zhao, Shihao Wang, Dajiang Zhou, Guo Lu, Li Song |<img width="1002" alt="image" src="https://arxiv.org/html/2412.16642v2/x2.png"> |[Github](https://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression) <br> [Paper](https://arxiv.org/abs/2412.16642)|[//]: #12/30
|[![Star](https://img.shields.io/github/stars/NL2G/promptoptme.svg?style=social&label=Star)](https://github.com/NL2G/promptoptme)<br>[PromptOptMe: Error-Aware Prompt Compression for LLM-based MT Evaluation Metrics](https://arxiv.org/abs/2412.16120) <br> Daniil Larionov, Steffen Eger |<img width="1002" alt="image" src="https://arxiv.org/html/2412.16120v1/x1.png"> |[Github](https://github.com/NL2G/promptoptme) <br> [Paper](https://arxiv.org/abs/2412.16120)|[//]: #12/30
|[![Star](https://img.shields.io/github/stars/microsoft/LLMLingua.svg?style=social&label=Star)](https://github.com/microsoft/LLMLingua)<br>:star: [LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression](https://arxiv.org/abs/2310.06839) <br> Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu |<img width="1002" alt="image" src="figures/longllmlingua.png"> |[Github](https://github.com/microsoft/LLMLingua) <br> [Paper](https://arxiv.org/abs/2310.06839)| [//]: #Recommend
|[A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression](https://arxiv.org/abs/2412.17483) <br> Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou |<img width="1002" alt="image" src="https://arxiv.org/html/2412.17483v1/x1.png"> |[Paper](https://arxiv.org/abs/2412.17483)|[//]: #12/30
|[JPPO: Joint Power and Prompt Optimization for Accelerated Large Language Model Services](https://arxiv.org/abs/2411.18010) <br> Feiran You, Hongyang Du, Kaibin Huang, Abbas Jamalipour |<img width="1002" alt="image" src="https://arxiv.org/html/2411.18010v1/x1.png"> |[Paper](https://arxiv.org/abs/2411.18010)|[//]: #12/07
|[![Star](https://img.shields.io/github/stars/kaistai/generative-context-distillation.svg?style=social&label=Star)](https://github.com/kaistai/generative-context-distillation)<br>[Generative Context Distillation](https://arxiv.org/abs/2411.15927) <br> Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo |<img width="1002" alt="image" src="figures/GCD.png"> |[Github](https://github.com/kaistai/generative-context-distillation) <br> [Paper](https://arxiv.org/abs/2411.15927)|[//]: #12/02
|[![Star](https://img.shields.io/github/stars/noelkelias/multitok.svg?style=social&label=Star)](https://github.com/noelkelias/multitok)<br>[MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression](https://arxiv.org/abs/2410.21548) <br> Noel Elias, Homa Esfahanizadeh, Kaan Kale, Sriram Vishwanath, Muriel Medard |<img width="1002" alt="image" src="https://arxiv.org/html/2410.21548v1/extracted/5960495/Figures/MultiTok.png"> |[Github](https://github.com/noelkelias/multitok) <br> [Paper](https://arxiv.org/abs/2410.21548)|[//]: #11/17
|[![Publish](https://img.shields.io/badge/Conference-EMNLP'24%20Findings-blue)]()<br>[Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability](https://arxiv.org/abs/2410.11786) <br> Tsz Ting Chung, Leyang Cui, Lemao Liu, Xinting Huang, Shuming Shi, Dit-Yan Yeung |<img width="202" alt="image" src="https://arxiv.org/html/2410.11786v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.11786)|[//]: #10/21
|[![Publish](https://img.shields.io/badge/Conference-EMNLP'24%20Findings-blue)]()<br>[From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression](https://arxiv.org/abs/2410.04139) <br> Eunseong Choi, Sunkyung Lee, Minjin Choi, June Park, Jongwuk Lee |<img width="1002" alt="image" src="https://arxiv.org/html/2410.04139v1/extracted/5902409/Figures/fig_R2C_framework_2col_v4.png"> |[Paper](https://arxiv.org/abs/2410.04139)|[//]: #10/14
|[Perception Compressor:A training-free prompt compression method in long context scenarios](https://arxiv.org/abs/2409.19272) <br> Jiwei Tang, Jin Xu, Tingwei Lu, Hai Lin, Yiming Zhao, Hai-Tao Zheng |<img width="1002" alt="image" src="https://arxiv.org/html/2409.19272v1/x1.png"> |[Paper](https://arxiv.org/abs/2409.19272)|[//]: #10/02

### Low-Rank Decomposition
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|[![Star](https://img.shields.io/github/stars/selfsupervised-ai/Natural-GaLore.svg?style=social&label=Star)](https://github.com/selfsupervised-ai/Natural-GaLore)<br>[Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning](https://arxiv.org/abs/2410.16029) <br> Arijit Das | |[Github](https://github.com/selfsupervised-ai/Natural-GaLore) <br> [Paper](https://arxiv.org/abs/2410.16029)|[//]: #10/30
|[CompAct: Compressed Activations for Memory-Efficient LLM Training](https://arxiv.org/abs/2410.15352) <br> Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster |<img width="202" alt="image" src="https://arxiv.org/html/2410.15352v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.15352)|[//]: #10/30
|[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]()<br>[ESPACE: Dimensionality Reduction of Activations for Model Compression](https://arxiv.org/abs/2410.05437) <br> Charbel Sakr, Brucek Khailany |<img width="1002" alt="image" src="figures/ESPACE.png"> |[Paper](https://arxiv.org/abs/2410.05437)|[//]: #10/14


### Hardware/System/Serving
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|[KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management](https://arxiv.org/abs/2412.18169) <br> Rongxin Cheng, Yifan Peng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2412.18169v2/x3.png"> |[Paper](https://arxiv.org/abs/2412.18169)|[//]: #12/30
|[FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving](https://arxiv.org/abs/2411.18424) <br> Ao Shen, Zhiyao Li, Mingyu Gao |<img width="1002" alt="image" src="https://arxiv.org/html/2411.18424v1/x5.png"> |[Paper](https://arxiv.org/abs/2411.18424)|[//]: #12/07
|[CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration](https://arxiv.org/abs/2411.02829) <br> Hongpeng Jin, Yanzhao Wu |<img width="1002" alt="image" src="https://arxiv.org/html/2411.02829v1/extracted/5978301/images/method_overview_sm.png"> |[Paper](https://arxiv.org/abs/2411.02829)|[//]: #11/18
|[Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management](https://arxiv.org/abs/2410.19274) <br> Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren |<img width="302" alt="image" src="https://arxiv.org/html/2410.19274v2/x7.png"> |[Paper](https://arxiv.org/abs/2410.19274)|[//]: #11/17
|[![Publish](https://img.shields.io/badge/Conference-ICCAD'24-blue)]()<br>[ALISE: Accelerating Large Language Model Serving with Speculative Scheduling](https://arxiv.org/abs/2410.23537) <br> Youpeng Zhao, Jun Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2410.23537v1/extracted/5967257/imgs/b1.png"> |[Paper](https://arxiv.org/abs/2410.23537)|[//]: #11/17
|[EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models](https://arxiv.org/abs/2410.15332) <br> Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie |<img width="202" alt="image" src="https://arxiv.org/html/2410.15332v1/x3.png"> |[Paper](https://arxiv.org/abs/2410.15332)|[//]: #10/30
|[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]()<br>[SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training](https://arxiv.org/abs/2410.15526) <br> Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, Dingwen Tao |<img width="1002" alt="image" src="https://arxiv.org/html/2410.15526v1/x2.png"> |[Paper](https://arxiv.org/abs/2410.15526)|[//]: #10/30
|[FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs](https://arxiv.org/abs/2410.16663) <br> Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan et al |<img width="1002" alt="image" src="https://arxiv.org/html/2410.16663v1/x2.png"> |[Paper](https://arxiv.org/abs/2410.16663)|[//]: #10/29
|[POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference](https://arxiv.org/abs/2410.18038) <br> Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar |<img width="1002" alt="image" src="https://arxiv.org/html/2410.18038v1/x5.png"> |[Paper](https://arxiv.org/abs/2410.18038)|[//]: #10/29
|[![Star](https://img.shields.io/github/stars/Lizonghang/TPI-LLM.svg?style=social&label=Star)](https://github.com/Lizonghang/TPI-LLM)<br>[TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices](https://arxiv.org/abs/2410.00531) <br> Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.00531v1/x4.png"> |[Github](https://github.com/Lizonghang/TPI-LLM) <br> [Paper](https://arxiv.org/abs/2410.00531)|[//]: #10/02



### Tuning
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|[HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization](https://arxiv.org/abs/2411.10696) <br> Huaqin Zhao, Jiaxi Li, Yi Pan, Shizhe Liang, Xiaofeng Yang, Wei Liu, Xiang Li, Fei Dou, Tianming Liu, Jin Lu |<img width="1002" alt="image" src="https://arxiv.org/html/2411.10696v1/x1.png"> |[Paper](https://arxiv.org/abs/2411.10696)|[//]: #11/24
|[![Star](https://img.shields.io/github/stars/LCS2-IIITD/MonteCLoRA.svg?style=social&label=Star)](https://github.com/LCS2-IIITD/MonteCLoRA)<br>[Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation](https://arxiv.org/abs/2411.04358) <br> Ayan Sengupta, Vaibhav Seth, Arinjay Pathak, Natraj Raman, Sriram Gopalakrishnan, Tanmoy Chakraborty |<img width="1002" alt="image" src="https://arxiv.org/html/2411.04358v2/x3.png"> |[Github](https://github.com/LCS2-IIITD/MonteCLoRA) <br> [Paper](https://arxiv.org/abs/2411.04358)|[//]: #11/18
|[![Publish](https://img.shields.io/badge/Conference-EMNLP'24%20Findings-blue)]()<br>[MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning](https://arxiv.org/abs/2410.18035) <br> Jingfan Zhang, Yi Zhao, Dan Chen, Xing Tian, Huanran Zheng, Wei Zhu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.18035v1/extracted/5949512/em_lora_framework.png"> |[Paper](https://arxiv.org/abs/2410.18035)|[//]: #10/29
|[![Star](https://img.shields.io/github/stars/Kowsher/RoCoFT.svg?style=social&label=Star)](https://github.com/Kowsher/RoCoFT)<br>[RoCoFT: Efficient Finetuning of Large Language Models with Row-Column Updates](https://arxiv.org/abs/2410.10075) <br> Md Kowsher, Tara Esmaeilbeig, Chun-Nam Yu, Mojtaba Soltanalian, Niloofar Yousefi |<img width="1002" alt="image" src="https://github.com/Kowsher/RoCoFT/blob/main/figures/rocoft.png"> |[Github](https://github.com/Kowsher/RoCoFT) <br> [Paper](https://arxiv.org/abs/2410.10075)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/Kaiseem/IST.svg?style=social&label=Star)](https://github.com/Kaiseem/IST)[![Publish](https://img.shields.io/badge/Conference-EMNLP'24-blue)]()<br>[Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models](https://arxiv.org/abs/2410.11772) <br> Kai Yao, Penlei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, Jianke Zhu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11772v1/x3.png"> |[Github](https://github.com/Kaiseem/IST) <br> [Paper](https://arxiv.org/abs/2410.11772)|[//]: #10/21
|[![Publish](https://img.shields.io/badge/Conference-Nature%20Scientific%20Reports-blue)]()<br>[Parameter-Efficient Fine-Tuning of Large Language Models using Semantic Knowledge Tuning](https://arxiv.org/abs/2410.08598) <br> Nusrat Jahan Prottasha, Asif Mahmud, Md. Shohanur Islam Sobuj, Prakash Bhat, Md Kowsher, Niloofar Yousefi, Ozlem Ozmen Garibay |<img width="1002" alt="image" src="https://arxiv.org/html/2410.08598v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.08598)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/xvyaward/qeft.svg?style=social&label=Star)](https://github.com/xvyaward/qeft)[![Publish](https://img.shields.io/badge/Conference-EMNLP'24%20Findings-blue)]()<br>[QEFT: Quantization for Efficient Fine-Tuning of LLMs](https://arxiv.org/abs/2410.08661) <br> Changhun Lee, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park |<img width="1002" alt="image" src="https://arxiv.org/html/2410.08661v1/x2.png"> |[Github](https://github.com/xvyaward/qeft) <br> [Paper](https://arxiv.org/abs/2410.08661)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/Aofei-Chang/BIPEFT.svg?style=social&label=Star)](https://github.com/Aofei-Chang/BIPEFT)[![Publish](https://img.shields.io/badge/Conference-EMNLP'24%20Findings-blue)]()<br>[BIPEFT: Budget-Guided Iterative Search for Parameter Efficient Fine-Tuning of Large Pretrained Language Models](https://arxiv.org/abs/2410.09079) <br> Aofei Chang, Jiaqi Wang, Han Liu, Parminder Bhatia, Cao Xiao, Ting Wang, Fenglong Ma |<img width="1002" alt="image" src="https://arxiv.org/html/2410.09079v1/x1.png"> |[Github](https://github.com/Aofei-Chang/BIPEFT) <br> [Paper](https://arxiv.org/abs/2410.09079)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/sayankotor/sparse_grads.svg?style=social&label=Star)](https://github.com/sayankotor/sparse_grads)<br>[SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers](https://arxiv.org/abs/2410.07383) <br> Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev, Alexander Panchenko, Ivan Oseledets |<img width="1002" alt="image" src="https://arxiv.org/html/2410.07383v1/x1.png"> |[Github](https://github.com/sayankotor/sparse_grads) <br> [Paper](https://arxiv.org/abs/2410.07383)|[//]: #10/13
|[SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching](https://arxiv.org/abs/2410.06364) <br> Tianyi Zhang, Junda Su, Oscar Wu, Zhaozhuo Xu, Anshumali Shrivastava |<img width="1002" alt="image" src="https://arxiv.org/html/2410.06364v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.06364)|[//]: #10/13



### Efficient Training
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|[![Star](https://img.shields.io/github/stars/neiterman21/LDB.svg?style=social&label=Star)](https://github.com/neiterman21/LDB)<br>[LayerDropBack: A Universally Applicable Approach for Accelerating Training of Deep Networks](https://arxiv.org/abs/2412.18027) <br> Evgeny Hershkovitch Neiterman, Gil Ben-Artzi |<img width="1002" alt="image" src="https://arxiv.org/html/2412.18027v1/x1.png"> |[Github](https://github.com/neiterman21/LDB) <br> [Paper](https://arxiv.org/abs/2412.18027)|[//]: #12/30
|[AutoMixQ: Self-Adjusting Quantization for High Performance Memory-Efficient Fine-Tuning](https://arxiv.org/abs/2411.13814) <br> Changhai Zhou, Shiyang Zhang, Yuhua Zhou, Zekai Liu, Shichao Weng |<img width="1002" alt="image" src="figures/AutoMixQ.png"> |[Paper](https://arxiv.org/abs/2411.13814)|[//]: #11/24
|[![Star](https://img.shields.io/github/stars/TsinghuaC3I/LPA.svg?style=social&label=Star)](https://github.com/TsinghuaC3I/LPA)[![Publish](https://img.shields.io/badge/Conference-EMNLP'24-blue)]()<br>[Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention](https://arxiv.org/abs/2411.02063) <br> Xingtai Lv, Ning Ding, Kaiyan Zhang, Ermo Hua, Ganqu Cui, Bowen Zhou |<img width="1002" alt="image" src="https://arxiv.org/html/2411.02063v1/x1.png"> |[Github](https://github.com/TsinghuaC3I/LPA) <br> [Paper](https://arxiv.org/abs/2411.02063)|[//]: #11/18
|[Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs](https://arxiv.org/abs/2410.19694) <br> Yifei Zhang, Hao Zhu, Aiwei Liu, Han Yu, Piotr Koniusz, Irwin King |<img width="1002" alt="image" src="https://arxiv.org/html/2410.19694v1/x3.png"> |[Paper](https://arxiv.org/abs/2410.19694)|[//]: #11/18
|[![Star](https://img.shields.io/github/stars/NVlabs/COAT.svg?style=social&label=Star)](https://github.com/NVlabs/COAT)<br>[COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training](https://arxiv.org/abs/2410.19313) <br> Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, Song Han |<img width="1002" alt="image" src="https://github.com/NVlabs/COAT/blob/main/docs/figs/FP8PrecisionFlow.png"> |[Github](https://github.com/NVlabs/COAT) <br> [Paper](https://arxiv.org/abs/2410.19313)|[//]: #11/17
|[![Star](https://img.shields.io/github/stars/wuhouming/BitPipe.svg?style=social&label=Star)](https://github.com/wuhouming/BitPipe)<br>[BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training](https://arxiv.org/abs/2410.19367) <br> Houming Wu, Ling Chen, Wenjie Yu |<img width="1002" alt="image" src="https://github.com/wuhouming/BitPipe/raw/main/docs/BitPipe_images/BitPipe-v.svg"> |[Github](https://github.com/wuhouming/BitPipe) <br> [Paper](https://arxiv.org/abs/2410.19367)|[//]: #11/17



### Survey (or Benchmark)
| Title & Authors | Introduction | Links |
|:--|  :----: | :---:|
|[Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding](https://arxiv.org/abs/2411.13157) <br> Hyun Ryu, Eric Kim |<img width="1002" alt="image" src="https://arxiv.org/html/2411.13157v1/extracted/6012092/figure2.png"> |[Paper](https://arxiv.org/abs/2411.13157)|[//]: #11/24
|[![Star](https://img.shields.io/github/stars/argonne-lcf/LLM-Inference-Bench.svg?style=social&label=Star)](https://github.com/argonne-lcf/LLM-Inference-Bench)<br>[LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators](https://arxiv.org/abs/2411.00136) <br> Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus et al | |[Github](https://github.com/argonne-lcf/LLM-Inference-Bench) <br> [Paper](https://arxiv.org/abs/2411.00136)|[//]: #11/18
|[![Star](https://img.shields.io/github/stars/ZongqianLi/Prompt-Compression-Survey.svg?style=social&label=Star)](https://github.com/ZongqianLi/Prompt-Compression-Survey)<br>[Prompt Compression for Large Language Models: A Survey](https://arxiv.org/abs/2410.12388) <br> Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier |<img width="1002" alt="image" src="https://arxiv.org/html/2410.12388v2/extracted/5933385/Figures/tree_overview.png"> |[Github](https://github.com/ZongqianLi/Prompt-Compression-Survey) <br> [Paper](https://arxiv.org/abs/2410.12388)|[//]: #10/21
|[Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective](https://arxiv.org/abs/2410.04466) <br> Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai |<img width="1002" alt="image" src="https://arxiv.org/html/2410.04466v1/x4.png"> |[Paper](https://arxiv.org/abs/2410.04466)|[//]: #10/14






