## Knowledge Distillation

| Title & Authors | Introduction | Links |
|:----|  :----: | :---:|
|[![Star](https://img.shields.io/github/stars/FranxYao/FlanT5-CoT-Specialization.svg?style=social&label=Star)](https://github.com/FranxYao/FlanT5-CoT-Specialization)[![Publish](https://img.shields.io/badge/Conference-ICML'23-blue)]()<br>[Specializing Smaller Language Models towards Multi-Step Reasoning](https://arxiv.org/abs/2301.12726) <br> Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot |<img width="1002" alt="image" src="figures/ModelSpecialization.png"> |[Github](https://github.com/FranxYao/FlanT5-CoT-Specialization) <br> [Paper](https://arxiv.org/abs/2301.12726)|
|[![Star](https://img.shields.io/github/stars/siyuyuan/coscript.svg?style=social&label=Star)](https://github.com/siyuyuan/coscript)[![Publish](https://img.shields.io/badge/Conference-ACL'23%20Outstanding-blue)]()<br>[Distilling Script Knowledge from Large Language Models for Constrained Language Planning](https://arxiv.org/abs/2305.05252) <br> Siyu Yuan, Jiangjie Chen, Ziquan Fu, Xuyang Ge, Soham Shah, Charles Robert Jankowski, Yanghua Xiao, Deqing Yang |<img width="302" alt="image" src="figures/CoScript.png"> |[Github](https://github.com/siyuyuan/coscript) <br> [Paper](https://arxiv.org/abs/2305.05252)|
|[![Publish](https://img.shields.io/badge/Conference-ACL'23%20Outstanding-blue)]()<br>[SCOTT: Self-Consistent Chain-of-Thought Distillation](https://arxiv.org/abs/2305.01879) <br> Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, Xiang Ren |<img width="1002" alt="image" src="figures/SCOTT.png"> |[Paper](https://arxiv.org/abs/2305.01879)|
|[![Star](https://img.shields.io/github/stars/eric11eca/disco.svg?style=social&label=Star)](https://github.com/eric11eca/disco)[![Publish](https://img.shields.io/badge/Conference-ACL'23-blue)]()<br>[DISCO: Distilling Counterfactuals with Large Language Models](https://arxiv.org/abs/2212.10534) <br> Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, Kyle Richardson |<img width="1002" alt="image" src="figures/disco.png"> |[Github](https://github.com/eric11eca/disco) <br> [Paper](https://arxiv.org/abs/2212.10534)|
|[![Star](https://img.shields.io/github/stars/allenai/i2d2.svg?style=social&label=Star)](https://github.com/allenai/i2d2)[![Publish](https://img.shields.io/badge/Conference-ACL'23-blue)]()<br>[I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation](https://arxiv.org/abs/2212.09246) <br> Chandra Bhagavatula, Jena D. Hwang, Doug Downey, Ronan Le Bras, Ximing Lu, Lianhui Qin, Keisuke Sakaguchi, Swabha Swayamdipta, Peter West, Yejin Choi |<img width="1002" alt="image" src="https://i2d2.allen.ai/i2d2-fig1.png"> |[Github](https://github.com/allenai/i2d2) <br> [Paper](https://arxiv.org/abs/2212.09246) <br> [Project](https://i2d2.allen.ai/) |
|[![Star](https://img.shields.io/github/stars/allenai/cot_distillation.svg?style=social&label=Star)](https://github.com/allenai/cot_distillation)[![Publish](https://img.shields.io/badge/Conference-ACL'23-blue)]()<br>[Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step](https://arxiv.org/abs/2306.14050) <br> Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, Yejin Choi |<img width="202" alt="image" src="figures/SCoTD.png"> |[Github](https://github.com/allenai/cot_distillation) <br> [Paper](https://arxiv.org/abs/2306.14050)|
|[![Star](https://img.shields.io/github/stars/swarnaHub/ExplanationIntervention.svg?style=social&label=Star)](https://github.com/swarnaHub/ExplanationIntervention) [![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]() <br>[Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind](https://arxiv.org/abs/2306.09299) <br> Swarnadeep Saha, Peter Hase, and Mohit Bansal |<img width="302" alt="image" src="https://github.com/swarnaHub/ExplanationIntervention/blob/main/assets/main_fig.png"> |[Github](https://github.com/swarnaHub/ExplanationIntervention) <br> [Paper](https://arxiv.org/abs/2306.09299)|
|[![Publish](https://img.shields.io/badge/Conference-EMNLP'23-blue)]()<br>[Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents](https://arxiv.org/abs/2310.09343) <br> Hyungjoo Chae, Yongho Song, Kai Tzu-iunn Ong, Taeyoon Kwon, Minjin Kim, Youngjae Yu, Dongha Lee, Dongyeop Kang, Jinyoung Yeo |<img width="1002" alt="image" src="figures/Doctor.png"> |[Paper](https://arxiv.org/abs/2310.09343)|
|[![Star](https://img.shields.io/github/stars/ServiceNow/PromptMix-EMNLP-2023.svg?style=social&label=Star)](https://github.com/ServiceNow/PromptMix-EMNLP-2023)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23-blue)]()<br>[PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation](https://arxiv.org/abs/2310.14192) <br> Gaurav Sahu, Olga Vechtomova, Dzmitry Bahdanau, Issam H. Laradji |<img width="1002" alt="image" src="figures/PromptMix.png"> |[Github](https://github.com/ServiceNow/PromptMix-EMNLP-2023) <br> [Paper](https://arxiv.org/abs/2310.14192)|
|[![Star](https://img.shields.io/github/stars/Yiwei98/TDG.svg?style=social&label=Star)](https://github.com/Yiwei98/TDG)[![Publish](https://img.shields.io/badge/Conference-AAAI'24-blue)]()<br>[Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data](https://arxiv.org/abs/2312.12832) <br> Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Bin Sun, Xinglin Wang, Heda Wang, Kan Li |<img width="1002" alt="image" src="https://github.com/Yiwei98/TDG/blob/main/img.png"> |[Github](https://github.com/Yiwei98/TDG) <br> [Paper](https://arxiv.org/abs/2312.12832)|
|[![Star](https://img.shields.io/github/stars/Raibows/Learn-to-Reason.svg?style=social&label=Star)](https://github.com/Raibows/Learn-to-Reason)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23-blue)]()<br>[Democratizing Reasoning Ability: Tailored Learning from Large Language Model](https://aclanthology.org/2023.emnlp-main.120.pdf) <br> Zhaoyang Wang, Shaohan Huang, Yuxuan Liu, Jiahai Wang, Minghui Song, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang |<img width="1002" alt="image" src="figures/learn-to-reason.png"> |[Github](https://github.com/Raibows/Learn-to-Reason) <br> [Paper](https://aclanthology.org/2023.emnlp-main.120.pdf)|
|[![Star](https://img.shields.io/github/stars/aitsc/GLMKD.svg?style=social&label=Star)](https://github.com/aitsc/GLMKD) [![Publish](https://img.shields.io/badge/Conference-ACL'23%20Industry%20Track-blue)]() <br>[GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model](https://arxiv.org/abs/2306.06629) <br> Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wenwen Gong, Yang Yang, Hongyin Tang, Keqing He, Jiahao Liu, Jingang Wang, Shu Zhao, Peng Zhang, Jie Tang |<img width="1002" alt="image" src="figures/GKD.png"> |[Github](https://github.com/aitsc/GLMKD) <br> [Paper](https://arxiv.org/abs/2306.06629)|
|[![Star](https://img.shields.io/github/stars/google-research/distilling-step-by-step.svg?style=social&label=Star)](https://github.com/google-research/distilling-step-by-step) [![Publish](https://img.shields.io/badge/Conference-ACL'23%20Findings-blue)]() <br> [Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes](https://arxiv.org/abs/2305.02301)    <br> Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister | <img width="2000" alt="image" src="figures/Distill_step_by_step.png">| [Github](https://github.com/google-research/distilling-step-by-step) <br> [Paper](https://arxiv.org/abs/2305.02301) |
|[![Publish](https://img.shields.io/badge/Conference-EMNLP'23%20Findings-blue)]()<br>[Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression](https://arxiv.org/abs/2310.15594) <br> Jiduan Liu, Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, Dongyan Zhao, Ran Lucien Wang, Rui Yan |<img width="1002" alt="image" src="figures/RetriKT.png"> |[Paper](https://arxiv.org/abs/2310.15594)|
|[![Star](https://img.shields.io/github/stars/stoyian/OCaTS.svg?style=social&label=Star)](https://github.com/stoyian/OCaTS)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23%20Findings-blue)]()<br>[Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models](https://arxiv.org/abs/2310.13395) <br> Ilias Stogiannidis, Stavros Vassos, Prodromos Malakasiotis, Ion Androutsopoulos |<img width="252" alt="image" src="figures/OCaTS.png"> |[Github](https://github.com/stoyian/OCaTS) <br> [Paper](https://arxiv.org/abs/2310.13395)|
|[![Publish](https://img.shields.io/badge/Conference-NAACL'24%20Industry%20Track-blue)]()<br>[Efficiently Distilling LLMs for Edge Applications](https://arxiv.org/abs/2404.01353) <br> Achintya Kundu, Fabian Lim, Aaron Chew, Laura Wynter, Penny Chong, Rhui Dih Lee |<img width="1002" alt="image" src="figures/MLFS.png"> |[Paper](https://arxiv.org/abs/2404.01353)|
| [![Star](https://img.shields.io/github/stars/mbzuai-nlp/LaMini-LM.svg?style=social&label=Star)](https://github.com/mbzuai-nlp/LaMini-LM) <br> [LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions](https://github.com/mbzuai-nlp/LaMini-LM) <br>Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, Alham Fikri Aji | <img width="1002" alt="image" src="https://github.com/mbzuai-nlp/LaMini-LM/blob/main/images/lamini-pipeline.drawio.png"> | [Github](https://github.com/mbzuai-nlp/LaMini-LM) [paper](https://arxiv.org/abs/2304.14402) |
|[Knowledge Distillation of Large Language Models](https://arxiv.org/abs/2306.08543) <br> Yuxian Gu, Li Dong, Furu Wei, Minlie Huang |<img width="1002" alt="image" src="https://github.com/microsoft/LMOps/blob/main/minillm/figures/method.png"> |[Github](https://github.com/microsoft/LMOps/tree/main/minillm) <br> [Paper](https://arxiv.org/abs/2306.08543)|
|[Teaching Small Language Models to Reason](https://arxiv.org/abs/2212.08410) <br> Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, Aliaksei Severyn.  |<img width="202" alt="image" src="figures/Teach_Small_LM_COT.png"> |[Paper](https://arxiv.org/abs/2212.08410)|
| [![Star](https://img.shields.io/github/stars/ananyahjha93/llm-distill.svg?style=social&label=Star)](https://github.com/ananyahjha93/llm-distill) <br> [Large Language Model Distillation Doesn't Need a Teacher](https://arxiv.org/abs/2305.14864) <br> Ananya Harsh Jha, Dirk Groeneveld, Emma Strubell, Iz Beltagy </br> | <img width="2000" alt="image" src="figures/TeacherFreeLLM.png"> | [Github](https://github.com/ananyahjha93/llm-distill) [paper](https://arxiv.org/abs/2305.14864) |
| [The False Promise of Imitating Proprietary LLMs](https://arxiv.org/abs/2305.15717) <br> Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song | <img width="400" alt="image" src="figures/FalsePromise.png"> | [Paper](https://arxiv.org/abs/2305.15717) |
|[![Star](https://img.shields.io/github/stars/jaehunjung1/impossible-distillation.svg?style=social&label=Star)](https://github.com/jaehunjung1/impossible-distillation) <br>[Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing](https://arxiv.org/abs/2305.16635) <br> Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, Yejin Choi |<img width="1002" alt="image" src="figures/impossible_distillation.png"> |[Github](https://github.com/jaehunjung1/impossible-distillation) [paper](https://arxiv.org/abs/2305.16635) |
|[PaD: Program-aided Distillation Specializes Large Models in Reasoning](https://arxiv.org/abs/2305.13888) <br> Xuekai Zhu, Biqing Qi, Kaiyan Zhang, Xingwei Long, Bowen Zhou |<img width="402" alt="image" src="figures/PaD.png"> |[Paper](https://arxiv.org/abs/2305.13888)|
|[RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment](https://arxiv.org/abs/2307.12950) <br> Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian |<img width="302" alt="image" src="figures/RLCD.png"> |[Paper](https://arxiv.org/abs/2307.12950)|
|[Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA](https://arxiv.org/abs/2308.04679) <br> Yuhan Ma, Haiqi Jiang, Chenyou Fan |<img width="302" alt="image" src="figures/Sci-COT.png"> |[Paper](https://arxiv.org/abs/2308.04679)|
|[![Star](https://img.shields.io/github/stars/universal-ner/universal-ner.svg?style=social&label=Star)](https://github.com/universal-ner/universal-ner)<br>[UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition](https://arxiv.org/abs/2308.03279) <br> Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, Hoifung Poon |<img width="302" alt="image" src="figures/UniversalNER.png"> |[Github](https://github.com/universal-ner/universal-ner) <br> [Paper](https://arxiv.org/abs/2308.03279) <br> [Project](https://universal-ner.github.io) |
|[![Star](https://img.shields.io/github/stars/timinar/BabyLlama.svg?style=social&label=Star)](https://github.com/timinar/BabyLlama)<br>[Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty](https://arxiv.org/abs/2308.02019) <br> Inar Timiryasov, Jean-Loup Tastet |<img width="302" alt="image" src="figures/BabyLLaMA.png"> |[Github](https://github.com/timinar/BabyLlama) <br> [Paper](https://arxiv.org/abs/2308.02019) | [Model](https://huggingface.co/timinar/baby-llama-58m) |
|[DistillSpec: Improving Speculative Decoding via Knowledge Distillation](https://arxiv.org/abs/2310.08461) <br> Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal |<img width="1002" alt="image" src="figures/DistillSpec.png"> |[Paper](https://arxiv.org/abs/2310.08461)|
|[![Star](https://img.shields.io/github/stars/huggingface/alignment-handbook.svg?style=social&label=Star)](https://github.com/huggingface/alignment-handbook)<br>[Zephyr: Direct Distillation of LM Alignment](https://arxiv.org/abs/2310.16944) <br> Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, Thomas Wolf |<img width="1002" alt="image" src="figures/zephyr.png"> |[Github](https://github.com/huggingface/alignment-handbook) <br> [Paper](https://arxiv.org/abs/2310.16944)|
|[![Star](https://img.shields.io/github/stars/GeneZC/MiniMA.svg?style=social&label=Star)](https://github.com/GeneZC/MiniMA)<br>[Towards the Law of Capacity Gap in Distilling Language Models](https://arxiv.org/abs/2311.07052) <br> Chen Zhang, Dawei Song, Zheyu Ye, Yan Gao |<img width="1002" alt="image" src="figures/MiniMA.png"> |[Github](https://github.com/GeneZC/MiniMA) <br> [Paper](https://arxiv.org/abs/2311.07052)|
|[Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models](https://arxiv.org/abs/2311.08213) <br> Xinwei Li, Li Lin, Shuai Wang, Chen Qian |<img width="1002" alt="image" src="figures/CoMD.png"> |[Paper](https://arxiv.org/abs/2311.08213)|
|[Mixed Distillation Helps Smaller Language Model Better Reasoning](https://arxiv.org/abs/2312.10730) <br> Li Chenglin, Chen Qianglong, Wang Caiyu, Zhang Yin |<img width="1002" alt="image" src="figures/MixDistill.png"> |[Paper](https://arxiv.org/abs/2312.10730)|
|[Distilling Event Sequence Knowledge From Large Language Models](https://arxiv.org/abs/2401.07237) <br> Somin Wadhwa, Oktie Hassanzadeh, Debarun Bhattacharjya, Ken Barker, Jian Ni |<img width="1002" alt="image" src="figures/distill_event.png"> |[Paper](https://arxiv.org/abs/2401.07237)|
|[Knowledge Distillation for Closed-Source Language Models](https://arxiv.org/abs/2401.07013) <br> Hongzhan Chen, Xiaojun Quan, Hehong Chen, Ming Yan, Ji Zhang |<img width="1002" alt="image" src="figures/kd_close_source.png"> |[Paper](https://arxiv.org/abs/2401.07013)|
|[Improving Small Language Models' Mathematical Reasoning via Equation-of-Thought Distillation](https://arxiv.org/abs/2401.11864) <br> Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang |<img width="1002" alt="image" src="figures/EoTD.png"> |[Paper](https://arxiv.org/abs/2401.11864)|
|[Scavenging Hyena: Distilling Transformers into Long Convolution Models](https://arxiv.org/abs/2401.17574) <br> Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Mohammad Sami Nur Islam, Wassim Jabbour, Laurence Liang |<img width="1002" alt="image" src="https://arxiv.org/html/2401.17574v1/extracted/5379324/figs/Knowledge-Transfer-HD.png"> |[Paper](https://arxiv.org/abs/2401.17574)|
|[![Star](https://img.shields.io/github/stars/jongwooko/distillm.svg?style=social&label=Star)](https://github.com/jongwooko/distillm)<br>[DistiLLM: Towards Streamlined Distillation for Large Language Models](https://arxiv.org/abs/2402.03898) <br> Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun |<img width="1002" alt="image" src="https://arxiv.org/html/2402.03898v1/x4.png"> |[Github](https://github.com/jongwooko/distillm) <br> [Paper](https://arxiv.org/abs/2402.03898)|
|[Large Language Model Meets Graph Neural Network in Knowledge Distillation](https://arxiv.org/abs/2402.05894) <br> Shengxiang Hu, Guobing Zou, Song Yang, Bofeng Zhang, Yixin Chen |<img width="1002" alt="image" src="figures/LinguGKD.png"> |[Paper](https://arxiv.org/abs/2402.05894)|
|[![Star](https://img.shields.io/github/stars/dong-river/LLM_unlearning.svg?style=social&label=Star)](https://github.com/dong-river/LLM_unlearning)<br>[Unmemorization in Large Language Models via Self-Distillation and Deliberate Imagination](https://arxiv.org/abs/2402.10052) <br> Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić |<img width="1002" alt="image" src="https://arxiv.org/html/2402.10052v1/x1.png"> |[Github](https://github.com/dong-river/LLM_unlearning) <br> [Paper](https://arxiv.org/abs/2402.10052)|
|[![Star](https://img.shields.io/github/stars/Nicolas-BZRD/llm-recipes.svg?style=social&label=Star)](https://github.com/Nicolas-BZRD/llm-recipes)<br>[Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs](https://arxiv.org/abs/2402.12030) <br> Nicolas Boizard, Kevin El-Haddad, Céline Hudelot, Pierre Colombo |<img width="1002" alt="image" src="figures/CrossTokenizer.png"> |[Github](https://github.com/Nicolas-BZRD/llm-recipes) [Github](https://github.com/Nicolas-BZRD/llm-distillation) <br> [Paper](https://arxiv.org/abs/2402.12030) <br> [Model](https://huggingface.co/collections/Nicolas-BZRD/llms-distillation-65cfa07f1e4ed7404502a9eb)|
|[Revisiting Knowledge Distillation for Autoregressive Language Models](https://arxiv.org/abs/2402.11890) <br> Qihuang Zhong, Liang Ding, Li Shen, Juhua Liu, Bo Du, Dacheng Tao |<img width="1002" alt="image" src="figures/ATKD.png"> |[Paper](https://arxiv.org/abs/2402.11890)|
|[PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning](https://arxiv.org/abs/2402.12842) <br> Gyeongman Kim, Doohyuk Jang, Eunho Yang |<img width="1002" alt="image" src="figures/PromptKD.png"> |[Paper](https://arxiv.org/abs/2402.12842)|
|[Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning](https://arxiv.org/abs/2402.13669) <br> Zhaorui Yang, Qian Liu, Tianyu Pang, Han Wang, Haozhe Feng, Minfeng Zhu, Wei Chen |<img width="1002" alt="image" src="figures/SDFT.png"> |[Paper](https://arxiv.org/abs/2402.13669)|
|[Wisdom of Committee: Distilling from Foundation Model to Specialized Application Model](https://arxiv.org/abs/2402.14035) <br> Zichang Liu, Qingyun Liu, Yuening Li, Liang Liu, Anshumali Shrivastava, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao |<img width="1002" alt="image" src="https://arxiv.org/html/2402.14035v1/x1.png"> |[Paper](https://arxiv.org/abs/2402.14035)|
|[Divide-or-Conquer? Which Part Should You Distill Your LLM?](https://arxiv.org/abs/2402.15000) <br> Zhuofeng Wu, He Bai, Aonan Zhang, Jiatao Gu, VG Vinod Vydiswaran, Navdeep Jaitly, Yizhe Zhang |<img width="202" alt="image" src="https://arxiv.org/html/2402.15000v1/x1.png"> |[Paper](https://arxiv.org/abs/2402.15000)|
|[![Star](https://img.shields.io/github/stars/pphuc25/distil-cd.svg?style=social&label=Star)](https://github.com/pphuc25/distil-cd)<br>[Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation](https://arxiv.org/abs/2402.14874) <br> Phuc Phan, Hieu Tran, Long Phan |<img width="1002" alt="image" src="https://github.com/pphuc25/distil-cd/blob/main/assets/figure1-method.jpg"> |[Github](https://github.com/pphuc25/distil-cd) <br> [Paper](https://arxiv.org/abs/2402.14874)|
|[Leveraging Zero-Shot Prompting for Efficient Language Model Distillation](https://arxiv.org/abs/2403.15886) <br> Lukas Vöge, Vincent Gurgul, Stefan Lessmann |<img width="1002" alt="image" src="https://arxiv.org/html/2403.15886v1/extracted/5490966/step_by_step.png"> |[Paper](https://arxiv.org/abs/2403.15886)|
|[![Star](https://img.shields.io/github/stars/KomeijiForce/MetaIE.svg?style=social&label=Star)](https://github.com/KomeijiForce/MetaIE)<br>[MetaIE: Distilling a Meta Model from LLM for All Kinds of Information Extraction Tasks](https://arxiv.org/abs/2404.00457) <br> Letian Peng, Zilong Wang, Feng Yao, Zihan Wang, Jingbo Shang |<img width="1002" alt="image" src="https://arxiv.org/html/2404.00457v1/x1.png"> |[Github](https://github.com/KomeijiForce/MetaIE) <br> [Paper](https://arxiv.org/abs/2404.00457) <br> [Model](https://huggingface.co/KomeijiForce/roberta-large-metaie)|
|[Gecko: Versatile Text Embeddings Distilled from Large Language Models](https://arxiv.org/abs/2403.20327) <br> Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer et al |<img width="1002" alt="image" src="https://arxiv.org/html/2403.20327v1/x1.png"> |[Paper](https://arxiv.org/abs/2403.20327)|
|[Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models](https://arxiv.org/abs/2404.02657) <br> Taiqiang Wu, Chaofan Tao, Jiahao Wang, Zhe Zhao, Ngai Wong |<img width="1002" alt="image" src="figures/rethink-AKL.png"> |[Paper](https://arxiv.org/abs/2404.02657) <br> [Blog-Eng](https://zhuanlan.zhihu.com/p/690804722)<br> [Blog-中](https://zhuanlan.zhihu.com/p/690748958)|
|[Post-Semantic-Thinking: A Robust Strategy to Distill Reasoning Capacity from Large Language Models](https://arxiv.org/abs/2404.09170) <br> Xiaoshu Chen, Sihang Zhou, Ke Liang, Xinwang Liu |<img width="1002" alt="image" src="https://arxiv.org/html/2404.09170v2/x1.png"> |[Paper](https://arxiv.org/abs/2404.09170)|
|[Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation](https://arxiv.org/abs/2405.03085) <br> Kaize Shi, Xueyao Sun, Qing Li, Guandong Xu |<img width="1002" alt="image" src="figures/concept_RAG.png"> |[Paper](https://arxiv.org/abs/2405.03085)|
|[Distilling Instruction-following Abilities of Large Language Models with Task-aware Curriculum Planning](https://arxiv.org/abs/2405.13448) <br> Yuanhao Yue, Chengyu Wang, Jun Huang, Peng Wang |<img width="1002" alt="image" src="figures/TAPIR.png"> |[Paper](https://arxiv.org/abs/2405.13448)|
|[![Star](https://img.shields.io/github/stars/WangXFng/RDRec.svg?style=social&label=Star)](https://github.com/WangXFng/RDRec)[![Publish](https://img.shields.io/badge/Conference-ACL'24-blue)]()<br>[RDRec: Rationale Distillation for LLM-based Recommendation](https://arxiv.org/abs/2405.10587) <br> Xinfeng Wang, Jin Cui, Yoshimi Suzuki, Fumiyo Fukumoto |<img width="1002" alt="image" src="figures/RDRec.png"> |[Github](https://github.com/WangXFng/RDRec) <br> [Paper](https://arxiv.org/abs/2405.10587)|
|[LLM and GNN are Complementary: Distilling LLM for Multimodal Graph Learning](https://arxiv.org/abs/2406.01032) <br> Junjie Xu, Zongyu Wu, Minhua Lin, Xiang Zhang, Suhang Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2406.01032v1/x1.png"> |[Paper](https://arxiv.org/abs/2406.01032)|[//]: #06/05
|[![Star](https://img.shields.io/github/stars/jiachenwestlake/MMKD.svg?style=social&label=Star)](https://github.com/jiachenwestlake/MMKD)<br>[Adversarial Moment-Matching Distillation of Large Language Models](https://arxiv.org/abs/2406.02959) <br> Chen Jia |<img width="1002" alt="image" src="https://arxiv.org/html/2406.02959v1/x1.png"> |[Github](https://github.com/jiachenwestlake/MMKD) <br> [Paper](https://arxiv.org/abs/2406.02959)|[//]: #06/11
|[BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation](https://arxiv.org/abs/2406.13555) <br> Minchong Li, Feng Zhou, Xiaohui Song |<img width="1002" alt="image" src="https://arxiv.org/html/2406.13555v1/extracted/5678562/images/bild.jpg"> |[Paper](https://arxiv.org/abs/2406.13555)|[//]: #07/05
|[Multi-Granularity Semantic Revision for Large Language Model Distillation](https://arxiv.org/abs/2407.10068) <br> Xiaoyu Liu, Yun Zhang, Wei Li, Simiao Li, Xudong Huang, Hanting Chen, Yehui Tang, Jie Hu, Zhiwei Xiong, Yunhe Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2407.10068v1/x1.png"> |[Paper](https://arxiv.org/abs/2407.10068)|[//]: #07/16
|[Don't Throw Away Data: Better Sequence Knowledge Distillation](https://arxiv.org/abs/2407.10456) <br> Jun Wang, Eleftheria Briakou, Hamid Dadkhahi, Rishabh Agarwal, Colin Cherry, Trevor Cohn | |[Paper](https://arxiv.org/abs/2407.10456)|[//]: #07/16
|[Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model](https://arxiv.org/abs/2407.10167) <br> Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2407.10167v1/x2.png"> |[Paper](https://arxiv.org/abs/2407.10167)|[//]: #07/16
|[DDK: Distilling Domain Knowledge for Efficient Large Language Models](https://arxiv.org/abs/2407.16154) <br> Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, Zhiqi Bai, Jie Liu, Ge Zhang, Jiakai Wang, Yanan Wu, Congnan Liu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng |<img width="1002" alt="image" src="https://arxiv.org/html/2407.16154v1/x2.png"> |[Paper](https://arxiv.org/abs/2407.16154)|[//]: #07/24
|[Enhancing Data-Limited Graph Neural Networks by Actively Distilling Knowledge from Large Language Models](https://arxiv.org/abs/2407.13989) <br> Quan Li, Tianxiang Zhao, Lingwei Chen, Junjie Xu, Suhang Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2407.13989v1/x2.png"> |[Paper](https://arxiv.org/abs/2407.13989)|[//]: #07/24
|[BOND: Aligning LLMs with Best-of-N Distillation](https://arxiv.org/abs/2407.14622) <br> Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Nino Vieillard et al |<img width="1002" alt="image" src="figures/BOND.png"> |[Paper](https://arxiv.org/abs/2407.14622)|[//]: #07/29
|[LaDiMo: Layer-wise Distillation Inspired MoEfier](https://arxiv.org/abs/2408.04278) <br> Sungyoon Kim, Youngjun Kim, Kihyo Moon, Minsung Jang |<img width="1002" alt="image" src="https://arxiv.org/html/2408.04278v1/extracted/5780689/figures/moefier.png"> |[Paper](https://arxiv.org/abs/2408.04278)|[//]: #08/13
|[Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models](https://arxiv.org/abs/2408.10189) <br> Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, Albert Gu |<img width="1002" alt="image" src="https://arxiv.org/html/2408.10189v1/x1.png"> |[Paper](https://arxiv.org/abs/2408.10189)|[//]: #08/20
|[Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting](https://arxiv.org/abs/2408.09365) <br> Emmanuel Aboah Boateng, Cassiano O. Becker, Nabiha Asghar, Kabir Walia, Ashwin Srinivasan, Ehi Nosakhare, Victor Dibia, Soundar Srinivasan |<img width="1002" alt="image" src="https://arxiv.org/html/2408.09365v1/x2.png"> |[Paper](https://arxiv.org/abs/2408.09365)|[//]: #08/20
|[Interactive DualChecker for Mitigating Hallucinations in Distilling Large Language Models](https://arxiv.org/abs/2408.12326) <br> Meiyun Wang, Masahiro Suzuki, Hiroki Sakaji, Kiyoshi Izumi |<img width="1002" alt="image" src="https://arxiv.org/html/2408.12326v1/extracted/5806761/figs/intro.jpg"> |[Paper](https://arxiv.org/abs/2408.12326)|[//]: #08/27
|[FIRST: Teach A Reliable Large Language Model Through Efficient Trustworthy Distillation](https://arxiv.org/abs/2408.12168) <br> KaShun Shum, Minrui Xu, Jianshu Zhang, Zixin Chen, Shizhe Diao, Hanze Dong, Jipeng Zhang, Muhammad Omer Raza |<img width="1002" alt="image" src="https://arxiv.org/html/2408.12168v1/extracted/5806746/Figures/trustworthy.png"> |[Paper](https://arxiv.org/abs/2408.12168)|[//]: #08/27
|[![Star](https://img.shields.io/github/stars/jxiw/MambaInLlama.svg?style=social&label=Star)](https://github.com/jxiw/MambaInLlama)<br>[The Mamba in the Llama: Distilling and Accelerating Hybrid Models](https://arxiv.org/abs/2408.15237) <br> Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao |<img width="1002" alt="image" src="https://arxiv.org/html/2408.15237v1/x1.png"> |[Github](https://github.com/jxiw/MambaInLlama) <br> [Paper](https://arxiv.org/abs/2408.15237)|[//]: #09/02
|[Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights](https://arxiv.org/abs/2409.12586) <br> Mohamad Ballout, Ulf Krumnack, Gunther Heidemann, Kai-Uwe Kühnberger |<img width="1002" alt="image" src="https://arxiv.org/html/2409.12586v1/x2.png"> |[Paper](https://arxiv.org/abs/2409.12586)|[//]: #09/21
|[Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models](https://arxiv.org/abs/2409.12512) <br> Jun Rao, Xuebo Liu, Zepeng Lin, Liang Ding, Jing Li, Dacheng Tao |<img width="1002" alt="image" src="https://arxiv.org/html/2409.12512v1/x1.png"> |[Paper](https://arxiv.org/abs/2409.12512)|[//]: #09/21
|[![Star](https://img.shields.io/github/stars/MANGA-UOFA/Prompt-LLMR.svg?style=social&label=Star)](https://github.com/MANGA-UOFA/Prompt-LLMR)[![Publish](https://img.shields.io/badge/Conference-LREC-COLING'24-blue)]()<br>[LLMR: Knowledge Distillation with a Large Language Model-Induced Reward](https://arxiv.org/abs/2409.12500) <br> Dongheng Li, Yongchang Hao, Lili Mou |<img width="1002" alt="image" src="https://github.com/MANGA-UOFA/Prompt-LLMR/blob/main/LLMR-main/assets/model.png"> |[Github](https://github.com/MANGA-UOFA/Prompt-LLMR) <br> [Paper](https://arxiv.org/abs/2409.12500)|[//]: #09/21
|[EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models](https://arxiv.org/abs/2409.14595) <br> Hossein Rajabzadeh, Aref Jafari, Aman Sharma, Benyamin Jami, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh |<img width="1002" alt="image" src="https://arxiv.org/html/2409.14595v1/extracted/5869635/Figs/shared_attention_diagram.png"> |[Paper](https://arxiv.org/abs/2409.14595)|[//]: #09/27
|[BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data](https://arxiv.org/abs/2409.17312) <br> Jean-Loup Tastet, Inar Timiryasov | |[Paper](https://arxiv.org/abs/2409.17312)|[//]: #09/27
|[Evolutionary Contrastive Distillation for Language Model Alignment](https://arxiv.org/abs/2410.07513) <br> Julian Katz-Samuels, Zheng Li, Hyokun Yun, Priyanka Nigam, Yi Xu, Vaclav Petricek, Bing Yin, Trishul Chilimbi |<img width="1002" alt="image" src="https://arxiv.org/html/2410.07513v1/extracted/5913898/figures/main_alg_v3.png"> |[Paper](https://arxiv.org/abs/2410.07513)|[//]: #10/13
|[Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling](https://arxiv.org/abs/2410.11325) <br> Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11325v1/x2.png"> |[Paper](https://arxiv.org/abs/2410.11325)|[//]: #10/21
|[![Star](https://img.shields.io/github/stars/thu-coai/MiniPLM.svg?style=social&label=Star)](https://github.com/thu-coai/MiniPLM)<br>[MiniPLM: Knowledge Distillation for Pre-Training Language Models](https://arxiv.org/abs/2410.17215) <br> Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang |<img width="1002" alt="image" src="https://github.com/thu-coai/MiniPLM/raw/main/figures/method.png"> |[Github](https://github.com/thu-coai/MiniPLM) <br> [Paper](https://arxiv.org/abs/2410.17215)|[//]: #10/29
|[Pre-training Distillation for Large Language Models: A Design Space Exploration](https://arxiv.org/abs/2410.16215) <br> Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, Juanzi Li | |[Paper](https://arxiv.org/abs/2410.16215)|[//]: #10/30
|[![Star](https://img.shields.io/github/stars/jdeschena/sdtt.svg?style=social&label=Star)](https://github.com/jdeschena/sdtt)<br>[Beyond Autoregression: Fast LLMs via Self-Distillation Through Time](https://arxiv.org/abs/2410.21035) <br> Justin Deschenaux, Caglar Gulcehre |<img width="1002" alt="image" src="https://arxiv.org/html/2410.21035v1/x3.png"> |[Github](https://github.com/jdeschena/sdtt) <br> [Paper](https://arxiv.org/abs/2410.21035)|[//]: #11/17
|[SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models](https://arxiv.org/abs/2410.19503) <br> Jahyun Koo, Yerin Hwang, Yongil Kim, Taegwan Kang, Hyunkyung Bae, Kyomin Jung |<img width="1002" alt="image" src="figures/switch.png"> |[Paper](https://arxiv.org/abs/2410.19503)|[//]: #11/17
|[![Star](https://img.shields.io/github/stars/kaistai/generative-context-distillation.svg?style=social&label=Star)](https://github.com/kaistai/generative-context-distillation)<br>[Generative Context Distillation](https://arxiv.org/abs/2411.15927) <br> Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo |<img width="1002" alt="image" src="figures/GCD.png"> |[Github](https://github.com/kaistai/generative-context-distillation) <br> [Paper](https://arxiv.org/abs/2411.15927)|[//]: #12/02
|[Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation](https://arxiv.org/abs/2411.14698) <br> Xunyu Zhu, Jian Li, Can Ma, Weiping Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2411.14698v1/x1.png"> |[Paper](https://arxiv.org/abs/2411.14698)|[//]: #12/03
|[![Star](https://img.shields.io/github/stars/HITSZ-HLT/FSA-Distillation.svg?style=social&label=Star)](https://github.com/HITSZ-HLT/FSA-Distillation)<br>[Distilling Fine-grained Sentiment Understanding from Large Language Models](https://arxiv.org/abs/2412.18552) <br> Yice Zhang, Guangyu Xie, Hongling Xu, Kaiheng Hou, Jianzhu Bao, Qianlong Wang, Shiwei Chen, Ruifeng Xu |<img width="302" alt="image" src="https://arxiv.org/html/2412.18552v1/x1.png"> |[Github](https://github.com/HITSZ-HLT/FSA-Distillation) <br> [Paper](https://arxiv.org/abs/2412.18552)|[//]: #12/30
|[![Star](https://img.shields.io/github/stars/alonso130r/knowledge-distillation.svg?style=social&label=Star)](https://github.com/alonso130r/knowledge-distillation)<br>[Enhancing Knowledge Distillation for LLMs with Response-Priming Prompting](https://arxiv.org/abs/2412.17846) <br> Vijay Goyal, Mustafa Khan, Aprameya Tirupati, Harveer Saini, Michael Lam, Kevin Zhu |<img width="1002" alt="image" src="https://arxiv.org/html/2412.17846v1/extracted/6080471/prompt-example.png"> |[Github](https://github.com/alonso130r/knowledge-distillation) <br> [Paper](https://arxiv.org/abs/2412.17846)|[//]: #12/30
|[Large Language Models Compression via Low-Rank Feature Distillation](https://arxiv.org/abs/2412.16719) <br> Yaya Sy, Christophe Cerisara, Irina Illina |<img width="302" alt="image" src="https://arxiv.org/html/2412.16719v1/x1.png"> |[Paper](https://arxiv.org/abs/2412.16719)|[//]: #12/30
|[![Publish](https://img.shields.io/badge/Conference-COLING'25-blue)]()<br>[Self-Evolution Knowledge Distillation for LLM-based Machine Translation](https://arxiv.org/abs/2412.15303) <br> Yuncheng Song, Liang Ding, Changtong Zan, Shujian Huang |<img width="1002" alt="image" src="https://arxiv.org/html/2412.15303v1/extracted/6081708/model_two.png"> |[Paper](https://arxiv.org/abs/2412.15303)|[//]: #12/30