# Awesome Audio Gallery

- [Awesome Audio Gallery](#awesome-audio-gallery)
  - [Courses and Tutorials](#courses-and-tutorials)
    - [DLHLP 2020 Spring](#dlhlp-2020-spring)
    - [Large Audio Model](#large-audio-model)
  - [Speech Translation](#speech-translation)
  - [Toolkits](#toolkits)
  - [Dataset](#dataset)
    - [TTS](#tts)
  - [Audio Techs](#audio-techs)

## Detection
- **SONAR: A Synthetic AI-Audio Detection Framework and Benchmark**, `arXiv, 2410.04324`, [arxiv](http://arxiv.org/abs/2410.04324v3), [pdf](http://arxiv.org/pdf/2410.04324v3.pdf), cication: [**-1**](None)

	 *Xiang Li, Pin-Yu Chen, Wenqi Wei* · ([SONAR](https://github.com/Jessegator/SONAR) - Jessegator) ![Star](https://img.shields.io/github/stars/Jessegator/SONAR.svg?style=social&label=Star)
- **FakeSound: Deepfake General Audio Detection**, `arXiv, 2406.08052`, [arxiv](http://arxiv.org/abs/2406.08052v1), [pdf](http://arxiv.org/pdf/2406.08052v1.pdf), cication: [**-1**](None)

	 *Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu* · ([fakesounddata.github](https://fakesounddata.github.io/))
- **Retrieval-Augmented Audio Deepfake Detection**, `arXiv, 2404.13892`, [arxiv](http://arxiv.org/abs/2404.13892v1), [pdf](http://arxiv.org/pdf/2404.13892v1.pdf), cication: [**-1**](None)

	 *Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang*
- **Cross-Domain Audio Deepfake Detection: Dataset and Analysis**, `arXiv, 2404.04904`, [arxiv](http://arxiv.org/abs/2404.04904v1), [pdf](http://arxiv.org/pdf/2404.04904v1.pdf), cication: [**-1**](None)

	 *Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang*
- **Detecting Multimedia Generated by Large AI Models: A Survey**, `arXiv, 2402.00045`, [arxiv](http://arxiv.org/abs/2402.00045v1), [pdf](http://arxiv.org/pdf/2402.00045v1.pdf), cication: [**-1**](None)

	 *Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, Shu Hu* · ([Detect-LAIM-generated-Multimedia-Survey](https://github.com/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey) - Purdue-M2) ![Star](https://img.shields.io/github/stars/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey.svg?style=social&label=Star)
- **Proactive Detection of Voice Cloning with Localized Watermarking**, `arXiv, 2401.17264`, [arxiv](http://arxiv.org/abs/2401.17264v1), [pdf](http://arxiv.org/pdf/2401.17264v1.pdf), cication: [**-1**](None)

	 *Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar* · ([audioseal](https://github.com/facebookresearch/audioseal) - facebookresearch) ![Star](https://img.shields.io/github/stars/facebookresearch/audioseal.svg?style=social&label=Star)

## Courses and Tutorials

### DLHLP 2020 Spring
- [DLHLP 2020 Spring](https://speech.ee.ntu.edu.tw/~hylee/dlhlp/2020-spring.php)
- [[DLHLP 2020] 李宏毅老师2020春课程-语音识别-语音合成-语音分离\_哔哩哔哩\_bilibili](https://www.bilibili.com/video/BV1hZ4y1w7j1/?spm_id_from=333.788.top_right_bar_window_custom_collection.content.click&vd_source=1453a06a1e0b377f5c40946333b4423a)
- [[DLHLP 2020] Vocoder (由助教許博竣同學講授)\_哔哩哔哩\_bilibili](https://www.bilibili.com/video/BV1hZ4y1w7j1?p=11&vd_source=1453a06a1e0b377f5c40946333b4423a)
- TTS Intro: [[DLHLP 2020] Speech Synthesis (1/2) - Tacotron - YouTube](https://www.youtube.com/watch?v=DMxKeHW8KdM&ab_channel=Hung-yiLee)

### Large Audio Model
- [【機器學習2023】語音基石模型 (助教張凱為講授) (1/2) - YouTube](https://www.youtube.com/watch?v=m7Be7ppR6q0&ab_channel=Hung-yiLee)
- [【機器學習2023】語音基石模型 (助教張凱為講授) (2/2) - YouTube](https://www.youtube.com/watch?v=HTAq-CPrU5s&ab_channel=Hung-yiLee)
- [https://speech.ee.ntu.edu.tw/\~hylee/ml/ml2023-course-data/張凱爲-x-機器學習-x-語音基石模型.pdf](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2023-course-data/%E5%BC%B5%E5%87%B1%E7%88%B2-x-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-x-%E8%AA%9E%E9%9F%B3%E5%9F%BA%E7%9F%B3%E6%A8%A1%E5%9E%8B.pdf)

### Other
- [scaling law for multimodality](https://twitter.com/xutan_tx/status/1783154647903113453)
- [**awesome-speech-recognition-speech-synthesis-papers**](https://github.com/zzw922cn/awesome-speech-recognition-speech-synthesis-papers) - zzw922cn ![Star](https://img.shields.io/github/stars/zzw922cn/awesome-speech-recognition-speech-synthesis-papers.svg?style=social&label=Star)
- [**Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion**](https://github.com/guan-yuan/Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion) - guan-yuan ![Star](https://img.shields.io/github/stars/guan-yuan/Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion.svg?style=social&label=Star)
- [**survey**](https://github.com/tts-tutorial/survey) - tts-tutorial ![Star](https://img.shields.io/github/stars/tts-tutorial/survey.svg?style=social&label=Star)

	 *A Survey on Neural Speech Synthesis*
- [**interspeech2022**](https://github.com/tts-tutorial/interspeech2022) - tts-tutorial ![Star](https://img.shields.io/github/stars/tts-tutorial/interspeech2022.svg?style=social&label=Star)
- [INTERSPEECH\_Tutorial\_TTS.pdf](https://tts-tutorial.github.io/interspeech2022/INTERSPEECH_Tutorial_TTS.pdf)
- [INTERSPEECH\_Tutorial\_VC.pdf](https://tts-tutorial.github.io/interspeech2022/INTERSPEECH_Tutorial_VC.pdf)
- [https://www.microsoft.com/en-us/research/uploads/prod/2022/12/Generative-Models-for-TTS.pdf](https://www.microsoft.com/en-us/research/uploads/prod/2022/12/Generative-Models-for-TTS.pdf)
- [**AudioGPT**](https://github.com/AIGC-Audio/AudioGPT) - AIGC-Audio ![Star](https://img.shields.io/github/stars/AIGC-Audio/AudioGPT.svg?style=social&label=Star)

## Speech Translation
- **FASST: Fast LLM-based Simultaneous Speech Translation**, `arXiv, 2408.09430`, [arxiv](http://arxiv.org/abs/2408.09430v1), [pdf](http://arxiv.org/pdf/2408.09430v1.pdf), cication: [**-1**](None)

	 *Siqi Ouyang, Xi Xu, Chinmay Dandekar, Lei Li*
- **StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task
  Learning**, `arXiv, 2406.03049`, [arxiv](http://arxiv.org/abs/2406.03049v1), [pdf](http://arxiv.org/pdf/2406.03049v1.pdf), cication: [**-1**](None)

	 *Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang Feng* · ([streamspeech](https://github.com/ictnlp/streamspeech) - ictnlp) ![Star](https://img.shields.io/github/stars/ictnlp/streamspeech.svg?style=social&label=Star)
- [**gentranslate**](https://github.com/yuchen005/gentranslate) - yuchen005 ![Star](https://img.shields.io/github/stars/yuchen005/gentranslate.svg?style=social&label=Star)

	 *Code for paper "GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators"*
- **PolyVoice: Language Models for Speech to Speech Translation**, `arXiv, 2306.02982`, [arxiv](http://arxiv.org/abs/2306.02982v2), [pdf](http://arxiv.org/pdf/2306.02982v2.pdf), cication: [**-1**](None)

	 *Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng* · ([speechtranslation.github](https://speechtranslation.github.io/polyvoice/))
- [**fairseq**](https://github.com/facebookresearch/fairseq/tree/ust/examples/hokkien) - facebookresearch ![Star](https://img.shields.io/github/stars/facebookresearch/fairseq.svg?style=social&label=Star)

## Toolkits
- [**awesome-denoiser**](https://github.com/xinliu9451/awesome-denoiser) - xinliu9451 ![Star](https://img.shields.io/github/stars/xinliu9451/awesome-denoiser.svg?style=social&label=Star)

	 *This is a repository that collects common audio noise reduction models, using Gradio to demonstrate the use of each model, which is very friendly for beginners.*
- [**LiveKit-AI-Voice-Assitant**](https://github.com/techwithtim/LiveKit-AI-Voice-Assitant) - techwithtim ![Star](https://img.shields.io/github/stars/techwithtim/LiveKit-AI-Voice-Assitant.svg?style=social&label=Star)
- **Super Monotonic Alignment Search**, `arXiv, 2409.07704`, [arxiv](http://arxiv.org/abs/2409.07704v1), [pdf](http://arxiv.org/pdf/2409.07704v1.pdf), cication: [**-1**](None)

	 *Junhyeok Lee, Hyeongju Kim*

	 · ([super-monotonic-align](https://github.com/supertone-inc/super-monotonic-align) - supertone-inc) ![Star](https://img.shields.io/github/stars/supertone-inc/super-monotonic-align.svg?style=social&label=Star)
- [LiveKit · GitHub](https://github.com/livekit)
- [**speechbrain**](https://github.com/speechbrain/speechbrain/) - speechbrain ![Star](https://img.shields.io/github/stars/speechbrain/speechbrain.svg?style=social&label=Star)

	 *A PyTorch-based Speech Toolkit* · ([huggingface](https://huggingface.co/speechbrain/sepformer-wham16k-enhancement))
- [**demucs**](https://github.com/facebookresearch/demucs) - facebookresearch ![Star](https://img.shields.io/github/stars/facebookresearch/demucs.svg?style=social&label=Star)

	 *Code for the paper Hybrid Spectrogram and Waveform Source Separation*
- [**Resemblyzer**](https://github.com/resemble-ai/Resemblyzer) - resemble-ai ![Star](https://img.shields.io/github/stars/resemble-ai/Resemblyzer.svg?style=social&label=Star)

	 *A python package to analyze and compare voices with deep learning*
- [**FRCRN**](https://github.com/alibabasglab/FRCRN) - alibabasglab ![Star](https://img.shields.io/github/stars/alibabasglab/FRCRN.svg?style=social&label=Star)
- [**seewav**](https://github.com/adefossez/seewav) - adefossez ![Star](https://img.shields.io/github/stars/adefossez/seewav.svg?style=social&label=Star)

	 *Audio waveform visualisation, converts any audio to a nice video* · ([huggingface](https://huggingface.co/spaces/mrfakename/seewav-gui))
- [**NeMo-text-processing**](https://github.com/NVIDIA/NeMo-text-processing?tab=readme-ov-file) - NVIDIA ![Star](https://img.shields.io/github/stars/NVIDIA/NeMo-text-processing.svg?style=social&label=Star)

	 *NeMo text processing for ASR and TTS*
- [**OpenPhonemizer**](https://github.com/NeuralVox/OpenPhonemizer) - NeuralVox ![Star](https://img.shields.io/github/stars/NeuralVox/OpenPhonemizer.svg?style=social&label=Star)

	 *Permissively licensed, open sourced, local IPA Phonemizer (G2P) powered by deep learning.*
- [**fullstop-deep-punctuation-prediction**](https://github.com/oliverguhr/fullstop-deep-punctuation-prediction) - oliverguhr ![Star](https://img.shields.io/github/stars/oliverguhr/fullstop-deep-punctuation-prediction.svg?style=social&label=Star)

	 *A model that predicts the punctuation of English, Italian, French and German texts.* · ([huggingface](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large))
- [**nendo**](https://github.com/okio-ai/nendo) - okio-ai ![Star](https://img.shields.io/github/stars/okio-ai/nendo.svg?style=social&label=Star)

	 *The Nendo AI Audio Tool Suite*
- [**deepfilternet**](https://github.com/rikorose/deepfilternet) - rikorose ![Star](https://img.shields.io/github/stars/rikorose/deepfilternet.svg?style=social&label=Star)

	 *Noise supression using deep filtering*
- [**CharsiuG2P**](https://github.com/lingjzhu/CharsiuG2P) - lingjzhu ![Star](https://img.shields.io/github/stars/lingjzhu/CharsiuG2P.svg?style=social&label=Star)

	 *Multilingual G2P in 100 languages*
- [(Inverse) Text Normalization — NVIDIA NeMo](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/text_normalization/intro.html)
- [**audio-preprocess**](https://github.com/fishaudio/audio-preprocess) - fishaudio ![Star](https://img.shields.io/github/stars/fishaudio/audio-preprocess.svg?style=social&label=Star)

	 *Preprocess Audio for training*
- [**pyloudnorm**](https://github.com/csteinmetz1/pyloudnorm) - csteinmetz1 ![Star](https://img.shields.io/github/stars/csteinmetz1/pyloudnorm.svg?style=social&label=Star)

	 *Flexible audio loudness meter in Python with implementation of ITU-R BS.1770-4 loudness algorithm*
- [**ultimatevocalremovergui**](https://github.com/Anjok07/ultimatevocalremovergui) - Anjok07 ![Star](https://img.shields.io/github/stars/Anjok07/ultimatevocalremovergui.svg?style=social&label=Star)

	 *GUI for a Vocal Remover that uses Deep Neural Networks.*
- **Amphion: An Open-Source Audio, Music and Speech Generation Toolkit**, `arXiv, 2312.09911`, [arxiv](http://arxiv.org/abs/2312.09911v1), [pdf](http://arxiv.org/pdf/2312.09911v1.pdf), cication: [**-1**](None)

	 *Xueyao Zhang, Liumeng Xue, Yuancheng Wang, Yicheng Gu, Xi Chen, Zihao Fang, Haopeng Chen, Lexiao Zou, Chaoren Wang, Jun Han* · ([huggingface](https://huggingface.co/amphion)) · ([Amphion](https://github.com/open-mmlab/Amphion/tree/main) - open-mmlab) ![Star](https://img.shields.io/github/stars/open-mmlab/Amphion.svg?style=social&label=Star)
- [**resemble-enhance**](https://github.com/resemble-ai/resemble-enhance) - resemble-ai ![Star](https://img.shields.io/github/stars/resemble-ai/resemble-enhance.svg?style=social&label=Star)

	 *AI powered speech denoising and enhancement*
- [**awesome-python**](https://github.com/vinta/awesome-python#audio) - vinta ![Star](https://img.shields.io/github/stars/vinta/awesome-python.svg?style=social&label=Star)

	 *A curated list of awesome Python frameworks, libraries, software and resources*
- [**pyannote-audio**](https://github.com/pyannote/pyannote-audio) - pyannote ![Star](https://img.shields.io/github/stars/pyannote/pyannote-audio.svg?style=social&label=Star)

	 *Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding*
- [**audiocraft**](https://github.com/facebookresearch/audiocraft) - facebookresearch ![Star](https://img.shields.io/github/stars/facebookresearch/audiocraft.svg?style=social&label=Star)

	 *Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.*
- [**audio-slicer**](https://github.com/openvpi/audio-slicer) - openvpi ![Star](https://img.shields.io/github/stars/openvpi/audio-slicer.svg?style=social&label=Star)

	 *Python script that slices audio with silence detection*
- [**autocut**](https://github.com/mli/autocut/blob/main/autocut/transcribe.py) - mli ![Star](https://img.shields.io/github/stars/mli/autocut.svg?style=social&label=Star)
- [**phonemizer**](https://github.com/bootphon/phonemizer) - bootphon ![Star](https://img.shields.io/github/stars/bootphon/phonemizer.svg?style=social&label=Star)

	 *Simple text to phones converter for multiple languages*
- [**g2pM**](https://github.com/kakaobrain/g2pM) - kakaobrain ![Star](https://img.shields.io/github/stars/kakaobrain/g2pM.svg?style=social&label=Star)

	 *A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset*
- [**g2pC**](https://github.com/Kyubyong/g2pC) - Kyubyong ![Star](https://img.shields.io/github/stars/Kyubyong/g2pC.svg?style=social&label=Star)

	 *g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese*
- [**AcademiCodec**](https://github.com/yangdongchao/AcademiCodec) - yangdongchao ![Star](https://img.shields.io/github/stars/yangdongchao/AcademiCodec.svg?style=social&label=Star)

	 *AcademiCodec: An Open Source Audio Codec Model for Academic Research*
- [**Python-Wrapper-for-World-Vocoder**](https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder) - JeremyCCHsu ![Star](https://img.shields.io/github/stars/JeremyCCHsu/Python-Wrapper-for-World-Vocoder.svg?style=social&label=Star)

	 *A Python wrapper for the high-quality vocoder "World"*
- [**g2p-kd**](https://github.com/sigmeta/g2p-kd) - sigmeta ![Star](https://img.shields.io/github/stars/sigmeta/g2p-kd.svg?style=social&label=Star)

	 *Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion*
- [**speechbrain**](https://github.com/speechbrain/speechbrain) - speechbrain ![Star](https://img.shields.io/github/stars/speechbrain/speechbrain.svg?style=social&label=Star)

	 *A PyTorch-based Speech Toolkit*
- [**ffmpeg-normalize**](https://github.com/slhck/ffmpeg-normalize) - slhck ![Star](https://img.shields.io/github/stars/slhck/ffmpeg-normalize.svg?style=social&label=Star)

	 *Audio Normalization for Python/ffmpeg*
- [**Montreal-Forced-Aligner**](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) - MontrealCorpusTools ![Star](https://img.shields.io/github/stars/MontrealCorpusTools/Montreal-Forced-Aligner.svg?style=social&label=Star)

	 *Command line utility for forced alignment using Kaldi*
- [**WeTextProcessing**](https://github.com/wenet-e2e/WeTextProcessing) - wenet-e2e ![Star](https://img.shields.io/github/stars/wenet-e2e/WeTextProcessing.svg?style=social&label=Star)

	 *Text Normalization & Inverse Text Normalization*
- [**python-pinyin**](https://github.com/mozillazg/python-pinyin) - mozillazg ![Star](https://img.shields.io/github/stars/mozillazg/python-pinyin.svg?style=social&label=Star)

	 *汉字转拼音(pypinyin)*
- [**pypinyin-g2pW**](https://github.com/mozillazg/pypinyin-g2pW) - mozillazg ![Star](https://img.shields.io/github/stars/mozillazg/pypinyin-g2pW.svg?style=social&label=Star)

	 *基于 g2pW 提升 pypinyin 的准确性*
- **MP-SENet: A Speech Enhancement Model with Parallel Denoising of
  Magnitude and Phase Spectra**, `arXiv, 2305.13686`, [arxiv](http://arxiv.org/abs/2305.13686v1), [pdf](http://arxiv.org/pdf/2305.13686v1.pdf), cication: [**-1**](None)

	 *Ye-Xin Lu, Yang Ai, Zhen-Hua Ling* · ([MP-SENet](https://github.com/yxlu-0102/MP-SENet) - yxlu-0102) ![Star](https://img.shields.io/github/stars/yxlu-0102/MP-SENet.svg?style=social&label=Star)
## Dataset
- **MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation
  Model Training on EU Languages**, `arXiv, 2410.01036`, [arxiv](http://arxiv.org/abs/2410.01036v1), [pdf](http://arxiv.org/pdf/2410.01036v1.pdf), cication: [**-1**](None)

	 *Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri* · ([mosel](https://github.com/hlt-mt/mosel) - hlt-mt) ![Star](https://img.shields.io/github/stars/hlt-mt/mosel.svg?style=social&label=Star) · ([huggingface](https://huggingface.co/datasets/FBK-MT/mosel))
- **SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural
  Language Description**, `arXiv, 2408.13608`, [arxiv](http://arxiv.org/abs/2408.13608v1), [pdf](http://arxiv.org/pdf/2408.13608v1.pdf), cication: [**1**](https://scholar.google.com/scholar?cites=2464243547952350028&as_sdt=2005&sciodt=0,5&hl=en&oe=ASCII)

	 *Zeyu Jin, Jia Jia, Qixin Wang, Kehan Li, Shuoyi Zhou, Songtao Zhou, Xiaoyu Qin, Zhiyong Wu* · ([SpeechCraft](https://github.com/thuhcsi/SpeechCraft) - thuhcsi) ![Star](https://img.shields.io/github/stars/thuhcsi/SpeechCraft.svg?style=social&label=Star)
- **Generating Data with Text-to-Speech and Large-Language Models for
  Conversational Speech Recognition**, `arXiv, 2408.09215`, [arxiv](http://arxiv.org/abs/2408.09215v1), [pdf](http://arxiv.org/pdf/2408.09215v1.pdf), cication: [**-1**](None)

	 *Samuele Cornell, Jordan Darefsky, Zhiyao Duan, Shinji Watanabe*
- **LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous
  Speech**, `arXiv, 2407.04280`, [arxiv](http://arxiv.org/abs/2407.04280v1), [pdf](http://arxiv.org/pdf/2407.04280v1.pdf), cication: [**-1**](None)

	 *Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim*
- **AudioTime: A Temporally-aligned Audio-text Benchmark Dataset**, `arXiv, 2407.02857`, [arxiv](http://arxiv.org/abs/2407.02857v1), [pdf](http://arxiv.org/pdf/2407.02857v1.pdf), cication: [**-1**](None)

	 *Zeyu Xie, Xuenan Xu, Zhizheng Wu, Mengyue Wu*

	 · ([zeyuxie29.github](https://zeyuxie29.github.io/AudioTime/)) · ([AudioTime](https://github.com/zeyuxie29/AudioTime) - zeyuxie29) ![Star](https://img.shields.io/github/stars/zeyuxie29/AudioTime.svg?style=social&label=Star)
- [**mms_ulab_v2**](https://huggingface.co/datasets/espnet/mms_ulab_v2) - espnet 🤗
- [**links_to_pocasts_lecture_and_shows_for_tts**](https://huggingface.co/datasets/laion/links_to_pocasts_lecture_and_shows_for_tts) - laion 🤗
- **Audio Dialogues: Dialogues dataset for audio and music understanding**, `arXiv, 2404.07616`, [arxiv](http://arxiv.org/abs/2404.07616v1), [pdf](http://arxiv.org/pdf/2404.07616v1.pdf), cication: [**-1**](None)

	 *Arushi Goel, Zhifeng Kong, Rafael Valle, Bryan Catanzaro* · ([audiodialogues.github](https://audiodialogues.github.io/))
- [**yodas**](https://huggingface.co/datasets/espnet/yodas) - espnet 🤗
- **An Automated End-to-End Open-Source Software for High-Quality
  Text-to-Speech Dataset Generation**, `arXiv, 2402.16380`, [arxiv](http://arxiv.org/abs/2402.16380v1), [pdf](http://arxiv.org/pdf/2402.16380v1.pdf), cication: [**-1**](None)

	 *Ahmet Gunduz, Kamer Ali Yuksel, Kareem Darwish, Golara Javadi, Fabio Minazzi, Nicola Sobieski, Sebastien Bratieres*
- [**common_voice_17_0**](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) - mozilla-foundation 🤗
- [**common_voice_16_0**](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_0) - mozilla-foundation 🤗
- [**GigaSpeech**](https://github.com/SpeechColab/GigaSpeech) - SpeechColab ![Star](https://img.shields.io/github/stars/SpeechColab/GigaSpeech.svg?style=social&label=Star)

	 *Large, modern dataset for speech recognition*
- [中英文数据收集](https://yqli.tech/page/data.html)
- [**voice_datasets**](https://github.com/jim-schwoebel/voice_datasets) - jim-schwoebel ![Star](https://img.shields.io/github/stars/jim-schwoebel/voice_datasets.svg?style=social&label=Star)

	 *🔊 A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).*
- **MUSAN: A Music, Speech, and Noise Corpus**, `arXiv, 1510.08484`, [arxiv](http://arxiv.org/abs/1510.08484v1), [pdf](http://arxiv.org/pdf/1510.08484v1.pdf), cication: [**1439**](https://scholar.google.com/scholar?cites=6755292572666912523&as_sdt=2005&sciodt=0,5&hl=en&oe=ASCII)

	 *David Snyder, Guoguo Chen, Daniel Povey*

### TTS
- [**libritts_r**](https://huggingface.co/datasets/mythicinfinity/libritts_r) - mythicinfinity 🤗
- **Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for
  Large-Scale Speech Generation**, `arXiv, 2407.05361`, [arxiv](http://arxiv.org/abs/2407.05361v2), [pdf](http://arxiv.org/pdf/2407.05361v2.pdf), cication: [**-1**](None)

	 *Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi* · ([huggingface](https://huggingface.co/datasets/amphion/Emilia)) · ([Amphion](https://github.com/open-mmlab/Amphion/blob/main/preprocessors/Emilia/README.md) - open-mmlab) ![Star](https://img.shields.io/github/stars/open-mmlab/Amphion.svg?style=social&label=Star)
- **GLOBE: A High-quality English Corpus with Global Accents for Zero-shot
  Speaker Adaptive Text-to-Speech**, `arXiv, 2406.14875`, [arxiv](http://arxiv.org/abs/2406.14875v1), [pdf](http://arxiv.org/pdf/2406.14875v1.pdf), cication: [**-1**](None)

	 *Wenbin Wang, Yang Song, Sanjay Jha* · ([huggingface](https://huggingface.co/datasets/MushanW/GLOBE)) · ([globecorpus.github](https://globecorpus.github.io/))
- **Meta Learning Text-to-Speech Synthesis in over 7000 Languages**, `arXiv, 2406.06403`, [arxiv](http://arxiv.org/abs/2406.06403v1), [pdf](http://arxiv.org/pdf/2406.06403v1.pdf), cication: [**-1**](None)

	 *Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang Vu* · ([huggingface](https://huggingface.co/datasets/Flux9665/BibleMMS))
- **YODAS: Youtube-Oriented Dataset for Audio and Speech**, `2023 ieee automatic speech recognition and understanding workshop …, 2023`, [arxiv](http://arxiv.org/abs/2406.00899v1), [pdf](http://arxiv.org/pdf/2406.00899v1.pdf), cication: [**2**](https://scholar.google.com/scholar?cites=8882551202403401748&as_sdt=2005&sciodt=0,5&hl=en&oe=ASCII)

	 *Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, Shinji Watanabe* · ([huggingface](https://huggingface.co/datasets/espnet/yodas)) · ([huggingface](https://huggingface.co/datasets/espnet/yodas2))
- **LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts
  for Text-to-Speech and Style Captioning**, `arXiv, 2406.07969`, [arxiv](http://arxiv.org/abs/2406.07969v1), [pdf](http://arxiv.org/pdf/2406.07969v1.pdf), cication: [**-1**](None)

	 *Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana* · ([masayakawamura.github](https://masayakawamura.github.io/libritts-p/)) · ([LibriTTS-P](https://github.com/line/LibriTTS-P) - line) ![Star](https://img.shields.io/github/stars/line/LibriTTS-P.svg?style=social&label=Star)
- **WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech
  Generation Model Benchmark**, `arXiv, 2406.05763`, [arxiv](http://arxiv.org/abs/2406.05763v2), [pdf](http://arxiv.org/pdf/2406.05763v2.pdf), cication: [**-1**](None)

	 *Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie*
	- https://huggingface.co/datasets/Wenetspeech4TTS/WenetSpeech4TTS
- [**snac_llm_parler_tts**](https://huggingface.co/datasets/blanchon/snac_llm_parler_tts) - blanchon 🤗
- **EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
  Resynthesis**, `arXiv, 2308.05725`, [arxiv](http://arxiv.org/abs/2308.05725v1), [pdf](http://arxiv.org/pdf/2308.05725v1.pdf), cication: [**-1**](None)

	 *Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid* · ([huggingface](https://huggingface.co/datasets/ylacombe/expresso)) · ([textlesslib](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset) - facebookresearch) ![Star](https://img.shields.io/github/stars/facebookresearch/textlesslib.svg?style=social&label=Star)
- **StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual
  Expressiveness Annotations**, `arXiv, 2404.14946`, [arxiv](http://arxiv.org/abs/2404.14946v1), [pdf](http://arxiv.org/pdf/2404.14946v1.pdf), cication: [**-1**](None)

	 *Sen Liu, Yiwei Guo, Xie Chen, Kai Yu* · ([goarsenal.github](https://goarsenal.github.io/StoryTTS/))
- [**mls_eng_10k**](https://huggingface.co/datasets/parler-tts/mls_eng_10k) - parler-tts 🤗
- **LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus**, `arXiv, 2305.18802`, [arxiv](http://arxiv.org/abs/2305.18802v1), [pdf](http://arxiv.org/pdf/2305.18802v1.pdf), cication: [**-1**](None)

	 *Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna* · ([openslr](http://www.openslr.org/141/)) · ([google.github](https://google.github.io/df-conformer/librittsr/))
- **Libri-Light: A Benchmark for ASR with Limited or No Supervision**, `icassp 2020-2020 ieee international conference on acoustics …, 2020`, [arxiv](http://arxiv.org/abs/1912.07875v1), [pdf](http://arxiv.org/pdf/1912.07875v1.pdf), cication: [**483**](https://scholar.google.com/scholar?cites=596221056952134740&as_sdt=2005&sciodt=0,5&hl=en&oe=ASCII)

	 *Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen* · ([libri-light](https://github.com/facebookresearch/libri-light/tree/main) - facebookresearch) ![Star](https://img.shields.io/github/stars/facebookresearch/libri-light.svg?style=social&label=Star)
- **MLS: A Large-Scale Multilingual Dataset for Speech Research**, `arXiv, 2012.03411`, [arxiv](http://arxiv.org/abs/2012.03411v2), [pdf](http://arxiv.org/pdf/2012.03411v2.pdf), cication: [**-1**](None)

	 *Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert* · ([openslr](http://www.openslr.org/94/))
	- Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. 
	- segment the audio files into 10-20 second segments 
- **LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech**, `arXiv, 1904.02882`, [arxiv](http://arxiv.org/abs/1904.02882v1), [pdf](http://arxiv.org/pdf/1904.02882v1.pdf), cication: [**-1**](None)

	 *Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu*
- **Hi-Fi Multi-Speaker English TTS Dataset**, `arXiv, 2104.01497`, [arxiv](http://arxiv.org/abs/2104.01497v3), [pdf](http://arxiv.org/pdf/2104.01497v3.pdf), cication: [**-1**](None)

	 *Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang* · ([openslr](https://www.openslr.org/109/))

## Audio Techs
- [**voicerestore**](https://github.com/skirdey/voicerestore) - skirdey ![Star](https://img.shields.io/github/stars/skirdey/voicerestore.svg?style=social&label=Star)

	 *VoiceRestore: Flow-Matching Transformers for Universal Speech Restoration* · ([huggingface](https://huggingface.co/spaces/jadechoghari/VoiceRestore))
- **Apollo: Band-sequence Modeling for High-Quality Audio Restoration**, `arXiv, 2409.08514`, [arxiv](http://arxiv.org/abs/2409.08514v1), [pdf](http://arxiv.org/pdf/2409.08514v1.pdf), cication: [**-1**](None)

	 *Kai Li, Yi Luo*

	 · ([cslikai](https://cslikai.cn/Apollo/))
- **The VoxCeleb Speaker Recognition Challenge: A Retrospective**, `arXiv, 2408.14886`, [arxiv](http://arxiv.org/abs/2408.14886v1), [pdf](http://arxiv.org/pdf/2408.14886v1.pdf), cication: [**-1**](None)

	 *Jaesung Huh, Joon Son Chung, Arsha Nagrani, Andrew Brown, Jee-weon Jung, Daniel Garcia-Romero, Andrew Zisserman* · ([mm.kaist.ac](https://mm.kaist.ac.kr/datasets/voxceleb/voxsrc/workshop.html))
- **Advancing Multi-grained Alignment for Contrastive Language-Audio
  Pre-training**, `arXiv, 2408.07919`, [arxiv](http://arxiv.org/abs/2408.07919v1), [pdf](http://arxiv.org/pdf/2408.07919v1.pdf), cication: [**-1**](None)

	 *Yiming Li, Zhifang Guo, Xiangdong Wang, Hong Liu*
- **Speech Slytherin: Examining the Performance and Efficiency of Mamba for
  Speech Separation, Recognition, and Synthesis**, `arXiv, 2407.09732`, [arxiv](http://arxiv.org/abs/2407.09732v1), [pdf](http://arxiv.org/pdf/2407.09732v1.pdf), cication: [**-1**](None)

	 *Xilin Jiang, Yinghao Aaron Li, Adrian Nicolas Florea, Cong Han, Nima Mesgarani* · ([Mamba-TasNet](https://github.com/xi-j/Mamba-TasNet) - xi-j) ![Star](https://img.shields.io/github/stars/xi-j/Mamba-TasNet.svg?style=social&label=Star)
- **EfficientSpeech: An On-Device Text to Speech Model**, `arXiv, 2305.13905`, [arxiv](http://arxiv.org/abs/2305.13905v1), [pdf](http://arxiv.org/pdf/2305.13905v1.pdf), cication: [**-1**](None)

	 *Rowel Atienza*
- **Visual-Aware Text-to-Speech**, `arXiv, 2306.12020`, [arxiv](http://arxiv.org/abs/2306.12020v1), [pdf](http://arxiv.org/pdf/2306.12020v1.pdf), cication: [**-1**](None)

	 *Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei*
- **Proactive Detection of Voice Cloning with Localized Watermarking**, `arXiv, 2401.17264`, [arxiv](http://arxiv.org/abs/2401.17264v1), [pdf](http://arxiv.org/pdf/2401.17264v1.pdf), cication: [**-1**](None)

	 *Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar*
- **FADI-AEC: Fast Score Based Diffusion Model Guided by Far-end Signal for
  Acoustic Echo Cancellation**, `arXiv, 2401.04283`, [arxiv](http://arxiv.org/abs/2401.04283v1), [pdf](http://arxiv.org/pdf/2401.04283v1.pdf), cication: [**-1**](None)

	 *Yang Liu, Li Wan, Yun Li, Yiteng Huang, Ming Sun, James Luan, Yangyang Shi, Xin Lei*
- [**AudioSep**](https://github.com/Audio-AGI/AudioSep) - Audio-AGI ![Star](https://img.shields.io/github/stars/Audio-AGI/AudioSep.svg?style=social&label=Star)

	 *Official implementation of "Separate Anything You Describe"*
- **AudioSR: Versatile Audio Super-resolution at Scale**, `arXiv, 2309.07314`, [arxiv](http://arxiv.org/abs/2309.07314v1), [pdf](http://arxiv.org/pdf/2309.07314v1.pdf), cication: [**-1**](None)

	 *Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, Mark D. Plumbley*

## Audio Visual
- **Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes**, `arXiv, 2407.10957`, [arxiv](http://arxiv.org/abs/2407.10957v1), [pdf](http://arxiv.org/pdf/2407.10957v1.pdf), cication: [**-1**](None)

	 *Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu* · ([gewu-lab.github](https://gewu-lab.github.io/Ref-AVS))
- **video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models**, `arXiv, 2406.15704`, [arxiv](http://arxiv.org/abs/2406.15704v1), [pdf](http://arxiv.org/pdf/2406.15704v1.pdf), cication: [**-1**](None)

	 *Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang* · ([SALMONN](https://github.com/bytedance/SALMONN) - bytedance) ![Star](https://img.shields.io/github/stars/bytedance/SALMONN.svg?style=social&label=Star)
- **Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding
  of Sound and Language**, `arXiv, 2406.05629`, [arxiv](http://arxiv.org/abs/2406.05629v1), [pdf](http://arxiv.org/pdf/2406.05629v1.pdf), cication: [**-1**](None)

	 *Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman*

- **AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary
  Alignment for Temporal Referential Dialogue**, `arXiv, 2403.16276`, [arxiv](http://arxiv.org/abs/2403.16276v1), [pdf](http://arxiv.org/pdf/2403.16276v1.pdf), cication: [**-1**](None)

	 *Yunlong Tang, Daiki Shimada, Jing Bi, Chenliang Xu*
- **M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual
  Academic Lecture Dataset**, `arXiv, 2403.14168`, [arxiv](http://arxiv.org/abs/2403.14168v1), [pdf](http://arxiv.org/pdf/2403.14168v1.pdf), cication: [**-1**](None)

	 *Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang*
- **Text-to-Audio Generation Synchronized with Videos**, `arXiv, 2403.07938`, [arxiv](http://arxiv.org/abs/2403.07938v1), [pdf](http://arxiv.org/pdf/2403.07938v1.pdf), cication: [**-1**](None)

	 *Shentong Mo, Jing Shi, Yapeng Tian*
- **Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
  Latent Aligners**, `arXiv, 2402.17723`, [arxiv](http://arxiv.org/abs/2402.17723v1), [pdf](http://arxiv.org/pdf/2402.17723v1.pdf), cication: [**-1**](None)

	 *Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen* · ([yzxing87.github](https://yzxing87.github.io/Seeing-and-Hearing/))
- [**vsp-llm**](https://github.com/sally-sh/vsp-llm) - sally-sh ![Star](https://img.shields.io/github/stars/sally-sh/vsp-llm.svg?style=social&label=Star)
- **RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual
  speech separation**, `arXiv, 2309.17189`, [arxiv](http://arxiv.org/abs/2309.17189v3), [pdf](http://arxiv.org/pdf/2309.17189v3.pdf), cication: [**-1**](None)

	 *Samuel Pegg, Kai Li, Xiaolin Hu* · ([jiqizhixin](https://www.jiqizhixin.com/articles/2024-03-06)) · ([cslikai](https://cslikai.cn/RTFS-Net/AV-Model-Demo.html)) · ([RTFS-Net](https://github.com/spkgyk/RTFS-Net) - spkgyk) ![Star](https://img.shields.io/github/stars/spkgyk/RTFS-Net.svg?style=social&label=Star)

## Event Dectection
- **SpeechEE: A Novel Benchmark for Speech Event Extraction**, `arXiv, 2408.09462`, [arxiv](http://arxiv.org/abs/2408.09462v2), [pdf](http://arxiv.org/pdf/2408.09462v2.pdf), cication: [**-1**](None)

	 *Bin Wang, Meishan Zhang, Hao Fei, Yu Zhao, Bobo Li, Shengqiong Wu, Wei Ji, Min Zhang* · ([speechee.github](https://speechee.github.io/))

## Emotion Recognition 
- **Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition**, `arXiv, 2403.19224`, [arxiv](http://arxiv.org/abs/2403.19224v1), [pdf](http://arxiv.org/pdf/2403.19224v1.pdf), cication: [**-1**](None)

	 *Siyuan Shen, Yu Gao, Feng Liu, Hanyang Wang, Aimin Zhou* · ([ENT](https://github.com/ECNU-Cross-Innovation-Lab/ENT) - ECNU-Cross-Innovation-Lab) ![Star](https://img.shields.io/github/stars/ECNU-Cross-Innovation-Lab/ENT.svg?style=social&label=Star)

## Speech Separation
- **SoloAudio: Target Sound Extraction with Language-oriented Audio
  Diffusion Transformer**, `arXiv, 2409.08425`, [arxiv](http://arxiv.org/abs/2409.08425v1), [pdf](http://arxiv.org/pdf/2409.08425v1.pdf), cication: [**-1**](None)

	 *Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak*

	 · ([SoloAudio](https://github.com/WangHelin1997/SoloAudio) - WangHelin1997) ![Star](https://img.shields.io/github/stars/WangHelin1997/SoloAudio.svg?style=social&label=Star) · ([wanghelin1997.github](https://wanghelin1997.github.io/SoloAudio-Demo/))
- **Facing the Music: Tackling Singing Voice Separation in Cinematic Audio
  Source Separation**, `arXiv, 2408.03588`, [arxiv](http://arxiv.org/abs/2408.03588v1), [pdf](http://arxiv.org/pdf/2408.03588v1.pdf), cication: [**-1**](None)

	 *Karn N. Watcharasupat, Chih-Wei Wu, Iroro Orife* · ([source-separation-landing](https://github.com/kwatcharasupat/source-separation-landing) - kwatcharasupat) ![Star](https://img.shields.io/github/stars/kwatcharasupat/source-separation-landing.svg?style=social&label=Star)
- **SPMamba: State-space model is all you need in speech separation**, `arXiv, 2404.02063`, [arxiv](http://arxiv.org/abs/2404.02063v1), [pdf](http://arxiv.org/pdf/2404.02063v1.pdf), cication: [**-1**](None)

	 *Kai Li, Guo Chen* · ([SPMamba](https://github.com/JusperLee/SPMamba) - JusperLee) ![Star](https://img.shields.io/github/stars/JusperLee/SPMamba.svg?style=social&label=Star)

## Products
- [PlayAI](https://play.ai/)
- [Meet Hume’s Empathic Voice Interface (EVI), the first conversational AI with emotional intelligence.](https://demo.hume.ai/)

	 · ([twitter](https://twitter.com/hume_ai/status/1773017055974789176))

	 · ([mp.weixin.qq](https://mp.weixin.qq.com/s?__biz=MzI3MTA0MTk1MA==&mid=2652463844&idx=2&sn=fbaea2bd4a0ea1f270ffbae67c406e92))


## Extra Reference
- [**AudioLLM**](https://github.com/AudioLLMs/AudioLLM) - AudioLLMs ![Star](https://img.shields.io/github/stars/AudioLLMs/AudioLLM.svg?style=social&label=Star)
- [**speech-trident**](https://github.com/ga642381/speech-trident) - ga642381 ![Star](https://img.shields.io/github/stars/ga642381/speech-trident.svg?style=social&label=Star)

	 *Awesome speech/audio LLMs, representation learning, and codec models*
- [**Large-Audio-Models**](https://github.com/liusongxiang/Large-Audio-Models) - liusongxiang ![Star](https://img.shields.io/github/stars/liusongxiang/Large-Audio-Models.svg?style=social&label=Star)

	 *Keep track of big models in audio domain, including speech, singing, music etc.*
- [**Awesome-Speech-Generation**](https://github.com/kuan2jiu99/Awesome-Speech-Generation) - kuan2jiu99 ![Star](https://img.shields.io/github/stars/kuan2jiu99/Awesome-Speech-Generation.svg?style=social&label=Star)

	 *Survey on speech generation work.*
- [**Speech-Prompts-Adapters**](https://github.com/ga642381/Speech-Prompts-Adapters) - ga642381 ![Star](https://img.shields.io/github/stars/ga642381/Speech-Prompts-Adapters.svg?style=social&label=Star)

	 *This Repository surveys the paper focusing on Prompting and Adapters for Speech Processing.*