<!--# Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities-->
## [aka.ms/GeneralAI](https://aka.ms/GeneralAI)
# Hiring
We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on Foundation Models (aka large-scale pre-trained models) and General AI, NLP, MT, Speech, Document AI and Multimodal AI, please send your resume to <a href="mailto:fuwei@microsoft.com" class="x-hidden-focus">fuwei@microsoft.com</a>.

# Foundation Architecture
### TorchScale - A Library of Foundation Architectures ([repo](https://github.com/microsoft/torchscale))

Fundamental research to develop new architectures for foundation models and AI, focusing on modeling generality and capability, as well as training stability and efficiency.

> Stability - [**DeepNet**](https://github.com/microsoft/unilm/tree/master/deepnet): scaling Transformers to 1,000 Layers and beyond

> Generality - [**Foundation Transformers (Magneto)**](https://arxiv.org/abs/2210.06423): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)

> Capability - A [**Length-Extrapolatable**](https://arxiv.org/abs/2212.10554) Transformer

> Efficiency & Transferability - [**X-MoE**](https://github.com/microsoft/unilm/tree/master/xmoe): scalable & finetunable sparse Mixture-of-Experts (MoE)

### The Revolution of Model Architecture

> [**BitNet**](https://arxiv.org/abs/2310.11453): 1-bit Transformers for Large Language Models

> [**RetNet**](https://arxiv.org/abs/2307.08621): Retentive Network: A Successor to Transformer for Large Language Models

> [**LongNet**](https://arxiv.org/abs/2307.02486): Scaling Transformers to 1,000,000,000 Tokens

# Foundation Models

### The Evolution of (M)LLM (Multimodal LLM)

> [**Kosmos-2.5**](https://github.com/microsoft/unilm/tree/master/kosmos-2.5): **A Multimodal Literate Model**

> [**Kosmos-2**](https://github.com/microsoft/unilm/tree/master/kosmos-2): **Grounding Multimodal Large Language Models to the World**

> [**Kosmos-1**](https://arxiv.org/abs/2302.14045): **A Multimodal Large Language Model (MLLM)**

> [**MetaLM**](https://github.com/microsoft/unilm/tree/master/metalm): **Language Models are General-Purpose Interfaces**

**The Big Convergence** - Large-scale self-supervised pre-training across ```tasks``` (predictive and generative), ```languages``` (100+ languages), and ```modalities``` (language, image, audio, layout/format + language, vision + language, audio + language, etc.)

<!--## Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities-->

### Language & Multilingual
> [**UniLM**](https://github.com/microsoft/unilm/tree/master/unilm): unified pre-training for language understanding and generation

> [**InfoXLM/XLM-E**](https://github.com/microsoft/unilm/tree/master/infoxlm): multilingual/cross-lingual pre-trained models for 100+ languages

> [**DeltaLM/mT6**](https://github.com/microsoft/unilm/tree/master/deltalm): encoder-decoder pre-training for language generation and translation for 100+ languages

> [**MiniLM**](https://github.com/microsoft/unilm/tree/master/minilm): small and fast pre-trained models for language understanding and generation

> [**AdaLM**](https://github.com/microsoft/unilm/tree/master/adalm): domain, language, and task adaptation of pre-trained models

> [**EdgeLM**](https://github.com/microsoft/unilm/tree/master/edgelm)(```NEW```): small pre-trained models on edge/client devices

> [**SimLM**](https://github.com/microsoft/unilm/tree/master/simlm) (```NEW```): large-scale pre-training for similarity matching

> [**E5**](https://github.com/microsoft/unilm/tree/master/e5) (```NEW```): text embeddings

> [**MiniLLM**](https://arxiv.org/abs/2306.08543) (```NEW```): Knowledge Distillation of Large Language Models

### Vision
> [**BEiT**](https://github.com/microsoft/unilm/tree/master/beit)/[**BEiT-2**](https://github.com/microsoft/unilm/tree/master/beit2): generative self-supervised pre-training for vision / BERT Pre-Training of Image Transformers

> [**DiT**](https://github.com/microsoft/unilm/tree/master/dit): self-supervised pre-training for Document Image Transformers

> [**TextDiffuser**](https://github.com/microsoft/unilm/tree/master/textdiffuser)/[**TextDiffuser-2**](https://github.com/microsoft/unilm/tree/master/textdiffuser-2) (```NEW```): Diffusion Models as Text Painters

### Speech
> [**WavLM**](https://github.com/microsoft/unilm/tree/master/wavlm): speech pre-training for full stack tasks

> [**VALL-E**](https://github.com/microsoft/unilm/tree/master/valle): a neural codec language model for TTS

### Multimodal (X + Language)
> [**LayoutLM**](https://github.com/microsoft/unilm/tree/master/layoutlm)/[**LayoutLMv2**](https://github.com/microsoft/unilm/tree/master/layoutlmv2)/[**LayoutLMv3**](https://github.com/microsoft/unilm/tree/master/layoutlmv3): multimodal (text + layout/format + image) **Document Foundation Model** for [Document AI](https://www.microsoft.com/en-us/research/project/document-ai/) (e.g. scanned documents, PDF, etc.)

> [**LayoutXLM**](https://github.com/microsoft/unilm/tree/master/layoutxlm): multimodal (text + layout/format + image) **Document Foundation Model** for multilingual Document AI

> [**MarkupLM**](https://github.com/microsoft/unilm/tree/master/markuplm): markup language model pre-training for visually-rich document understanding

> [**XDoc**](https://github.com/microsoft/unilm/tree/master/xdoc): unified pre-training for cross-format document understanding

> [**UniSpeech**](https://arxiv.org/abs/2101.07597): unified pre-training for self-supervised learning and supervised learning for ASR

> [**UniSpeech-SAT**](https://arxiv.org/pdf/2110.05752.pdf): universal speech representation learning with speaker-aware pre-training

> [**SpeechT5**](https://arxiv.org/abs/2110.07205): encoder-decoder pre-training for spoken language processing

> [**SpeechLM**](https://arxiv.org/abs/2209.15329): Enhanced Speech Pre-Training with Unpaired Textual Data

> [**VLMo**](https://github.com/microsoft/unilm/tree/master/vlmo): Unified vision-language pre-training 

> [**VL-BEiT**](https://github.com/microsoft/unilm/tree/master/vl-beit) (```NEW```): Generative Vision-Language Pre-training - evolution of **BEiT** to multimodal

> [**BEiT-3**](https://github.com/microsoft/unilm/tree/master/beit3) (```NEW```): a general-purpose multimodal foundation model, and a major milestone of **The Big Convergence** of Large-scale Pre-training Across Tasks, Languages, and Modalities.
### Toolkits
> [**s2s-ft**](https://github.com/microsoft/unilm/tree/master/s2s-ft): sequence-to-sequence fine-tuning toolkit

> [**Aggressive Decoding**](https://arxiv.org/pdf/2205.10350.pdf) (```NEW```): lossless and efficient sequence-to-sequence decoding algorithm

### Applications
> [**TrOCR**](https://github.com/microsoft/unilm/tree/master/trocr): transformer-based OCR w/ pre-trained models
 
> [**LayoutReader**](https://github.com/microsoft/unilm/tree/master/layoutreader): pre-training of text and layout for reading order detection

> [**XLM-T**](https://github.com/microsoft/unilm/tree/master/xlmt): multilingual NMT w/ pretrained cross-lingual encoders

## Links
### LLMOps ([repo](https://github.com/microsoft/lmops))
General technology for enabling AI capabilities w/ LLMs and MLLMs.

### RedStone ([repo](https://github.com/microsoft/redstone))
Curating General, Code, Math, and QA Data for Large Language Models.

## News
- December, 2024: [**RedStone**](https://github.com/microsoft/redstone) was released!
- December, 2023: [**LongNet**](https://github.com/microsoft/unilm/tree/master/longnet) and [**LongViT**](https://github.com/microsoft/unilm/tree/master/longvit) released
- [Model Release] Dec, 2023: [**TextDiffuser-2**](https://github.com/microsoft/unilm/tree/master/textdiffuser-2) models, code and [demo](https://huggingface.co/spaces/JingyeChen22/TextDiffuser-2). 
- Sep, 2023: [**Kosmos-2.5**](https://arxiv.org/abs/2309.11419) - a multimodal literate model for machine reading of text-intensive images.
- [Model Release] May, 2023: [**TextDiffuser**](https://github.com/microsoft/unilm/tree/master/textdiffuser) models and code.
- [Model Release] March, 2023: [**BEiT-3**](https://github.com/microsoft/unilm/tree/master/beit3) pretrained models and code.
- March, 2023: [**Kosmos-1**](https://arxiv.org/abs/2302.14045) - a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot).
- January, 2023: [**VALL-E**](https://arxiv.org/abs/2301.02111) a language modeling approach for text to speech synthesis (TTS), which achieves state-of-the-art zero-shot TTS performance. See https://aka.ms/valle for demos of our work.
- [Model Release] January, 2023: [**E5**](https://github.com/microsoft/unilm/tree/master/e5) - Text Embeddings by Weakly-Supervised Contrastive Pre-training.
- November, 2022: [**TorchScale 0.1.1**](https://github.com/microsoft/torchscale) was released!
- November, 2022: [**TrOCR**](https://arxiv.org/abs/2109.10282) was accepted by AAAI 2023.
- [Model Release] November, 2022: [**XDoc**](https://github.com/microsoft/unilm/tree/master/xdoc) **BASE** models for cross-format document understanding.
- [Model Release] September, 2022: [**TrOCR**](https://github.com/microsoft/unilm/tree/master/trocr) **BASE** and **LARGE** models for Scene Text Recognition (STR).
- [Model Release] September, 2022: [**BEiT v2**](https://github.com/microsoft/unilm/tree/master/beit2) code and pretrained models.
- August, 2022: [**BEiT-3**](https://arxiv.org/abs/2208.10442) - a general-purpose multimodal foundation model, which achieves state-of-the-art transfer performance on both vision and vision-language tasks
- July, 2022: [**SimLM**](https://github.com/microsoft/unilm/tree/master/simlm) - Large-scale self-supervised pre-training for similarity matching
- June, 2022: [**DiT**](https://arxiv.org/abs/2203.02378) and [**LayoutLMv3**](https://arxiv.org/abs/2204.08387) were accepted by ACM Multimedia 2022.
- June, 2022: [**MetaLM**](https://github.com/microsoft/unilm/tree/master/metalm) - Language models are general-purpose interfaces to foundation models (language/multilingual, vision, speech, and multimodal)
- June, 2022: [**VL-BEiT**](https://github.com/microsoft/unilm/tree/master/vl-beit) - bidirectional multimodal Transformer learned from scratch with one unified pretraining task, one shared backbone, and one-stage training, supporting both vision and vision-language tasks.
- [Model Release] June, 2022: [**LayoutLMv3 Chinese**](https://github.com/microsoft/unilm/tree/master/layoutlmv3#form-understanding-on-xfund) - Chinese version of LayoutLMv3
- [Code Release] May, 2022: [**Aggressive Decoding**](https://github.com/microsoft/unilm/tree/master/decoding) - Lossless Speedup for Seq2seq Generation
- April, 2022: **Transformers at Scale** = [DeepNet](https://arxiv.org/abs/2203.00555) + [X-MoE](https://arxiv.org/abs/2204.09179)
- [Model Release] April, 2022: [**LayoutLMv3**](https://github.com/microsoft/unilm/tree/master/layoutlmv3) - Pre-training for Document AI with Unified Text and Image Masking
- [Model Release] March, 2022: [**EdgeFormer**](https://github.com/microsoft/unilm/tree/master/edgelm) - Parameter-efficient Transformer for On-device Seq2seq Generation
- [Model Release] March, 2022: [**DiT**](https://github.com/microsoft/unilm/tree/master/dit) - Self-supervised Document Image Transformer. Demos: [Document Layout Analysis](https://huggingface.co/spaces/nielsr/dit-document-layout-analysis), [Document Image Classification](https://huggingface.co/spaces/microsoft/document-image-transformer)
- January, 2022: [**BEiT**](https://openreview.net/forum?id=p-BhZSz59o4) was accepted by **ICLR 2022 as Oral presentation** (54 out of 3391).
- [Model Release] December 16th, 2021: [**TrOCR**](https://github.com/microsoft/unilm/tree/master/trocr) **small** models for handwritten and printed texts, with 3x inference speedup.
- November 24th, 2021: [**VLMo**](https://github.com/microsoft/unilm/tree/master/vlmo) as the new SOTA on the [VQA Challenge](https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278)
- November, 2021: [Multilingual translation at scale: 10000 language pairs and beyond](https://www.microsoft.com/en-us/translator/blog/2021/11/22/multilingual-translation-at-scale-10000-language-pairs-and-beyond/)
- [Model Release] November, 2021: [**MarkupLM**](https://github.com/microsoft/unilm/tree/master/markuplm) - Pre-training for text and markup language (e.g. HTML/XML)
- [Model Release] November, 2021: [**VLMo**](https://github.com/microsoft/unilm/tree/master/vlmo) - Unified vision-language pre-training w/ [**BEiT**](https://github.com/microsoft/unilm/tree/master/beit)
- October, 2021: [**WavLM**](https://github.com/microsoft/unilm/tree/master/wavlm) Large achieves state-of-the-art performance on the [SUPERB](https://superbbenchmark.org/leaderboard) benchmark
- [Model Release] October, 2021: [**WavLM**](https://github.com/microsoft/unilm/tree/master/wavlm) - Large-scale self-supervised pre-trained models for speech. 
- [Model Release] October 2021: [**TrOCR**](https://huggingface.co/transformers/master/model_doc/trocr.html) is on [HuggingFace](https://github.com/huggingface/transformers)
- September 28th, 2021: T-ULRv5 (aka <a href="https://arxiv.org/abs/2106.16138" target="_blank">XLM-E</a>/<a href="https://arxiv.org/abs/2007.07834" target="_blank">InfoXLM</a>) as the SOTA on the <a href="https://sites.research.google/xtreme" target="_blank">XTREME</a> leaderboard. // <a href="https://www.microsoft.com/en-us/research/blog/microsoft-turing-universal-language-representation-model-t-ulrv5-tops-xtreme-leaderboard-and-trains-100x-faster/" target="_blank">Blog</a>
- [Model Release] September, 2021: [**LayoutLM-cased**](https://huggingface.co/microsoft/layoutlm-base-cased) are on [HuggingFace](https://github.com/huggingface/transformers)
- [Model Release] September, 2021: [**TrOCR**](https://github.com/microsoft/unilm/tree/master/trocr) - Transformer-based OCR w/ pre-trained [**BEiT**](https://github.com/microsoft/unilm/tree/master/beit) and RoBERTa models.
- August 2021: [**LayoutLMv2**](https://huggingface.co/transformers/master/model_doc/layoutlmv2.html) and [**LayoutXLM**](https://huggingface.co/transformers/master/model_doc/layoutxlm.html) are on [HuggingFace](https://github.com/huggingface/transformers)
- [Model Release] August, 2021: [**LayoutReader**](https://github.com/microsoft/unilm/tree/master/layoutreader) - Built with LayoutLM to improve general reading order detection.
- [Model Release] August, 2021: [**DeltaLM**](https://github.com/microsoft/unilm/tree/master/deltalm) - Encoder-decoder pre-training for language generation and translation.
- August 2021: [**BEiT**](https://huggingface.co/transformers/master/model_doc/beit.html) is on [HuggingFace](https://github.com/huggingface/transformers)
- [Model Release] July, 2021: [**BEiT**](https://github.com/microsoft/unilm/tree/master/beit) - Towards BERT moment for CV
- [Model Release] June, 2021: [**LayoutLMv2**](https://github.com/microsoft/unilm/tree/master/layoutlmv2), [**LayoutXLM**](https://github.com/microsoft/unilm/tree/master/layoutxlm), [**MiniLMv2**](https://github.com/microsoft/unilm/tree/master/minilm), and [**AdaLM**](https://github.com/microsoft/unilm/tree/master/adalm).
- May, 2021: [LayoutLMv2](https://github.com/microsoft/unilm/tree/master/layoutlmv2), InfoXLMv2, MiniLMv2, UniLMv3, and AdaLM were accepted by ACL 2021.
- April, 2021: [LayoutXLM](https://github.com/microsoft/unilm/tree/master/layoutxlm) is coming by extending the LayoutLM into multilingual support! A multilingual form understanding benchmark [XFUND](https://github.com/doc-analysis/XFUND) is also introduced, which includes forms with human labeled key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
- March, 2021: [InfoXLM](https://github.com/microsoft/unilm/tree/master/infoxlm) was accepted by NAACL 2021.
- December 29th, 2020: [LayoutLMv2](https://arxiv.org/abs/2012.14740) is coming with the new SOTA on a wide variety of document AI tasks, including [DocVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1) and [SROIE](https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3) leaderboard.
- October 8th, 2020: T-ULRv2 (aka [InfoXLM](https://arxiv.org/abs/2007.07834)) as the SOTA on the [XTREME](https://sites.research.google/xtreme) leaderboard. // [Blog](https://www.microsoft.com/en-us/research/blog/microsoft-turing-universal-language-representation-model-t-ulrv2-tops-xtreme-leaderboard/)
- September, 2020: [MiniLM](https://github.com/microsoft/unilm/tree/master/minilm) was accepted by NeurIPS 2020.
- July 16, 2020: [**InfoXLM** (Multilingual UniLM)](https://github.com/microsoft/unilm/tree/master/infoxlm) [arXiv](https://arxiv.org/pdf/2007.07834.pdf)
- June, 2020: [UniLMv2](https://github.com/microsoft/unilm/tree/master/unilm) was accepted by ICML 2020; [LayoutLM](https://github.com/microsoft/unilm/tree/master/layoutlm) was accepted by KDD 2020.
- April 5, 2020: [**Multilingual MiniLM**](https://github.com/microsoft/unilm/tree/master/minilm) released!
- September, 2019: [UniLMv1](https://github.com/microsoft/unilm/tree/master/unilm-v1) was accepted by NeurIPS 2019.

<!--
## Release

**\*\*\*\*\* ```New October, 2022```: [XDoc](https://github.com/microsoft/unilm/tree/master/xdoc) release \*\*\*\*\***

- [x] [**XDoc**](https://github.com/microsoft/unilm/tree/master/xdoc) (October 7, 2022): XDoc, a unified pre-trained model which deals with different document formats in a single model. For parameter efficiency, we share backbone parameters for different formats such as the word embedding layer and the Transformer layers. Meanwhile, we introduce adaptive layers with lightweight parameters to enhance the distinction across different formats. Experimental results have demonstrated that with only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost effective for real-world deployment. "[XDoc: Unified Pre-training for Cross-Format Document Understanding](https://arxiv.org/abs/2210.02849) ```EMNLP 2022```"

**\*\*\*\*\* ```New May, 2022```: [Aggressive Decoding](https://github.com/microsoft/unilm/tree/master/decoding) release \*\*\*\*\***

- [x] [**Aggressive Decoding**](https://github.com/microsoft/unilm/tree/master/decoding) (May 20, 2022): Aggressive Decoding, a novel decoding paradigm for lossless speedup for seq2seq generation. Unlike the previous efforts (e.g., non-autoregressive decoding) speeding up seq2seq generation at the cost of quality loss, Aggressive Decoding aims to yield the identical (or better) generation compared with autoregressive decoding but in a significant speedup: For the seq2seq tasks characterized by highly similar inputs and outputs (e.g., Grammatical Error Correction and Text Simplification), the Input-guided Aggressive Decoding can introduce a 7x-9x speedup for the popular 6-layer Transformer on GPU with the identical results as greedy decoding; For other general seq2seq tasks (e.g., Machine Translation and Abstractive Summarization), the Generalized Aggressive Decoding can have a 3x-5x speedup with the identical or even better quality. "[Lossless Acceleration for Seq2seq Generation with Aggressive Decoding](https://arxiv.org/pdf/2205.10350.pdf)"

**\*\*\*\*\* ```New April, 2022```: [LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3) release \*\*\*\*\***

- [x] [**LayoutLM 3.0**](https://github.com/microsoft/unilm/tree/master/layoutlmv3) (April 19, 2022): LayoutLMv3, a multimodal pre-trained Transformer for Document AI with unified text and image masking. Additionally, it is also pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. "[LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) ```ACM MM 2022```"

**\*\*\*\*\* ```March, 2022```: [EdgeFormer](https://github.com/microsoft/unilm/tree/master/edgelm) release \*\*\*\*\***

- [x] [**EdgeFormer**](https://github.com/microsoft/unilm/tree/master/edgelm) (March 18, 2022): EdgeFormer, the first publicly available pretrained parameter-efficient Transformer for on-device seq2seq generation. EdgeFormer has only 11 million parameters, taking up less than 15MB disk size after int8 quantization and compression, which can process a sentence of the length of 20-30 tokens with acceptable latency on two middle-to-high end CPU cores and less than 50MB memory footprint. The pretrained EdgeFormer can be fine-tuned to English seq2seq tasks and achieve promising results -- significantly better than the strong paramter-efficient Transformer baseline (pretrained Universal Transformer) and full-parameterized Transformer-base model without pretraining, which we believe can largely facilitate on-device seq2seq generation in practice. "[EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation](https://arxiv.org/abs/2202.07959)"


**\*\*\*\*\* ```March, 2022```: [DiT](https://github.com/microsoft/unilm/tree/master/dit) release \*\*\*\*\***

- [x] [**DiT**](https://github.com/microsoft/unilm/tree/master/dit) (March 4, 2022): DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 → 92.69), document layout analysis (91.0 → 94.9), table detection (94.23 → 96.55) and text detection for OCR (93.07 → 94.29). "[DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) ```ACM MM 2022```"


**\*\*\*\*\* ```October, 2021```: [WavLM](https://github.com/microsoft/unilm/tree/master/wavlm) release \*\*\*\*\***

- [x] [**WavLM**](https://github.com/microsoft/unilm/tree/master/wavlm) (October 27, 2021):  WavLM, a new pre-trained speech model, to solve full-stack downstream speech tasks. 
WavLM integrates the gated relative position embedding structure and the utterance mixing method, to model both spoken content and speaker identity preservation. WavLM is trained on  94k hours of public audio data, which is larger than other released checkpoints for English Speech modeling. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. "[WavLM: Large-Scale Self-Supervised  Pre-training   for Full Stack Speech Processing](https://arxiv.org/pdf/2110.13900.pdf)"

**\*\*\*\*\* ```October, 2021```: [MarkupLM](https://github.com/microsoft/unilm/tree/master/markuplm) release \*\*\*\*\***

- [x] [**MarkupLM**](https://github.com/microsoft/unilm/tree/master/markuplm) (October 19, 2021): MarkupLM, a simple yet effective pre-training approach for text and markup language. With the Transformer architecture, MarkupLM integrates different input embeddings including text embeddings, position embeddings, and XPath embeddings. Furthermore, we also propose new pre-training objectives that are specially designed for understanding the markup language. We evaluate the pre-trained MarkupLM model on the WebSRC and SWDE datasets. Experiments show that MarkupLM significantly outperforms several SOTA baselines in these tasks. "[MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) ```ACL 2022```"

**\*\*\*\*\* ```September, 2021```: [TrOCR](https://github.com/microsoft/unilm/tree/master/trocr) release \*\*\*\*\***

- [x] [**TrOCR**](https://github.com/microsoft/unilm/tree/master/trocr) (September 22, 2021): Transformer-based OCR with pre-trained models, which leverages the Transformer architecture for both image understanding and bpe-level text generation. The TrOCR model is simple but effective (convolution free), and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. "[TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) ```AAAI 2023```"

**\*\*\*\*\* ```August, 2021```: [LayoutReader](https://github.com/microsoft/unilm/tree/master/layoutreader) release \*\*\*\*\***

- [x] [**LayoutReader**](https://github.com/microsoft/unilm/tree/master/layoutreader) (August 26, 2021): pre-training of text and layout for reading order detection. The pre-trained LayoutReader significantly improves both open-source and commercial OCR engines in ordering text lines. Meanwhile, we also created a reading order benchmark dataset [ReadingBank](https://github.com/doc-analysis/ReadingBank) to further empower the research in this area. "[LayoutReader: Pre-training of Text and Layout for Reading Order Detection](https://arxiv.org/abs/2108.11591) ```EMNLP 2021```"

**\*\*\*\*\* ```August, 2021```: [DeltaLM](https://github.com/microsoft/unilm/tree/master/deltalm) release \*\*\*\*\***

- [x] [**DeltaLM**](https://github.com/microsoft/unilm/tree/master/deltalm) (August, 2021): encoder-decoder pre-training for language generation and translation. DeltaLM **ranks first** on the [WMT21 multilingual translation task](http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html). The task requires a model to translate between 102 languages. "[DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders.](https://arxiv.org/abs/2106.13736)"

**\*\*\*\*\* ```July, 2021```: [BEiT](https://github.com/microsoft/unilm/tree/master/beit) release \*\*\*\*\***

- [x] [**BEiT**](https://github.com/microsoft/unilm/tree/master/beit) (June 15, 2021): BERT Pre-Training of Image Transformers. BEiT-large achieves **[state-of-the-art results on ADE20K](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k) (a big jump to 57.0 mIoU) for semantic segmentation**. BEiT-large achieves **state-of-the-art ImageNet top-1 accuracy (88.6%) under the setting without extra data other than ImageNet-22k**. "[BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254)"



**\*\*\*\*\* ```June, 2021```: [LayoutXLM](https://github.com/microsoft/unilm/tree/master/layoutxlm) | [AdaLM](https://github.com/microsoft/unilm/tree/master/adalm) | [MiniLMv2](https://github.com/microsoft/unilm/tree/master/minilm) release \*\*\*\*\***

- [x] [**LayoutXLM**](https://github.com/microsoft/unilm/tree/master/layoutxlm) (April 17, 2021): multimodal pre-training for multilingual visually-rich document understanding. The pre-trained LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the FUNSD and multilingual [XFUND](https://github.com/doc-analysis/XFUND) dataset including 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). "[LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836)"
- [x] [**AdaLM**](https://github.com/microsoft/unilm/tree/master/adalm) (June 2021): a simple yet effective approach for domain adaptation of pre-trained models. Biomedical specific pre-trained models are released. "[Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains](#) ```ACL 2021```"
- [x] [**MiniLMv2**](https://github.com/microsoft/unilm/tree/master/minilm) (December, 2020): a simple yet effective task-agnostic knowledge distillation method, namely multi-head self-attention relation distillation, for compressing large pre-trained Transformers into small and fast pre-trained models. MiniLMv2 significantly outperforms MiniLMv1. Both English and multilingual MiniLM models are released. "[MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers](https://arxiv.org/abs/2012.15828) ```ACL 2021```"

**\*\*\*\*\* ```May, 2021```: [LayoutLMv2](https://github.com/microsoft/unilm/tree/master/layoutlmv2) | [LayoutXLM](https://github.com/microsoft/unilm/tree/master/layoutxlm) release \*\*\*\*\***

- [x] [**LayoutLM 2.0**](https://github.com/microsoft/unilm/tree/master/layoutlmv2) (December 29, 2020): multimodal pre-training for visually-rich document understanding by leveraging text, layout and image information in a single framework. It is coming with new SOTA on a wide range of document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). "[LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) ```ACL 2021```"

**\*\*\*\*\* ```February, 2020```: [UniLM v2](https://github.com/microsoft/unilm/tree/master/unilm) | [MiniLM v1](https://github.com/microsoft/unilm/tree/master/minilm) | [LayoutLM v1](https://github.com/microsoft/unilm/tree/master/layoutlm) | [s2s-ft v1](https://github.com/microsoft/unilm/tree/master/s2s-ft) release \*\*\*\*\***

- [x] [**LayoutLM 1.0**](https://github.com/microsoft/unilm/tree/master/layoutlm) (February 18, 2020): pre-trained models for document (image) understanding (e.g. receipts, forms, etc.) . It achieves new SOTA results in several downstream tasks, including form understanding (the FUNSD dataset from 70.72 to 79.27), receipt understanding (the [ICDAR 2019 SROIE leaderboard](https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3) from 94.02 to 95.24) and document image classification (the RVL-CDIP dataset from 93.07 to 94.42). "[LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) ```KDD 2020```"
- [x] [**s2s-ft 1.0**](https://github.com/microsoft/unilm/tree/master/s2s-ft) (February 26, 2020): A PyTorch package used to fine-tune pre-trained Transformers for sequence-to-sequence language generation. "[s2s-ft: Fine-Tuning Pre-Trained Transformers for Sequence-to-Sequence Learning](#)"
- [x] [**MiniLM 1.0**](https://github.com/microsoft/unilm/tree/master/minilm) (February 26, 2020): deep self-attention distillation is all you need (for task-agnostic knowledge distillation of pre-trained Transformers). MiniLM (12-layer, 384-hidden) achieves 2.7x speedup and comparable results over BERT-base (12-layer, 768-hidden) on NLU tasks as well as strong results on NLG tasks. The even smaller MiniLM (6-layer, 384-hidden) obtains 5.3x speedup and produces very competitive results. "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957) ```NeurIPS 2020```"
- [x] [**UniLM 2.0**](https://github.com/microsoft/unilm/tree/master/unilm) (February 28, 2020): **unified pre-training** of bi-directional LM (via autoencoding) and sequence-to-sequence LM (via partially autoregressive) w/ **Pseudo-Masked Language Model** for language understanding and generation. UniLM v2 achieves new SOTA in a wide range of natural language understanding and generation tasks. "[UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training](https://arxiv.org/abs/2002.12804) ```ICML 2020```"



**\*\*\*\*\* October 1st, 2019: UniLM v1 release \*\*\*\*\***

- [x] [**UniLM v1**](https://github.com/microsoft/unilm/tree/master/unilm-v1) (September 30, 2019): the code and pre-trained models for the ```NeurIPS 2019``` paper entitled "[Unified Language Model Pre-training for Natural Language Understanding and Generation](https://arxiv.org/abs/1905.03197)". UniLM (v1) achieves the **new SOTA results** in **NLG** (especially **sequence-to-sequence generation**) tasks, including abstractive summarization (the Gigaword and CNN/DM datasets), question generation (the SQuAD QG dataset), etc. 

-->

## License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) project.

[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

### Contact Information

For help or issues using the pre-trained models, please submit a GitHub issue.

For other communications, please contact [Furu Wei](https://thegenerality.com) (`fuwei@microsoft.com`).
