# Lemmatization Lists

These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.

These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.

- Asturian (ast) (108,792 pairs)
- Bulgarian (bg) (30,323 pairs)
- Catalan (ca) (591,534 pairs)
- Czech (cs) (36,400 pairs)
- English (en) (41,760 pairs)
- Estonian (et) (80,536 pairs)
- French (fr) (224,002 pairs)
- Galician (gl) (392,856 pairs)
- German (de) (358,473 pairs)
- Hungarian (hu) (39,898 pairs)
- Irish (ga) (415,502 pairs)
- Manx Gaelic (gv) (67,177 pairs)
- Italian (it) (341,074 pairs)
- Persian/Farsi (fa) (6,273 pairs)
- Polish (pl) (3,296,232 pairs)
- Portuguese (pt) (850,264 pairs)
- Romanian (ro) (314,810 pairs)
- Russian (ru) (537,810 pairs)
- Scottish Gaelic (gd) (51,624 pairs)
- Slovak (sk) (858,414 pairs)
- Slovene (sl) (99,063 pairs)
- Spanish (es) (497,560 pairs)
- Swedish (sv) (675,137 pairs)
- Ukrainian (uk) (193,703 pairs)
- Welsh (cy) (359,224 pairs)

Licence

- Available under the [Open Database License](http://opendatacommons.org/licenses/odbl/summary/)

Sources

- [Various Hunspell dictionaries](http://extensions.services.openoffice.org/en/dictionaries) from the OpenOffice.org website
- [Deutsches Morphologie-Lexikon](http://www.danielnaber.de/morphologie/) by Daniel Naber
- [Lexique](http://www.lexique.org/) by Boris New and Christophe Pallier
- [e_lemma.txt](http://www.lexically.net/downloads/BNC_wordlists/e_lemma.txt) by Yasumasa Someya
- [Multext East](http://nl.ijs.si/ME/) (only those morphological lexicons that are under a free licence are used)
- Morphological dictionaries from [FreeLing](http://nlp.lsi.upc.edu/freeling/index.php)
- [SALDO](http://spraakbanken.gu.se/eng/saldo) morphological lexicon
- [Irish National Morphology Database](http://www.teanglann.ie/en/gram/_download)
- Various lists by [Kevin Scannell](https://cadhan.com/)
- [OpenRussian.org](https://en.openrussian.org/)
