A fast, efficient natural language processor for Icelandic
Greynir is a Python 3 (>= 3.7) package, published by Miðeind ehf., for working with Icelandic natural language text. Greynir can parse text into sentence trees, find lemmas, inflect noun phrases, assign part-of-speech tags and much more.
Greynir's sentence trees can inter alia be used to extract information from text, for instance about people, titles, entities, facts, actions and opinions.
Full documentation for Greynir is available here.
Greynir is the engine of Greynir.is, a natural-language front end for a database of over 10 million sentences parsed from Icelandic news articles, and Embla, a voice-driven virtual assistant app for smart devices such as iOS and Android phones.
Greynir includes a hand-written context-free grammar for the Icelandic language, consisting of over 7,000 lines of grammatical productions in extended Backus-Naur format. Its fast C++ parser core is able to cope with long and ambiguous sentences, using an Earley-type parser as enhanced by Scott and Johnstone.
Greynir employs the Tokenizer package, by the same authors, to tokenize text, and uses BinPackage as its database of Icelandic vocabulary and morphology.
````python from reynir import NounPhrase as Nl
karfa = Nl("þrír lúxus-miðar á Star Wars og tveir brimsaltir pokar af poppi")
print(f"Þú keyptir {karfa:þf}.") print(f"Hér er kvittunin þín fyrir {karfa:þgf}.") ````
The program outputs the following text, correctly inflected:
text
Þú keyptir þrjá lúxus-miða á Star Wars og tvo brimsalta poka af poppi.
Hér er kvittunin þín fyrir þremur lúxus-miðum á Star Wars og tveimur brimsöltum pokum af poppi.
````python
from reynir import Greynir g = Greynir() sent = g.parse_single("Ása sá sól.") print(sent.tree.view) P # Root +-S-MAIN # Main sentence +-IP # Inflected phrase +-NP-SUBJ # Noun phrase, subject +-no_et_nf_kvk: 'Ása' # Noun, singular, nominative, feminine +-VP # Verb phrase containing arguments +-VP # Verb phrase containing verb +-so_1_þf_et_p3: 'sá' # Verb, 1 accusative arg, singular, 3rd p +-NP-OBJ # Noun phrase, object +-no_et_þf_kvk: 'sól' # Noun, singular, accusative, feminine +-'.' # Punctuation sent.tree.nouns ['Ása', 'sól'] sent.tree.verbs ['sjá'] sent.tree.flat 'P S-MAIN IP NP-SUBJ no_et_nf_kvk /NP-SUBJ VP so_1_þf_et_p3 NP-OBJ no_et_þf_kvk /NP-OBJ /VP /IP /S-MAIN p /P'
The subject noun phrase (S.IP.NP also works)
sent.tree.S.IP.NP_SUBJ.lemmas ['Ása']
The verb phrase
sent.tree.S.IP.VP.lemmas ['sjá', 'sól']
The object within the verb phrase (S.IP.VP.NP also works)
sent.tree.S.IP.VP.NP_OBJ.lemmas ['sól'] ````
This package runs on CPython 3.7 or newer, and on PyPy 3.7 or newer.
To find out which version of Python you have, enter:
sh
python --version
If a binary wheel package isn't available on PyPI
for your system, you may need to have the python3-dev
package
(or its Windows equivalent) installed on your
system to set up Greynir successfully. This is
because a source distribution install requires a C++ compiler and linker:
````sh
sudo apt-get install python3-dev ````
Depending on your system, you may also need to install libffi-dev
:
````sh
sudo apt-get install libffi-dev ````
To install this package, assuming Python 3 is your default Python:
sh
pip install reynir
If you have git installed and want to be able to edit the source, do like so:
````sh git clone https://github.com/mideind/GreynirPackage cd GreynirPackage
pip install -e . ````
The package source code is in GreynirPackage/src/reynir
.
To run the built-in tests, install pytest,
cd
to your GreynirPackage
subdirectory (and optionally activate your
virtualenv), then run:
sh
python -m pytest
A parsing test pipeline for different parsing schemas, including the Greynir schema, has been developed. It is available here.
Please consult Greynir's documentation for detailed installation instructions, a quickstart guide, and reference information, as well as important information about copyright and licensing.
Greynir is Copyright © 2022 by Miðeind ehf..
The original author of this software is Vilhjálmur Þorsteinsson.
This software is licensed under the MIT License:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
If you would like to use this software in ways that are incompatible with the standard MIT license, contact Miðeind ehf. to negotiate custom arrangements.
GreynirPackage indirectly embeds the Database of Icelandic Morphology, (Beygingarlýsing íslensks nútímamáls), abbreviated BÍN. GreynirPackage does not claim any endorsement by the BÍN authors or copyright holders.
The BÍN source data are publicly available under the CC BY-SA 4.0 license, as further detailed here in English and here in Icelandic.
In accordance with the BÍN license terms, credit is hereby given as follows:
Beygingarlýsing íslensks nútímamáls. Stofnun Árna Magnússonar í íslenskum fræðum. Höfundur og ritstjóri Kristín Bjarnadóttir.
``` from reynir import NounPhrase
np_1 = NounPhrase('ýmsir menn, þar á meðal þessi') print(f'Ég er í slagtogi með {np_1:þgf}.')
np_2 = NounPhrase('ýmsir menn, til dæmis þessi') print(f'Ég er í slagtogi með {np_2:þgf}.')
np_3 = NounPhrase('ýmsir menn, t.d. þessi') print(f'Ég er í slagtogi með {np_3:þgf}.')
np_4 = NounPhrase('ýmsir menn, þ.á m. þessi') print(f'Ég er í slagtogi með {np_4:þgf}.')
```
Is there a way to get verb form variants in the same way you can get case variants for nouns?
Something like
```python
BIN_Db.lookup_past_participle("sækja") ```
The use case is for a results highlighter. Lemmas are indexed, but I would like to highlight the original forms based on search string lemmas. For this I need to potentially highlight derived word forms. I’m basically writing a get_all_meaning_wordforms
function that returns a set of strings that should be highlighted.
I ran the 100 most common first names in Iceland through greynir.parse
. No female names are interpreted as verbs but there are a few male ones. See this gist for the code.
https://gist.github.com/jokull/2c1048bbc845feb46c717ac7c77e0cc5
If there is a way to augment the grammar file for specific project contexts, that should be documented.
Greynir makes it easy to lemmatize text. If the parser fails I can fallback to the bintokenizer and get multiple lemmas for all meanings. This makes for a great search index even if there are some extra lemmas there when the parser fails.
Perhaps Greynir should provide a function out of the box to do this, as it will be a common use case? I can share my code if anyone wants to see it.
I would loooove it if were easier to reach other variant cases and number when you have a token meaning or terminal instance. Something like token.get_singular
and token.get_accusative
.
Full Changelog: https://github.com/mideind/GreynirPackage/compare/3.5.2...3.5.3
config/Verbs.conf
)Icelandic startup specializing in AI and Natural Language Processing
GitHub Repository Homepagenlp natural-language-processing python python3 python-library parser earley icelandic parsing parsing-engine parsing-library