.. image:: https://github.com/mideind/Tokenizer/workflows/Python%20package/badge.svg :target: https://github.com/mideind/Tokenizer
Tokenization is a necessary first step in many natural language processing tasks, such as word counting, parsing, spell checking, corpus generation, and statistical analysis of text.
Tokenizer is a compact pure-Python (>= 3.7) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences.
The package contains a dictionary of common Icelandic abbreviations,
in the file src/tokenizer/Abbrev.conf
.
Tokenizer is an independent spinoff from the Greynir project <https://greynir.is>
(GitHub repository here <https://github.com/mideind/Greynir>
), by the same authors.
The Greynir natural language parser for Icelandic <https://github.com/mideind/GreynirPackage>
_
uses Tokenizer on its input.
Note that Tokenizer is licensed under the MIT license while Greynir is licensed under GPLv3.
Tokenizer can do both deep and shallow tokenization.
Shallow tokenization simply returns each sentence as a string (or as a line of text in an output file), where the individual tokens are separated by spaces.
Deep tokenization returns token objects that have been annotated with the token type and further information extracted from the token, for example a (year, month, day) tuple in the case of date tokens.
In shallow tokenization, tokens are in most cases kept intact, although
consecutive white space is always coalesced. The input strings
"800 MW"
, "21. janúar"
and "800 7000"
thus become
two tokens each, output with a single space between them.
In deep tokenization, the same strings are represented by single token objects,
of type TOK.MEASUREMENT
, TOK.DATEREL
and TOK.TELNO
, respectively.
The text associated with a single token object may contain spaces,
although consecutive whitespace is always coalesced into a single space " "
.
By default, the command line tool performs shallow tokenization. If you
want deep tokenization with the command line tool, use the --json
or
--csv
switches.
From Python code, call split_into_sentences()
for shallow tokenization,
or tokenize()
for deep tokenization. These functions are documented with
examples below.
To install:
.. code-block:: console
$ pip install tokenizer
After installation, the tokenizer can be invoked directly from the command line:
.. code-block:: console
$ tokenize input.txt output.txt
Input and output files are in UTF-8 encoding. If the files are not
given explicitly, stdin
and stdout
are used for input and output,
respectively.
Empty lines in the input are treated as hard sentence boundaries.
By default, the output consists of one sentence per line, where each
line ends with a single newline character (ASCII LF, chr(10)
, "\n"
).
Within each line, tokens are separated by spaces.
The following (mutually exclusive) options can be specified on the command line:
+-------------------+---------------------------------------------------+
| | --csv
| Deep tokenization. Output token objects in CSV |
| | format, one per line. Sentences are separated by |
| | lines containing 0,"",""
|
+-------------------+---------------------------------------------------+
| | --json
| Deep tokenization. Output token objects in JSON |
| | format, one per line. |
+-------------------+---------------------------------------------------+
Other options can be specified on the command line:
+-----------------------------------+---------------------------------------------------+
| | -n
| Normalize punctuation, causing e.g. quotes to be |
| | | output in Icelandic form and hyphens to be |
| | --normalize
| regularized. This option is only applicable to |
| | shallow tokenization. |
+-----------------------------------+---------------------------------------------------+
| | -s
| Input contains strictly one sentence per line, |
| | | i.e. every newline is a sentence boundary. |
| | --one_sent_per_line
| |
+-----------------------------------+---------------------------------------------------+
| | -o
| Output original token text, i.e. bypass shallow |
| | | tokenization. This effectively runs the tokenizer |
| | --original
| as a sentence splitter only. |
+-----------------------------------+---------------------------------------------------+
| | -m
| Degree signal in tokens denoting temperature |
| | --convert_measurements
| normalized (200° C -> 200 °C) |
+-----------------------------------+---------------------------------------------------+
| | -p
| Numbers combined into one token with the |
| | --coalesce_percent
| following token denoting percentage word forms |
| | (prósent, prósentustig, hundraðshlutar) |
+-----------------------------------+---------------------------------------------------+
| | -g
| Do not replace composite glyphs using Unicode |
| | --keep_composite_glyphs
| COMBINING codes with their accented/umlaut |
| | counterparts |
+-----------------------------------+---------------------------------------------------+
| | -e
| HTML escape codes replaced by their meaning, |
| | --replace_html_escapes
| such as á
-> á
|
+-----------------------------------+---------------------------------------------------+
| | -c
| English-style decimal points and thousands |
| | --convert_numbers
| separators in numbers changed to Icelandic style |
+-----------------------------------+---------------------------------------------------+
| | -k N
| Kludgy ordinal handling defined. |
| | --handle_kludgy_ordinals N
| 0: Returns the original mixed word form |
| | 1. Kludgy ordinal returned as pure word forms |
| | 2: Kludgy ordinals returned as pure numbers |
+-----------------------------------+---------------------------------------------------+
Type tokenize -h
or tokenize --help
to get a short help message.
.. code-block:: console
$ echo "3.janúar sl. keypti ég 64kWst rafbíl. Hann kostaði € 30.000." | tokenize
3. janúar sl. keypti ég 64kWst rafbíl .
Hann kostaði €30.000 .
$ echo "3.janúar sl. keypti ég 64kWst rafbíl. Hann kostaði € 30.000." | tokenize --csv
19,"3. janúar","0|1|3"
6,"sl.","síðastliðinn"
6,"keypti",""
6,"ég",""
22,"64kWst","J|230400000.0"
6,"rafbíl",""
1,".","."
0,"",""
6,"Hann",""
6,"kostaði",""
13,"€30.000","30000|EUR"
1,".","."
0,"",""
$ echo "3.janúar sl. keypti ég 64kWst rafbíl. Hann kostaði € 30.000." | tokenize --json
{"k":"BEGIN SENT"}
{"k":"DATEREL","t":"3. janúar","v":[0,1,3]}
{"k":"WORD","t":"sl.","v":["síðastliðinn"]}
{"k":"WORD","t":"keypti"}
{"k":"WORD","t":"ég"}
{"k":"MEASUREMENT","t":"64kWst","v":["J",230400000.0]}
{"k":"WORD","t":"rafbíl"}
{"k":"PUNCTUATION","t":".","v":"."}
{"k":"END SENT"}
{"k":"BEGIN SENT"}
{"k":"WORD","t":"Hann"}
{"k":"WORD","t":"kostaði"}
{"k":"AMOUNT","t":"€30.000","v":[30000,"EUR"]}
{"k":"PUNCTUATION","t":".","v":"."}
{"k":"END SENT"}
An example of shallow tokenization from Python code goes something like this:
.. code-block:: python
from tokenizer import split_into_sentences
# A string to be tokenized, containing two sentences
s = "3.janúar sl. keypti ég 64kWst rafbíl. Hann kostaði € 30.000."
# Obtain a generator of sentence strings
g = split_into_sentences(s)
# Loop through the sentences
for sentence in g:
# Obtain the individual token strings
tokens = sentence.split()
# Print the tokens, comma-separated
print("|".join(tokens))
The program outputs::
3.|janúar|sl.|keypti|ég|64kWst|rafbíl|.
Hann|kostaði|€30.000|.
To do deep tokenization from within Python code:
.. code-block:: python
from tokenizer import tokenize, TOK
text = ("Málinu var vísað til stjórnskipunar- og eftirlitsnefndar "
"skv. 3. gr. XVII. kafla laga nr. 10/2007 þann 3. janúar 2010.")
for token in tokenize(text):
print("{0}: '{1}' {2}".format(
TOK.descr[token.kind],
token.txt or "-",
token.val or ""))
Output::
BEGIN SENT: '-' (0, None)
WORD: 'Málinu'
WORD: 'var'
WORD: 'vísað'
WORD: 'til'
WORD: 'stjórnskipunar- og eftirlitsnefndar'
WORD: 'skv.' [('samkvæmt', 0, 'fs', 'skst', 'skv.', '-')]
ORDINAL: '3.' 3
WORD: 'gr.' [('grein', 0, 'kvk', 'skst', 'gr.', '-')]
ORDINAL: 'XVII.' 17
WORD: 'kafla'
WORD: 'laga'
WORD: 'nr.' [('númer', 0, 'hk', 'skst', 'nr.', '-')]
NUMBER: '10' (10, None, None)
PUNCTUATION: '/' (4, '/')
YEAR: '2007' 2007
WORD: 'þann'
DATEABS: '3. janúar 2010' (2010, 1, 3)
PUNCTUATION: '.' (3, '.')
END SENT: '-'
Note the following:
TOK.S_BEGIN
and TOK.S_END
tokens.token.val
field.token.val
field.token.val
.token.val
field.test module <https://github.com/mideind/Tokenizer/blob/master/test/test_tokenizer.py>
in the project's GitHub repository <https://github.com/mideind/Tokenizer>
.tokenize()
functionTo deep-tokenize a text string, call tokenizer.tokenize(text, **options)
.
The text
parameter can be a string, or an iterable that yields strings
(such as a text file object).
The function returns a Python generator of token objects.
Each token object is a simple namedtuple
with three
fields: (kind, txt, val)
(further documented below).
The tokenizer.tokenize()
function is typically called in a for
loop:
.. code-block:: python
import tokenizer
for token in tokenizer.tokenize(mystring):
kind, txt, val = token
if kind == tokenizer.TOK.WORD:
# Do something with word tokens
pass
else:
# Do something else
pass
Alternatively, create a token list from the returned generator::
token_list = list(tokenizer.tokenize(mystring))
split_into_sentences()
functionTo shallow-tokenize a text string, call
tokenizer.split_into_sentences(text_or_gen, **options)
.
The text_or_gen
parameter can be a string, or an iterable that yields
strings (such as a text file object).
This function returns a Python generator of strings, yielding a string for each sentence in the input. Within a sentence, the tokens are separated by spaces.
You can pass the option normalize=True
to the function if you want
the normalized form of punctuation tokens. Normalization outputs
Icelandic single and double quotes („these“) instead of English-style
ones ("these"), converts three-dot ellipsis ... to single character
ellipsis …, and casts en-dashes – and em-dashes — to regular hyphens.
The tokenizer.split_into_sentences()
function is typically called
in a for
loop:
.. code-block:: python
import tokenizer
with open("example.txt", "r", encoding="utf-8") as f:
# You can pass a file object directly to split_into_sentences()
for sentence in tokenizer.split_into_sentences(f):
# sentence is a string of space-separated tokens
tokens = sentence.split()
# Now, tokens is a list of strings, one for each token
for t in tokens:
# Do something with the token t
pass
correct_spaces()
functionThe tokenizer.correct_spaces(text)
function returns a string after
splitting it up and re-joining it with correct whitespace around
punctuation tokens. Example::
>>> import tokenizer
>>> tokenizer.correct_spaces(
... "Frétt \n dagsins:Jón\t ,Friðgeir og Páll ! 100 / 2 = 50"
... )
'Frétt dagsins: Jón, Friðgeir og Páll! 100/2 = 50'
detokenize()
functionThe tokenizer.detokenize(tokens, normalize=False)
function
takes an iterable of token objects and returns a corresponding, correctly
spaced text string, composed from the tokens' text. If the
normalize
parameter is set to True
,
the function uses the normalized form of any punctuation tokens, such
as proper Icelandic single and double quotes instead of English-type
quotes. Example::
>>> import tokenizer
>>> toklist = list(tokenizer.tokenize("Hann sagði: „Þú ert ágæt!“."))
>>> tokenizer.detokenize(toklist, normalize=True)
'Hann sagði: „Þú ert ágæt!“.'
normalized_text()
functionThe tokenizer.normalized_text(token)
function
returns the normalized text for a token. This means that the original
token text is returned except for certain punctuation tokens, where a
normalized form is returned instead. Specifically, English-type quotes
are converted to Icelandic ones, and en- and em-dashes are converted
to regular hyphens.
text_from_tokens()
functionThe tokenizer.text_from_tokens(tokens)
function
returns a concatenation of the text contents of the given token list,
with spaces between tokens. Example::
>>> import tokenizer
>>> toklist = list(tokenizer.tokenize("Hann sagði: \"Þú ert ágæt!\"."))
>>> tokenizer.text_from_tokens(toklist)
'Hann sagði : " Þú ert ágæt ! " .'
normalized_text_from_tokens()
functionThe tokenizer.normalized_text_from_tokens(tokens)
function
returns a concatenation of the normalized text contents of the given
token list, with spaces between tokens. Example (note the double quotes)::
>>> import tokenizer
>>> toklist = list(tokenizer.tokenize("Hann sagði: \"Þú ert ágæt!\"."))
>>> tokenizer.normalized_text_from_tokens(toklist)
'Hann sagði : „ Þú ert ágæt ! “ .'
You can optionally pass one or more of the following options as
keyword parameters to the tokenize()
and split_into_sentences()
functions:
convert_numbers=[bool]
Setting this option to True
causes the tokenizer to convert numbers
and amounts with
English-style decimal points (.
) and thousands separators (,
)
to Icelandic format, where the decimal separator is a comma (,
)
and the thousands separator is a period (.
). $1,234.56
is thus
converted to a token whose text is $1.234,56
.
The default value for the convert_numbers
option is False
.
Note that in versions of Tokenizer prior to 1.4, convert_numbers
was True
.
convert_measurements=[bool]
Setting this option to True
causes the tokenizer to convert
degrees Kelvin, Celsius and Fahrenheit to a regularized form, i.e.
200° C
becomes 200 °C
.
The default value for the convert_measurements
option is False
.
replace_composite_glyphs=[bool]
Setting this option to False
disables the automatic replacement
of composite Unicode glyphs with their corresponding Icelandic characters.
By default, the tokenizer combines vowels with the Unicode
COMBINING ACUTE ACCENT and COMBINING DIAERESIS glyphs to form single
character code points, such as 'á' and 'ö'.
The default value for the replace_composite_glyphs
option is True
.
replace_html_escapes=[bool]
Setting this option to True
causes the tokenizer to replace common
HTML escaped character codes, such as á
with the character being
escaped, such as á
. Note that ­
(soft hyphen) is replaced by
an empty string, and
is replaced by a normal space.
The ligatures fi
and fl
are replaced by fi
and fl
,
respectively.
The default value for the replace_html_escapes
option is False
.
handle_kludgy_ordinals=[value]
This options controls the way Tokenizer handles 'kludgy' ordinals, such as
1sti, 4ðu, or 2ja. By default, such ordinals are returned unmodified
('passed through') as word tokens (TOK.WORD
).
However, this can be modified as follows:
tokenizer.KLUDGY_ORDINALS_MODIFY
: Kludgy ordinals are corrected
to become 'proper' word tokens, i.e. 1sti becomes fyrsti and
2ja becomes tveggja.
tokenizer.KLUDGY_ORDINALS_TRANSLATE
: Kludgy ordinals that represent
proper ordinal numbers are translated to ordinal tokens (TOK.ORDINAL
),
with their original text and their ordinal value. 1sti thus
becomes a TOK.ORDINAL
token with a value of 1, and 3ja becomes
a TOK.ORDINAL
with a value of 3.
tokenizer.KLUDGY_ORDINALS_PASS_THROUGH
is the default value of
the option. It causes kludgy ordinals to be returned unmodified as
word tokens.
Note that versions of Tokenizer prior to 1.4 behaved as if
handle_kludgy_ordinals
were set to
tokenizer.KLUDGY_ORDINALS_TRANSLATE
.
Each token is an instance of the class Tok
that has three main properties:
kind
, txt
and val
.
kind
propertyThe kind
property contains one of the following integer constants,
defined within the TOK
class:
+---------------+---------+---------------------+---------------------------+
| Constant | Value | Explanation | Examples |
+===============+=========+=====================+===========================+
| PUNCTUATION | 1 | Punctuation | . ! ; % & |
+---------------+---------+---------------------+---------------------------+
| TIME | 2 | Time (h, m, s) | | 11:35:40 |
| | | | | kl. 7:05 |
| | | | | klukkan 23:35 |
+---------------+---------+---------------------+---------------------------+
| DATE * | 3 | Date (y, m, d) | [Unused, see DATEABS and |
| | | | DATEREL] |
+---------------+---------+---------------------+---------------------------+
| YEAR | 4 | Year | | árið 874 e.Kr. |
| | | | | 1965 |
| | | | | 44 f.Kr. |
+---------------+---------+---------------------+---------------------------+
| NUMBER | 5 | Number | | 100 |
| | | | | 1.965 |
| | | | | 1.965,34 |
| | | | | 1,965.34 |
| | | | | 2⅞ |
+---------------+---------+---------------------+---------------------------+
| WORD | 6 | Word | | kattaeftirlit |
| | | | | hunda- og kattaeftirlit |
+---------------+---------+---------------------+---------------------------+
| TELNO | 7 | Telephone number | | 5254764 |
| | | | | 699-4244 |
| | | | | 410 4000 |
+---------------+---------+---------------------+---------------------------+
| PERCENT | 8 | Percentage | 78% |
+---------------+---------+---------------------+---------------------------+
| URL | 9 | URL | | https://greynir.is |
| | | | | http://tiny.cc/28695y |
+---------------+---------+---------------------+---------------------------+
| ORDINAL | 10 | Ordinal number | | 30. |
| | | | | XVIII. |
+---------------+---------+---------------------+---------------------------+
| TIMESTAMP * | 11 | Timestamp | [Unused, see |
| | | | TIMESTAMPABS and |
| | | | TIMESTAMPREL] |
+---------------+---------+---------------------+---------------------------+
| CURRENCY * | 12 | Currency name | [Unused] |
+---------------+---------+---------------------+---------------------------+
| AMOUNT | 13 | Amount | | €2.345,67 |
| | | | | 750 þús.kr. |
| | | | | 2,7 mrð. USD |
| | | | | kr. 9.900 |
| | | | | EUR 200 |
+---------------+---------+---------------------+---------------------------+
| PERSON * | 14 | Person name | [Unused] |
+---------------+---------+---------------------+---------------------------+
| EMAIL | 15 | E-mail | [email protected]
|
+---------------+---------+---------------------+---------------------------+
| ENTITY * | 16 | Named entity | [Unused] |
+---------------+---------+---------------------+---------------------------+
| UNKNOWN | 17 | Unknown token | |
+---------------+---------+---------------------+---------------------------+
| DATEABS | 18 | Absolute date | | 30. desember 1965 |
| | | | | 30/12/1965 |
| | | | | 1965-12-30 |
| | | | | 1965/12/30 |
+---------------+---------+---------------------+---------------------------+
| DATEREL | 19 | Relative date | | 15. mars |
| | | | | 15/3 |
| | | | | 15.3. |
| | | | | mars 1911 |
+---------------+---------+---------------------+---------------------------+
| TIMESTAMPABS | 20 | Absolute timestamp | | 30. desember 1965 11:34 |
| | | | | 1965-12-30 kl. 13:00 |
+---------------+---------+---------------------+---------------------------+
| TIMESTAMPREL | 21 | Relative timestamp | | 30. desember kl. 13:00 |
+---------------+---------+---------------------+---------------------------+
| MEASUREMENT | 22 | Value with a | | 690 MW |
| | | measurement unit | | 1.010 hPa |
| | | | | 220 m² |
| | | | | 80° C |
+---------------+---------+---------------------+---------------------------+
| NUMWLETTER | 23 | Number followed by | | 14a |
| | | a single letter | | 7B |
+---------------+---------+---------------------+---------------------------+
| DOMAIN | 24 | Domain name | | greynir.is |
| | | | | Reddit.com |
| | | | | www.wikipedia.org |
+---------------+---------+---------------------+---------------------------+
| HASHTAG | 25 | Hashtag | | #MeToo |
| | | | | #12stig |
+---------------+---------+---------------------+---------------------------+
| MOLECULE | 26 | Molecular formula | | H2SO4 |
| | | | | CO2 |
+---------------+---------+---------------------+---------------------------+
| SSN | 27 | Social security | | 591213-1480 |
| | | number (kennitala)| |
+---------------+---------+---------------------+---------------------------+
| USERNAME | 28 | Twitter user handle | | @username_123 |
| | | | |
+---------------+---------+---------------------+---------------------------+
| SERIALNUMBER | 29 | Serial number | | 394-5388 |
| | | | | 12-345-6789 |
+---------------+---------+---------------------+---------------------------+
| COMPANY * | 30 | Company name | [Unused] |
+---------------+---------+---------------------+---------------------------+
| S_BEGIN | 11001 | Start of sentence | |
+---------------+---------+---------------------+---------------------------+
| S_END | 11002 | End of sentence | |
+---------------+---------+---------------------+---------------------------+
(*) The token types marked with an asterisk are reserved for the Greynir package and not currently returned by the tokenizer.
To obtain a descriptive text for a token kind, use
TOK.descr[token.kind]
(see example above).
txt
propertyThe txt
property contains the original source text for the token,
with the following exceptions:
All contiguous whitespace (spaces, tabs, newlines) is coalesced
into single spaces (" "
) within the txt
string. A date
token that is parsed from a source text of "29. \n janúar"
thus has a txt
of "29. janúar"
.
Tokenizer automatically merges Unicode COMBINING ACUTE ACCENT
(code point 769) and COMBINING DIAERESIS
(code point 776)
with vowels to form single code points for the Icelandic letters
á, é, í, ó, ú, ý and ö, in both lower and upper case. (This behavior
can be disabled; see the replace_composite_glyphs
option described
above.)
If the appropriate options are specified (see above), it converts kludgy ordinals (3ja) to proper ones (þriðja), and English-style thousand and decimal separators to Icelandic ones (10,345.67 becomes 10.345,67).
If the replace_html_escapes
option is set, Tokenizer replaces
HTML-style escapes (á
) with the characters
being escaped (á
).
val
propertyThe val
property contains auxiliary information, corresponding to
the token kind, as follows:
For TOK.PUNCTUATION
, the val
field contains a tuple with
two items: (whitespace, normalform)
. The first item (token.val[0]
)
specifies the whitespace normally found around the symbol in question,
as an integer::
TP_LEFT = 1 # Whitespace to the left TP_CENTER = 2 # Whitespace to the left and right TP_RIGHT = 3 # Whitespace to the right TP_NONE = 4 # No whitespace
The second item (token.val[1]
) contains a normalized representation of the
punctuation. For instance, various forms of single and double
quotes are represented as Icelandic ones (i.e. „these“ or ‚these‘) in
normalized form, and ellipsis ("...") are represented as the single
character "…".
For TOK.TIME
, the val
field contains an
(hour, minute, second)
tuple.
For TOK.DATEABS
, the val
field contains a
(year, month, day)
tuple (all 1-based).
For TOK.DATEREL
, the val
field contains a
(year, month, day)
tuple (all 1-based),
except that a least one of the tuple fields is missing and set to 0.
Example: 3. júní becomes TOK.DATEREL
with the fields (0, 6, 3)
as the year is missing.
For TOK.YEAR
, the val
field contains the year as an integer.
A negative number indicates that the year is BCE (fyrir Krist),
specified with the suffix f.Kr. (e.g. árið 33 f.Kr.).
For TOK.NUMBER
, the val
field contains a tuple
(number, None, None)
.
(The two empty fields are included for compatibility with Greynir.)
For TOK.WORD
, the val
field contains the full expansion
of an abbreviation, as a list containing a single tuple, or None
if the word is not abbreviated.
For TOK.PERCENT
, the val
field contains a tuple
of (percentage, None, None)
.
For TOK.ORDINAL
, the val
field contains the ordinal value
as an integer. The original ordinal may be a decimal number
or a Roman numeral.
For TOK.TIMESTAMP
, the val
field contains
a (year, month, day, hour, minute, second)
tuple.
For TOK.AMOUNT
, the val
field contains
an (amount, currency, None, None)
tuple. The amount is a float, and
the currency is an ISO currency code, e.g. USD for dollars ($ sign),
EUR for euros (€ sign) or ISK for Icelandic króna
(kr. abbreviation). (The two empty fields are included for
compatibility with Greynir.)
For TOK.MEASUREMENT
, the val
field contains a (unit, value)
tuple, where unit
is a base SI unit (such as g
, m
,
m²
, s
, W
, Hz
, K
for temperature in Kelvin).
For TOK.TELNO
, the val
field contains a tuple: (number, cc)
where the first item is the phone number
in a normalized NNN-NNNN
format, i.e. always including a hyphen,
and the second item is the country code, eventually prefixed by +
.
The country code defaults to 354
(Iceland).
Abbreviations recognized by Tokenizer are defined in the Abbrev.conf
file, found in the src/tokenizer/
directory. This is a text file with
abbreviations, their definitions and explanatory comments.
When an abbreviation is encountered, it is recognized as a word token
(i.e. having its kind
field equal to TOK.WORD
).
Its expansion(s) are included in the token's
val
field as a list containing tuples of the format
(ordmynd, utg, ordfl, fl, stofn, beyging)
.
An example is o.s.frv., which results in a val
field equal to
[('og svo framvegis', 0, 'ao', 'frasi', 'o.s.frv.', '-')]
.
The tuple format is designed to be compatible with the Database of Icelandic Morphology (DIM), Beygingarlýsing íslensks nútímamáls, i.e. the so-called Sigrúnarsnið.
To install Tokenizer in development mode, where you can easily
modify the source files (assuming you have git
available):
.. code-block:: console
$ git clone https://github.com/mideind/Tokenizer
$ cd Tokenizer
$ # [ Activate your virtualenv here, if you have one ]
$ pip install -e .
Tokenizer comes with a large test suite.
The file test/test_tokenizer.py
contains built-in tests that
run under pytest
.
To run the built-in tests, install pytest <https://docs.pytest.org/en/latest/>
_,
cd
to your Tokenizer
subdirectory (and optionally
activate your virtualenv), then run:
.. code-block:: console
$ python -m pytest
The file test/toktest_large.txt
contains a test set of 13,075 lines.
The lines test sentence detection, token detection and token classification.
For analysis, test/toktest_large_gold_perfect.txt
contains
the expected output of a perfect shallow tokenization, and
test/toktest_large_gold_acceptable.txt
contains the current output of the
shallow tokenization.
The file test/Overview.txt
(only in Icelandic) contains a description
of the test set, including line numbers for each part in both
test/toktest_large.txt
and test/toktest_large_gold_acceptable.txt
,
and a tag describing what is being tested in each part.
It also contains a description of a perfect shallow tokenization for each part, acceptable tokenization and the current behaviour. As such, the description is an analysis of which edge cases the tokenizer can handle and which it can not.
To test the tokenizer on the large test set the following needs to be typed in the command line:
.. code-block:: console
$ tokenize test/toktest_large.txt test/toktest_large_out.txt
To compare it to the acceptable behaviour:
.. code-block:: console
$ diff test/toktest_large_out.txt test/toktest_large_gold_acceptable.txt > diff.txt
The file test/toktest_normal.txt
contains a running text from recent
news articles, containing no edge cases. The gold standard for that file
can be found in the file test/toktest_normal_gold_expected.txt
.
META_BEGIN
token added.TOK.WORD
s ('sex' and 'hundruð'), not as single
coalesced TOK.NUMBER
/ TOK.AMOUNT
/etc. tokens.[[
and ]]
(removing spaces).-o
switch to the tokenize
command to return original
token text, enabling the tokenizer to run as a sentence splitter only.NUMBER WITH LETTER
is now NUMWLETTER
.replace_html_escapes
option to
the tokenize()
function.correct_spaces()
to handle compounds such as
Atvinnu-, nýsköpunar- og ferðamálaráðuneytið and
bensínstöðvar, -dælur og -tankar.TOK.COMPANY
token type; fixed a few abbreviations;
renamed parameter text
to text_or_gen
in functions that accept a string
or a string iterator.TOK.TIME
.detokenize()
where abbreviations, domains
and e-mails containing periods were wrongly split.TOK.DATEREL
tokens. Thus, þriðji júní is now a TOK.WORD
followed by a TOK.DATEREL
. 3. júní continues to be parsed as
a single TOK.DATEREL
.token.val
field
made deterministic; fixed bug in measurement unit handling.split_into_sentences()
and detokenize()
functions; removed convert_telno
option;
splitting of coalesced tokens made more robust;
added TOK.SSN
, TOK.MOLECULE
, TOK.USERNAME
and
TOK.SERIALNUMBER
token kinds; abbreviations can now have multiple
meanings.**options
parameter to the
tokenize()
function, giving control over the handling of numbers,
telephone numbers, and 'kludgy' ordinals.TOK.DOMAIN
and TOK.HASHTAG
token types;
improved handling of capitalized month name Ágúst, which is
now recognized when following an ordinal number; improved recognition
of telephone numbers; added abbreviations.±
sign; added several abbreviations.l
and ltr
) measurement units.mark_paragraphs()
function.Abbrev.conf
are now
returned with their meaning in a tuple in token.val
;
handling of 'mbl.is' fixed.DATEREL
, MEASUREMENT
and NUMWLETTER
token handling; added 'kWst' and 'MWst' measurement units; blackened.TOK.NUMWLETTER
token type.COMBINING ACUTE ACCENT
and
COMBINING DIAERESIS
code points with vowels.TOK.DATEABS
, TOK.TIMESTAMPABS
,
TOK.MEASUREMENT
.E.g.:
We could use having better code coverage of tests.
There is an easy way to get the coverage: There is a pytest plugin called pytest-cov that generates coverage reports.
pip install pytest pytest-cov
pytest --cov=src/tokenizer --cov-report=html
The current results look like this:
```
----------- coverage: platform linux, python 3.8.6-final-0 -----------
Name Stmts Miss Cover
src/tokenizer/init.py 7 0 100% src/tokenizer/abbrev.py 157 12 92% src/tokenizer/definitions.py 121 9 93% src/tokenizer/main.py 103 103 0% src/tokenizer/tokenizer.py 1168 200 83%
TOTAL 1556 324 79%
``
and a html-report that highlights uncovered lines is generated in the folder
htmlcov`
The tokenizer should support superscripted citation characters. This will also help with GreynirCorrect, which I assume will be heavily used to read student essays and academic papers.
nlp icelandic tokenizer python