Python version of Sudachi, a Japanese tokenizer.

WorksApplications, updated 🕥 2022-10-07 07:38:45

SudachiPy

PyPi version Build Status

日本語

SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.

Warning

This repository is for 0.5. version of SudachiPy, 0.6 and above are developed as Sudachi.rs.

TL;DR

```bash $ pip install sudachipy sudachidict_core

$ echo "高輪ゲートウェイ駅" | sudachipy 高輪ゲートウェイ駅 名詞,固有名詞,一般,,,* 高輪ゲートウェイ駅 EOS

$ echo "高輪ゲートウェイ駅" | sudachipy -m A 高輪 名詞,固有名詞,地名,一般,, 高輪 ゲートウェイ 名詞,普通名詞,一般,,, ゲートウェー 駅 名詞,普通名詞,一般,,, 駅 EOS

$ echo "空缶空罐空きカン" | sudachipy -a 空缶 名詞,普通名詞,一般,,, 空き缶 空缶 アキカン 0 空罐 名詞,普通名詞,一般,,, 空き缶 空罐 アキカン 0 空きカン 名詞,普通名詞,一般,,,* 空き缶 空きカン アキカン 0 EOS ```

Setup

You need SudachiPy and a dictionary.

Step 1. Install SudachiPy

bash $ pip install sudachipy

Step 2. Get a Dictionary

You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the core edition).

bash $ pip install sudachidict_core

Alternatively, you can choose other dictionary editions. See this section for the detail.

Usage: As a command

There is a CLI command sudachipy.

bash $ echo "外国人参政権" | sudachipy 外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 EOS $ echo "外国人参政権" | sudachipy -m A 外国 名詞,普通名詞,一般,*,*,* 外国 人 接尾辞,名詞的,一般,*,*,* 人 参政 名詞,普通名詞,一般,*,*,* 参政 権 接尾辞,名詞的,一般,*,*,* 権 EOS

```bash $ sudachipy tokenize -h usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string] [-a] [-d] [-v] [file [file ...]]

Tokenize Text

positional arguments: file text written in utf-8

optional arguments: -h, --help show this help message and exit -r file the setting file in JSON format -m {A,B,C} the mode of splitting -o file the output file -s string sudachidict type -a print all of the fields -d print the debug information -v, --version print sudachipy version ```

Output

Columns are tab separated.

  • Surface
  • Part-of-Speech Tags (comma separated)
  • Normalized Form

When you add the -a option, it additionally outputs

  • Dictionary Form
  • Reading Form
  • Dictionary ID
  • 0 for the system dictionary
  • 1 and above for the user dictionaries
  • -1\t(OOV) if a word is Out-of-Vocabulary (not in the dictionary)

bash $ echo "外国人参政権" | sudachipy -a 外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 外国人参政権 ガイコクジンサンセイケン 0 EOS

bash echo "阿quei" | sudachipy -a 阿 名詞,普通名詞,一般,*,*,* 阿 阿 -1 (OOV) quei 名詞,普通名詞,一般,*,*,* quei quei -1 (OOV) EOS

Usage: As a Python package

Here is an example;

```python from sudachipy import tokenizer from sudachipy import dictionary

tokenizer_obj = dictionary.Dictionary().create() ```

```python

Multi-granular Tokenization

mode = tokenizer.Tokenizer.SplitMode.C [m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]

=> ['国家公務員']

mode = tokenizer.Tokenizer.SplitMode.B [m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]

=> ['国家', '公務員']

mode = tokenizer.Tokenizer.SplitMode.A [m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]

=> ['国家', '公務', '員']

```

```python

Morpheme information

m = tokenizer_obj.tokenize("食べ", mode)[0]

m.surface() # => '食べ' m.dictionary_form() # => '食べる' m.reading_form() # => 'タベ' m.part_of_speech() # => ['動詞', '一般', '', '', '下一段-バ行', '連用形-一般'] ```

```python

Normalization

tokenizer_obj.tokenize("附属", mode)[0].normalized_form()

=> '付属'

tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()

=> 'サマー'

tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()

=> 'シミュレーション'

```

(With 20200330 core dictionary. The results may change when you use other versions)

Dictionary Edition

WARNING: sudachipy link is no longer available in SudachiPy v0.5.2 and later.

There are three editions of Sudachi Dictionary, namely, small, core, and full. See WorksApplications/SudachiDict for the detail.

SudachiPy uses sudachidict_core by default.

Dictionaries are installed as Python packages sudachidict_small, sudachidict_core, and sudachidict_full.

The dictionary files are not in the package itself, but it is downloaded upon installation.

Dictionary option: command line

You can specify the dictionary with the tokenize option -s.

bash $ pip install sudachidict_small $ echo "外国人参政権" | sudachipy -s small

bash $ pip install sudachidict_full $ echo "外国人参政権" | sudachipy -s full

Dictionary option: Python package

You can specify the dictionary with the Dicionary() argument; config_path or dict_type.

python class Dictionary(config_path=None, resource_dir=None, dict_type=None)

  1. config_path
    • You can specify the file path to the setting file with config_path (See Dictionary in The Setting File for the detail).
    • If the dictionary file is specified in the setting file as systemDict, SudachiPy will use the dictionary.
  2. dict_type
    • You can also specify the dictionary type with dict_type.
    • The available arguments are small, core, or full.
    • If different dictionaries are specified with config_path and dict_type, a dictionary defined dict_type overrides those defined in the config path.

```python from sudachipy import tokenizer from sudachipy import dictionary

default: sudachidict_core

tokenizer_obj = dictionary.Dictionary().create()

The dictionary given by the systemDict key in the config file (/path/to/sudachi.json) will be used

tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json").create()

The dictionary specified by dict_type will be set.

tokenizer_obj = dictionary.Dictionary(dict_type="core").create() # sudachidict_core (same as default) tokenizer_obj = dictionary.Dictionary(dict_type="small").create() # sudachidict_small tokenizer_obj = dictionary.Dictionary(dict_type="full").create() # sudachidict_full

The dictionary specified by dict_type overrides those defined in the config path.

In the following code, sudachidict_full will be used regardless of a dictionary defined in the config file.

tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create()
```

Dictionary in The Setting File

Alternatively, if the dictionary file is specified in the setting file, sudachi.json, SudachiPy will use that file.

{ "systemDict" : "relative/path/to/system.dic", ... }

The default setting file is sudachipy/resources/sudachi.json. You can specify your sudachi.json with the -r option.

bash $ sudachipy -r path/to/sudachi.json

User Dictionary

To use a user dictionary, user.dic, place sudachi.json to anywhere you like, and add userDict value with the relative path from sudachi.json to your user.dic.

js { "userDict" : ["relative/path/to/user.dic"], ... }

Then specify your sudachi.json with the -r option.

bash $ sudachipy -r path/to/sudachi.json

You can build a user dictionary with the subcommand ubuild.

WARNING: v0.3.* ubuild contains bug.

```bash $ sudachipy ubuild -h usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]

Build User Dictionary

positional arguments: file source files with CSV format (one or more)

optional arguments: -h, --help show this help message and exit -d string description comment to be embedded on dictionary -o file output file (default: user.dic) -s file system dictionary path (default: system core dictionary path) ```

About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).

Customized System Dictionary

```bash $ sudachipy build -h usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]

Build Sudachi Dictionary

positional arguments: file source files with CSV format (one of more)

optional arguments: -h, --help show this help message and exit -o file output file (default: system.dic) -d string description comment to be embedded on dictionary

required named arguments: -m file connection matrix file with MeCab's matrix.def format ```

To use your customized system.dic, place sudachi.json to anywhere you like, and overwrite systemDict value with the relative path from sudachi.json to your system.dic.

{ "systemDict" : "relative/path/to/system.dic", ... }

Then specify your sudachi.json with the -r option.

bash $ sudachipy -r path/to/sudachi.json

For Developers

Cython Build

sh $ python setup.py build_ext --inplace

Code Format

Run scripts/format.sh to check if your code is formatted correctly.

You need packages flake8 flake8-import-order flake8-buitins (See requirements.txt).

Test

Run scripts/test.sh to run the tests.

Contact

Sudachi and SudachiPy are developed by WAP Tokushima Laboratory of AI and NLP.

Open an issue, or come to our Slack workspace for questions and discussion.

https://sudachi-dev.slack.com/ (Get invitation here)

Enjoy tokenization!

Issues

Cannot install SudachiDict-core 20200722

opened on 2022-02-12 11:48:42 by wubowen416

Hi, thank you for providing this repo.

I tried to install SudachiDict-core 20200722 from https://pypi.org/project/SudachiDict-core/20200722/, but the pip tells me that

``` (gd) [email protected]:~/repo/gesture-diffusion$ python -m pip install SudachiDict-core==20200722 Collecting SudachiDict-core==20200722 Using cached SudachiDict-core-20200722.tar.gz (8.8 kB) Preparing metadata (setup.py) ... error error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [21 lines of output] Downloading the Sudachi dictionary (It may take a while) ... Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/tmp/pip-install-vt2se1yi/sudachidict-core_8a9a33469456470f914729aca871a592/setup.py", line 44, in _, _msg = urlretrieve(ZIP_URL, ZIP_NAME) File "/home/wu/anaconda3/envs/gd/lib/python3.8/urllib/request.py", line 247, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "/home/wu/anaconda3/envs/gd/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/home/wu/anaconda3/envs/gd/lib/python3.8/urllib/request.py", line 531, in open response = meth(req, response) File "/home/wu/anaconda3/envs/gd/lib/python3.8/urllib/request.py", line 640, in http_response response = self.parent.error( File "/home/wu/anaconda3/envs/gd/lib/python3.8/urllib/request.py", line 569, in error return self._call_chain(args) File "/home/wu/anaconda3/envs/gd/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(args) File "/home/wu/anaconda3/envs/gd/lib/python3.8/urllib/request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 401: Unauthorized [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed

× Encountered error while generating package metadata. ╰─> See above for output.

note: This is an issue with the package mentioned above, not pip. hint: See above for details. ```

Could you help me to find out? Thanks in advance.

Upgrading sortedcontainers

opened on 2022-01-21 23:51:24 by radurevutchi

Can we please loosen the sortedcontainers requirement. I have the PR here https://github.com/WorksApplications/SudachiPy/pull/170

sudachipy/lattice.pyx:35:12: Assignment to const attribute 'connect_costs'

opened on 2021-09-03 16:30:14 by tien-le

Hi, After building the latest codes from git (branch "develop"), I have the following issue:

``` $ python setup.py build_ext --inplace

running build_ext skipping 'sudachipy/latticenode.c' Cython extension (up-to-date) cythoning sudachipy/lattice.pyx to sudachipy/lattice.c

Error compiling Cython file:

... cdef LatticeNode bos_node = LatticeNode() bos_params = grammar.get_bos_parameter() bos_node.set_parameter(bos_params[0], bos_params[1], bos_params[2]) bos_node.is_connected_to_bos = True self.end_lists.append([bos_node]) self.connect_costs = self.grammar._matrix_view ^


sudachipy/lattice.pyx:35:12: Assignment to const attribute 'connect_costs'

Error compiling Cython file:

... pyx_result = Lattice.__new(__pyx_type) if __pyx_state is not None: __pyx_unpickle_Lattice__set_state( __pyx_result, __pyx_state) return __pyx_result cdef __pyx_unpickle_Lattice__set_state(Lattice __pyx_result, tuple __pyx_state): __pyx_result.capacity = __pyx_state[0]; __pyx_result.connect_costs = __pyx_state[1]; __pyx_result.end_lists = __pyx_state[2]; __pyx_result.eos_node = __pyx_state[3]; __pyx_result.eos_params = __pyx_state[4]; __pyx_result.grammar = __pyx_state[5]; __pyx_result.size = __pyx_state[6] ^


(tree fragment):10:56: Assignment to const attribute 'connect_costs' building 'sudachipy.lattice' extension gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/home/user/python/include/python3.6m -c sudachipy/lattice.c -o build/temp.linux-x86_64-3.6/sudachipy/lattice.o sudachipy/lattice.c:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation. #error Do not use this file, it is the result of a failed Cython compilation. ^~~~~ error: command 'gcc' failed with exit status 1 ```

Can you show me how to tackle the above issue, please? Thanks advance for your help.

Training Sudachi on new data

opened on 2021-06-27 23:40:54 by ardakdemir

If I understand correctly, Sudachi is a lattice-based tokenizer and uses the occurrence probabilities and left-right probabilities (costs) for finding the best token sequence.

We would like to know whether we could customize these cost values. I imagine that in a niche domain like biomedicine with many unknown bacteria/disease names, we need domain-specific values to have the best tokenizer.

ImportError: DLL load failed

opened on 2020-10-05 08:59:31 by lash-1997

すみません。 [from sudachipy import tokenizer]の時、失敗した。 そして、sudachipy中の「.pyd」全部importできないことが発見した。 system:win10 python:3.7.6 SudachiPy :0.4.9

Problem with user defined dictionary

opened on 2020-09-17 10:33:36 by JSB97

I am making use of sudachipy via ginza, and am trying to annotate the following sentences.

プロ野球の中日で選手、監督を務め、1月4日に70歳で死去した星野仙一氏をしのび、3日、名古屋市東区のナゴヤドームで行われた中日―楽天のオープン戦は追悼試合として開催された。 明治大の後輩、島内宏明外野手は「改めてすごい人だったんだなと思った」と話した。

And in my dictionary I have the following lines, which match 明治 and 楽天 in the above. There are no other lines in the dictionary that match any substrings in the sentence.

楽天,1288,1288,100,楽天_4755-2018,名詞,固有名詞,組織,上場会社,*,*,RAKUTEN,楽天,*,*,*,*,* 明治,1288,1288,100,明治_2261-2009,名詞,固有名詞,組織,上場会社,*,*,MEIJI,明治,*,*,*,*,*

When I try and run annotations with this configuration, i get the below error:

``` ...

File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/language.py", line 441, in call doc = self.make_doc(text) File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/init.py", line 281, in make_doc return self.tokenizer(text) File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/init.py", line 144, in call dtokens = self._get_dtokens(sudachipy_tokens) File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/init.py", line 182, in _get_dtokens ) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0 File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/init.py", line 182, in ) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0 File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/sudachipy/morpheme.py", line 36, in part_of_speech return self.list.grammar.get_part_of_speech_string(wi.pos_id) File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/sudachipy/dictionarylib/grammar.py", line 55, in get_part_of_speech_string return self.pos_list[pos_id] IndexError: list index out of range ```

Could someone advise me as to what is causing this error please?

I am quite certain the sentence with 明治 is causing the issue,as if i remove the second sentence, the annotation works fine. It therefore seems like 楽天 is being picked up by SudachiPy with the dictionary, but 明治 is not.

Why is this?

Releases

v0.5.4 2021-09-25 14:31:06

Fixed a bug related to user-defined parts of speech

  • When multiple user dictionaries with user-defined parts of speech are used, the user-defined POS IDs of the second and subsequent user dictionaries become invalid (IndexError: list index out of range)

v0.5.3 2021-09-10 04:50:41

Fixed the following bugs

  • Words containing digits cannot be properly registered in split information
  • Slow to build user dictionary
  • Some katakana words are analyzed as OOV

v0.5.2 2021-03-26 08:43:13

Do not use symbolic links to specify dictionary types.

  • Added option -s to specify dictionary type
  • Added argument to Dictionary class to specify dictionary type
  • Removed the option to create a link

$ pip install sudachidict_full $ echo "外国人参政権" | sudachipy -s full

v0.5.1 2021-01-04 02:38:17

Fix command line option related issue

  • https://github.com/WorksApplications/SudachiPy/pull/151 Fix -a option (print all of the fields) (Error reported in https://github.com/WorksApplications/SudachiPy/issues/150)

v0.5.0 2020-12-18 09:38:13

Support for new dictionary format with synonym group IDs

v0.4.9 2020-06-19 06:20:58

Fix a Cythonization related issue

  • 134 Fix Morphemelist split (Error reported in #133)

Works Applications
GitHub Repository

nlp-library morphological-analysis segmentation pos-tagging