SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.
This repository is for 0.5. version of SudachiPy, 0.6 and above are developed as Sudachi.rs.
```bash $ pip install sudachipy sudachidict_core
$ echo "高輪ゲートウェイ駅" | sudachipy 高輪ゲートウェイ駅 名詞,固有名詞,一般,,,* 高輪ゲートウェイ駅 EOS
$ echo "高輪ゲートウェイ駅" | sudachipy -m A 高輪 名詞,固有名詞,地名,一般,, 高輪 ゲートウェイ 名詞,普通名詞,一般,,, ゲートウェー 駅 名詞,普通名詞,一般,,, 駅 EOS
$ echo "空缶空罐空きカン" | sudachipy -a 空缶 名詞,普通名詞,一般,,, 空き缶 空缶 アキカン 0 空罐 名詞,普通名詞,一般,,, 空き缶 空罐 アキカン 0 空きカン 名詞,普通名詞,一般,,,* 空き缶 空きカン アキカン 0 EOS ```
You need SudachiPy and a dictionary.
bash
$ pip install sudachipy
You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the core
edition).
bash
$ pip install sudachidict_core
Alternatively, you can choose other dictionary editions. See this section for the detail.
There is a CLI command sudachipy
.
bash
$ echo "外国人参政権" | sudachipy
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権
EOS
$ echo "外国人参政権" | sudachipy -m A
外国 名詞,普通名詞,一般,*,*,* 外国
人 接尾辞,名詞的,一般,*,*,* 人
参政 名詞,普通名詞,一般,*,*,* 参政
権 接尾辞,名詞的,一般,*,*,* 権
EOS
```bash $ sudachipy tokenize -h usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string] [-a] [-d] [-v] [file [file ...]]
Tokenize Text
positional arguments: file text written in utf-8
optional arguments: -h, --help show this help message and exit -r file the setting file in JSON format -m {A,B,C} the mode of splitting -o file the output file -s string sudachidict type -a print all of the fields -d print the debug information -v, --version print sudachipy version ```
Columns are tab separated.
When you add the -a
option, it additionally outputs
0
for the system dictionary1
and above for the user dictionaries-1\t(OOV)
if a word is Out-of-Vocabulary (not in the dictionary)bash
$ echo "外国人参政権" | sudachipy -a
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 外国人参政権 ガイコクジンサンセイケン 0
EOS
bash
echo "阿quei" | sudachipy -a
阿 名詞,普通名詞,一般,*,*,* 阿 阿 -1 (OOV)
quei 名詞,普通名詞,一般,*,*,* quei quei -1 (OOV)
EOS
Here is an example;
```python from sudachipy import tokenizer from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create() ```
```python
mode = tokenizer.Tokenizer.SplitMode.C [m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
mode = tokenizer.Tokenizer.SplitMode.B [m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
mode = tokenizer.Tokenizer.SplitMode.A [m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
```
```python
m = tokenizer_obj.tokenize("食べ", mode)[0]
m.surface() # => '食べ' m.dictionary_form() # => '食べる' m.reading_form() # => 'タベ' m.part_of_speech() # => ['動詞', '一般', '', '', '下一段-バ行', '連用形-一般'] ```
```python
tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
```
(With 20200330
core
dictionary. The results may change when you use other versions)
WARNING: sudachipy link
is no longer available in SudachiPy v0.5.2 and later.
There are three editions of Sudachi Dictionary, namely, small
, core
, and full
. See WorksApplications/SudachiDict for the detail.
SudachiPy uses sudachidict_core
by default.
Dictionaries are installed as Python packages sudachidict_small
, sudachidict_core
, and sudachidict_full
.
The dictionary files are not in the package itself, but it is downloaded upon installation.
You can specify the dictionary with the tokenize option -s
.
bash
$ pip install sudachidict_small
$ echo "外国人参政権" | sudachipy -s small
bash
$ pip install sudachidict_full
$ echo "外国人参政権" | sudachipy -s full
You can specify the dictionary with the Dicionary()
argument; config_path
or dict_type
.
python
class Dictionary(config_path=None, resource_dir=None, dict_type=None)
config_path
config_path
(See Dictionary in The Setting File for the detail).systemDict
, SudachiPy will use the dictionary.dict_type
dict_type
.small
, core
, or full
.config_path
and dict_type
, a dictionary defined dict_type
overrides those defined in the config path.```python from sudachipy import tokenizer from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
systemDict
key in the config file (/path/to/sudachi.json) will be usedtokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json").create()
dict_type
will be set.tokenizer_obj = dictionary.Dictionary(dict_type="core").create() # sudachidict_core (same as default) tokenizer_obj = dictionary.Dictionary(dict_type="small").create() # sudachidict_small tokenizer_obj = dictionary.Dictionary(dict_type="full").create() # sudachidict_full
dict_type
overrides those defined in the config path.sudachidict_full
will be used regardless of a dictionary defined in the config file.tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create()
```
Alternatively, if the dictionary file is specified in the setting file, sudachi.json
, SudachiPy will use that file.
{
"systemDict" : "relative/path/to/system.dic",
...
}
The default setting file is sudachipy/resources/sudachi.json. You can specify your sudachi.json
with the -r
option.
bash
$ sudachipy -r path/to/sudachi.json
To use a user dictionary, user.dic
, place sudachi.json to anywhere you like, and add userDict
value with the relative path from sudachi.json
to your user.dic
.
js
{
"userDict" : ["relative/path/to/user.dic"],
...
}
Then specify your sudachi.json
with the -r
option.
bash
$ sudachipy -r path/to/sudachi.json
You can build a user dictionary with the subcommand ubuild
.
WARNING: v0.3.* ubuild contains bug.
```bash $ sudachipy ubuild -h usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]
Build User Dictionary
positional arguments: file source files with CSV format (one or more)
optional arguments: -h, --help show this help message and exit -d string description comment to be embedded on dictionary -o file output file (default: user.dic) -s file system dictionary path (default: system core dictionary path) ```
About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).
```bash $ sudachipy build -h usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]
Build Sudachi Dictionary
positional arguments: file source files with CSV format (one of more)
optional arguments: -h, --help show this help message and exit -o file output file (default: system.dic) -d string description comment to be embedded on dictionary
required named arguments: -m file connection matrix file with MeCab's matrix.def format ```
To use your customized system.dic
, place sudachi.json to anywhere you like, and overwrite systemDict
value with the relative path from sudachi.json
to your system.dic
.
{
"systemDict" : "relative/path/to/system.dic",
...
}
Then specify your sudachi.json
with the -r
option.
bash
$ sudachipy -r path/to/sudachi.json
sh
$ python setup.py build_ext --inplace
Run scripts/format.sh
to check if your code is formatted correctly.
You need packages flake8
flake8-import-order
flake8-buitins
(See requirements.txt
).
Run scripts/test.sh
to run the tests.
Sudachi and SudachiPy are developed by WAP Tokushima Laboratory of AI and NLP.
Open an issue, or come to our Slack workspace for questions and discussion.
https://sudachi-dev.slack.com/ (Get invitation here)
Enjoy tokenization!
Hi, thank you for providing this repo.
I tried to install SudachiDict-core 20200722 from https://pypi.org/project/SudachiDict-core/20200722/, but the pip tells me that
``` (gd) [email protected]:~/repo/gesture-diffusion$ python -m pip install SudachiDict-core==20200722 Collecting SudachiDict-core==20200722 Using cached SudachiDict-core-20200722.tar.gz (8.8 kB) Preparing metadata (setup.py) ... error error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [21 lines of output]
Downloading the Sudachi dictionary (It may take a while) ...
Traceback (most recent call last):
File "
note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed
× Encountered error while generating package metadata. ╰─> See above for output.
note: This is an issue with the package mentioned above, not pip. hint: See above for details. ```
Could you help me to find out? Thanks in advance.
Can we please loosen the sortedcontainers
requirement. I have the PR here
https://github.com/WorksApplications/SudachiPy/pull/170
Hi, After building the latest codes from git (branch "develop"), I have the following issue:
``` $ python setup.py build_ext --inplace
running build_ext skipping 'sudachipy/latticenode.c' Cython extension (up-to-date) cythoning sudachipy/lattice.pyx to sudachipy/lattice.c
... cdef LatticeNode bos_node = LatticeNode() bos_params = grammar.get_bos_parameter() bos_node.set_parameter(bos_params[0], bos_params[1], bos_params[2]) bos_node.is_connected_to_bos = True self.end_lists.append([bos_node]) self.connect_costs = self.grammar._matrix_view ^
sudachipy/lattice.pyx:35:12: Assignment to const attribute 'connect_costs'
...
pyx_result = Lattice.__new(__pyx_type)
if __pyx_state is not None:
__pyx_unpickle_Lattice__set_state(
(tree fragment):10:56: Assignment to const attribute 'connect_costs' building 'sudachipy.lattice' extension gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/home/user/python/include/python3.6m -c sudachipy/lattice.c -o build/temp.linux-x86_64-3.6/sudachipy/lattice.o sudachipy/lattice.c:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation. #error Do not use this file, it is the result of a failed Cython compilation. ^~~~~ error: command 'gcc' failed with exit status 1 ```
Can you show me how to tackle the above issue, please? Thanks advance for your help.
If I understand correctly, Sudachi is a lattice-based tokenizer and uses the occurrence probabilities and left-right probabilities (costs) for finding the best token sequence.
We would like to know whether we could customize these cost values. I imagine that in a niche domain like biomedicine with many unknown bacteria/disease names, we need domain-specific values to have the best tokenizer.
すみません。 [from sudachipy import tokenizer]の時、失敗した。 そして、sudachipy中の「.pyd」全部importできないことが発見した。 system:win10 python:3.7.6 SudachiPy :0.4.9
I am making use of sudachipy via ginza, and am trying to annotate the following sentences.
プロ野球の中日で選手、監督を務め、1月4日に70歳で死去した星野仙一氏をしのび、3日、名古屋市東区のナゴヤドームで行われた中日―楽天のオープン戦は追悼試合として開催された。
明治大の後輩、島内宏明外野手は「改めてすごい人だったんだなと思った」と話した。
And in my dictionary I have the following lines, which match 明治
and 楽天
in the above.
There are no other lines in the dictionary that match any substrings in the sentence.
楽天,1288,1288,100,楽天_4755-2018,名詞,固有名詞,組織,上場会社,*,*,RAKUTEN,楽天,*,*,*,*,*
明治,1288,1288,100,明治_2261-2009,名詞,固有名詞,組織,上場会社,*,*,MEIJI,明治,*,*,*,*,*
When I try and run annotations with this configuration, i get the below error:
``` ...
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/language.py", line 441, in call
doc = self.make_doc(text)
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/init.py", line 281, in make_doc
return self.tokenizer(text)
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/init.py", line 144, in call
dtokens = self._get_dtokens(sudachipy_tokens)
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/init.py", line 182, in _get_dtokens
) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0
File "/Users/jb/.pyenv/versions/3.6.1/lib/python3.6/site-packages/spacy/lang/ja/init.py", line 182, in
Could someone advise me as to what is causing this error please?
I am quite certain the sentence with 明治
is causing the issue,as if i remove the second sentence, the annotation works fine. It therefore seems like 楽天
is being picked up by SudachiPy with the dictionary, but 明治
is not.
Why is this?
Fixed a bug related to user-defined parts of speech
IndexError: list index out of range
)Fixed the following bugs
Do not use symbolic links to specify dictionary types.
$ pip install sudachidict_full
$ echo "外国人参政権" | sudachipy -s full
Fix command line option related issue
-a
option (print all of the fields
) (Error reported in https://github.com/WorksApplications/SudachiPy/issues/150)Support for new dictionary format with synonym group IDs
Fix a Cythonization related issue
nlp-library morphological-analysis segmentation pos-tagging