Reject complicated operations for incorporating lexicon for Chinese NER.

v-mipeng, updated 🕥 2022-01-22 07:27:59


This is the implementation of our arxiv paper "Simplify the Usage of Lexicon in Chinese NER", which rejects complicated operations for incorporating word lexicon in Chinese NER. We show that incorporating lexicon in Chinese NER can be quite simple and, at the same time, effective.

Source code description


Python 3.6 Pytorch 0.4.1

Input format:

CoNLL format, with each character and its label split by a whitespace in a line. The "BMES" tag scheme is prefered.

别 O

错 O

过 O

邻 O

近 O




的 O

湿 O

地 O

Pretrain embedding:

The pretrained embeddings(word embedding, char embedding and bichar embedding) are the same with Lattice LSTM

Run the code:

  1. Download the character embeddings and word embeddings from Lattice LSTM and put them in the data folder.
  2. Download the four datasets in data/MSRANER, data/OntoNotesNER, data/ResumeNER and data/WeiboNER, respectively.
  3. To train on the four datasets:

  4. To train on OntoNotes:

python --train data/OntoNotesNER/train.char.bmes --dev data/OntoNotesNER/dev.char.bmes --test data/OntoNotesNER/test.char.bmes --modelname OntoNotes --savedset data/OntoNotes.dset

  • To train on Resume:

python --train data/ResumeNER/train.char.bmes --dev data/ResumeNER/dev.char.bmes --test data/ResumeNER/test.char.bmes --modelname Resume --savedset data/Resume.dset --hidden_dim 200

  • To train on Weibo:

python --train data/WeiboNER/train.all.bmes --dev data/WeiboNER/dev.all.bmes --test data/WeiboNER/test.all.bmes --modelname Weibo --savedset data/Weibo.dset --lr=0.005 --hidden_dim 200

  • To train on MSRA:

python --train data/MSRANER/train.char.bmes --dev data/MSRANER/dev.char.bmes --test data/MSRANER/test.char.bmes --modelname MSRA --savedset data/MSRA.dset

  1. To train/test your own data: modify the command with your file path and run.



opened on 2023-03-05 09:52:32 by y573082640

😊作者您好,关于最终拼接的向量我有一些疑惑: 中文NER中word_embs对应单个字的字向量,gaz中的S集合也是单个字的字向量。 拼接两个(来源不同的)字向量有什么特殊含义吗? 是否可以取消其中一个,或者统一向量来源呢?


opened on 2023-02-14 05:16:54 by zhanghanweii

请问这个文件的路径在哪呢,LatticeLSTM中只有character embeddings和word embeddings


opened on 2023-01-17 09:56:27 by summer-la

为什么Single的count直接置为1而不用词频? 103行 为什么要对BMES的空集进行这样的处理?其中对于gazs[idx][label].append(0),这里的0是什么意思? 120-124行



opened on 2022-12-24 13:13:10 by Maxystart

代码中self.hidden2tag = nn.Linear(self.hidden_dim, data.label_alphabet_size+2),为什么需要加上2,加2之后不是与实际的tag数目不同了吗?

OSError: Model name 'bert-base-chinese' was not found in tokenizers model name list(...)怎么解决

opened on 2022-12-22 02:59:48 by summer-la None


opened on 2022-09-11 13:41:21 by studymryang None
Minlong Peng

I am now working at Cognitive Computing Lab, Baidu as an NLP researcher.

GitHub Repository