Reject complicated operations for incorporating lexicon for Chinese NER.

v-mipeng, updated 🕥 2022-01-22 07:27:59

LexiconAugmentedNER

This is the implementation of our arxiv paper "Simplify the Usage of Lexicon in Chinese NER", which rejects complicated operations for incorporating word lexicon in Chinese NER. We show that incorporating lexicon in Chinese NER can be quite simple and, at the same time, effective.

Source code description

Requirement:

Python 3.6 Pytorch 0.4.1

Input format:

CoNLL format, with each character and its label split by a whitespace in a line. The "BMES" tag scheme is prefered.

别 O

错 O

过 O

邻 O

近 O

大 B-LOC

鹏 M-LOC

湾 E-LOC

的 O

湿 O

地 O

Pretrain embedding:

The pretrained embeddings(word embedding, char embedding and bichar embedding) are the same with Lattice LSTM

Run the code:

  1. Download the character embeddings and word embeddings from Lattice LSTM and put them in the data folder.
  2. Download the four datasets in data/MSRANER, data/OntoNotesNER, data/ResumeNER and data/WeiboNER, respectively.
  3. To train on the four datasets:

  4. To train on OntoNotes:

python main.py --train data/OntoNotesNER/train.char.bmes --dev data/OntoNotesNER/dev.char.bmes --test data/OntoNotesNER/test.char.bmes --modelname OntoNotes --savedset data/OntoNotes.dset

  • To train on Resume:

python main.py --train data/ResumeNER/train.char.bmes --dev data/ResumeNER/dev.char.bmes --test data/ResumeNER/test.char.bmes --modelname Resume --savedset data/Resume.dset --hidden_dim 200

  • To train on Weibo:

python main.py --train data/WeiboNER/train.all.bmes --dev data/WeiboNER/dev.all.bmes --test data/WeiboNER/test.all.bmes --modelname Weibo --savedset data/Weibo.dset --lr=0.005 --hidden_dim 200

  • To train on MSRA:

python main.py --train data/MSRANER/train.char.bmes --dev data/MSRANER/dev.char.bmes --test data/MSRANER/test.char.bmes --modelname MSRA --savedset data/MSRA.dset

  1. To train/test your own data: modify the command with your file path and run.

Issues

word_embs和gaz中的S集合的设置问题

opened on 2023-03-05 09:52:32 by y573082640

😊作者您好,关于最终拼接的向量我有一些疑惑: 中文NER中word_embs对应单个字的字向量,gaz中的S集合也是单个字的字向量。 拼接两个(来源不同的)字向量有什么特殊含义吗? 是否可以取消其中一个,或者统一向量来源呢?

关于bigram的特征文件

opened on 2023-02-14 05:16:54 by zhanghanweii

请问gigaword_chn.all.a2b.bi.ite50.vec这个文件的路径在哪呢,LatticeLSTM中只有character embeddings和word embeddings

关于gaz和gaz_count的一些代码原理问题

opened on 2023-01-17 09:56:27 by summer-la

为什么Single的count直接置为1而不用词频? functions.py 103行 为什么要对BMES的空集进行这样的处理?其中对于gazs[idx][label].append(0),这里的0是什么意思?functions.py 120-124行

image

关于代码中hidden2tag的问题

opened on 2022-12-24 13:13:10 by Maxystart

代码中self.hidden2tag = nn.Linear(self.hidden_dim, data.label_alphabet_size+2),为什么需要加上2,加2之后不是与实际的tag数目不同了吗?

OSError: Model name 'bert-base-chinese' was not found in tokenizers model name list(...)怎么解决

opened on 2022-12-22 02:59:48 by summer-la None

可以用BIO标注的训练集进行训练吗?

opened on 2022-09-11 13:41:21 by studymryang None
Minlong Peng

I am now working at Cognitive Computing Lab, Baidu as an NLP researcher.

GitHub Repository