195 lines
6.3 KiB
Plaintext
195 lines
6.3 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# dh_msra 说明\n",
|
||
"0. **下载地址:** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/dh_msra/dh_msra.zip)\n",
|
||
"1. **数据概览:** 5 万多条中文命名实体识别标注数据([IOB2](https://dl.acm.org/citation.cfm?id=977059) 格式,符合 [CoNLL 2002](https://www.clips.uantwerpen.be/conll2002/ner/) 和 [CRF++](https://taku910.github.io/crfpp/#format) 标准)\n",
|
||
"2. **推荐实验:** 中文命名实体识别\n",
|
||
"2. **数据来源:** 不详\n",
|
||
"3. **原数据集:** [zh-NER-TF](https://github.com/Determined22/zh-NER-TF),网上搜集,具体作者、来源不详,可能是来自于 MSRA 的语料\n",
|
||
"4. **加工处理:**\n",
|
||
" 1. 将原来 2 个文件 (train 和 test) 整合到 1 个文件中"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import codecs\n",
|
||
"import random\n",
|
||
"\n",
|
||
"import numpy as np"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"path = 'dh_msra_文件夹_所在_路径'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 1. dh_msra.txt"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 加载数据"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def load_iob2(file_path):\n",
|
||
" '''加载 IOB2 格式的数据'''\n",
|
||
" token_seqs = []\n",
|
||
" label_seqs = []\n",
|
||
" tokens = []\n",
|
||
" labels = []\n",
|
||
" with codecs.open(file_path) as f:\n",
|
||
" for index, line in enumerate(f):\n",
|
||
" items = line.strip().split()\n",
|
||
" if len(items) == 2:\n",
|
||
" token, label = items\n",
|
||
" tokens.append(token)\n",
|
||
" labels.append(label)\n",
|
||
" elif len(items) == 0:\n",
|
||
" if tokens:\n",
|
||
" token_seqs.append(tokens)\n",
|
||
" label_seqs.append(labels)\n",
|
||
" tokens = []\n",
|
||
" labels = []\n",
|
||
" else:\n",
|
||
" print('格式错误。行号:{} 内容:{}'.format(index, line))\n",
|
||
" continue\n",
|
||
" \n",
|
||
" if tokens: # 如果文件末尾没有空行,手动将最后一条数据加入序列的列表中\n",
|
||
" token_seqs.append(tokens)\n",
|
||
" label_seqs.append(labels) \n",
|
||
" \n",
|
||
" return np.array(token_seqs), np.array(label_seqs)\n",
|
||
"\n",
|
||
"\n",
|
||
"def show_iob2(token_seqs, label_seqs, num=5, shuffle=True):\n",
|
||
" '''显示 IOB2 格式数据'''\n",
|
||
" if shuffle:\n",
|
||
" length = len(token_seqs)\n",
|
||
" indexes = [random.randrange(0, length) for i in range(num)] \n",
|
||
" zip_seqs = zip(token_seqs[indexes], label_seqs[indexes])\n",
|
||
" else:\n",
|
||
" zip_seqs = zip(token_seqs[0:num], label_seqs[0:num])\n",
|
||
" \n",
|
||
" for tokens, labels in zip_seqs:\n",
|
||
" for token, label in zip(tokens, labels):\n",
|
||
" print('{}/{} '.format(token, label), end='')\n",
|
||
" print('\\n')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"55289 55289\n",
|
||
"\n",
|
||
"目/O 前/O “/O 继/B-PER 生/I-PER ”/O 共/O 产/O 仔/O 5/O 胎/O ,/O 产/O 下/O 小/O 老/O 虎/O 1/O 8/O 只/O ,/O 堪/O 称/O 虎/O 妈/O 妈/O 中/O 的/O 英/O 雄/O 。/O \n",
|
||
"\n",
|
||
"历/O 史/O 的/O 内/O 涵/O 是/O 很/O 丰/O 富/O 的/O ,/O 经/O 典/O 作/O 家/O 的/O 论/O 断/O 固/O 然/O 有/O 其/O 权/O 威/O 性/O 和/O 合/O 理/O 性/O ,/O 但/O 历/O 史/O 学/O 家/O 显/O 然/O 不/O 能/O 局/O 限/O 于/O 此/O 。/O \n",
|
||
"\n",
|
||
"5/O 月/O 3/O 0/O 日/O 在/O 中/B-LOC 国/I-LOC 革/I-LOC 命/I-LOC 军/I-LOC 事/I-LOC 博/I-LOC 物/I-LOC 馆/I-LOC 开/O 幕/O 的/O 全/O 国/O 禁/O 毒/O 展/O 览/O ,/O 在/O 社/O 会/O 上/O 引/O 起/O 了/O 强/O 烈/O 的/O 反/O 响/O 。/O \n",
|
||
"\n",
|
||
"另/O 外/O ,/O 还/O 有/O 一/O 个/O 惊/O 人/O 的/O 发/O 现/O :/O 有/O 的/O 发/O 展/O 中/O 国/O 家/O 人/O 均/O 国/O 民/O 资/O 源/O 非/O 常/O 丰/O 富/O ,/O 但/O 发/O 展/O 不/O 起/O 来/O 的/O 原/O 因/O 在/O 于/O 教/O 育/O 水/O 平/O 太/O 低/O 、/O 对/O 技/O 术/O 的/O 理/O 解/O 和/O 把/O 握/O 太/O 低/O 、/O 管/O 理/O 水/O 平/O 太/O 低/O 等/O 等/O ,/O 一/O 句/O 话/O ,/O 智/O 力/O 资/O 本/O 太/O 贫/O 乏/O 。/O \n",
|
||
"\n",
|
||
"这/O 还/O 要/O 看/O 进/O 一/O 步/O 深/O 入/O 调/O 查/O 的/O 结/O 果/O 。/O \n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"token_seqs, label_seqs = load_iob2(path+'dh_msra.txt')\n",
|
||
"\n",
|
||
"print(len(token_seqs), len(label_seqs))\n",
|
||
"print() \n",
|
||
"show_iob2(token_seqs, label_seqs)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 标签说明\n",
|
||
"\n",
|
||
"| 标签 | 说明 |\n",
|
||
"| ---- | ---- |\n",
|
||
"| LOC | 地点 (LOCATION) |\n",
|
||
"| ORG | 机构 (ORGANIZATION) |\n",
|
||
"| PER | 人物 (PERSON) |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"{'B-LOC', 'B-ORG', 'B-PER', 'I-LOC', 'I-ORG', 'I-PER', 'O'}"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"set([label for labels in label_seqs for label in labels])"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.6.5"
|
||
},
|
||
"widgets": {
|
||
"state": {},
|
||
"version": "1.1.2"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|