Files
bettafish-company/ChineseNlpCorpus/datasets/dh_msra/intro.ipynb
T

195 lines
6.3 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# dh_msra 说明\n",
"0. **下载地址:** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/dh_msra/dh_msra.zip)\n",
"1. **数据概览:** 5 万多条中文命名实体识别标注数据([IOB2](https://dl.acm.org/citation.cfm?id=977059) 格式,符合 [CoNLL 2002](https://www.clips.uantwerpen.be/conll2002/ner/) 和 [CRF++](https://taku910.github.io/crfpp/#format) 标准)\n",
"2. **推荐实验:** 中文命名实体识别\n",
"2. **数据来源:** 不详\n",
"3. **原数据集:** [zh-NER-TF](https://github.com/Determined22/zh-NER-TF),网上搜集,具体作者、来源不详,可能是来自于 MSRA 的语料\n",
"4. **加工处理:**\n",
" 1. 将原来 2 个文件 (train 和 test) 整合到 1 个文件中"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import codecs\n",
"import random\n",
"\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = 'dh_msra_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. dh_msra.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def load_iob2(file_path):\n",
" '''加载 IOB2 格式的数据'''\n",
" token_seqs = []\n",
" label_seqs = []\n",
" tokens = []\n",
" labels = []\n",
" with codecs.open(file_path) as f:\n",
" for index, line in enumerate(f):\n",
" items = line.strip().split()\n",
" if len(items) == 2:\n",
" token, label = items\n",
" tokens.append(token)\n",
" labels.append(label)\n",
" elif len(items) == 0:\n",
" if tokens:\n",
" token_seqs.append(tokens)\n",
" label_seqs.append(labels)\n",
" tokens = []\n",
" labels = []\n",
" else:\n",
" print('格式错误。行号:{} 内容:{}'.format(index, line))\n",
" continue\n",
" \n",
" if tokens: # 如果文件末尾没有空行,手动将最后一条数据加入序列的列表中\n",
" token_seqs.append(tokens)\n",
" label_seqs.append(labels) \n",
" \n",
" return np.array(token_seqs), np.array(label_seqs)\n",
"\n",
"\n",
"def show_iob2(token_seqs, label_seqs, num=5, shuffle=True):\n",
" '''显示 IOB2 格式数据'''\n",
" if shuffle:\n",
" length = len(token_seqs)\n",
" indexes = [random.randrange(0, length) for i in range(num)] \n",
" zip_seqs = zip(token_seqs[indexes], label_seqs[indexes])\n",
" else:\n",
" zip_seqs = zip(token_seqs[0:num], label_seqs[0:num])\n",
" \n",
" for tokens, labels in zip_seqs:\n",
" for token, label in zip(tokens, labels):\n",
" print('{}/{} '.format(token, label), end='')\n",
" print('\\n')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"55289 55289\n",
"\n",
"目/O 前/O “/O 继/B-PER 生/I-PER ”/O 共/O 产/O 仔/O 5/O 胎/O /O 产/O 下/O 小/O 老/O 虎/O 1/O 8/O 只/O /O 堪/O 称/O 虎/O 妈/O 妈/O 中/O 的/O 英/O 雄/O 。/O \n",
"\n",
"历/O 史/O 的/O 内/O 涵/O 是/O 很/O 丰/O 富/O 的/O /O 经/O 典/O 作/O 家/O 的/O 论/O 断/O 固/O 然/O 有/O 其/O 权/O 威/O 性/O 和/O 合/O 理/O 性/O /O 但/O 历/O 史/O 学/O 家/O 显/O 然/O 不/O 能/O 局/O 限/O 于/O 此/O 。/O \n",
"\n",
"5/O 月/O 3/O 0/O 日/O 在/O 中/B-LOC 国/I-LOC 革/I-LOC 命/I-LOC 军/I-LOC 事/I-LOC 博/I-LOC 物/I-LOC 馆/I-LOC 开/O 幕/O 的/O 全/O 国/O 禁/O 毒/O 展/O 览/O /O 在/O 社/O 会/O 上/O 引/O 起/O 了/O 强/O 烈/O 的/O 反/O 响/O 。/O \n",
"\n",
"另/O 外/O /O 还/O 有/O 一/O 个/O 惊/O 人/O 的/O 发/O 现/O /O 有/O 的/O 发/O 展/O 中/O 国/O 家/O 人/O 均/O 国/O 民/O 资/O 源/O 非/O 常/O 丰/O 富/O /O 但/O 发/O 展/O 不/O 起/O 来/O 的/O 原/O 因/O 在/O 于/O 教/O 育/O 水/O 平/O 太/O 低/O 、/O 对/O 技/O 术/O 的/O 理/O 解/O 和/O 把/O 握/O 太/O 低/O 、/O 管/O 理/O 水/O 平/O 太/O 低/O 等/O 等/O /O 一/O 句/O 话/O /O 智/O 力/O 资/O 本/O 太/O 贫/O 乏/O 。/O \n",
"\n",
"这/O 还/O 要/O 看/O 进/O 一/O 步/O 深/O 入/O 调/O 查/O 的/O 结/O 果/O 。/O \n",
"\n"
]
}
],
"source": [
"token_seqs, label_seqs = load_iob2(path+'dh_msra.txt')\n",
"\n",
"print(len(token_seqs), len(label_seqs))\n",
"print() \n",
"show_iob2(token_seqs, label_seqs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 标签说明\n",
"\n",
"| 标签 | 说明 |\n",
"| ---- | ---- |\n",
"| LOC | 地点 (LOCATION) |\n",
"| ORG | 机构 (ORGANIZATION) |\n",
"| PER | 人物 (PERSON) |"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'B-LOC', 'B-ORG', 'B-PER', 'I-LOC', 'I-ORG', 'I-PER', 'O'}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"set([label for labels in label_seqs for label in labels])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}