427 lines
15 KiB
Plaintext
427 lines
15 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# waimai_10k 说明\n",
|
|
"0. **下载地址:** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/waimai_10k/waimai_10k.csv)\n",
|
|
"1. **数据概览:** 某外卖平台收集的用户评价,正向 4000 条,负向 约 8000 条\n",
|
|
"2. **推荐实验:** 情感/观点/评论 倾向性分析\n",
|
|
"2. **数据来源:** 某外卖平台\n",
|
|
"3. **原数据集:** [中文短文本情感分析语料 外卖评价](https://download.csdn.net/download/cstkl/10236683),网上搜集,具体作者、来源不详\n",
|
|
"4. **加工处理:**\n",
|
|
" 1. 将原来 2 个文件整合到 1 个文件中\n",
|
|
" 2. 去重"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"path = 'waimai_10k_文件夹_所在_路径'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 1. waimai_10k.csv"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 加载数据"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"评论数目(总体):11987\n",
|
|
"评论数目(正向):4000\n",
|
|
"评论数目(负向):7987\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"pd_all = pd.read_csv(path + 'waimai_10k.csv')\n",
|
|
"\n",
|
|
"print('评论数目(总体):%d' % pd_all.shape[0])\n",
|
|
"print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])\n",
|
|
"print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 字段说明\n",
|
|
"\n",
|
|
"| 字段 | 说明 |\n",
|
|
"| ---- | ---- |\n",
|
|
"| label | 1 表示正向评论,0 表示负向评论 |\n",
|
|
"| review | 评论内容 |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>label</th>\n",
|
|
" <th>review</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>25</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>送餐特别快,态度也好,辛苦啦</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>6632</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>点了热带雨林披萨+饮料,和BBQ鸡肉披萨+饮料,送来的是两个奥尔良披萨+两个银耳冰粥,冰凉冰...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>8849</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>难吃!!!油死了,味道烂</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>11114</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>今天菜太咸,连着定了3天吃,一天比一天难吃。</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>11661</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>送的太慢了,菜都凉了。</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>9571</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>没有满减!</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>10614</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>差评!定的时间是12点一刻,结果刚11点就送来了!果断退单。送餐前不看时间吗?</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>7585</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>羊肉串太咸,还有些不新鲜。鸡心和鸡胗烤的太老</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>6919</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>快递员挺好,速度挺快</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3192</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>小炒肉卷饼好辣~</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>10224</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>送来的时候都凉了,味道一般,鲜果西米露就两口的量,鲜果就是一块西瓜一个西瓜籽</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>7295</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>没放糖,没放奶油,好难喝</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>275</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>他家的奶茶超级好喝。。。</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>8378</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>黑椒牛柳饭送成大排饭</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>5879</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>一个半小时,可以</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>7523</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>订单满减后应该是24,送过来要收我原价39?你搞笑呐,还少听加多宝!我管你什么美食送的还是你...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>6590</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>真心也忒慢了,其他都还成</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1703</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>非常划算,很好</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>5345</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>首选是得吐槽一下这家的速度,一个半小时起,然后卷饼包装很不错,酱香鸡肉的比较赞,飘香肘子一般...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1674</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>离我们远点55分钟送到的,可以理解,饼和粥都不错</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" label review\n",
|
|
"25 1 送餐特别快,态度也好,辛苦啦\n",
|
|
"6632 0 点了热带雨林披萨+饮料,和BBQ鸡肉披萨+饮料,送来的是两个奥尔良披萨+两个银耳冰粥,冰凉冰...\n",
|
|
"8849 0 难吃!!!油死了,味道烂\n",
|
|
"11114 0 今天菜太咸,连着定了3天吃,一天比一天难吃。\n",
|
|
"11661 0 送的太慢了,菜都凉了。\n",
|
|
"9571 0 没有满减!\n",
|
|
"10614 0 差评!定的时间是12点一刻,结果刚11点就送来了!果断退单。送餐前不看时间吗?\n",
|
|
"7585 0 羊肉串太咸,还有些不新鲜。鸡心和鸡胗烤的太老\n",
|
|
"6919 0 快递员挺好,速度挺快\n",
|
|
"3192 1 小炒肉卷饼好辣~\n",
|
|
"10224 0 送来的时候都凉了,味道一般,鲜果西米露就两口的量,鲜果就是一块西瓜一个西瓜籽\n",
|
|
"7295 0 没放糖,没放奶油,好难喝\n",
|
|
"275 1 他家的奶茶超级好喝。。。\n",
|
|
"8378 0 黑椒牛柳饭送成大排饭\n",
|
|
"5879 0 一个半小时,可以\n",
|
|
"7523 0 订单满减后应该是24,送过来要收我原价39?你搞笑呐,还少听加多宝!我管你什么美食送的还是你...\n",
|
|
"6590 0 真心也忒慢了,其他都还成\n",
|
|
"1703 1 非常划算,很好\n",
|
|
"5345 0 首选是得吐槽一下这家的速度,一个半小时起,然后卷饼包装很不错,酱香鸡肉的比较赞,飘香肘子一般...\n",
|
|
"1674 1 离我们远点55分钟送到的,可以理解,饼和粥都不错"
|
|
]
|
|
},
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"pd_all.sample(20)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 2. 构造平衡语料"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pd_positive = pd_all[pd_all.label==1]\n",
|
|
"pd_negative = pd_all[pd_all.label==0]\n",
|
|
"\n",
|
|
"def get_balance_corpus(corpus_size, corpus_pos, corpus_neg):\n",
|
|
" sample_size = corpus_size // 2\n",
|
|
" pd_corpus_balance = pd.concat([corpus_pos.sample(sample_size, replace=corpus_pos.shape[0]<sample_size), \\\n",
|
|
" corpus_neg.sample(sample_size, replace=corpus_neg.shape[0]<sample_size)])\n",
|
|
" \n",
|
|
" print('评论数目(总体):%d' % pd_corpus_balance.shape[0])\n",
|
|
" print('评论数目(正向):%d' % pd_corpus_balance[pd_corpus_balance.label==1].shape[0])\n",
|
|
" print('评论数目(负向):%d' % pd_corpus_balance[pd_corpus_balance.label==0].shape[0]) \n",
|
|
" \n",
|
|
" return pd_corpus_balance"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"评论数目(总体):4000\n",
|
|
"评论数目(正向):2000\n",
|
|
"评论数目(负向):2000\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>label</th>\n",
|
|
" <th>review</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>10436</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>难吃~石锅拌饭居然没酱~而且刚好晚了29分钟</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>10468</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>等了很久,没关系,毕竟还在约定时间内,可是最让我忍不了的是真的很一般,个人口味吧,反正不和我...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1643</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>嗯,纸袋比较高大上</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>8723</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>海参怎么是生的,没法吃,郁闷</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2431</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>送餐很快,送餐人员很热情!~</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>5121</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>不如以前好吃,肘子都有味儿了!哎!</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>10565</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>东西有些小贵。</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2413</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>虽然时间长了些但是很准时。下次记得给些番茄酱就更好了。,一个人吃足够了。好好吃</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>11937</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>11点以前就定的餐,做了1小时48分钟,呵呵,我只想说:拜拜!!!</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1024</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>很好吃,面皮特别有嚼劲儿,酱料也很好吃</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" label review\n",
|
|
"10436 0 难吃~石锅拌饭居然没酱~而且刚好晚了29分钟\n",
|
|
"10468 0 等了很久,没关系,毕竟还在约定时间内,可是最让我忍不了的是真的很一般,个人口味吧,反正不和我...\n",
|
|
"1643 1 嗯,纸袋比较高大上\n",
|
|
"8723 0 海参怎么是生的,没法吃,郁闷\n",
|
|
"2431 1 送餐很快,送餐人员很热情!~\n",
|
|
"5121 0 不如以前好吃,肘子都有味儿了!哎!\n",
|
|
"10565 0 东西有些小贵。\n",
|
|
"2413 1 虽然时间长了些但是很准时。下次记得给些番茄酱就更好了。,一个人吃足够了。好好吃\n",
|
|
"11937 0 11点以前就定的餐,做了1小时48分钟,呵呵,我只想说:拜拜!!!\n",
|
|
"1024 1 很好吃,面皮特别有嚼劲儿,酱料也很好吃"
|
|
]
|
|
},
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"waimai_10k_ba_4000 = get_balance_corpus(4000, pd_positive, pd_negative)\n",
|
|
"\n",
|
|
"waimai_10k_ba_4000.sample(10)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.4"
|
|
},
|
|
"widgets": {
|
|
"state": {},
|
|
"version": "1.1.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|