281 lines
11 KiB
Plaintext
281 lines
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# weibo_senti_100k 说明\n",
|
|
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1DoQbki3YwqkuwQUOj64R_g)\n",
|
|
"1. **数据概览:** 10 万多条,带情感标注 新浪微博,正负向评论约各 5 万条\n",
|
|
"2. **推荐实验:** 情感/观点/评论 倾向性分析\n",
|
|
"2. **数据来源:** [新浪微博](https://weibo.com/)\n",
|
|
"3. **原数据集:** [新浪微博,情感分析标记语料共12万条](https://download.csdn.net/download/weixin_38442818/10214750),网上搜集,具体作者、来源不详\n",
|
|
"4. **加工处理:**\n",
|
|
" 1. 将原来的 2 份文档,整合成 1 份 csv 文件\n",
|
|
" 2. 编码统一为 UTF-8\n",
|
|
" 3. 去重"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"path = 'weibo_senti_100k_文件夹_所在_路径'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 1. weibo_senti_100k.csv"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 加载数据"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"评论数目(总体):119988\n",
|
|
"评论数目(正向):59993\n",
|
|
"评论数目(负向):59995\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"pd_all = pd.read_csv(path + 'weibo_senti_100k.csv')\n",
|
|
"\n",
|
|
"print('评论数目(总体):%d' % pd_all.shape[0])\n",
|
|
"print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])\n",
|
|
"print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 字段说明\n",
|
|
"\n",
|
|
"| 字段 | 说明 |\n",
|
|
"| ---- | ---- |\n",
|
|
"| label | 1 表示正向评论,0 表示负向评论 |\n",
|
|
"| review | 微博内容 |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>label</th>\n",
|
|
" <th>review</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>62050</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>太过分了@Rexzhenghao //@Janie_Zhang:招行最近负面新闻越来越多呀...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>68263</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>希望你?得好?我本"?肥血?史"[晕][哈哈]@Pete三姑父</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>81472</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>有点想参加????[偷?]想安排下时间再决定[抓狂]//@黑晶晶crystal: @细腿大羽...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>42021</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>[给力]感谢所有支持雯婕的芝麻![爱你]</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>7777</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2013最后一天,在新加坡开心度过,向所有的朋友们问声:新年快乐!2014年,我们会更好[调...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>100399</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>大中午出门办事找错路,曝晒中。要多杯具有多杯具。[泪][泪][汗]</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>82398</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>马航还会否认吗?到底在隐瞒啥呢?[抓狂]//@头条新闻: 转发微博</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>106423</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>克罗地亚球迷很爱放烟火!球又没进,就硝烟四起。[晕]</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>24798</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>[抱抱]福芦 TangRoulou 吉祥书 8.8折优惠 >>> http://t.cn/z...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>6598</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>回复@钱旭明QXM:[嘻嘻][嘻嘻] //@钱旭明QXM:杨大哥[good][good][g...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>53920</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>人家这脸长的!!!!!![哈哈]</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>15587</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>这个价不算高,和一天内训相比相差无几。。[哈哈]//@博通传媒v: 6个月!一个月工资1万,...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>101237</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>终于收工啦,脚丫子快冻掉了[泪][泪][泪]</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>82449</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>我决定从今天开始我想吃什么就去吃什么,一个人吃也无所谓,重点是不要因为别人的意见委屈了自己[...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>32537</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>飘雪的北京 需要双份早餐.......//@美食天下: [哈哈]//@王淼Margay: 屁...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>10630</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>[耶],这个太赞了,生活大爆炸第六季马上要出啦[鼓掌] //@-郑瑜-:这个不错 //@经典...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>85130</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>刚追完#倾世皇妃#,#千山暮雪#又紧随其后,网速和更新速度都太不给力,尽管我看过原著,还是焦...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>105956</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>晚上看金二胖?察前?,推出的火炮基座?糟了,可以PK了[泪] //@艾米粒er: //@wi...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>72391</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>必须把中国足球的伟大,用我的职业演说出来 //@袁腾飞:[泪]</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>10761</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>[鼓掌] //@宁波香格里拉大酒店: 小编来答疑,周五晚惊艳全场的树根蛋糕到底有多长?蛋糕全...</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" label review\n",
|
|
"62050 0 太过分了@Rexzhenghao //@Janie_Zhang:招行最近负面新闻越来越多呀...\n",
|
|
"68263 0 希望你?得好?我本"?肥血?史"[晕][哈哈]@Pete三姑父\n",
|
|
"81472 0 有点想参加????[偷?]想安排下时间再决定[抓狂]//@黑晶晶crystal: @细腿大羽...\n",
|
|
"42021 1 [给力]感谢所有支持雯婕的芝麻![爱你]\n",
|
|
"7777 1 2013最后一天,在新加坡开心度过,向所有的朋友们问声:新年快乐!2014年,我们会更好[调...\n",
|
|
"100399 0 大中午出门办事找错路,曝晒中。要多杯具有多杯具。[泪][泪][汗]\n",
|
|
"82398 0 马航还会否认吗?到底在隐瞒啥呢?[抓狂]//@头条新闻: 转发微博\n",
|
|
"106423 0 克罗地亚球迷很爱放烟火!球又没进,就硝烟四起。[晕]\n",
|
|
"24798 1 [抱抱]福芦 TangRoulou 吉祥书 8.8折优惠 >>> http://t.cn/z...\n",
|
|
"6598 1 回复@钱旭明QXM:[嘻嘻][嘻嘻] //@钱旭明QXM:杨大哥[good][good][g...\n",
|
|
"53920 1 人家这脸长的!!!!!![哈哈]\n",
|
|
"15587 1 这个价不算高,和一天内训相比相差无几。。[哈哈]//@博通传媒v: 6个月!一个月工资1万,...\n",
|
|
"101237 0 终于收工啦,脚丫子快冻掉了[泪][泪][泪]\n",
|
|
"82449 0 我决定从今天开始我想吃什么就去吃什么,一个人吃也无所谓,重点是不要因为别人的意见委屈了自己[...\n",
|
|
"32537 1 飘雪的北京 需要双份早餐.......//@美食天下: [哈哈]//@王淼Margay: 屁...\n",
|
|
"10630 1 [耶],这个太赞了,生活大爆炸第六季马上要出啦[鼓掌] //@-郑瑜-:这个不错 //@经典...\n",
|
|
"85130 0 刚追完#倾世皇妃#,#千山暮雪#又紧随其后,网速和更新速度都太不给力,尽管我看过原著,还是焦...\n",
|
|
"105956 0 晚上看金二胖?察前?,推出的火炮基座?糟了,可以PK了[泪] //@艾米粒er: //@wi...\n",
|
|
"72391 0 必须把中国足球的伟大,用我的职业演说出来 //@袁腾飞:[泪]\n",
|
|
"10761 1 [鼓掌] //@宁波香格里拉大酒店: 小编来答疑,周五晚惊艳全场的树根蛋糕到底有多长?蛋糕全..."
|
|
]
|
|
},
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"pd_all.sample(20)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.4"
|
|
},
|
|
"widgets": {
|
|
"state": {},
|
|
"version": "1.1.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|