{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# weibo_senti_100k 说明\n", "0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1DoQbki3YwqkuwQUOj64R_g)\n", "1. **数据概览:** 10 万多条,带情感标注 新浪微博,正负向评论约各 5 万条\n", "2. **推荐实验:** 情感/观点/评论 倾向性分析\n", "2. **数据来源:** [新浪微博](https://weibo.com/)\n", "3. **原数据集:** [新浪微博,情感分析标记语料共12万条](https://download.csdn.net/download/weixin_38442818/10214750),网上搜集,具体作者、来源不详\n", "4. **加工处理:**\n", " 1. 将原来的 2 份文档,整合成 1 份 csv 文件\n", " 2. 编码统一为 UTF-8\n", " 3. 去重" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "path = 'weibo_senti_100k_文件夹_所在_路径'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. weibo_senti_100k.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 加载数据" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "评论数目(总体):119988\n", "评论数目(正向):59993\n", "评论数目(负向):59995\n" ] } ], "source": [ "pd_all = pd.read_csv(path + 'weibo_senti_100k.csv')\n", "\n", "print('评论数目(总体):%d' % pd_all.shape[0])\n", "print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])\n", "print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 字段说明\n", "\n", "| 字段 | 说明 |\n", "| ---- | ---- |\n", "| label | 1 表示正向评论,0 表示负向评论 |\n", "| review | 微博内容 |" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelreview
620500太过分了@Rexzhenghao //@Janie_Zhang:招行最近负面新闻越来越多呀...
682630希望你?得好?我本"?肥血?史"[晕][哈哈]@Pete三姑父
814720有点想参加????[偷?]想安排下时间再决定[抓狂]//@黑晶晶crystal: @细腿大羽...
420211[给力]感谢所有支持雯婕的芝麻![爱你]
777712013最后一天,在新加坡开心度过,向所有的朋友们问声:新年快乐!2014年,我们会更好[调...
1003990大中午出门办事找错路,曝晒中。要多杯具有多杯具。[泪][泪][汗]
823980马航还会否认吗?到底在隐瞒啥呢?[抓狂]//@头条新闻: 转发微博
1064230克罗地亚球迷很爱放烟火!球又没进,就硝烟四起。[晕]
247981[抱抱]福芦 TangRoulou 吉祥书 8.8折优惠 >>> http://t.cn/z...
65981回复@钱旭明QXM:[嘻嘻][嘻嘻] //@钱旭明QXM:杨大哥[good][good][g...
539201人家这脸长的!!!!!![哈哈]
155871这个价不算高,和一天内训相比相差无几。。[哈哈]//@博通传媒v: 6个月!一个月工资1万,...
1012370终于收工啦,脚丫子快冻掉了[泪][泪][泪]
824490我决定从今天开始我想吃什么就去吃什么,一个人吃也无所谓,重点是不要因为别人的意见委屈了自己[...
325371飘雪的北京 需要双份早餐.......//@美食天下: [哈哈]//@王淼Margay: 屁...
106301[耶],这个太赞了,生活大爆炸第六季马上要出啦[鼓掌] //@-郑瑜-:这个不错 //@经典...
851300刚追完#倾世皇妃#,#千山暮雪#又紧随其后,网速和更新速度都太不给力,尽管我看过原著,还是焦...
1059560晚上看金二胖?察前?,推出的火炮基座?糟了,可以PK了[泪] //@艾米粒er: //@wi...
723910必须把中国足球的伟大,用我的职业演说出来 //@袁腾飞:[泪]
107611[鼓掌] //@宁波香格里拉大酒店: 小编来答疑,周五晚惊艳全场的树根蛋糕到底有多长?蛋糕全...
\n", "
" ], "text/plain": [ " label review\n", "62050 0 太过分了@Rexzhenghao //@Janie_Zhang:招行最近负面新闻越来越多呀...\n", "68263 0 希望你?得好?我本"?肥血?史"[晕][哈哈]@Pete三姑父\n", "81472 0 有点想参加????[偷?]想安排下时间再决定[抓狂]//@黑晶晶crystal: @细腿大羽...\n", "42021 1 [给力]感谢所有支持雯婕的芝麻![爱你]\n", "7777 1 2013最后一天,在新加坡开心度过,向所有的朋友们问声:新年快乐!2014年,我们会更好[调...\n", "100399 0 大中午出门办事找错路,曝晒中。要多杯具有多杯具。[泪][泪][汗]\n", "82398 0 马航还会否认吗?到底在隐瞒啥呢?[抓狂]//@头条新闻: 转发微博\n", "106423 0 克罗地亚球迷很爱放烟火!球又没进,就硝烟四起。[晕]\n", "24798 1 [抱抱]福芦 TangRoulou 吉祥书 8.8折优惠 >>> http://t.cn/z...\n", "6598 1 回复@钱旭明QXM:[嘻嘻][嘻嘻] //@钱旭明QXM:杨大哥[good][good][g...\n", "53920 1 人家这脸长的!!!!!![哈哈]\n", "15587 1 这个价不算高,和一天内训相比相差无几。。[哈哈]//@博通传媒v: 6个月!一个月工资1万,...\n", "101237 0 终于收工啦,脚丫子快冻掉了[泪][泪][泪]\n", "82449 0 我决定从今天开始我想吃什么就去吃什么,一个人吃也无所谓,重点是不要因为别人的意见委屈了自己[...\n", "32537 1 飘雪的北京 需要双份早餐.......//@美食天下: [哈哈]//@王淼Margay: 屁...\n", "10630 1 [耶],这个太赞了,生活大爆炸第六季马上要出啦[鼓掌] //@-郑瑜-:这个不错 //@经典...\n", "85130 0 刚追完#倾世皇妃#,#千山暮雪#又紧随其后,网速和更新速度都太不给力,尽管我看过原著,还是焦...\n", "105956 0 晚上看金二胖?察前?,推出的火炮基座?糟了,可以PK了[泪] //@艾米粒er: //@wi...\n", "72391 0 必须把中国足球的伟大,用我的职业演说出来 //@袁腾飞:[泪]\n", "10761 1 [鼓掌] //@宁波香格里拉大酒店: 小编来答疑,周五晚惊艳全场的树根蛋糕到底有多长?蛋糕全..." ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd_all.sample(20)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 2 }