988 lines
36 KiB
Plaintext
988 lines
36 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# dmsc_v2 说明\n",
|
|
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1c0yn3TlkzHYTdEBz3T5arA)\n",
|
|
"1. **数据概览:** 28 部电影,超 70 万 用户,超 200 万条 评分/评论 数据\n",
|
|
"2. **推荐实验:** 推荐系统、情感/观点/评论 倾向性分析\n",
|
|
"2. **数据来源:**[豆瓣电影](https://movie.douban.com/)\n",
|
|
"3. **原数据集:** [Douban Movie Short Comments Dataset V2](https://www.kaggle.com/utmhikari/doubanmovieshortcomments)\n",
|
|
"4. **加工处理:**\n",
|
|
" 1. 去重并整理成与 [MovieLens](https://grouplens.org/datasets/movielens/) 兼容的格式\n",
|
|
" 2. 进行脱敏操作,以保护用户隐私"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"path = 'dmsc_文件夹_所在_路径'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 1. movies.csv"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 加载数据"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"电影数目:28\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"movies = pd.read_csv(path + 'movies.csv')\n",
|
|
"\n",
|
|
"print('电影数目:%d' % movies.shape[0])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 字段说明\n",
|
|
"\n",
|
|
"| 字段 | 说明 |\n",
|
|
"| ---- | ---- |\n",
|
|
"| movieId | 电影 id (从 0 开始,连续编号) |\n",
|
|
"| title | 英文名称 |\n",
|
|
"| title_cn | 中文名称 |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style>\n",
|
|
" .dataframe thead tr:only-child th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: left;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>movieId</th>\n",
|
|
" <th>title</th>\n",
|
|
" <th>title_cn</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>Avengers Age of Ultron</td>\n",
|
|
" <td>复仇者联盟2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Big Fish and Begonia</td>\n",
|
|
" <td>大鱼海棠</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Captain America Civil War</td>\n",
|
|
" <td>美国队长3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Chinese Zodiac</td>\n",
|
|
" <td>十二生肖</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>4</td>\n",
|
|
" <td>Chronicles of the Ghostly Tribe</td>\n",
|
|
" <td>九层妖塔</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>5</th>\n",
|
|
" <td>5</td>\n",
|
|
" <td>CUG King of Heroes</td>\n",
|
|
" <td>大圣归来</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>6</th>\n",
|
|
" <td>6</td>\n",
|
|
" <td>Forever Young</td>\n",
|
|
" <td>栀子花开</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>7</th>\n",
|
|
" <td>7</td>\n",
|
|
" <td>Goodbye Mr. Loser</td>\n",
|
|
" <td>夏洛特烦恼</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>8</th>\n",
|
|
" <td>8</td>\n",
|
|
" <td>Iron Man</td>\n",
|
|
" <td>钢铁侠1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>9</th>\n",
|
|
" <td>9</td>\n",
|
|
" <td>Journey to the West Conquering the Demons</td>\n",
|
|
" <td>西游降魔篇</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>10</th>\n",
|
|
" <td>10</td>\n",
|
|
" <td>Journey to the West The Demons Strike Back</td>\n",
|
|
" <td>西游伏妖篇</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>11</th>\n",
|
|
" <td>11</td>\n",
|
|
" <td>La La Land</td>\n",
|
|
" <td>爱乐之城</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>12</th>\n",
|
|
" <td>12</td>\n",
|
|
" <td>Lost In Thailand</td>\n",
|
|
" <td>泰囧</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>13</th>\n",
|
|
" <td>13</td>\n",
|
|
" <td>My Sunshine</td>\n",
|
|
" <td>何以笙箫默</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>14</th>\n",
|
|
" <td>14</td>\n",
|
|
" <td>Operation Mekong</td>\n",
|
|
" <td>湄公河行动</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>15</th>\n",
|
|
" <td>15</td>\n",
|
|
" <td>Soulmate</td>\n",
|
|
" <td>七月与安生</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>16</th>\n",
|
|
" <td>16</td>\n",
|
|
" <td>The Avengers</td>\n",
|
|
" <td>复仇者联盟</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>17</th>\n",
|
|
" <td>17</td>\n",
|
|
" <td>The Continent</td>\n",
|
|
" <td>后会无期</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>18</th>\n",
|
|
" <td>18</td>\n",
|
|
" <td>The Ghouls</td>\n",
|
|
" <td>寻龙诀</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>19</th>\n",
|
|
" <td>19</td>\n",
|
|
" <td>The Great Wall</td>\n",
|
|
" <td>长城</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>20</th>\n",
|
|
" <td>20</td>\n",
|
|
" <td>The Left Ear</td>\n",
|
|
" <td>左耳</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>21</th>\n",
|
|
" <td>21</td>\n",
|
|
" <td>The Mermaid</td>\n",
|
|
" <td>美人鱼</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>22</th>\n",
|
|
" <td>22</td>\n",
|
|
" <td>Tiny Times 1.0</td>\n",
|
|
" <td>小时代1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>23</th>\n",
|
|
" <td>23</td>\n",
|
|
" <td>Tiny Times 3.0</td>\n",
|
|
" <td>小时代3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>24</th>\n",
|
|
" <td>24</td>\n",
|
|
" <td>Train to Busan</td>\n",
|
|
" <td>釜山行</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>25</th>\n",
|
|
" <td>25</td>\n",
|
|
" <td>Transformers Age of Extinction</td>\n",
|
|
" <td>变形金刚4</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>26</th>\n",
|
|
" <td>26</td>\n",
|
|
" <td>Your Name</td>\n",
|
|
" <td>你的名字</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>27</th>\n",
|
|
" <td>27</td>\n",
|
|
" <td>Zootopia</td>\n",
|
|
" <td>疯狂动物城</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" movieId title title_cn\n",
|
|
"0 0 Avengers Age of Ultron 复仇者联盟2\n",
|
|
"1 1 Big Fish and Begonia 大鱼海棠\n",
|
|
"2 2 Captain America Civil War 美国队长3\n",
|
|
"3 3 Chinese Zodiac 十二生肖\n",
|
|
"4 4 Chronicles of the Ghostly Tribe 九层妖塔\n",
|
|
"5 5 CUG King of Heroes 大圣归来\n",
|
|
"6 6 Forever Young 栀子花开\n",
|
|
"7 7 Goodbye Mr. Loser 夏洛特烦恼\n",
|
|
"8 8 Iron Man 钢铁侠1\n",
|
|
"9 9 Journey to the West Conquering the Demons 西游降魔篇\n",
|
|
"10 10 Journey to the West The Demons Strike Back 西游伏妖篇\n",
|
|
"11 11 La La Land 爱乐之城\n",
|
|
"12 12 Lost In Thailand 泰囧\n",
|
|
"13 13 My Sunshine 何以笙箫默\n",
|
|
"14 14 Operation Mekong 湄公河行动\n",
|
|
"15 15 Soulmate 七月与安生\n",
|
|
"16 16 The Avengers 复仇者联盟\n",
|
|
"17 17 The Continent 后会无期\n",
|
|
"18 18 The Ghouls 寻龙诀\n",
|
|
"19 19 The Great Wall 长城\n",
|
|
"20 20 The Left Ear 左耳\n",
|
|
"21 21 The Mermaid 美人鱼\n",
|
|
"22 22 Tiny Times 1.0 小时代1\n",
|
|
"23 23 Tiny Times 3.0 小时代3\n",
|
|
"24 24 Train to Busan 釜山行\n",
|
|
"25 25 Transformers Age of Extinction 变形金刚4\n",
|
|
"26 26 Your Name 你的名字\n",
|
|
"27 27 Zootopia 疯狂动物城"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"movies"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 2. ratings.csv"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 加载数据"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"用户数据:738701\n",
|
|
"评分数目:2125056\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"ratings = pd.read_csv(path + 'ratings.csv')\n",
|
|
"\n",
|
|
"print('用户数据:%d' % ratings.userId.unique().shape[0])\n",
|
|
"print('评分数目:%d' % ratings.shape[0])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 字段说明"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"| 字段 | 说明 |\n",
|
|
"| ---- | ---- |\n",
|
|
"| userId | 用户 id (从 0 开始,连续编号) |\n",
|
|
"| movieId | 即 movies.csv 中的 movieId|\n",
|
|
"|rating | 评分,[1,5] 之间的整数 | \n",
|
|
"|timestamp | 评分时间戳 |\n",
|
|
"|comment | 评论内容 |\n",
|
|
"| like | 该评论被多少人点赞 |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {
|
|
"scrolled": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style>\n",
|
|
" .dataframe thead tr:only-child th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: left;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>userId</th>\n",
|
|
" <th>movieId</th>\n",
|
|
" <th>rating</th>\n",
|
|
" <th>timestamp</th>\n",
|
|
" <th>comment</th>\n",
|
|
" <th>like</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>1763779</th>\n",
|
|
" <td>130888</td>\n",
|
|
" <td>24</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1474560000</td>\n",
|
|
" <td>原著的剧本不是这样的,而是最后只有那个自私鬼活了下来。孕妇中枪,小孩中枪的时候哭出了声音,...</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1608147</th>\n",
|
|
" <td>23695</td>\n",
|
|
" <td>22</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>1377360000</td>\n",
|
|
" <td>郭敬明真的要为中国产生如此大规模的青少年脑残群体负一定责任 = =</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1735498</th>\n",
|
|
" <td>323858</td>\n",
|
|
" <td>24</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>1473696000</td>\n",
|
|
" <td>三分不能再多。其中一分给壮汉大叔,帅过男主。</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1631095</th>\n",
|
|
" <td>218188</td>\n",
|
|
" <td>22</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>1372953600</td>\n",
|
|
" <td>柯震东露点 给三星 后面的彩蛋很欢乐</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1193163</th>\n",
|
|
" <td>155900</td>\n",
|
|
" <td>17</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1406390400</td>\n",
|
|
" <td>给四星不是因为电影有那么好,文艺腔调有,公路片元素够,但好看程度其实低于预期,但是因为是韩...</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1874658</th>\n",
|
|
" <td>8534</td>\n",
|
|
" <td>26</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1480780800</td>\n",
|
|
" <td>身体互换和改变未来都是老梗了,算是半新不旧的瓶装了个旧酒吧,不过倒是不错,意外的好看,伏笔...</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>645671</th>\n",
|
|
" <td>312247</td>\n",
|
|
" <td>9</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1476979200</td>\n",
|
|
" <td>念念不忘,必有回响…</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1681543</th>\n",
|
|
" <td>284941</td>\n",
|
|
" <td>23</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1409673600</td>\n",
|
|
" <td>看到她们在雪地的那段,居然很感动</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1042238</th>\n",
|
|
" <td>100689</td>\n",
|
|
" <td>15</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1474214400</td>\n",
|
|
" <td>以前看安妮宝贝时期....最喜欢的小说之一</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1672379</th>\n",
|
|
" <td>139726</td>\n",
|
|
" <td>23</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>1406736000</td>\n",
|
|
" <td>郭小四不是标榜自己时尚品味吗?四个女主一个镜头换一身皮草哪来的品味啊??(客观的说,叙事增...</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1823549</th>\n",
|
|
" <td>447412</td>\n",
|
|
" <td>25</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>1405958400</td>\n",
|
|
" <td>擎天柱胸前蓝色的部分装着生命所需的能量和他的记忆。这让我更加坚信一些东西,只是然后的然后我...</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1112590</th>\n",
|
|
" <td>495975</td>\n",
|
|
" <td>16</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1336838400</td>\n",
|
|
" <td>浩克抖包袱……</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>210239</th>\n",
|
|
" <td>123095</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1390320000</td>\n",
|
|
" <td>轻松愉快,打斗设置还不错</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2093623</th>\n",
|
|
" <td>232598</td>\n",
|
|
" <td>27</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1474560000</td>\n",
|
|
" <td>比之前大热的冰雪奇缘好太多,一部全家人都可以坐在一起看的电影。</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>583777</th>\n",
|
|
" <td>322422</td>\n",
|
|
" <td>8</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1301500800</td>\n",
|
|
" <td>的确比蜘蛛侠超人什么什么的好看</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1914937</th>\n",
|
|
" <td>75819</td>\n",
|
|
" <td>26</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1473955200</td>\n",
|
|
" <td>真的棒。但是我自己还是不那么喜欢。</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1211561</th>\n",
|
|
" <td>514748</td>\n",
|
|
" <td>17</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1407427200</td>\n",
|
|
" <td>。</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1965672</th>\n",
|
|
" <td>704638</td>\n",
|
|
" <td>26</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1480953600</td>\n",
|
|
" <td>陪朋友去看的,本身我是拒绝这类小清新的电影的,而且在刚开始的时候说实话没怎么看懂,不过看到...</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1935211</th>\n",
|
|
" <td>259717</td>\n",
|
|
" <td>26</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1480694400</td>\n",
|
|
" <td>时间与空间错乱里的爱情 温暖又幽默</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>839108</th>\n",
|
|
" <td>426801</td>\n",
|
|
" <td>11</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1486742400</td>\n",
|
|
" <td>Here is to the ones who dream.</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" userId movieId rating timestamp \\\n",
|
|
"1763779 130888 24 5 1474560000 \n",
|
|
"1608147 23695 22 2 1377360000 \n",
|
|
"1735498 323858 24 3 1473696000 \n",
|
|
"1631095 218188 22 3 1372953600 \n",
|
|
"1193163 155900 17 4 1406390400 \n",
|
|
"1874658 8534 26 4 1480780800 \n",
|
|
"645671 312247 9 4 1476979200 \n",
|
|
"1681543 284941 23 4 1409673600 \n",
|
|
"1042238 100689 15 5 1474214400 \n",
|
|
"1672379 139726 23 2 1406736000 \n",
|
|
"1823549 447412 25 2 1405958400 \n",
|
|
"1112590 495975 16 4 1336838400 \n",
|
|
"210239 123095 3 4 1390320000 \n",
|
|
"2093623 232598 27 5 1474560000 \n",
|
|
"583777 322422 8 5 1301500800 \n",
|
|
"1914937 75819 26 4 1473955200 \n",
|
|
"1211561 514748 17 4 1407427200 \n",
|
|
"1965672 704638 26 5 1480953600 \n",
|
|
"1935211 259717 26 4 1480694400 \n",
|
|
"839108 426801 11 5 1486742400 \n",
|
|
"\n",
|
|
" comment like \n",
|
|
"1763779 原著的剧本不是这样的,而是最后只有那个自私鬼活了下来。孕妇中枪,小孩中枪的时候哭出了声音,... 1 \n",
|
|
"1608147 郭敬明真的要为中国产生如此大规模的青少年脑残群体负一定责任 = = 0 \n",
|
|
"1735498 三分不能再多。其中一分给壮汉大叔,帅过男主。 0 \n",
|
|
"1631095 柯震东露点 给三星 后面的彩蛋很欢乐 0 \n",
|
|
"1193163 给四星不是因为电影有那么好,文艺腔调有,公路片元素够,但好看程度其实低于预期,但是因为是韩... 0 \n",
|
|
"1874658 身体互换和改变未来都是老梗了,算是半新不旧的瓶装了个旧酒吧,不过倒是不错,意外的好看,伏笔... 1 \n",
|
|
"645671 念念不忘,必有回响… 0 \n",
|
|
"1681543 看到她们在雪地的那段,居然很感动 0 \n",
|
|
"1042238 以前看安妮宝贝时期....最喜欢的小说之一 0 \n",
|
|
"1672379 郭小四不是标榜自己时尚品味吗?四个女主一个镜头换一身皮草哪来的品味啊??(客观的说,叙事增... 0 \n",
|
|
"1823549 擎天柱胸前蓝色的部分装着生命所需的能量和他的记忆。这让我更加坚信一些东西,只是然后的然后我... 0 \n",
|
|
"1112590 浩克抖包袱…… 0 \n",
|
|
"210239 轻松愉快,打斗设置还不错 0 \n",
|
|
"2093623 比之前大热的冰雪奇缘好太多,一部全家人都可以坐在一起看的电影。 0 \n",
|
|
"583777 的确比蜘蛛侠超人什么什么的好看 0 \n",
|
|
"1914937 真的棒。但是我自己还是不那么喜欢。 0 \n",
|
|
"1211561 。 0 \n",
|
|
"1965672 陪朋友去看的,本身我是拒绝这类小清新的电影的,而且在刚开始的时候说实话没怎么看懂,不过看到... 0 \n",
|
|
"1935211 时间与空间错乱里的爱情 温暖又幽默 0 \n",
|
|
"839108 Here is to the ones who dream. 0 "
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"ratings.sample(20)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 3. 用于 情感/观点/评论 倾向性分析"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 筛选出带有较明显倾向性的评论(1星和5星的评分)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"正向(5星)数目:638106\n",
|
|
"负向(1星)数目:190927\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style>\n",
|
|
" .dataframe thead tr:only-child th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: left;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>userId</th>\n",
|
|
" <th>movieId</th>\n",
|
|
" <th>rating</th>\n",
|
|
" <th>timestamp</th>\n",
|
|
" <th>comment</th>\n",
|
|
" <th>like</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>405540</th>\n",
|
|
" <td>251302</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1436976000</td>\n",
|
|
" <td>路人转自来水!大圣帅气!我要生猴子~~~^-^</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>159308</th>\n",
|
|
" <td>18639</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1462636800</td>\n",
|
|
" <td>冬兵从醒了以后就应该要求被冻起来,美队这个人烂的真要命。心疼tony。</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1329674</th>\n",
|
|
" <td>127217</td>\n",
|
|
" <td>18</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1451059200</td>\n",
|
|
" <td>超级棒!远远超出预期 免费水军来了哈哈哈哈</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1945766</th>\n",
|
|
" <td>75720</td>\n",
|
|
" <td>26</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1476460800</td>\n",
|
|
" <td>为爱而动</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1706244</th>\n",
|
|
" <td>29721</td>\n",
|
|
" <td>23</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1406131200</td>\n",
|
|
" <td>看小时代3的时候真是太壮观了整个场子那个乱啊打电话的聊天的中途上厕所的没办法大家提不起兴趣...</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1271715</th>\n",
|
|
" <td>546029</td>\n",
|
|
" <td>17</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1406217600</td>\n",
|
|
" <td>可以给零分么</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>394698</th>\n",
|
|
" <td>243184</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1437926400</td>\n",
|
|
" <td>一直听网友说好,今天去电影院看了下。真的不错,是中国动漫的一个值得一看的作品。太多的喜羊羊...</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>324077</th>\n",
|
|
" <td>208900</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1437062400</td>\n",
|
|
" <td>先吐槽一下自己的泪点,太低了。小和尚太像弟弟小时候的样子了。整部电影是良心之作,国产地影这...</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1004222</th>\n",
|
|
" <td>186241</td>\n",
|
|
" <td>14</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1475942400</td>\n",
|
|
" <td>主旋律片的杰出代表,节奏顺畅快速。看得人热血沸腾!</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>198523</th>\n",
|
|
" <td>5774</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1462723200</td>\n",
|
|
" <td>迄今看过最精彩的漫威电影 其实整个剧情核心是复仇 但是这个复仇点真心满怪的 队长还是一如既...</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2014461</th>\n",
|
|
" <td>25511</td>\n",
|
|
" <td>27</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1457280000</td>\n",
|
|
" <td>try everything!动物界的乌托邦 nick真的好苏好腹黑啊啊(原谅我带入了小说</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2101031</th>\n",
|
|
" <td>727978</td>\n",
|
|
" <td>27</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1462550400</td>\n",
|
|
" <td>讲真很棒!</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1614137</th>\n",
|
|
" <td>64084</td>\n",
|
|
" <td>22</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1374768000</td>\n",
|
|
" <td>最后雪中的姐妹情。</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1980114</th>\n",
|
|
" <td>248321</td>\n",
|
|
" <td>26</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1480867200</td>\n",
|
|
" <td>时空的跨越,绝对不能忘记的,你的名字。</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1829632</th>\n",
|
|
" <td>18891</td>\n",
|
|
" <td>25</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1403884800</td>\n",
|
|
" <td>请记住一个特效片是不需要完美剧情的。在电影院看的就是特效,没有其他。给特效满分。顶端水平。...</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>276335</th>\n",
|
|
" <td>186281</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1443715200</td>\n",
|
|
" <td>不知道在演什么鬼</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2090682</th>\n",
|
|
" <td>214830</td>\n",
|
|
" <td>27</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1457193600</td>\n",
|
|
" <td>这狐狸怎么那么苏!!!反差萌的梗简直炉火纯青</td>\n",
|
|
" <td>4</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2108227</th>\n",
|
|
" <td>731117</td>\n",
|
|
" <td>27</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1458403200</td>\n",
|
|
" <td>树懒梗可爱到爆。乌托邦社会的构建反讽了乌托邦社会设想,号称没有偏见的世界里,本身就是由偏见...</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>864728</th>\n",
|
|
" <td>10418</td>\n",
|
|
" <td>12</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1355673600</td>\n",
|
|
" <td>啥也不说了,从头笑到尾,差点没乐死我,最后又赚了些感动</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>422130</th>\n",
|
|
" <td>263856</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1436544000</td>\n",
|
|
" <td>很感动很用心</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" userId movieId rating timestamp \\\n",
|
|
"405540 251302 5 5 1436976000 \n",
|
|
"159308 18639 2 5 1462636800 \n",
|
|
"1329674 127217 18 5 1451059200 \n",
|
|
"1945766 75720 26 5 1476460800 \n",
|
|
"1706244 29721 23 1 1406131200 \n",
|
|
"1271715 546029 17 1 1406217600 \n",
|
|
"394698 243184 5 5 1437926400 \n",
|
|
"324077 208900 5 5 1437062400 \n",
|
|
"1004222 186241 14 5 1475942400 \n",
|
|
"198523 5774 2 5 1462723200 \n",
|
|
"2014461 25511 27 5 1457280000 \n",
|
|
"2101031 727978 27 5 1462550400 \n",
|
|
"1614137 64084 22 1 1374768000 \n",
|
|
"1980114 248321 26 5 1480867200 \n",
|
|
"1829632 18891 25 5 1403884800 \n",
|
|
"276335 186281 4 1 1443715200 \n",
|
|
"2090682 214830 27 5 1457193600 \n",
|
|
"2108227 731117 27 5 1458403200 \n",
|
|
"864728 10418 12 5 1355673600 \n",
|
|
"422130 263856 5 5 1436544000 \n",
|
|
"\n",
|
|
" comment like \n",
|
|
"405540 路人转自来水!大圣帅气!我要生猴子~~~^-^ 0 \n",
|
|
"159308 冬兵从醒了以后就应该要求被冻起来,美队这个人烂的真要命。心疼tony。 0 \n",
|
|
"1329674 超级棒!远远超出预期 免费水军来了哈哈哈哈 0 \n",
|
|
"1945766 为爱而动 0 \n",
|
|
"1706244 看小时代3的时候真是太壮观了整个场子那个乱啊打电话的聊天的中途上厕所的没办法大家提不起兴趣... 0 \n",
|
|
"1271715 可以给零分么 0 \n",
|
|
"394698 一直听网友说好,今天去电影院看了下。真的不错,是中国动漫的一个值得一看的作品。太多的喜羊羊... 0 \n",
|
|
"324077 先吐槽一下自己的泪点,太低了。小和尚太像弟弟小时候的样子了。整部电影是良心之作,国产地影这... 0 \n",
|
|
"1004222 主旋律片的杰出代表,节奏顺畅快速。看得人热血沸腾! 0 \n",
|
|
"198523 迄今看过最精彩的漫威电影 其实整个剧情核心是复仇 但是这个复仇点真心满怪的 队长还是一如既... 0 \n",
|
|
"2014461 try everything!动物界的乌托邦 nick真的好苏好腹黑啊啊(原谅我带入了小说 0 \n",
|
|
"2101031 讲真很棒! 0 \n",
|
|
"1614137 最后雪中的姐妹情。 0 \n",
|
|
"1980114 时空的跨越,绝对不能忘记的,你的名字。 0 \n",
|
|
"1829632 请记住一个特效片是不需要完美剧情的。在电影院看的就是特效,没有其他。给特效满分。顶端水平。... 0 \n",
|
|
"276335 不知道在演什么鬼 1 \n",
|
|
"2090682 这狐狸怎么那么苏!!!反差萌的梗简直炉火纯青 4 \n",
|
|
"2108227 树懒梗可爱到爆。乌托邦社会的构建反讽了乌托邦社会设想,号称没有偏见的世界里,本身就是由偏见... 0 \n",
|
|
"864728 啥也不说了,从头笑到尾,差点没乐死我,最后又赚了些感动 0 \n",
|
|
"422130 很感动很用心 0 "
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"ratings_with_opinions = ratings[(ratings.rating==1) | (ratings.rating==5)]\n",
|
|
"\n",
|
|
"\n",
|
|
"print('正向(5星)数目:%d' % (ratings_with_opinions[ratings_with_opinions.rating==5].shape[0]))\n",
|
|
"print('负向(1星)数目:%d' % (ratings_with_opinions[ratings_with_opinions.rating==1].shape[0]))\n",
|
|
"\n",
|
|
"ratings_with_opinions.sample(20)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "keras"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|