Files
bettafish-company/ChineseNlpCorpus/datasets/ez_douban/intro.ipynb
T

781 lines
22 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ez_douban 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1DkN1LmdSMzm_jCBKhbPbig)\n",
"1. **数据概览:** 5 万多部电影(3 万多有电影名称,2 万多没有电影名称),2.8 万 用户,280 万条评分数据\n",
"2. **推荐实验:** 推荐系统\n",
"2. **数据来源:**[豆瓣电影](https://movie.douban.com/)\n",
"3. **原数据集:** [Douban-1 和 Douban-2](https://sites.google.com/site/erhengzhong/datasets),这是 Erheng Zhong 博士 为在 KDD'12, TKDD'14, SDM'12 上发表论文而收集的数据\n",
"4. **加工处理:**\n",
" 1. 去除 Douban-1 中无用的 status 字段,以及无效的评分,并整理成与 [MovieLens](https://grouplens.org/datasets/movielens/) 兼容的格式\n",
" 2. 从 Douban-2 中提取电影信息和链接信息,并与 Douban-1 中的评分数据进行联表操作\n",
" 3. 进行脱敏操作,以保护用户隐私"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"path = 'ez_douban_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. movies.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"电影数目(有名称):33258\n",
"电影数目(没有名称):24166\n",
"电影数目(总计):57424\n"
]
}
],
"source": [
"movies = pd.read_csv(path + 'movies.csv')\n",
"\n",
"print('电影数目(有名称):%d' % movies[~pd.isnull(movies.title)].shape[0])\n",
"print('电影数目(没有名称):%d' % movies[pd.isnull(movies.title)].shape[0])\n",
"print('电影数目(总计):%d' % movies.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| movieId | 电影 id (从 0 开始,连续编号) |\n",
"| title | 电影名称 |"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movieId</th>\n",
" <th>title</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>41807</th>\n",
" <td>41807</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16521</th>\n",
" <td>16521</td>\n",
" <td>五女拜寿</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10689</th>\n",
" <td>10689</td>\n",
" <td>La pelote de laine</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21653</th>\n",
" <td>21653</td>\n",
" <td>Ma mha 4 khaa khrap</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36630</th>\n",
" <td>36630</td>\n",
" <td>the sky the earth and the rain</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31734</th>\n",
" <td>31734</td>\n",
" <td>Viva María!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31530</th>\n",
" <td>31530</td>\n",
" <td>远路</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22553</th>\n",
" <td>22553</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32346</th>\n",
" <td>32346</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29429</th>\n",
" <td>29429</td>\n",
" <td>The Crazies</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34912</th>\n",
" <td>34912</td>\n",
" <td>Stestí</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10350</th>\n",
" <td>10350</td>\n",
" <td>羊のうた</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31487</th>\n",
" <td>31487</td>\n",
" <td>一触即发</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50688</th>\n",
" <td>50688</td>\n",
" <td>还君明珠</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40769</th>\n",
" <td>40769</td>\n",
" <td>Red Riding Hood</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32748</th>\n",
" <td>32748</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17204</th>\n",
" <td>17204</td>\n",
" <td>작은아씨들</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55870</th>\n",
" <td>55870</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42879</th>\n",
" <td>42879</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26432</th>\n",
" <td>26432</td>\n",
" <td>后门</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movieId title\n",
"41807 41807 NaN\n",
"16521 16521 五女拜寿\n",
"10689 10689 La pelote de laine\n",
"21653 21653 Ma mha 4 khaa khrap\n",
"36630 36630 the sky the earth and the rain\n",
"31734 31734 Viva María!\n",
"31530 31530 远路\n",
"22553 22553 NaN\n",
"32346 32346 NaN\n",
"29429 29429 The Crazies\n",
"34912 34912 Stestí\n",
"10350 10350 羊のうた\n",
"31487 31487 一触即发\n",
"50688 50688 还君明珠\n",
"40769 40769 Red Riding Hood\n",
"32748 32748 NaN\n",
"17204 17204 작은아씨들\n",
"55870 55870 NaN\n",
"42879 42879 NaN\n",
"26432 26432 后门"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.sample(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. ratings.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"用户数据:28718\n",
"评分数目:2828585\n"
]
}
],
"source": [
"ratings = pd.read_csv(path + 'ratings.csv')\n",
"\n",
"print('用户数据:%d' % ratings.userId.unique().shape[0])\n",
"print('评分数目:%d' % ratings.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| userId | 用户 id (从 0 开始,连续编号) |\n",
"| movieId | 即 movies.csv 中的 movieId|\n",
"|rating | 评分,[1,5] 之间的整数 | \n",
"|timestamp | 评分时间戳 |"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>userId</th>\n",
" <th>movieId</th>\n",
" <th>rating</th>\n",
" <th>timestamp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1234569</th>\n",
" <td>4825</td>\n",
" <td>14852</td>\n",
" <td>5</td>\n",
" <td>1263084471</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1817521</th>\n",
" <td>7121</td>\n",
" <td>140</td>\n",
" <td>4</td>\n",
" <td>1259054160</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2417373</th>\n",
" <td>9449</td>\n",
" <td>116</td>\n",
" <td>3</td>\n",
" <td>1255344370</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1234106</th>\n",
" <td>4822</td>\n",
" <td>685</td>\n",
" <td>5</td>\n",
" <td>1124800342</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2044878</th>\n",
" <td>7996</td>\n",
" <td>22343</td>\n",
" <td>4</td>\n",
" <td>1254639194</td>\n",
" </tr>\n",
" <tr>\n",
" <th>239277</th>\n",
" <td>947</td>\n",
" <td>5730</td>\n",
" <td>5</td>\n",
" <td>1253992436</td>\n",
" </tr>\n",
" <tr>\n",
" <th>305034</th>\n",
" <td>1178</td>\n",
" <td>9839</td>\n",
" <td>5</td>\n",
" <td>1304648204</td>\n",
" </tr>\n",
" <tr>\n",
" <th>121193</th>\n",
" <td>527</td>\n",
" <td>1512</td>\n",
" <td>4</td>\n",
" <td>1125694603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2563603</th>\n",
" <td>10758</td>\n",
" <td>738</td>\n",
" <td>4</td>\n",
" <td>1301927887</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2034193</th>\n",
" <td>7949</td>\n",
" <td>1671</td>\n",
" <td>5</td>\n",
" <td>1276176595</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1373543</th>\n",
" <td>5369</td>\n",
" <td>893</td>\n",
" <td>3</td>\n",
" <td>1299972980</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1798131</th>\n",
" <td>7027</td>\n",
" <td>4530</td>\n",
" <td>3</td>\n",
" <td>1178099769</td>\n",
" </tr>\n",
" <tr>\n",
" <th>572517</th>\n",
" <td>2243</td>\n",
" <td>9773</td>\n",
" <td>3</td>\n",
" <td>1187275220</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2160230</th>\n",
" <td>8470</td>\n",
" <td>12</td>\n",
" <td>3</td>\n",
" <td>1306330169</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1672554</th>\n",
" <td>6554</td>\n",
" <td>5637</td>\n",
" <td>3</td>\n",
" <td>1168168788</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1504944</th>\n",
" <td>5920</td>\n",
" <td>6659</td>\n",
" <td>3</td>\n",
" <td>1254041654</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2657986</th>\n",
" <td>17116</td>\n",
" <td>738</td>\n",
" <td>4</td>\n",
" <td>1238829652</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2123663</th>\n",
" <td>8319</td>\n",
" <td>1242</td>\n",
" <td>4</td>\n",
" <td>1225941971</td>\n",
" </tr>\n",
" <tr>\n",
" <th>561109</th>\n",
" <td>2206</td>\n",
" <td>4209</td>\n",
" <td>3</td>\n",
" <td>1307884947</td>\n",
" </tr>\n",
" <tr>\n",
" <th>208970</th>\n",
" <td>887</td>\n",
" <td>4723</td>\n",
" <td>3</td>\n",
" <td>1306314265</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" userId movieId rating timestamp\n",
"1234569 4825 14852 5 1263084471\n",
"1817521 7121 140 4 1259054160\n",
"2417373 9449 116 3 1255344370\n",
"1234106 4822 685 5 1124800342\n",
"2044878 7996 22343 4 1254639194\n",
"239277 947 5730 5 1253992436\n",
"305034 1178 9839 5 1304648204\n",
"121193 527 1512 4 1125694603\n",
"2563603 10758 738 4 1301927887\n",
"2034193 7949 1671 5 1276176595\n",
"1373543 5369 893 3 1299972980\n",
"1798131 7027 4530 3 1178099769\n",
"572517 2243 9773 3 1187275220\n",
"2160230 8470 12 3 1306330169\n",
"1672554 6554 5637 3 1168168788\n",
"1504944 5920 6659 3 1254041654\n",
"2657986 17116 738 4 1238829652\n",
"2123663 8319 1242 4 1225941971\n",
"561109 2206 4209 3 1307884947\n",
"208970 887 4723 3 1306314265"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings.sample(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. links.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"links = pd.read_csv(path + 'links.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| movieId | 即 movies.csv 和 ratings.csv 中的 movieId |\n",
"| imdbId | IMDB 网站的电影编号 |\n",
"|doubanId | 豆瓣网站的电影编号 |"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movieId</th>\n",
" <th>imdbId</th>\n",
" <th>doubanId</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>50304</th>\n",
" <td>50304</td>\n",
" <td>NaN</td>\n",
" <td>3712319</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46231</th>\n",
" <td>46231</td>\n",
" <td>NaN</td>\n",
" <td>3035298</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56597</th>\n",
" <td>56597</td>\n",
" <td>NaN</td>\n",
" <td>2980174</td>\n",
" </tr>\n",
" <tr>\n",
" <th>54191</th>\n",
" <td>54191</td>\n",
" <td>86992.0</td>\n",
" <td>1294617</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3418</th>\n",
" <td>3418</td>\n",
" <td>87406.0</td>\n",
" <td>1533608</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6586</th>\n",
" <td>6586</td>\n",
" <td>NaN</td>\n",
" <td>6383567</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52685</th>\n",
" <td>52685</td>\n",
" <td>376706.0</td>\n",
" <td>1770079</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53372</th>\n",
" <td>53372</td>\n",
" <td>218839.0</td>\n",
" <td>1295836</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27540</th>\n",
" <td>27540</td>\n",
" <td>NaN</td>\n",
" <td>2371674</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34467</th>\n",
" <td>34467</td>\n",
" <td>NaN</td>\n",
" <td>4868728</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2301</th>\n",
" <td>2301</td>\n",
" <td>NaN</td>\n",
" <td>3732699</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16687</th>\n",
" <td>16687</td>\n",
" <td>NaN</td>\n",
" <td>4840386</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36301</th>\n",
" <td>36301</td>\n",
" <td>364457.0</td>\n",
" <td>1764523</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44922</th>\n",
" <td>44922</td>\n",
" <td>452640.0</td>\n",
" <td>1920065</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27815</th>\n",
" <td>27815</td>\n",
" <td>114687.0</td>\n",
" <td>1773480</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25370</th>\n",
" <td>25370</td>\n",
" <td>NaN</td>\n",
" <td>4192036</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36070</th>\n",
" <td>36070</td>\n",
" <td>NaN</td>\n",
" <td>4848096</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40954</th>\n",
" <td>40954</td>\n",
" <td>115906.0</td>\n",
" <td>1302469</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38395</th>\n",
" <td>38395</td>\n",
" <td>436784.0</td>\n",
" <td>1857858</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49680</th>\n",
" <td>49680</td>\n",
" <td>NaN</td>\n",
" <td>4168480</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movieId imdbId doubanId\n",
"50304 50304 NaN 3712319\n",
"46231 46231 NaN 3035298\n",
"56597 56597 NaN 2980174\n",
"54191 54191 86992.0 1294617\n",
"3418 3418 87406.0 1533608\n",
"6586 6586 NaN 6383567\n",
"52685 52685 376706.0 1770079\n",
"53372 53372 218839.0 1295836\n",
"27540 27540 NaN 2371674\n",
"34467 34467 NaN 4868728\n",
"2301 2301 NaN 3732699\n",
"16687 16687 NaN 4840386\n",
"36301 36301 364457.0 1764523\n",
"44922 44922 452640.0 1920065\n",
"27815 27815 114687.0 1773480\n",
"25370 25370 NaN 4192036\n",
"36070 36070 NaN 4848096\n",
"40954 40954 115906.0 1302469\n",
"38395 38395 436784.0 1857858\n",
"49680 49680 NaN 4168480"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"links.sample(20)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "keras"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}