Complete the part of the system crawler.

This commit is contained in:
戒酒的李白
2025-08-20 21:54:31 +08:00
parent 995ec11144
commit 047bbf8c26
536 changed files with 20 additions and 115899 deletions
+1 -1
View File
@@ -288,7 +288,7 @@ tmp/
# ==== 配置和密钥 ==== # ==== 配置和密钥 ====
# 敏感配置文件 # 敏感配置文件
config.py # config.py
config.ini config.ini
secrets.json secrets.json
.secrets .secrets
-42
View File
@@ -1,42 +0,0 @@
# ChineseNlpCorpus
搜集、整理、发布 中文 自然语言处理 语料/数据集,与 有志之士 共同 促进 中文 自然语言处理 的 发展。
## 情感/观点/评论 倾向性分析
| 数据集 | 数据概览 | 下载地址 |
| ----- | -------- | ------- |
| ChnSentiCorp_htl_all | 7000 多条酒店评论数据,5000 多条正向评论,2000 多条负向评论 | [点击查看](./datasets/ChnSentiCorp_htl_all/intro.ipynb) |
| waimai_10k | 某外卖平台收集的用户评价,正向 4000 条,负向 约 8000 条 | [点击查看](./datasets/waimai_10k/intro.ipynb) |
| online_shopping_10_cats | 10 个类别,共 6 万多条评论数据,正、负向评论各约 3 万条,<br /> 包括书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店 | [点击查看](./datasets/online_shopping_10_cats/intro.ipynb) |
| weibo_senti_100k | 10 万多条,带情感标注 新浪微博,正负向评论约各 5 万条 | [点击查看](./datasets/weibo_senti_100k/intro.ipynb) |
| simplifyweibo_4_moods | 36 万多条,带情感标注 新浪微博,包含 4 种情感,<br /> 其中喜悦约 20 万条,愤怒、厌恶、低落各约 5 万条 | [点击查看](./datasets/simplifyweibo_4_moods/intro.ipynb) |
| dmsc_v2 | 28 部电影,超 70 万 用户,超 200 万条 评分/评论 数据 | [点击查看](./datasets/dmsc_v2/intro.ipynb) |
| yf_dianping | 24 万家餐馆,54 万用户,440 万条评论/评分数据 | [点击查看](./datasets/yf_dianping/intro.ipynb) |
| yf_amazon | 52 万件商品,1100 多个类目,142 万用户,720 万条评论/评分数据 | [点击查看](./datasets/yf_amazon/intro.ipynb) |
## 中文命名实体识别
| 数据集 | 数据概览 | 下载地址 |
| ----- | -------- | ------- |
| dh_msra | 5 万多条中文命名实体识别标注数据(包括地点、机构、人物) | [点击查看](./datasets/dh_msra/intro.ipynb) |
## 推荐系统
| 数据集 | 数据概览 | 下载地址 |
| ----- | -------- | ------- |
| ez_douban | 5 万多部电影(3 万多有电影名称,2 万多没有电影名称),2.8 万 用户,280 万条评分数据 | [点击查看](./datasets/ez_douban/intro.ipynb) |
| dmsc_v2 | 28 部电影,超 70 万 用户,超 200 万条 评分/评论 数据 | [点击查看](./datasets/dmsc_v2/intro.ipynb) |
| yf_dianping | 24 万家餐馆,54 万用户,440 万条评论/评分数据 | [点击查看](./datasets/yf_dianping/intro.ipynb) |
| yf_amazon | 52 万件商品,1100 多个类目,142 万用户,720 万条评论/评分数据 | [点击查看](./datasets/yf_amazon/intro.ipynb) |
## FAQ 问答系统
| 数据集 | 数据概览 | 下载地址 |
| ----- | -------- | ------- |
| 保险知道 | 8000 多条保险行业问答数据,包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/baoxianzhidao/intro.ipynb) |
| 安徽电信知道 | 15.6 万条电信问答数据,包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/anhuidianxinzhidao/intro.ipynb) |
| 金融知道 | 77 万条金融行业问答数据,包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/financezhidao/intro.ipynb) |
| 法律知道 | 3.6 万条法律问答数据,包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/lawzhidao/intro.ipynb) |
| 联通知道 | 20.3 万条联通问答数据,包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/liantongzhidao/intro.ipynb) |
| 农行知道 | 4 万条农业银行问答数据,包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/nonghangzhidao/intro.ipynb) |
| 保险知道 | 58.8 万条保险行业问答数据,包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/baoxianzhidao/intro.ipynb) |
File diff suppressed because one or more lines are too long
@@ -1,668 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ChnSentiCorp_htl_all 说明\n",
"0. **下载地址:** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv)\n",
"1. **数据概览:** 7000 多条酒店评论数据,5000 多条正向评论,2000 多条负向评论\n",
"2. **推荐实验:** 情感/观点/评论 倾向性分析\n",
"2. **数据来源:**[携程网](http://www.ctrip.com/)\n",
"3. **原数据集:** ChnSentiCorp_htl,由 [谭松波](http://people.ucas.ac.cn/~0012244) 老师整理的一份数据集\n",
"4. **加工处理:**\n",
" 1. 将原来 1 万个离散的文件整合到 1 个文件中\n",
" 2. 将负向评论的 label 从 -1 改成 0\n",
" 3. 去重"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"path = 'ChnSentiCorp_htl_all_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. ChnSentiCorp_htl_all.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"评论数目(总体):7766\n",
"评论数目(正向):5322\n",
"评论数目(负向):2444\n"
]
}
],
"source": [
"pd_all = pd.read_csv(path + 'ChnSentiCorp_htl_all.csv')\n",
"\n",
"print('评论数目(总体):%d' % pd_all.shape[0])\n",
"print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])\n",
"print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| label | 1 表示正向评论,0 表示负向评论 |\n",
"| review | 评论内容 |"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5612</th>\n",
" <td>0</td>\n",
" <td>房间小得无法想象,建议个子大的不要选择,一般的睡觉脚也伸不直.房间不超过10平方,彩电是14...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7321</th>\n",
" <td>0</td>\n",
" <td>我们一家人带孩子去过“五.一”,在协程网上挑了半天才选中的酒店,但看来还是错了。1.酒店除了...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3870</th>\n",
" <td>1</td>\n",
" <td>周六到西山去采橘子,路过这家酒店的时候就觉得应该不错的,采好橘子回来天也晚了,就临时决定住在...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4057</th>\n",
" <td>1</td>\n",
" <td>交通很便利,到渔人码头和港澳码头都在步行的范围之内.CHECKIN和CHECKOUT的速度都...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1452</th>\n",
" <td>1</td>\n",
" <td>很不错的一个酒店,床很大,很舒服.酒店员工的服务态度很亲切.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4805</th>\n",
" <td>1</td>\n",
" <td>酒店环境和服务都还不错,地理位置也不错,尤其是酒店北面的川北凉粉确实好吃,不过就是隔音效果不...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6868</th>\n",
" <td>0</td>\n",
" <td>旧楼改建的酒店,期望不要太高。酒店经理的态度很好,会帮助解决问题。有一位前台小姐的态度实在是...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1345</th>\n",
" <td>1</td>\n",
" <td>经常去海口出差,但从没住过该酒店.看外表感觉一般吧其实酒店里面还真不错,房间是新装修的(我住...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2026</th>\n",
" <td>1</td>\n",
" <td>算是海口市比较好的酒店了。处于市中心,购物方便。服务态度好。保险柜出问题了叫人来开,打个电话...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2805</th>\n",
" <td>1</td>\n",
" <td>感受的是热情的服务!从入门开始,一直很愉快!房间硬件只是准2星的吧,卫生间淋浴头在马桶上方,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2915</th>\n",
" <td>1</td>\n",
" <td>房间很整洁,尤其是床上的哪个靠枕是我以前所住过宾馆没有的,红色的很喜庆。虽然是在当地比较繁华...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1803</th>\n",
" <td>1</td>\n",
" <td>准确的说,酒店的环境很漂亮,房间设施也还行,可以算4星标准。但是,卫生间下水道的气味实在是让...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4729</th>\n",
" <td>1</td>\n",
" <td>价格越来越高了,周遍不方便,去哪里都需要打车.不过装修风格很时尚舒适.服务态度不错.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1913</th>\n",
" <td>1</td>\n",
" <td>地理位置不错。但好像人气不太旺。不过下次也会考虑住这的。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7159</th>\n",
" <td>0</td>\n",
" <td>设施老化,紧靠马路噪音太大。晚上楼上卫生间的水流声和空调噪音非常大,无法入眠,跟总台反映后,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1119</th>\n",
" <td>1</td>\n",
" <td>11月份住了一次。1.服务方面还不错,门童挺积极。2.感觉房间略有陈旧。3.早餐品种还算丰富...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2170</th>\n",
" <td>1</td>\n",
" <td>总的来说,酒店还不错。比较安静,地理位置比较好,服务也不错,包括入住和结账。不太好的地方,7...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2793</th>\n",
" <td>1</td>\n",
" <td>我喜欢那里,性价比很高地.去太原90%都住在那里的.服务员的服务很不错</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5895</th>\n",
" <td>0</td>\n",
" <td>非常糟糕!1。我们通过其商务中心包了一辆车游西湖,该车拉我们去不正规景点买茶叶(我们买了),...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4089</th>\n",
" <td>1</td>\n",
" <td>我是7月9号晚10点多的时候入住的,房间很新,据说是跟格林豪泰是同一公司的,可能是是新开业的...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"5612 0 房间小得无法想象,建议个子大的不要选择,一般的睡觉脚也伸不直.房间不超过10平方,彩电是14...\n",
"7321 0 我们一家人带孩子去过“五.一”,在协程网上挑了半天才选中的酒店,但看来还是错了。1.酒店除了...\n",
"3870 1 周六到西山去采橘子,路过这家酒店的时候就觉得应该不错的,采好橘子回来天也晚了,就临时决定住在...\n",
"4057 1 交通很便利,到渔人码头和港澳码头都在步行的范围之内.CHECKIN和CHECKOUT的速度都...\n",
"1452 1 很不错的一个酒店,床很大,很舒服.酒店员工的服务态度很亲切.\n",
"4805 1 酒店环境和服务都还不错,地理位置也不错,尤其是酒店北面的川北凉粉确实好吃,不过就是隔音效果不...\n",
"6868 0 旧楼改建的酒店,期望不要太高。酒店经理的态度很好,会帮助解决问题。有一位前台小姐的态度实在是...\n",
"1345 1 经常去海口出差,但从没住过该酒店.看外表感觉一般吧其实酒店里面还真不错,房间是新装修的(我住...\n",
"2026 1 算是海口市比较好的酒店了。处于市中心,购物方便。服务态度好。保险柜出问题了叫人来开,打个电话...\n",
"2805 1 感受的是热情的服务!从入门开始,一直很愉快!房间硬件只是准2星的吧,卫生间淋浴头在马桶上方,...\n",
"2915 1 房间很整洁,尤其是床上的哪个靠枕是我以前所住过宾馆没有的,红色的很喜庆。虽然是在当地比较繁华...\n",
"1803 1 准确的说,酒店的环境很漂亮,房间设施也还行,可以算4星标准。但是,卫生间下水道的气味实在是让...\n",
"4729 1 价格越来越高了,周遍不方便,去哪里都需要打车.不过装修风格很时尚舒适.服务态度不错.\n",
"1913 1 地理位置不错。但好像人气不太旺。不过下次也会考虑住这的。\n",
"7159 0 设施老化,紧靠马路噪音太大。晚上楼上卫生间的水流声和空调噪音非常大,无法入眠,跟总台反映后,...\n",
"1119 1 11月份住了一次。1.服务方面还不错,门童挺积极。2.感觉房间略有陈旧。3.早餐品种还算丰富...\n",
"2170 1 总的来说,酒店还不错。比较安静,地理位置比较好,服务也不错,包括入住和结账。不太好的地方,7...\n",
"2793 1 我喜欢那里,性价比很高地.去太原90%都住在那里的.服务员的服务很不错\n",
"5895 0 非常糟糕!1。我们通过其商务中心包了一辆车游西湖,该车拉我们去不正规景点买茶叶(我们买了),...\n",
"4089 1 我是7月9号晚10点多的时候入住的,房间很新,据说是跟格林豪泰是同一公司的,可能是是新开业的..."
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. 构造平衡语料\n",
"\n",
"- 原数据集中还包含了 3 份平衡语料:ChnSentiCorp_htl_ba_2000, ChnSentiCorp_htl_ba_4000, ChnSentiCorp_htl_ba_6000\n",
"- 用随机抽样的方法,很容易构造出类似的平衡语料"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"pd_positive = pd_all[pd_all.label==1]\n",
"pd_negative = pd_all[pd_all.label==0]\n",
"\n",
"def get_balance_corpus(corpus_size, corpus_pos, corpus_neg):\n",
" sample_size = corpus_size // 2\n",
" pd_corpus_balance = pd.concat([corpus_pos.sample(sample_size, replace=corpus_pos.shape[0]<sample_size), \\\n",
" corpus_neg.sample(sample_size, replace=corpus_neg.shape[0]<sample_size)])\n",
" \n",
" print('评论数目(总体):%d' % pd_corpus_balance.shape[0])\n",
" print('评论数目(正向):%d' % pd_corpus_balance[pd_corpus_balance.label==1].shape[0])\n",
" print('评论数目(负向):%d' % pd_corpus_balance[pd_corpus_balance.label==0].shape[0]) \n",
" \n",
" return pd_corpus_balance"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"评论数目(总体):2000\n",
"评论数目(正向):1000\n",
"评论数目(负向):1000\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5536</th>\n",
" <td>0</td>\n",
" <td>建议携程不要和这家酒店合作,名曰三星,要我看准星级都勉强!首先不在市区里面(去涵江区打车还要...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4086</th>\n",
" <td>1</td>\n",
" <td>感觉比老街口客栈舒适,很中规中矩的3星级,推荐大家住主楼的豪华间,设施比较好,前台和大堂的服...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6112</th>\n",
" <td>0</td>\n",
" <td>是我遇到的最差的4星酒店,进门没人管,进去要我和大堂打招呼,退房也很慢,不会再去住了</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4440</th>\n",
" <td>1</td>\n",
" <td>房间的设施不错,由于武夷山市是个小地方,酒店离景区有一定距离,如果没有自己开车就不太方便,但...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2706</th>\n",
" <td>1</td>\n",
" <td>首次入住该酒店,环境雅致,服务非常不错,很多笑脸,感觉热情,早餐可以接受,有送餐服务以后去徐...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1770</th>\n",
" <td>1</td>\n",
" <td>不错!就是洗澡的地方小点~~下回去还住这家~~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4306</th>\n",
" <td>1</td>\n",
" <td>环境位置很好,房间情况尚可,早餐一般般,价格偏高了一些.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2161</th>\n",
" <td>1</td>\n",
" <td>位置优越,出行方便。就是房间较小,床位较小,房间装修较旧,其他方面都不错。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7667</th>\n",
" <td>0</td>\n",
" <td>酒店周围环境差,内部也很旧,卫生不好,很脏,总之没什么好的,下次决不住这。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4419</th>\n",
" <td>1</td>\n",
" <td>我7月24号入住瑞豪酒店,开始有些不顺利,但是那里的管理还是非常好的,有位姓赵的经理发现问题...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"5536 0 建议携程不要和这家酒店合作,名曰三星,要我看准星级都勉强!首先不在市区里面(去涵江区打车还要...\n",
"4086 1 感觉比老街口客栈舒适,很中规中矩的3星级,推荐大家住主楼的豪华间,设施比较好,前台和大堂的服...\n",
"6112 0 是我遇到的最差的4星酒店,进门没人管,进去要我和大堂打招呼,退房也很慢,不会再去住了\n",
"4440 1 房间的设施不错,由于武夷山市是个小地方,酒店离景区有一定距离,如果没有自己开车就不太方便,但...\n",
"2706 1 首次入住该酒店,环境雅致,服务非常不错,很多笑脸,感觉热情,早餐可以接受,有送餐服务以后去徐...\n",
"1770 1 不错!就是洗澡的地方小点~~下回去还住这家~~\n",
"4306 1 环境位置很好,房间情况尚可,早餐一般般,价格偏高了一些.\n",
"2161 1 位置优越,出行方便。就是房间较小,床位较小,房间装修较旧,其他方面都不错。\n",
"7667 0 酒店周围环境差,内部也很旧,卫生不好,很脏,总之没什么好的,下次决不住这。\n",
"4419 1 我7月24号入住瑞豪酒店,开始有些不顺利,但是那里的管理还是非常好的,有位姓赵的经理发现问题..."
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ChnSentiCorp_htl_ba_2000 = get_balance_corpus(2000, pd_positive, pd_negative)\n",
"\n",
"ChnSentiCorp_htl_ba_2000.sample(10)"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"评论数目(总体):4000\n",
"评论数目(正向):2000\n",
"评论数目(负向):2000\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3605</th>\n",
" <td>1</td>\n",
" <td>酒店就在海水浴场旁边,出门到接触到海水两分钟,如果要和海水亲近的朋友,极力推荐。这样游泳换衣...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7260</th>\n",
" <td>0</td>\n",
" <td>TheWorsehotelinChengdurightnow,checkoutat12.30...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5762</th>\n",
" <td>0</td>\n",
" <td>房间还算可以,不过前台服务人员的态度,受不了,我晚上11点多到酒店CHEKIN第二天退房的时...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5790</th>\n",
" <td>0</td>\n",
" <td>酒店设施陈旧,浴缸排水不畅,入住无房,一间16:00,一间2200,早餐差</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4504</th>\n",
" <td>1</td>\n",
" <td>虽是公寓式酒店,但其房间整洁程度、全方位的服务都给我留下了很好的印象。丝丝不完善之处在于很多...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5246</th>\n",
" <td>1</td>\n",
" <td>很好的酒店,很喜欢,房间很干净很漂亮,从房间的窗口看出去,超美的,在市中心区域,出行也非常的...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>624</th>\n",
" <td>1</td>\n",
" <td>在临沂,这个酒店算是比较有档次的了,给外国客人的服务也比较合格。可惜电视内容比较单调,国外的...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1382</th>\n",
" <td>1</td>\n",
" <td>4年前住过,我和德国同事都觉得很不错。今年我又选了豪门,还是觉得很好。自助餐品种丰富,房间宽...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3723</th>\n",
" <td>1</td>\n",
" <td>价格不高,比较实惠,服务也不错,离闹市区不远.交通也比较方便.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3328</th>\n",
" <td>1</td>\n",
" <td>房间:建筑风格比较独特。木屋矗立在随潮汐涨落的水中,围廊象迷宫一样。看着自己的小屋,却没有直...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"3605 1 酒店就在海水浴场旁边,出门到接触到海水两分钟,如果要和海水亲近的朋友,极力推荐。这样游泳换衣...\n",
"7260 0 TheWorsehotelinChengdurightnow,checkoutat12.30...\n",
"5762 0 房间还算可以,不过前台服务人员的态度,受不了,我晚上11点多到酒店CHEKIN第二天退房的时...\n",
"5790 0 酒店设施陈旧,浴缸排水不畅,入住无房,一间16:00,一间22:00,早餐差\n",
"4504 1 虽是公寓式酒店,但其房间整洁程度、全方位的服务都给我留下了很好的印象。丝丝不完善之处在于很多...\n",
"5246 1 很好的酒店,很喜欢,房间很干净很漂亮,从房间的窗口看出去,超美的,在市中心区域,出行也非常的...\n",
"624 1 在临沂,这个酒店算是比较有档次的了,给外国客人的服务也比较合格。可惜电视内容比较单调,国外的...\n",
"1382 1 4年前住过,我和德国同事都觉得很不错。今年我又选了豪门,还是觉得很好。自助餐品种丰富,房间宽...\n",
"3723 1 价格不高,比较实惠,服务也不错,离闹市区不远.交通也比较方便.\n",
"3328 1 房间:建筑风格比较独特。木屋矗立在随潮汐涨落的水中,围廊象迷宫一样。看着自己的小屋,却没有直..."
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ChnSentiCorp_htl_ba_4000 = get_balance_corpus(4000, pd_positive, pd_negative)\n",
"\n",
"ChnSentiCorp_htl_ba_4000.sample(10)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"评论数目(总体):6000\n",
"评论数目(正向):3000\n",
"评论数目(负向):3000\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4817</th>\n",
" <td>1</td>\n",
" <td>入住的是260元的迷你标准间。感觉比想象的要好很多,房间如果住一个人很合适的,洗手间很大,很...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7021</th>\n",
" <td>0</td>\n",
" <td>7点到了酒店前台打电话问了楼层说房间可以入住,上楼竟然房间的垃圾成堆根本就没有打扫,下楼要求...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6484</th>\n",
" <td>0</td>\n",
" <td>又要对他进行点评了,呜呜。。。说什么好呢</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6715</th>\n",
" <td>0</td>\n",
" <td>看了前面介绍的推荐去入住的,结果很失望,酒店的淋浴居然没有维护设施,洗个澡弄得整个洗手间都淋...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6775</th>\n",
" <td>0</td>\n",
" <td>酒店的设施太差了,估计连1星级都没有,房间空调都不开的,简直就是一塌糊涂。建议大家不要去预订该酒店</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7575</th>\n",
" <td>0</td>\n",
" <td>真的差得没话说,但说起来又有一堆。住进去的时候发现没有浴巾,第二天却一直打电话说我们拿了那两...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1615</th>\n",
" <td>1</td>\n",
" <td>酒店非常好,距离高速出口很近,服务也很到位,值得推荐的酒店,到泰山应该是最好的酒店了.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6466</th>\n",
" <td>0</td>\n",
" <td>携城预定员极力推荐这家酒店,相信她才入住了这家,结果到了酒店才发现,连一星级都不如,前台的小...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1392</th>\n",
" <td>1</td>\n",
" <td>酒店很大,服务太差,A楼房间也老,下次再也不住了。环境很好,打高尔夫的或许可以忍忍吧。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4408</th>\n",
" <td>1</td>\n",
" <td>房间很大,大的让我去其他宾馆都感觉性价比不高!服务也不错,值得一住!!</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"4817 1 入住的是260元的迷你标准间。感觉比想象的要好很多,房间如果住一个人很合适的,洗手间很大,很...\n",
"7021 0 7点到了酒店前台打电话问了楼层说房间可以入住,上楼竟然房间的垃圾成堆根本就没有打扫,下楼要求...\n",
"6484 0 又要对他进行点评了,呜呜。。。说什么好呢\n",
"6715 0 看了前面介绍的推荐去入住的,结果很失望,酒店的淋浴居然没有维护设施,洗个澡弄得整个洗手间都淋...\n",
"6775 0 酒店的设施太差了,估计连1星级都没有,房间空调都不开的,简直就是一塌糊涂。建议大家不要去预订该酒店\n",
"7575 0 真的差得没话说,但说起来又有一堆。住进去的时候发现没有浴巾,第二天却一直打电话说我们拿了那两...\n",
"1615 1 酒店非常好,距离高速出口很近,服务也很到位,值得推荐的酒店,到泰山应该是最好的酒店了.\n",
"6466 0 携城预定员极力推荐这家酒店,相信她才入住了这家,结果到了酒店才发现,连一星级都不如,前台的小...\n",
"1392 1 酒店很大,服务太差,A楼房间也老,下次再也不住了。环境很好,打高尔夫的或许可以忍忍吧。\n",
"4408 1 房间很大,大的让我去其他宾馆都感觉性价比不高!服务也不错,值得一住!!"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ChnSentiCorp_htl_ba_6000 = get_balance_corpus(6000, pd_positive, pd_negative)\n",
"\n",
"ChnSentiCorp_htl_ba_6000.sample(10)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,333 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# anhuidianxinzhidao 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1nrg5SRU3Xy1VN85dd85-vg)\n",
"1. **数据概览:** 15.6 万条电信问答数据\n",
"2. **推荐实验:** FAQ 问答系统\n",
"3. **数据来源:** 百度知道\n",
"4. **加工处理:**\n",
" 1. 过滤了id、url、qid、reply_t、user字段\n",
" 2. 对question、reply做了脱敏处理"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = 'anhuidianxinzhidao_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1.anhuidianxinzhidao_filter.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"pd_all = pd.read_csv(path + 'anhuidianxinzhidao_filter.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| title | 标题 |\n",
"| question | 问题(可为空) |\n",
"| reply| 每个问题的内容 |\n",
"| is_best| 是否是最佳答案 |"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>question</th>\n",
" <th>reply</th>\n",
" <th>is_best</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>129754</th>\n",
" <td>红米no##4x</td>\n",
" <td>NaN</td>\n",
" <td>可以,</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15843</th>\n",
" <td>为什么不能同时用两个电信卡</td>\n",
" <td>NaN</td>\n",
" <td>您好不可以的,目前推出的手机都是不能同时支持两张电信手机卡的,即使是全网通手机也只能在其中的...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23985</th>\n",
" <td>电信181、177、133哪个号段好?</td>\n",
" <td>NaN</td>\n",
" <td>133的</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>72065</th>\n",
" <td>华*荣耀7x和魅蓝note6哪个好</td>\n",
" <td>NaN</td>\n",
" <td>荣耀畅玩7X很不错,性价比很高,以下是手机的配置:1、外观方面:荣耀畅玩7X采用5.93英寸...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11843</th>\n",
" <td>p8青春版电信版多少钱</td>\n",
" <td>NaN</td>\n",
" <td>您好,这款手机价格参考如下</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3280</th>\n",
" <td>华为di####00叫什么</td>\n",
" <td>华为di####00叫什么</td>\n",
" <td>DI####00是华为畅享6S全网通版。华为畅享6S性价比高,是一款很不错的手机。电信新出流...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>143200</th>\n",
" <td>电信版酷派9190L双卡双通可以用移动网络吗</td>\n",
" <td>NaN</td>\n",
" <td>您好电信版双卡双待手机只能使用电信手机卡上网,卡槽2的移动或联通手机卡只能支持2G网络,一般...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>120692</th>\n",
" <td>苹果微信载图怎么载图</td>\n",
" <td>苹果微信载图怎么载图</td>\n",
" <td>您说的应该是截图吧。您可以直接通过苹果手机截图组合按键进行截图操作。直接同时安装电源键和ho...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>109786</th>\n",
" <td>天翼网关的wifi被我关了又没有邦定客户端怎么办想再连wifi该怎么办</td>\n",
" <td>NaN</td>\n",
" <td>您好电信光纤猫的无线网络一般需要破解才能使用的,但破解可能会到帐宽带不稳定或不能正常上网,建...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29030</th>\n",
" <td>v*v*x21是不是全网通</td>\n",
" <td>v*v*x21是不是全网通</td>\n",
" <td>vi###21系列是有vi###21A全网通版本与vi###21移动全网通版本的;此两款机型...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>72603</th>\n",
" <td>电信网上营业厅手机卡办理步骤</td>\n",
" <td>NaN</td>\n",
" <td>中*电信目前是支持网上办理手机号的,下面分享下网上营业厅办理号卡的步骤:1、首先打开浏览器,...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103229</th>\n",
" <td>花呗可以充话费吗</td>\n",
" <td>NaN</td>\n",
" <td>您好,是可以的,目前花呗进行充值话费,每个月只能使用花呗一次,最高不超过500元,如果您已经...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91507</th>\n",
" <td>荣耀8好还是三星noT4好</td>\n",
" <td>NaN</td>\n",
" <td>如果我选择三星,华为去论坛发个意见都很尴尬。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>143504</th>\n",
" <td>ios10.2.1能降级吗ios10.2.1怎么降级</td>\n",
" <td>NaN</td>\n",
" <td>IOS设备一旦升级IOS系统就无法降级了,因为:1、IOS采用推荐升级、强制保持最新的升级策...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21999</th>\n",
" <td>电信校园网宽带超一分钟多少钱</td>\n",
" <td>NaN</td>\n",
" <td>由于各地业务情况不同,建议用户通过当地的电信网是营业厅或者手机营业厅了解,也可以直接到附近的...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7644</th>\n",
" <td>有没有人办过开发区的电信卡</td>\n",
" <td>NaN</td>\n",
" <td>您好目前使用电信手机卡的用户非常多,电信手机卡资费更优惠、网络更稳定、网速更快,请放心办理使...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>76835</th>\n",
" <td>请问67###18这个电话号码是哪里的</td>\n",
" <td>NaN</td>\n",
" <td>查吧</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>76752</th>\n",
" <td>电信,铁通,移动,广电。那个网速好呢?</td>\n",
" <td>NaN</td>\n",
" <td>办理宽带推荐您办理电信宽带使用。由于中*电信的服务器、网络架设等较完善,且每年都在不断完善和...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94290</th>\n",
" <td>三星s8+好用不</td>\n",
" <td>NaN</td>\n",
" <td>S8+的主要特征:1.全视曲面屏:超窄边框、沉浸感视效、双曲面侧屏的显示屏,为您带来更纯粹的...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79345</th>\n",
" <td>一加手机5玩王者会卡吗?</td>\n",
" <td>NaN</td>\n",
" <td>不会卡,我也推荐你买一加5,它运行内存有8G,玩游戏的时候就能感受到性能有多好,手机不卡,丢...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" title question \\\n",
"129754 红米no##4x NaN \n",
"15843 为什么不能同时用两个电信卡 NaN \n",
"23985 电信181、177、133哪个号段好? NaN \n",
"72065 华*荣耀7x和魅蓝note6哪个好 NaN \n",
"11843 p8青春版电信版多少钱 NaN \n",
"3280 华为di####00叫什么 华为di####00叫什么 \n",
"143200 电信版酷派9190L双卡双通可以用移动网络吗 NaN \n",
"120692 苹果微信载图怎么载图 苹果微信载图怎么载图 \n",
"109786 天翼网关的wifi被我关了又没有邦定客户端怎么办想再连wifi该怎么办 NaN \n",
"29030 v*v*x21是不是全网通 v*v*x21是不是全网通 \n",
"72603 电信网上营业厅手机卡办理步骤 NaN \n",
"103229 花呗可以充话费吗 NaN \n",
"91507 荣耀8好还是三星noT4好 NaN \n",
"143504 ios10.2.1能降级吗ios10.2.1怎么降级 NaN \n",
"21999 电信校园网宽带超一分钟多少钱 NaN \n",
"7644 有没有人办过开发区的电信卡 NaN \n",
"76835 请问67###18这个电话号码是哪里的 NaN \n",
"76752 电信,铁通,移动,广电。那个网速好呢? NaN \n",
"94290 三星s8+好用不 NaN \n",
"79345 一加手机5玩王者会卡吗? NaN \n",
"\n",
" reply is_best \n",
"129754 可以, 0 \n",
"15843 您好不可以的,目前推出的手机都是不能同时支持两张电信手机卡的,即使是全网通手机也只能在其中的... 1 \n",
"23985 133的 0 \n",
"72065 荣耀畅玩7X很不错,性价比很高,以下是手机的配置:1、外观方面:荣耀畅玩7X采用5.93英寸... 1 \n",
"11843 您好,这款手机价格参考如下 1 \n",
"3280 DI####00是华为畅享6S全网通版。华为畅享6S性价比高,是一款很不错的手机。电信新出流... 1 \n",
"143200 您好电信版双卡双待手机只能使用电信手机卡上网,卡槽2的移动或联通手机卡只能支持2G网络,一般... 1 \n",
"120692 您说的应该是截图吧。您可以直接通过苹果手机截图组合按键进行截图操作。直接同时安装电源键和ho... 1 \n",
"109786 您好电信光纤猫的无线网络一般需要破解才能使用的,但破解可能会到帐宽带不稳定或不能正常上网,建... 1 \n",
"29030 vi###21系列是有vi###21A全网通版本与vi###21移动全网通版本的;此两款机型... 0 \n",
"72603 中*电信目前是支持网上办理手机号的,下面分享下网上营业厅办理号卡的步骤:1、首先打开浏览器,... 1 \n",
"103229 您好,是可以的,目前花呗进行充值话费,每个月只能使用花呗一次,最高不超过500元,如果您已经... 0 \n",
"91507 如果我选择三星,华为去论坛发个意见都很尴尬。 0 \n",
"143504 IOS设备一旦升级IOS系统就无法降级了,因为:1、IOS采用推荐升级、强制保持最新的升级策... 1 \n",
"21999 由于各地业务情况不同,建议用户通过当地的电信网是营业厅或者手机营业厅了解,也可以直接到附近的... 1 \n",
"7644 您好目前使用电信手机卡的用户非常多,电信手机卡资费更优惠、网络更稳定、网速更快,请放心办理使... 1 \n",
"76835 查吧 0 \n",
"76752 办理宽带推荐您办理电信宽带使用。由于中*电信的服务器、网络架设等较完善,且每年都在不断完善和... 1 \n",
"94290 S8+的主要特征:1.全视曲面屏:超窄边框、沉浸感视效、双曲面侧屏的显示屏,为您带来更纯粹的... 1 \n",
"79345 不会卡,我也推荐你买一加5,它运行内存有8G,玩游戏的时候就能感受到性能有多好,手机不卡,丢... 1 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(n=20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,355 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# baoxianzhidao_filter 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1cgYeIrJHAgb8D33H09Zc5w)\n",
"1. **数据概览:** 8000 多条保险行业问答数据\n",
"2. **推荐实验:** FAQ 问答系统\n",
"3. **数据来源:** 百度知道\n",
"4. **加工处理:**\n",
" 1. 过滤了id、url、qid、reply_t、user字段\n",
" 2. 对question、reply做了脱敏处理"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = 'baoxianzhidao_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. baoxianzhidao_filter.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"pd_all = pd.read_csv(path + 'baoxianzhidao_filter.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| title | 问题的标题 |\n",
"| question | 问题内容(可为空) |\n",
"| reply| 回复内容 |\n",
"| is_best| 是否为页面上显示的最佳回答 |"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>question</th>\n",
" <th>reply</th>\n",
" <th>is_best</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>6733</th>\n",
" <td>五险两金和五险一金有什么区别</td>\n",
" <td>单位招聘,独立待遇中有一项是五险两金。有些单位是五险一金,还有些五险两金。然而我刚毕业小白,...</td>\n",
" <td>五险一金是指:医疗保险,生育保险,工伤保险,失业保险和养老保险,还有住房公积金。五险两金指的...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7580</th>\n",
" <td>户口不在本地如何办医疗保险</td>\n",
" <td>户口不在本地如何办医疗保险</td>\n",
" <td>户口不在本地可以办理医保,通常都是以单位名义进行办理。医疗保险分两种办理方式,一种是单位办理...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6310</th>\n",
" <td>酒精含量百分之二十八保险公司理赔吗?</td>\n",
" <td>NaN</td>\n",
" <td>不会赔</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5843</th>\n",
" <td>我买的二手车,车险都没过户,怎么交保险</td>\n",
" <td>NaN</td>\n",
" <td>要看保险合同了,有的是指定被保险人的,如果你出了险,保险公司是不理赔的。建议尽快去过户,或者...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2790</th>\n",
" <td>保险买交强险后可加其他险种吗</td>\n",
" <td>NaN</td>\n",
" <td>可以的。车险种类包括:1.交强险,交强险[全称机动车交通事故责任强制保险]是我国首个由国家法...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4301</th>\n",
" <td>农村九级伤残赔偿标准我父亲.因矿采煤塌陷至伤残九级应赔多少钱</td>\n",
" <td>农村九级伤残赔偿标准我父亲.因矿采煤塌陷至伤残九级应赔多少钱</td>\n",
" <td>发生九级伤残的赔偿标准主要包括医疗费用、一次性补偿金等等,具体包括这些:医疗费:以医院发票金...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4685</th>\n",
" <td>领着失业金还可以交失业险吗</td>\n",
" <td>NaN</td>\n",
" <td>可以。领取失业金只是说明目前是离职状态,但仍可以居民形式参加保险,但缴纳的只能是医疗保险和养...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7350</th>\n",
" <td>车辆上牌照必须在当地上保险吗</td>\n",
" <td>车辆上牌照必须在当地上保险吗</td>\n",
" <td>不是必须在当地买保险,也可以异地投保,现在很多保险公司开发了异地买汽车保险的购买渠道。但是保...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1611</th>\n",
" <td>泰康人寿保险官网产品多不多,能直接在网上买吗</td>\n",
" <td>NaN</td>\n",
" <td>你想买哪方面保险呢,主要是看给你的服务,国寿现在新*市一款你可以考虑下</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5127</th>\n",
" <td>车出事故对方全责第三者受伤对方保险应怎样理赔?</td>\n",
" <td>NaN</td>\n",
" <td>对方的交强险和第三者责任险可以对第三者的伤害进行赔偿。第三者责任险是保险车辆因意外事故致使第...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4743</th>\n",
" <td>我主责对方次责,对方摩托车无保险怎么赔付?</td>\n",
" <td>我汽车全险有不计免赔,对方摩托车什么都没有。他的车辆损失和医药费是不是由我保险公司出?那我的...</td>\n",
" <td>对方无保险需要自费赔付损失。一般在机动车与机动车之间发生交通事故,由保险公司在机动车第三者责...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5729</th>\n",
" <td>网上买健康保险不用检查身体吗</td>\n",
" <td>我想在慧择网买一款险种,保大病的,但是有个疑问就是,如果不用确认我身体健康就能入保的话,这样...</td>\n",
" <td>通常普通的健康保险是不需要体检的,不过如果年龄、保额超过保险公司规定的限度,就一定需要体检。...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3564</th>\n",
" <td>招商信诺儿童险如何投保?</td>\n",
" <td>NaN</td>\n",
" <td>儿童保险是指用于解决其成长过程中所需要的教育、创业、婚嫁等费用,以及应付孩子可能面临的疾病、...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>824</th>\n",
" <td>医疗保险请问单位交的医疗保险到底有啥用–手机爱问</td>\n",
" <td>NaN</td>\n",
" <td>直接到当地社保处办理就可以了</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4856</th>\n",
" <td>以前办理过养老金,在交要身份证吗</td>\n",
" <td>NaN</td>\n",
" <td>第二次办理养老保险需要的资料1.本地人才市场《劳动保障事物代理委托协议书》2.身份正原件及复...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2054</th>\n",
" <td>江*车可以在防城港买保险吗?</td>\n",
" <td>江*车可以在防城港买保险吗?</td>\n",
" <td>理论上说是可行的。具体要看各地的政策和监管要求是如何运行,不同的城市对异地投保的情况的规定是...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1415</th>\n",
" <td>中英人寿户外保险好吗?有什么好处</td>\n",
" <td>NaN</td>\n",
" <td>建议直接拨打人寿客服电话咨询</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5225</th>\n",
" <td>机动车保险到期多少日内免于处罚</td>\n",
" <td>NaN</td>\n",
" <td>机动车保险到期就等于无保险,机动车交通事故责任强制保险条例第三十九条:机动车所有人、管理人未...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5596</th>\n",
" <td>上学放学途中发生意外,学校购买的意外保险,可以理赔吗</td>\n",
" <td>NaN</td>\n",
" <td>那要看你们学校买的意外保险的条款中有没有限定只负责理赔在校园中发生的意外伤害,如果没有这样的...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7390</th>\n",
" <td>办建筑工人意外险需要交什么证件</td>\n",
" <td>NaN</td>\n",
" <td>需要提供工人的身份证号需要提供建筑公司的组织机构代码证团体意外险投保书填写及盖章一、企业施工...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" title \\\n",
"6733 五险两金和五险一金有什么区别 \n",
"7580 户口不在本地如何办医疗保险 \n",
"6310 酒精含量百分之二十八保险公司理赔吗? \n",
"5843 我买的二手车,车险都没过户,怎么交保险 \n",
"2790 保险买交强险后可加其他险种吗 \n",
"4301 农村九级伤残赔偿标准我父亲.因矿采煤塌陷至伤残九级应赔多少钱 \n",
"4685 领着失业金还可以交失业险吗 \n",
"7350 车辆上牌照必须在当地上保险吗 \n",
"1611 泰康人寿保险官网产品多不多,能直接在网上买吗 \n",
"5127 车出事故对方全责第三者受伤对方保险应怎样理赔? \n",
"4743 我主责对方次责,对方摩托车无保险怎么赔付? \n",
"5729 网上买健康保险不用检查身体吗 \n",
"3564 招商信诺儿童险如何投保? \n",
"824 医疗保险请问单位交的医疗保险到底有啥用–手机爱问 \n",
"4856 以前办理过养老金,在交要身份证吗 \n",
"2054 江*车可以在防城港买保险吗? \n",
"1415 中英人寿户外保险好吗?有什么好处 \n",
"5225 机动车保险到期多少日内免于处罚 \n",
"5596 上学放学途中发生意外,学校购买的意外保险,可以理赔吗 \n",
"7390 办建筑工人意外险需要交什么证件 \n",
"\n",
" question \\\n",
"6733 单位招聘,独立待遇中有一项是五险两金。有些单位是五险一金,还有些五险两金。然而我刚毕业小白,... \n",
"7580 户口不在本地如何办医疗保险 \n",
"6310 NaN \n",
"5843 NaN \n",
"2790 NaN \n",
"4301 农村九级伤残赔偿标准我父亲.因矿采煤塌陷至伤残九级应赔多少钱 \n",
"4685 NaN \n",
"7350 车辆上牌照必须在当地上保险吗 \n",
"1611 NaN \n",
"5127 NaN \n",
"4743 我汽车全险有不计免赔,对方摩托车什么都没有。他的车辆损失和医药费是不是由我保险公司出?那我的... \n",
"5729 我想在慧择网买一款险种,保大病的,但是有个疑问就是,如果不用确认我身体健康就能入保的话,这样... \n",
"3564 NaN \n",
"824 NaN \n",
"4856 NaN \n",
"2054 江*车可以在防城港买保险吗? \n",
"1415 NaN \n",
"5225 NaN \n",
"5596 NaN \n",
"7390 NaN \n",
"\n",
" reply is_best \n",
"6733 五险一金是指:医疗保险,生育保险,工伤保险,失业保险和养老保险,还有住房公积金。五险两金指的... 0 \n",
"7580 户口不在本地可以办理医保,通常都是以单位名义进行办理。医疗保险分两种办理方式,一种是单位办理... 1 \n",
"6310 不会赔 0 \n",
"5843 要看保险合同了,有的是指定被保险人的,如果你出了险,保险公司是不理赔的。建议尽快去过户,或者... 0 \n",
"2790 可以的。车险种类包括:1.交强险,交强险[全称机动车交通事故责任强制保险]是我国首个由国家法... 1 \n",
"4301 发生九级伤残的赔偿标准主要包括医疗费用、一次性补偿金等等,具体包括这些:医疗费:以医院发票金... 1 \n",
"4685 可以。领取失业金只是说明目前是离职状态,但仍可以居民形式参加保险,但缴纳的只能是医疗保险和养... 1 \n",
"7350 不是必须在当地买保险,也可以异地投保,现在很多保险公司开发了异地买汽车保险的购买渠道。但是保... 0 \n",
"1611 你想买哪方面保险呢,主要是看给你的服务,国寿现在新*市一款你可以考虑下 0 \n",
"5127 对方的交强险和第三者责任险可以对第三者的伤害进行赔偿。第三者责任险是保险车辆因意外事故致使第... 1 \n",
"4743 对方无保险需要自费赔付损失。一般在机动车与机动车之间发生交通事故,由保险公司在机动车第三者责... 1 \n",
"5729 通常普通的健康保险是不需要体检的,不过如果年龄、保额超过保险公司规定的限度,就一定需要体检。... 1 \n",
"3564 儿童保险是指用于解决其成长过程中所需要的教育、创业、婚嫁等费用,以及应付孩子可能面临的疾病、... 1 \n",
"824 直接到当地社保处办理就可以了 0 \n",
"4856 第二次办理养老保险需要的资料1.本地人才市场《劳动保障事物代理委托协议书》2.身份正原件及复... 1 \n",
"2054 理论上说是可行的。具体要看各地的政策和监管要求是如何运行,不同的城市对异地投保的情况的规定是... 1 \n",
"1415 建议直接拨打人寿客服电话咨询 0 \n",
"5225 机动车保险到期就等于无保险,机动车交通事故责任强制保险条例第三十九条:机动车所有人、管理人未... 1 \n",
"5596 那要看你们学校买的意外保险的条款中有没有限定只负责理赔在校园中发生的意外伤害,如果没有这样的... 0 \n",
"7390 需要提供工人的身份证号需要提供建筑公司的组织机构代码证团体意外险投保书填写及盖章一、企业施工... 0 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(n=20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,194 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# dh_msra 说明\n",
"0. **下载地址:** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/dh_msra/dh_msra.zip)\n",
"1. **数据概览:** 5 万多条中文命名实体识别标注数据([IOB2](https://dl.acm.org/citation.cfm?id=977059) 格式,符合 [CoNLL 2002](https://www.clips.uantwerpen.be/conll2002/ner/) 和 [CRF++](https://taku910.github.io/crfpp/#format) 标准)\n",
"2. **推荐实验:** 中文命名实体识别\n",
"2. **数据来源:** 不详\n",
"3. **原数据集:** [zh-NER-TF](https://github.com/Determined22/zh-NER-TF),网上搜集,具体作者、来源不详,可能是来自于 MSRA 的语料\n",
"4. **加工处理:**\n",
" 1. 将原来 2 个文件 (train 和 test) 整合到 1 个文件中"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import codecs\n",
"import random\n",
"\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = 'dh_msra_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. dh_msra.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def load_iob2(file_path):\n",
" '''加载 IOB2 格式的数据'''\n",
" token_seqs = []\n",
" label_seqs = []\n",
" tokens = []\n",
" labels = []\n",
" with codecs.open(file_path) as f:\n",
" for index, line in enumerate(f):\n",
" items = line.strip().split()\n",
" if len(items) == 2:\n",
" token, label = items\n",
" tokens.append(token)\n",
" labels.append(label)\n",
" elif len(items) == 0:\n",
" if tokens:\n",
" token_seqs.append(tokens)\n",
" label_seqs.append(labels)\n",
" tokens = []\n",
" labels = []\n",
" else:\n",
" print('格式错误。行号:{} 内容:{}'.format(index, line))\n",
" continue\n",
" \n",
" if tokens: # 如果文件末尾没有空行,手动将最后一条数据加入序列的列表中\n",
" token_seqs.append(tokens)\n",
" label_seqs.append(labels) \n",
" \n",
" return np.array(token_seqs), np.array(label_seqs)\n",
"\n",
"\n",
"def show_iob2(token_seqs, label_seqs, num=5, shuffle=True):\n",
" '''显示 IOB2 格式数据'''\n",
" if shuffle:\n",
" length = len(token_seqs)\n",
" indexes = [random.randrange(0, length) for i in range(num)] \n",
" zip_seqs = zip(token_seqs[indexes], label_seqs[indexes])\n",
" else:\n",
" zip_seqs = zip(token_seqs[0:num], label_seqs[0:num])\n",
" \n",
" for tokens, labels in zip_seqs:\n",
" for token, label in zip(tokens, labels):\n",
" print('{}/{} '.format(token, label), end='')\n",
" print('\\n')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"55289 55289\n",
"\n",
"目/O 前/O “/O 继/B-PER 生/I-PER ”/O 共/O 产/O 仔/O 5/O 胎/O /O 产/O 下/O 小/O 老/O 虎/O 1/O 8/O 只/O /O 堪/O 称/O 虎/O 妈/O 妈/O 中/O 的/O 英/O 雄/O 。/O \n",
"\n",
"历/O 史/O 的/O 内/O 涵/O 是/O 很/O 丰/O 富/O 的/O /O 经/O 典/O 作/O 家/O 的/O 论/O 断/O 固/O 然/O 有/O 其/O 权/O 威/O 性/O 和/O 合/O 理/O 性/O /O 但/O 历/O 史/O 学/O 家/O 显/O 然/O 不/O 能/O 局/O 限/O 于/O 此/O 。/O \n",
"\n",
"5/O 月/O 3/O 0/O 日/O 在/O 中/B-LOC 国/I-LOC 革/I-LOC 命/I-LOC 军/I-LOC 事/I-LOC 博/I-LOC 物/I-LOC 馆/I-LOC 开/O 幕/O 的/O 全/O 国/O 禁/O 毒/O 展/O 览/O /O 在/O 社/O 会/O 上/O 引/O 起/O 了/O 强/O 烈/O 的/O 反/O 响/O 。/O \n",
"\n",
"另/O 外/O /O 还/O 有/O 一/O 个/O 惊/O 人/O 的/O 发/O 现/O /O 有/O 的/O 发/O 展/O 中/O 国/O 家/O 人/O 均/O 国/O 民/O 资/O 源/O 非/O 常/O 丰/O 富/O /O 但/O 发/O 展/O 不/O 起/O 来/O 的/O 原/O 因/O 在/O 于/O 教/O 育/O 水/O 平/O 太/O 低/O 、/O 对/O 技/O 术/O 的/O 理/O 解/O 和/O 把/O 握/O 太/O 低/O 、/O 管/O 理/O 水/O 平/O 太/O 低/O 等/O 等/O /O 一/O 句/O 话/O /O 智/O 力/O 资/O 本/O 太/O 贫/O 乏/O 。/O \n",
"\n",
"这/O 还/O 要/O 看/O 进/O 一/O 步/O 深/O 入/O 调/O 查/O 的/O 结/O 果/O 。/O \n",
"\n"
]
}
],
"source": [
"token_seqs, label_seqs = load_iob2(path+'dh_msra.txt')\n",
"\n",
"print(len(token_seqs), len(label_seqs))\n",
"print() \n",
"show_iob2(token_seqs, label_seqs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 标签说明\n",
"\n",
"| 标签 | 说明 |\n",
"| ---- | ---- |\n",
"| LOC | 地点 (LOCATION) |\n",
"| ORG | 机构 (ORGANIZATION) |\n",
"| PER | 人物 (PERSON) |"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'B-LOC', 'B-ORG', 'B-PER', 'I-LOC', 'I-ORG', 'I-PER', 'O'}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"set([label for labels in label_seqs for label in labels])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,987 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# dmsc_v2 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1c0yn3TlkzHYTdEBz3T5arA)\n",
"1. **数据概览:** 28 部电影,超 70 万 用户,超 200 万条 评分/评论 数据\n",
"2. **推荐实验:** 推荐系统、情感/观点/评论 倾向性分析\n",
"2. **数据来源:**[豆瓣电影](https://movie.douban.com/)\n",
"3. **原数据集:** [Douban Movie Short Comments Dataset V2](https://www.kaggle.com/utmhikari/doubanmovieshortcomments)\n",
"4. **加工处理:**\n",
" 1. 去重并整理成与 [MovieLens](https://grouplens.org/datasets/movielens/) 兼容的格式\n",
" 2. 进行脱敏操作,以保护用户隐私"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"path = 'dmsc_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. movies.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"电影数目:28\n"
]
}
],
"source": [
"movies = pd.read_csv(path + 'movies.csv')\n",
"\n",
"print('电影数目:%d' % movies.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| movieId | 电影 id (从 0 开始,连续编号) |\n",
"| title | 英文名称 |\n",
"| title_cn | 中文名称 |"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movieId</th>\n",
" <th>title</th>\n",
" <th>title_cn</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>Avengers Age of Ultron</td>\n",
" <td>复仇者联盟2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>Big Fish and Begonia</td>\n",
" <td>大鱼海棠</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>Captain America Civil War</td>\n",
" <td>美国队长3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>Chinese Zodiac</td>\n",
" <td>十二生肖</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>Chronicles of the Ghostly Tribe</td>\n",
" <td>九层妖塔</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5</td>\n",
" <td>CUG King of Heroes</td>\n",
" <td>大圣归来</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>6</td>\n",
" <td>Forever Young</td>\n",
" <td>栀子花开</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>7</td>\n",
" <td>Goodbye Mr. Loser</td>\n",
" <td>夏洛特烦恼</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>8</td>\n",
" <td>Iron Man</td>\n",
" <td>钢铁侠1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>9</td>\n",
" <td>Journey to the West Conquering the Demons</td>\n",
" <td>西游降魔篇</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>10</td>\n",
" <td>Journey to the West The Demons Strike Back</td>\n",
" <td>西游伏妖篇</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>11</td>\n",
" <td>La La Land</td>\n",
" <td>爱乐之城</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>12</td>\n",
" <td>Lost In Thailand</td>\n",
" <td>泰囧</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>13</td>\n",
" <td>My Sunshine</td>\n",
" <td>何以笙箫默</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>14</td>\n",
" <td>Operation Mekong</td>\n",
" <td>湄公河行动</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>15</td>\n",
" <td>Soulmate</td>\n",
" <td>七月与安生</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>16</td>\n",
" <td>The Avengers</td>\n",
" <td>复仇者联盟</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>17</td>\n",
" <td>The Continent</td>\n",
" <td>后会无期</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>18</td>\n",
" <td>The Ghouls</td>\n",
" <td>寻龙诀</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>19</td>\n",
" <td>The Great Wall</td>\n",
" <td>长城</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>20</td>\n",
" <td>The Left Ear</td>\n",
" <td>左耳</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>21</td>\n",
" <td>The Mermaid</td>\n",
" <td>美人鱼</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>22</td>\n",
" <td>Tiny Times 1.0</td>\n",
" <td>小时代1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>23</td>\n",
" <td>Tiny Times 3.0</td>\n",
" <td>小时代3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>24</td>\n",
" <td>Train to Busan</td>\n",
" <td>釜山行</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>25</td>\n",
" <td>Transformers Age of Extinction</td>\n",
" <td>变形金刚4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>26</td>\n",
" <td>Your Name</td>\n",
" <td>你的名字</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>27</td>\n",
" <td>Zootopia</td>\n",
" <td>疯狂动物城</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movieId title title_cn\n",
"0 0 Avengers Age of Ultron 复仇者联盟2\n",
"1 1 Big Fish and Begonia 大鱼海棠\n",
"2 2 Captain America Civil War 美国队长3\n",
"3 3 Chinese Zodiac 十二生肖\n",
"4 4 Chronicles of the Ghostly Tribe 九层妖塔\n",
"5 5 CUG King of Heroes 大圣归来\n",
"6 6 Forever Young 栀子花开\n",
"7 7 Goodbye Mr. Loser 夏洛特烦恼\n",
"8 8 Iron Man 钢铁侠1\n",
"9 9 Journey to the West Conquering the Demons 西游降魔篇\n",
"10 10 Journey to the West The Demons Strike Back 西游伏妖篇\n",
"11 11 La La Land 爱乐之城\n",
"12 12 Lost In Thailand 泰囧\n",
"13 13 My Sunshine 何以笙箫默\n",
"14 14 Operation Mekong 湄公河行动\n",
"15 15 Soulmate 七月与安生\n",
"16 16 The Avengers 复仇者联盟\n",
"17 17 The Continent 后会无期\n",
"18 18 The Ghouls 寻龙诀\n",
"19 19 The Great Wall 长城\n",
"20 20 The Left Ear 左耳\n",
"21 21 The Mermaid 美人鱼\n",
"22 22 Tiny Times 1.0 小时代1\n",
"23 23 Tiny Times 3.0 小时代3\n",
"24 24 Train to Busan 釜山行\n",
"25 25 Transformers Age of Extinction 变形金刚4\n",
"26 26 Your Name 你的名字\n",
"27 27 Zootopia 疯狂动物城"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. ratings.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"用户数据:738701\n",
"评分数目:2125056\n"
]
}
],
"source": [
"ratings = pd.read_csv(path + 'ratings.csv')\n",
"\n",
"print('用户数据:%d' % ratings.userId.unique().shape[0])\n",
"print('评分数目:%d' % ratings.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| userId | 用户 id (从 0 开始,连续编号) |\n",
"| movieId | 即 movies.csv 中的 movieId|\n",
"|rating | 评分,[1,5] 之间的整数 | \n",
"|timestamp | 评分时间戳 |\n",
"|comment | 评论内容 |\n",
"| like | 该评论被多少人点赞 |"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>userId</th>\n",
" <th>movieId</th>\n",
" <th>rating</th>\n",
" <th>timestamp</th>\n",
" <th>comment</th>\n",
" <th>like</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1763779</th>\n",
" <td>130888</td>\n",
" <td>24</td>\n",
" <td>5</td>\n",
" <td>1474560000</td>\n",
" <td>原著的剧本不是这样的,而是最后只有那个自私鬼活了下来。孕妇中枪,小孩中枪的时候哭出了声音,...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1608147</th>\n",
" <td>23695</td>\n",
" <td>22</td>\n",
" <td>2</td>\n",
" <td>1377360000</td>\n",
" <td>郭敬明真的要为中国产生如此大规模的青少年脑残群体负一定责任 = =</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1735498</th>\n",
" <td>323858</td>\n",
" <td>24</td>\n",
" <td>3</td>\n",
" <td>1473696000</td>\n",
" <td>三分不能再多。其中一分给壮汉大叔,帅过男主。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1631095</th>\n",
" <td>218188</td>\n",
" <td>22</td>\n",
" <td>3</td>\n",
" <td>1372953600</td>\n",
" <td>柯震东露点 给三星 后面的彩蛋很欢乐</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1193163</th>\n",
" <td>155900</td>\n",
" <td>17</td>\n",
" <td>4</td>\n",
" <td>1406390400</td>\n",
" <td>给四星不是因为电影有那么好,文艺腔调有,公路片元素够,但好看程度其实低于预期,但是因为是韩...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1874658</th>\n",
" <td>8534</td>\n",
" <td>26</td>\n",
" <td>4</td>\n",
" <td>1480780800</td>\n",
" <td>身体互换和改变未来都是老梗了,算是半新不旧的瓶装了个旧酒吧,不过倒是不错,意外的好看,伏笔...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>645671</th>\n",
" <td>312247</td>\n",
" <td>9</td>\n",
" <td>4</td>\n",
" <td>1476979200</td>\n",
" <td>念念不忘,必有回响…</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1681543</th>\n",
" <td>284941</td>\n",
" <td>23</td>\n",
" <td>4</td>\n",
" <td>1409673600</td>\n",
" <td>看到她们在雪地的那段,居然很感动</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1042238</th>\n",
" <td>100689</td>\n",
" <td>15</td>\n",
" <td>5</td>\n",
" <td>1474214400</td>\n",
" <td>以前看安妮宝贝时期....最喜欢的小说之一</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1672379</th>\n",
" <td>139726</td>\n",
" <td>23</td>\n",
" <td>2</td>\n",
" <td>1406736000</td>\n",
" <td>郭小四不是标榜自己时尚品味吗?四个女主一个镜头换一身皮草哪来的品味啊??(客观的说,叙事增...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1823549</th>\n",
" <td>447412</td>\n",
" <td>25</td>\n",
" <td>2</td>\n",
" <td>1405958400</td>\n",
" <td>擎天柱胸前蓝色的部分装着生命所需的能量和他的记忆。这让我更加坚信一些东西,只是然后的然后我...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1112590</th>\n",
" <td>495975</td>\n",
" <td>16</td>\n",
" <td>4</td>\n",
" <td>1336838400</td>\n",
" <td>浩克抖包袱……</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>210239</th>\n",
" <td>123095</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>1390320000</td>\n",
" <td>轻松愉快,打斗设置还不错</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2093623</th>\n",
" <td>232598</td>\n",
" <td>27</td>\n",
" <td>5</td>\n",
" <td>1474560000</td>\n",
" <td>比之前大热的冰雪奇缘好太多,一部全家人都可以坐在一起看的电影。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>583777</th>\n",
" <td>322422</td>\n",
" <td>8</td>\n",
" <td>5</td>\n",
" <td>1301500800</td>\n",
" <td>的确比蜘蛛侠超人什么什么的好看</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1914937</th>\n",
" <td>75819</td>\n",
" <td>26</td>\n",
" <td>4</td>\n",
" <td>1473955200</td>\n",
" <td>真的棒。但是我自己还是不那么喜欢。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1211561</th>\n",
" <td>514748</td>\n",
" <td>17</td>\n",
" <td>4</td>\n",
" <td>1407427200</td>\n",
" <td>。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1965672</th>\n",
" <td>704638</td>\n",
" <td>26</td>\n",
" <td>5</td>\n",
" <td>1480953600</td>\n",
" <td>陪朋友去看的,本身我是拒绝这类小清新的电影的,而且在刚开始的时候说实话没怎么看懂,不过看到...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1935211</th>\n",
" <td>259717</td>\n",
" <td>26</td>\n",
" <td>4</td>\n",
" <td>1480694400</td>\n",
" <td>时间与空间错乱里的爱情 温暖又幽默</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>839108</th>\n",
" <td>426801</td>\n",
" <td>11</td>\n",
" <td>5</td>\n",
" <td>1486742400</td>\n",
" <td>Here is to the ones who dream.</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" userId movieId rating timestamp \\\n",
"1763779 130888 24 5 1474560000 \n",
"1608147 23695 22 2 1377360000 \n",
"1735498 323858 24 3 1473696000 \n",
"1631095 218188 22 3 1372953600 \n",
"1193163 155900 17 4 1406390400 \n",
"1874658 8534 26 4 1480780800 \n",
"645671 312247 9 4 1476979200 \n",
"1681543 284941 23 4 1409673600 \n",
"1042238 100689 15 5 1474214400 \n",
"1672379 139726 23 2 1406736000 \n",
"1823549 447412 25 2 1405958400 \n",
"1112590 495975 16 4 1336838400 \n",
"210239 123095 3 4 1390320000 \n",
"2093623 232598 27 5 1474560000 \n",
"583777 322422 8 5 1301500800 \n",
"1914937 75819 26 4 1473955200 \n",
"1211561 514748 17 4 1407427200 \n",
"1965672 704638 26 5 1480953600 \n",
"1935211 259717 26 4 1480694400 \n",
"839108 426801 11 5 1486742400 \n",
"\n",
" comment like \n",
"1763779 原著的剧本不是这样的,而是最后只有那个自私鬼活了下来。孕妇中枪,小孩中枪的时候哭出了声音,... 1 \n",
"1608147 郭敬明真的要为中国产生如此大规模的青少年脑残群体负一定责任 = = 0 \n",
"1735498 三分不能再多。其中一分给壮汉大叔,帅过男主。 0 \n",
"1631095 柯震东露点 给三星 后面的彩蛋很欢乐 0 \n",
"1193163 给四星不是因为电影有那么好,文艺腔调有,公路片元素够,但好看程度其实低于预期,但是因为是韩... 0 \n",
"1874658 身体互换和改变未来都是老梗了,算是半新不旧的瓶装了个旧酒吧,不过倒是不错,意外的好看,伏笔... 1 \n",
"645671 念念不忘,必有回响… 0 \n",
"1681543 看到她们在雪地的那段,居然很感动 0 \n",
"1042238 以前看安妮宝贝时期....最喜欢的小说之一 0 \n",
"1672379 郭小四不是标榜自己时尚品味吗?四个女主一个镜头换一身皮草哪来的品味啊??(客观的说,叙事增... 0 \n",
"1823549 擎天柱胸前蓝色的部分装着生命所需的能量和他的记忆。这让我更加坚信一些东西,只是然后的然后我... 0 \n",
"1112590 浩克抖包袱…… 0 \n",
"210239 轻松愉快,打斗设置还不错 0 \n",
"2093623 比之前大热的冰雪奇缘好太多,一部全家人都可以坐在一起看的电影。 0 \n",
"583777 的确比蜘蛛侠超人什么什么的好看 0 \n",
"1914937 真的棒。但是我自己还是不那么喜欢。 0 \n",
"1211561 。 0 \n",
"1965672 陪朋友去看的,本身我是拒绝这类小清新的电影的,而且在刚开始的时候说实话没怎么看懂,不过看到... 0 \n",
"1935211 时间与空间错乱里的爱情 温暖又幽默 0 \n",
"839108 Here is to the ones who dream. 0 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings.sample(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. 用于 情感/观点/评论 倾向性分析"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 筛选出带有较明显倾向性的评论(1星和5星的评分)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"正向(5星)数目:638106\n",
"负向(1星)数目:190927\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>userId</th>\n",
" <th>movieId</th>\n",
" <th>rating</th>\n",
" <th>timestamp</th>\n",
" <th>comment</th>\n",
" <th>like</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>405540</th>\n",
" <td>251302</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>1436976000</td>\n",
" <td>路人转自来水!大圣帅气!我要生猴子~~~^-^</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>159308</th>\n",
" <td>18639</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>1462636800</td>\n",
" <td>冬兵从醒了以后就应该要求被冻起来,美队这个人烂的真要命。心疼tony。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1329674</th>\n",
" <td>127217</td>\n",
" <td>18</td>\n",
" <td>5</td>\n",
" <td>1451059200</td>\n",
" <td>超级棒!远远超出预期 免费水军来了哈哈哈哈</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1945766</th>\n",
" <td>75720</td>\n",
" <td>26</td>\n",
" <td>5</td>\n",
" <td>1476460800</td>\n",
" <td>为爱而动</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1706244</th>\n",
" <td>29721</td>\n",
" <td>23</td>\n",
" <td>1</td>\n",
" <td>1406131200</td>\n",
" <td>看小时代3的时候真是太壮观了整个场子那个乱啊打电话的聊天的中途上厕所的没办法大家提不起兴趣...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1271715</th>\n",
" <td>546029</td>\n",
" <td>17</td>\n",
" <td>1</td>\n",
" <td>1406217600</td>\n",
" <td>可以给零分么</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>394698</th>\n",
" <td>243184</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>1437926400</td>\n",
" <td>一直听网友说好,今天去电影院看了下。真的不错,是中国动漫的一个值得一看的作品。太多的喜羊羊...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>324077</th>\n",
" <td>208900</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>1437062400</td>\n",
" <td>先吐槽一下自己的泪点,太低了。小和尚太像弟弟小时候的样子了。整部电影是良心之作,国产地影这...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1004222</th>\n",
" <td>186241</td>\n",
" <td>14</td>\n",
" <td>5</td>\n",
" <td>1475942400</td>\n",
" <td>主旋律片的杰出代表,节奏顺畅快速。看得人热血沸腾!</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>198523</th>\n",
" <td>5774</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>1462723200</td>\n",
" <td>迄今看过最精彩的漫威电影 其实整个剧情核心是复仇 但是这个复仇点真心满怪的 队长还是一如既...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2014461</th>\n",
" <td>25511</td>\n",
" <td>27</td>\n",
" <td>5</td>\n",
" <td>1457280000</td>\n",
" <td>try everything!动物界的乌托邦 nick真的好苏好腹黑啊啊(原谅我带入了小说</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2101031</th>\n",
" <td>727978</td>\n",
" <td>27</td>\n",
" <td>5</td>\n",
" <td>1462550400</td>\n",
" <td>讲真很棒!</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1614137</th>\n",
" <td>64084</td>\n",
" <td>22</td>\n",
" <td>1</td>\n",
" <td>1374768000</td>\n",
" <td>最后雪中的姐妹情。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1980114</th>\n",
" <td>248321</td>\n",
" <td>26</td>\n",
" <td>5</td>\n",
" <td>1480867200</td>\n",
" <td>时空的跨越,绝对不能忘记的,你的名字。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1829632</th>\n",
" <td>18891</td>\n",
" <td>25</td>\n",
" <td>5</td>\n",
" <td>1403884800</td>\n",
" <td>请记住一个特效片是不需要完美剧情的。在电影院看的就是特效,没有其他。给特效满分。顶端水平。...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>276335</th>\n",
" <td>186281</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1443715200</td>\n",
" <td>不知道在演什么鬼</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2090682</th>\n",
" <td>214830</td>\n",
" <td>27</td>\n",
" <td>5</td>\n",
" <td>1457193600</td>\n",
" <td>这狐狸怎么那么苏!!!反差萌的梗简直炉火纯青</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2108227</th>\n",
" <td>731117</td>\n",
" <td>27</td>\n",
" <td>5</td>\n",
" <td>1458403200</td>\n",
" <td>树懒梗可爱到爆。乌托邦社会的构建反讽了乌托邦社会设想,号称没有偏见的世界里,本身就是由偏见...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>864728</th>\n",
" <td>10418</td>\n",
" <td>12</td>\n",
" <td>5</td>\n",
" <td>1355673600</td>\n",
" <td>啥也不说了,从头笑到尾,差点没乐死我,最后又赚了些感动</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>422130</th>\n",
" <td>263856</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>1436544000</td>\n",
" <td>很感动很用心</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" userId movieId rating timestamp \\\n",
"405540 251302 5 5 1436976000 \n",
"159308 18639 2 5 1462636800 \n",
"1329674 127217 18 5 1451059200 \n",
"1945766 75720 26 5 1476460800 \n",
"1706244 29721 23 1 1406131200 \n",
"1271715 546029 17 1 1406217600 \n",
"394698 243184 5 5 1437926400 \n",
"324077 208900 5 5 1437062400 \n",
"1004222 186241 14 5 1475942400 \n",
"198523 5774 2 5 1462723200 \n",
"2014461 25511 27 5 1457280000 \n",
"2101031 727978 27 5 1462550400 \n",
"1614137 64084 22 1 1374768000 \n",
"1980114 248321 26 5 1480867200 \n",
"1829632 18891 25 5 1403884800 \n",
"276335 186281 4 1 1443715200 \n",
"2090682 214830 27 5 1457193600 \n",
"2108227 731117 27 5 1458403200 \n",
"864728 10418 12 5 1355673600 \n",
"422130 263856 5 5 1436544000 \n",
"\n",
" comment like \n",
"405540 路人转自来水!大圣帅气!我要生猴子~~~^-^ 0 \n",
"159308 冬兵从醒了以后就应该要求被冻起来,美队这个人烂的真要命。心疼tony。 0 \n",
"1329674 超级棒!远远超出预期 免费水军来了哈哈哈哈 0 \n",
"1945766 为爱而动 0 \n",
"1706244 看小时代3的时候真是太壮观了整个场子那个乱啊打电话的聊天的中途上厕所的没办法大家提不起兴趣... 0 \n",
"1271715 可以给零分么 0 \n",
"394698 一直听网友说好,今天去电影院看了下。真的不错,是中国动漫的一个值得一看的作品。太多的喜羊羊... 0 \n",
"324077 先吐槽一下自己的泪点,太低了。小和尚太像弟弟小时候的样子了。整部电影是良心之作,国产地影这... 0 \n",
"1004222 主旋律片的杰出代表,节奏顺畅快速。看得人热血沸腾! 0 \n",
"198523 迄今看过最精彩的漫威电影 其实整个剧情核心是复仇 但是这个复仇点真心满怪的 队长还是一如既... 0 \n",
"2014461 try everything!动物界的乌托邦 nick真的好苏好腹黑啊啊(原谅我带入了小说 0 \n",
"2101031 讲真很棒! 0 \n",
"1614137 最后雪中的姐妹情。 0 \n",
"1980114 时空的跨越,绝对不能忘记的,你的名字。 0 \n",
"1829632 请记住一个特效片是不需要完美剧情的。在电影院看的就是特效,没有其他。给特效满分。顶端水平。... 0 \n",
"276335 不知道在演什么鬼 1 \n",
"2090682 这狐狸怎么那么苏!!!反差萌的梗简直炉火纯青 4 \n",
"2108227 树懒梗可爱到爆。乌托邦社会的构建反讽了乌托邦社会设想,号称没有偏见的世界里,本身就是由偏见... 0 \n",
"864728 啥也不说了,从头笑到尾,差点没乐死我,最后又赚了些感动 0 \n",
"422130 很感动很用心 0 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings_with_opinions = ratings[(ratings.rating==1) | (ratings.rating==5)]\n",
"\n",
"\n",
"print('正向(5星)数目:%d' % (ratings_with_opinions[ratings_with_opinions.rating==5].shape[0]))\n",
"print('负向(1星)数目:%d' % (ratings_with_opinions[ratings_with_opinions.rating==1].shape[0]))\n",
"\n",
"ratings_with_opinions.sample(20)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "keras"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,780 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ez_douban 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1DkN1LmdSMzm_jCBKhbPbig)\n",
"1. **数据概览:** 5 万多部电影(3 万多有电影名称,2 万多没有电影名称),2.8 万 用户,280 万条评分数据\n",
"2. **推荐实验:** 推荐系统\n",
"2. **数据来源:**[豆瓣电影](https://movie.douban.com/)\n",
"3. **原数据集:** [Douban-1 和 Douban-2](https://sites.google.com/site/erhengzhong/datasets),这是 Erheng Zhong 博士 为在 KDD'12, TKDD'14, SDM'12 上发表论文而收集的数据\n",
"4. **加工处理:**\n",
" 1. 去除 Douban-1 中无用的 status 字段,以及无效的评分,并整理成与 [MovieLens](https://grouplens.org/datasets/movielens/) 兼容的格式\n",
" 2. 从 Douban-2 中提取电影信息和链接信息,并与 Douban-1 中的评分数据进行联表操作\n",
" 3. 进行脱敏操作,以保护用户隐私"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"path = 'ez_douban_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. movies.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"电影数目(有名称):33258\n",
"电影数目(没有名称):24166\n",
"电影数目(总计):57424\n"
]
}
],
"source": [
"movies = pd.read_csv(path + 'movies.csv')\n",
"\n",
"print('电影数目(有名称):%d' % movies[~pd.isnull(movies.title)].shape[0])\n",
"print('电影数目(没有名称):%d' % movies[pd.isnull(movies.title)].shape[0])\n",
"print('电影数目(总计):%d' % movies.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| movieId | 电影 id (从 0 开始,连续编号) |\n",
"| title | 电影名称 |"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movieId</th>\n",
" <th>title</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>41807</th>\n",
" <td>41807</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16521</th>\n",
" <td>16521</td>\n",
" <td>五女拜寿</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10689</th>\n",
" <td>10689</td>\n",
" <td>La pelote de laine</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21653</th>\n",
" <td>21653</td>\n",
" <td>Ma mha 4 khaa khrap</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36630</th>\n",
" <td>36630</td>\n",
" <td>the sky the earth and the rain</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31734</th>\n",
" <td>31734</td>\n",
" <td>Viva María!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31530</th>\n",
" <td>31530</td>\n",
" <td>远路</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22553</th>\n",
" <td>22553</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32346</th>\n",
" <td>32346</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29429</th>\n",
" <td>29429</td>\n",
" <td>The Crazies</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34912</th>\n",
" <td>34912</td>\n",
" <td>Stestí</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10350</th>\n",
" <td>10350</td>\n",
" <td>羊のうた</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31487</th>\n",
" <td>31487</td>\n",
" <td>一触即发</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50688</th>\n",
" <td>50688</td>\n",
" <td>还君明珠</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40769</th>\n",
" <td>40769</td>\n",
" <td>Red Riding Hood</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32748</th>\n",
" <td>32748</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17204</th>\n",
" <td>17204</td>\n",
" <td>작은아씨들</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55870</th>\n",
" <td>55870</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42879</th>\n",
" <td>42879</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26432</th>\n",
" <td>26432</td>\n",
" <td>后门</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movieId title\n",
"41807 41807 NaN\n",
"16521 16521 五女拜寿\n",
"10689 10689 La pelote de laine\n",
"21653 21653 Ma mha 4 khaa khrap\n",
"36630 36630 the sky the earth and the rain\n",
"31734 31734 Viva María!\n",
"31530 31530 远路\n",
"22553 22553 NaN\n",
"32346 32346 NaN\n",
"29429 29429 The Crazies\n",
"34912 34912 Stestí\n",
"10350 10350 羊のうた\n",
"31487 31487 一触即发\n",
"50688 50688 还君明珠\n",
"40769 40769 Red Riding Hood\n",
"32748 32748 NaN\n",
"17204 17204 작은아씨들\n",
"55870 55870 NaN\n",
"42879 42879 NaN\n",
"26432 26432 后门"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.sample(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. ratings.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"用户数据:28718\n",
"评分数目:2828585\n"
]
}
],
"source": [
"ratings = pd.read_csv(path + 'ratings.csv')\n",
"\n",
"print('用户数据:%d' % ratings.userId.unique().shape[0])\n",
"print('评分数目:%d' % ratings.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| userId | 用户 id (从 0 开始,连续编号) |\n",
"| movieId | 即 movies.csv 中的 movieId|\n",
"|rating | 评分,[1,5] 之间的整数 | \n",
"|timestamp | 评分时间戳 |"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>userId</th>\n",
" <th>movieId</th>\n",
" <th>rating</th>\n",
" <th>timestamp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1234569</th>\n",
" <td>4825</td>\n",
" <td>14852</td>\n",
" <td>5</td>\n",
" <td>1263084471</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1817521</th>\n",
" <td>7121</td>\n",
" <td>140</td>\n",
" <td>4</td>\n",
" <td>1259054160</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2417373</th>\n",
" <td>9449</td>\n",
" <td>116</td>\n",
" <td>3</td>\n",
" <td>1255344370</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1234106</th>\n",
" <td>4822</td>\n",
" <td>685</td>\n",
" <td>5</td>\n",
" <td>1124800342</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2044878</th>\n",
" <td>7996</td>\n",
" <td>22343</td>\n",
" <td>4</td>\n",
" <td>1254639194</td>\n",
" </tr>\n",
" <tr>\n",
" <th>239277</th>\n",
" <td>947</td>\n",
" <td>5730</td>\n",
" <td>5</td>\n",
" <td>1253992436</td>\n",
" </tr>\n",
" <tr>\n",
" <th>305034</th>\n",
" <td>1178</td>\n",
" <td>9839</td>\n",
" <td>5</td>\n",
" <td>1304648204</td>\n",
" </tr>\n",
" <tr>\n",
" <th>121193</th>\n",
" <td>527</td>\n",
" <td>1512</td>\n",
" <td>4</td>\n",
" <td>1125694603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2563603</th>\n",
" <td>10758</td>\n",
" <td>738</td>\n",
" <td>4</td>\n",
" <td>1301927887</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2034193</th>\n",
" <td>7949</td>\n",
" <td>1671</td>\n",
" <td>5</td>\n",
" <td>1276176595</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1373543</th>\n",
" <td>5369</td>\n",
" <td>893</td>\n",
" <td>3</td>\n",
" <td>1299972980</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1798131</th>\n",
" <td>7027</td>\n",
" <td>4530</td>\n",
" <td>3</td>\n",
" <td>1178099769</td>\n",
" </tr>\n",
" <tr>\n",
" <th>572517</th>\n",
" <td>2243</td>\n",
" <td>9773</td>\n",
" <td>3</td>\n",
" <td>1187275220</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2160230</th>\n",
" <td>8470</td>\n",
" <td>12</td>\n",
" <td>3</td>\n",
" <td>1306330169</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1672554</th>\n",
" <td>6554</td>\n",
" <td>5637</td>\n",
" <td>3</td>\n",
" <td>1168168788</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1504944</th>\n",
" <td>5920</td>\n",
" <td>6659</td>\n",
" <td>3</td>\n",
" <td>1254041654</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2657986</th>\n",
" <td>17116</td>\n",
" <td>738</td>\n",
" <td>4</td>\n",
" <td>1238829652</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2123663</th>\n",
" <td>8319</td>\n",
" <td>1242</td>\n",
" <td>4</td>\n",
" <td>1225941971</td>\n",
" </tr>\n",
" <tr>\n",
" <th>561109</th>\n",
" <td>2206</td>\n",
" <td>4209</td>\n",
" <td>3</td>\n",
" <td>1307884947</td>\n",
" </tr>\n",
" <tr>\n",
" <th>208970</th>\n",
" <td>887</td>\n",
" <td>4723</td>\n",
" <td>3</td>\n",
" <td>1306314265</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" userId movieId rating timestamp\n",
"1234569 4825 14852 5 1263084471\n",
"1817521 7121 140 4 1259054160\n",
"2417373 9449 116 3 1255344370\n",
"1234106 4822 685 5 1124800342\n",
"2044878 7996 22343 4 1254639194\n",
"239277 947 5730 5 1253992436\n",
"305034 1178 9839 5 1304648204\n",
"121193 527 1512 4 1125694603\n",
"2563603 10758 738 4 1301927887\n",
"2034193 7949 1671 5 1276176595\n",
"1373543 5369 893 3 1299972980\n",
"1798131 7027 4530 3 1178099769\n",
"572517 2243 9773 3 1187275220\n",
"2160230 8470 12 3 1306330169\n",
"1672554 6554 5637 3 1168168788\n",
"1504944 5920 6659 3 1254041654\n",
"2657986 17116 738 4 1238829652\n",
"2123663 8319 1242 4 1225941971\n",
"561109 2206 4209 3 1307884947\n",
"208970 887 4723 3 1306314265"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings.sample(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. links.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"links = pd.read_csv(path + 'links.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| movieId | 即 movies.csv 和 ratings.csv 中的 movieId |\n",
"| imdbId | IMDB 网站的电影编号 |\n",
"|doubanId | 豆瓣网站的电影编号 |"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movieId</th>\n",
" <th>imdbId</th>\n",
" <th>doubanId</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>50304</th>\n",
" <td>50304</td>\n",
" <td>NaN</td>\n",
" <td>3712319</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46231</th>\n",
" <td>46231</td>\n",
" <td>NaN</td>\n",
" <td>3035298</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56597</th>\n",
" <td>56597</td>\n",
" <td>NaN</td>\n",
" <td>2980174</td>\n",
" </tr>\n",
" <tr>\n",
" <th>54191</th>\n",
" <td>54191</td>\n",
" <td>86992.0</td>\n",
" <td>1294617</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3418</th>\n",
" <td>3418</td>\n",
" <td>87406.0</td>\n",
" <td>1533608</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6586</th>\n",
" <td>6586</td>\n",
" <td>NaN</td>\n",
" <td>6383567</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52685</th>\n",
" <td>52685</td>\n",
" <td>376706.0</td>\n",
" <td>1770079</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53372</th>\n",
" <td>53372</td>\n",
" <td>218839.0</td>\n",
" <td>1295836</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27540</th>\n",
" <td>27540</td>\n",
" <td>NaN</td>\n",
" <td>2371674</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34467</th>\n",
" <td>34467</td>\n",
" <td>NaN</td>\n",
" <td>4868728</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2301</th>\n",
" <td>2301</td>\n",
" <td>NaN</td>\n",
" <td>3732699</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16687</th>\n",
" <td>16687</td>\n",
" <td>NaN</td>\n",
" <td>4840386</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36301</th>\n",
" <td>36301</td>\n",
" <td>364457.0</td>\n",
" <td>1764523</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44922</th>\n",
" <td>44922</td>\n",
" <td>452640.0</td>\n",
" <td>1920065</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27815</th>\n",
" <td>27815</td>\n",
" <td>114687.0</td>\n",
" <td>1773480</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25370</th>\n",
" <td>25370</td>\n",
" <td>NaN</td>\n",
" <td>4192036</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36070</th>\n",
" <td>36070</td>\n",
" <td>NaN</td>\n",
" <td>4848096</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40954</th>\n",
" <td>40954</td>\n",
" <td>115906.0</td>\n",
" <td>1302469</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38395</th>\n",
" <td>38395</td>\n",
" <td>436784.0</td>\n",
" <td>1857858</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49680</th>\n",
" <td>49680</td>\n",
" <td>NaN</td>\n",
" <td>4168480</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movieId imdbId doubanId\n",
"50304 50304 NaN 3712319\n",
"46231 46231 NaN 3035298\n",
"56597 56597 NaN 2980174\n",
"54191 54191 86992.0 1294617\n",
"3418 3418 87406.0 1533608\n",
"6586 6586 NaN 6383567\n",
"52685 52685 376706.0 1770079\n",
"53372 53372 218839.0 1295836\n",
"27540 27540 NaN 2371674\n",
"34467 34467 NaN 4868728\n",
"2301 2301 NaN 3732699\n",
"16687 16687 NaN 4840386\n",
"36301 36301 364457.0 1764523\n",
"44922 44922 452640.0 1920065\n",
"27815 27815 114687.0 1773480\n",
"25370 25370 NaN 4192036\n",
"36070 36070 NaN 4848096\n",
"40954 40954 115906.0 1302469\n",
"38395 38395 436784.0 1857858\n",
"49680 49680 NaN 4168480"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"links.sample(20)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "keras"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,357 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# financezhidao 说明\n",
"0. **下载地址:** [百度知道](https://pan.baidu.com/s/1z1Rnnk-ubRSvzDu4UvLlIw)\n",
"1. **数据概览:** 77万 条金融行业问答数据\n",
"2. **推荐实验:** FAQ 问答系统\n",
"3. **数据来源:** 百度知道\n",
"4. **加工处理:**\n",
" 1. 过滤了id、url、qid、reply_t、user字段\n",
" 2. 对question、reply做了脱敏处理"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = 'financezhidao_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. financezhidao_filter.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"pd_all = pd.read_csv(path + 'financezhidao_filter.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| title | 标题 |\n",
"| question | 问题(可为空) |\n",
"| reply| 每个问题的内容 |\n",
"| is_best| 是否是最佳答案 |"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>question</th>\n",
" <th>reply</th>\n",
" <th>is_best</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>678109</th>\n",
" <td>大家好,请问信用卡怎么分期,分期有什么用处呢</td>\n",
" <td>NaN</td>\n",
" <td>分期好提额,但是有利息</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>534025</th>\n",
" <td>本人在银行的存款,别人带本人的身份证可以取出来吗</td>\n",
" <td>NaN</td>\n",
" <td>若使用的是招商银行储蓄卡,在网点取款可代办,取款金额在1万元以上,需出示双人身份证原件和银行...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>501941</th>\n",
" <td>向银行贷款30万一个月要多少利息</td>\n",
" <td>NaN</td>\n",
" <td>1000万</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>734438</th>\n",
" <td>招商信用卡还款怎么还,是每个月固定还多少钱,还是按照我们用款额度来算每个月还多少钱?</td>\n",
" <td>NaN</td>\n",
" <td>消费多少还多少,还款期内免利息。账单出来会提示你全额还多少,最低还多少的。</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>448905</th>\n",
" <td>年利率6每月多少钱</td>\n",
" <td>NaN</td>\n",
" <td>一年按12个月算的</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>521387</th>\n",
" <td>以卡办卡查卡里余额吗</td>\n",
" <td>NaN</td>\n",
" <td>若需查询招行一卡通余额,可通过电话银行,手机银行,网上银行(大众版和专业版),自助设备等渠道...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>758812</th>\n",
" <td>2016年调整金融机构各个银行人民币存贷款基准利率是多少</td>\n",
" <td>NaN</td>\n",
" <td>这个问题的话,本金*利率*时间就可以算出来了总的存款利率的话一般都是有央*规定的,怕出现什么...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>220626</th>\n",
" <td>请问一下,广信贷怎么样?这个理财真的可以赚?</td>\n",
" <td>NaN</td>\n",
" <td>所在城市若有招商银行,也可以了解下招行发售的理财产品,您可以进入招行主页,点击“理财产品”-...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>86984</th>\n",
" <td>公积金断交后,补上可以申请公积金贷款吗</td>\n",
" <td>公积金交了一年了,但是断了大概5个月了,现在想申请公积金贷款,请问补上可以吗</td>\n",
" <td>住房公积金断了,需要当事人准备相应的补交材料给单位经办人,由单位的经办人去有关部门办理补缴手...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20026</th>\n",
" <td>在哪里能借到钱</td>\n",
" <td>NaN</td>\n",
" <td>你要借多少</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>121538</th>\n",
" <td>哪有办理个人信用卡pos机</td>\n",
" <td>NaN</td>\n",
" <td>很多都可以办理</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>467245</th>\n",
" <td>身份证消磁了就不能办银行卡了吗</td>\n",
" <td>NaN</td>\n",
" <td>身份证读不出信息就是无效证件是没法去银行办理业务的目前部分银行支持临时身份证+辅助证明的方式...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>730725</th>\n",
" <td>年薪20万,招行信用卡标准金卡额度能有多少?</td>\n",
" <td>NaN</td>\n",
" <td>正常来说一般是一万,要看你个人的信用度。这个情况要去银行问。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>517301</th>\n",
" <td>自己可以拿家长的身份证办银行卡么吗</td>\n",
" <td>NaN</td>\n",
" <td>必须本人办理</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>255614</th>\n",
" <td>有没有人现在能借一千以内给我,急需,无前期,走今借到</td>\n",
" <td>NaN</td>\n",
" <td>那么晚了还出来诈骗</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>545539</th>\n",
" <td>招行信用卡查询密码怎么修改有多种方式</td>\n",
" <td>NaN</td>\n",
" <td>可以通过网银大众版、专业版、手机银行、掌上生活客户端、电话银行等渠道修改。</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>747413</th>\n",
" <td>信用卡面签被拒的原因是什么?</td>\n",
" <td>信用卡面签被拒的原因是什么?</td>\n",
" <td>若申请的是招行信用卡,最主要的条件是有稳定的工作和收入,必备申请文件为身份证明复印件和工作证...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>669087</th>\n",
" <td>信用卡还款日提前一天会黑名单</td>\n",
" <td>NaN</td>\n",
" <td>你好,这个是不会的,信用卡还款日是指免息期的最后一天,在这个时间之前全额还款都是没有问题的。...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>237058</th>\n",
" <td>求一个农村人可以借钱的软件,就几百块急用,在网上找了十几个认证了</td>\n",
" <td>求一个农村人可以借钱的软件,就几百块急用,在网上找了十几个认证了半天都不给借,求一个靠谱的</td>\n",
" <td>你好很高兴为您解答:qq现金贷不错</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>131290</th>\n",
" <td>我们办理房贷合同时银行工作人员给信用卡申请来填,那个信用卡的核实信息我答不上会影响放款吗</td>\n",
" <td>我们办理房贷合同时银行工作人员给信用卡申请来填,那个信用卡的核实信息我答不上会影响放款吗急用</td>\n",
" <td>若是在招行申请的个人住房贷款,信用卡的核发情况不影响贷款放款。贷款的最终审核是否能够通过,是...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" title \\\n",
"678109 大家好,请问信用卡怎么分期,分期有什么用处呢 \n",
"534025 本人在银行的存款,别人带本人的身份证可以取出来吗 \n",
"501941 向银行贷款30万一个月要多少利息 \n",
"734438 招商信用卡还款怎么还,是每个月固定还多少钱,还是按照我们用款额度来算每个月还多少钱? \n",
"448905 年利率6每月多少钱 \n",
"521387 以卡办卡查卡里余额吗 \n",
"758812 2016年调整金融机构各个银行人民币存贷款基准利率是多少 \n",
"220626 请问一下,广信贷怎么样?这个理财真的可以赚? \n",
"86984 公积金断交后,补上可以申请公积金贷款吗 \n",
"20026 在哪里能借到钱 \n",
"121538 哪有办理个人信用卡pos机 \n",
"467245 身份证消磁了就不能办银行卡了吗 \n",
"730725 年薪20万,招行信用卡标准金卡额度能有多少? \n",
"517301 自己可以拿家长的身份证办银行卡么吗 \n",
"255614 有没有人现在能借一千以内给我,急需,无前期,走今借到 \n",
"545539 招行信用卡查询密码怎么修改有多种方式 \n",
"747413 信用卡面签被拒的原因是什么? \n",
"669087 信用卡还款日提前一天会黑名单 \n",
"237058 求一个农村人可以借钱的软件,就几百块急用,在网上找了十几个认证了 \n",
"131290 我们办理房贷合同时银行工作人员给信用卡申请来填,那个信用卡的核实信息我答不上会影响放款吗 \n",
"\n",
" question \\\n",
"678109 NaN \n",
"534025 NaN \n",
"501941 NaN \n",
"734438 NaN \n",
"448905 NaN \n",
"521387 NaN \n",
"758812 NaN \n",
"220626 NaN \n",
"86984 公积金交了一年了,但是断了大概5个月了,现在想申请公积金贷款,请问补上可以吗 \n",
"20026 NaN \n",
"121538 NaN \n",
"467245 NaN \n",
"730725 NaN \n",
"517301 NaN \n",
"255614 NaN \n",
"545539 NaN \n",
"747413 信用卡面签被拒的原因是什么? \n",
"669087 NaN \n",
"237058 求一个农村人可以借钱的软件,就几百块急用,在网上找了十几个认证了半天都不给借,求一个靠谱的 \n",
"131290 我们办理房贷合同时银行工作人员给信用卡申请来填,那个信用卡的核实信息我答不上会影响放款吗急用 \n",
"\n",
" reply is_best \n",
"678109 分期好提额,但是有利息 0 \n",
"534025 若使用的是招商银行储蓄卡,在网点取款可代办,取款金额在1万元以上,需出示双人身份证原件和银行... 1 \n",
"501941 1000万 0 \n",
"734438 消费多少还多少,还款期内免利息。账单出来会提示你全额还多少,最低还多少的。 1 \n",
"448905 一年按12个月算的 0 \n",
"521387 若需查询招行一卡通余额,可通过电话银行,手机银行,网上银行(大众版和专业版),自助设备等渠道... 1 \n",
"758812 这个问题的话,本金*利率*时间就可以算出来了总的存款利率的话一般都是有央*规定的,怕出现什么... 0 \n",
"220626 所在城市若有招商银行,也可以了解下招行发售的理财产品,您可以进入招行主页,点击“理财产品”-... 1 \n",
"86984 住房公积金断了,需要当事人准备相应的补交材料给单位经办人,由单位的经办人去有关部门办理补缴手... 1 \n",
"20026 你要借多少 0 \n",
"121538 很多都可以办理 0 \n",
"467245 身份证读不出信息就是无效证件是没法去银行办理业务的目前部分银行支持临时身份证+辅助证明的方式... 0 \n",
"730725 正常来说一般是一万,要看你个人的信用度。这个情况要去银行问。 0 \n",
"517301 必须本人办理 0 \n",
"255614 那么晚了还出来诈骗 0 \n",
"545539 可以通过网银大众版、专业版、手机银行、掌上生活客户端、电话银行等渠道修改。 1 \n",
"747413 若申请的是招行信用卡,最主要的条件是有稳定的工作和收入,必备申请文件为身份证明复印件和工作证... 1 \n",
"669087 你好,这个是不会的,信用卡还款日是指免息期的最后一天,在这个时间之前全额还款都是没有问题的。... 0 \n",
"237058 你好很高兴为您解答:qq现金贷不错 0 \n",
"131290 若是在招行申请的个人住房贷款,信用卡的核发情况不影响贷款放款。贷款的最终审核是否能够通过,是... 1 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(n=20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,357 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# lawzhidao_filter 说明\n",
"0. **下载地址:** [百度知道](https://pan.baidu.com/s/18Lwq16VBo6wBD_qLb3i33g)\n",
"1. **数据概览:** 3.6 万条法律问答数据\n",
"2. **推荐实验:** FAQ 问答系统\n",
"3. **数据来源:** 百度知道\n",
"4. **加工处理:**\n",
" 1. 过滤了id、url、qid、reply_t、user字段\n",
" 2. 对question、reply做了脱敏处理"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = 'lawzhidao_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. lawzhidao_filter.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"pd_all = pd.read_csv(path + 'baoxianzhidao_filter.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| title | 问题的标题 |\n",
"| question | 问题内容(可为空) |\n",
"| reply| 回复内容 |\n",
"| is_best| 是否为页面上显示的最佳回答 |"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>question</th>\n",
" <th>reply</th>\n",
" <th>is_best</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>6725</th>\n",
" <td>请问车险理赔时,全责一方和无责任一方收到待遇的区别</td>\n",
" <td>NaN</td>\n",
" <td>这位朋友提问的有些过于笼统了不是很详细,理论上来讲,从商业险的角度分析,有责任,保险公司才会...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6399</th>\n",
" <td>买保险,一定要找代理人吗,直接去保险公司买不可以吗?</td>\n",
" <td>买保险,一定要找代理人吗,直接去保险公司买不可以吗?</td>\n",
" <td>可以的。可以自行去保险公司进行投保,也可以选择在网上投保。不过有代理人的好处在于可以为被保险...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4242</th>\n",
" <td>机动车撞伤人至骨折保险公司该怎么赔偿</td>\n",
" <td>NaN</td>\n",
" <td>交通事故赔偿是有标准的,因交通事故造成损失,肇事者向受害者、保险公司对承保车辆造成的损失进行...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7481</th>\n",
" <td>贷款买养老保险如何办理?</td>\n",
" <td>贷款买养老保险如何办理?</td>\n",
" <td>助保贷款主要是针对中断缴纳基本养老保险费的接近退休年龄无力续保的困难*员,通过政府担保贴息、...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5674</th>\n",
" <td>摩托车行车证年审应交哪些保险?一定要交驾驶员个人险吗?</td>\n",
" <td>NaN</td>\n",
" <td>摩托车买保险最应该买的就是交强险,一般根据排量的不同共分为三个类别,其中50CC及以下的排量...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1122</th>\n",
" <td>惠*安保费贵不贵?一年需要多少钱?</td>\n",
" <td>NaN</td>\n",
" <td>年缴保费500元,缴费20年,保障30年。</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5511</th>\n",
" <td>农村医保没有交,会把户口注销了吗?本人现不在家无法交医保,乡镇通知我,他说我不交医保就把我的户口</td>\n",
" <td>销了。是真的吗?</td>\n",
" <td>不会的,这是不合法的,新农合是指由政府组织、引导、支持,农民自愿参加,个人、集体和政府多方筹...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7338</th>\n",
" <td>新华保险的保单贷款是怎样还的?</td>\n",
" <td>NaN</td>\n",
" <td>半年要去签一次息,具体情况,可以直接咨询新华人寿保险公司,新华客服热线9##67</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1280</th>\n",
" <td>一起慧99到底有什么优惠相比其他的保险</td>\n",
" <td>NaN</td>\n",
" <td>您好!一起慧99</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6388</th>\n",
" <td>辞职后,养老保险如果不转移会怎么样</td>\n",
" <td>我2010年2月在原公司辞职后,养老保险没有转移。如果不转移,我这部分养老保险会怎么处理?</td>\n",
" <td>会被封存,所以要及时转移。养老保险转移和接续手续:一、申请出具《基本养老保险参保缴费凭证》职...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7920</th>\n",
" <td>慧*安*儿定期重疾是怎么理赔的</td>\n",
" <td>NaN</td>\n",
" <td>首先是报案您或被保险人应在知道保险事故发生之日起10日内通知本公司。如果您或受益人故意或者因...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3134</th>\n",
" <td>构不成住院条件的车祸需要赔付精神损失费误工费营养费护理费吗</td>\n",
" <td>NaN</td>\n",
" <td>只要存在精神损失、误工、需要增加营养、护理的费用,就可以向侵权人主张赔偿责任。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4015</th>\n",
" <td>基本保险金额是什么意思</td>\n",
" <td>基本保险金额是什么意思</td>\n",
" <td>基本保险金额是保单上明确标注的金额,保险金额是能拿到的保险赔付金额,有些保险条款的基本保险金...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6848</th>\n",
" <td>重大疾病保险有必要买吗?</td>\n",
" <td>我今年25岁,身体很健康,我去买保险,保险公司的人给我的计划里有重大疾病保险的项目,但是我只...</td>\n",
" <td>重大疾病保险还是很有必要买的。我国的医疗保障体系是由基本医保和商业健康保险组成。如果发生重大...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2494</th>\n",
" <td>库*勒妇科商业医保报销范围有哪些?</td>\n",
" <td>库*勒妇科商业医保报销范围有哪些?</td>\n",
" <td>你好,商业医保报销范围比医疗保险报销更广。基本都是能报销的。报销分农村居民和城镇职工:1、居...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7341</th>\n",
" <td>第三者保险营运与非营运什么区别</td>\n",
" <td>第三者保险营运与非营运什么区别</td>\n",
" <td>车辆行驶证的“使用性质“一个是营运,一个是非营运。营运需要在运输管理部门办理车辆的道路运输许...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4997</th>\n",
" <td>犹豫期内退保一定要去原来办理的地点吗?</td>\n",
" <td>犹豫期内退保一定要去原来办理的地点吗?</td>\n",
" <td>要退保必须去保险公司退,在银行的柜台上是没办法退的,而且退保必须由投保人本人持其身份证去退,...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5694</th>\n",
" <td>保险法的构成主要包括</td>\n",
" <td>NaN</td>\n",
" <td>保险法的构成主要包括保险业法、保险合同法*保险特别法。1.保险业法又称保险事业法、保险事业监...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1604</th>\n",
" <td>适合中老年的保险多不多,能买哪些保险?</td>\n",
" <td>NaN</td>\n",
" <td>年龄多大呢?保费预算多少?</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3098</th>\n",
" <td>汽车购置税属于机动车第三者责任险赔偿范围内吗?</td>\n",
" <td>NaN</td>\n",
" <td>购置税你是你购置车辆的时候上牌还需要交的费用。跟保险不是一个范围。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" title \\\n",
"6725 请问车险理赔时,全责一方和无责任一方收到待遇的区别 \n",
"6399 买保险,一定要找代理人吗,直接去保险公司买不可以吗? \n",
"4242 机动车撞伤人至骨折保险公司该怎么赔偿 \n",
"7481 贷款买养老保险如何办理? \n",
"5674 摩托车行车证年审应交哪些保险?一定要交驾驶员个人险吗? \n",
"1122 惠*安保费贵不贵?一年需要多少钱? \n",
"5511 农村医保没有交,会把户口注销了吗?本人现不在家无法交医保,乡镇通知我,他说我不交医保就把我的户口 \n",
"7338 新华保险的保单贷款是怎样还的? \n",
"1280 一起慧99到底有什么优惠相比其他的保险 \n",
"6388 辞职后,养老保险如果不转移会怎么样 \n",
"7920 慧*安*儿定期重疾是怎么理赔的 \n",
"3134 构不成住院条件的车祸需要赔付精神损失费误工费营养费护理费吗 \n",
"4015 基本保险金额是什么意思 \n",
"6848 重大疾病保险有必要买吗? \n",
"2494 库*勒妇科商业医保报销范围有哪些? \n",
"7341 第三者保险营运与非营运什么区别 \n",
"4997 犹豫期内退保一定要去原来办理的地点吗? \n",
"5694 保险法的构成主要包括 \n",
"1604 适合中老年的保险多不多,能买哪些保险? \n",
"3098 汽车购置税属于机动车第三者责任险赔偿范围内吗? \n",
"\n",
" question \\\n",
"6725 NaN \n",
"6399 买保险,一定要找代理人吗,直接去保险公司买不可以吗? \n",
"4242 NaN \n",
"7481 贷款买养老保险如何办理? \n",
"5674 NaN \n",
"1122 NaN \n",
"5511 销了。是真的吗? \n",
"7338 NaN \n",
"1280 NaN \n",
"6388 我2010年2月在原公司辞职后,养老保险没有转移。如果不转移,我这部分养老保险会怎么处理? \n",
"7920 NaN \n",
"3134 NaN \n",
"4015 基本保险金额是什么意思 \n",
"6848 我今年25岁,身体很健康,我去买保险,保险公司的人给我的计划里有重大疾病保险的项目,但是我只... \n",
"2494 库*勒妇科商业医保报销范围有哪些? \n",
"7341 第三者保险营运与非营运什么区别 \n",
"4997 犹豫期内退保一定要去原来办理的地点吗? \n",
"5694 NaN \n",
"1604 NaN \n",
"3098 NaN \n",
"\n",
" reply is_best \n",
"6725 这位朋友提问的有些过于笼统了不是很详细,理论上来讲,从商业险的角度分析,有责任,保险公司才会... 0 \n",
"6399 可以的。可以自行去保险公司进行投保,也可以选择在网上投保。不过有代理人的好处在于可以为被保险... 1 \n",
"4242 交通事故赔偿是有标准的,因交通事故造成损失,肇事者向受害者、保险公司对承保车辆造成的损失进行... 1 \n",
"7481 助保贷款主要是针对中断缴纳基本养老保险费的接近退休年龄无力续保的困难*员,通过政府担保贴息、... 0 \n",
"5674 摩托车买保险最应该买的就是交强险,一般根据排量的不同共分为三个类别,其中50CC及以下的排量... 1 \n",
"1122 年缴保费500元,缴费20年,保障30年。 1 \n",
"5511 不会的,这是不合法的,新农合是指由政府组织、引导、支持,农民自愿参加,个人、集体和政府多方筹... 1 \n",
"7338 半年要去签一次息,具体情况,可以直接咨询新华人寿保险公司,新华客服热线9##67 0 \n",
"1280 您好!一起慧99 0 \n",
"6388 会被封存,所以要及时转移。养老保险转移和接续手续:一、申请出具《基本养老保险参保缴费凭证》职... 1 \n",
"7920 首先是报案您或被保险人应在知道保险事故发生之日起10日内通知本公司。如果您或受益人故意或者因... 1 \n",
"3134 只要存在精神损失、误工、需要增加营养、护理的费用,就可以向侵权人主张赔偿责任。 0 \n",
"4015 基本保险金额是保单上明确标注的金额,保险金额是能拿到的保险赔付金额,有些保险条款的基本保险金... 1 \n",
"6848 重大疾病保险还是很有必要买的。我国的医疗保障体系是由基本医保和商业健康保险组成。如果发生重大... 1 \n",
"2494 你好,商业医保报销范围比医疗保险报销更广。基本都是能报销的。报销分农村居民和城镇职工:1、居... 0 \n",
"7341 车辆行驶证的“使用性质“一个是营运,一个是非营运。营运需要在运输管理部门办理车辆的道路运输许... 1 \n",
"4997 要退保必须去保险公司退,在银行的柜台上是没办法退的,而且退保必须由投保人本人持其身份证去退,... 1 \n",
"5694 保险法的构成主要包括保险业法、保险合同法*保险特别法。1.保险业法又称保险事业法、保险事业监... 0 \n",
"1604 年龄多大呢?保费预算多少? 0 \n",
"3098 购置税你是你购置车辆的时候上牌还需要交的费用。跟保险不是一个范围。 0 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(n=20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,355 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# liantongzhidao_filter 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1oYi9SfbXpnvreJYGV837Nw)\n",
"1. **数据概览:** 20.3万 条联通问答数据\n",
"2. **推荐实验:** FAQ 问答系统\n",
"3. **数据来源:** 百度知道\n",
"4. **加工处理:**\n",
" 1. 过滤了id、url、qid、reply_t、user字段\n",
" 2. 对question、reply做了脱敏处理"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = 'liantongzhidao_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. liantongzhidao_filter.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"pd_all = pd.read_csv(path + 'liantongzhidao_filter.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| title | 问题的标题 |\n",
"| question | 问题内容(可为空) |\n",
"| reply| 回复内容 |\n",
"| is_best| 是否为页面上显示的最佳回答 |"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>question</th>\n",
" <th>reply</th>\n",
" <th>is_best</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>104525</th>\n",
" <td>拖欠联通话费会有利息出现吗?</td>\n",
" <td>NaN</td>\n",
" <td>应该没有</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>116168</th>\n",
" <td>5S日版为什么插移动卡可以用.联通卡就不读卡</td>\n",
" <td>NaN</td>\n",
" <td>苹果手机卡贴分为移动和联通的,说明卡贴支持移动卡,不支持联通卡,主要是网络制式决定的。联通网...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>154475</th>\n",
" <td>联通空中号激活了也不能打电话是怎么回事</td>\n",
" <td>联通空中号激活了也不能打电话是怎么回事</td>\n",
" <td>手机已激活却无法接打电话的常见原因及解决方法如下:【1】检查手机是否欠费停机,建议缴费充值;...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>153069</th>\n",
" <td>联通48元送2g活动本月月租到底算不算进去?</td>\n",
" <td>NaN</td>\n",
" <td>算</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>195043</th>\n",
" <td>VI###13是不是不支持联通上网卡</td>\n",
" <td>NaN</td>\n",
" <td>VI###13支持联通上网卡。网络参考:主屏尺寸:4.5英寸主屏分辨率:854x480像素后...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5235</th>\n",
" <td>电话号码能定位是真吗</td>\n",
" <td>电话号码能定位是真吗</td>\n",
" <td>当然了</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10472</th>\n",
" <td>索尼LT22i可以刷机到4.1吗</td>\n",
" <td>NaN</td>\n",
" <td>由于手机所支持的网络是由硬件所确定的,无法通过破解软件或者升级软件系统让手机支持其他运营商的...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>86083</th>\n",
" <td>苹果ip##ne手机的个人热点怎么设置使用</td>\n",
" <td>NaN</td>\n",
" <td>1、点击“设置”选项;2、在“设置”界面中找到“个人热点”;3、然后我们可以看到“个人热点”...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>150247</th>\n",
" <td>我用的联通的号码,信号一会有一会没有,请问到底是怎什么回事</td>\n",
" <td>NaN</td>\n",
" <td>信号不好,手机因素,运营商问题,手机卡问题,很多因素你可以到当地联通营业厅寻求帮助</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202724</th>\n",
" <td>流量畅享包订购生效时间</td>\n",
" <td>NaN</td>\n",
" <td>您订购沃商店/沃游戏流量畅享包后,订购当月立即生效,按月自动续订;退订月底生效,当月可继续使...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44450</th>\n",
" <td>办了腾讯大王卡,激活后,身份证是不是就剩下半俩张卡的机会了</td>\n",
" <td>办了腾讯大王卡,激活后,身份证是不是就剩下半俩张卡的机会了</td>\n",
" <td>每人仅可预约一张音乐小*卡或视频小*卡或腾讯大*卡或腾讯天*卡(一共仅1张)(识别条件为:联...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>65875</th>\n",
" <td>现在有联通的合约机吗</td>\n",
" <td>NaN</td>\n",
" <td>联通有合约机。合约种类大致有存话费送手机、买手机送话费、合约惠机等,具体合约种类可登录联通网...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>51934</th>\n",
" <td>联通卡那个腾讯应用省内定向流量免费是什么意思啊</td>\n",
" <td>NaN</td>\n",
" <td>大王卡,对腾讯的应用,都免流量!</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>155866</th>\n",
" <td>怎么设置电信手机彩铃?</td>\n",
" <td>NaN</td>\n",
" <td>设置中*电信的彩铃可以自己在网上操作的,前提是先开通中*电信的彩铃业务,可以直接致电电信客服...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>118696</th>\n",
" <td>联通手机号挂失还能交费吗</td>\n",
" <td>NaN</td>\n",
" <td>1、挂失状态下可以交费。交费渠道与手机正常状态下是一样的。2、温馨提示:如果号码有套餐,挂失...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>115890</th>\n",
" <td>我刚买了一张联通卡,过了几天我怎么收到达飞即有分期让我还款的信息</td>\n",
" <td>我刚买了一张联通卡,过了几天我怎么收到达飞即有分期让我还款的信息,我又没有借过,该怎么办,打...</td>\n",
" <td>出现此情况一般是有以下几种情况:1、信息可能发错接收人了。2、此卡为二次放号的手机卡,前一个...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35555</th>\n",
" <td>就是联通网用不了。</td>\n",
" <td>NaN</td>\n",
" <td>如使用联通手机无法上网,可做以下排查:1、升级为4G套餐后如不重启手机则无法正常使用上网功能...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73496</th>\n",
" <td>生份证复印件被公司拿去开联通号码了怎么办</td>\n",
" <td>生份证复印件被公司拿去开联通号码了怎么办</td>\n",
" <td>你再用原件去注销</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>114899</th>\n",
" <td>手机有4g网络,可是却显示无法上网</td>\n",
" <td>NaN</td>\n",
" <td>1、检查信号是否正常;2、号卡是否欠费;3、如上面2项都正常,可重新开关机、换机换卡测试;4...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47230</th>\n",
" <td>移动,联通无限打到底是怎么回事</td>\n",
" <td>NaN</td>\n",
" <td>您好!现运营商均有推出各种语音、流量优惠套餐,具体情况建议您可咨询当地客服热线、实体营业厅、...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" title \\\n",
"104525 拖欠联通话费会有利息出现吗? \n",
"116168 5S日版为什么插移动卡可以用.联通卡就不读卡 \n",
"154475 联通空中号激活了也不能打电话是怎么回事 \n",
"153069 联通48元送2g活动本月月租到底算不算进去? \n",
"195043 VI###13是不是不支持联通上网卡 \n",
"5235 电话号码能定位是真吗 \n",
"10472 索尼LT22i可以刷机到4.1吗 \n",
"86083 苹果ip##ne手机的个人热点怎么设置使用 \n",
"150247 我用的联通的号码,信号一会有一会没有,请问到底是怎什么回事 \n",
"202724 流量畅享包订购生效时间 \n",
"44450 办了腾讯大王卡,激活后,身份证是不是就剩下半俩张卡的机会了 \n",
"65875 现在有联通的合约机吗 \n",
"51934 联通卡那个腾讯应用省内定向流量免费是什么意思啊 \n",
"155866 怎么设置电信手机彩铃? \n",
"118696 联通手机号挂失还能交费吗 \n",
"115890 我刚买了一张联通卡,过了几天我怎么收到达飞即有分期让我还款的信息 \n",
"35555 就是联通网用不了。 \n",
"73496 生份证复印件被公司拿去开联通号码了怎么办 \n",
"114899 手机有4g网络,可是却显示无法上网 \n",
"47230 移动,联通无限打到底是怎么回事 \n",
"\n",
" question \\\n",
"104525 NaN \n",
"116168 NaN \n",
"154475 联通空中号激活了也不能打电话是怎么回事 \n",
"153069 NaN \n",
"195043 NaN \n",
"5235 电话号码能定位是真吗 \n",
"10472 NaN \n",
"86083 NaN \n",
"150247 NaN \n",
"202724 NaN \n",
"44450 办了腾讯大王卡,激活后,身份证是不是就剩下半俩张卡的机会了 \n",
"65875 NaN \n",
"51934 NaN \n",
"155866 NaN \n",
"118696 NaN \n",
"115890 我刚买了一张联通卡,过了几天我怎么收到达飞即有分期让我还款的信息,我又没有借过,该怎么办,打... \n",
"35555 NaN \n",
"73496 生份证复印件被公司拿去开联通号码了怎么办 \n",
"114899 NaN \n",
"47230 NaN \n",
"\n",
" reply is_best \n",
"104525 应该没有 0 \n",
"116168 苹果手机卡贴分为移动和联通的,说明卡贴支持移动卡,不支持联通卡,主要是网络制式决定的。联通网... 1 \n",
"154475 手机已激活却无法接打电话的常见原因及解决方法如下:【1】检查手机是否欠费停机,建议缴费充值;... 1 \n",
"153069 算 1 \n",
"195043 VI###13支持联通上网卡。网络参考:主屏尺寸:4.5英寸主屏分辨率:854x480像素后... 1 \n",
"5235 当然了 0 \n",
"10472 由于手机所支持的网络是由硬件所确定的,无法通过破解软件或者升级软件系统让手机支持其他运营商的... 1 \n",
"86083 1、点击“设置”选项;2、在“设置”界面中找到“个人热点”;3、然后我们可以看到“个人热点”... 0 \n",
"150247 信号不好,手机因素,运营商问题,手机卡问题,很多因素你可以到当地联通营业厅寻求帮助 0 \n",
"202724 您订购沃商店/沃游戏流量畅享包后,订购当月立即生效,按月自动续订;退订月底生效,当月可继续使... 1 \n",
"44450 每人仅可预约一张音乐小*卡或视频小*卡或腾讯大*卡或腾讯天*卡(一共仅1张)(识别条件为:联... 1 \n",
"65875 联通有合约机。合约种类大致有存话费送手机、买手机送话费、合约惠机等,具体合约种类可登录联通网... 0 \n",
"51934 大王卡,对腾讯的应用,都免流量! 0 \n",
"155866 设置中*电信的彩铃可以自己在网上操作的,前提是先开通中*电信的彩铃业务,可以直接致电电信客服... 1 \n",
"118696 1、挂失状态下可以交费。交费渠道与手机正常状态下是一样的。2、温馨提示:如果号码有套餐,挂失... 1 \n",
"115890 出现此情况一般是有以下几种情况:1、信息可能发错接收人了。2、此卡为二次放号的手机卡,前一个... 1 \n",
"35555 如使用联通手机无法上网,可做以下排查:1、升级为4G套餐后如不重启手机则无法正常使用上网功能... 1 \n",
"73496 你再用原件去注销 0 \n",
"114899 1、检查信号是否正常;2、号卡是否欠费;3、如上面2项都正常,可重新开关机、换机换卡测试;4... 1 \n",
"47230 您好!现运营商均有推出各种语音、流量优惠套餐,具体情况建议您可咨询当地客服热线、实体营业厅、... 1 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(n=20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,355 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# nonghangzhidao_filter 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1n-jT9SKkt6cwI_PjCd7i_g)\n",
"1. **数据概览:** 4万 条农业银行问答数据\n",
"2. **推荐实验:** FAQ 问答系统\n",
"3. **数据来源:** 百度知道\n",
"4. **加工处理:**\n",
" 1. 过滤了id、url、qid、reply_t、user字段\n",
" 2. 对question、reply做了脱敏处理"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = 'nonghangzhidao_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. nonghangzhidao_filter.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"pd_all = pd.read_csv(path + 'nonghangzhidao_filter.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| title | 问题的标题 |\n",
"| question | 问题内容(可为空) |\n",
"| reply| 回复内容 |\n",
"| is_best| 是否为页面上显示的最佳回答 |"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>question</th>\n",
" <th>reply</th>\n",
" <th>is_best</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>31655</th>\n",
" <td>广东农行转账到江苏农行,几天可以到账?1月4日晚上10点多转的!</td>\n",
" <td>NaN</td>\n",
" <td>这么久还没有到账的话,建议查询一下是否被退回了,如果未退回的话,需要联系银行查询原因。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20349</th>\n",
" <td>惠水哪里有小额贷款的,而且抵押的东西能方</td>\n",
" <td>NaN</td>\n",
" <td>留vx..</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20303</th>\n",
" <td>想问一下重庆分行的体检通知还有第二批吗</td>\n",
" <td>NaN</td>\n",
" <td>若客户申请的是农行招聘,则可以参考以下信息:1、请登录农行官网,在“关于农行”栏目下选择点击...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18420</th>\n",
" <td>现在有什么软件借钱可以秒过的没。江湖救急</td>\n",
" <td>NaN</td>\n",
" <td>资料真实有效二十分钟放款</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39804</th>\n",
" <td>想找高利贷怎么找?</td>\n",
" <td>武*那里有高利贷啊?接个几千块就行年后还?有吗</td>\n",
" <td>留你联系方式</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23242</th>\n",
" <td>别人用建行卡往我农行卡转了20万,一天了怎么还不到账?</td>\n",
" <td>别人用建行卡往我农行卡转了20万,一天了怎么还不到账?肯定是</td>\n",
" <td>如果是昨天下午五点后就要等到中午以后</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30656</th>\n",
" <td>我问银行了,说消不了户、只能刷掉</td>\n",
" <td>NaN</td>\n",
" <td>如果使用的是农行信用卡,可以致电信用卡客服40######99反馈核实一下。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>贷的太少了,可以提前还清贷款,然后多贷点他用吗</td>\n",
" <td>NaN</td>\n",
" <td>建议客户选择正规渠道申请贷款,例如农行“网捷贷”。网捷贷是指农业银行向符合特定条件的农业银行...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2961</th>\n",
" <td>银行理财和证券公司理财一样吗</td>\n",
" <td>NaN</td>\n",
" <td>不太一样,产品的种类风险不同</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10837</th>\n",
" <td>农行提额喜欢刷大额是还是小额</td>\n",
" <td>NaN</td>\n",
" <td>老农现在是印头与时俱进哦比其他银行都大方。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34726</th>\n",
" <td>存折可以异地取款吗存折取钱一定要本人吗</td>\n",
" <td>NaN</td>\n",
" <td>农行个人活期存折支取方式里如果有凭证件支取,此类存折必须户主本人办理;没有密码的存折只能到开...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2381</th>\n",
" <td>住房公积金可以首付么</td>\n",
" <td>NaN</td>\n",
" <td>不能用公积金来付首付。这个贷款是在购房付了首付款后才能给贷的,也就是说公积金使用只能是与房屋...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21717</th>\n",
" <td>农行有银联标识的社会保障卡能开通网银吗?</td>\n",
" <td>如题</td>\n",
" <td>由农业银行发行的有银联标识的社会保障卡,上面如果有农业银行卡号的话,是可以用本人由身份证和银...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35871</th>\n",
" <td>求告知宁*装修贷款条件有哪些</td>\n",
" <td>NaN</td>\n",
" <td>以建行家装贷为例:“家装贷”是建设银行所有具有装修融资服务功能的个人贷款产品,包括个人住房抵...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3417</th>\n",
" <td>重*公积金中介能代取的吗</td>\n",
" <td>重*公积金中介能代取的吗个人公积金代取</td>\n",
" <td>一般来说,公积金套现主要存在几个方面的风险:一、中介机构提取完公积金后,有可能会携款潜逃,竹...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25763</th>\n",
" <td>农行的理财产品能买吗</td>\n",
" <td>NaN</td>\n",
" <td>农行理财业务与国内同业同步,迄今为止,已经形成了制度体系较为完善、系统开发不断前进、产品系列...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13381</th>\n",
" <td>出大事了,出大事了,急用钱,请问我</td>\n",
" <td>NaN</td>\n",
" <td>需要多少呢</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7448</th>\n",
" <td>向钱贷会跑路吗?</td>\n",
" <td>NaN</td>\n",
" <td>不会,放心。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2617</th>\n",
" <td>我想开通农业掌上银行提示要开通短信,可以先开通短信把掌上银行开通后,取消短信服务吗?对其他有影响吗</td>\n",
" <td>我想开通农业掌上银行提示要开通短信,可以先开通短信把掌上银行开通后,取消短信服务吗?对其他有...</td>\n",
" <td>我想开通农业掌上银行提示要开通短信,可以先开通短信把掌上银行开通后,取消短信服务吗?对其他有...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24669</th>\n",
" <td>朋友晚上8点半转账给我到现在还没到帐</td>\n",
" <td>NaN</td>\n",
" <td>现在外面ATM机都是24小时才到帐</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" title \\\n",
"31655 广东农行转账到江苏农行,几天可以到账?1月4日晚上10点多转的! \n",
"20349 惠水哪里有小额贷款的,而且抵押的东西能方 \n",
"20303 想问一下重庆分行的体检通知还有第二批吗 \n",
"18420 现在有什么软件借钱可以秒过的没。江湖救急 \n",
"39804 想找高利贷怎么找? \n",
"23242 别人用建行卡往我农行卡转了20万,一天了怎么还不到账? \n",
"30656 我问银行了,说消不了户、只能刷掉 \n",
"8 贷的太少了,可以提前还清贷款,然后多贷点他用吗 \n",
"2961 银行理财和证券公司理财一样吗 \n",
"10837 农行提额喜欢刷大额是还是小额 \n",
"34726 存折可以异地取款吗存折取钱一定要本人吗 \n",
"2381 住房公积金可以首付么 \n",
"21717 农行有银联标识的社会保障卡能开通网银吗? \n",
"35871 求告知宁*装修贷款条件有哪些 \n",
"3417 重*公积金中介能代取的吗 \n",
"25763 农行的理财产品能买吗 \n",
"13381 出大事了,出大事了,急用钱,请问我 \n",
"7448 向钱贷会跑路吗? \n",
"2617 我想开通农业掌上银行提示要开通短信,可以先开通短信把掌上银行开通后,取消短信服务吗?对其他有影响吗 \n",
"24669 朋友晚上8点半转账给我到现在还没到帐 \n",
"\n",
" question \\\n",
"31655 NaN \n",
"20349 NaN \n",
"20303 NaN \n",
"18420 NaN \n",
"39804 武*那里有高利贷啊?接个几千块就行年后还?有吗 \n",
"23242 别人用建行卡往我农行卡转了20万,一天了怎么还不到账?肯定是 \n",
"30656 NaN \n",
"8 NaN \n",
"2961 NaN \n",
"10837 NaN \n",
"34726 NaN \n",
"2381 NaN \n",
"21717 如题 \n",
"35871 NaN \n",
"3417 重*公积金中介能代取的吗个人公积金代取 \n",
"25763 NaN \n",
"13381 NaN \n",
"7448 NaN \n",
"2617 我想开通农业掌上银行提示要开通短信,可以先开通短信把掌上银行开通后,取消短信服务吗?对其他有... \n",
"24669 NaN \n",
"\n",
" reply is_best \n",
"31655 这么久还没有到账的话,建议查询一下是否被退回了,如果未退回的话,需要联系银行查询原因。 0 \n",
"20349 留vx.. 0 \n",
"20303 若客户申请的是农行招聘,则可以参考以下信息:1、请登录农行官网,在“关于农行”栏目下选择点击... 1 \n",
"18420 资料真实有效二十分钟放款 0 \n",
"39804 留你联系方式 0 \n",
"23242 如果是昨天下午五点后就要等到中午以后 0 \n",
"30656 如果使用的是农行信用卡,可以致电信用卡客服40######99反馈核实一下。 0 \n",
"8 建议客户选择正规渠道申请贷款,例如农行“网捷贷”。网捷贷是指农业银行向符合特定条件的农业银行... 0 \n",
"2961 不太一样,产品的种类风险不同 0 \n",
"10837 老农现在是印头与时俱进哦比其他银行都大方。 0 \n",
"34726 农行个人活期存折支取方式里如果有凭证件支取,此类存折必须户主本人办理;没有密码的存折只能到开... 1 \n",
"2381 不能用公积金来付首付。这个贷款是在购房付了首付款后才能给贷的,也就是说公积金使用只能是与房屋... 1 \n",
"21717 由农业银行发行的有银联标识的社会保障卡,上面如果有农业银行卡号的话,是可以用本人由身份证和银... 1 \n",
"35871 以建行家装贷为例:“家装贷”是建设银行所有具有装修融资服务功能的个人贷款产品,包括个人住房抵... 1 \n",
"3417 一般来说,公积金套现主要存在几个方面的风险:一、中介机构提取完公积金后,有可能会携款潜逃,竹... 1 \n",
"25763 农行理财业务与国内同业同步,迄今为止,已经形成了制度体系较为完善、系统开发不断前进、产品系列... 1 \n",
"13381 需要多少呢 0 \n",
"7448 不会,放心。 0 \n",
"2617 我想开通农业掌上银行提示要开通短信,可以先开通短信把掌上银行开通后,取消短信服务吗?对其他有... 0 \n",
"24669 现在外面ATM机都是24小时才到帐 0 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(n=20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,553 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# online_shopping_10_cats 说明\n",
"0. **下载地址:** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/online_shopping_10_cats/online_shopping_10_cats.zip)\n",
"1. **数据概览:** 10 个类别(书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店),共 6 万多条评论数据,正、负向评论各约 3 万条\n",
"2. **推荐实验:** 情感/观点/评论 倾向性分析\n",
"2. **数据来源:** 各电商平台,具体不详\n",
"3. **原数据集:** [中文情感分析语料](https://download.csdn.net/download/weixin_38395744/10231401)、[中文情感分析语料库](https://download.csdn.net/download/u010097581/9919245),网上搜集,具体作者、来源不详\n",
"4. **加工处理:**\n",
" 1. 将 2 份语料整合成 1 份语料\n",
" 2. 将原来零散的 excel, txt 文档,整合成 1 个 统一的 csv 文档\n",
" 3. 去重"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"path = 'online_shopping_10_cats_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. online_shopping_10_cats.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"评论数目(总体):62774\n",
"评论数目(正向):31728\n",
"评论数目(负向):31046\n"
]
}
],
"source": [
"pd_all = pd.read_csv(path + 'online_shopping_10_cats.csv')\n",
"\n",
"print('评论数目(总体):%d' % pd_all.shape[0])\n",
"print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])\n",
"print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| cat | 类别:包括 书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店 |\n",
"| label | 1 表示正向评论,0 表示负向评论 |\n",
"| review | 评论内容 |"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cat</th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>11194</th>\n",
" <td>平板</td>\n",
" <td>0</td>\n",
" <td>什么玩意。刚用一天,就充不上电,开不开机,返厂老麻烦,</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17794</th>\n",
" <td>水果</td>\n",
" <td>1</td>\n",
" <td>买了几次了,价格实惠,口感不错,保鲜好!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29529</th>\n",
" <td>洗发水</td>\n",
" <td>1</td>\n",
" <td>挺值得购买的,有包装买回去送家人,毛巾质量不错。小块的可以拿来当擦手帕。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24976</th>\n",
" <td>水果</td>\n",
" <td>0</td>\n",
" <td>真的就算后悔了。两天才拿到货。还不如水果店买!还都发霉不新鲜了!以后不买了</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28447</th>\n",
" <td>洗发水</td>\n",
" <td>1</td>\n",
" <td>一般般,薄荷洗发水没想象中的凉快</td>\n",
" </tr>\n",
" <tr>\n",
" <th>264</th>\n",
" <td>书籍</td>\n",
" <td>1</td>\n",
" <td>这本书有别于以往看过的早教书籍,结合了说明文的写实,散文的情致和图册的一目了然。特别是读过几...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53035</th>\n",
" <td>酒店</td>\n",
" <td>1</td>\n",
" <td>酒店的大堂很漂亮,房间不算小,设施还可以也很干净,离码头很近,而且又有车接送,很方便.晚上2...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50250</th>\n",
" <td>计算机</td>\n",
" <td>1</td>\n",
" <td>做工不错,外壳也很漂亮。测试了一下还行!~中通很快啊,13号下午的订单,今天早上就收到了。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62461</th>\n",
" <td>酒店</td>\n",
" <td>0</td>\n",
" <td>房间空间比较小, 环境比较吵。特别半夜被窗户外面的空调外机的声音吵醒(因为窗外一条巷子之隔,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52888</th>\n",
" <td>酒店</td>\n",
" <td>1</td>\n",
" <td>清明节入住两天.从进入酒店就感受到无处不在的服务,非常周到,又很得体.从大堂,商务中心,到前...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31429</th>\n",
" <td>洗发水</td>\n",
" <td>0</td>\n",
" <td>感觉不怎么样,刚刚洗完头发又感觉头发干枯枯的而且还是好油</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21443</th>\n",
" <td>水果</td>\n",
" <td>0</td>\n",
" <td>算了,不要买了,先不说个头小,就味道难吃的要死,还没有路边摊卖的好吃,硬,涩,根本就没有苹果...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19374</th>\n",
" <td>水果</td>\n",
" <td>1</td>\n",
" <td>快递神速,品种与描述一样,比上次买的好吃!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28188</th>\n",
" <td>洗发水</td>\n",
" <td>1</td>\n",
" <td>还没有用,不过感觉和实体店买的差不多,等用过之后再追加评价吧</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46182</th>\n",
" <td>衣服</td>\n",
" <td>0</td>\n",
" <td>裤子又大又长,那里像休闲裤,妈的,还修身呢,真是够了</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62616</th>\n",
" <td>酒店</td>\n",
" <td>0</td>\n",
" <td>奇葩的酒店。在一个办公楼里,自己开车去酒店,很难找到,等到了酒店地下停车场,不知道应该坐那部...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44044</th>\n",
" <td>衣服</td>\n",
" <td>0</td>\n",
" <td>我要晕死得节奏,买回来就没穿过,真的是霉!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19456</th>\n",
" <td>水果</td>\n",
" <td>1</td>\n",
" <td>苹果不大,但很脆甜。检查了一下,48个没有烂的,有个别难看的。总体上质量不错</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10562</th>\n",
" <td>平板</td>\n",
" <td>0</td>\n",
" <td>差差差真卡渣渣品牌以后在也不相信大品牌了坑是了</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34199</th>\n",
" <td>洗发水</td>\n",
" <td>0</td>\n",
" <td>这个是6月18当天买的,只有半瓶。购物太差劲了</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" cat label review\n",
"11194 平板 0 什么玩意。刚用一天,就充不上电,开不开机,返厂老麻烦,\n",
"17794 水果 1 买了几次了,价格实惠,口感不错,保鲜好!\n",
"29529 洗发水 1 挺值得购买的,有包装买回去送家人,毛巾质量不错。小块的可以拿来当擦手帕。\n",
"24976 水果 0 真的就算后悔了。两天才拿到货。还不如水果店买!还都发霉不新鲜了!以后不买了\n",
"28447 洗发水 1 一般般,薄荷洗发水没想象中的凉快\n",
"264 书籍 1 这本书有别于以往看过的早教书籍,结合了说明文的写实,散文的情致和图册的一目了然。特别是读过几...\n",
"53035 酒店 1 酒店的大堂很漂亮,房间不算小,设施还可以也很干净,离码头很近,而且又有车接送,很方便.晚上2...\n",
"50250 计算机 1 做工不错,外壳也很漂亮。测试了一下还行!~中通很快啊,13号下午的订单,今天早上就收到了。\n",
"62461 酒店 0 房间空间比较小, 环境比较吵。特别半夜被窗户外面的空调外机的声音吵醒(因为窗外一条巷子之隔,...\n",
"52888 酒店 1 清明节入住两天.从进入酒店就感受到无处不在的服务,非常周到,又很得体.从大堂,商务中心,到前...\n",
"31429 洗发水 0 感觉不怎么样,刚刚洗完头发又感觉头发干枯枯的而且还是好油\n",
"21443 水果 0 算了,不要买了,先不说个头小,就味道难吃的要死,还没有路边摊卖的好吃,硬,涩,根本就没有苹果...\n",
"19374 水果 1 快递神速,品种与描述一样,比上次买的好吃!\n",
"28188 洗发水 1 还没有用,不过感觉和实体店买的差不多,等用过之后再追加评价吧\n",
"46182 衣服 0 裤子又大又长,那里像休闲裤,妈的,还修身呢,真是够了\n",
"62616 酒店 0 奇葩的酒店。在一个办公楼里,自己开车去酒店,很难找到,等到了酒店地下停车场,不知道应该坐那部...\n",
"44044 衣服 0 我要晕死得节奏,买回来就没穿过,真的是霉!\n",
"19456 水果 1 苹果不大,但很脆甜。检查了一下,48个没有烂的,有个别难看的。总体上质量不错\n",
"10562 平板 0 差差差真卡渣渣品牌以后在也不相信大品牌了坑是了\n",
"34199 洗发水 0 这个是6月18当天买的,只有半瓶。购物太差劲了"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. 统计各类别语料的规模"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"书籍: 3851 (总体), 2100 (正例), 1751 (负例)\n",
"平板: 10000 (总体), 5000 (正例), 5000 (负例)\n",
"手机: 2323 (总体), 1165 (正例), 1158 (负例)\n",
"水果: 10000 (总体), 5000 (正例), 5000 (负例)\n",
"洗发水: 10000 (总体), 5000 (正例), 5000 (负例)\n",
"热水器: 575 (总体), 475 (正例), 100 (负例)\n",
"蒙牛: 2033 (总体), 992 (正例), 1041 (负例)\n",
"衣服: 10000 (总体), 5000 (正例), 5000 (负例)\n",
"计算机: 3992 (总体), 1996 (正例), 1996 (负例)\n",
"酒店: 10000 (总体), 5000 (正例), 5000 (负例)\n"
]
}
],
"source": [
"all_cats = ['书籍', '平板', '手机', '水果', '洗发水', '热水器', '蒙牛', '衣服', '计算机', '酒店'] # 全部类别\n",
"\n",
"for cat in all_cats:\n",
" pd_data = pd_all[pd_all.cat==cat]\n",
" print('{}: {} (总体), {} (正例), {} (负例)'.format(cat, pd_data.shape[0], \n",
" pd_data[pd_data.label==1].shape[0], pd_data[pd_data.label==0].shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. 加载指定类别的语料"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"评论数目(总体):17843\n",
"评论数目(正向):9096\n",
"评论数目(负向):8747\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cat</th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1620</th>\n",
" <td>书籍</td>\n",
" <td>1</td>\n",
" <td>符弦歌&amp;凌悠扬,一个背负着道义和家族荣誉,一个洒脱且桀骜不羁,两个完全不相同的人却因为千丝万...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18872</th>\n",
" <td>水果</td>\n",
" <td>1</td>\n",
" <td>一直在吃,烟台苹果,味道不错,物流快</td>\n",
" </tr>\n",
" <tr>\n",
" <th>443</th>\n",
" <td>书籍</td>\n",
" <td>1</td>\n",
" <td>仔细回想这本文集,发现自己喜欢的只是写《教室朝南,没有风筝》的麻宁,不知道是她成长了还是自己...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21437</th>\n",
" <td>水果</td>\n",
" <td>0</td>\n",
" <td>最差的一次购物体验,干瘪,坏心,糟糕透顶</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18321</th>\n",
" <td>水果</td>\n",
" <td>1</td>\n",
" <td>多次购买新鲜爽甜,80个头大大个,物流超快,上午9点前下单,下午16点收货</td>\n",
" </tr>\n",
" <tr>\n",
" <th>568</th>\n",
" <td>书籍</td>\n",
" <td>1</td>\n",
" <td>一开始我是看了当当上的推荐,说不一样的卡梅拉这套书是亚马逊的五星级图书,大家的评论也非常好。...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23927</th>\n",
" <td>水果</td>\n",
" <td>0</td>\n",
" <td>垃圾啊,以后再也不 会买了啊 ,好几个坏的,还有好多歪头歪闹的</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19244</th>\n",
" <td>水果</td>\n",
" <td>1</td>\n",
" <td>包装完好,没有烂果,就是比较小粒,卖相不好。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20643</th>\n",
" <td>水果</td>\n",
" <td>1</td>\n",
" <td>不错不错特别好吃,甜甜的水分还足而且还很脆,第一次在京东买苹果,果然没让我失望,</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22330</th>\n",
" <td>水果</td>\n",
" <td>0</td>\n",
" <td>第一次给差评,刚拿上打开第一个就黑心。差评。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17905</th>\n",
" <td>水果</td>\n",
" <td>1</td>\n",
" <td>妈妈说非常好,谢谢店家,会继续支持</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19439</th>\n",
" <td>水果</td>\n",
" <td>1</td>\n",
" <td>不错不错挺甜的。 收到还凉凉的。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23419</th>\n",
" <td>水果</td>\n",
" <td>0</td>\n",
" <td>吃第一个就是烂的,而且是烂透了的。认栽,图都难得传了!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>355</th>\n",
" <td>书籍</td>\n",
" <td>1</td>\n",
" <td>这本书从男性的视觉诠释了承诺和责任的关系。从达菲一开始的茫然到最后勇敢面对自己的真心,以及对...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24028</th>\n",
" <td>水果</td>\n",
" <td>0</td>\n",
" <td>味同嚼蜡,水泥地里长出来的吗?一点味道都没有还硬的很,颜色很红,个头很小,口感特别差,真后悔</td>\n",
" </tr>\n",
" <tr>\n",
" <th>497</th>\n",
" <td>书籍</td>\n",
" <td>1</td>\n",
" <td>因为众所周知的原因,我一直在内心深处比较抵制日本文化,我们接受的教育也是负面的信息多于正面的...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52307</th>\n",
" <td>计算机</td>\n",
" <td>0</td>\n",
" <td>噪音稍大,再就是装XP系统确实蓝屏的几率比较大,装VISTA算了,别的缺点暂时真没发觉,水平有限</td>\n",
" </tr>\n",
" <tr>\n",
" <th>51268</th>\n",
" <td>计算机</td>\n",
" <td>0</td>\n",
" <td>可能是主板比较特殊,很多Ghost启动光盘不能识别光驱,不过好像萝卜花园的可以识别。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21227</th>\n",
" <td>水果</td>\n",
" <td>0</td>\n",
" <td>好小一个,根本不是进口的。包装好看而已!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18596</th>\n",
" <td>水果</td>\n",
" <td>1</td>\n",
" <td>好吃真心的好吃赞了,快递特快,继续关注,会回购的</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" cat label review\n",
"1620 书籍 1 符弦歌&凌悠扬,一个背负着道义和家族荣誉,一个洒脱且桀骜不羁,两个完全不相同的人却因为千丝万...\n",
"18872 水果 1 一直在吃,烟台苹果,味道不错,物流快\n",
"443 书籍 1 仔细回想这本文集,发现自己喜欢的只是写《教室朝南,没有风筝》的麻宁,不知道是她成长了还是自己...\n",
"21437 水果 0 最差的一次购物体验,干瘪,坏心,糟糕透顶\n",
"18321 水果 1 多次购买新鲜爽甜,80个头大大个,物流超快,上午9点前下单,下午16点收货\n",
"568 书籍 1 一开始我是看了当当上的推荐,说不一样的卡梅拉这套书是亚马逊的五星级图书,大家的评论也非常好。...\n",
"23927 水果 0 垃圾啊,以后再也不 会买了啊 ,好几个坏的,还有好多歪头歪闹的\n",
"19244 水果 1 包装完好,没有烂果,就是比较小粒,卖相不好。\n",
"20643 水果 1 不错不错特别好吃,甜甜的水分还足而且还很脆,第一次在京东买苹果,果然没让我失望,\n",
"22330 水果 0 第一次给差评,刚拿上打开第一个就黑心。差评。\n",
"17905 水果 1 妈妈说非常好,谢谢店家,会继续支持\n",
"19439 水果 1 不错不错挺甜的。 收到还凉凉的。\n",
"23419 水果 0 吃第一个就是烂的,而且是烂透了的。认栽,图都难得传了!\n",
"355 书籍 1 这本书从男性的视觉诠释了承诺和责任的关系。从达菲一开始的茫然到最后勇敢面对自己的真心,以及对...\n",
"24028 水果 0 味同嚼蜡,水泥地里长出来的吗?一点味道都没有还硬的很,颜色很红,个头很小,口感特别差,真后悔\n",
"497 书籍 1 因为众所周知的原因,我一直在内心深处比较抵制日本文化,我们接受的教育也是负面的信息多于正面的...\n",
"52307 计算机 0 噪音稍大,再就是装XP系统确实蓝屏的几率比较大,装VISTA算了,别的缺点暂时真没发觉,水平有限\n",
"51268 计算机 0 可能是主板比较特殊,很多Ghost启动光盘不能识别光驱,不过好像萝卜花园的可以识别。\n",
"21227 水果 0 好小一个,根本不是进口的。包装好看而已!\n",
"18596 水果 1 好吃真心的好吃赞了,快递特快,继续关注,会回购的"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"target_cats = ['书籍', '水果', '计算机'] # 假定只需要 书籍、水果、计算机 3 个 类别的数据\n",
"\n",
"pd_data = pd_all[pd_all.cat.isin(target_cats)]\n",
"\n",
"print('评论数目(总体):%d' % pd_data.shape[0])\n",
"print('评论数目(正向):%d' % pd_data[pd_data.label==1].shape[0])\n",
"print('评论数目(负向):%d' % pd_data[pd_data.label==0].shape[0])\n",
"\n",
"pd_data.sample(20)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,287 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# simplifyweibo_4_moods 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/16c93E5x373nsGozyWevITg)\n",
"1. **数据概览:** 36 万多条,带情感标注 新浪微博,包含 4 种情感,其中喜悦约 20 万条,愤怒、厌恶、低落各约 5 万条\n",
"2. **推荐实验:** 情感/观点/评论 倾向性分析\n",
"2. **数据来源:** [新浪微博](https://weibo.com/)\n",
"3. **原数据集:** [微博情感分析数据集](https://download.csdn.net/download/turkan/9181661),网上搜集,具体作者、来源不详\n",
"4. **加工处理:**\n",
" 1. 将原来的 4 份文档,整合成 1 份 csv 文件\n",
" 2. 原始语料进行了分词处理,我们重新将其还原为未分词的语料\n",
" 3. 编码统一为 UTF-8\n",
" 4. 去重"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"path = 'simplifyweibo_4_moods_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. simplifyweibo_4_moods.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"微博数目(总体):361744\n",
"微博数目(喜悦):199496\n",
"微博数目(愤怒):51714\n",
"微博数目(厌恶):55267\n",
"微博数目(低落):55267\n"
]
}
],
"source": [
"pd_all = pd.read_csv(path + 'simplifyweibo_4_moods.csv')\n",
"moods = {0: '喜悦', 1: '愤怒', 2: '厌恶', 3: '低落'}\n",
"\n",
"print('微博数目(总体):%d' % pd_all.shape[0])\n",
"\n",
"for label, mood in moods.items(): \n",
" print('微博数目({}):{}'.format(mood, pd_all[pd_all.label==label].shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| label | 0 喜悦,1 愤怒,2 厌恶,3 低落 |\n",
"| review | 微博内容 |"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>307114</th>\n",
" <td>3</td>\n",
" <td>回复美国看起来很美,对别人比较狠!对付哪国人,就用哪国人做他的腿,简称狗腿落后的祖宗挨过打!...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>258815</th>\n",
" <td>2</td>\n",
" <td>我表示压力狠大。!哇。犀利妹!偶尔街拍,其实姐只是一个你永远无法超越的传说。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>249801</th>\n",
" <td>1</td>\n",
" <td>可怜,帮这孩子转下,希望不会因为涉嫌联系业务负什么责任啊…………是想粉丝想疯了什么情况啊?想...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>165587</th>\n",
" <td>0</td>\n",
" <td>哦也~ ~ ~ !得瑟哈哈哈耶~ ~ ~ !新logo 。。。。我们的logo 会不会抢了的...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>351395</th>\n",
" <td>3</td>\n",
" <td>我发现真的是最齐全的一张。这是去看北方儿子的时候啊。怀念。对了,我怎么穿那件破衬衫。。好难看...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>339894</th>\n",
" <td>3</td>\n",
" <td>看你那个享受的表情nuna 很感动~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>307523</th>\n",
" <td>3</td>\n",
" <td>不得不轉 !大家淚 奔吧哈哈27開 始,短短8秒,我咽哽了</td>\n",
" </tr>\n",
" <tr>\n",
" <th>124636</th>\n",
" <td>0</td>\n",
" <td>早看到了,再看到还是想笑,好可爱啊</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56901</th>\n",
" <td>0</td>\n",
" <td>快来围观我的小丸子模板~ ~ 哇咔咔~ 得瑟~ ~ ~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>106905</th>\n",
" <td>0</td>\n",
" <td>也未免太厉害了吧.......观看完此视频之后,我终于明白了香港歌星GEM—— 邓紫棋走红的...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>291966</th>\n",
" <td>2</td>\n",
" <td>天啊…是住家发生爆炸了,天热,各位注意安全。一朋友开化工厂的。唉。注意安全。真难以想像,不知...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>321489</th>\n",
" <td>3</td>\n",
" <td>肯德基你就不会带个头,做件好事可爱的脖子们,帮她圆了梦吧~ ~ 小时候来北京,吃过一种小糕点...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>188566</th>\n",
" <td>0</td>\n",
" <td>想去桂林,上学时候就学到一课文说桂林山水甲天下,一直想去看看品橙网国庆旅游胜地创意评奖活动开...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15444</th>\n",
" <td>0</td>\n",
" <td>晃姐姐口才真不是一般人的高,这大概就是文凭带来的区别吧。拿着真文凭的人总会觉得那是自己的底线...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56820</th>\n",
" <td>0</td>\n",
" <td>火火happy birthday 天蝎座的人虽然喜欢隐藏自己,但是他喜欢掌握每天生活当中与他...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>257031</th>\n",
" <td>2</td>\n",
" <td>好久没看了。。。还是那么的感动~ ~ ~ ~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>144782</th>\n",
" <td>0</td>\n",
" <td>你看他像几岁?关键是牛尔多大?【分享图片】现场挑战高难度抗衰老奇迹~ 看看他都使用倩碧什么产品~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>130776</th>\n",
" <td>0</td>\n",
" <td>比江苏台的好玩这个真的很搞笑,再次推荐!哈哈,这个绝对值得一看,搞笑死了。当然其中的讽刺意味...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59158</th>\n",
" <td>0</td>\n",
" <td>【YMG 推荐】来,哥让你见识下,什么是真正的招财猫!要发财的童鞋抱走~ ~ 在海味舖 買 ?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>240262</th>\n",
" <td>1</td>\n",
" <td>该带套的时候要带上。大哥,你就得瑟吧和吃饭。美女很美很火。因为吃香辣小龙虾,我的衬衫歇火了。...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"307114 3 回复美国看起来很美,对别人比较狠!对付哪国人,就用哪国人做他的腿,简称狗腿落后的祖宗挨过打!...\n",
"258815 2 我表示压力狠大。!哇。犀利妹!偶尔街拍,其实姐只是一个你永远无法超越的传说。\n",
"249801 1 可怜,帮这孩子转下,希望不会因为涉嫌联系业务负什么责任啊…………是想粉丝想疯了什么情况啊?想...\n",
"165587 0 哦也~ ~ ~ !得瑟哈哈哈耶~ ~ ~ !新logo 。。。。我们的logo 会不会抢了的...\n",
"351395 3 我发现真的是最齐全的一张。这是去看北方儿子的时候啊。怀念。对了,我怎么穿那件破衬衫。。好难看...\n",
"339894 3 看你那个享受的表情nuna 很感动~\n",
"307523 3 不得不轉 !大家淚 奔吧哈哈27開 始,短短8秒,我咽哽了\n",
"124636 0 早看到了,再看到还是想笑,好可爱啊\n",
"56901 0 快来围观我的小丸子模板~ ~ 哇咔咔~ 得瑟~ ~ ~\n",
"106905 0 也未免太厉害了吧.......观看完此视频之后,我终于明白了香港歌星GEM—— 邓紫棋走红的...\n",
"291966 2 天啊…是住家发生爆炸了,天热,各位注意安全。一朋友开化工厂的。唉。注意安全。真难以想像,不知...\n",
"321489 3 肯德基你就不会带个头,做件好事可爱的脖子们,帮她圆了梦吧~ ~ 小时候来北京,吃过一种小糕点...\n",
"188566 0 想去桂林,上学时候就学到一课文说桂林山水甲天下,一直想去看看品橙网国庆旅游胜地创意评奖活动开...\n",
"15444 0 晃姐姐口才真不是一般人的高,这大概就是文凭带来的区别吧。拿着真文凭的人总会觉得那是自己的底线...\n",
"56820 0 火火happy birthday 天蝎座的人虽然喜欢隐藏自己,但是他喜欢掌握每天生活当中与他...\n",
"257031 2 好久没看了。。。还是那么的感动~ ~ ~ ~\n",
"144782 0 你看他像几岁?关键是牛尔多大?【分享图片】现场挑战高难度抗衰老奇迹~ 看看他都使用倩碧什么产品~\n",
"130776 0 比江苏台的好玩这个真的很搞笑,再次推荐!哈哈,这个绝对值得一看,搞笑死了。当然其中的讽刺意味...\n",
"59158 0 【YMG 推荐】来,哥让你见识下,什么是真正的招财猫!要发财的童鞋抱走~ ~ 在海味舖 買 ?\n",
"240262 1 该带套的时候要带上。大哥,你就得瑟吧和吃饭。美女很美很火。因为吃香辣小龙虾,我的衬衫歇火了。..."
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(20)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,341 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# touzizhidao_filter 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1SR5d20DPpU7F1h_OVf64GA)\n",
"1. **数据概览:** 58.8 万条投资行业问答数据\n",
"2. **推荐实验:** FAQ 问答系统\n",
"3. **数据来源:** 百度知道\n",
"4. **加工处理:**\n",
" 1. 过滤了id、url、qid、reply_t、user字段\n",
" 2. 对question、reply做了脱敏处理"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = 'touzizhidao_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. touzizhidao_filter.csv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"pd_all = pd.read_csv(path + 'touzizhidao_filter.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| title | 问题的标题 |\n",
"| question | 问题内容(可为空) |\n",
"| reply| 回复内容 |\n",
"| is_best| 是否为页面上显示的最佳回答 |"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>question</th>\n",
" <th>reply</th>\n",
" <th>is_best</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>133637</th>\n",
" <td>华夏银行信用卡怎么查询申请进度</td>\n",
" <td>NaN</td>\n",
" <td>信用卡申请进度查询:查询步骤:一、网银查询:1、登录银行信用卡中心页面,然后点击“办卡进度查...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>295236</th>\n",
" <td>我向上海复星投资创业有限公司申请贷款要交1000元保险开户费,交了</td>\n",
" <td>我向上海复星投资创业有限公司申请贷款要交1000元保险开户费,交了过后又说我银行卡不行还要交...</td>\n",
" <td>我的不用</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>329332</th>\n",
" <td>二手房买卖中介收费是多少二手房买卖中介如何收费</td>\n",
" <td>NaN</td>\n",
" <td>二手房交易流程(1)买方咨询买卖双方建立信息沟通渠道,买方了解房屋整体现状及产权状况,要求卖...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>176871</th>\n",
" <td>单位给职工办的社保卡买药里面资金不足怎么办</td>\n",
" <td>单位给职工办的社保卡买药里面资金不足怎么办</td>\n",
" <td>不足的部分需要自己支付医保卡的使用范围主要有以下三个方面:1、用于购药:参保人员在定点药店买...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>485667</th>\n",
" <td>银联医保卡去哪家银行激活。</td>\n",
" <td>NaN</td>\n",
" <td>医保卡上面的银行医保卡激活的步骤:1、带着老卡和新卡到建设银行办理;2、新医保卡的密码是身份...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5012</th>\n",
" <td>买一套大概150万的二手门面房大概要交多少钱</td>\n",
" <td>NaN</td>\n",
" <td>如果购买的是非普通住宅,除了缴纳房屋费用,还需要按以下规定缴纳相关税费:(1)增值税:非住宅...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>475672</th>\n",
" <td>二手房买卖,公共维修基金应怎么处理?是需要下家支付给上家账面余额,还是无偿顺延呢?</td>\n",
" <td>买卖合同中是这样写的:“出卖人同意其缴纳的该房屋专项维修资金(公共维修基金)的账面余额转移至...</td>\n",
" <td>需要办理维修基金过户。无偿顺延就可以。维修基金使用条件:1、维修基金只有在保修期满后,对物业...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>199291</th>\n",
" <td>信用卡全额还款好还是最低还款好</td>\n",
" <td>NaN</td>\n",
" <td>如果条件可以,当然是全额还款好,最低还款是要付利息的,而且还有点高,银行当然希望是最低还款,...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>265499</th>\n",
" <td>花呗如何才能提额</td>\n",
" <td>花呗怎样才能提高额度</td>\n",
" <td>花呗额度取决于芝麻信用分,若要提升额度,需要先提升芝麻信用分,提升芝麻信用分小技巧:1、多在...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>224237</th>\n",
" <td>工行信用卡逾期两个月,没有90天!!</td>\n",
" <td>工行信用卡逾期两个月,没有90天!!银行把卡冻结了,欠款7000,全部还清以后打电话解冻,客...</td>\n",
" <td>可以用,但额度只有2000元,且征信上有逾期记录注销吧</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>271023</th>\n",
" <td>中*房价什么时候会大跌</td>\n",
" <td>NaN</td>\n",
" <td>我感觉房价下降的几率比较小,现在啥都涨价,国家再调控,也不可能让我这月收入几千块钱的人买得起...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14097</th>\n",
" <td>在四*地*,个人所得税达到多少金额</td>\n",
" <td>NaN</td>\n",
" <td>个人所得税征税内容工资、薪金所得,个体工商户的生产、经营所得,他有偿服务活动取得的所得。经营...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>363978</th>\n",
" <td>关于贷款的</td>\n",
" <td>关于贷款的有没有什么借款途径</td>\n",
" <td>有口子。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>517939</th>\n",
" <td>农村建房可以贷款吗</td>\n",
" <td>NaN</td>\n",
" <td>不可以,银行贷款一般是能够上市交易的房子。贷款需要准备四大类资料:1、个人身份证明:身份证、...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>505671</th>\n",
" <td>2017年甘*个人医保卡能让别人用吗</td>\n",
" <td>NaN</td>\n",
" <td>个人医保卡是不能让别人使用的。医保卡(社保卡)只限本人就医时使用,不能出借给他人。参保人如把...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>117318</th>\n",
" <td>别墅好还是高层好</td>\n",
" <td>别墅好还是高层好</td>\n",
" <td>别墅。还是看你自己的需要还有经济能力了不是房子建的好看就算是别墅的。别墅即别野,讲究的是周围...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>376669</th>\n",
" <td>新车什么时候算上户了也就说法律上属于自己的财产</td>\n",
" <td>NaN</td>\n",
" <td>购房合同签订完了车子就属于个人财产了。中*人*共*国*法通则第七十五条规定:个人财产所有权包...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>179097</th>\n",
" <td>农民59岁买什么养老</td>\n",
" <td>农民59岁买什么养老</td>\n",
" <td>多存点钱。</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>77847</th>\n",
" <td>支付宝转账手续费怎么收的</td>\n",
" <td>NaN</td>\n",
" <td>好想是一个月内不能超过5万没有手续费你好,每个支付宝账户有两万元的免费提现和转账额度,提现和...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>319220</th>\n",
" <td>甘*省企业退休人员养老金怎么调整</td>\n",
" <td>NaN</td>\n",
" <td>2016年,我国实现了企业和机关事业单位养老金待遇同步调整,按6.5%左右提高企业和机关事业...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" title \\\n",
"133637 华夏银行信用卡怎么查询申请进度 \n",
"295236 我向上海复星投资创业有限公司申请贷款要交1000元保险开户费,交了 \n",
"329332 二手房买卖中介收费是多少二手房买卖中介如何收费 \n",
"176871 单位给职工办的社保卡买药里面资金不足怎么办 \n",
"485667 银联医保卡去哪家银行激活。 \n",
"5012 买一套大概150万的二手门面房大概要交多少钱 \n",
"475672 二手房买卖,公共维修基金应怎么处理?是需要下家支付给上家账面余额,还是无偿顺延呢? \n",
"199291 信用卡全额还款好还是最低还款好 \n",
"265499 花呗如何才能提额 \n",
"224237 工行信用卡逾期两个月,没有90天!! \n",
"271023 中*房价什么时候会大跌 \n",
"14097 在四*地*,个人所得税达到多少金额 \n",
"363978 关于贷款的 \n",
"517939 农村建房可以贷款吗 \n",
"505671 2017年甘*个人医保卡能让别人用吗 \n",
"117318 别墅好还是高层好 \n",
"376669 新车什么时候算上户了也就说法律上属于自己的财产 \n",
"179097 农民59岁买什么养老 \n",
"77847 支付宝转账手续费怎么收的 \n",
"319220 甘*省企业退休人员养老金怎么调整 \n",
"\n",
" question \\\n",
"133637 NaN \n",
"295236 我向上海复星投资创业有限公司申请贷款要交1000元保险开户费,交了过后又说我银行卡不行还要交... \n",
"329332 NaN \n",
"176871 单位给职工办的社保卡买药里面资金不足怎么办 \n",
"485667 NaN \n",
"5012 NaN \n",
"475672 买卖合同中是这样写的:“出卖人同意其缴纳的该房屋专项维修资金(公共维修基金)的账面余额转移至... \n",
"199291 NaN \n",
"265499 花呗怎样才能提高额度 \n",
"224237 工行信用卡逾期两个月,没有90天!!银行把卡冻结了,欠款7000,全部还清以后打电话解冻,客... \n",
"271023 NaN \n",
"14097 NaN \n",
"363978 关于贷款的有没有什么借款途径 \n",
"517939 NaN \n",
"505671 NaN \n",
"117318 别墅好还是高层好 \n",
"376669 NaN \n",
"179097 农民59岁买什么养老 \n",
"77847 NaN \n",
"319220 NaN \n",
"\n",
" reply is_best \n",
"133637 信用卡申请进度查询:查询步骤:一、网银查询:1、登录银行信用卡中心页面,然后点击“办卡进度查... 1 \n",
"295236 我的不用 0 \n",
"329332 二手房交易流程(1)买方咨询买卖双方建立信息沟通渠道,买方了解房屋整体现状及产权状况,要求卖... 1 \n",
"176871 不足的部分需要自己支付医保卡的使用范围主要有以下三个方面:1、用于购药:参保人员在定点药店买... 1 \n",
"485667 医保卡上面的银行医保卡激活的步骤:1、带着老卡和新卡到建设银行办理;2、新医保卡的密码是身份... 1 \n",
"5012 如果购买的是非普通住宅,除了缴纳房屋费用,还需要按以下规定缴纳相关税费:(1)增值税:非住宅... 1 \n",
"475672 需要办理维修基金过户。无偿顺延就可以。维修基金使用条件:1、维修基金只有在保修期满后,对物业... 0 \n",
"199291 如果条件可以,当然是全额还款好,最低还款是要付利息的,而且还有点高,银行当然希望是最低还款,... 1 \n",
"265499 花呗额度取决于芝麻信用分,若要提升额度,需要先提升芝麻信用分,提升芝麻信用分小技巧:1、多在... 0 \n",
"224237 可以用,但额度只有2000元,且征信上有逾期记录注销吧 0 \n",
"271023 我感觉房价下降的几率比较小,现在啥都涨价,国家再调控,也不可能让我这月收入几千块钱的人买得起... 0 \n",
"14097 个人所得税征税内容工资、薪金所得,个体工商户的生产、经营所得,他有偿服务活动取得的所得。经营... 1 \n",
"363978 有口子。 0 \n",
"517939 不可以,银行贷款一般是能够上市交易的房子。贷款需要准备四大类资料:1、个人身份证明:身份证、... 1 \n",
"505671 个人医保卡是不能让别人使用的。医保卡(社保卡)只限本人就医时使用,不能出借给他人。参保人如把... 1 \n",
"117318 别墅。还是看你自己的需要还有经济能力了不是房子建的好看就算是别墅的。别墅即别野,讲究的是周围... 0 \n",
"376669 购房合同签订完了车子就属于个人财产了。中*人*共*国*法通则第七十五条规定:个人财产所有权包... 1 \n",
"179097 多存点钱。 0 \n",
"77847 好想是一个月内不能超过5万没有手续费你好,每个支付宝账户有两万元的免费提现和转账额度,提现和... 0 \n",
"319220 2016年,我国实现了企业和机关事业单位养老金待遇同步调整,按6.5%左右提高企业和机关事业... 1 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(n=20)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,426 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# waimai_10k 说明\n",
"0. **下载地址:** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/waimai_10k/waimai_10k.csv)\n",
"1. **数据概览:** 某外卖平台收集的用户评价,正向 4000 条,负向 约 8000 条\n",
"2. **推荐实验:** 情感/观点/评论 倾向性分析\n",
"2. **数据来源:** 某外卖平台\n",
"3. **原数据集:** [中文短文本情感分析语料 外卖评价](https://download.csdn.net/download/cstkl/10236683),网上搜集,具体作者、来源不详\n",
"4. **加工处理:**\n",
" 1. 将原来 2 个文件整合到 1 个文件中\n",
" 2. 去重"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"path = 'waimai_10k_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. waimai_10k.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"评论数目(总体):11987\n",
"评论数目(正向):4000\n",
"评论数目(负向):7987\n"
]
}
],
"source": [
"pd_all = pd.read_csv(path + 'waimai_10k.csv')\n",
"\n",
"print('评论数目(总体):%d' % pd_all.shape[0])\n",
"print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])\n",
"print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| label | 1 表示正向评论,0 表示负向评论 |\n",
"| review | 评论内容 |"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>1</td>\n",
" <td>送餐特别快,态度也好,辛苦啦</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6632</th>\n",
" <td>0</td>\n",
" <td>点了热带雨林披萨+饮料,和BBQ鸡肉披萨+饮料,送来的是两个奥尔良披萨+两个银耳冰粥,冰凉冰...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8849</th>\n",
" <td>0</td>\n",
" <td>难吃!!!油死了,味道烂</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11114</th>\n",
" <td>0</td>\n",
" <td>今天菜太咸,连着定了3天吃,一天比一天难吃。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11661</th>\n",
" <td>0</td>\n",
" <td>送的太慢了,菜都凉了。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9571</th>\n",
" <td>0</td>\n",
" <td>没有满减!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10614</th>\n",
" <td>0</td>\n",
" <td>差评!定的时间是12点一刻,结果刚11点就送来了!果断退单。送餐前不看时间吗?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7585</th>\n",
" <td>0</td>\n",
" <td>羊肉串太咸,还有些不新鲜。鸡心和鸡胗烤的太老</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6919</th>\n",
" <td>0</td>\n",
" <td>快递员挺好,速度挺快</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3192</th>\n",
" <td>1</td>\n",
" <td>小炒肉卷饼好辣~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10224</th>\n",
" <td>0</td>\n",
" <td>送来的时候都凉了,味道一般,鲜果西米露就两口的量,鲜果就是一块西瓜一个西瓜籽</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7295</th>\n",
" <td>0</td>\n",
" <td>没放糖,没放奶油,好难喝</td>\n",
" </tr>\n",
" <tr>\n",
" <th>275</th>\n",
" <td>1</td>\n",
" <td>他家的奶茶超级好喝。。。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8378</th>\n",
" <td>0</td>\n",
" <td>黑椒牛柳饭送成大排饭</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5879</th>\n",
" <td>0</td>\n",
" <td>一个半小时,可以</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7523</th>\n",
" <td>0</td>\n",
" <td>订单满减后应该是24,送过来要收我原价39?你搞笑呐,还少听加多宝!我管你什么美食送的还是你...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6590</th>\n",
" <td>0</td>\n",
" <td>真心也忒慢了,其他都还成</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1703</th>\n",
" <td>1</td>\n",
" <td>非常划算,很好</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5345</th>\n",
" <td>0</td>\n",
" <td>首选是得吐槽一下这家的速度,一个半小时起,然后卷饼包装很不错,酱香鸡肉的比较赞,飘香肘子一般...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1674</th>\n",
" <td>1</td>\n",
" <td>离我们远点55分钟送到的,可以理解,饼和粥都不错</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"25 1 送餐特别快,态度也好,辛苦啦\n",
"6632 0 点了热带雨林披萨+饮料,和BBQ鸡肉披萨+饮料,送来的是两个奥尔良披萨+两个银耳冰粥,冰凉冰...\n",
"8849 0 难吃!!!油死了,味道烂\n",
"11114 0 今天菜太咸,连着定了3天吃,一天比一天难吃。\n",
"11661 0 送的太慢了,菜都凉了。\n",
"9571 0 没有满减!\n",
"10614 0 差评!定的时间是12点一刻,结果刚11点就送来了!果断退单。送餐前不看时间吗?\n",
"7585 0 羊肉串太咸,还有些不新鲜。鸡心和鸡胗烤的太老\n",
"6919 0 快递员挺好,速度挺快\n",
"3192 1 小炒肉卷饼好辣~\n",
"10224 0 送来的时候都凉了,味道一般,鲜果西米露就两口的量,鲜果就是一块西瓜一个西瓜籽\n",
"7295 0 没放糖,没放奶油,好难喝\n",
"275 1 他家的奶茶超级好喝。。。\n",
"8378 0 黑椒牛柳饭送成大排饭\n",
"5879 0 一个半小时,可以\n",
"7523 0 订单满减后应该是24,送过来要收我原价39?你搞笑呐,还少听加多宝!我管你什么美食送的还是你...\n",
"6590 0 真心也忒慢了,其他都还成\n",
"1703 1 非常划算,很好\n",
"5345 0 首选是得吐槽一下这家的速度,一个半小时起,然后卷饼包装很不错,酱香鸡肉的比较赞,飘香肘子一般...\n",
"1674 1 离我们远点55分钟送到的,可以理解,饼和粥都不错"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. 构造平衡语料"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"pd_positive = pd_all[pd_all.label==1]\n",
"pd_negative = pd_all[pd_all.label==0]\n",
"\n",
"def get_balance_corpus(corpus_size, corpus_pos, corpus_neg):\n",
" sample_size = corpus_size // 2\n",
" pd_corpus_balance = pd.concat([corpus_pos.sample(sample_size, replace=corpus_pos.shape[0]<sample_size), \\\n",
" corpus_neg.sample(sample_size, replace=corpus_neg.shape[0]<sample_size)])\n",
" \n",
" print('评论数目(总体):%d' % pd_corpus_balance.shape[0])\n",
" print('评论数目(正向):%d' % pd_corpus_balance[pd_corpus_balance.label==1].shape[0])\n",
" print('评论数目(负向):%d' % pd_corpus_balance[pd_corpus_balance.label==0].shape[0]) \n",
" \n",
" return pd_corpus_balance"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"评论数目(总体):4000\n",
"评论数目(正向):2000\n",
"评论数目(负向):2000\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>10436</th>\n",
" <td>0</td>\n",
" <td>难吃~石锅拌饭居然没酱~而且刚好晚了29分钟</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10468</th>\n",
" <td>0</td>\n",
" <td>等了很久,没关系,毕竟还在约定时间内,可是最让我忍不了的是真的很一般,个人口味吧,反正不和我...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1643</th>\n",
" <td>1</td>\n",
" <td>嗯,纸袋比较高大上</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8723</th>\n",
" <td>0</td>\n",
" <td>海参怎么是生的,没法吃,郁闷</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2431</th>\n",
" <td>1</td>\n",
" <td>送餐很快,送餐人员很热情!~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5121</th>\n",
" <td>0</td>\n",
" <td>不如以前好吃,肘子都有味儿了!哎!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10565</th>\n",
" <td>0</td>\n",
" <td>东西有些小贵。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2413</th>\n",
" <td>1</td>\n",
" <td>虽然时间长了些但是很准时。下次记得给些番茄酱就更好了。,一个人吃足够了。好好吃</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11937</th>\n",
" <td>0</td>\n",
" <td>11点以前就定的餐,做了1小时48分钟,呵呵,我只想说:拜拜!!!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1024</th>\n",
" <td>1</td>\n",
" <td>很好吃,面皮特别有嚼劲儿,酱料也很好吃</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"10436 0 难吃~石锅拌饭居然没酱~而且刚好晚了29分钟\n",
"10468 0 等了很久,没关系,毕竟还在约定时间内,可是最让我忍不了的是真的很一般,个人口味吧,反正不和我...\n",
"1643 1 嗯,纸袋比较高大上\n",
"8723 0 海参怎么是生的,没法吃,郁闷\n",
"2431 1 送餐很快,送餐人员很热情!~\n",
"5121 0 不如以前好吃,肘子都有味儿了!哎!\n",
"10565 0 东西有些小贵。\n",
"2413 1 虽然时间长了些但是很准时。下次记得给些番茄酱就更好了。,一个人吃足够了。好好吃\n",
"11937 0 11点以前就定的餐,做了1小时48分钟,呵呵,我只想说:拜拜!!!\n",
"1024 1 很好吃,面皮特别有嚼劲儿,酱料也很好吃"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"waimai_10k_ba_4000 = get_balance_corpus(4000, pd_positive, pd_negative)\n",
"\n",
"waimai_10k_ba_4000.sample(10)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
File diff suppressed because it is too large Load Diff
@@ -1,280 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# weibo_senti_100k 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1DoQbki3YwqkuwQUOj64R_g)\n",
"1. **数据概览:** 10 万多条,带情感标注 新浪微博,正负向评论约各 5 万条\n",
"2. **推荐实验:** 情感/观点/评论 倾向性分析\n",
"2. **数据来源:** [新浪微博](https://weibo.com/)\n",
"3. **原数据集:** [新浪微博,情感分析标记语料共12万条](https://download.csdn.net/download/weixin_38442818/10214750),网上搜集,具体作者、来源不详\n",
"4. **加工处理:**\n",
" 1. 将原来的 2 份文档,整合成 1 份 csv 文件\n",
" 2. 编码统一为 UTF-8\n",
" 3. 去重"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"path = 'weibo_senti_100k_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. weibo_senti_100k.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"评论数目(总体):119988\n",
"评论数目(正向):59993\n",
"评论数目(负向):59995\n"
]
}
],
"source": [
"pd_all = pd.read_csv(path + 'weibo_senti_100k.csv')\n",
"\n",
"print('评论数目(总体):%d' % pd_all.shape[0])\n",
"print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])\n",
"print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| label | 1 表示正向评论,0 表示负向评论 |\n",
"| review | 微博内容 |"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>62050</th>\n",
" <td>0</td>\n",
" <td>太过分了@Rexzhenghao //@Janie_Zhang:招行最近负面新闻越来越多呀...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68263</th>\n",
" <td>0</td>\n",
" <td>希望你?得好?我本"?肥血?史"[晕][哈哈]@Pete三姑父</td>\n",
" </tr>\n",
" <tr>\n",
" <th>81472</th>\n",
" <td>0</td>\n",
" <td>有点想参加????[偷?]想安排下时间再决定[抓狂]//@黑晶晶crystal: @细腿大羽...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42021</th>\n",
" <td>1</td>\n",
" <td>[给力]感谢所有支持雯婕的芝麻![爱你]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7777</th>\n",
" <td>1</td>\n",
" <td>2013最后一天,在新加坡开心度过,向所有的朋友们问声:新年快乐!2014年,我们会更好[调...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>100399</th>\n",
" <td>0</td>\n",
" <td>大中午出门办事找错路,曝晒中。要多杯具有多杯具。[泪][泪][汗]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82398</th>\n",
" <td>0</td>\n",
" <td>马航还会否认吗?到底在隐瞒啥呢?[抓狂]//@头条新闻: 转发微博</td>\n",
" </tr>\n",
" <tr>\n",
" <th>106423</th>\n",
" <td>0</td>\n",
" <td>克罗地亚球迷很爱放烟火!球又没进,就硝烟四起。[晕]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24798</th>\n",
" <td>1</td>\n",
" <td>[抱抱]福芦 TangRoulou 吉祥书 8.8折优惠 &gt;&gt;&gt; http://t.cn/z...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6598</th>\n",
" <td>1</td>\n",
" <td>回复@钱旭明QXM:[嘻嘻][嘻嘻] //@钱旭明QXM:杨大哥[good][good][g...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53920</th>\n",
" <td>1</td>\n",
" <td>人家这脸长的!!!!!![哈哈]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15587</th>\n",
" <td>1</td>\n",
" <td>这个价不算高,和一天内训相比相差无几。。[哈哈]//@博通传媒v: 6个月!一个月工资1万,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101237</th>\n",
" <td>0</td>\n",
" <td>终于收工啦,脚丫子快冻掉了[泪][泪][泪]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82449</th>\n",
" <td>0</td>\n",
" <td>我决定从今天开始我想吃什么就去吃什么,一个人吃也无所谓,重点是不要因为别人的意见委屈了自己[...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32537</th>\n",
" <td>1</td>\n",
" <td>飘雪的北京 需要双份早餐.......//@美食天下: [哈哈]//@王淼Margay: 屁...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10630</th>\n",
" <td>1</td>\n",
" <td>[耶],这个太赞了,生活大爆炸第六季马上要出啦[鼓掌] //@-郑瑜-:这个不错 //@经典...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85130</th>\n",
" <td>0</td>\n",
" <td>刚追完#倾世皇妃#,#千山暮雪#又紧随其后,网速和更新速度都太不给力,尽管我看过原著,还是焦...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>105956</th>\n",
" <td>0</td>\n",
" <td>晚上看金二胖?察前?,推出的火炮基座?糟了,可以PK了[泪] //@艾米粒er: //@wi...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>72391</th>\n",
" <td>0</td>\n",
" <td>必须把中国足球的伟大,用我的职业演说出来 //@袁腾飞:[泪]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10761</th>\n",
" <td>1</td>\n",
" <td>[鼓掌] //@宁波香格里拉大酒店: 小编来答疑,周五晚惊艳全场的树根蛋糕到底有多长?蛋糕全...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"62050 0 太过分了@Rexzhenghao //@Janie_Zhang:招行最近负面新闻越来越多呀...\n",
"68263 0 希望你?得好?我本"?肥血?史"[晕][哈哈]@Pete三姑父\n",
"81472 0 有点想参加????[偷?]想安排下时间再决定[抓狂]//@黑晶晶crystal: @细腿大羽...\n",
"42021 1 [给力]感谢所有支持雯婕的芝麻![爱你]\n",
"7777 1 2013最后一天,在新加坡开心度过,向所有的朋友们问声:新年快乐!2014年,我们会更好[调...\n",
"100399 0 大中午出门办事找错路,曝晒中。要多杯具有多杯具。[泪][泪][汗]\n",
"82398 0 马航还会否认吗?到底在隐瞒啥呢?[抓狂]//@头条新闻: 转发微博\n",
"106423 0 克罗地亚球迷很爱放烟火!球又没进,就硝烟四起。[晕]\n",
"24798 1 [抱抱]福芦 TangRoulou 吉祥书 8.8折优惠 >>> http://t.cn/z...\n",
"6598 1 回复@钱旭明QXM:[嘻嘻][嘻嘻] //@钱旭明QXM:杨大哥[good][good][g...\n",
"53920 1 人家这脸长的!!!!!![哈哈]\n",
"15587 1 这个价不算高,和一天内训相比相差无几。。[哈哈]//@博通传媒v: 6个月!一个月工资1万,...\n",
"101237 0 终于收工啦,脚丫子快冻掉了[泪][泪][泪]\n",
"82449 0 我决定从今天开始我想吃什么就去吃什么,一个人吃也无所谓,重点是不要因为别人的意见委屈了自己[...\n",
"32537 1 飘雪的北京 需要双份早餐.......//@美食天下: [哈哈]//@王淼Margay: 屁...\n",
"10630 1 [耶],这个太赞了,生活大爆炸第六季马上要出啦[鼓掌] //@-郑瑜-:这个不错 //@经典...\n",
"85130 0 刚追完#倾世皇妃#,#千山暮雪#又紧随其后,网速和更新速度都太不给力,尽管我看过原著,还是焦...\n",
"105956 0 晚上看金二胖?察前?,推出的火炮基座?糟了,可以PK了[泪] //@艾米粒er: //@wi...\n",
"72391 0 必须把中国足球的伟大,用我的职业演说出来 //@袁腾飞:[泪]\n",
"10761 1 [鼓掌] //@宁波香格里拉大酒店: 小编来答疑,周五晚惊艳全场的树根蛋糕到底有多长?蛋糕全..."
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(20)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,804 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# yf_amazon 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1SbfpZb5cm-g2LmnYV_af8Q)\n",
"1. **数据概览:** 52 万件商品,1100 多个类目,142 万用户,720 万条评论/评分数据\n",
"2. **推荐实验:** 推荐系统、情感/观点/评论 倾向性分析\n",
"2. **数据来源:** [亚马逊](https://www.amazon.cn/)\n",
"3. **原数据集:** [JD.com E-Commerce Data](http://yongfeng.me/dataset/)Yongfeng Zhang 教授为 WWW 2015 会议论文而搜集的数据\n",
"4. **加工处理:**\n",
" 1. 将全角字符转换为半角字符,并采用 UTF-8 编码\n",
" 2. 整理成与 [MovieLens](https://grouplens.org/datasets/movielens/) 兼容的格式\n",
" 3. 进行脱敏操作,以保护用户隐私"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = 'yf_amazon_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. products.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"产品数目:525619\n"
]
}
],
"source": [
"products = pd.read_csv(path + 'products.csv')\n",
"\n",
"print('产品数目:%d' % products.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| productId | 产品 id (从 0 开始,连续编号) |\n",
"| name | 产品名称 |\n",
"| catIds | 类别 id(从 0 开始,连续编号,从左到右依次表示一级类目、二级类目、三级类目) |"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>productId</th>\n",
" <th>name</th>\n",
" <th>catIds</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>331420</th>\n",
" <td>331420</td>\n",
" <td>欧意金狐狸 女式 皮手套 QT602</td>\n",
" <td>802,143,996</td>\n",
" </tr>\n",
" <tr>\n",
" <th>130945</th>\n",
" <td>130945</td>\n",
" <td>YESO TOT 中性 单肩包/斜挎包 均码 9411</td>\n",
" <td>1111,864,781</td>\n",
" </tr>\n",
" <tr>\n",
" <th>179886</th>\n",
" <td>179886</td>\n",
" <td>李斯特论柏辽兹与舒曼</td>\n",
" <td>832,552,337</td>\n",
" </tr>\n",
" <tr>\n",
" <th>504123</th>\n",
" <td>504123</td>\n",
" <td>Tuscarora 途斯卡洛拉 中性 烈焰驰骋无缝头巾 PSU3083</td>\n",
" <td>1111,522,720</td>\n",
" </tr>\n",
" <tr>\n",
" <th>387785</th>\n",
" <td>387785</td>\n",
" <td>我们的故事:一百个北大荒老知青的人生形态</td>\n",
" <td>832,519,599</td>\n",
" </tr>\n",
" <tr>\n",
" <th>406231</th>\n",
" <td>406231</td>\n",
" <td>图读周易</td>\n",
" <td>832,723,724</td>\n",
" </tr>\n",
" <tr>\n",
" <th>199072</th>\n",
" <td>199072</td>\n",
" <td>Barbie 芭比 女童 运动休闲鞋 A22993</td>\n",
" <td>802,777,601</td>\n",
" </tr>\n",
" <tr>\n",
" <th>518528</th>\n",
" <td>518528</td>\n",
" <td>HiVi 惠威 多媒体音箱 D1080MKII 2.0声道 棕色</td>\n",
" <td>1057,439,1064</td>\n",
" </tr>\n",
" <tr>\n",
" <th>446621</th>\n",
" <td>446621</td>\n",
" <td>HALTI 男式 JUOVAJACKET 芬兰国家队系列 羽绒滑雪服 H0591922</td>\n",
" <td>1111,651,693</td>\n",
" </tr>\n",
" <tr>\n",
" <th>379960</th>\n",
" <td>379960</td>\n",
" <td>塑料回收再生术:百工百技</td>\n",
" <td>832,1096,509</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" productId name catIds\n",
"331420 331420 欧意金狐狸 女式 皮手套 QT602 802,143,996\n",
"130945 130945 YESO TOT 中性 单肩包/斜挎包 均码 9411 1111,864,781\n",
"179886 179886 李斯特论柏辽兹与舒曼 832,552,337\n",
"504123 504123 Tuscarora 途斯卡洛拉 中性 烈焰驰骋无缝头巾 PSU3083 1111,522,720\n",
"387785 387785 我们的故事:一百个北大荒老知青的人生形态 832,519,599\n",
"406231 406231 图读周易 832,723,724\n",
"199072 199072 Barbie 芭比 女童 运动休闲鞋 A22993 802,777,601\n",
"518528 518528 HiVi 惠威 多媒体音箱 D1080MKII 2.0声道 棕色 1057,439,1064\n",
"446621 446621 HALTI 男式 JUOVAJACKET 芬兰国家队系列 羽绒滑雪服 H0591922 1111,651,693\n",
"379960 379960 塑料回收再生术:百工百技 832,1096,509"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"products.sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. categories.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"类别数目:1175\n"
]
}
],
"source": [
"categories = pd.read_csv(path + 'categories.csv')\n",
"\n",
"print('类别数目:%d' % categories.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| catId | 类别 id (从 0 开始,连续编号) |\n",
"| category | 类别名称 |"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>catId</th>\n",
" <th>category</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>947</th>\n",
" <td>947</td>\n",
" <td>理发器</td>\n",
" </tr>\n",
" <tr>\n",
" <th>818</th>\n",
" <td>818</td>\n",
" <td>电脑硬件</td>\n",
" </tr>\n",
" <tr>\n",
" <th>212</th>\n",
" <td>212</td>\n",
" <td>帐篷</td>\n",
" </tr>\n",
" <tr>\n",
" <th>815</th>\n",
" <td>815</td>\n",
" <td>路由器/中继器</td>\n",
" </tr>\n",
" <tr>\n",
" <th>829</th>\n",
" <td>829</td>\n",
" <td>拉杆箱/包</td>\n",
" </tr>\n",
" <tr>\n",
" <th>391</th>\n",
" <td>391</td>\n",
" <td>女鞋</td>\n",
" </tr>\n",
" <tr>\n",
" <th>756</th>\n",
" <td>756</td>\n",
" <td>大型健身器械</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>11</td>\n",
" <td>其他运动器材</td>\n",
" </tr>\n",
" <tr>\n",
" <th>633</th>\n",
" <td>633</td>\n",
" <td>垂钓用品</td>\n",
" </tr>\n",
" <tr>\n",
" <th>115</th>\n",
" <td>115</td>\n",
" <td>卡通</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" catId category\n",
"947 947 理发器\n",
"818 818 电脑硬件\n",
"212 212 帐篷\n",
"815 815 路由器/中继器\n",
"829 829 拉杆箱/包\n",
"391 391 女鞋\n",
"756 756 大型健身器械\n",
"11 11 其他运动器材\n",
"633 633 垂钓用品\n",
"115 115 卡通"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"categories.sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. ratings.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"用户 数目:1424596\n",
"评分/评论 数目(总计):7202921\n",
"\n"
]
}
],
"source": [
"pd_ratings = pd.read_csv(path+'ratings.csv')\n",
"\n",
"print('用户 数目:%d' % pd_ratings.userId.unique().shape[0])\n",
"print('评分/评论 数目(总计):%d\\n' % pd_ratings.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| userId | 用户 id (从 0 开始,连续编号) |\n",
"| productId | 即 products.csv 中的 productId |\n",
"| rating | 评分,[1,5] 之间的整数 |\n",
"| timestamp | 评分时间戳 |\n",
"| title | 评论的标题 |\n",
"| comment | 评论的内容 |"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>userId</th>\n",
" <th>productId</th>\n",
" <th>rating</th>\n",
" <th>timestamp</th>\n",
" <th>title</th>\n",
" <th>comment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4287636</th>\n",
" <td>230944.0</td>\n",
" <td>394505</td>\n",
" <td>5.0</td>\n",
" <td>1393084800</td>\n",
" <td>赞!</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3940838</th>\n",
" <td>16628.0</td>\n",
" <td>84789</td>\n",
" <td>5.0</td>\n",
" <td>1389715200</td>\n",
" <td>喜欢</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4064284</th>\n",
" <td>325829.0</td>\n",
" <td>94108</td>\n",
" <td>3.0</td>\n",
" <td>1384531200</td>\n",
" <td>磨脚</td>\n",
" <td>右脚小脚趾磨掉一块皮</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4802616</th>\n",
" <td>586385.0</td>\n",
" <td>254002</td>\n",
" <td>5.0</td>\n",
" <td>1383408000</td>\n",
" <td>哦~</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>292946</th>\n",
" <td>842028.0</td>\n",
" <td>231449</td>\n",
" <td>5.0</td>\n",
" <td>1369324800</td>\n",
" <td>致我们终将逝去的青春</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2306551</th>\n",
" <td>933226.0</td>\n",
" <td>219015</td>\n",
" <td>4.0</td>\n",
" <td>1341763200</td>\n",
" <td>有点大 不过很漂亮</td>\n",
" <td>外观很精致的说 就是外形有点偏大</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1707442</th>\n",
" <td>402851.0</td>\n",
" <td>228321</td>\n",
" <td>5.0</td>\n",
" <td>1374076800</td>\n",
" <td>给宝宝讲讲挺好的,内容简单,便于宝宝理解。</td>\n",
" <td>给宝宝讲讲挺好的,内容简单,便于宝宝理解。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3641724</th>\n",
" <td>123473.0</td>\n",
" <td>515623</td>\n",
" <td>4.0</td>\n",
" <td>1305475200</td>\n",
" <td>书很好,但居然没有包装!?!?!?</td>\n",
" <td>书很好,但居然没有包装!?!?!?这么好的书却没有包装!?!?!?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1921912</th>\n",
" <td>435946.0</td>\n",
" <td>63238</td>\n",
" <td>4.0</td>\n",
" <td>1357228800</td>\n",
" <td>嗯</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1475151</th>\n",
" <td>1612.0</td>\n",
" <td>139044</td>\n",
" <td>4.0</td>\n",
" <td>1316102400</td>\n",
" <td>一般</td>\n",
" <td>香味没有前面评价那么香,就是普通的爽肤水,有点黏黏的</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" userId productId rating timestamp title \\\n",
"4287636 230944.0 394505 5.0 1393084800 赞! \n",
"3940838 16628.0 84789 5.0 1389715200 喜欢 \n",
"4064284 325829.0 94108 3.0 1384531200 磨脚 \n",
"4802616 586385.0 254002 5.0 1383408000 哦~ \n",
"292946 842028.0 231449 5.0 1369324800 致我们终将逝去的青春 \n",
"2306551 933226.0 219015 4.0 1341763200 有点大 不过很漂亮 \n",
"1707442 402851.0 228321 5.0 1374076800 给宝宝讲讲挺好的,内容简单,便于宝宝理解。 \n",
"3641724 123473.0 515623 4.0 1305475200 书很好,但居然没有包装!?!?!? \n",
"1921912 435946.0 63238 4.0 1357228800 嗯 \n",
"1475151 1612.0 139044 4.0 1316102400 一般 \n",
"\n",
" comment \n",
"4287636 NaN \n",
"3940838 NaN \n",
"4064284 右脚小脚趾磨掉一块皮 \n",
"4802616 NaN \n",
"292946 NaN \n",
"2306551 外观很精致的说 就是外形有点偏大 \n",
"1707442 给宝宝讲讲挺好的,内容简单,便于宝宝理解。 \n",
"3641724 书很好,但居然没有包装!?!?!?这么好的书却没有包装!?!?!? \n",
"1921912 NaN \n",
"1475151 香味没有前面评价那么香,就是普通的爽肤水,有点黏黏的 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_ratings.sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. links.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"links = pd.read_csv(path + 'links.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| productId | 即 products.csv 和 ratings.csv 中的 productId |\n",
"| amazonId | 亚马逊的产品编号 |"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>productId</th>\n",
" <th>amazonId</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>436251</th>\n",
" <td>436251</td>\n",
" <td>B00F91KYGK</td>\n",
" </tr>\n",
" <tr>\n",
" <th>194578</th>\n",
" <td>194578</td>\n",
" <td>B00GICSVUK</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336998</th>\n",
" <td>336998</td>\n",
" <td>B00GMKUNBI</td>\n",
" </tr>\n",
" <tr>\n",
" <th>371924</th>\n",
" <td>371924</td>\n",
" <td>B008RIA4AS</td>\n",
" </tr>\n",
" <tr>\n",
" <th>433617</th>\n",
" <td>433617</td>\n",
" <td>B00332FJ7Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>236918</th>\n",
" <td>236918</td>\n",
" <td>060614479X</td>\n",
" </tr>\n",
" <tr>\n",
" <th>388158</th>\n",
" <td>388158</td>\n",
" <td>B008TI5V2C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>479855</th>\n",
" <td>479855</td>\n",
" <td>B002NSML6I</td>\n",
" </tr>\n",
" <tr>\n",
" <th>311842</th>\n",
" <td>311842</td>\n",
" <td>B001DTWV2C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>445227</th>\n",
" <td>445227</td>\n",
" <td>B0055PT83U</td>\n",
" </tr>\n",
" <tr>\n",
" <th>360465</th>\n",
" <td>360465</td>\n",
" <td>B005UTT2QY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>258363</th>\n",
" <td>258363</td>\n",
" <td>0805092919</td>\n",
" </tr>\n",
" <tr>\n",
" <th>308642</th>\n",
" <td>308642</td>\n",
" <td>B0079WMXT8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>232740</th>\n",
" <td>232740</td>\n",
" <td>B0018HKRAW</td>\n",
" </tr>\n",
" <tr>\n",
" <th>335318</th>\n",
" <td>335318</td>\n",
" <td>B00840LWKU</td>\n",
" </tr>\n",
" <tr>\n",
" <th>497048</th>\n",
" <td>497048</td>\n",
" <td>B003ZI61RA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>388969</th>\n",
" <td>388969</td>\n",
" <td>B00BIUYL06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10448</th>\n",
" <td>10448</td>\n",
" <td>B00GMZ9DKK</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75752</th>\n",
" <td>75752</td>\n",
" <td>B002R0DNB4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>392345</th>\n",
" <td>392345</td>\n",
" <td>B0041IY7CE</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" productId amazonId\n",
"436251 436251 B00F91KYGK\n",
"194578 194578 B00GICSVUK\n",
"336998 336998 B00GMKUNBI\n",
"371924 371924 B008RIA4AS\n",
"433617 433617 B00332FJ7Q\n",
"236918 236918 060614479X\n",
"388158 388158 B008TI5V2C\n",
"479855 479855 B002NSML6I\n",
"311842 311842 B001DTWV2C\n",
"445227 445227 B0055PT83U\n",
"360465 360465 B005UTT2QY\n",
"258363 258363 0805092919\n",
"308642 308642 B0079WMXT8\n",
"232740 232740 B0018HKRAW\n",
"335318 335318 B00840LWKU\n",
"497048 497048 B003ZI61RA\n",
"388969 388969 B00BIUYL06\n",
"10448 10448 B00GMZ9DKK\n",
"75752 75752 B002R0DNB4\n",
"392345 392345 B0041IY7CE"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"links.sample(20)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -1,736 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# yf_dianping 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/1yMNvHLl6QYsGbjT7u51Nfg)\n",
"1. **数据概览:** 24 万家餐馆,54 万用户,440 万条评论/评分数据\n",
"2. **推荐实验:** 推荐系统、情感/观点/评论 倾向性分析\n",
"2. **数据来源:** [大众点评](http://www.dianping.com/)\n",
"3. **原数据集:** [Dianping Review Dataset](http://yongfeng.me/dataset/)Yongfeng Zhang 教授为 WWW 2013, SIGIR 2013, SIGIR 2014 会议论文而搜集的数据\n",
"4. **加工处理:**\n",
" 1. 只保留原数据集中的评论、评分等信息,去除其他无用信息\n",
" 2. 整理成与 [MovieLens](https://grouplens.org/datasets/movielens/) 兼容的格式\n",
" 3. 进行脱敏操作,以保护用户隐私"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"path = 'yf_dianping_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. restaurants.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"餐馆数目(有名称):209132\n",
"餐馆数目(没有名称):34115\n",
"餐馆数目(总计):243247\n"
]
}
],
"source": [
"restaurants = pd.read_csv(path + 'restaurants.csv')\n",
"\n",
"print('餐馆数目(有名称):%d' % restaurants[~pd.isnull(restaurants.name)].shape[0])\n",
"print('餐馆数目(没有名称):%d' % restaurants[pd.isnull(restaurants.name)].shape[0])\n",
"print('餐馆数目(总计):%d' % restaurants.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| restId | 餐馆 id (从 0 开始,连续编号) |\n",
"| name | 餐馆名称 |"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>restId</th>\n",
" <th>name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>210902</th>\n",
" <td>210902</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>124832</th>\n",
" <td>124832</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26766</th>\n",
" <td>26766</td>\n",
" <td>香锅制造(新苏天地店)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91754</th>\n",
" <td>91754</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204465</th>\n",
" <td>204465</td>\n",
" <td>西部牛扒城(湖塘店)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36475</th>\n",
" <td>36475</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>231861</th>\n",
" <td>231861</td>\n",
" <td>四季火锅</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79816</th>\n",
" <td>79816</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>140694</th>\n",
" <td>140694</td>\n",
" <td>彝家牛汤锅</td>\n",
" </tr>\n",
" <tr>\n",
" <th>169641</th>\n",
" <td>169641</td>\n",
" <td>春秋</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33809</th>\n",
" <td>33809</td>\n",
" <td>九头鸟酒家(永定门店)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>236919</th>\n",
" <td>236919</td>\n",
" <td>老上海城隍庙小吃(人民大学店)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>182387</th>\n",
" <td>182387</td>\n",
" <td>河源三家村酒楼</td>\n",
" </tr>\n",
" <tr>\n",
" <th>140475</th>\n",
" <td>140475</td>\n",
" <td>荣记麻辣烫</td>\n",
" </tr>\n",
" <tr>\n",
" <th>194224</th>\n",
" <td>194224</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>152406</th>\n",
" <td>152406</td>\n",
" <td>鼎丰真(东四马路店)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11701</th>\n",
" <td>11701</td>\n",
" <td>南亚餐厅</td>\n",
" </tr>\n",
" <tr>\n",
" <th>58805</th>\n",
" <td>58805</td>\n",
" <td>益丰坊(虎泉店)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15641</th>\n",
" <td>15641</td>\n",
" <td>万达艾美酒店大堂吧</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43424</th>\n",
" <td>43424</td>\n",
" <td>新美心绿姿生活</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" restId name\n",
"210902 210902 NaN\n",
"124832 124832 NaN\n",
"26766 26766 香锅制造(新苏天地店)\n",
"91754 91754 NaN\n",
"204465 204465 西部牛扒城(湖塘店)\n",
"36475 36475 NaN\n",
"231861 231861 四季火锅\n",
"79816 79816 NaN\n",
"140694 140694 彝家牛汤锅\n",
"169641 169641 春秋\n",
"33809 33809 九头鸟酒家(永定门店)\n",
"236919 236919 老上海城隍庙小吃(人民大学店)\n",
"182387 182387 河源三家村酒楼\n",
"140475 140475 荣记麻辣烫\n",
"194224 194224 NaN\n",
"152406 152406 鼎丰真(东四马路店)\n",
"11701 11701 南亚餐厅\n",
"58805 58805 益丰坊(虎泉店)\n",
"15641 15641 万达艾美酒店大堂吧\n",
"43424 43424 新美心绿姿生活"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"restaurants.sample(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. ratings.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"用户 数目:542706\n",
"评分/评论 数目(总计):4422473\n",
"\n",
"总体 评分 数目([1,5]):3293878\n",
"环境 评分 数目([1,5]):4076220\n",
"口味 评分 数目([1,5]):4093819\n",
"服务 评分 数目([1,5]):4076220\n",
"评论 数目:4107409\n"
]
}
],
"source": [
"pd_ratings = pd.read_csv(path+'ratings.csv')\n",
"\n",
"print('用户 数目:%d' % pd_ratings.userId.unique().shape[0])\n",
"print('评分/评论 数目(总计):%d\\n' % pd_ratings.shape[0])\n",
"\n",
"print('总体 评分 数目([1,5]):%d' % pd_ratings[(pd_ratings.rating>=1) & (pd_ratings.rating<=5)].shape[0])\n",
"print('环境 评分 数目([1,5]):%d' % pd_ratings[(pd_ratings.rating_env>=1) & (pd_ratings.rating_env<=5)].shape[0])\n",
"print('口味 评分 数目([1,5]):%d' % pd_ratings[(pd_ratings.rating_flavor>=1) & (pd_ratings.rating_flavor<=5)].shape[0])\n",
"print('服务 评分 数目([1,5]):%d' % pd_ratings[(pd_ratings.rating_service>=1) & (pd_ratings.rating_service<=5)].shape[0])\n",
"print('评论 数目:%d' % pd_ratings[~pd_ratings.comment.isna()].shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| userId | 用户 id (从 0 开始,连续编号) |\n",
"| restId | 即 restaurants.csv 中的 restId |\n",
"| rating | 总体评分,[0,5] 之间的整数 |\n",
"| rating_env | 环境评分,[1,5] 之间的整数 |\n",
"| rating_flavor | 口味评分,[1,5] 之间的整数 |\n",
"| rating_service | 服务评分,[1,5] 之间的整数 |\n",
"| timestamp | 评分时间戳 |\n",
"| comment | 评论内容 |"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>userId</th>\n",
" <th>restId</th>\n",
" <th>rating</th>\n",
" <th>rating_env</th>\n",
" <th>rating_flavor</th>\n",
" <th>rating_service</th>\n",
" <th>timestamp</th>\n",
" <th>comment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3331708</th>\n",
" <td>6802</td>\n",
" <td>183728</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>3.0</td>\n",
" <td>1315673880000</td>\n",
" <td>环境不错,停车方便,交通也比较方便,东西齐全,应有尽有,吃、喝、玩、乐样样齐全,还有个五星级...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3332473</th>\n",
" <td>3106</td>\n",
" <td>183750</td>\n",
" <td>5.0</td>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" <td>1260155880000</td>\n",
" <td>去过两次,都是由日本朋友带着去的,很喜欢那种在小巷子深处的店,总觉得那样的店料理会很好吃。最...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>291609</th>\n",
" <td>39590</td>\n",
" <td>13570</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>1324792500000</td>\n",
" <td>朋友请客,两个人中午去吃的,虽然不是节假日,但人还是非常的多,等了很长时间才上餐,价位偏高,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>749582</th>\n",
" <td>59192</td>\n",
" <td>38519</td>\n",
" <td>4.0</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>2.0</td>\n",
" <td>1321430760000</td>\n",
" <td>十一长假之前,我们的房子终于有了好消息,这个月底就可以拿到钥匙,真是不容易,盼星星盼月亮的,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>719908</th>\n",
" <td>241643</td>\n",
" <td>36382</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1271862180000</td>\n",
" <td>很差的一家店!公司聚餐居然选在这里,真是个大大的失策!\\n点的菜迟迟不上,不知道是故意不上还...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3127953</th>\n",
" <td>12481</td>\n",
" <td>173459</td>\n",
" <td>4.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>1300407540000</td>\n",
" <td>这家是离家最近的一家城市超市了,所以自然要进去随便逛逛啦。\\n因为附近是居民区,自然光顾的主...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2068253</th>\n",
" <td>13070</td>\n",
" <td>115853</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>2.0</td>\n",
" <td>1308671820000</td>\n",
" <td>以前觉得还行,但有了85度之后就不行了。要了个提拉米苏,不行,太甜了。\\n辣松的味道倒不错,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>640356</th>\n",
" <td>168006</td>\n",
" <td>33263</td>\n",
" <td>NaN</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" <td>3.0</td>\n",
" <td>1224868560000</td>\n",
" <td>算比较地道的川菜了 味道辣的很正 强力推荐 据说还是标点美食的... 香辣鸡翅每去必点~!不...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1222261</th>\n",
" <td>76280</td>\n",
" <td>65171</td>\n",
" <td>3.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>1302136740000</td>\n",
" <td>为什么这么多人说好吃啊?为什么这么多人说肉多啊?难道是我人品有问题?\\n这个也是慕名而去的~...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101366</th>\n",
" <td>67372</td>\n",
" <td>2853</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1283741400000</td>\n",
" <td>两年前经常去这家吃卤煮,感觉特别好吃,可是最近吃了一次,让我大失所望。。。\\n卤煮的汤和食材...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" userId restId rating rating_env rating_flavor rating_service \\\n",
"3331708 6802 183728 3.0 3.0 4.0 3.0 \n",
"3332473 3106 183750 5.0 4.0 4.0 4.0 \n",
"291609 39590 13570 3.0 3.0 2.0 3.0 \n",
"749582 59192 38519 4.0 2.0 3.0 2.0 \n",
"719908 241643 36382 1.0 2.0 1.0 1.0 \n",
"3127953 12481 173459 4.0 3.0 3.0 3.0 \n",
"2068253 13070 115853 3.0 3.0 3.0 2.0 \n",
"640356 168006 33263 NaN 3.0 5.0 3.0 \n",
"1222261 76280 65171 3.0 2.0 2.0 2.0 \n",
"101366 67372 2853 1.0 1.0 1.0 1.0 \n",
"\n",
" timestamp comment \n",
"3331708 1315673880000 环境不错,停车方便,交通也比较方便,东西齐全,应有尽有,吃、喝、玩、乐样样齐全,还有个五星级... \n",
"3332473 1260155880000 去过两次,都是由日本朋友带着去的,很喜欢那种在小巷子深处的店,总觉得那样的店料理会很好吃。最... \n",
"291609 1324792500000 朋友请客,两个人中午去吃的,虽然不是节假日,但人还是非常的多,等了很长时间才上餐,价位偏高,... \n",
"749582 1321430760000 十一长假之前,我们的房子终于有了好消息,这个月底就可以拿到钥匙,真是不容易,盼星星盼月亮的,... \n",
"719908 1271862180000 很差的一家店!公司聚餐居然选在这里,真是个大大的失策!\\n点的菜迟迟不上,不知道是故意不上还... \n",
"3127953 1300407540000 这家是离家最近的一家城市超市了,所以自然要进去随便逛逛啦。\\n因为附近是居民区,自然光顾的主... \n",
"2068253 1308671820000 以前觉得还行,但有了85度之后就不行了。要了个提拉米苏,不行,太甜了。\\n辣松的味道倒不错,... \n",
"640356 1224868560000 算比较地道的川菜了 味道辣的很正 强力推荐 据说还是标点美食的... 香辣鸡翅每去必点~!不... \n",
"1222261 1302136740000 为什么这么多人说好吃啊?为什么这么多人说肉多啊?难道是我人品有问题?\\n这个也是慕名而去的~... \n",
"101366 1283741400000 两年前经常去这家吃卤煮,感觉特别好吃,可是最近吃了一次,让我大失所望。。。\\n卤煮的汤和食材... "
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_ratings.sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. links.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"links = pd.read_csv(path + 'links.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| restId | 即 restaurants.csv 和 ratings.csv 中的 restId |\n",
"| dianpingId | 大众点评网的餐馆编号 |"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>restId</th>\n",
" <th>dianpingId</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>138492</th>\n",
" <td>138492</td>\n",
" <td>3566359</td>\n",
" </tr>\n",
" <tr>\n",
" <th>158007</th>\n",
" <td>158007</td>\n",
" <td>2484433</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16170</th>\n",
" <td>16170</td>\n",
" <td>3651451</td>\n",
" </tr>\n",
" <tr>\n",
" <th>116637</th>\n",
" <td>116637</td>\n",
" <td>5143029</td>\n",
" </tr>\n",
" <tr>\n",
" <th>191554</th>\n",
" <td>191554</td>\n",
" <td>2734621</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192481</th>\n",
" <td>192481</td>\n",
" <td>3000367</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40978</th>\n",
" <td>40978</td>\n",
" <td>3168181</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196832</th>\n",
" <td>196832</td>\n",
" <td>3523291</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6048</th>\n",
" <td>6048</td>\n",
" <td>2435827</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200405</th>\n",
" <td>200405</td>\n",
" <td>4130573</td>\n",
" </tr>\n",
" <tr>\n",
" <th>69792</th>\n",
" <td>69792</td>\n",
" <td>2853502</td>\n",
" </tr>\n",
" <tr>\n",
" <th>153075</th>\n",
" <td>153075</td>\n",
" <td>2000257</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8528</th>\n",
" <td>8528</td>\n",
" <td>2651221</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196930</th>\n",
" <td>196930</td>\n",
" <td>3534673</td>\n",
" </tr>\n",
" <tr>\n",
" <th>224063</th>\n",
" <td>224063</td>\n",
" <td>3138160</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3434</th>\n",
" <td>3434</td>\n",
" <td>2185753</td>\n",
" </tr>\n",
" <tr>\n",
" <th>125490</th>\n",
" <td>125490</td>\n",
" <td>2112511</td>\n",
" </tr>\n",
" <tr>\n",
" <th>230533</th>\n",
" <td>230533</td>\n",
" <td>4122445</td>\n",
" </tr>\n",
" <tr>\n",
" <th>130597</th>\n",
" <td>130597</td>\n",
" <td>2632129</td>\n",
" </tr>\n",
" <tr>\n",
" <th>186956</th>\n",
" <td>186956</td>\n",
" <td>2233513</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" restId dianpingId\n",
"138492 138492 3566359\n",
"158007 158007 2484433\n",
"16170 16170 3651451\n",
"116637 116637 5143029\n",
"191554 191554 2734621\n",
"192481 192481 3000367\n",
"40978 40978 3168181\n",
"196832 196832 3523291\n",
"6048 6048 2435827\n",
"200405 200405 4130573\n",
"69792 69792 2853502\n",
"153075 153075 2000257\n",
"8528 8528 2651221\n",
"196930 196930 3534673\n",
"224063 224063 3138160\n",
"3434 3434 2185753\n",
"125490 125490 2112511\n",
"230533 230533 4122445\n",
"130597 130597 2632129\n",
"186956 186956 2233513"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"links.sample(20)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Binary file not shown.

Before

Width:  |  Height:  |  Size: 273 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 142 KiB

@@ -1,59 +0,0 @@
![](../images/recruit/jd_header.png)
【岗位名称】系统架构师(人工智能产品)
厦门 / 5-10 年 / 本科 / 15k-25k
---
【岗位职责】
1. 架构设计机器人客服软件系统
2. 组织重构现有机器人客服软件系统
3. 制定并优化机器人客服软件定制开发流程,提升定制开发的效率
4. 制定内部技术标准,优化软件开发、测试、部署全流程,提升研发效率
5. 对研发/测试人员进行技术培训,并指导其日常工作,打造科学、严谨、高效的技术团队
【任职要求】
1. 本科(或以上)学历,计算机、软件工程、自动化等相关专业毕业
2. 5 年以上软件开发/系统架构相关工作经验
3. 精通微服务系统架构、设计模式、常见数据结构及相关算法
4. 具备优异的逻辑思维、系统抽象能力,强悍的工程实现能力
【优先录用】
1. 具备 1 年以上大型、复杂系统(尤其是人工智能或 SAAS 产品)软件开发或系统架构经验
2. 了解自然语言处理(或机器学习、人工智能)基础知识,能迅速理解机器人客服软件各功能模块
3. 熟悉 Python 开发
【成长路径】
1. 纵向发展:公司完善的职级职等体系,配合 OKR 工具,帮助你在成为技术专家的道路上不断突破进取
2. 横向发展:扁平化的组织架构,大量的技术分享、交流活动,有机会接触到人工智能前沿相关产品、技术岗位,根据个人意愿,考核通过后,可申请调岗
【团队介绍】
1. 以人工智能技术服务全球 30 亿用户
2. 人工智能朝阳产业,风口中的风口,期待你的加盟
3. 极客精神、技术驱动,做有温度的技术,让世界更美好
4. 每月不定期小组及大部门分享、交流活动,一同领略人工智能前沿的无限魅力……
【公司福利、交通】
1. 每周工作 5 天(双休),劳逸结合,高效执行
2. 每天半小时弹性工作时间,可自由申请调休(不影响工作条件下)
3. 交通便利,公司楼下即为莲花路口地铁站,如风一般快捷
4. 市中心,一站公交直达莲坂、明发商业广场,吃喝不愁
5. 公司有用餐区、咖啡厅、按摩椅
6. 下午茶,节日礼物
7. 各种团建活动
8. 工作满 3 年以上,绩效/价值观优秀,有机会申请期权奖励
---
【联系方式】
- 蔡先生, jinhua@kuaishang.com.cn
- 蓝先生, lanzl@kuaishang.com.cn, 180-3025-1206
- 叶女士, yeyp@kuaishang.com.cn, 0592-5380356
-50
View File
@@ -1,50 +0,0 @@
![](../images/recruit/jd_header.png)
【岗位名称】自然语言处理算法工程师
厦门 / 3-5 年 / 硕士 / 15k-25k
---
【岗位职责】
1. 参与自动营销机器人客服软件核心模块的设计与研发,满足客户需求
2. 专注对话系统的若干研究/应用领域/关键技术,展开深入研究,保持技术领先优势
【任职要求】
1. 硕士(或以上)学历
2. 3 年以上对话/问答系统相关研究或开发经验
3. 对话/问答系统核心技术骨干,了解各模块的设计与构造,并深入掌握其中的若干模块或关键技术
4. 优秀的工程实现能力,能快速实现各种创新技术构想,编码和文档规范
5. 英文阅读理解能力优秀,具有良好的英文技术文献阅读和理解能力
【优先录用】
1. 具有对话/问答系统相关产品研发成功经验者优先录用
【团队介绍】
1. 以人工智能技术服务全球 30 亿用户
2. 专注面向行业细分领域的自动营销机器人,客户需求旺盛,产品前景无限
3. 极客精神、技术驱动,做有温度的技术,让世界更美好
4. 每月不定期小组及大部门分享、交流活动,团队氛围燃爆……
【公司福利、交通】
1. 每周工作 5 天(双休),劳逸结合,高效执行
2. 每天半小时弹性工作时间,可自由申请调休(不影响工作条件下)
3. 交通便利,公司楼下即为莲花路口地铁站,如风一般快捷
4. 市中心,一站公交直达莲坂、明发商业广场,吃喝不愁
5. 公司有用餐区、咖啡厅、按摩椅
6. 下午茶,节日礼物
7. 各种团建活动
8. 工作满 3 年以上,绩效/价值观优秀,有机会申请期权奖励
---
【联系方式】
- 蔡先生, jinhua@kuaishang.com.cn
- 蓝先生, lanzl@kuaishang.com.cn, 180-3025-1206
- 叶女士, yeyp@kuaishang.com.cn, 0592-5380356
@@ -1,51 +0,0 @@
![](../images/recruit/jd_header.png)
【岗位名称】自然语言人机交互应用研究
厦门 / 5-10 年 / 硕士 / 20k-35k
---
【岗位职责】
1. 主持设计并组织研发面向行业细分领域的自动营销机器人客服软件
2. 制定并优化机器人客服软件定制开发流程,显著提升定制开发的效率
3. 洞察前沿技术发展趋势,帮助提升团队整体技术水平
【任职要求】
1. 硕士(或以上)学历
2. 5 年以上对话/问答系统相关研究或开发经验
3. 对话/问答系统核心技术骨干,熟悉各模块的设计与构造,尤其精通对话流程管理与控制(即中控系统)的研发
4. 优秀的工程实现能力,能快速实现各种创新技术构想,编码和文档规范
5. 优异的英文文献阅读能力,时刻把握前沿技术发展趋势
【优先录用】
1. 具有对话/问答系统相关产品研发成功经验者优先录用
【团队介绍】
1. 以人工智能技术服务全球 30 亿用户
2. 专注面向行业细分领域的自动营销机器人,客户需求旺盛,产品前景无限
3. 极客精神、技术驱动,做有温度的技术,让世界更美好
4. 每月不定期小组及大部门分享、交流活动,团队氛围燃爆……
【公司福利、交通】
1. 每周工作 5 天(双休),劳逸结合,高效执行
2. 每天半小时弹性工作时间,可自由申请调休(不影响工作条件下)
3. 交通便利,公司楼下即为莲花路口地铁站,如风一般快捷
4. 市中心,一站公交直达莲坂、明发商业广场,吃喝不愁
5. 公司有用餐区、咖啡厅、按摩椅
6. 下午茶,节日礼物
7. 各种团建活动
8. 工作满 3 年以上,绩效/价值观优秀,有机会申请期权奖励
---
【联系方式】
- 蔡先生, jinhua@kuaishang.com.cn
- 蓝先生, lanzl@kuaishang.com.cn, 180-3025-1206
- 叶女士, yeyp@kuaishang.com.cn, 0592-5380356
@@ -1,6 +0,0 @@
# Initial Ruff Linting
70d7725e5c89bccfe7d4e5a3ccd87e05c642d74b
# Change line-length and ruff format
39bbfdb8298b5faa32e4bc052080d240f6140bea
# pre-commit hooks and ruff
6ed123ecc4aec9da26bd48748df670cd5b42b3cd
@@ -1 +0,0 @@
*.ipynb linguist-documentation
@@ -1,45 +0,0 @@
name: "\U0001F41B Bug Report"
description: Report your bug here.
labels: ["bug"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this bug report! Any information you can provide about your system and the issue you encountered will help to resolve it faster.
- type: checkboxes
attributes:
label: Have you searched existing issues? 🔎
description: Please search to see if an [issue](https://github.com/MaartenGr/BERTopic/issues) already exists for the issue you encountered.
options:
- label: I have searched and found no existing issues
required: true
- type: textarea
id: describe_the_bug
attributes:
label: Desribe the bug
description: Please provide a concise description of the bug. If there is an error, make sure to provide the **full** error log.
placeholder: Describe the bug
validations:
required: true
- type: textarea
id: reproduction
attributes:
label: Reproduction
description: Please provide a minimal example, with code, that can be run to reproduce the issue.
value: |
```python
from bertopic import BERTopic
```
- type: input
id: bertopic_version
attributes:
label: BERTopic Version
description: What version of BERTopic are you using?
validations:
required: true
@@ -1,8 +0,0 @@
blank_issues_enabled: true
contact_links:
- name: 💡 General questions
url: https://github.com/MaartenGr/BERTopic/discussions
about: Ask a question there!
- name: Want to contribute?
url: https://github.com/MaartenGr/BERTopic/blob/master/CONTRIBUTING.md
about: Head to the contributing guidelines
@@ -1,30 +0,0 @@
name: "\U0001F680 Feature request"
description: Submit a proposal/request for a new BERTopic feature
labels: ["Feature request"]
body:
- type: textarea
id: feature-request
validations:
required: true
attributes:
label: Feature request
description: |
A clear and concise description of the feature proposal.
- type: textarea
id: motivation
validations:
required: true
attributes:
label: Motivation
description: |
Please outline the motivation for the proposal. If this is related to another GitHub issue, please link here too.
- type: textarea
id: contribution
validations:
required: true
attributes:
label: Your contribution
description: |
Any help on the implementation of this feature would be greatly appreciated. If you are interested in working on this, make sure to read the [CONTRIBUTING.MD guide](https://github.com/MaartenGr/BERTopic/blob/master/CONTRIBUTING.md)
@@ -1,17 +0,0 @@
# What does this PR do?
<!--
Thank you for considering creating a PR! Before you do, make sure to read through [contributor guideline](https://github.com/MaartenGr/BERTopic/blob/master/CONTRIBUTING.md)
-->
<!-- Remove if not applicable -->
Fixes # (issue)
## Before submitting
- [ ] This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
- [ ] Did you read the [contributor guideline](https://github.com/MaartenGr/BERTopic/blob/master/CONTRIBUTING.md)?
- [ ] Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes (if applicable)?
- [ ] Did you write any new necessary tests?
@@ -1,39 +0,0 @@
name: Code Checks
on:
push:
branches:
- master
- dev
pull_request:
branches:
- master
- dev
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
# Ref: https://github.com/pre-commit/action
- uses: pre-commit/action@v3.0.1
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[test]"
- name: Run Checking Mechanisms
run: make check
-88
View File
@@ -1,88 +0,0 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
model_dir
model_dir/
test
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Sphinx documentation
docs/_build/
# Jupyter Notebook
.ipynb_checkpoints
notebooks/
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
*.lock
# Artifacts
.idea
.idea/
.vscode
.DS_Store
# mkdocs
site/
@@ -1,20 +0,0 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: trailing-whitespace
exclude: |
(?x)^(
README.md|
docs/
)$
- id: end-of-file-fixer
exclude_types: [html, svg]
- id: check-yaml
- id: check-added-large-files
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.9.9
hooks:
- id: ruff
args: [--fix, --show-fixes, --exit-non-zero-on-fix]
- id: ruff-format
@@ -1,64 +0,0 @@
# Contributing to BERTopic
Hi! Thank you for considering contributing to BERTopic. With the modular nature of BERTopic, many new add-ons, backends, representation models, sub-models, and LLMs, can quickly be added to keep up with the incredibly fast-pacing field.
Whether contributions are new features, better documentation, bug fixes, or improvement on the repository itself, anything is appreciated!
## 📚 Guidelines
### 🤖 Contributing Code
To contribute to this project, we follow an `issue -> pull request` approach for main features and bug fixes. This means that any new feature, bug fix, or anything else that touches on code directly needs to start from an issue first. That way, the main discussion about what needs to be added/fixed can be done in the issue before creating a pull request. This makes sure that we are on the same page before you start coding your pull request. If you start working on an issue, please assign it to yourself but do so after there is an agreement with the maintainer, [@MaartenGr](https://github.com/MaartenGr).
When there is agreement on the assigned approach, a pull request can be created in which the fix/feature can be added. This follows a ["fork and pull request"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) workflow.
Please do not try to push directly to this repo unless you are a maintainer.
There are exceptions to the `issue -> pull request` approach that are typically small changes that do not need agreements, such as:
* Documentation
* Spelling/grammar issues
* Docstrings
* etc.
There is a large focus on documentation in this repository, so please make sure to add extensive descriptions of features when creating the pull request.
Note that the main focus of pull requests and code should be:
* Easy readability
* Clear communication
* Sufficient documentation
## 🚀 Quick Start
To start contributing, make sure to first start from a fresh environment. Using an environment manager, such as `conda` or `pyenv` helps in making sure that your code is reproducible and tracks the versions you have in your environment.
If you are using conda, you can approach it as follows:
1. Create and activate a new conda environment (e.g., `conda create -n bertopic python=3.9`)
2. Install requirements (e.g., `pip install .[dev]`)
* This makes sure to also install documentation and testing packages
3. (Optional) Run `make docs` to build your documentation
4. (Optional) Run `make test` to run the unit tests and `make coverage` to check the coverage of unit tests
❗Note: Unit testing the package can take quite some time since it needs to run several variants of the BERTopic pipeline.
## 🧹 Linting and Formatting
We use [Ruff](https://docs.astral.sh/ruff/) to ensure code is uniformly formatted and to avoid common mistakes and bad practices.
* To automatically re-format code, run `make format`
* To check for linting issues, run `make lint` - some issues may be automatically fixed, some will not be
When a pull request is made, the CI will automatically check for linting and formatting issues. However, it will not automatically apply any fixes, so it is easiest to run locally.
If you believe an error is incorrectly flagged, use a [`# noqa:` comment to suppress](https://docs.astral.sh/ruff/linter/#error-suppression), but this is discouraged unless strictly necessary.
## 🤓 Collaborative Efforts
When you run into any issue with the above or need help to start with a pull request, feel free to reach out in the issues! As with all repositories, this one has its particularities as a result of the maintainer's view. Each repository is quite different and so will their processes.
## 🏆 Recognition
If your contribution has made its way into a new release of BERTopic, you will be given credit in the changelog of the new release! Regardless of the size of the contribution, any help is greatly appreciated.
## 🎈 Release
BERTopic tries to mostly follow [semantic versioning](https://semver.org/) for its new releases. Even though BERTopic has been around for a few years now, it is still pre-1.0 software. With the rapid chances in the field and as a way to keep up, this versioning is on purpose. Backwards-compatibility is taken into account but integrating new features and thereby keeping up with the field takes priority. Especially since BERTopic focuses on modularity, flexibility is necessary.
-21
View File
@@ -1,21 +0,0 @@
MIT License
Copyright (c) 2024, Maarten P. Grootendorst
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
-29
View File
@@ -1,29 +0,0 @@
test:
pytest
coverage:
pytest --cov
format:
ruff format
lint:
ruff check --fix
install:
python -m pip install -e .
install-test:
python -m pip install -e ".[dev]"
docs:
mkdocs serve
pypi:
python -m build
twine upload dist/*
clean:
rm -rf **/.ipynb_checkpoints **/.pytest_cache **/__pycache__ **/**/__pycache__ .ipynb_checkpoints .pytest_cache
check: test clean
-309
View File
@@ -1,309 +0,0 @@
[![PyPI Downloads](https://static.pepy.tech/badge/bertopic)](https://pepy.tech/projects/bertopic)
[![PyPI - Python](https://img.shields.io/badge/python-v3.9+-blue.svg)](https://pypi.org/project/bertopic/)
[![Build](https://img.shields.io/github/actions/workflow/status/MaartenGr/BERTopic/testing.yml?branch=master)](https://github.com/MaartenGr/BERTopic/actions)
[![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/BERTopic/)
[![PyPI - PyPi](https://img.shields.io/pypi/v/BERTopic)](https://pypi.org/project/bertopic/)
[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/VLAC/blob/master/LICENSE)
[![arXiv](https://img.shields.io/badge/arXiv-2203.05794-<COLOR>.svg)](https://arxiv.org/abs/2203.05794)
# BERTopic
<img src="images/logo.png" width="35%" align="right" />
BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters
allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
BERTopic supports all kinds of topic modeling techniques:
<table>
<tr>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/guided/guided.html">Guided</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html">Supervised</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html">Semi-supervised</a></td>
</tr>
<tr>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/manual/manual.html">Manual</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html">Multi-topic distributions</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html">Hierarchical</a></td>
</tr>
<tr>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html">Class-based</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html">Dynamic</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/online/online.html">Online/Incremental</a></td>
</tr>
<tr>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html">Multimodal</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html">Multi-aspect</a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/representation/llm.html">Text Generation/LLM</a></td>
</tr>
<tr>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html">Zero-shot <b>(new!)</b></a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/merge/merge.html">Merge Models <b>(new!)</b></a></td>
<td><a href="https://maartengr.github.io/BERTopic/getting_started/seed_words/seed_words.html">Seed Words <b>(new!)</b></a></td>
</tr>
</table>
Corresponding medium posts can be found [here](https://medium.com/data-science/topic-modeling-with-bert-779f7db187e6?sk=0b5a470c006d1842ad4c8a3057063a99
), [here](https://medium.com/data-science/using-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf?sk=b1e0fd46f70cb15e8422b4794a81161d
) and [here](https://medium.com/data-science/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4
). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794) or see a [brief overview](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).
## Installation
Installation, with sentence-transformers, can be done using [uv](https://docs.astral.sh/uv/):
```bash
uv add bertopic
```
or with [pip](https://github.com/pypa/pip):
```bash
pip install bertopic
```
If you want to install BERTopic with other embedding models, you can choose one of the following:
```bash
# Choose an embedding backend
pip install bertopic[flair,gensim,spacy,use]
# Topic modeling with images
pip install bertopic[vision]
```
For a *light-weight installation* without transformers, UMAP and/or HDBSCAN (for training with Model2Vec or inference), see [this tutorial](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#lightweight-installation).
## Getting Started
For an in-depth overview of the features of BERTopic
you can check the [**full documentation**](https://maartengr.github.io/BERTopic/) or you can follow along
with one of the examples below:
| Name | Link |
|---|---|
| Start Here - **Best Practices in BERTopic** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing) |
| **🆕 New!** - Topic Modeling on Large Data (GPU Acceleration) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing) |
| **🆕 New!** - Topic Modeling with Llama 2 🦙 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing) |
| **🆕 New!** - Topic Modeling with Quantized LLMs | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1DdSHvVPJA3rmNfBWjCo2P1E9686xfxFx?usp=sharing) |
| Topic Modeling with BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing) |
| (Custom) Embedding Models in BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/18arPPe50szvcCp_Y6xS56H2tY0m-RLqv?usp=sharing) |
| Advanced Customization in BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ClTYut039t-LDtlcd-oQAdXWgcsSGTw9?usp=sharing) |
| (semi-)Supervised Topic Modeling with BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bxizKzv5vfxJEB29sntU__ZC7PBSIPaQ?usp=sharing) |
| Dynamic Topic Modeling with Trump's Tweets | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1un8ooI-7ZNlRoK0maVkYhmNRl0XGK88f?usp=sharing) |
| Topic Modeling arXiv Abstracts | [![Kaggle](https://img.shields.io/static/v1?style=for-the-badge&message=Kaggle&color=222222&logo=Kaggle&logoColor=20BEFF&label=)](https://www.kaggle.com/maartengr/topic-modeling-arxiv-abstract-with-bertopic) |
## Quick Start
We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:
```python
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
```
After generating topics and their probabilities, we can access all of the topics together with their topic representations:
```python
>>> topic_model.get_topic_info()
Topic Count Name
-1 4630 -1_can_your_will_any
0 693 49_windows_drive_dos_file
1 466 32_jesus_bible_christian_faith
2 441 2_space_launch_orbit_lunar
3 381 22_key_encryption_keys_encrypted
...
```
The `-1` topic refers to all outlier documents and are typically ignored. Each word in a topic describes the underlying theme of that topic and can be used
for interpreting that topic. Next, let's take a look at the most frequent topic that was generated:
```python
>>> topic_model.get_topic(0)
[('windows', 0.006152228076250982),
('drive', 0.004982897610645755),
('dos', 0.004845038866360651),
('file', 0.004140142872194834),
('disk', 0.004131678774810884),
('mac', 0.003624848635985097),
('memory', 0.0034840976976789903),
('software', 0.0034415334250699077),
('email', 0.0034239554442333257),
('pc', 0.003047105930670237)]
```
Using `.get_document_info`, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:
```python
>>> topic_model.get_document_info(docs)
Document Topic Name Top_n_words Probability ...
I am sure some bashers of Pens... 0 0_game_team_games_season game - team - games... 0.200010 ...
My brother is in the market for... -1 -1_can_your_will_any can - your - will... 0.420668 ...
Finally you said what you dream... -1 -1_can_your_will_any can - your - will... 0.807259 ...
Think! It's the SCSI card doing... 49 49_windows_drive_dos_file windows - drive - docs... 0.071746 ...
1) I have an old Jasmine drive... 49 49_windows_drive_dos_file windows - drive - docs... 0.038983 ...
```
**`🔥 Tip`**: Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages.
## Fine-tune Topic Representations
In BERTopic, there are a number of different [topic representations](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is `KeyBERTInspired`, which for many users increases the coherence and reduces stopwords from the resulting topic representations:
```python
from bertopic.representation import KeyBERTInspired
# Fine-tune your topic representations
representation_model = KeyBERTInspired()
topic_model = BERTopic(representation_model=representation_model)
```
However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:
```python
import openai
from bertopic.representation import OpenAI
# Fine-tune topic representations with GPT
client = openai.OpenAI(api_key="sk-...")
representation_model = OpenAI(client, model="gpt-4o-mini", chat=True)
topic_model = BERTopic(representation_model=representation_model)
```
**`🔥 Tip`**: Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic.
## Visualizations
After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good
understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the [many visualization options](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html) in BERTopic.
For example, we can visualize the topics that were generated in a way very similar to
[LDAvis](https://github.com/cpsievert/LDAvis):
```python
topic_model.visualize_topics()
```
<img src="images/topic_visualization.gif" width="80%" align="center" />
## Modularity
By default, the [main steps](https://maartengr.github.io/BERTopic/algorithm/algorithm.html) for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:
https://user-images.githubusercontent.com/25746895/218420473-4b2bb539-9dbe-407a-9674-a8317c7fb3bf.mp4
You can swap out any of these models or even remove them entirely. The following steps are completely modular:
1. [Embedding](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) documents
2. [Reducing dimensionality](https://maartengr.github.io/BERTopic/getting_started/dim_reduction/dim_reduction.html) of embeddings
3. [Clustering](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html) reduced embeddings into topics
4. [Tokenization](https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html) of topics
5. [Weight](https://maartengr.github.io/BERTopic/getting_started/ctfidf/ctfidf.html) tokens
6. [Represent topics](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) with one or [multiple](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) representations
## Functionality
BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview
of all methods and a short description of its purpose.
### Common
Below, you will find an overview of common functions in BERTopic.
| Method | Code |
|-----------------------|---|
| Fit the model | `.fit(docs)` |
| Fit the model and predict documents | `.fit_transform(docs)` |
| Predict new documents | `.transform([new_doc])` |
| Access single topic | `.get_topic(topic=12)` |
| Access all topics | `.get_topics()` |
| Get topic freq | `.get_topic_freq()` |
| Get all topic information| `.get_topic_info()` |
| Get all document information| `.get_document_info(docs)` |
| Get representative docs per topic | `.get_representative_docs()` |
| Update topic representation | `.update_topics(docs, n_gram_range=(1, 3))` |
| Generate topic labels | `.generate_topic_labels()` |
| Set topic labels | `.set_topic_labels(my_custom_labels)` |
| Merge topics | `.merge_topics(docs, topics_to_merge)` |
| Reduce nr of topics | `.reduce_topics(docs, nr_topics=30)` |
| Reduce outliers | `.reduce_outliers(docs, topics)` |
| Find topics | `.find_topics("vehicle")` |
| Save model | `.save("my_model", serialization="safetensors")` |
| Load model | `BERTopic.load("my_model")` |
| Get parameters | `.get_params()` |
### Attributes
After having trained your BERTopic model, several attributes are saved within your model. These attributes, in part,
refer to how model information is stored on an estimator during fitting. The attributes that you see below all end in `_` and are
public attributes that can be used to access model information.
| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| `.topics_` | The topics that are generated for each document after training or updating the topic model. |
| `.probabilities_` | The probabilities that are generated for each document if HDBSCAN is used. |
| `.topic_sizes_` | The size of each topic |
| `.topic_mapper_` | A class for tracking topics and their mappings anytime they are merged/reduced. |
| `.topic_representations_` | The top *n* terms per topic and their respective c-TF-IDF values. |
| `.c_tf_idf_` | The topic-term matrix as calculated through c-TF-IDF. |
| `.topic_aspects_` | The different aspects, or representations, of each topic. |
| `.topic_labels_` | The default labels for each topic. |
| `.custom_labels_` | Custom labels for each topic as generated through `.set_topic_labels`. |
| `.topic_embeddings_` | The embeddings for each topic if `embedding_model` was used. |
| `.representative_docs_` | The representative documents for each topic if HDBSCAN is used. |
### Variations
There are many different use cases in which topic modeling can be used. As such, several variations of BERTopic have been developed such that one package can be used across many use cases.
| Method | Code |
|-----------------------|---|
| [Topic Distribution Approximation](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html) | `.approximate_distribution(docs)` |
| [Online Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/online/online.html) | `.partial_fit(doc)` |
| [Semi-supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html) | `.fit(docs, y=y)` |
| [Supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html) | `.fit(docs, y=y)` |
| [Manual Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html) | `.fit(docs, y=y)` |
| [Multimodal Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html) | ``.fit(docs, images=images)`` |
| [Topic Modeling per Class](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html) | `.topics_per_class(docs, classes)` |
| [Dynamic Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html) | `.topics_over_time(docs, timestamps)` |
| [Hierarchical Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html) | `.hierarchical_topics(docs)` |
| [Guided Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html) | `BERTopic(seed_topic_list=seed_topic_list)` |
| [Zero-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html) | `BERTopic(zeroshot_topic_list=zeroshot_topic_list)` |
| [Merge Multiple Models](https://maartengr.github.io/BERTopic/getting_started/merge/merge.html) | `BERTopic.merge_models([topic_model_1, topic_model_2])` |
### Visualizations
Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation.
Visualizing different aspects of the topic model helps in understanding the model and makes it easier
to tweak the model to your liking.
| Method | Code |
|-----------------------|---|
| Visualize Topics | `.visualize_topics()` |
| Visualize Documents | `.visualize_documents()` |
| Visualize Document Hierarchy | `.visualize_hierarchical_documents()` |
| Visualize Topic Hierarchy | `.visualize_hierarchy()` |
| Visualize Topic Tree | `.get_topic_tree(hierarchical_topics)` |
| Visualize Topic Terms | `.visualize_barchart()` |
| Visualize Topic Similarity | `.visualize_heatmap()` |
| Visualize Term Score Decline | `.visualize_term_rank()` |
| Visualize Topic Probability Distribution | `.visualize_distribution(probs[0])` |
| Visualize Topics over Time | `.visualize_topics_over_time(topics_over_time)` |
| Visualize Topics per Class | `.visualize_topics_per_class(topics_per_class)` |
## Citation
To cite the [BERTopic paper](https://arxiv.org/abs/2203.05794), please use the following bibtex reference:
```bibtext
@article{grootendorst2022bertopic,
title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},
author={Grootendorst, Maarten},
journal={arXiv preprint arXiv:2203.05794},
year={2022}
}
```
@@ -1,9 +0,0 @@
from importlib.metadata import version
from bertopic._bertopic import BERTopic
__version__ = version("bertopic")
__all__ = [
"BERTopic",
]
File diff suppressed because it is too large Load Diff
@@ -1,538 +0,0 @@
import os
import json
import numpy as np
from pathlib import Path
from tempfile import TemporaryDirectory
# HuggingFace Hub
try:
from huggingface_hub import (
create_repo,
get_hf_file_metadata,
hf_hub_download,
hf_hub_url,
repo_type_and_id_from_hf_id,
upload_folder,
)
_has_hf_hub = True
except ImportError:
_has_hf_hub = False
# Typing
from typing import Union
# Pytorch check
try:
import torch
_has_torch = True
except ImportError:
_has_torch = False
# Image check
try:
from PIL import Image
_has_vision = True
except ImportError:
_has_vision = False
TOPICS_NAME = "topics.json"
CONFIG_NAME = "config.json"
HF_WEIGHTS_NAME = "topic_embeddings.bin" # default pytorch pkl
HF_SAFE_WEIGHTS_NAME = "topic_embeddings.safetensors" # safetensors version
CTFIDF_WEIGHTS_NAME = "ctfidf.bin" # default pytorch pkl
CTFIDF_SAFE_WEIGHTS_NAME = "ctfidf.safetensors" # safetensors version
CTFIDF_CFG_NAME = "ctfidf_config.json"
MODEL_CARD_TEMPLATE = """
---
tags:
- bertopic
library_name: bertopic
pipeline_tag: {PIPELINE_TAG}
---
# {MODEL_NAME}
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
## Usage
To use this model, please install BERTopic:
```
pip install -U bertopic
```
You can use the model as follows:
```python
from bertopic import BERTopic
topic_model = BERTopic.load("{PATH}")
topic_model.get_topic_info()
```
## Topic overview
* Number of topics: {NR_TOPICS}
* Number of training documents: {NR_DOCUMENTS}
<details>
<summary>Click here for an overview of all topics.</summary>
{TOPICS}
</details>
## Training hyperparameters
{HYPERPARAMS}
## Framework versions
{FRAMEWORKS}
"""
def push_to_hf_hub(
model,
repo_id: str,
commit_message: str = "Add BERTopic model",
token: str = None,
revision: str = None,
private: bool = False,
create_pr: bool = False,
model_card: bool = True,
serialization: str = "safetensors",
save_embedding_model: Union[str, bool] = True,
save_ctfidf: bool = False,
):
"""Push your BERTopic model to a HuggingFace Hub.
Arguments:
model: The BERTopic model to push
repo_id: The name of your HuggingFace repository
commit_message: A commit message
token: Token to add if not already logged in
revision: Repository revision
private: Whether to create a private repository
create_pr: Whether to upload the model as a Pull Request
model_card: Whether to automatically create a modelcard
serialization: The type of serialization.
Either `safetensors` or `pytorch`
save_embedding_model: A pointer towards a HuggingFace model to be loaded in with
SentenceTransformers. E.g.,
`sentence-transformers/all-MiniLM-L6-v2`
save_ctfidf: Whether to save c-TF-IDF information
"""
if not _has_hf_hub:
raise ValueError("Make sure you have the huggingface hub installed via `pip install --upgrade huggingface_hub`")
# Create repo if it doesn't exist yet and infer complete repo_id
repo_url = create_repo(repo_id, token=token, private=private, exist_ok=True)
_, repo_owner, repo_name = repo_type_and_id_from_hf_id(repo_url)
repo_id = f"{repo_owner}/{repo_name}"
# Temporarily save model and push to HF
with TemporaryDirectory() as tmpdir:
# Save model weights and config.
model.save(
tmpdir,
serialization=serialization,
save_embedding_model=save_embedding_model,
save_ctfidf=save_ctfidf,
)
# Add README if it does not exist
try:
get_hf_file_metadata(hf_hub_url(repo_id=repo_id, filename="README.md", revision=revision))
except: # noqa: E722
if model_card:
readme_text = generate_readme(model, repo_id)
readme_path = Path(tmpdir) / "README.md"
readme_path.write_text(readme_text, encoding="utf8")
# Upload model
return upload_folder(
repo_id=repo_id,
folder_path=tmpdir,
revision=revision,
create_pr=create_pr,
commit_message=commit_message,
)
def load_local_files(path):
"""Load local BERTopic files."""
# Load json configs
topics = load_cfg_from_json(path / TOPICS_NAME)
params = load_cfg_from_json(path / CONFIG_NAME)
# Load Topic Embeddings
safetensor_path = path / HF_SAFE_WEIGHTS_NAME
if safetensor_path.is_file():
tensors = load_safetensors(safetensor_path)
else:
torch_path = path / HF_WEIGHTS_NAME
if torch_path.is_file():
tensors = torch.load(torch_path, map_location="cpu")
tensors = {k: v.numpy() for k, v in tensors.items()}
# c-TF-IDF
try:
ctfidf_tensors = None
safetensor_path = path / CTFIDF_SAFE_WEIGHTS_NAME
if safetensor_path.is_file():
ctfidf_tensors = load_safetensors(safetensor_path)
else:
torch_path = path / CTFIDF_WEIGHTS_NAME
if torch_path.is_file():
ctfidf_tensors = torch.load(torch_path, map_location="cpu")
ctfidf_tensors = {k: v.numpy() for k, v in ctfidf_tensors.items()}
ctfidf_config = load_cfg_from_json(path / CTFIDF_CFG_NAME)
except: # noqa: E722
ctfidf_config, ctfidf_tensors = None, None
# Load images
images = None
if _has_vision:
try:
Image.open(path / "images/0.jpg")
_has_images = True
except: # noqa: E722
_has_images = False
if _has_images:
topic_list = list(topics["topic_representations"].keys())
images = {}
for topic in topic_list:
image = Image.open(path / f"images/{topic}.jpg")
images[int(topic)] = image
return topics, params, tensors, ctfidf_tensors, ctfidf_config, images
def load_files_from_hf(path):
"""Load files from HuggingFace."""
path = str(path)
# Configs
topics = load_cfg_from_json(hf_hub_download(path, TOPICS_NAME, revision=None))
params = load_cfg_from_json(hf_hub_download(path, CONFIG_NAME, revision=None))
# Topic Embeddings
try:
tensors = hf_hub_download(path, HF_SAFE_WEIGHTS_NAME, revision=None)
tensors = load_safetensors(tensors)
except: # noqa: E722
tensors = hf_hub_download(path, HF_WEIGHTS_NAME, revision=None)
tensors = torch.load(tensors, map_location="cpu")
# c-TF-IDF
try:
ctfidf_config = load_cfg_from_json(hf_hub_download(path, CTFIDF_CFG_NAME, revision=None))
try:
ctfidf_tensors = hf_hub_download(path, CTFIDF_SAFE_WEIGHTS_NAME, revision=None)
ctfidf_tensors = load_safetensors(ctfidf_tensors)
except: # noqa: E722
ctfidf_tensors = hf_hub_download(path, CTFIDF_WEIGHTS_NAME, revision=None)
ctfidf_tensors = torch.load(ctfidf_tensors, map_location="cpu")
except: # noqa: E722
ctfidf_config, ctfidf_tensors = None, None
# Load images if they exist
images = None
if _has_vision:
try:
hf_hub_download(path, "images/0.jpg", revision=None)
_has_images = True
except: # noqa: E722
_has_images = False
if _has_images:
topic_list = list(topics["topic_representations"].keys())
images = {}
for topic in topic_list:
image = Image.open(hf_hub_download(path, f"images/{topic}.jpg", revision=None))
images[int(topic)] = image
return topics, params, tensors, ctfidf_tensors, ctfidf_config, images
def generate_readme(model, repo_id: str):
"""Generate README for HuggingFace model card."""
model_card = MODEL_CARD_TEMPLATE
topic_table_head = "| Topic ID | Topic Keywords | Topic Frequency | Label | \n|----------|----------------|-----------------|-------| \n"
# Get Statistics
model_name = repo_id.split("/")[-1]
params = {param: value for param, value in model.get_params().items() if "model" not in param}
params = "\n".join([f"* {param}: {value}" for param, value in params.items()])
topics = sorted(list(set(model.topics_)))
nr_topics = str(len(set(model.topics_)))
if model.topic_sizes_ is not None:
nr_documents = str(sum(model.topic_sizes_.values()))
else:
nr_documents = ""
# Topic information
topic_keywords = [" - ".join(list(zip(*model.get_topic(topic)))[0][:5]) for topic in topics]
topic_freq = [model.get_topic_freq(topic) for topic in topics]
topic_labels = model.custom_labels_ if model.custom_labels_ else [model.topic_labels_[topic] for topic in topics]
topics = [
f"| {topic} | {topic_keywords[index]} | {topic_freq[topic]} | {topic_labels[index]} | \n"
for index, topic in enumerate(topics)
]
topics = topic_table_head + "".join(topics)
frameworks = "\n".join([f"* {param}: {value}" for param, value in get_package_versions().items()])
# Fill Statistics into model card
model_card = model_card.replace("{MODEL_NAME}", model_name)
model_card = model_card.replace("{PATH}", repo_id)
model_card = model_card.replace("{NR_TOPICS}", nr_topics)
model_card = model_card.replace("{TOPICS}", topics.strip())
model_card = model_card.replace("{NR_DOCUMENTS}", nr_documents)
model_card = model_card.replace("{HYPERPARAMS}", params)
model_card = model_card.replace("{FRAMEWORKS}", frameworks)
# Fill Pipeline tag
has_visual_aspect = check_has_visual_aspect(model)
if not has_visual_aspect:
model_card = model_card.replace("{PIPELINE_TAG}", "text-classification")
else:
model_card = model_card.replace("pipeline_tag: {PIPELINE_TAG}\n", "") # TODO add proper tag for this instance
return model_card
def save_hf(model, save_directory, serialization: str):
"""Save topic embeddings, either safely (using safetensors) or using legacy pytorch."""
tensors = np.array(model.topic_embeddings_, dtype=np.float32)
if serialization == "safetensors":
tensors = {"topic_embeddings": tensors}
save_safetensors(save_directory / HF_SAFE_WEIGHTS_NAME, tensors)
if serialization == "pytorch":
assert _has_torch, "`pip install pytorch` to save as bin"
tensors = {"topic_embeddings": torch.from_numpy(tensors)}
torch.save(tensors, save_directory / HF_WEIGHTS_NAME)
def save_ctfidf(model, save_directory: str, serialization: str):
"""Save c-TF-IDF sparse matrix."""
indptr = model.c_tf_idf_.indptr
indices = model.c_tf_idf_.indices
data = model.c_tf_idf_.data
shape = np.array(model.c_tf_idf_.shape)
diag = np.array(model.ctfidf_model._idf_diag.data)
if serialization == "safetensors":
tensors = {
"indptr": indptr,
"indices": indices,
"data": data,
"shape": shape,
"diag": diag,
}
save_safetensors(save_directory / CTFIDF_SAFE_WEIGHTS_NAME, tensors)
if serialization == "pytorch":
assert _has_torch, "`pip install pytorch` to save as .bin"
tensors = {
"indptr": torch.from_numpy(indptr),
"indices": torch.from_numpy(indices),
"data": torch.from_numpy(data),
"shape": torch.from_numpy(shape),
"diag": torch.from_numpy(diag),
}
torch.save(tensors, save_directory / CTFIDF_WEIGHTS_NAME)
def save_ctfidf_config(model, path):
"""Save parameters to recreate CountVectorizer and c-TF-IDF."""
config = {}
# Recreate ClassTfidfTransformer
config["ctfidf_model"] = {
"bm25_weighting": model.ctfidf_model.bm25_weighting,
"reduce_frequent_words": model.ctfidf_model.reduce_frequent_words,
}
# Recreate CountVectorizer
cv_params = model.vectorizer_model.get_params()
del cv_params["tokenizer"], cv_params["preprocessor"], cv_params["dtype"]
if not isinstance(cv_params["analyzer"], str):
del cv_params["analyzer"]
config["vectorizer_model"] = {
"params": cv_params,
"vocab": model.vectorizer_model.vocabulary_,
}
with path.open("w") as f:
json.dump(config, f, indent=2)
def save_config(model, path: str, embedding_model):
"""Save BERTopic configuration."""
path = Path(path)
params = model.get_params()
config = {param: value for param, value in params.items() if "model" not in param}
# Embedding model tag to be used in sentence-transformers
if isinstance(embedding_model, str):
config["embedding_model"] = embedding_model
with path.open("w") as f:
json.dump(config, f, indent=2)
return config
def check_has_visual_aspect(model):
"""Check if model has visual aspect."""
if _has_vision:
for aspect, value in model.topic_aspects_.items():
if isinstance(value[0], Image.Image):
return True
def save_images(model, path: str):
"""Save topic images."""
if _has_vision:
visual_aspects = None
for aspect, value in model.topic_aspects_.items():
if isinstance(value[0], Image.Image):
visual_aspects = model.topic_aspects_[aspect]
break
if visual_aspects is not None:
path.mkdir(exist_ok=True, parents=True)
for topic, image in visual_aspects.items():
image.save(path / f"{topic}.jpg")
def save_topics(model, path: str):
"""Save Topic-specific information."""
path = Path(path)
if _has_vision:
selected_topic_aspects = {}
for aspect, value in model.topic_aspects_.items():
if not isinstance(value[0], Image.Image):
selected_topic_aspects[aspect] = value
else:
selected_topic_aspects["Visual_Aspect"] = True
else:
selected_topic_aspects = model.topic_aspects_
topics = {
"topic_representations": model.topic_representations_,
"topics": [int(topic) for topic in model.topics_],
"topic_sizes": model.topic_sizes_,
"topic_mapper": np.array(model.topic_mapper_.mappings_, dtype=int).tolist(),
"topic_labels": model.topic_labels_,
"custom_labels": model.custom_labels_,
"_outliers": int(model._outliers),
"topic_aspects": selected_topic_aspects,
}
with path.open("w") as f:
json.dump(topics, f, indent=2, cls=NumpyEncoder)
def load_cfg_from_json(json_file: Union[str, os.PathLike]):
"""Load configuration from json."""
with open(json_file, "r", encoding="utf-8") as reader:
text = reader.read()
return json.loads(text)
class NumpyEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, np.integer):
return int(obj)
if isinstance(obj, np.floating):
return float(obj)
return super(NumpyEncoder, self).default(obj)
def get_package_versions():
"""Get versions of main dependencies of BERTopic."""
try:
import platform
from numpy import __version__ as np_version
from pandas import __version__ as pandas_version
from sklearn import __version__ as sklearn_version
from plotly import __version__ as plotly_version
try:
from importlib.metadata import version
hdbscan_version = version("hdbscan")
except (ImportError, ModuleNotFoundError):
hdbscan_version = None
try:
from umap import __version__ as umap_version
except (ImportError, ModuleNotFoundError):
umap_version = None
try:
from sentence_transformers import __version__ as sbert_version
except (ImportError, ModuleNotFoundError):
sbert_version = None
try:
from numba import __version__ as numba_version
except (ImportError, ModuleNotFoundError):
numba_version = None
try:
from transformers import __version__ as transformers_version
except (ImportError, ModuleNotFoundError):
transformers_version = None
return {
"Numpy": np_version,
"HDBSCAN": hdbscan_version,
"UMAP": umap_version,
"Pandas": pandas_version,
"Scikit-Learn": sklearn_version,
"Sentence-transformers": sbert_version,
"Transformers": transformers_version,
"Numba": numba_version,
"Plotly": plotly_version,
"Python": platform.python_version(),
}
except Exception as e:
return e
def load_safetensors(path):
"""Load safetensors and check whether it is installed."""
try:
import safetensors.numpy
return safetensors.numpy.load_file(path)
except ImportError:
raise ValueError("`pip install safetensors` to load .safetensors")
def save_safetensors(path, tensors):
"""Save safetensors and check whether it is installed."""
try:
import safetensors.numpy
safetensors.numpy.save_file(tensors, path)
except ImportError:
raise ValueError("`pip install safetensors` to save as .safetensors")
@@ -1,228 +0,0 @@
import numpy as np
import pandas as pd
import logging
from collections.abc import Iterable
from scipy.sparse import csr_matrix
from scipy.spatial.distance import squareform
from typing import Optional, Union, Tuple
class MyLogger:
def __init__(self):
self.logger = logging.getLogger("BERTopic")
def configure(self, level):
self.set_level(level)
self._add_handler()
self.logger.propagate = False
def info(self, message):
self.logger.info(f"{message}")
def warning(self, message):
self.logger.warning(f"WARNING: {message}")
def set_level(self, level):
levels = ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
if level in levels:
self.logger.setLevel(level)
def _add_handler(self):
sh = logging.StreamHandler()
sh.setFormatter(logging.Formatter("%(asctime)s - %(name)s - %(message)s"))
self.logger.addHandler(sh)
# Remove duplicate handlers
if len(self.logger.handlers) > 1:
self.logger.handlers = [self.logger.handlers[0]]
def check_documents_type(documents):
"""Check whether the input documents are indeed a list of strings."""
if isinstance(documents, pd.DataFrame):
raise TypeError("Make sure to supply a list of strings, not a dataframe.")
elif isinstance(documents, Iterable) and not isinstance(documents, str):
if not any([isinstance(doc, str) for doc in documents]):
raise TypeError("Make sure that the iterable only contains strings.")
else:
raise TypeError("Make sure that the documents variable is an iterable containing strings only.")
def check_embeddings_shape(embeddings, docs):
"""Check if the embeddings have the correct shape."""
if embeddings is not None:
if not any([isinstance(embeddings, np.ndarray), isinstance(embeddings, csr_matrix)]):
raise ValueError("Make sure to input embeddings as a numpy array or scipy.sparse.csr.csr_matrix. ")
else:
if embeddings.shape[0] != len(docs):
raise ValueError(
"Make sure that the embeddings are a numpy array with shape: "
"(len(docs), vector_dim) where vector_dim is the dimensionality "
"of the vector embeddings. "
)
def check_is_fitted(topic_model):
"""Checks if the model was fitted by verifying the presence of self.matches.
Arguments:
topic_model: BERTopic instance for which the check is performed.
Returns:
None
Raises:
ValueError: If the matches were not found.
"""
msg = "This %(name)s instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator."
if topic_model.topics_ is None:
raise ValueError(msg % {"name": type(topic_model).__name__})
class NotInstalled:
"""This object is used to notify the user that additional dependencies need to be
installed in order to use the string matching model.
"""
def __init__(self, tool, dep, custom_msg=None):
self.tool = tool
self.dep = dep
msg = f"In order to use {self.tool} you will need to install via;\n\n"
if custom_msg is not None:
msg += custom_msg
else:
msg += f"pip install bertopic[{self.dep}]\n\n"
self.msg = msg
def __getattr__(self, *args, **kwargs):
raise ModuleNotFoundError(self.msg)
def __call__(self, *args, **kwargs):
raise ModuleNotFoundError(self.msg)
def validate_distance_matrix(X, n_samples):
"""Validate the distance matrix and convert it to a condensed distance matrix
if necessary.
A valid distance matrix is either a square matrix of shape (n_samples, n_samples)
with zeros on the diagonal and non-negative values or condensed distance matrix
of shape (n_samples * (n_samples - 1) / 2,) containing the upper triangular of the
distance matrix.
Arguments:
X: Distance matrix to validate.
n_samples: Number of samples in the dataset.
Returns:
X: Validated distance matrix.
Raises:
ValueError: If the distance matrix is not valid.
"""
# Make sure it is the 1-D condensed distance matrix with zeros on the diagonal
s = X.shape
if len(s) == 1:
# check it has correct size
n = s[0]
if n != (n_samples * (n_samples - 1) / 2):
raise ValueError("The condensed distance matrix must have shape (n*(n-1)/2,).")
elif len(s) == 2:
# check it has correct size
if (s[0] != n_samples) or (s[1] != n_samples):
raise ValueError("The distance matrix must be of shape (n, n) where n is the number of samples.")
# force zero diagonal and convert to condensed
np.fill_diagonal(X, 0)
X = squareform(X)
else:
raise ValueError(
"The distance matrix must be either a 1-D condensed "
"distance matrix of shape (n*(n-1)/2,) or a "
"2-D square distance matrix of shape (n, n)."
"where n is the number of documents."
"Got a distance matrix of shape %s" % str(s)
)
# Make sure its entries are non-negative
if np.any(X < 0):
raise ValueError("Distance matrix cannot contain negative values.")
return X
def get_unique_distances(dists: np.array, noise_max=1e-7) -> np.array:
"""Check if the consecutive elements in the distance array are the same. If so, a small noise
is added to one of the elements to make sure that the array does not contain duplicates.
Arguments:
dists: distance array sorted in the increasing order.
noise_max: the maximal magnitude of noise to be added.
Returns:
Unique distances sorted in the preserved increasing order.
"""
dists_cp = dists.copy()
for i in range(dists.shape[0] - 1):
if dists[i] == dists[i + 1]:
# returns the next unique distance or the current distance with the added noise
next_unique_dist = next((d for d in dists[i + 1 :] if d != dists[i]), dists[i] + noise_max)
# the noise can never be large then the difference between the next unique distance and the current one
curr_max_noise = min(noise_max, next_unique_dist - dists_cp[i])
dists_cp[i + 1] = np.random.uniform(low=dists_cp[i] + curr_max_noise / 2, high=dists_cp[i] + curr_max_noise)
return dists_cp
def select_topic_representation(
ctfidf_embeddings: Optional[Union[np.ndarray, csr_matrix]] = None,
embeddings: Optional[Union[np.ndarray, csr_matrix]] = None,
use_ctfidf: bool = True,
output_ndarray: bool = False,
) -> Tuple[np.ndarray, bool]:
"""Select the topic representation.
Arguments:
ctfidf_embeddings: The c-TF-IDF embedding matrix
embeddings: The topic embedding matrix
use_ctfidf: Whether to use the c-TF-IDF representation. If False, topics embedding representation is used, if it
exists. Default is True.
output_ndarray: Whether to convert the selected representation into ndarray
Raises
ValueError:
- If no topic representation was found
- If c-TF-IDF embeddings are not a numpy array or a scipy.sparse.csr_matrix
Returns:
The selected topic representation and a boolean indicating whether it is c-TF-IDF.
"""
def to_ndarray(array: Union[np.ndarray, csr_matrix]) -> np.ndarray:
if isinstance(array, csr_matrix):
return array.toarray()
return array
logger = MyLogger()
if use_ctfidf:
if ctfidf_embeddings is None:
logger.warning(
"No c-TF-IDF matrix was found despite it is supposed to be used (`use_ctfidf` is True). "
"Defaulting to semantic embeddings."
)
repr_, ctfidf_used = embeddings, False
else:
repr_, ctfidf_used = ctfidf_embeddings, True
else:
if embeddings is None:
logger.warning(
"No topic embeddings were found despite they are supposed to be used (`use_ctfidf` is False). "
"Defaulting to c-TF-IDF representation."
)
repr_, ctfidf_used = ctfidf_embeddings, True
else:
repr_, ctfidf_used = embeddings, False
return to_ndarray(repr_) if output_ndarray else repr_, ctfidf_used
@@ -1,60 +0,0 @@
from ._base import BaseEmbedder
from ._word_doc import WordDocEmbedder
from ._utils import languages
from bertopic._utils import NotInstalled
# OpenAI Embeddings
try:
from bertopic.backend._openai import OpenAIBackend
except ModuleNotFoundError:
msg = "`pip install openai` \n\n"
OpenAIBackend = NotInstalled("OpenAI", "OpenAI", custom_msg=msg)
# Cohere Embeddings
try:
from bertopic.backend._cohere import CohereBackend
except ModuleNotFoundError:
msg = "`pip install cohere` \n\n"
CohereBackend = NotInstalled("Cohere", "Cohere", custom_msg=msg)
# Multimodal Embeddings
try:
from bertopic.backend._multimodal import MultiModalBackend
except ModuleNotFoundError:
msg = "`pip install bertopic[vision]` \n\n"
MultiModalBackend = NotInstalled("Vision", "Vision", custom_msg=msg)
# Model2Vec Embeddings
try:
from bertopic.backend._model2vec import Model2VecBackend
except ModuleNotFoundError:
msg = "`pip install model2vec` \n\n"
Model2VecBackend = NotInstalled("Model2Vec", "Model2Vec", custom_msg=msg)
# FasteEmbed Embeddings
try:
from bertopic.backend._fastembed import FastEmbedBackend
except ModuleNotFoundError:
msg = "`pip install fastembed` \n\n"
FastEmbedBackend = NotInstalled("FastEmbed", "FastEmbed", custom_msg=msg)
# Langchain Embedddings
try:
from bertopic.backend._langchain import LangChainBackend
except ModuleNotFoundError:
msg = "`pip install langchain` \n\n"
LangChainBackend = NotInstalled("LangChain", "LangChain", custom_msg=msg)
__all__ = [
"BaseEmbedder",
"WordDocEmbedder",
"OpenAIBackend",
"CohereBackend",
"Model2VecBackend",
"MultiModalBackend",
"FastEmbedBackend",
"LangChainBackend",
"languages",
]
@@ -1,62 +0,0 @@
import numpy as np
from typing import List
class BaseEmbedder:
"""The Base Embedder used for creating embedding models.
Arguments:
embedding_model: The main embedding model to be used for extracting
document and word embedding
word_embedding_model: The embedding model used for extracting word
embeddings only. If this model is selected,
then the `embedding_model` is purely used for
creating document embeddings.
"""
def __init__(self, embedding_model=None, word_embedding_model=None):
self.embedding_model = embedding_model
self.word_embedding_model = word_embedding_model
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
pass
def embed_words(self, words: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n words into an n-dimensional
matrix of embeddings.
Arguments:
words: A list of words to be embedded
verbose: Controls the verbosity of the process
Returns:
Word embeddings with shape (n, m) with `n` words
that each have an embeddings size of `m`
"""
return self.embed(words, verbose)
def embed_documents(self, document: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n words into an n-dimensional
matrix of embeddings.
Arguments:
document: A list of documents to be embedded
verbose: Controls the verbosity of the process
Returns:
Document embeddings with shape (n, m) with `n` documents
that each have an embeddings size of `m`
"""
return self.embed(document, verbose)
@@ -1,94 +0,0 @@
import time
import numpy as np
from tqdm import tqdm
from typing import Any, List, Mapping
from bertopic.backend import BaseEmbedder
class CohereBackend(BaseEmbedder):
"""Cohere Embedding Model.
Arguments:
client: A `cohere` client.
embedding_model: A Cohere model. Default is "large".
For an overview of models see:
https://docs.cohere.ai/docs/generation-card
delay_in_seconds: If a `batch_size` is given, use this set
the delay in seconds between batches.
batch_size: The size of each batch.
embed_kwargs: Kwargs passed to `cohere.Client.embed`.
Can be used to define additional parameters
such as `input_type`
Examples:
```python
import cohere
from bertopic.backend import CohereBackend
client = cohere.Client("APIKEY")
cohere_model = CohereBackend(client)
```
If you want to specify `input_type`:
```python
cohere_model = CohereBackend(
client,
embedding_model="embed-english-v3.0",
embed_kwargs={"input_type": "clustering"}
)
```
"""
def __init__(
self,
client,
embedding_model: str = "large",
delay_in_seconds: float = None,
batch_size: int = None,
embed_kwargs: Mapping[str, Any] = {},
):
super().__init__()
self.client = client
self.embedding_model = embedding_model
self.delay_in_seconds = delay_in_seconds
self.batch_size = batch_size
self.embed_kwargs = embed_kwargs
if self.embed_kwargs.get("model"):
self.embedding_model = embed_kwargs.get("model")
else:
self.embed_kwargs["model"] = self.embedding_model
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
# Batch-wise embedding extraction
if self.batch_size is not None:
embeddings = []
for batch in tqdm(self._chunks(documents), disable=not verbose):
response = self.client.embed(texts=batch, **self.embed_kwargs)
embeddings.extend(response.embeddings)
# Delay subsequent calls
if self.delay_in_seconds:
time.sleep(self.delay_in_seconds)
# Extract embeddings all at once
else:
response = self.client.embed(texts=documents, **self.embed_kwargs)
embeddings = response.embeddings
return np.array(embeddings)
def _chunks(self, documents):
for i in range(0, len(documents), self.batch_size):
yield documents[i : i + self.batch_size]
@@ -1,54 +0,0 @@
import numpy as np
from typing import List
from fastembed import TextEmbedding
from bertopic.backend import BaseEmbedder
class FastEmbedBackend(BaseEmbedder):
"""FastEmbed embedding model.
The FastEmbed embedding model used for generating sentence embeddings.
Arguments:
embedding_model: A FastEmbed embedding model
Examples:
To create a model, you can load in a string pointing to a supported
FastEmbed model:
```python
from bertopic.backend import FastEmbedBackend
sentence_model = FastEmbedBackend("BAAI/bge-small-en-v1.5")
```
"""
def __init__(self, embedding_model: str = "BAAI/bge-small-en-v1.5"):
super().__init__()
supported_models = [m["model"] for m in TextEmbedding.list_supported_models()]
if isinstance(embedding_model, str) and embedding_model in supported_models:
self.embedding_model = TextEmbedding(model_name=embedding_model)
else:
raise ValueError(
"Please select a correct FasteEmbed model: \n"
"the model must be a string and must be supported. \n"
"The supported TextEmbedding model list is here: https://qdrant.github.io/fastembed/examples/Supported_Models/"
)
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
embeddings = np.array(list(self.embedding_model.embed(documents, show_progress_bar=verbose)))
return embeddings
@@ -1,78 +0,0 @@
import numpy as np
from tqdm import tqdm
from typing import Union, List
from flair.data import Sentence
from flair.embeddings import DocumentEmbeddings, TokenEmbeddings, DocumentPoolEmbeddings
from bertopic.backend import BaseEmbedder
class FlairBackend(BaseEmbedder):
"""Flair Embedding Model.
The Flair embedding model used for generating document and
word embeddings.
Arguments:
embedding_model: A Flair embedding model
Examples:
```python
from bertopic.backend import FlairBackend
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
# Create a Flair Embedding model
glove_embedding = WordEmbeddings('crawl')
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])
# Pass the Flair model to create a new backend
flair_embedder = FlairBackend(document_glove_embeddings)
```
"""
def __init__(self, embedding_model: Union[TokenEmbeddings, DocumentEmbeddings]):
super().__init__()
# Flair word embeddings
if isinstance(embedding_model, TokenEmbeddings):
self.embedding_model = DocumentPoolEmbeddings([embedding_model])
# Flair document embeddings + disable fine tune to prevent CUDA OOM
# https://github.com/flairNLP/flair/issues/1719
elif isinstance(embedding_model, DocumentEmbeddings):
if "fine_tune" in embedding_model.__dict__:
embedding_model.fine_tune = False
self.embedding_model = embedding_model
else:
raise ValueError(
"Please select a correct Flair model by either using preparing a token or document "
"embedding model: \n"
"`from flair.embeddings import TransformerDocumentEmbeddings` \n"
"`roberta = TransformerDocumentEmbeddings('roberta-base')`"
)
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
embeddings = []
for document in tqdm(documents, disable=not verbose):
try:
sentence = Sentence(document) if document else Sentence("an empty document")
self.embedding_model.embed(sentence)
except RuntimeError:
sentence = Sentence("an empty document")
self.embedding_model.embed(sentence)
embedding = sentence.embedding.detach().cpu().numpy()
embeddings.append(embedding)
embeddings = np.asarray(embeddings)
return embeddings
@@ -1,69 +0,0 @@
import numpy as np
from tqdm import tqdm
from typing import List
from bertopic.backend import BaseEmbedder
from gensim.models.keyedvectors import Word2VecKeyedVectors
class GensimBackend(BaseEmbedder):
"""Gensim Embedding Model.
The Gensim embedding model is typically used for word embeddings with
GloVe, Word2Vec or FastText.
Arguments:
embedding_model: A Gensim embedding model
Examples:
```python
from bertopic.backend import GensimBackend
import gensim.downloader as api
ft = api.load('fasttext-wiki-news-subwords-300')
ft_embedder = GensimBackend(ft)
```
"""
def __init__(self, embedding_model: Word2VecKeyedVectors):
super().__init__()
if isinstance(embedding_model, Word2VecKeyedVectors):
self.embedding_model = embedding_model
else:
raise ValueError(
"Please select a correct Gensim model: \n"
"`import gensim.downloader as api` \n"
"`ft = api.load('fasttext-wiki-news-subwords-300')`"
)
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
vector_shape = self.embedding_model.get_vector(list(self.embedding_model.index_to_key)[0]).shape[0]
empty_vector = np.zeros(vector_shape)
# Extract word embeddings and pool to document-level
embeddings = []
for doc in tqdm(documents, disable=not verbose, position=0, leave=True):
embedding = [
self.embedding_model.get_vector(word)
for word in doc.split()
if word in self.embedding_model.key_to_index
]
if len(embedding) > 0:
embeddings.append(np.mean(embedding, axis=0))
else:
embeddings.append(empty_vector)
embeddings = np.array(embeddings)
return embeddings
@@ -1,104 +0,0 @@
import numpy as np
from tqdm import tqdm
from typing import List
from torch.utils.data import Dataset
from sklearn.preprocessing import normalize
from transformers.pipelines import Pipeline
from bertopic.backend import BaseEmbedder
class HFTransformerBackend(BaseEmbedder):
"""Hugging Face transformers model.
This uses the `transformers.pipelines.pipeline` to define and create
a feature generation pipeline from which embeddings can be extracted.
Arguments:
embedding_model: A Hugging Face feature extraction pipeline
Examples:
To use a Hugging Face transformers model, load in a pipeline and point
to any model found on their model hub (https://huggingface.co/models):
```python
from bertopic.backend import HFTransformerBackend
from transformers.pipelines import pipeline
hf_model = pipeline("feature-extraction", model="distilbert-base-cased")
embedding_model = HFTransformerBackend(hf_model)
```
"""
def __init__(self, embedding_model: Pipeline):
super().__init__()
if isinstance(embedding_model, Pipeline):
self.embedding_model = embedding_model
else:
raise ValueError(
"Please select a correct transformers pipeline. For example: "
"pipeline('feature-extraction', model='distilbert-base-cased', device=0)"
)
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
dataset = MyDataset(documents)
embeddings = []
for document, features in tqdm(
zip(documents, self.embedding_model(dataset, truncation=True, padding=True)),
total=len(dataset),
disable=not verbose,
):
embeddings.append(self._embed(document, features))
return np.array(embeddings)
def _embed(self, document: str, features: np.ndarray) -> np.ndarray:
"""Mean pooling.
Arguments:
document: The document for which to extract the attention mask
features: The embeddings for each token
Adopted from:
https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2#usage-huggingface-transformers
"""
token_embeddings = np.array(features)
attention_mask = self.embedding_model.tokenizer(document, truncation=True, padding=True, return_tensors="np")[
"attention_mask"
]
input_mask_expanded = np.broadcast_to(np.expand_dims(attention_mask, -1), token_embeddings.shape)
sum_embeddings = np.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = np.clip(
input_mask_expanded.sum(1),
a_min=1e-9,
a_max=input_mask_expanded.sum(1).max(),
)
embedding = normalize(sum_embeddings / sum_mask)[0]
return embedding
class MyDataset(Dataset):
"""Dataset to pass to `transformers.pipelines.pipeline`."""
def __init__(self, docs):
self.docs = docs
def __len__(self):
return len(self.docs)
def __getitem__(self, idx):
return self.docs[idx]
@@ -1,43 +0,0 @@
from typing import List
import numpy as np
from bertopic.backend import BaseEmbedder
from langchain_core.embeddings import Embeddings
class LangChainBackend(BaseEmbedder):
"""LangChain Embedding Model.
This class uses the LangChain Embedding class to embed the documents.
Argument:
embedding_model: A LangChain Embedding Instance.
Examples:
```python
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from bertopic.backend import LangChainBackend
hf_embedding = HuggingFaceInstructEmbeddings()
langchain_embedder = LangChainBackend(hf_embedding)
```
"""
def __init__(self, embedding_model: Embeddings):
self.embedding_model = embedding_model
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
# Prepare documents, replacing empty strings with a single space
prepared_documents = [" " if doc == "" else doc for doc in documents]
response = self.embedding_model.embed_documents(prepared_documents)
return np.array(response)
@@ -1,129 +0,0 @@
import numpy as np
from typing import List, Union
from model2vec import StaticModel
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.backend import BaseEmbedder
class Model2VecBackend(BaseEmbedder):
"""Model2Vec embedding model.
Arguments:
embedding_model: Either a model2vec model or a
string pointing to a model2vec model
distill: Indicates whether to distill a sentence-transformers compatible model.
The distillation will happen during fitting of the topic model.
NOTE: Only works if `embedding_model` is a string.
distill_kwargs: Keyword arguments to pass to the distillation process
of `model2vec.distill.distill`
distill_vectorizer: A CountVectorizer used for creating a custom vocabulary
based on the same documents used for topic modeling.
NOTE: If "vocabulary" is in `distill_kwargs`, this will be ignored.
Examples:
To create a model, you can load in a string pointing to a
model2vec model:
```python
from bertopic.backend import Model2VecBackend
sentence_model = Model2VecBackend("minishlab/potion-base-8M")
```
or you can instantiate a model yourself:
```python
from bertopic.backend import Model2VecBackend
from model2vec import StaticModel
embedding_model = StaticModel.from_pretrained("minishlab/potion-base-8M")
sentence_model = Model2VecBackend(embedding_model)
```
If you want to distill a sentence-transformers model with the vocabulary of the documents,
run the following:
```python
from bertopic.backend import Model2VecBackend
sentence_model = Model2VecBackend("sentence-transformers/all-MiniLM-L6-v2", distill=True)
```
"""
def __init__(
self,
embedding_model: Union[str, StaticModel],
distill: bool = False,
distill_kwargs: dict = {},
distill_vectorizer: str = None,
):
super().__init__()
self.distill = distill
self.distill_kwargs = distill_kwargs
self.distill_vectorizer = distill_vectorizer
self._has_distilled = False
# When we distill, we need a string pointing to a sentence-transformer model
if self.distill:
self._check_model2vec_installation()
if not self.distill_vectorizer:
self.distill_vectorizer = CountVectorizer()
if isinstance(embedding_model, str):
self.embedding_model = embedding_model
else:
raise ValueError("Please pass a string pointing to a sentence-transformer model when distilling.")
# If we don't distill, we can pass a model2vec model directly or load from a string
elif isinstance(embedding_model, StaticModel):
self.embedding_model = embedding_model
elif isinstance(embedding_model, str):
self.embedding_model = StaticModel.from_pretrained(embedding_model)
else:
raise ValueError(
"Please select a correct Model2Vec model: \n"
"`from model2vec import StaticModel` \n"
"`model = StaticModel.from_pretrained('minishlab/potion-base-8M')`"
)
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
# Distill the model
if self.distill and not self._has_distilled:
from model2vec.distill import distill
# Distill with the vocabulary of the documents
if not self.distill_kwargs.get("vocabulary"):
X = self.distill_vectorizer.fit_transform(documents)
word_counts = np.array(X.sum(axis=0)).flatten()
words = self.distill_vectorizer.get_feature_names_out()
vocabulary = [word for word, _ in sorted(zip(words, word_counts), key=lambda x: x[1], reverse=True)]
self.distill_kwargs["vocabulary"] = vocabulary
# Distill the model
self.embedding_model = distill(self.embedding_model, **self.distill_kwargs)
# Distillation should happen only once and not for every embed call
# The distillation should only happen the first time on the entire vocabulary
self._has_distilled = True
# Embed the documents
embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
return embeddings
def _check_model2vec_installation(self):
try:
from model2vec.distill import distill # noqa: F401
except ImportError:
raise ImportError("To distill a model using model2vec, you need to run `pip install model2vec[distill]`")
@@ -1,200 +0,0 @@
import numpy as np
from PIL import Image
from tqdm import tqdm
from typing import List, Union
from sentence_transformers import SentenceTransformer
from bertopic.backend import BaseEmbedder
class MultiModalBackend(BaseEmbedder):
"""Multimodal backend using Sentence-transformers.
The sentence-transformers embedding model used for
generating word, document, and image embeddings.
Arguments:
embedding_model: A sentence-transformers embedding model that
can either embed both images and text or only text.
If it only embeds text, then `image_model` needs
to be used to embed the images.
image_model: A sentence-transformers embedding model that is used
to embed only images.
batch_size: The sizes of image batches to pass
Examples:
To create a model, you can load in a string pointing to a
sentence-transformers model:
```python
from bertopic.backend import MultiModalBackend
sentence_model = MultiModalBackend("clip-ViT-B-32")
```
or you can instantiate a model yourself:
```python
from bertopic.backend import MultiModalBackend
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("clip-ViT-B-32")
sentence_model = MultiModalBackend(embedding_model)
```
"""
def __init__(
self,
embedding_model: Union[str, SentenceTransformer],
image_model: Union[str, SentenceTransformer] = None,
batch_size: int = 32,
):
super().__init__()
self.batch_size = batch_size
# Text or Text+Image model
if isinstance(embedding_model, SentenceTransformer):
self.embedding_model = embedding_model
elif isinstance(embedding_model, str):
self.embedding_model = SentenceTransformer(embedding_model)
else:
raise ValueError(
"Please select a correct SentenceTransformers model: \n"
"`from sentence_transformers import SentenceTransformer` \n"
"`model = SentenceTransformer('clip-ViT-B-32')`"
)
# Image Model
self.image_model = None
if image_model is not None:
if isinstance(image_model, SentenceTransformer):
self.image_model = image_model
elif isinstance(image_model, str):
self.image_model = SentenceTransformer(image_model)
else:
raise ValueError(
"Please select a correct SentenceTransformers model: \n"
"`from sentence_transformers import SentenceTransformer` \n"
"`model = SentenceTransformer('clip-ViT-B-32')`"
)
try:
self.tokenizer = self.embedding_model._first_module().processor.tokenizer
except AttributeError:
self.tokenizer = self.embedding_model.tokenizer
except: # noqa: E722
self.tokenizer = None
def embed(self, documents: List[str], images: List[str] = None, verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words or images into an n-dimensional
matrix of embeddings.
Either documents, images, or both can be provided. If both are provided,
then the embeddings are averaged.
Arguments:
documents: A list of documents or words to be embedded
images: A list of image paths to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
# Embed documents
doc_embeddings = None
if documents[0] is not None:
doc_embeddings = self.embed_documents(documents)
# Embed images
image_embeddings = None
if isinstance(images, list):
image_embeddings = self.embed_images(images, verbose)
# Average embeddings
averaged_embeddings = None
if doc_embeddings is not None and image_embeddings is not None:
averaged_embeddings = np.mean([doc_embeddings, image_embeddings], axis=0)
if averaged_embeddings is not None:
return averaged_embeddings
elif doc_embeddings is not None:
return doc_embeddings
elif image_embeddings is not None:
return image_embeddings
def embed_documents(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
truncated_docs = [self._truncate_document(doc) for doc in documents]
embeddings = self.embedding_model.encode(truncated_docs, show_progress_bar=verbose)
return embeddings
def embed_words(self, words: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n words into an n-dimensional
matrix of embeddings.
Arguments:
words: A list of words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
embeddings = self.embedding_model.encode(words, show_progress_bar=verbose)
return embeddings
def embed_images(self, images, verbose):
if self.batch_size:
nr_iterations = int(np.ceil(len(images) / self.batch_size))
# Embed images per batch
embeddings = []
for i in tqdm(range(nr_iterations), disable=not verbose):
start_index = i * self.batch_size
end_index = (i * self.batch_size) + self.batch_size
images_to_embed = [
Image.open(image) if isinstance(image, str) else image for image in images[start_index:end_index]
]
if self.image_model is not None:
img_emb = self.image_model.encode(images_to_embed)
else:
img_emb = self.embedding_model.encode(images_to_embed, show_progress_bar=False)
embeddings.extend(img_emb.tolist())
# Close images
if isinstance(images[0], str):
for image in images_to_embed:
image.close()
embeddings = np.array(embeddings)
else:
images_to_embed = [Image.open(filepath) for filepath in images]
if self.image_model is not None:
embeddings = self.image_model.encode(images_to_embed)
else:
embeddings = self.embedding_model.encode(images_to_embed, show_progress_bar=False)
return embeddings
def _truncate_document(self, document):
if self.tokenizer:
tokens = self.tokenizer.encode(document)
if len(tokens) > 77:
# Skip the starting token, only include 75 tokens
truncated_tokens = tokens[1:76]
document = self.tokenizer.decode(truncated_tokens)
# Recursive call here, because the encode(decode()) can have different result
return self._truncate_document(document)
return document
@@ -1,88 +0,0 @@
import time
import openai
import numpy as np
from tqdm import tqdm
from typing import List, Mapping, Any
from bertopic.backend import BaseEmbedder
class OpenAIBackend(BaseEmbedder):
"""OpenAI Embedding Model.
Arguments:
client: A `openai.OpenAI` client.
embedding_model: An OpenAI model. Default is
For an overview of models see:
https://platform.openai.com/docs/models/embeddings
delay_in_seconds: If a `batch_size` is given, use this set
the delay in seconds between batches.
batch_size: The size of each batch.
generator_kwargs: Kwargs passed to `openai.Embedding.create`.
Can be used to define custom engines or
deployment_ids.
Examples:
```python
import openai
from bertopic.backend import OpenAIBackend
client = openai.OpenAI(api_key="sk-...")
openai_embedder = OpenAIBackend(client, "text-embedding-ada-002")
```
"""
def __init__(
self,
client: openai.OpenAI,
embedding_model: str = "text-embedding-ada-002",
delay_in_seconds: float = None,
batch_size: int = None,
generator_kwargs: Mapping[str, Any] = {},
):
super().__init__()
self.client = client
self.embedding_model = embedding_model
self.delay_in_seconds = delay_in_seconds
self.batch_size = batch_size
self.generator_kwargs = generator_kwargs
if self.generator_kwargs.get("model"):
self.embedding_model = generator_kwargs.get("model")
elif not self.generator_kwargs.get("engine"):
self.generator_kwargs["model"] = self.embedding_model
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
# Prepare documents, replacing empty strings with a single space
prepared_documents = [" " if doc == "" else doc for doc in documents]
# Batch-wise embedding extraction
if self.batch_size is not None:
embeddings = []
for batch in tqdm(self._chunks(prepared_documents), disable=not verbose):
response = self.client.embeddings.create(input=batch, **self.generator_kwargs)
embeddings.extend([r.embedding for r in response.data])
# Delay subsequent calls
if self.delay_in_seconds:
time.sleep(self.delay_in_seconds)
# Extract embeddings all at once
else:
response = self.client.embeddings.create(input=prepared_documents, **self.generator_kwargs)
embeddings = [r.embedding for r in response.data]
return np.array(embeddings)
def _chunks(self, documents):
for i in range(0, len(documents), self.batch_size):
yield documents[i : i + self.batch_size]
@@ -1,85 +0,0 @@
import numpy as np
from typing import List, Union
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
from bertopic.backend import BaseEmbedder
class SentenceTransformerBackend(BaseEmbedder):
"""Sentence-transformers embedding model.
The sentence-transformers embedding model used for generating document and
word embeddings.
Arguments:
embedding_model: A sentence-transformers embedding model
model2vec: Indicates whether `embedding_model` is a model2vec model.
NOTE: Only works if `embedding_model` is a string.
Otherwise, you can pass the model2vec model directly to `embedding_model`.
Examples:
To create a model, you can load in a string pointing to a
sentence-transformers model:
```python
from bertopic.backend import SentenceTransformerBackend
sentence_model = SentenceTransformerBackend("all-MiniLM-L6-v2")
```
or you can instantiate a model yourself:
```python
from bertopic.backend import SentenceTransformerBackend
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
sentence_model = SentenceTransformerBackend(embedding_model)
```
If you want to use a model2vec model without having to install model2vec,
you can pass the model2vec model as a string:
```python
from bertopic.backend import SentenceTransformerBackend
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("minishlab/potion-base-8M", model2vec=True)
sentence_model = SentenceTransformerBackend(embedding_model)
```
"""
def __init__(self, embedding_model: Union[str, SentenceTransformer], model2vec: bool = False):
super().__init__()
self._hf_model = None
if model2vec and isinstance(embedding_model, str):
static_embedding = StaticEmbedding.from_model2vec(embedding_model)
self.embedding_model = SentenceTransformer(modules=[static_embedding])
elif isinstance(embedding_model, SentenceTransformer):
self.embedding_model = embedding_model
elif isinstance(embedding_model, str):
self.embedding_model = SentenceTransformer(embedding_model)
self._hf_model = embedding_model
else:
raise ValueError(
"Please select a correct SentenceTransformers model: \n"
"`from sentence_transformers import SentenceTransformer` \n"
"`model = SentenceTransformer('all-MiniLM-L6-v2')`"
)
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
return embeddings
@@ -1,68 +0,0 @@
from bertopic.backend import BaseEmbedder
from sklearn.utils.validation import check_is_fitted, NotFittedError
class SklearnEmbedder(BaseEmbedder):
"""Scikit-Learn based embedding model.
This component allows the usage of scikit-learn pipelines for generating document and
word embeddings.
Arguments:
pipe: A scikit-learn pipeline that can `.transform()` text.
Examples:
Scikit-Learn is very flexible and it allows for many representations.
A relatively simple pipeline is shown below.
```python
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from bertopic.backend import SklearnEmbedder
pipe = make_pipeline(
TfidfVectorizer(),
TruncatedSVD(100)
)
sklearn_embedder = SklearnEmbedder(pipe)
topic_model = BERTopic(embedding_model=sklearn_embedder)
```
This pipeline first constructs a sparse representation based on TF/idf and then
makes it dense by applying SVD. Alternatively, you might also construct something
more elaborate. As long as you construct a scikit-learn compatible pipeline, you
should be able to pass it to Bertopic.
!!! Warning
One caveat to be aware of is that scikit-learns base `Pipeline` class does not
support the `.partial_fit()`-API. If you have a pipeline that theoretically should
be able to support online learning then you might want to explore
the [scikit-partial](https://github.com/koaning/scikit-partial) project.
"""
def __init__(self, pipe):
super().__init__()
self.pipe = pipe
def embed(self, documents, verbose=False):
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: No-op variable that's kept around to keep the API consistent. If you want to get feedback on training times, you should use the sklearn API.
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
try:
check_is_fitted(self.pipe)
embeddings = self.pipe.transform(documents)
except NotFittedError:
embeddings = self.pipe.fit_transform(documents)
return embeddings
@@ -1,94 +0,0 @@
import numpy as np
from tqdm import tqdm
from typing import List
from bertopic.backend import BaseEmbedder
class SpacyBackend(BaseEmbedder):
"""Spacy embedding model.
The Spacy embedding model used for generating document and
word embeddings.
Arguments:
embedding_model: A spacy embedding model
Examples:
To create a Spacy backend, you need to create an nlp object and
pass it through this backend:
```python
import spacy
from bertopic.backend import SpacyBackend
nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
spacy_model = SpacyBackend(nlp)
```
To load in a transformer model use the following:
```python
import spacy
from thinc.api import set_gpu_allocator, require_gpu
from bertopic.backend import SpacyBackend
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
set_gpu_allocator("pytorch")
require_gpu(0)
spacy_model = SpacyBackend(nlp)
```
If you run into gpu/memory-issues, please use:
```python
import spacy
from bertopic.backend import SpacyBackend
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
spacy_model = SpacyBackend(nlp)
```
"""
def __init__(self, embedding_model):
super().__init__()
if "spacy" in str(type(embedding_model)):
self.embedding_model = embedding_model
else:
raise ValueError(
"Please select a correct Spacy model by either using a string such as 'en_core_web_md' "
"or create a nlp model using: `nlp = spacy.load('en_core_web_md')"
)
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
# Handle empty documents, spaCy models automatically map
# empty strings to the zero vector
empty_document = " "
# Extract embeddings
embeddings = []
for doc in tqdm(documents, position=0, leave=True, disable=not verbose):
embedding = self.embedding_model(doc or empty_document)
if embedding.has_vector:
embedding = embedding.vector
else:
embedding = embedding._.trf_data.tensors[-1][0]
if not isinstance(embedding, np.ndarray) and hasattr(embedding, "get"):
# Convert cupy array to numpy array
embedding = embedding.get()
embeddings.append(embedding)
return np.array(embeddings)
@@ -1,55 +0,0 @@
import numpy as np
from tqdm import tqdm
from typing import List
from bertopic.backend import BaseEmbedder
class USEBackend(BaseEmbedder):
"""Universal Sentence Encoder.
USE encodes text into high-dimensional vectors that
are used for semantic similarity in BERTopic.
Arguments:
embedding_model: An USE embedding model
Examples:
```python
import tensorflow_hub
from bertopic.backend import USEBackend
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
use_embedder = USEBackend(embedding_model)
```
"""
def __init__(self, embedding_model):
super().__init__()
try:
embedding_model(["test sentence"])
self.embedding_model = embedding_model
except TypeError:
raise ValueError(
"Please select a correct USE model: \n"
"`import tensorflow_hub` \n"
"`embedding_model = tensorflow_hub.load(path_to_model)`"
)
def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n documents/words into an n-dimensional
matrix of embeddings.
Arguments:
documents: A list of documents or words to be embedded
verbose: Controls the verbosity of the process
Returns:
Document/words embeddings with shape (n, m) with `n` documents/words
that each have an embeddings size of `m`
"""
embeddings = np.array(
[self.embedding_model([doc]).cpu().numpy()[0] for doc in tqdm(documents, disable=not verbose)]
)
return embeddings
@@ -1,171 +0,0 @@
from ._base import BaseEmbedder
# Imports for light-weight variant of BERTopic
from bertopic.backend._sklearn import SklearnEmbedder
from bertopic._utils import MyLogger
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline as ScikitPipeline
logger = MyLogger()
logger.configure("WARNING")
languages = [
"arabic",
"bulgarian",
"catalan",
"czech",
"danish",
"german",
"greek",
"english",
"spanish",
"estonian",
"persian",
"finnish",
"french",
"canadian french",
"galician",
"gujarati",
"hebrew",
"hindi",
"croatian",
"hungarian",
"armenian",
"indonesian",
"italian",
"japanese",
"georgian",
"korean",
"kurdish",
"lithuanian",
"latvian",
"macedonian",
"mongolian",
"marathi",
"malay",
"burmese",
"norwegian bokmal",
"dutch",
"polish",
"portuguese",
"brazilian portuguese",
"romanian",
"russian",
"slovak",
"slovenian",
"albanian",
"serbian",
"swedish",
"thai",
"turkish",
"ukrainian",
"urdu",
"vietnamese",
"chinese (simplified)",
"chinese (traditional)",
]
def select_backend(embedding_model, language: str = None, verbose: bool = False) -> BaseEmbedder:
"""Select an embedding model based on language or a specific provided model.
When selecting a language, we choose all-MiniLM-L6-v2 for English and
paraphrase-multilingual-MiniLM-L12-v2 for all other languages as it support 100+ languages.
If sentence-transformers is not installed, in the case of a lightweight installation,
a scikit-learn backend is default.
Returns:
model: The selected model backend.
"""
logger.set_level("INFO" if verbose else "WARNING")
# BERTopic language backend
if isinstance(embedding_model, BaseEmbedder):
return embedding_model
# Scikit-learn backend
if isinstance(embedding_model, ScikitPipeline):
return SklearnEmbedder(embedding_model)
# Flair word embeddings
if "flair" in str(type(embedding_model)):
from bertopic.backend._flair import FlairBackend
return FlairBackend(embedding_model)
# Spacy embeddings
if "spacy" in str(type(embedding_model)):
from bertopic.backend._spacy import SpacyBackend
return SpacyBackend(embedding_model)
# Gensim embeddings
if "gensim" in str(type(embedding_model)):
from bertopic.backend._gensim import GensimBackend
return GensimBackend(embedding_model)
# USE embeddings
if "tensorflow" and "saved_model" in str(type(embedding_model)):
from bertopic.backend._use import USEBackend
return USEBackend(embedding_model)
# Sentence Transformer embeddings
if "sentence_transformers" in str(type(embedding_model)) or isinstance(embedding_model, str):
from ._sentencetransformers import SentenceTransformerBackend
return SentenceTransformerBackend(embedding_model)
# Hugging Face embeddings
if "transformers" and "pipeline" in str(type(embedding_model)):
from ._hftransformers import HFTransformerBackend
return HFTransformerBackend(embedding_model)
# Model2Vec embeddings
if "model2vec" in str(type(embedding_model)):
from ._model2vec import Model2VecBackend
return Model2VecBackend(embedding_model)
# FastEmbed word embeddings
if "fastembed" in str(type(embedding_model)):
from bertopic.backend._fastembed import FastEmbedBackend
return FastEmbedBackend(embedding_model)
# Select embedding model based on language
if language:
try:
from ._sentencetransformers import SentenceTransformerBackend
if language.lower() in ["English", "english", "en"]:
return SentenceTransformerBackend("sentence-transformers/all-MiniLM-L6-v2")
elif language.lower() in languages or language == "multilingual":
return SentenceTransformerBackend("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
else:
raise ValueError(
f"{language} is currently not supported. However, you can "
f"create any embeddings yourself and pass it through fit_transform(docs, embeddings)\n"
"Else, please select a language from the following list:\n"
f"{languages}"
)
# A ModuleNotFoundError might be a lightweight installation
except ModuleNotFoundError as e:
if e.name != "sentence_transformers":
# Error occurred in a downstream module, probably not a lightweight install
raise e
# Whole sentence_transformers module is missing, probably a lightweight install
if verbose:
logger.info(
"Automatically selecting lightweight scikit-learn embedding backend as sentence-transformers appears to not be installed."
)
pipe = make_pipeline(TfidfVectorizer(), TruncatedSVD(100))
return SklearnEmbedder(pipe)
from ._sentencetransformers import SentenceTransformerBackend
return SentenceTransformerBackend("sentence-transformers/all-MiniLM-L6-v2")
@@ -1,43 +0,0 @@
import numpy as np
from typing import List
from bertopic.backend._base import BaseEmbedder
from bertopic.backend._utils import select_backend
class WordDocEmbedder(BaseEmbedder):
"""Combine a document- and word-level embedder."""
def __init__(self, embedding_model, word_embedding_model):
super().__init__()
self.embedding_model = select_backend(embedding_model)
self.word_embedding_model = select_backend(word_embedding_model)
def embed_words(self, words: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n words into an n-dimensional
matrix of embeddings.
Arguments:
words: A list of words to be embedded
verbose: Controls the verbosity of the process
Returns:
Word embeddings with shape (n, m) with `n` words
that each have an embeddings size of `m`
"""
return self.word_embedding_model.embed(words, verbose)
def embed_documents(self, document: List[str], verbose: bool = False) -> np.ndarray:
"""Embed a list of n words into an n-dimensional
matrix of embeddings.
Arguments:
document: A list of documents to be embedded
verbose: Controls the verbosity of the process
Returns:
Document embeddings with shape (n, m) with `n` documents
that each have an embeddings size of `m`
"""
return self.embedding_model.embed(document, verbose)
@@ -1,5 +0,0 @@
from ._base import BaseCluster
__all__ = [
"BaseCluster",
]
@@ -1,41 +0,0 @@
import numpy as np
class BaseCluster:
"""The Base Cluster class.
Using this class directly in BERTopic will make it skip
over the cluster step. As a result, topics need to be passed
to BERTopic in the form of its `y` parameter in order to create
topic representations.
Examples:
This will skip over the cluster step in BERTopic:
```python
from bertopic import BERTopic
from bertopic.cluster import BaseCluster
empty_cluster_model = BaseCluster()
topic_model = BERTopic(hdbscan_model=empty_cluster_model)
```
Then, this class can be used to perform manual topic modeling.
That is, topic modeling on a topics that were already generated before
without the need to learn them:
```python
topic_model.fit(docs, y=y)
```
"""
def fit(self, X, y=None):
if y is not None:
self.labels_ = y
else:
self.labels_ = None
return self
def transform(self, X: np.ndarray) -> np.ndarray:
return X
@@ -1,81 +0,0 @@
import numpy as np
def hdbscan_delegator(model, func: str, embeddings: np.ndarray = None):
"""Function used to select the HDBSCAN-like model for generating
predictions and probabilities.
Arguments:
model: The cluster model.
func: The function to use. Options:
- "approximate_predict"
- "all_points_membership_vectors"
- "membership_vector"
embeddings: Input embeddings for "approximate_predict"
and "membership_vector"
"""
try:
import hdbscan
except (ImportError, ModuleNotFoundError):
hdbscan = type("hdbscan", (), {"HDBSCAN": None})()
# Approximate predict
if func == "approximate_predict":
if isinstance(model, hdbscan.HDBSCAN):
predictions, probabilities = hdbscan.approximate_predict(model, embeddings)
return predictions, probabilities
str_type_model = str(type(model)).lower()
if "cuml" in str_type_model and "hdbscan" in str_type_model:
from cuml.cluster import hdbscan as cuml_hdbscan
predictions, probabilities = cuml_hdbscan.approximate_predict(model, embeddings)
return predictions, probabilities
predictions = model.predict(embeddings)
return predictions, None
# All points membership
if func == "all_points_membership_vectors":
if isinstance(model, hdbscan.HDBSCAN):
return hdbscan.all_points_membership_vectors(model)
str_type_model = str(type(model)).lower()
if "cuml" in str_type_model and "hdbscan" in str_type_model:
from cuml.cluster import hdbscan as cuml_hdbscan
return cuml_hdbscan.all_points_membership_vectors(model)
return None
# membership_vector
if func == "membership_vector":
if isinstance(model, hdbscan.HDBSCAN):
probabilities = hdbscan.membership_vector(model, embeddings)
return probabilities
str_type_model = str(type(model)).lower()
if "cuml" in str_type_model and "hdbscan" in str_type_model:
from cuml.cluster import hdbscan as cuml_hdbscan
probabilities = cuml_hdbscan.membership_vector(model, embeddings)
return probabilities
return None
def is_supported_hdbscan(model):
"""Check whether the input model is a supported HDBSCAN-like model."""
try:
import hdbscan
except (ImportError, ModuleNotFoundError):
hdbscan = type("hdbscan", (), {"HDBSCAN": None})()
if isinstance(model, hdbscan.HDBSCAN):
return True
str_type_model = str(type(model)).lower()
if "cuml" in str_type_model and "hdbscan" in str_type_model:
return True
return False
@@ -1,5 +0,0 @@
from ._base import BaseDimensionalityReduction
__all__ = [
"BaseDimensionalityReduction",
]
@@ -1,26 +0,0 @@
import numpy as np
class BaseDimensionalityReduction:
"""The Base Dimensionality Reduction class.
You can use this to skip over the dimensionality reduction step in BERTopic.
Examples:
This will skip over the reduction step in BERTopic:
```python
from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction
empty_reduction_model = BaseDimensionalityReduction()
topic_model = BERTopic(umap_model=empty_reduction_model)
```
"""
def fit(self, X: np.ndarray = None):
return self
def transform(self, X: np.ndarray) -> np.ndarray:
return X
@@ -1,28 +0,0 @@
from ._topics import visualize_topics
from ._heatmap import visualize_heatmap
from ._barchart import visualize_barchart
from ._documents import visualize_documents
from ._term_rank import visualize_term_rank
from ._hierarchy import visualize_hierarchy
from ._datamap import visualize_document_datamap
from ._distribution import visualize_distribution
from ._topics_over_time import visualize_topics_over_time
from ._topics_per_class import visualize_topics_per_class
from ._hierarchical_documents import visualize_hierarchical_documents
from ._approximate_distribution import visualize_approximate_distribution
__all__ = [
"visualize_topics",
"visualize_heatmap",
"visualize_barchart",
"visualize_documents",
"visualize_term_rank",
"visualize_hierarchy",
"visualize_distribution",
"visualize_document_datamap",
"visualize_topics_over_time",
"visualize_topics_per_class",
"visualize_hierarchical_documents",
"visualize_approximate_distribution",
]
@@ -1,100 +0,0 @@
import numpy as np
import pandas as pd
try:
from pandas.io.formats.style import Styler # noqa: F401
HAS_JINJA = True
except (ModuleNotFoundError, ImportError):
HAS_JINJA = False
def visualize_approximate_distribution(
topic_model,
document: str,
topic_token_distribution: np.ndarray,
normalize: bool = False,
):
"""Visualize the topic distribution calculated by `.approximate_topic_distribution`
on a token level. Thereby indicating the extend to which a certain word or phrases belong
to a specific topic. The assumption here is that a single word can belong to multiple
similar topics and as such give information about the broader set of topics within
a single document.
Note:
This function will return a stylized pandas dataframe if Jinja2 is installed. If not,
it will only return a pandas dataframe without color highlighting. To install jinja:
`pip install jinja2`
Arguments:
topic_model: A fitted BERTopic instance.
document: The document for which you want to visualize
the approximated topic distribution.
topic_token_distribution: The topic-token distribution of the document as
extracted by `.approximate_topic_distribution`
normalize: Whether to normalize, between 0 and 1 (summing to 1), the
topic distribution values.
Returns:
df: A stylized dataframe indicating the best fitting topics
for each token.
Examples:
```python
# Calculate the topic distributions on a token level
# Note that we need to have `calculate_token_level=True`
topic_distr, topic_token_distr = topic_model.approximate_distribution(
docs, calculate_token_level=True
)
# Visualize the approximated topic distributions
df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])
df
```
To revert this stylized dataframe back to a regular dataframe,
you can run the following:
```python
df.data.columns = [column.strip() for column in df.data.columns]
df = df.data
```
"""
# Tokenize document
analyzer = topic_model.vectorizer_model.build_tokenizer()
tokens = analyzer(document)
if len(tokens) == 0:
raise ValueError("Make sure that your document contains at least 1 token.")
# Prepare dataframe with results
if normalize:
df = pd.DataFrame(topic_token_distribution / topic_token_distribution.sum()).T
else:
df = pd.DataFrame(topic_token_distribution).T
df.columns = [f"{token}_{i}" for i, token in enumerate(tokens)]
df.columns = [f"{token}{' ' * i}" for i, token in enumerate(tokens)]
df.index = list(topic_model.topic_labels_.values())[topic_model._outliers :]
df = df.loc[(df.sum(axis=1) != 0), :]
# Style the resulting dataframe
def text_color(val):
color = "white" if val == 0 else "black"
return "color: %s" % color
def highligh_color(data, color="white"):
attr = "background-color: {}".format(color)
return pd.DataFrame(np.where(data == 0, attr, ""), index=data.index, columns=data.columns)
if len(df) == 0:
return df
elif HAS_JINJA:
df = (
df.style.format("{:.3f}")
.background_gradient(cmap="Blues", axis=None)
.applymap(lambda x: text_color(x))
.apply(highligh_color, axis=None)
)
return df
@@ -1,132 +0,0 @@
import itertools
import numpy as np
from typing import List, Union
import plotly.graph_objects as go
from plotly.subplots import make_subplots
def visualize_barchart(
topic_model,
topics: List[int] = None,
top_n_topics: int = 8,
n_words: int = 5,
custom_labels: Union[bool, str] = False,
title: str = "<b>Topic Word Scores</b>",
width: int = 250,
height: int = 250,
autoscale: bool = False,
) -> go.Figure:
"""Visualize a barchart of selected topics.
Arguments:
topic_model: A fitted BERTopic instance.
topics: A selection of topics to visualize.
top_n_topics: Only select the top n most frequent topics.
n_words: Number of words to show in a topic
custom_labels: If bool, whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
title: Title of the plot.
width: The width of each figure.
height: The height of each figure.
autoscale: Whether to automatically calculate the height of the figures to fit the whole bar text
Returns:
fig: A plotly figure
Examples:
To visualize the barchart of selected topics
simply run:
```python
topic_model.visualize_barchart()
```
Or if you want to save the resulting figure:
```python
fig = topic_model.visualize_barchart()
fig.write_html("path/to/file.html")
```
<iframe src="../../getting_started/visualization/bar_chart.html"
style="width:1100px; height: 660px; border: 0px;""></iframe>
"""
colors = itertools.cycle(["#D55E00", "#0072B2", "#CC79A7", "#E69F00", "#56B4E9", "#009E73", "#F0E442"])
# Select topics based on top_n and topics args
freq_df = topic_model.get_topic_freq()
freq_df = freq_df.loc[freq_df.Topic != -1, :]
if topics is not None:
topics = list(topics)
elif top_n_topics is not None:
topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
else:
topics = sorted(freq_df.Topic.to_list()[0:6])
# Initialize figure
if isinstance(custom_labels, str):
subplot_titles = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]
subplot_titles = ["_".join([label[0] for label in labels[:4]]) for labels in subplot_titles]
subplot_titles = [label if len(label) < 30 else label[:27] + "..." for label in subplot_titles]
elif topic_model.custom_labels_ is not None and custom_labels:
subplot_titles = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in topics]
else:
subplot_titles = [f"Topic {topic}" for topic in topics]
columns = 4
rows = int(np.ceil(len(topics) / columns))
fig = make_subplots(
rows=rows,
cols=columns,
shared_xaxes=False,
horizontal_spacing=0.1,
vertical_spacing=0.4 / rows if rows > 1 else 0,
subplot_titles=subplot_titles,
)
# Add barchart for each topic
row = 1
column = 1
for topic in topics:
words = [word + " " for word, _ in topic_model.get_topic(topic)][:n_words][::-1]
scores = [score for _, score in topic_model.get_topic(topic)][:n_words][::-1]
fig.add_trace(
go.Bar(x=scores, y=words, orientation="h", marker_color=next(colors)),
row=row,
col=column,
)
if autoscale:
if len(words) > 12:
height = 250 + (len(words) - 12) * 11
if len(words) > 9:
fig.update_yaxes(tickfont=dict(size=(height - 140) // len(words)))
if column == columns:
column = 1
row += 1
else:
column += 1
# Stylize graph
fig.update_layout(
template="plotly_white",
showlegend=False,
title={
"text": f"{title}",
"x": 0.5,
"xanchor": "center",
"yanchor": "top",
"font": dict(size=22, color="Black"),
},
width=width * 4,
height=height * rows if rows > 1 else height * 1.3,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
)
fig.update_xaxes(showgrid=True)
fig.update_yaxes(showgrid=True)
return fig
@@ -1,188 +0,0 @@
import numpy as np
import pandas as pd
from typing import List, Union
from warnings import warn
try:
import datamapplot
from matplotlib.figure import Figure
except ImportError:
warn("Data map plotting is unavailable unless datamapplot is installed.")
# Create a dummy figure type for typing
class Figure(object):
pass
def visualize_document_datamap(
topic_model,
docs: List[str] = None,
topics: List[int] = None,
embeddings: np.ndarray = None,
reduced_embeddings: np.ndarray = None,
custom_labels: Union[bool, str] = False,
title: str = "Documents and Topics",
sub_title: Union[str, None] = None,
width: int = 1200,
height: int = 750,
interactive: bool = False,
enable_search: bool = False,
topic_prefix: bool = False,
datamap_kwds: dict = {},
int_datamap_kwds: dict = {},
) -> Figure:
"""Visualize documents and their topics in 2D as a static plot for publication using
DataMapPlot.
Arguments:
topic_model: A fitted BERTopic instance.
docs: The documents you used when calling either `fit` or `fit_transform`.
topics: A selection of topics to visualize.
Not to be confused with the topics that you get from `.fit_transform`.
For example, if you want to visualize only topics 1 through 5:
`topics = [1, 2, 3, 4, 5]`. Documents not in these topics will be shown
as noise points.
embeddings: The embeddings of all documents in `docs`.
reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.
custom_labels: If bool, whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
title: Title of the plot.
sub_title: Sub-title of the plot.
width: The width of the figure.
height: The height of the figure.
interactive: Whether to create an interactive plot using DataMapPlot's `create_interactive_plot`.
enable_search: Whether to enable search in the interactive plot. Only works if `interactive=True`.
topic_prefix: Prefix to add to the topic number when displaying the topic name.
datamap_kwds: Keyword args be passed on to DataMapPlot's `create_plot` function
if you are not using the interactive version.
See the DataMapPlot documentation for more details.
int_datamap_kwds: Keyword args be passed on to DataMapPlot's `create_interactive_plot` function
if you are using the interactive version.
See the DataMapPlot documentation for more details.
Returns:
figure: A Matplotlib Figure object.
Examples:
To visualize the topics simply run:
```python
topic_model.visualize_document_datamap(docs)
```
Do note that this re-calculates the embeddings and reduces them to 2D.
The advised and preferred pipeline for using this function is as follows:
```python
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
# Prepare embeddings
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)
# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)
# Reduce dimensionality of embeddings, this step is optional
# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
# Run the visualization with the original embeddings
topic_model.visualize_document_datamap(docs, embeddings=embeddings)
# Or, if you have reduced the original embeddings already:
topic_model.visualize_document_datamap(docs, reduced_embeddings=reduced_embeddings)
```
Or if you want to save the resulting figure:
```python
fig = topic_model.visualize_document_datamap(docs, reduced_embeddings=reduced_embeddings)
fig.savefig("path/to/file.png", bbox_inches="tight")
```
<img src="../../getting_started/visualization/datamapplot.png",
alt="DataMapPlot of 20-Newsgroups", width=800, height=800></img>
"""
topic_per_doc = topic_model.topics_
df = pd.DataFrame({"topic": np.array(topic_per_doc)})
df["doc"] = docs
df["topic"] = topic_per_doc
# Extract embeddings if not already done
if embeddings is None and reduced_embeddings is None:
embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
else:
embeddings_to_reduce = embeddings
# Reduce input embeddings
if reduced_embeddings is None:
try:
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=2, min_dist=0.15, metric="cosine").fit(embeddings_to_reduce)
embeddings_2d = umap_model.embedding_
except (ImportError, ModuleNotFoundError):
raise ModuleNotFoundError(
"UMAP is required if the embeddings are not yet reduced in dimensionality. Please install it using `pip install umap-learn`."
)
else:
embeddings_2d = reduced_embeddings
unique_topics = set(topic_per_doc)
# Prepare text and names
if isinstance(custom_labels, str):
names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
names = [" ".join([label[0] for label in labels[:4]]) for labels in names]
names = [label if len(label) < 30 else label[:27] + "..." for label in names]
elif topic_model.custom_labels_ is not None and custom_labels:
names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
else:
if topic_prefix:
names = [
f"Topic-{topic}: " + " ".join([word for word, value in topic_model.get_topic(topic)][:3])
for topic in unique_topics
]
else:
names = [" ".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]
topic_name_mapping = {topic_num: topic_name for topic_num, topic_name in zip(unique_topics, names)}
topic_name_mapping[-1] = "Unlabelled"
# If a set of topics is chosen, set everything else to "Unlabelled"
if topics is not None:
selected_topics = set(topics)
for topic_num in topic_name_mapping:
if topic_num not in selected_topics:
topic_name_mapping[topic_num] = "Unlabelled"
# Map in topic names and plot
named_topic_per_doc = pd.Series(topic_per_doc).map(topic_name_mapping).values
if interactive:
figure = datamapplot.create_interactive_plot(
embeddings_2d,
named_topic_per_doc,
hover_text=docs,
enable_search=enable_search,
width=width,
height=height,
**int_datamap_kwds,
)
else:
figure, _ = datamapplot.create_plot(
embeddings_2d,
named_topic_per_doc,
figsize=(width / 100, height / 100),
dpi=100,
title=title,
sub_title=sub_title,
**datamap_kwds,
)
return figure
@@ -1,109 +0,0 @@
import numpy as np
from typing import Union
import plotly.graph_objects as go
def visualize_distribution(
topic_model,
probabilities: np.ndarray,
min_probability: float = 0.015,
custom_labels: Union[bool, str] = False,
title: str = "<b>Topic Probability Distribution</b>",
width: int = 800,
height: int = 600,
) -> go.Figure:
"""Visualize the distribution of topic probabilities.
Arguments:
topic_model: A fitted BERTopic instance.
probabilities: An array of probability scores
min_probability: The minimum probability score to visualize.
All others are ignored.
custom_labels: If bool, whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
title: Title of the plot.
width: The width of the figure.
height: The height of the figure.
Examples:
Make sure to fit the model before and only input the
probabilities of a single document:
```python
topic_model.visualize_distribution(probabilities[0])
```
Or if you want to save the resulting figure:
```python
fig = topic_model.visualize_distribution(probabilities[0])
fig.write_html("path/to/file.html")
```
<iframe src="../../getting_started/visualization/probabilities.html"
style="width:1000px; height: 500px; border: 0px;""></iframe>
"""
if len(probabilities.shape) != 1:
raise ValueError(
"This visualization cannot be used if you have set `calculate_probabilities` to False "
"as it uses the topic probabilities of all topics. "
)
if len(probabilities[probabilities > min_probability]) == 0:
raise ValueError(
"There are no values where `min_probability` is higher than the "
"probabilities that were supplied. Lower `min_probability` to prevent this error."
)
# Get values and indices equal or exceed the minimum probability
labels_idx = np.argwhere(probabilities >= min_probability).flatten()
vals = probabilities[labels_idx].tolist()
# Create labels
if isinstance(custom_labels, str):
labels = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in labels_idx]
labels = ["_".join([label[0] for label in l[:4]]) for l in labels] # noqa: E741
labels = [label if len(label) < 30 else label[:27] + "..." for label in labels]
elif topic_model.custom_labels_ is not None and custom_labels:
labels = [topic_model.custom_labels_[idx + topic_model._outliers] for idx in labels_idx]
else:
labels = []
for idx in labels_idx:
words = topic_model.get_topic(idx)
if words:
label = [word[0] for word in words[:5]]
label = f"<b>Topic {idx}</b>: {'_'.join(label)}"
label = label[:40] + "..." if len(label) > 40 else label
labels.append(label)
else:
vals.remove(probabilities[idx])
# Create Figure
fig = go.Figure(
go.Bar(
x=vals,
y=labels,
marker=dict(
color="#C8D2D7",
line=dict(color="#6E8484", width=1),
),
orientation="h",
)
)
fig.update_layout(
xaxis_title="Probability",
title={
"text": f"{title}",
"y": 0.95,
"x": 0.5,
"xanchor": "center",
"yanchor": "top",
"font": dict(size=22, color="Black"),
},
template="simple_white",
width=width,
height=height,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
)
return fig
@@ -1,263 +0,0 @@
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from typing import List, Union
def visualize_documents(
topic_model,
docs: List[str],
topics: List[int] = None,
embeddings: np.ndarray = None,
reduced_embeddings: np.ndarray = None,
sample: float = None,
hide_annotations: bool = False,
hide_document_hover: bool = False,
custom_labels: Union[bool, str] = False,
title: str = "<b>Documents and Topics</b>",
width: int = 1200,
height: int = 750,
):
"""Visualize documents and their topics in 2D.
Arguments:
topic_model: A fitted BERTopic instance.
docs: The documents you used when calling either `fit` or `fit_transform`
topics: A selection of topics to visualize.
Not to be confused with the topics that you get from `.fit_transform`.
For example, if you want to visualize only topics 1 through 5:
`topics = [1, 2, 3, 4, 5]`.
embeddings: The embeddings of all documents in `docs`.
reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.
sample: The percentage of documents in each topic that you would like to keep.
Value can be between 0 and 1. Setting this value to, for example,
0.1 (10% of documents in each topic) makes it easier to visualize
millions of documents as a subset is chosen.
hide_annotations: Hide the names of the traces on top of each cluster.
hide_document_hover: Hide the content of the documents when hovering over
specific points. Helps to speed up generation of visualization.
custom_labels: If bool, whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
title: Title of the plot.
width: The width of the figure.
height: The height of the figure.
Examples:
To visualize the topics simply run:
```python
topic_model.visualize_documents(docs)
```
Do note that this re-calculates the embeddings and reduces them to 2D.
The advised and preferred pipeline for using this function is as follows:
```python
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
# Prepare embeddings
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)
# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)
# Reduce dimensionality of embeddings, this step is optional
# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
# Run the visualization with the original embeddings
topic_model.visualize_documents(docs, embeddings=embeddings)
# Or, if you have reduced the original embeddings already:
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
```
Or if you want to save the resulting figure:
```python
fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
fig.write_html("path/to/file.html")
```
<iframe src="../../getting_started/visualization/documents.html"
style="width:1000px; height: 800px; border: 0px;""></iframe>
"""
topic_per_doc = topic_model.topics_
# Sample the data to optimize for visualization and dimensionality reduction
if sample is None or sample > 1:
sample = 1
indices = []
for topic in set(topic_per_doc):
s = np.where(np.array(topic_per_doc) == topic)[0]
size = len(s) if len(s) < 100 else int(len(s) * sample)
indices.extend(np.random.choice(s, size=size, replace=False))
indices = np.array(indices)
df = pd.DataFrame({"topic": np.array(topic_per_doc)[indices]})
df["doc"] = [docs[index] for index in indices]
df["topic"] = [topic_per_doc[index] for index in indices]
# Extract embeddings if not already done
if sample is None:
if embeddings is None and reduced_embeddings is None:
embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
else:
embeddings_to_reduce = embeddings
else:
if embeddings is not None:
embeddings_to_reduce = embeddings[indices]
elif embeddings is None and reduced_embeddings is None:
embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
# Reduce input embeddings
if reduced_embeddings is None:
try:
from umap import UMAP
umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric="cosine").fit(embeddings_to_reduce)
embeddings_2d = umap_model.embedding_
except (ImportError, ModuleNotFoundError):
raise ModuleNotFoundError(
"UMAP is required if the embeddings are not yet reduced in dimensionality. Please install it using `pip install umap-learn`."
)
elif sample is not None and reduced_embeddings is not None:
embeddings_2d = reduced_embeddings[indices]
elif sample is None and reduced_embeddings is not None:
embeddings_2d = reduced_embeddings
unique_topics = set(topic_per_doc)
if topics is None:
topics = unique_topics
# Combine data
df["x"] = embeddings_2d[:, 0]
df["y"] = embeddings_2d[:, 1]
# Prepare text and names
if isinstance(custom_labels, str):
names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
names = ["_".join([label[0] for label in labels[:4]]) for labels in names]
names = [label if len(label) < 30 else label[:27] + "..." for label in names]
elif topic_model.custom_labels_ is not None and custom_labels:
names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
else:
names = [
f"{topic}_" + "_".join([word for word, value in topic_model.get_topic(topic)][:3])
for topic in unique_topics
]
# Visualize
fig = go.Figure()
# Outliers and non-selected topics
non_selected_topics = set(unique_topics).difference(topics)
if len(non_selected_topics) == 0:
non_selected_topics = [-1]
selection = df.loc[df.topic.isin(non_selected_topics), :]
selection["text"] = ""
selection.loc[len(selection), :] = [
None,
None,
selection.x.mean(),
selection.y.mean(),
"Other documents",
]
fig.add_trace(
go.Scattergl(
x=selection.x,
y=selection.y,
hovertext=selection.doc if not hide_document_hover else None,
hoverinfo="text",
mode="markers+text",
name="other",
showlegend=False,
marker=dict(color="#CFD8DC", size=5, opacity=0.5),
)
)
# Selected topics
for name, topic in zip(names, unique_topics):
if topic in topics and topic != -1:
selection = df.loc[df.topic == topic, :]
selection["text"] = ""
if not hide_annotations:
selection.loc[len(selection), :] = [
None,
None,
selection.x.mean(),
selection.y.mean(),
name,
]
fig.add_trace(
go.Scattergl(
x=selection.x,
y=selection.y,
hovertext=selection.doc if not hide_document_hover else None,
hoverinfo="text",
text=selection.text,
mode="markers+text",
name=name,
textfont=dict(
size=12,
),
marker=dict(size=5, opacity=0.5),
)
)
# Add grid in a 'plus' shape
x_range = (
df.x.min() - abs((df.x.min()) * 0.15),
df.x.max() + abs((df.x.max()) * 0.15),
)
y_range = (
df.y.min() - abs((df.y.min()) * 0.15),
df.y.max() + abs((df.y.max()) * 0.15),
)
fig.add_shape(
type="line",
x0=sum(x_range) / 2,
y0=y_range[0],
x1=sum(x_range) / 2,
y1=y_range[1],
line=dict(color="#CFD8DC", width=2),
)
fig.add_shape(
type="line",
x0=x_range[0],
y0=sum(y_range) / 2,
x1=x_range[1],
y1=sum(y_range) / 2,
line=dict(color="#9E9E9E", width=2),
)
fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text="D1", showarrow=False, yshift=10)
fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text="D2", showarrow=False, xshift=10)
# Stylize layout
fig.update_layout(
template="simple_white",
title={
"text": f"{title}",
"x": 0.5,
"xanchor": "center",
"yanchor": "top",
"font": dict(size=22, color="Black"),
},
width=width,
height=height,
)
fig.update_xaxes(visible=False)
fig.update_yaxes(visible=False)
return fig
@@ -1,136 +0,0 @@
import numpy as np
from typing import List, Union
from scipy.cluster.hierarchy import fcluster, linkage
from sklearn.metrics.pairwise import cosine_similarity
from bertopic._utils import select_topic_representation
import plotly.express as px
import plotly.graph_objects as go
def visualize_heatmap(
topic_model,
topics: List[int] = None,
top_n_topics: int = None,
n_clusters: int = None,
use_ctfidf: bool = False,
custom_labels: Union[bool, str] = False,
title: str = "<b>Similarity Matrix</b>",
width: int = 800,
height: int = 800,
) -> go.Figure:
"""Visualize a heatmap of the topic's similarity matrix.
Based on the cosine similarity matrix between topic embeddings (either c-TF-IDF or the embeddings from the embedding
model), a heatmap is created showing the similarity between topics.
Arguments:
topic_model: A fitted BERTopic instance.
topics: A selection of topics to visualize.
top_n_topics: Only select the top n most frequent topics.
n_clusters: Create n clusters and order the similarity
matrix by those clusters.
use_ctfidf: Whether to calculate distances between topics based on c-TF-IDF embeddings. If False, the embeddings
from the embedding model are used.
custom_labels: If bool, whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
title: Title of the plot.
width: The width of the figure.
height: The height of the figure.
Returns:
fig: A plotly figure
Examples:
To visualize the similarity matrix of
topics simply run:
```python
topic_model.visualize_heatmap()
```
Or if you want to save the resulting figure:
```python
fig = topic_model.visualize_heatmap()
fig.write_html("path/to/file.html")
```
<iframe src="../../getting_started/visualization/heatmap.html"
style="width:1000px; height: 720px; border: 0px;""></iframe>
"""
embeddings = select_topic_representation(topic_model.c_tf_idf_, topic_model.topic_embeddings_, use_ctfidf)[0][
topic_model._outliers :
]
# Select topics based on top_n and topics args
freq_df = topic_model.get_topic_freq()
freq_df = freq_df.loc[freq_df.Topic != -1, :]
if topics is not None:
topics = list(topics)
elif top_n_topics is not None:
topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
else:
topics = sorted(freq_df.Topic.to_list())
# Order heatmap by similar clusters of topics
sorted_topics = topics
if n_clusters:
if n_clusters >= len(set(topics)):
raise ValueError("Make sure to set `n_clusters` lower than the total number of unique topics.")
distance_matrix = cosine_similarity(embeddings[topics])
Z = linkage(distance_matrix, "ward")
clusters = fcluster(Z, t=n_clusters, criterion="maxclust")
# Extract new order of topics
mapping = {cluster: [] for cluster in clusters}
for topic, cluster in zip(topics, clusters):
mapping[cluster].append(topic)
mapping = [cluster for cluster in mapping.values()]
sorted_topics = [topic for cluster in mapping for topic in cluster]
# Select embeddings
indices = np.array([topics.index(topic) for topic in sorted_topics])
embeddings = embeddings[indices]
distance_matrix = cosine_similarity(embeddings)
# Create labels
if isinstance(custom_labels, str):
new_labels = [
[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in sorted_topics
]
new_labels = ["_".join([label[0] for label in labels[:4]]) for labels in new_labels]
new_labels = [label if len(label) < 30 else label[:27] + "..." for label in new_labels]
elif topic_model.custom_labels_ is not None and custom_labels:
new_labels = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in sorted_topics]
else:
new_labels = [[[str(topic), None]] + topic_model.get_topic(topic) for topic in sorted_topics]
new_labels = ["_".join([label[0] for label in labels[:4]]) for labels in new_labels]
new_labels = [label if len(label) < 30 else label[:27] + "..." for label in new_labels]
fig = px.imshow(
distance_matrix,
labels=dict(color="Similarity Score"),
x=new_labels,
y=new_labels,
color_continuous_scale="GnBu",
)
fig.update_layout(
title={
"text": f"{title}",
"y": 0.95,
"x": 0.55,
"xanchor": "center",
"yanchor": "top",
"font": dict(size=22, color="Black"),
},
width=width,
height=height,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
)
fig.update_layout(showlegend=True)
fig.update_layout(legend_title_text="Trend")
return fig
@@ -1,375 +0,0 @@
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import math
from typing import List, Union
def visualize_hierarchical_documents(
topic_model,
docs: List[str],
hierarchical_topics: pd.DataFrame,
topics: List[int] = None,
embeddings: np.ndarray = None,
reduced_embeddings: np.ndarray = None,
sample: Union[float, int] = None,
hide_annotations: bool = False,
hide_document_hover: bool = True,
nr_levels: int = 10,
level_scale: str = "linear",
custom_labels: Union[bool, str] = False,
title: str = "<b>Hierarchical Documents and Topics</b>",
width: int = 1200,
height: int = 750,
) -> go.Figure:
"""Visualize documents and their topics in 2D at different levels of hierarchy.
Arguments:
topic_model: A fitted BERTopic instance.
docs: The documents you used when calling either `fit` or `fit_transform`
hierarchical_topics: A dataframe that contains a hierarchy of topics
represented by their parents and their children
topics: A selection of topics to visualize.
Not to be confused with the topics that you get from `.fit_transform`.
For example, if you want to visualize only topics 1 through 5:
`topics = [1, 2, 3, 4, 5]`.
embeddings: The embeddings of all documents in `docs`.
reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.
sample: The percentage of documents in each topic that you would like to keep.
Value can be between 0 and 1. Setting this value to, for example,
0.1 (10% of documents in each topic) makes it easier to visualize
millions of documents as a subset is chosen.
hide_annotations: Hide the names of the traces on top of each cluster.
hide_document_hover: Hide the content of the documents when hovering over
specific points. Helps to speed up generation of visualizations.
nr_levels: The number of levels to be visualized in the hierarchy. First, the distances
in `hierarchical_topics.Distance` are split in `nr_levels` lists of distances.
Then, for each list of distances, the merged topics are selected that have a
distance less or equal to the maximum distance of the selected list of distances.
NOTE: To get all possible merged steps, make sure that `nr_levels` is equal to
the length of `hierarchical_topics`.
level_scale: Whether to apply a linear or logarithmic (log) scale levels of the distance
vector. Linear scaling will perform an equal number of merges at each level
while logarithmic scaling will perform more mergers in earlier levels to
provide more resolution at higher levels (this can be used for when the number
of topics is large).
custom_labels: If bool, whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
NOTE: Custom labels are only generated for the original
un-merged topics.
title: Title of the plot.
width: The width of the figure.
height: The height of the figure.
Examples:
To visualize the topics simply run:
```python
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)
```
Do note that this re-calculates the embeddings and reduces them to 2D.
The advised and preferred pipeline for using this function is as follows:
```python
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
# Prepare embeddings
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)
# Train BERTopic and extract hierarchical topics
topic_model = BERTopic().fit(docs, embeddings)
hierarchical_topics = topic_model.hierarchical_topics(docs)
# Reduce dimensionality of embeddings, this step is optional
# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
# Run the visualization with the original embeddings
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)
# Or, if you have reduced the original embeddings already:
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)
```
Or if you want to save the resulting figure:
```python
fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)
fig.write_html("path/to/file.html")
```
Note:
This visualization was inspired by the scatter plot representation of Doc2Map:
https://github.com/louisgeisler/Doc2Map
<iframe src="../../getting_started/visualization/hierarchical_documents.html"
style="width:1000px; height: 770px; border: 0px;""></iframe>
"""
topic_per_doc = topic_model.topics_
# Sample the data to optimize for visualization and dimensionality reduction
if sample is None or sample > 1:
sample = 1
indices = []
for topic in set(topic_per_doc):
s = np.where(np.array(topic_per_doc) == topic)[0]
size = len(s) if len(s) < 100 else int(len(s) * sample)
indices.extend(np.random.choice(s, size=size, replace=False))
indices = np.array(indices)
df = pd.DataFrame({"topic": np.array(topic_per_doc)[indices]})
df["doc"] = [docs[index] for index in indices]
df["topic"] = [topic_per_doc[index] for index in indices]
# Extract embeddings if not already done
if sample is None:
if embeddings is None and reduced_embeddings is None:
embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
else:
embeddings_to_reduce = embeddings
else:
if embeddings is not None:
embeddings_to_reduce = embeddings[indices]
elif embeddings is None and reduced_embeddings is None:
embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
# Reduce input embeddings
if reduced_embeddings is None:
try:
from umap import UMAP
umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric="cosine").fit(embeddings_to_reduce)
embeddings_2d = umap_model.embedding_
except (ImportError, ModuleNotFoundError):
raise ModuleNotFoundError(
"UMAP is required if the embeddings are not yet reduced in dimensionality. Please install it using `pip install umap-learn`."
)
elif sample is not None and reduced_embeddings is not None:
embeddings_2d = reduced_embeddings[indices]
elif sample is None and reduced_embeddings is not None:
embeddings_2d = reduced_embeddings
# Combine data
df["x"] = embeddings_2d[:, 0]
df["y"] = embeddings_2d[:, 1]
# Create topic list for each level, levels are created by calculating the distance
distances = hierarchical_topics.Distance.to_list()
if level_scale == "log" or level_scale == "logarithmic":
log_indices = (
np.round(
np.logspace(
start=math.log(1, 10),
stop=math.log(len(distances) - 1, 10),
num=nr_levels,
)
)
.astype(int)
.tolist()
)
log_indices.reverse()
max_distances = [distances[i] for i in log_indices]
elif level_scale == "lin" or level_scale == "linear":
max_distances = [
distances[indices[-1]] for indices in np.array_split(range(len(hierarchical_topics)), nr_levels)
][::-1]
else:
raise ValueError("level_scale needs to be one of 'log' or 'linear'")
for index, max_distance in enumerate(max_distances):
# Get topics below `max_distance`
mapping = {topic: topic for topic in df.topic.unique()}
selection = hierarchical_topics.loc[hierarchical_topics.Distance <= max_distance, :]
selection.Parent_ID = selection.Parent_ID.astype(int)
selection = selection.sort_values("Parent_ID")
for row in selection.iterrows():
for topic in row[1].Topics:
mapping[topic] = row[1].Parent_ID
# Make sure the mappings are mapped 1:1
mappings = [True for _ in mapping]
while any(mappings):
for i, (key, value) in enumerate(mapping.items()):
if value in mapping.keys() and key != value:
mapping[key] = mapping[value]
else:
mappings[i] = False
# Create new column
df[f"level_{index + 1}"] = df.topic.map(mapping)
df[f"level_{index + 1}"] = df[f"level_{index + 1}"].astype(int)
# Prepare topic names of original and merged topics
trace_names = []
topic_names = {}
for topic in range(hierarchical_topics.Parent_ID.astype(int).max()):
if topic < hierarchical_topics.Parent_ID.astype(int).min():
if topic_model.get_topic(topic):
if isinstance(custom_labels, str):
trace_name = f"{topic}_" + "_".join(
list(zip(*topic_model.topic_aspects_[custom_labels][topic]))[0][:3]
)
elif topic_model.custom_labels_ is not None and custom_labels:
trace_name = topic_model.custom_labels_[topic + topic_model._outliers]
else:
trace_name = f"{topic}_" + "_".join([word[:20] for word, _ in topic_model.get_topic(topic)][:3])
topic_names[topic] = {
"trace_name": trace_name[:40],
"plot_text": trace_name[:40],
}
trace_names.append(trace_name)
else:
trace_name = (
f"{topic}_"
+ hierarchical_topics.loc[hierarchical_topics.Parent_ID == str(topic), "Parent_Name"].values[0]
)
plot_text = "_".join([name[:20] for name in trace_name.split("_")[:3]])
topic_names[topic] = {
"trace_name": trace_name[:40],
"plot_text": plot_text[:40],
}
trace_names.append(trace_name)
# Prepare traces
all_traces = []
for level in range(len(max_distances)):
traces = []
# Outliers
if topic_model._outliers:
traces.append(
go.Scattergl(
x=df.loc[(df[f"level_{level + 1}"] == -1), "x"],
y=df.loc[df[f"level_{level + 1}"] == -1, "y"],
mode="markers+text",
name="other",
hoverinfo="text",
hovertext=df.loc[(df[f"level_{level + 1}"] == -1), "doc"] if not hide_document_hover else None,
showlegend=False,
marker=dict(color="#CFD8DC", size=5, opacity=0.5),
)
)
# Selected topics
if topics:
selection = df.loc[(df.topic.isin(topics)), :]
unique_topics = sorted([int(topic) for topic in selection[f"level_{level + 1}"].unique()])
else:
unique_topics = sorted([int(topic) for topic in df[f"level_{level + 1}"].unique()])
for topic in unique_topics:
if topic != -1:
if topics:
selection = df.loc[(df[f"level_{level + 1}"] == topic) & (df.topic.isin(topics)), :]
else:
selection = df.loc[df[f"level_{level + 1}"] == topic, :]
if not hide_annotations:
selection.loc[len(selection), :] = None
selection["text"] = ""
selection.loc[len(selection) - 1, "x"] = selection.x.mean()
selection.loc[len(selection) - 1, "y"] = selection.y.mean()
selection.loc[len(selection) - 1, "text"] = topic_names[int(topic)]["plot_text"]
traces.append(
go.Scattergl(
x=selection.x,
y=selection.y,
text=selection.text if not hide_annotations else None,
hovertext=selection.doc if not hide_document_hover else None,
hoverinfo="text",
name=topic_names[int(topic)]["trace_name"],
mode="markers+text",
marker=dict(size=5, opacity=0.5),
)
)
all_traces.append(traces)
# Track and count traces
nr_traces_per_set = [len(traces) for traces in all_traces]
trace_indices = [(0, nr_traces_per_set[0])]
for index, nr_traces in enumerate(nr_traces_per_set[1:]):
start = trace_indices[index][1]
end = nr_traces + start
trace_indices.append((start, end))
# Visualization
fig = go.Figure()
for traces in all_traces:
for trace in traces:
fig.add_trace(trace)
for index in range(len(fig.data)):
if index >= nr_traces_per_set[0]:
fig.data[index].visible = False
# Create and add slider
steps = []
for index, indices in enumerate(trace_indices):
step = dict(
method="update",
label=str(index),
args=[{"visible": [False] * len(fig.data)}],
)
for index in range(indices[1] - indices[0]):
step["args"][0]["visible"][index + indices[0]] = True
steps.append(step)
sliders = [dict(currentvalue={"prefix": "Level: "}, pad={"t": 20}, steps=steps)]
# Add grid in a 'plus' shape
x_range = (
df.x.min() - abs((df.x.min()) * 0.15),
df.x.max() + abs((df.x.max()) * 0.15),
)
y_range = (
df.y.min() - abs((df.y.min()) * 0.15),
df.y.max() + abs((df.y.max()) * 0.15),
)
fig.add_shape(
type="line",
x0=sum(x_range) / 2,
y0=y_range[0],
x1=sum(x_range) / 2,
y1=y_range[1],
line=dict(color="#CFD8DC", width=2),
)
fig.add_shape(
type="line",
x0=x_range[0],
y0=sum(y_range) / 2,
x1=x_range[1],
y1=sum(y_range) / 2,
line=dict(color="#9E9E9E", width=2),
)
fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text="D1", showarrow=False, yshift=10)
fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text="D2", showarrow=False, xshift=10)
# Stylize layout
fig.update_layout(
sliders=sliders,
template="simple_white",
title={
"text": f"{title}",
"x": 0.5,
"xanchor": "center",
"yanchor": "top",
"font": dict(size=22, color="Black"),
},
width=width,
height=height,
)
fig.update_xaxes(visible=False)
fig.update_yaxes(visible=False)
return fig
@@ -1,330 +0,0 @@
import numpy as np
import pandas as pd
from typing import Callable, List, Union
from scipy.sparse import csr_matrix
from scipy.cluster import hierarchy as sch
from sklearn.metrics.pairwise import cosine_similarity
from bertopic._utils import select_topic_representation
import plotly.graph_objects as go
import plotly.figure_factory as ff
from bertopic._utils import validate_distance_matrix
def visualize_hierarchy(
topic_model,
orientation: str = "left",
topics: List[int] = None,
top_n_topics: int = None,
use_ctfidf: bool = True,
custom_labels: Union[bool, str] = False,
title: str = "<b>Hierarchical Clustering</b>",
width: int = 1000,
height: int = 600,
hierarchical_topics: pd.DataFrame = None,
linkage_function: Callable[[csr_matrix], np.ndarray] = None,
distance_function: Callable[[csr_matrix], csr_matrix] = None,
color_threshold: int = 1,
) -> go.Figure:
"""Visualize a hierarchical structure of the topics.
A ward linkage function is used to perform the
hierarchical clustering based on the cosine distance
matrix between topic embeddings (either c-TF-IDF or the embeddings from the embedding model).
Arguments:
topic_model: A fitted BERTopic instance.
orientation: The orientation of the figure.
Either 'left' or 'bottom'
topics: A selection of topics to visualize
top_n_topics: Only select the top n most frequent topics
use_ctfidf: Whether to calculate distances between topics based on c-TF-IDF embeddings. If False, the embeddings
from the embedding model are used.
custom_labels: If bool, whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
NOTE: Custom labels are only generated for the original
un-merged topics.
title: Title of the plot.
width: The width of the figure. Only works if orientation is set to 'left'
height: The height of the figure. Only works if orientation is set to 'bottom'
hierarchical_topics: A dataframe that contains a hierarchy of topics
represented by their parents and their children.
NOTE: The hierarchical topic names are only visualized
if both `topics` and `top_n_topics` are not set.
linkage_function: The linkage function to use. Default is:
`lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`
NOTE: Make sure to use the same `linkage_function` as used
in `topic_model.hierarchical_topics`.
distance_function: The distance function to use on the c-TF-IDF matrix. Default is:
`lambda x: 1 - cosine_similarity(x)`.
You can pass any function that returns either a square matrix of
shape (n_samples, n_samples) with zeros on the diagonal and
non-negative values or condensed distance matrix of shape
(n_samples * (n_samples - 1) / 2,) containing the upper
triangular of the distance matrix.
NOTE: Make sure to use the same `distance_function` as used
in `topic_model.hierarchical_topics`.
color_threshold: Value at which the separation of clusters will be made which
will result in different colors for different clusters.
A higher value will typically lead in less colored clusters.
Returns:
fig: A plotly figure
Examples:
To visualize the hierarchical structure of
topics simply run:
```python
topic_model.visualize_hierarchy()
```
If you also want the labels visualized of hierarchical topics,
run the following:
```python
# Extract hierarchical topics and their representations
hierarchical_topics = topic_model.hierarchical_topics(docs)
# Visualize these representations
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
```
If you want to save the resulting figure:
```python
fig = topic_model.visualize_hierarchy()
fig.write_html("path/to/file.html")
```
<iframe src="../../getting_started/visualization/hierarchy.html"
style="width:1000px; height: 680px; border: 0px;""></iframe>
"""
if distance_function is None:
distance_function = lambda x: 1 - cosine_similarity(x)
if linkage_function is None:
linkage_function = lambda x: sch.linkage(x, "ward", optimal_ordering=True)
# Select topics based on top_n and topics args
freq_df = topic_model.get_topic_freq()
freq_df = freq_df.loc[freq_df.Topic != -1, :]
if topics is not None:
topics = list(topics)
elif top_n_topics is not None:
topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
else:
topics = sorted(freq_df.Topic.to_list())
# Select embeddings
all_topics = sorted(list(topic_model.get_topics().keys()))
indices = np.array([all_topics.index(topic) for topic in topics])
# Select topic embeddings
embeddings = select_topic_representation(topic_model.c_tf_idf_, topic_model.topic_embeddings_, use_ctfidf)[0][
indices
]
# Annotations
if hierarchical_topics is not None and len(topics) == len(freq_df.Topic.to_list()):
annotations = _get_annotations(
topic_model=topic_model,
hierarchical_topics=hierarchical_topics,
embeddings=embeddings,
distance_function=distance_function,
linkage_function=linkage_function,
orientation=orientation,
custom_labels=custom_labels,
)
else:
annotations = None
# wrap distance function to validate input and return a condensed distance matrix
distance_function_viz = lambda x: validate_distance_matrix(distance_function(x), embeddings.shape[0])
# Create dendogram
fig = ff.create_dendrogram(
embeddings,
orientation=orientation,
distfun=distance_function_viz,
linkagefun=linkage_function,
hovertext=annotations,
color_threshold=color_threshold,
)
# Create nicer labels
axis = "yaxis" if orientation == "left" else "xaxis"
if isinstance(custom_labels, str):
new_labels = [
[[str(x), None]] + topic_model.topic_aspects_[custom_labels][x] for x in fig.layout[axis]["ticktext"]
]
new_labels = ["_".join([label[0] for label in labels[:4]]) for labels in new_labels]
new_labels = [label if len(label) < 30 else label[:27] + "..." for label in new_labels]
elif topic_model.custom_labels_ is not None and custom_labels:
new_labels = [
topic_model.custom_labels_[topics[int(x)] + topic_model._outliers] for x in fig.layout[axis]["ticktext"]
]
else:
new_labels = [
[[str(topics[int(x)]), None]] + topic_model.get_topic(topics[int(x)]) for x in fig.layout[axis]["ticktext"]
]
new_labels = ["_".join([label[0] for label in labels[:4]]) for labels in new_labels]
new_labels = [label if len(label) < 30 else label[:27] + "..." for label in new_labels]
# Stylize layout
fig.update_layout(
plot_bgcolor="#ECEFF1",
template="plotly_white",
title={
"text": f"{title}",
"x": 0.5,
"xanchor": "center",
"yanchor": "top",
"font": dict(size=22, color="Black"),
},
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
)
# Stylize orientation
if orientation == "left":
fig.update_layout(
height=200 + (15 * len(topics)),
width=width,
yaxis=dict(tickmode="array", ticktext=new_labels),
)
# Fix empty space on the bottom of the graph
y_max = max([trace["y"].max() + 5 for trace in fig["data"]])
y_min = min([trace["y"].min() - 5 for trace in fig["data"]])
fig.update_layout(yaxis=dict(range=[y_min, y_max]))
else:
fig.update_layout(
width=200 + (15 * len(topics)),
height=height,
xaxis=dict(tickmode="array", ticktext=new_labels),
)
if hierarchical_topics is not None:
for index in [0, 3]:
axis = "x" if orientation == "left" else "y"
xs = [data["x"][index] for data in fig.data if (data["text"] and data[axis][index] > 0)]
ys = [data["y"][index] for data in fig.data if (data["text"] and data[axis][index] > 0)]
hovertext = [data["text"][index] for data in fig.data if (data["text"] and data[axis][index] > 0)]
fig.add_trace(
go.Scatter(
x=xs,
y=ys,
marker_color="black",
hovertext=hovertext,
hoverinfo="text",
mode="markers",
showlegend=False,
)
)
return fig
def _get_annotations(
topic_model,
hierarchical_topics: pd.DataFrame,
embeddings: csr_matrix,
linkage_function: Callable[[csr_matrix], np.ndarray],
distance_function: Callable[[csr_matrix], csr_matrix],
orientation: str,
custom_labels: bool = False,
) -> List[List[str]]:
"""Get annotations by replicating linkage function calculation in scipy.
Arguments:
topic_model: A fitted BERTopic instance.
hierarchical_topics: A dataframe that contains a hierarchy of topics
represented by their parents and their children.
NOTE: The hierarchical topic names are only visualized
if both `topics` and `top_n_topics` are not set.
embeddings: The c-TF-IDF matrix on which to model the hierarchy
linkage_function: The linkage function to use. Default is:
`lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`
NOTE: Make sure to use the same `linkage_function` as used
in `topic_model.hierarchical_topics`.
distance_function: The distance function to use on the c-TF-IDF matrix. Default is:
`lambda x: 1 - cosine_similarity(x)`.
You can pass any function that returns either a square matrix of
shape (n_samples, n_samples) with zeros on the diagonal and
non-negative values or condensed distance matrix of shape
(n_samples * (n_samples - 1) / 2,) containing the upper
triangular of the distance matrix.
NOTE: Make sure to use the same `distance_function` as used
in `topic_model.hierarchical_topics`.
orientation: The orientation of the figure.
Either 'left' or 'bottom'
custom_labels: Whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
NOTE: Custom labels are only generated for the original
un-merged topics.
Returns:
text_annotations: Annotations to be used within Plotly's `ff.create_dendogram`
"""
df = hierarchical_topics.loc[hierarchical_topics.Parent_Name != "Top", :]
# Calculate distance
X = distance_function(embeddings)
X = validate_distance_matrix(X, embeddings.shape[0])
# Calculate linkage and generate dendrogram
Z = linkage_function(X)
P = sch.dendrogram(Z, orientation=orientation, no_plot=True)
# store topic no.(leaves) corresponding to the x-ticks in dendrogram
x_ticks = np.arange(5, len(P["leaves"]) * 10 + 5, 10)
x_topic = dict(zip(P["leaves"], x_ticks))
topic_vals = dict()
for key, val in x_topic.items():
topic_vals[val] = [key]
parent_topic = dict(zip(df.Parent_ID, df.Topics))
# loop through every trace (scatter plot) in dendrogram
text_annotations = []
for index, trace in enumerate(P["icoord"]):
fst_topic = topic_vals[trace[0]]
scnd_topic = topic_vals[trace[2]]
if len(fst_topic) == 1:
if isinstance(custom_labels, str):
fst_name = f"{fst_topic[0]}_" + "_".join(
list(zip(*topic_model.topic_aspects_[custom_labels][fst_topic[0]]))[0][:3]
)
elif topic_model.custom_labels_ is not None and custom_labels:
fst_name = topic_model.custom_labels_[fst_topic[0] + topic_model._outliers]
else:
fst_name = "_".join([word for word, _ in topic_model.get_topic(fst_topic[0])][:5])
else:
for key, value in parent_topic.items():
if set(value) == set(fst_topic):
fst_name = df.loc[df.Parent_ID == key, "Parent_Name"].values[0]
if len(scnd_topic) == 1:
if isinstance(custom_labels, str):
scnd_name = f"{scnd_topic[0]}_" + "_".join(
list(zip(*topic_model.topic_aspects_[custom_labels][scnd_topic[0]]))[0][:3]
)
elif topic_model.custom_labels_ is not None and custom_labels:
scnd_name = topic_model.custom_labels_[scnd_topic[0] + topic_model._outliers]
else:
scnd_name = "_".join([word for word, _ in topic_model.get_topic(scnd_topic[0])][:5])
else:
for key, value in parent_topic.items():
if set(value) == set(scnd_topic):
scnd_name = df.loc[df.Parent_ID == key, "Parent_Name"].values[0]
text_annotations.append([fst_name, "", "", scnd_name])
center = (trace[0] + trace[2]) / 2
topic_vals[center] = fst_topic + scnd_topic
return text_annotations
@@ -1,131 +0,0 @@
import numpy as np
from typing import List, Union
import plotly.graph_objects as go
def visualize_term_rank(
topic_model,
topics: List[int] = None,
log_scale: bool = False,
custom_labels: Union[bool, str] = False,
title: str = "<b>Term score decline per Topic</b>",
width: int = 800,
height: int = 500,
) -> go.Figure:
"""Visualize the ranks of all terms across all topics.
Each topic is represented by a set of words. These words, however,
do not all equally represent the topic. This visualization shows
how many words are needed to represent a topic and at which point
the beneficial effect of adding words starts to decline.
Arguments:
topic_model: A fitted BERTopic instance.
topics: A selection of topics to visualize. These will be colored
red where all others will be colored black.
log_scale: Whether to represent the ranking on a log scale
custom_labels: If bool, whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
title: Title of the plot.
width: The width of the figure.
height: The height of the figure.
Returns:
fig: A plotly figure
Examples:
To visualize the ranks of all words across
all topics simply run:
```python
topic_model.visualize_term_rank()
```
Or if you want to save the resulting figure:
```python
fig = topic_model.visualize_term_rank()
fig.write_html("path/to/file.html")
```
<iframe src="../../getting_started/visualization/term_rank.html"
style="width:1000px; height: 530px; border: 0px;""></iframe>
<iframe src="../../getting_started/visualization/term_rank_log.html"
style="width:1000px; height: 530px; border: 0px;""></iframe>
Reference:
This visualization was heavily inspired by the
"Term Probability Decline" visualization found in an
analysis by the amazing [tmtoolkit](https://tmtoolkit.readthedocs.io/).
Reference to that specific analysis can be found
[here](https://wzbsocialsciencecenter.github.io/tm_corona/tm_analysis.html).
"""
topics = [] if topics is None else topics
topic_ids = topic_model.get_topic_info().Topic.unique().tolist()
topic_words = [topic_model.get_topic(topic) for topic in topic_ids]
values = np.array([[value[1] for value in values] for values in topic_words])
indices = np.array([[value + 1 for value in range(len(values))] for values in topic_words])
# Create figure
lines = []
for topic, x, y in zip(topic_ids, indices, values):
if not any(y > 1.5):
# labels
if isinstance(custom_labels, str):
label = f"{topic}_" + "_".join(list(zip(*topic_model.topic_aspects_[custom_labels][topic]))[0][:3])
elif topic_model.custom_labels_ is not None and custom_labels:
label = topic_model.custom_labels_[topic + topic_model._outliers]
else:
label = f"<b>Topic {topic}</b>:" + "_".join([word[0] for word in topic_model.get_topic(topic)])
label = label[:50]
# line parameters
color = "red" if topic in topics else "black"
opacity = 1 if topic in topics else 0.1
if any(y == 0):
y[y == 0] = min(values[values > 0])
y = np.log10(y, out=y, where=y > 0) if log_scale else y
line = go.Scatter(
x=x,
y=y,
name="",
hovertext=label,
mode="lines+lines",
opacity=opacity,
line=dict(color=color, width=1.5),
)
lines.append(line)
fig = go.Figure(data=lines)
# Stylize layout
fig.update_xaxes(range=[0, len(indices[0])], tick0=1, dtick=2)
fig.update_layout(
showlegend=False,
template="plotly_white",
title={
"text": f"{title}",
"y": 0.9,
"x": 0.5,
"xanchor": "center",
"yanchor": "top",
"font": dict(size=22, color="Black"),
},
width=width,
height=height,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
)
fig.update_xaxes(title_text="Term Rank")
if log_scale:
fig.update_yaxes(title_text="c-TF-IDF score (log scale)")
else:
fig.update_yaxes(title_text="c-TF-IDF score")
return fig
@@ -1,212 +0,0 @@
import numpy as np
import pandas as pd
try:
from umap import UMAP
HAS_UMAP = True
except (ImportError, ModuleNotFoundError):
HAS_UMAP = False
from typing import List, Union
from sklearn.preprocessing import MinMaxScaler
from bertopic._utils import select_topic_representation
import plotly.express as px
import plotly.graph_objects as go
def visualize_topics(
topic_model,
topics: List[int] = None,
top_n_topics: int = None,
use_ctfidf: bool = False,
custom_labels: Union[bool, str] = False,
title: str = "<b>Intertopic Distance Map</b>",
width: int = 650,
height: int = 650,
) -> go.Figure:
"""Visualize topics, their sizes, and their corresponding words.
This visualization is highly inspired by LDAvis, a great visualization
technique typically reserved for LDA.
Arguments:
topic_model: A fitted BERTopic instance.
topics: A selection of topics to visualize
top_n_topics: Only select the top n most frequent topics
use_ctfidf: Whether to use c-TF-IDF representations instead of the embeddings from the embedding model.
custom_labels: If bool, whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
title: Title of the plot.
width: The width of the figure.
height: The height of the figure.
Examples:
To visualize the topics simply run:
```python
topic_model.visualize_topics()
```
Or if you want to save the resulting figure:
```python
fig = topic_model.visualize_topics()
fig.write_html("path/to/file.html")
```
<iframe src="../../getting_started/visualization/viz.html"
style="width:1000px; height: 680px; border: 0px;""></iframe>
"""
# Select topics based on top_n and topics args
freq_df = topic_model.get_topic_freq()
freq_df = freq_df.loc[freq_df.Topic != -1, :]
if topics is not None:
topics = list(topics)
elif top_n_topics is not None:
topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
else:
topics = sorted(freq_df.Topic.to_list())
# Extract topic words and their frequencies
topic_list = sorted(topics)
frequencies = [topic_model.topic_sizes_[topic] for topic in topic_list]
if isinstance(custom_labels, str):
words = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topic_list]
words = ["_".join([label[0] for label in labels[:4]]) for labels in words]
words = [label if len(label) < 30 else label[:27] + "..." for label in words]
elif custom_labels and topic_model.custom_labels_ is not None:
words = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in topic_list]
else:
words = [" | ".join([word[0] for word in topic_model.get_topic(topic)[:5]]) for topic in topic_list]
# Embed c-TF-IDF into 2D
all_topics = sorted(list(topic_model.get_topics().keys()))
indices = np.array([all_topics.index(topic) for topic in topics])
embeddings, c_tfidf_used = select_topic_representation(
topic_model.c_tf_idf_,
topic_model.topic_embeddings_,
use_ctfidf=use_ctfidf,
output_ndarray=True,
)
embeddings = embeddings[indices]
if HAS_UMAP:
if c_tfidf_used:
embeddings = MinMaxScaler().fit_transform(embeddings)
embeddings = UMAP(n_neighbors=2, n_components=2, metric="hellinger", random_state=42).fit_transform(
embeddings
)
else:
embeddings = UMAP(n_neighbors=2, n_components=2, metric="cosine", random_state=42).fit_transform(embeddings)
else:
raise ModuleNotFoundError(
"UMAP is required to reduce the embeddings.. Please install it using `pip install umap-learn`."
)
# Visualize with plotly
df = pd.DataFrame(
{
"x": embeddings[:, 0],
"y": embeddings[:, 1],
"Topic": topic_list,
"Words": words,
"Size": frequencies,
}
)
return _plotly_topic_visualization(df, topic_list, title, width, height)
def _plotly_topic_visualization(df: pd.DataFrame, topic_list: List[str], title: str, width: int, height: int):
"""Create plotly-based visualization of topics with a slider for topic selection."""
def get_color(topic_selected):
if topic_selected == -1:
marker_color = ["#B0BEC5" for _ in topic_list]
else:
marker_color = ["red" if topic == topic_selected else "#B0BEC5" for topic in topic_list]
return [{"marker.color": [marker_color]}]
# Prepare figure range
x_range = (
df.x.min() - abs((df.x.min()) * 0.15),
df.x.max() + abs((df.x.max()) * 0.15),
)
y_range = (
df.y.min() - abs((df.y.min()) * 0.15),
df.y.max() + abs((df.y.max()) * 0.15),
)
# Plot topics
fig = px.scatter(
df,
x="x",
y="y",
size="Size",
size_max=40,
template="simple_white",
labels={"x": "", "y": ""},
hover_data={"Topic": True, "Words": True, "Size": True, "x": False, "y": False},
)
fig.update_traces(marker=dict(color="#B0BEC5", line=dict(width=2, color="DarkSlateGrey")))
# Update hover order
fig.update_traces(
hovertemplate="<br>".join(
[
"<b>Topic %{customdata[0]}</b>",
"%{customdata[1]}",
"Size: %{customdata[2]}",
]
)
)
# Create a slider for topic selection
steps = [dict(label=f"Topic {topic}", method="update", args=get_color(topic)) for topic in topic_list]
sliders = [dict(active=0, pad={"t": 50}, steps=steps)]
# Stylize layout
fig.update_layout(
title={
"text": f"{title}",
"y": 0.95,
"x": 0.5,
"xanchor": "center",
"yanchor": "top",
"font": dict(size=22, color="Black"),
},
width=width,
height=height,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
xaxis={"visible": False},
yaxis={"visible": False},
sliders=sliders,
)
# Update axes ranges
fig.update_xaxes(range=x_range)
fig.update_yaxes(range=y_range)
# Add grid in a 'plus' shape
fig.add_shape(
type="line",
x0=sum(x_range) / 2,
y0=y_range[0],
x1=sum(x_range) / 2,
y1=y_range[1],
line=dict(color="#CFD8DC", width=2),
)
fig.add_shape(
type="line",
x0=x_range[0],
y0=sum(y_range) / 2,
x1=x_range[1],
y1=sum(y_range) / 2,
line=dict(color="#9E9E9E", width=2),
)
fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text="D1", showarrow=False, yshift=10)
fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text="D2", showarrow=False, xshift=10)
fig.data = fig.data[::-1]
return fig
@@ -1,134 +0,0 @@
import pandas as pd
from typing import List, Union
import plotly.graph_objects as go
from sklearn.preprocessing import normalize
def visualize_topics_over_time(
topic_model,
topics_over_time: pd.DataFrame,
top_n_topics: int = None,
topics: List[int] = None,
normalize_frequency: bool = False,
custom_labels: Union[bool, str] = False,
title: str = "<b>Topics over Time</b>",
width: int = 1250,
height: int = 450,
) -> go.Figure:
"""Visualize topics over time.
Arguments:
topic_model: A fitted BERTopic instance.
topics_over_time: The topics you would like to be visualized with the
corresponding topic representation
top_n_topics: To visualize the most frequent topics instead of all
topics: Select which topics you would like to be visualized
normalize_frequency: Whether to normalize each topic's frequency individually
custom_labels: If bool, whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
title: Title of the plot.
width: The width of the figure.
height: The height of the figure.
Returns:
A plotly.graph_objects.Figure including all traces
Examples:
To visualize the topics over time, simply run:
```python
topics_over_time = topic_model.topics_over_time(docs, timestamps)
topic_model.visualize_topics_over_time(topics_over_time)
```
Or if you want to save the resulting figure:
```python
fig = topic_model.visualize_topics_over_time(topics_over_time)
fig.write_html("path/to/file.html")
```
<iframe src="../../getting_started/visualization/trump.html"
style="width:1000px; height: 680px; border: 0px;""></iframe>
"""
colors = [
"#E69F00",
"#56B4E9",
"#009E73",
"#F0E442",
"#D55E00",
"#0072B2",
"#CC79A7",
]
# Select topics based on top_n and topics args
freq_df = topic_model.get_topic_freq()
freq_df = freq_df.loc[freq_df.Topic != -1, :]
if topics is not None:
selected_topics = list(topics)
elif top_n_topics is not None:
selected_topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
else:
selected_topics = sorted(freq_df.Topic.to_list())
# Prepare data
if isinstance(custom_labels, str):
topic_names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]
topic_names = ["_".join([label[0] for label in labels[:4]]) for labels in topic_names]
topic_names = [label if len(label) < 30 else label[:27] + "..." for label in topic_names]
topic_names = {key: topic_names[index] for index, key in enumerate(topic_model.topic_labels_.keys())}
elif topic_model.custom_labels_ is not None and custom_labels:
topic_names = {
key: topic_model.custom_labels_[key + topic_model._outliers] for key, _ in topic_model.topic_labels_.items()
}
else:
topic_names = {
key: value[:40] + "..." if len(value) > 40 else value for key, value in topic_model.topic_labels_.items()
}
topics_over_time["Name"] = topics_over_time.Topic.map(topic_names)
data = topics_over_time.loc[topics_over_time.Topic.isin(selected_topics), :].sort_values(["Topic", "Timestamp"])
# Add traces
fig = go.Figure()
for index, topic in enumerate(data.Topic.unique()):
trace_data = data.loc[data.Topic == topic, :]
topic_name = trace_data.Name.values[0]
words = trace_data.Words.values
if normalize_frequency:
y = normalize(trace_data.Frequency.values.reshape(1, -1))[0]
else:
y = trace_data.Frequency
fig.add_trace(
go.Scatter(
x=trace_data.Timestamp,
y=y,
mode="lines",
marker_color=colors[index % 7],
hoverinfo="text",
name=topic_name,
hovertext=[f"<b>Topic {topic}</b><br>Words: {word}" for word in words],
)
)
# Styling of the visualization
fig.update_xaxes(showgrid=True)
fig.update_yaxes(showgrid=True)
fig.update_layout(
yaxis_title="Normalized Frequency" if normalize_frequency else "Frequency",
title={
"text": f"{title}",
"y": 0.95,
"x": 0.40,
"xanchor": "center",
"yanchor": "top",
"font": dict(size=22, color="Black"),
},
template="simple_white",
width=width,
height=height,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
legend=dict(
title="<b>Global Topic Representation",
),
)
return fig
@@ -1,140 +0,0 @@
import pandas as pd
from typing import List, Union
import plotly.graph_objects as go
from sklearn.preprocessing import normalize
def visualize_topics_per_class(
topic_model,
topics_per_class: pd.DataFrame,
top_n_topics: int = 10,
topics: List[int] = None,
normalize_frequency: bool = False,
custom_labels: Union[bool, str] = False,
title: str = "<b>Topics per Class</b>",
width: int = 1250,
height: int = 900,
) -> go.Figure:
"""Visualize topics per class.
Arguments:
topic_model: A fitted BERTopic instance.
topics_per_class: The topics you would like to be visualized with the
corresponding topic representation
top_n_topics: To visualize the most frequent topics instead of all
topics: Select which topics you would like to be visualized
normalize_frequency: Whether to normalize each topic's frequency individually
custom_labels: If bool, whether to use custom topic labels that were defined using
`topic_model.set_topic_labels`.
If `str`, it uses labels from other aspects, e.g., "Aspect1".
title: Title of the plot.
width: The width of the figure.
height: The height of the figure.
Returns:
A plotly.graph_objects.Figure including all traces
Examples:
To visualize the topics per class, simply run:
```python
topics_per_class = topic_model.topics_per_class(docs, classes)
topic_model.visualize_topics_per_class(topics_per_class)
```
Or if you want to save the resulting figure:
```python
fig = topic_model.visualize_topics_per_class(topics_per_class)
fig.write_html("path/to/file.html")
```
<iframe src="../../getting_started/visualization/topics_per_class.html"
style="width:1400px; height: 1000px; border: 0px;""></iframe>
"""
colors = [
"#E69F00",
"#56B4E9",
"#009E73",
"#F0E442",
"#D55E00",
"#0072B2",
"#CC79A7",
]
# Select topics based on top_n and topics args
freq_df = topic_model.get_topic_freq()
freq_df = freq_df.loc[freq_df.Topic != -1, :]
if topics is not None:
selected_topics = list(topics)
elif top_n_topics is not None:
selected_topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
else:
selected_topics = sorted(freq_df.Topic.to_list())
# Prepare data
if isinstance(custom_labels, str):
topic_names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]
topic_names = ["_".join([label[0] for label in labels[:4]]) for labels in topic_names]
topic_names = [label if len(label) < 30 else label[:27] + "..." for label in topic_names]
topic_names = {key: topic_names[index] for index, key in enumerate(topic_model.topic_labels_.keys())}
elif topic_model.custom_labels_ is not None and custom_labels:
topic_names = {
key: topic_model.custom_labels_[key + topic_model._outliers] for key, _ in topic_model.topic_labels_.items()
}
else:
topic_names = {
key: value[:40] + "..." if len(value) > 40 else value for key, value in topic_model.topic_labels_.items()
}
topics_per_class["Name"] = topics_per_class.Topic.map(topic_names)
data = topics_per_class.loc[topics_per_class.Topic.isin(selected_topics), :]
# Add traces
fig = go.Figure()
for index, topic in enumerate(selected_topics):
if index == 0:
visible = True
else:
visible = "legendonly"
trace_data = data.loc[data.Topic == topic, :]
topic_name = trace_data.Name.values[0]
words = trace_data.Words.values
if normalize_frequency:
x = normalize(trace_data.Frequency.values.reshape(1, -1))[0]
else:
x = trace_data.Frequency
fig.add_trace(
go.Bar(
y=trace_data.Class,
x=x,
visible=visible,
marker_color=colors[index % 7],
hoverinfo="text",
name=topic_name,
orientation="h",
hovertext=[f"<b>Topic {topic}</b><br>Words: {word}" for word in words],
)
)
# Styling of the visualization
fig.update_xaxes(showgrid=True)
fig.update_yaxes(showgrid=True)
fig.update_layout(
xaxis_title="Normalized Frequency" if normalize_frequency else "Frequency",
yaxis_title="Class",
title={
"text": f"{title}",
"y": 0.95,
"x": 0.40,
"xanchor": "center",
"yanchor": "top",
"font": dict(size=22, color="Black"),
},
template="simple_white",
width=width,
height=height,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
legend=dict(
title="<b>Global Topic Representation",
),
)
return fig
@@ -1,76 +0,0 @@
from bertopic._utils import NotInstalled
from bertopic.representation._cohere import Cohere
from bertopic.representation._base import BaseRepresentation
from bertopic.representation._keybert import KeyBERTInspired
from bertopic.representation._mmr import MaximalMarginalRelevance
# Llama CPP Generator
try:
from bertopic.representation._llamacpp import LlamaCPP
except ModuleNotFoundError:
msg = "`pip install llama-cpp-python` \n\n"
LlamaCPP = NotInstalled("llama.cpp", "llama-cpp-python", custom_msg=msg)
# Text Generation using transformers
try:
from bertopic.representation._textgeneration import TextGeneration
except ModuleNotFoundError:
msg = "`pip install bertopic` without `--no-deps` \n\n"
TextGeneration = NotInstalled("TextGeneration", "transformers", custom_msg=msg)
# Zero-shot classification using transformers
try:
from bertopic.representation._zeroshot import ZeroShotClassification
except ModuleNotFoundError:
msg = "`pip install bertopic` without `--no-deps` \n\n"
ZeroShotClassification = NotInstalled("ZeroShotClassification", "transformers", custom_msg=msg)
# OpenAI Generator
try:
from bertopic.representation._openai import OpenAI
except ModuleNotFoundError:
msg = "`pip install openai` \n\n"
OpenAI = NotInstalled("OpenAI", "openai", custom_msg=msg)
# LiteLLM Generator
try:
from bertopic.representation._litellm import LiteLLM
except ModuleNotFoundError:
msg = "`pip install litellm` \n\n"
LiteLLM = NotInstalled("LiteLLM", "litellm", custom_msg=msg)
# LangChain Generator
try:
from bertopic.representation._langchain import LangChain
except ModuleNotFoundError:
msg = "`pip install langchain` \n\n"
LangChain = NotInstalled("langchain", "langchain", custom_msg=msg)
# POS using Spacy
try:
from bertopic.representation._pos import PartOfSpeech
except ModuleNotFoundError:
PartOfSpeech = NotInstalled("Part of Speech with Spacy", "spacy")
# Multimodal
try:
from bertopic.representation._visual import VisualRepresentation
except ModuleNotFoundError:
VisualRepresentation = NotInstalled("a visual representation model", "vision")
__all__ = [
"BaseRepresentation",
"TextGeneration",
"ZeroShotClassification",
"KeyBERTInspired",
"PartOfSpeech",
"MaximalMarginalRelevance",
"Cohere",
"OpenAI",
"LangChain",
"LiteLLM",
"LlamaCPP",
"VisualRepresentation",
]
@@ -1,40 +0,0 @@
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.base import BaseEstimator
from typing import Mapping, List, Tuple
class BaseRepresentation(BaseEstimator):
"""The base representation model for fine-tuning topic representations."""
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topics.
Each representation model that inherits this class will have
its arguments (topic_model, documents, c_tf_idf, topics)
automatically passed. Therefore, the representation model
will only have access to the information about topics related
to those arguments.
Arguments:
topic_model: The BERTopic model that is fitted until topic
representations are calculated.
documents: A dataframe with columns "Document" and "Topic"
that contains all documents with each corresponding
topic.
c_tf_idf: A c-TF-IDF representation that is typically
identical to `topic_model.c_tf_idf_` except for
dynamic, class-based, and hierarchical topic modeling
where it is calculated on a subset of the documents.
topics: A dictionary with topic (key) and tuple of word and
weight (value) as calculated by c-TF-IDF. This is the
default topics that are returned if no representation
model is used.
"""
return topic_model.topic_representations_
@@ -1,209 +0,0 @@
import time
import pandas as pd
from tqdm import tqdm
from scipy.sparse import csr_matrix
from typing import Mapping, List, Tuple, Union, Callable
from bertopic.representation._base import BaseRepresentation
from bertopic.representation._utils import truncate_document, validate_truncate_document_parameters
DEFAULT_PROMPT = """
This is a list of texts where each collection of texts describe a topic. After each collection of texts, the name of the topic they represent is mentioned as a short-highly-descriptive title
---
Topic:
Sample texts from this topic:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.
Keywords: meat beef eat eating emissions steak food health processed chicken
Topic name: Environmental impacts of eating meat
---
Topic:
Sample texts from this topic:
- I have ordered the product weeks ago but it still has not arrived!
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.
- I got a message stating that I received the monitor but that is not true!
- It took a month longer to deliver than was advised...
Keywords: deliver weeks product shipping long delivery received arrived arrive week
Topic name: Shipping and delivery issues
---
Topic:
Sample texts from this topic:
[DOCUMENTS]
Keywords: [KEYWORDS]
Topic name:"""
DEFAULT_SYSTEM_PROMPT = "You are an assistant that extracts high-level topics from texts."
class Cohere(BaseRepresentation):
"""Use the Cohere API to generate topic labels based on their
generative model.
Find more about their models here:
https://docs.cohere.ai/docs
Arguments:
client: A `cohere.Client`
model: Model to use within Cohere, defaults to `"xlarge"`.
prompt: The prompt to be used in the model. If no prompt is given,
`self.default_prompt_` is used instead.
NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
to decide where the keywords and documents need to be
inserted.
system_prompt: The system prompt to be used in the model. If no system prompt is given,
`self.default_system_prompt_` is used instead.
delay_in_seconds: The delay in seconds between consecutive prompts
in order to prevent RateLimitErrors.
nr_docs: The number of documents to pass to OpenAI if a prompt
with the `["DOCUMENTS"]` tag is used.
diversity: The diversity of documents to pass to OpenAI.
Accepts values between 0 and 1. A higher
values results in passing more diverse documents
whereas lower values passes more similar documents.
doc_length: The maximum length of each document. If a document is longer,
it will be truncated. If None, the entire document is passed.
tokenizer: The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to `doc_length`
* If tokenizer is 'whitespace', the document is split up
into words separated by whitespaces. These words are counted
and truncated depending on `doc_length`
* If tokenizer is 'vectorizer', then the internal CountVectorizer
is used to tokenize the document. These tokens are counted
and truncated depending on `doc_length`
* If tokenizer is a callable, then that callable is used to tokenize
the document. These tokens are counted and truncated depending
on `doc_length`
Usage:
To use this, you will need to install cohere first:
`pip install cohere`
Then, get yourself an API key and use Cohere's API as follows:
```python
import cohere
from bertopic.representation import Cohere
from bertopic import BERTopic
# Create your representation model
co = cohere.Client(my_api_key)
representation_model = Cohere(co)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
You can also use a custom prompt:
```python
prompt = "I have the following documents: [DOCUMENTS]. What topic do they contain?"
representation_model = Cohere(co, prompt=prompt)
```
"""
def __init__(
self,
client,
model: str = "command-r",
prompt: str = None,
system_prompt: str = None,
delay_in_seconds: float = None,
nr_docs: int = 4,
diversity: float = None,
doc_length: int = None,
tokenizer: Union[str, Callable] = None,
):
self.client = client
self.model = model
self.prompt = prompt if prompt is not None else DEFAULT_PROMPT
self.system_prompt = system_prompt if system_prompt is not None else DEFAULT_SYSTEM_PROMPT
self.default_prompt_ = DEFAULT_PROMPT
self.default_system_prompt_ = DEFAULT_SYSTEM_PROMPT
self.delay_in_seconds = delay_in_seconds
self.nr_docs = nr_docs
self.diversity = diversity
self.doc_length = doc_length
self.tokenizer = tokenizer
validate_truncate_document_parameters(self.tokenizer, self.doc_length)
self.prompts_ = []
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topics.
Arguments:
topic_model: Not used
documents: Not used
c_tf_idf: Not used
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
# Extract the top 4 representative documents per topic
repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity
)
# Generate using Cohere's Language Model
updated_topics = {}
for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
prompt = self._create_prompt(truncated_docs, topic, topics)
self.prompts_.append(prompt)
# Delay
if self.delay_in_seconds:
time.sleep(self.delay_in_seconds)
request = self.client.chat(
model=self.model,
preamble=self.system_prompt,
message=prompt,
max_tokens=50,
stop_sequences=["\n"],
)
label = request.text.strip()
updated_topics[topic] = [(label, 1)] + [("", 0) for _ in range(9)]
return updated_topics
def _create_prompt(self, docs, topic, topics):
keywords = list(zip(*topics[topic]))[0]
# Use the Default Chat Prompt
if self.prompt == DEFAULT_PROMPT:
prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
prompt = self._replace_documents(prompt, docs)
# Use a custom prompt that leverages keywords, documents or both using
# custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively
else:
prompt = self.prompt
if "[KEYWORDS]" in prompt:
prompt = prompt.replace("[KEYWORDS]", ", ".join(keywords))
if "[DOCUMENTS]" in prompt:
prompt = self._replace_documents(prompt, docs)
return prompt
@staticmethod
def _replace_documents(prompt, docs):
to_replace = ""
for doc in docs:
to_replace += f"- {doc}\n"
prompt = prompt.replace("[DOCUMENTS]", to_replace)
return prompt
@@ -1,222 +0,0 @@
import numpy as np
import pandas as pd
from packaging import version
from scipy.sparse import csr_matrix
from typing import Mapping, List, Tuple, Union
from sklearn.metrics.pairwise import cosine_similarity
from bertopic.representation._base import BaseRepresentation
from sklearn import __version__ as sklearn_version
class KeyBERTInspired(BaseRepresentation):
def __init__(
self,
top_n_words: int = 10,
nr_repr_docs: int = 5,
nr_samples: int = 500,
nr_candidate_words: int = 100,
random_state: int = 42,
):
"""Use a KeyBERT-like model to fine-tune the topic representations.
The algorithm follows KeyBERT but does some optimization in
order to speed up inference.
The steps are as follows. First, we extract the top n representative
documents per topic. To extract the representative documents, we
randomly sample a number of candidate documents per cluster
which is controlled by the `nr_samples` parameter. Then,
the top n representative documents are extracted by calculating
the c-TF-IDF representation for the candidate documents and finding,
through cosine similarity, which are closest to the topic c-TF-IDF representation.
Next, the top n words per topic are extracted based on their
c-TF-IDF representation, which is controlled by the `nr_repr_docs`
parameter.
Then, we extract the embeddings for words and representative documents
and create topic embeddings by averaging the representative documents.
Finally, the most similar words to each topic are extracted by
calculating the cosine similarity between word and topic embeddings.
Arguments:
top_n_words: The top n words to extract per topic.
nr_repr_docs: The number of representative documents to extract per cluster.
nr_samples: The number of candidate documents to extract per cluster.
nr_candidate_words: The number of candidate words per cluster.
random_state: The random state for randomly sampling candidate documents.
Usage:
```python
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
# Create your representation model
representation_model = KeyBERTInspired()
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
"""
self.top_n_words = top_n_words
self.nr_repr_docs = nr_repr_docs
self.nr_samples = nr_samples
self.nr_candidate_words = nr_candidate_words
self.random_state = random_state
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
embeddings: np.ndarray = None,
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topics.
Arguments:
topic_model: A BERTopic model
documents: All input documents
c_tf_idf: The topic c-TF-IDF representation
topics: The candidate topics as calculated with c-TF-IDF
embeddings: Pre-trained document embeddings. These can be used
instead of an embedding model
Returns:
updated_topics: Updated topic representations
"""
# We extract the top n representative documents per class
_, representative_docs, repr_doc_indices, _ = topic_model._extract_representative_docs(
c_tf_idf, documents, topics, self.nr_samples, self.nr_repr_docs
)
# If document embeddings are precomputed, extract the embeddings of the representative documents based on repr_doc_indices
repr_embeddings = None
if embeddings is not None:
repr_embeddings = [embeddings[index] for index in np.concatenate(repr_doc_indices)]
# We extract the top n words per class
topics = self._extract_candidate_words(topic_model, c_tf_idf, topics)
# We calculate the similarity between word and document embeddings and create
# topic embeddings from the representative document embeddings
sim_matrix, words = self._extract_embeddings(
topic_model, topics, representative_docs, repr_doc_indices, repr_embeddings
)
# Find the best matching words based on the similarity matrix for each topic
updated_topics = self._extract_top_words(words, topics, sim_matrix)
return updated_topics
def _extract_candidate_words(
self,
topic_model,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, float]]]:
"""For each topic, extract candidate words based on the c-TF-IDF
representation.
Arguments:
topic_model: A BERTopic model
c_tf_idf: The topic c-TF-IDF representation
topics: The top words per topic
Returns:
topics: The `self.top_n_words` per topic
"""
labels = [int(label) for label in sorted(list(topics.keys()))]
# Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0
# and will be removed in 1.2. Please use get_feature_names_out instead.
if version.parse(sklearn_version) >= version.parse("1.0.0"):
words = topic_model.vectorizer_model.get_feature_names_out()
else:
words = topic_model.vectorizer_model.get_feature_names()
indices = topic_model._top_n_idx_sparse(c_tf_idf, self.nr_candidate_words)
scores = topic_model._top_n_values_sparse(c_tf_idf, indices)
sorted_indices = np.argsort(scores, 1)
indices = np.take_along_axis(indices, sorted_indices, axis=1)
scores = np.take_along_axis(scores, sorted_indices, axis=1)
# Get top 30 words per topic based on c-TF-IDF score
topics = {
label: [
(words[word_index], score) if word_index is not None and score > 0 else ("", 0.00001)
for word_index, score in zip(indices[index][::-1], scores[index][::-1])
]
for index, label in enumerate(labels)
}
topics = {label: list(zip(*values[: self.nr_candidate_words]))[0] for label, values in topics.items()}
return topics
def _extract_embeddings(
self,
topic_model,
topics: Mapping[str, List[Tuple[str, float]]],
representative_docs: List[str],
repr_doc_indices: List[List[int]],
repr_embeddings: np.ndarray = None,
) -> Union[np.ndarray, List[str]]:
"""Extract the representative document embeddings and create topic embeddings.
Then extract word embeddings and calculate the cosine similarity between topic
embeddings and the word embeddings. Topic embeddings are the average of
representative document embeddings.
Arguments:
topic_model: A BERTopic model
topics: The top words per topic
representative_docs: A flat list of representative documents
repr_doc_indices: The indices of representative documents
that belong to each topic
repr_embeddings: Embeddings of respective representative_docs
Returns:
sim: The similarity matrix between word and topic embeddings
vocab: The complete vocabulary of input documents
"""
# Calculate representative document embeddings if there are no precomputed embeddings.
if repr_embeddings is None:
repr_embeddings = topic_model._extract_embeddings(representative_docs, method="document", verbose=False)
topic_embeddings = [np.mean(repr_embeddings[i[0] : i[-1] + 1], axis=0) for i in repr_doc_indices]
# Calculate word embeddings and extract best matching with updated topic_embeddings
vocab = list(set([word for words in topics.values() for word in words]))
word_embeddings = topic_model._extract_embeddings(vocab, method="document", verbose=False)
sim = cosine_similarity(topic_embeddings, word_embeddings)
return sim, vocab
def _extract_top_words(
self,
vocab: List[str],
topics: Mapping[str, List[Tuple[str, float]]],
sim: np.ndarray,
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract the top n words per topic based on the
similarity matrix between topics and words.
Arguments:
vocab: The complete vocabulary of input documents
labels: All topic labels
topics: The top words per topic
sim: The similarity matrix between word and topic embeddings
Returns:
updated_topics: The updated topic representations
"""
labels = [int(label) for label in sorted(list(topics.keys()))]
updated_topics = {}
for i, topic in enumerate(labels):
indices = [vocab.index(word) for word in topics[topic]]
values = sim[:, indices][i]
word_indices = [indices[index] for index in np.argsort(values)[-self.top_n_words :]]
updated_topics[topic] = [
(vocab[index], val) for val, index in zip(np.sort(values)[-self.top_n_words :], word_indices)
][::-1]
return updated_topics
@@ -1,213 +0,0 @@
import pandas as pd
from langchain.docstore.document import Document
from scipy.sparse import csr_matrix
from typing import Callable, Mapping, List, Tuple, Union
from bertopic.representation._base import BaseRepresentation
from bertopic.representation._utils import truncate_document, validate_truncate_document_parameters
DEFAULT_PROMPT = "What are these documents about? Please give a single label."
class LangChain(BaseRepresentation):
"""Using chains in langchain to generate topic labels.
The classic example uses `langchain.chains.question_answering.load_qa_chain`.
This returns a chain that takes a list of documents and a question as input.
You can also use Runnables such as those composed using the LangChain Expression Language.
Arguments:
chain: The langchain chain or Runnable with a `batch` method.
Input keys must be `input_documents` and `question`.
Output key must be `output_text`.
prompt: The prompt to be used in the model. If no prompt is given,
`self.default_prompt_` is used instead.
NOTE: Use `"[KEYWORDS]"` in the prompt
to decide where the keywords need to be
inserted. Keywords won't be included unless
indicated. Unlike other representation models,
Langchain does not use the `"[DOCUMENTS]"` tag
to insert documents into the prompt. The load_qa_chain function
formats the representative documents within the prompt.
nr_docs: The number of documents to pass to LangChain
diversity: The diversity of documents to pass to LangChain.
Accepts values between 0 and 1. A higher
values results in passing more diverse documents
whereas lower values passes more similar documents.
doc_length: The maximum length of each document. If a document is longer,
it will be truncated. If None, the entire document is passed.
tokenizer: The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to `doc_length`
* If tokenizer is 'whitespace', the document is split up
into words separated by whitespaces. These words are counted
and truncated depending on `doc_length`
* If tokenizer is 'vectorizer', then the internal CountVectorizer
is used to tokenize the document. These tokens are counted
and truncated depending on `doc_length`. They are decoded with
whitespaces.
* If tokenizer is a callable, then that callable is used to tokenize
the document. These tokens are counted and truncated depending
on `doc_length`
chain_config: The configuration for the langchain chain. Can be used to set options
like max_concurrency to avoid rate limiting errors.
Usage:
To use this, you will need to install the langchain package first.
Additionally, you will need an underlying LLM to support langchain,
like openai:
`pip install langchain`
`pip install openai`
Then, you can create your chain as follows:
```python
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
```
Finally, you can pass the chain to BERTopic as follows:
```python
from bertopic.representation import LangChain
# Create your representation model
representation_model = LangChain(chain)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
You can also use a custom prompt:
```python
prompt = "What are these documents about? Please give a single label."
representation_model = LangChain(chain, prompt=prompt)
```
You can also use a Runnable instead of a chain.
The example below uses the LangChain Expression Language:
```python
from bertopic.representation import LangChain
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import ChatAnthropic
from langchain.schema.document import Document
from langchain.schema.runnable import RunnablePassthrough
from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer
prompt = ...
llm = ...
# We will construct a special privacy-preserving chain using Microsoft Presidio
pii_handler = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"])
chain = (
{
"input_documents": (
lambda inp: [
Document(
page_content=pii_handler.anonymize(
d.page_content,
language="en",
),
)
for d in inp["input_documents"]
]
),
"question": RunnablePassthrough(),
}
| load_qa_chain(representation_llm, chain_type="stuff")
| (lambda output: {"output_text": pii_handler.deanonymize(output["output_text"])})
)
representation_model = LangChain(chain, prompt=representation_prompt)
```
"""
def __init__(
self,
chain,
prompt: str = None,
nr_docs: int = 4,
diversity: float = None,
doc_length: int = None,
tokenizer: Union[str, Callable] = None,
chain_config=None,
):
self.chain = chain
self.prompt = prompt if prompt is not None else DEFAULT_PROMPT
self.default_prompt_ = DEFAULT_PROMPT
self.chain_config = chain_config
self.nr_docs = nr_docs
self.diversity = diversity
self.doc_length = doc_length
self.tokenizer = tokenizer
validate_truncate_document_parameters(self.tokenizer, self.doc_length)
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, int]]]:
"""Extract topics.
Arguments:
topic_model: A BERTopic model
documents: All input documents
c_tf_idf: The topic c-TF-IDF representation
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
# Extract the top 4 representative documents per topic
repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
c_tf_idf=c_tf_idf,
documents=documents,
topics=topics,
nr_samples=500,
nr_repr_docs=self.nr_docs,
diversity=self.diversity,
)
# Generate label using langchain's batch functionality
chain_docs: List[List[Document]] = [
[
Document(page_content=truncate_document(topic_model, self.doc_length, self.tokenizer, doc))
for doc in docs
]
for docs in repr_docs_mappings.values()
]
# `self.chain` must take `input_documents` and `question` as input keys
# Use a custom prompt that leverages keywords, using the tag: [KEYWORDS]
if "[KEYWORDS]" in self.prompt:
prompts = []
for topic in topics:
keywords = list(zip(*topics[topic]))[0]
prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
prompts.append(prompt)
inputs = [{"input_documents": docs, "question": prompt} for docs, prompt in zip(chain_docs, prompts)]
else:
inputs = [{"input_documents": docs, "question": self.prompt} for docs in chain_docs]
# `self.chain` must return a dict with an `output_text` key
# same output key as the `StuffDocumentsChain` returned by `load_qa_chain`
outputs = self.chain.batch(inputs=inputs, config=self.chain_config)
labels = [output["output_text"].strip() for output in outputs]
updated_topics = {
topic: [(label, 1)] + [("", 0) for _ in range(9)] for topic, label in zip(repr_docs_mappings.keys(), labels)
}
return updated_topics
@@ -1,176 +0,0 @@
import time
from litellm import completion
import pandas as pd
from scipy.sparse import csr_matrix
from typing import Mapping, List, Tuple, Any
from bertopic.representation._base import BaseRepresentation
from bertopic.representation._utils import retry_with_exponential_backoff
DEFAULT_PROMPT = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the information above, extract a short topic label in the following format:
topic: <topic label>
"""
class LiteLLM(BaseRepresentation):
"""Using the LiteLLM API to generate topic labels.
For an overview of models see:
https://docs.litellm.ai/docs/providers
Arguments:
model: Model to use. Defaults to OpenAI's "gpt-3.5-turbo".
generator_kwargs: Kwargs passed to `litellm.completion`.
prompt: The prompt to be used in the model. If no prompt is given,
`self.default_prompt_` is used instead.
NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
to decide where the keywords and documents need to be
inserted.
delay_in_seconds: The delay in seconds between consecutive prompts
in order to prevent RateLimitErrors.
exponential_backoff: Retry requests with a random exponential backoff.
A short sleep is used when a rate limit error is hit,
then the requests is retried. Increase the sleep length
if errors are hit until 10 unsuccesfull requests.
If True, overrides `delay_in_seconds`.
nr_docs: The number of documents to pass to LiteLLM if a prompt
with the `["DOCUMENTS"]` tag is used.
diversity: The diversity of documents to pass to LiteLLM.
Accepts values between 0 and 1. A higher
values results in passing more diverse documents
whereas lower values passes more similar documents.
Usage:
To use this, you will need to install the litellm package first:
`pip install litellm`
Then, get yourself an API key of any provider (for instance OpenAI) and use it as follows:
```python
import os
from bertopic.representation import LiteLLM
from bertopic import BERTopic
# set ENV variables
os.environ["OPENAI_API_KEY"] = "your-openai-key"
# Create your representation model
representation_model = LiteLLM(model="gpt-3.5-turbo")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
You can also use a custom prompt:
```python
prompt = "I have the following documents: [DOCUMENTS] \nThese documents are about the following topic: '"
representation_model = LiteLLM(model="gpt", prompt=prompt)
```
""" # noqa: D301
def __init__(
self,
model: str = "gpt-3.5-turbo",
prompt: str = None,
generator_kwargs: Mapping[str, Any] = {},
delay_in_seconds: float = None,
exponential_backoff: bool = False,
nr_docs: int = 4,
diversity: float = None,
):
self.model = model
self.prompt = prompt if prompt else DEFAULT_PROMPT
self.default_prompt_ = DEFAULT_PROMPT
self.delay_in_seconds = delay_in_seconds
self.exponential_backoff = exponential_backoff
self.nr_docs = nr_docs
self.diversity = diversity
self.generator_kwargs = generator_kwargs
if self.generator_kwargs.get("model"):
self.model = generator_kwargs.get("model")
if self.generator_kwargs.get("prompt"):
del self.generator_kwargs["prompt"]
def extract_topics(
self, topic_model, documents: pd.DataFrame, c_tf_idf: csr_matrix, topics: Mapping[str, List[Tuple[str, float]]]
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topics.
Arguments:
topic_model: A BERTopic model
documents: All input documents
c_tf_idf: The topic c-TF-IDF representation
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
# Extract the top n representative documents per topic
repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity
)
# Generate using a (Large) Language Model
updated_topics = {}
for topic, docs in repr_docs_mappings.items():
prompt = self._create_prompt(docs, topic, topics)
# Delay
if self.delay_in_seconds:
time.sleep(self.delay_in_seconds)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
]
kwargs = {"model": self.model, "messages": messages, **self.generator_kwargs}
if self.exponential_backoff:
response = chat_completions_with_backoff(**kwargs)
else:
response = completion(**kwargs)
label = response["choices"][0]["message"]["content"].strip().replace("topic: ", "")
updated_topics[topic] = [(label, 1)]
return updated_topics
def _create_prompt(self, docs, topic, topics):
keywords = list(zip(*topics[topic]))[0]
# Use the Default Chat Prompt
if self.prompt == DEFAULT_PROMPT:
prompt = self.prompt.replace("[KEYWORDS]", " ".join(keywords))
prompt = self._replace_documents(prompt, docs)
# Use a custom prompt that leverages keywords, documents or both using
# custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively
else:
prompt = self.prompt
if "[KEYWORDS]" in prompt:
prompt = prompt.replace("[KEYWORDS]", " ".join(keywords))
if "[DOCUMENTS]" in prompt:
prompt = self._replace_documents(prompt, docs)
return prompt
@staticmethod
def _replace_documents(prompt, docs):
to_replace = ""
for doc in docs:
to_replace += f"- {doc[:255]}\n"
prompt = prompt.replace("[DOCUMENTS]", to_replace)
return prompt
def chat_completions_with_backoff(**kwargs):
return retry_with_exponential_backoff(
completion,
)(**kwargs)
@@ -1,215 +0,0 @@
import pandas as pd
from tqdm import tqdm
from scipy.sparse import csr_matrix
from llama_cpp import Llama
from typing import Mapping, List, Tuple, Any, Union, Callable
from bertopic.representation._base import BaseRepresentation
from bertopic.representation._utils import truncate_document, validate_truncate_document_parameters
DEFAULT_PROMPT = """
This is a list of texts where each collection of texts describe a topic. After each collection of texts, the name of the topic they represent is mentioned as a short-highly-descriptive title
---
Topic:
Sample texts from this topic:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.
Keywords: meat beef eat eating emissions steak food health processed chicken
Topic name: Environmental impacts of eating meat
---
Topic:
Sample texts from this topic:
- I have ordered the product weeks ago but it still has not arrived!
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.
- I got a message stating that I received the monitor but that is not true!
- It took a month longer to deliver than was advised...
Keywords: deliver weeks product shipping long delivery received arrived arrive week
Topic name: Shipping and delivery issues
---
Topic:
Sample texts from this topic:
[DOCUMENTS]
Keywords: [KEYWORDS]
Topic name:"""
DEFAULT_SYSTEM_PROMPT = "You are an assistant that extracts high-level topics from texts."
class LlamaCPP(BaseRepresentation):
"""A llama.cpp implementation to use as a representation model.
Arguments:
model: Either a string pointing towards a local LLM or a
`llama_cpp.Llama` object.
prompt: The prompt to be used in the model. If no prompt is given,
`self.default_prompt_` is used instead.
NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
to decide where the keywords and documents need to be
inserted.
system_prompt: The system prompt to be used in the model. If no system prompt is given,
`self.default_system_prompt_` is used instead.
pipeline_kwargs: Kwargs that you can pass to the `llama_cpp.Llama`
when it is called such as `max_tokens` to be generated.
nr_docs: The number of documents to pass to OpenAI if a prompt
with the `["DOCUMENTS"]` tag is used.
diversity: The diversity of documents to pass to OpenAI.
Accepts values between 0 and 1. A higher
values results in passing more diverse documents
whereas lower values passes more similar documents.
doc_length: The maximum length of each document. If a document is longer,
it will be truncated. If None, the entire document is passed.
tokenizer: The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to `doc_length`
* If tokenizer is 'whitespace', the the document is split up
into words separated by whitespaces. These words are counted
and truncated depending on `doc_length`
* If tokenizer is 'vectorizer', then the internal CountVectorizer
is used to tokenize the document. These tokens are counted
and truncated depending on `doc_length`
* If tokenizer is a callable, then that callable is used to tokenize
the document. These tokens are counted and truncated depending
on `doc_length`
Usage:
To use a llama.cpp, first download the LLM:
```bash
wget https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q4_K_M.gguf
```
Then, we can now use the model the model with BERTopic in just a couple of lines:
```python
from bertopic import BERTopic
from bertopic.representation import LlamaCPP
# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha
representation_model = LlamaCPP("zephyr-7b-alpha.Q4_K_M.gguf")
# Create our BERTopic model
topic_model = BERTopic(representation_model=representation_model, verbose=True)
```
If you want to have more control over the LLMs parameters, you can run it like so:
```python
from bertopic import BERTopic
from bertopic.representation import LlamaCPP
from llama_cpp import Llama
# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha
llm = Llama(model_path="zephyr-7b-alpha.Q4_K_M.gguf", n_gpu_layers=-1, n_ctx=4096, stop="Q:")
representation_model = LlamaCPP(llm)
# Create our BERTopic model
topic_model = BERTopic(representation_model=representation_model, verbose=True)
```
"""
def __init__(
self,
model: Union[str, Llama],
prompt: str = None,
system_prompt: str = None,
pipeline_kwargs: Mapping[str, Any] = {},
nr_docs: int = 4,
diversity: float = None,
doc_length: int = None,
tokenizer: Union[str, Callable] = None,
):
if isinstance(model, str):
self.model = Llama(model_path=model, n_gpu_layers=-1, stop="\n", chat_format="ChatML")
elif isinstance(model, Llama):
self.model = model
else:
raise ValueError(
"Make sure that the model that you"
"pass is either a string referring to a"
"local LLM or a ` llama_cpp.Llama` object."
)
self.prompt = prompt if prompt is not None else DEFAULT_PROMPT
self.system_prompt = system_prompt if system_prompt is not None else DEFAULT_SYSTEM_PROMPT
self.default_prompt_ = DEFAULT_PROMPT
self.default_system_prompt_ = DEFAULT_SYSTEM_PROMPT
self.pipeline_kwargs = pipeline_kwargs
self.nr_docs = nr_docs
self.diversity = diversity
self.doc_length = doc_length
self.tokenizer = tokenizer
validate_truncate_document_parameters(self.tokenizer, self.doc_length)
self.prompts_ = []
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topic representations and return a single label.
Arguments:
topic_model: A BERTopic model
documents: Not used
c_tf_idf: Not used
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
# Extract the top 4 representative documents per topic
repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity
)
updated_topics = {}
for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
# Prepare prompt
truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
prompt = self._create_prompt(truncated_docs, topic, topics)
self.prompts_.append(prompt)
# Extract result from generator and use that as label
# topic_description = self.model(prompt, **self.pipeline_kwargs)["choices"]
topic_description = self.model.create_chat_completion(
messages=[{"role": "system", "content": self.system_prompt}, {"role": "user", "content": prompt}],
**self.pipeline_kwargs,
)
label = topic_description["choices"][0]["message"]["content"].strip()
updated_topics[topic] = [(label, 1)] + [("", 0) for _ in range(9)]
return updated_topics
def _create_prompt(self, docs, topic, topics):
keywords = list(zip(*topics[topic]))[0]
# Use the Default Chat Prompt
if self.prompt == DEFAULT_PROMPT:
prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
prompt = self._replace_documents(prompt, docs)
# Use a custom prompt that leverages keywords, documents or both using
# custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively
else:
prompt = self.prompt
if "[KEYWORDS]" in prompt:
prompt = prompt.replace("[KEYWORDS]", ", ".join(keywords))
if "[DOCUMENTS]" in prompt:
prompt = self._replace_documents(prompt, docs)
return prompt
@staticmethod
def _replace_documents(prompt, docs):
to_replace = ""
for doc in docs:
to_replace += f"- {doc}\n"
prompt = prompt.replace("[DOCUMENTS]", to_replace)
return prompt
@@ -1,128 +0,0 @@
import warnings
import numpy as np
import pandas as pd
from typing import List, Mapping, Tuple
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
from bertopic.representation._base import BaseRepresentation
class MaximalMarginalRelevance(BaseRepresentation):
"""Calculate Maximal Marginal Relevance (MMR)
between candidate keywords and the document.
MMR considers the similarity of keywords/keyphrases with the
document, along with the similarity of already selected
keywords and keyphrases. This results in a selection of keywords
that maximize their within diversity with respect to the document.
Arguments:
diversity: How diverse the select keywords/keyphrases are.
Values range between 0 and 1 with 0 being not diverse at all
and 1 being most diverse.
top_n_words: The number of keywords/keyhprases to return
Usage:
```python
from bertopic.representation import MaximalMarginalRelevance
from bertopic import BERTopic
# Create your representation model
representation_model = MaximalMarginalRelevance(diversity=0.3)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
"""
def __init__(self, diversity: float = 0.1, top_n_words: int = 10):
self.diversity = diversity
self.top_n_words = top_n_words
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topic representations.
Arguments:
topic_model: The BERTopic model
documents: Not used
c_tf_idf: Not used
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
if topic_model.embedding_model is None:
warnings.warn(
"MaximalMarginalRelevance can only be used BERTopic was instantiated"
"with the `embedding_model` parameter."
)
return topics
updated_topics = {}
for topic, topic_words in topics.items():
words = [word[0] for word in topic_words]
word_embeddings = topic_model._extract_embeddings(words, method="word", verbose=False)
topic_embedding = topic_model._extract_embeddings(" ".join(words), method="word", verbose=False).reshape(
1, -1
)
topic_words = mmr(
topic_embedding,
word_embeddings,
words,
self.diversity,
self.top_n_words,
)
updated_topics[topic] = [(word, value) for word, value in topics[topic] if word in topic_words]
return updated_topics
def mmr(
doc_embedding: np.ndarray,
word_embeddings: np.ndarray,
words: List[str],
diversity: float = 0.1,
top_n: int = 10,
) -> List[str]:
"""Maximal Marginal Relevance.
Arguments:
doc_embedding: The document embeddings
word_embeddings: The embeddings of the selected candidate keywords/phrases
words: The selected candidate keywords/keyphrases
diversity: The diversity of the selected embeddings.
Values between 0 and 1.
top_n: The top n items to return
Returns:
List[str]: The selected keywords/keyphrases
"""
# Extract similarity within words, and between words and the document
word_doc_similarity = cosine_similarity(word_embeddings, doc_embedding)
word_similarity = cosine_similarity(word_embeddings)
# Initialize candidates and already choose best keyword/keyphras
keywords_idx = [np.argmax(word_doc_similarity)]
candidates_idx = [i for i in range(len(words)) if i != keywords_idx[0]]
for _ in range(top_n - 1):
# Extract similarities within candidates and
# between candidates and selected keywords/phrases
candidate_similarities = word_doc_similarity[candidates_idx, :]
target_similarities = np.max(word_similarity[candidates_idx][:, keywords_idx], axis=1)
# Calculate MMR
mmr = (1 - diversity) * candidate_similarities - diversity * target_similarities.reshape(-1, 1)
mmr_idx = candidates_idx[np.argmax(mmr)]
# Update keywords & candidates
keywords_idx.append(mmr_idx)
candidates_idx.remove(mmr_idx)
return [words[idx] for idx in keywords_idx]
@@ -1,274 +0,0 @@
import time
import openai
import pandas as pd
from tqdm import tqdm
from scipy.sparse import csr_matrix
from typing import Mapping, List, Tuple, Any, Union, Callable
from bertopic.representation._base import BaseRepresentation
from bertopic.representation._utils import (
retry_with_exponential_backoff,
truncate_document,
validate_truncate_document_parameters,
)
DEFAULT_CHAT_PROMPT = """You will extract a short topic label from given documents and keywords.
Here are two examples of topics you created before:
# Example 1
Sample texts from this topic:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the worst food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.
Keywords: meat beef eat eating emissions steak food health processed chicken
topic: Environmental impacts of eating meat
# Example 2
Sample texts from this topic:
- I have ordered the product weeks ago but it still has not arrived!
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.
- I got a message stating that I received the monitor but that is not true!
- It took a month longer to deliver than was advised...
Keywords: deliver weeks product shipping long delivery received arrived arrive week
topic: Shipping and delivery issues
# Your task
Sample texts from this topic:
[DOCUMENTS]
Keywords: [KEYWORDS]
Based on the information above, extract a short topic label (three words at most) in the following format:
topic: <topic_label>
"""
DEFAULT_SYSTEM_PROMPT = "You are an assistant that extracts high-level topics from texts."
class OpenAI(BaseRepresentation):
r"""Using the OpenAI API to generate topic labels based
on one of their Completion of ChatCompletion models.
For an overview see:
https://platform.openai.com/docs/models
Arguments:
client: A `openai.OpenAI` client
model: Model to use within OpenAI, defaults to `"gpt-4o-mini"`.
generator_kwargs: Kwargs passed to `openai.Completion.create`
for fine-tuning the output.
prompt: The prompt to be used in the model. If no prompt is given,
`self.default_prompt_` is used instead.
NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
to decide where the keywords and documents need to be
inserted.
system_prompt: The system prompt to be used in the model. If no system prompt is given,
`self.default_system_prompt_` is used instead.
delay_in_seconds: The delay in seconds between consecutive prompts
in order to prevent RateLimitErrors.
exponential_backoff: Retry requests with a random exponential backoff.
A short sleep is used when a rate limit error is hit,
then the requests is retried. Increase the sleep length
if errors are hit until 10 unsuccessful requests.
If True, overrides `delay_in_seconds`.
nr_docs: The number of documents to pass to OpenAI if a prompt
with the `["DOCUMENTS"]` tag is used.
diversity: The diversity of documents to pass to OpenAI.
Accepts values between 0 and 1. A higher
values results in passing more diverse documents
whereas lower values passes more similar documents.
doc_length: The maximum length of each document. If a document is longer,
it will be truncated. If None, the entire document is passed.
tokenizer: The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to `doc_length`
* If tokenizer is 'whitespace', the document is split up
into words separated by whitespaces. These words are counted
and truncated depending on `doc_length`
* If tokenizer is 'vectorizer', then the internal CountVectorizer
is used to tokenize the document. These tokens are counted
and truncated depending on `doc_length`
* If tokenizer is a callable, then that callable is used to tokenize
the document. These tokens are counted and truncated depending
on `doc_length`
Usage:
To use this, you will need to install the openai package first:
`pip install openai`
Then, get yourself an API key and use OpenAI's API as follows:
```python
import openai
from bertopic.representation import OpenAI
from bertopic import BERTopic
# Create your representation model
client = openai.OpenAI(api_key=MY_API_KEY)
representation_model = OpenAI(client, delay_in_seconds=5)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
You can also use a custom prompt:
```python
prompt = "I have the following documents: [DOCUMENTS] \nThese documents are about the following topic: '"
representation_model = OpenAI(client, prompt=prompt, delay_in_seconds=5)
```
To choose a model:
```python
representation_model = OpenAI(client, model="gpt-4o-mini", delay_in_seconds=10)
```
"""
def __init__(
self,
client,
model: str = "gpt-4o-mini",
prompt: str = None,
system_prompt: str = None,
generator_kwargs: Mapping[str, Any] = {},
delay_in_seconds: float = None,
exponential_backoff: bool = False,
nr_docs: int = 4,
diversity: float = None,
doc_length: int = None,
tokenizer: Union[str, Callable] = None,
**kwargs,
):
self.client = client
self.model = model
if prompt is None:
self.prompt = DEFAULT_CHAT_PROMPT
else:
self.prompt = prompt
if system_prompt is None:
self.system_prompt = DEFAULT_SYSTEM_PROMPT
else:
self.system_prompt = system_prompt
self.default_prompt_ = DEFAULT_CHAT_PROMPT
self.default_system_prompt_ = DEFAULT_SYSTEM_PROMPT
self.delay_in_seconds = delay_in_seconds
self.exponential_backoff = exponential_backoff
self.nr_docs = nr_docs
self.diversity = diversity
self.doc_length = doc_length
self.tokenizer = tokenizer
validate_truncate_document_parameters(self.tokenizer, self.doc_length)
self.prompts_ = []
self.generator_kwargs = generator_kwargs
if self.generator_kwargs.get("model"):
self.model = generator_kwargs.get("model")
del self.generator_kwargs["model"]
if self.generator_kwargs.get("prompt"):
del self.generator_kwargs["prompt"]
if not self.generator_kwargs.get("stop"):
self.generator_kwargs["stop"] = "\n"
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topics.
Arguments:
topic_model: A BERTopic model
documents: All input documents
c_tf_idf: The topic c-TF-IDF representation
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
# Extract the top n representative documents per topic
repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity
)
# Generate using OpenAI's Language Model
updated_topics = {}
for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
prompt = self._create_prompt(truncated_docs, topic, topics)
self.prompts_.append(prompt)
# Delay
if self.delay_in_seconds:
time.sleep(self.delay_in_seconds)
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": prompt},
]
kwargs = {
"model": self.model,
"messages": messages,
**self.generator_kwargs,
}
if self.exponential_backoff:
response = chat_completions_with_backoff(self.client, **kwargs)
else:
response = self.client.chat.completions.create(**kwargs)
# Check whether content was actually generated
# Addresses #1570 for potential issues with OpenAI's content filter
# Addresses #2176 for potential issues when openAI returns a None type object
if response and hasattr(response.choices[0].message, "content"):
label = response.choices[0].message.content.strip().replace("topic: ", "")
else:
label = "No label returned"
updated_topics[topic] = [(label, 1)]
return updated_topics
def _create_prompt(self, docs, topic, topics):
keywords = list(zip(*topics[topic]))[0]
# Use the Default Chat Prompt
if self.prompt == DEFAULT_CHAT_PROMPT:
prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
prompt = self._replace_documents(prompt, docs)
# Use a custom prompt that leverages keywords, documents or both using
# custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively
else:
prompt = self.prompt
if "[KEYWORDS]" in prompt:
prompt = prompt.replace("[KEYWORDS]", ", ".join(keywords))
if "[DOCUMENTS]" in prompt:
prompt = self._replace_documents(prompt, docs)
return prompt
@staticmethod
def _replace_documents(prompt, docs):
to_replace = ""
for doc in docs:
to_replace += f"- {doc}\n"
prompt = prompt.replace("[DOCUMENTS]", to_replace)
return prompt
def chat_completions_with_backoff(client, **kwargs):
return retry_with_exponential_backoff(
client.chat.completions.create,
errors=(openai.RateLimitError,),
)(**kwargs)
@@ -1,161 +0,0 @@
import numpy as np
import pandas as pd
import spacy
from spacy.matcher import Matcher
from spacy.language import Language
from packaging import version
from scipy.sparse import csr_matrix
from typing import List, Mapping, Tuple, Union
from sklearn import __version__ as sklearn_version
from bertopic.representation._base import BaseRepresentation
class PartOfSpeech(BaseRepresentation):
"""Extract Topic Keywords based on their Part-of-Speech.
DEFAULT_PATTERNS = [
[{'POS': 'ADJ'}, {'POS': 'NOUN'}],
[{'POS': 'NOUN'}],
[{'POS': 'ADJ'}]
]
From candidate topics, as extracted with c-TF-IDF,
find documents that contain keywords found in the
candidate topics. These candidate documents then
serve as the representative set of documents from
which the Spacy model can extract a set of candidate
keywords for each topic.
These candidate keywords are first judged by whether
they fall within the DEFAULT_PATTERNS or the user-defined
pattern. Then, the resulting keywords are sorted by
their respective c-TF-IDF values.
Arguments:
model: The Spacy model to use
top_n_words: The top n words to extract
pos_patterns: Patterns for Spacy to use.
See https://spacy.io/usage/rule-based-matching
Usage:
```python
from bertopic.representation import PartOfSpeech
from bertopic import BERTopic
# Create your representation model
representation_model = PartOfSpeech("en_core_web_sm")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
You can define custom POS patterns to be extracted:
```python
pos_patterns = [
[{'POS': 'ADJ'}, {'POS': 'NOUN'}],
[{'POS': 'NOUN'}], [{'POS': 'ADJ'}]
]
representation_model = PartOfSpeech("en_core_web_sm", pos_patterns=pos_patterns)
```
"""
def __init__(
self,
model: Union[str, Language] = "en_core_web_sm",
top_n_words: int = 10,
pos_patterns: List[str] = None,
):
if isinstance(model, str):
self.model = spacy.load(model)
elif isinstance(model, Language):
self.model = model
else:
raise ValueError(
"Make sure that the Spacy model that you"
"pass is either a string referring to a"
"Spacy model or a Spacy nlp object."
)
self.top_n_words = top_n_words
if pos_patterns is None:
self.pos_patterns = [
[{"POS": "ADJ"}, {"POS": "NOUN"}],
[{"POS": "NOUN"}],
[{"POS": "ADJ"}],
]
else:
self.pos_patterns = pos_patterns
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topics.
Arguments:
topic_model: A BERTopic model
documents: All input documents
c_tf_idf: Not used
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
matcher = Matcher(self.model.vocab)
matcher.add("Pattern", self.pos_patterns)
candidate_topics = {}
for topic, values in topics.items():
keywords = list(zip(*values))[0]
# Extract candidate documents
candidate_documents = []
for keyword in keywords:
selection = documents.loc[documents.Topic == topic, :]
selection = selection.loc[selection.Document.str.contains(keyword, regex=False), "Document"]
if len(selection) > 0:
for document in selection[:2]:
candidate_documents.append(document)
candidate_documents = list(set(candidate_documents))
# Extract keywords
docs_pipeline = self.model.pipe(candidate_documents)
updated_keywords = []
for doc in docs_pipeline:
matches = matcher(doc)
for _, start, end in matches:
updated_keywords.append(doc[start:end].text)
candidate_topics[topic] = list(set(updated_keywords))
# Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0
# and will be removed in 1.2. Please use get_feature_names_out instead.
if version.parse(sklearn_version) >= version.parse("1.0.0"):
words = list(topic_model.vectorizer_model.get_feature_names_out())
else:
words = list(topic_model.vectorizer_model.get_feature_names())
# Match updated keywords with c-TF-IDF values
words_lookup = dict(zip(words, range(len(words))))
updated_topics = {topic: [] for topic in topics.keys()}
for topic, candidate_keywords in candidate_topics.items():
word_indices = np.sort(
[words_lookup.get(keyword) for keyword in candidate_keywords if keyword in words_lookup]
)
vals = topic_model.c_tf_idf_[:, word_indices][topic + topic_model._outliers]
indices = np.argsort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words :][::-1]
vals = np.sort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words :][::-1]
topic_words = [(words[word_indices[index]], val) for index, val in zip(indices, vals)]
updated_topics[topic] = topic_words
if len(updated_topics[topic]) < self.top_n_words:
updated_topics[topic] += [("", 0) for _ in range(self.top_n_words - len(updated_topics[topic]))]
return updated_topics
@@ -1,188 +0,0 @@
import pandas as pd
from tqdm import tqdm
from scipy.sparse import csr_matrix
from transformers import pipeline, set_seed
from transformers.pipelines.base import Pipeline
from typing import Mapping, List, Tuple, Any, Union, Callable
from bertopic.representation._base import BaseRepresentation
from bertopic.representation._utils import truncate_document, validate_truncate_document_parameters
DEFAULT_PROMPT = """
I have a topic described by the following keywords: [KEYWORDS].
The name of this topic is:
"""
class TextGeneration(BaseRepresentation):
"""Text2Text or text generation with transformers.
Arguments:
model: A transformers pipeline that should be initialized as "text-generation"
for gpt-like models or "text2text-generation" for T5-like models.
For example, `pipeline('text-generation', model='gpt2')`. If a string
is passed, "text-generation" will be selected by default.
prompt: The prompt to be used in the model. If no prompt is given,
`self.default_prompt_` is used instead.
NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
to decide where the keywords and documents need to be
inserted.
pipeline_kwargs: Kwargs that you can pass to the transformers.pipeline
when it is called.
random_state: A random state to be passed to `transformers.set_seed`
nr_docs: The number of documents to pass to OpenAI if a prompt
with the `["DOCUMENTS"]` tag is used.
diversity: The diversity of documents to pass to OpenAI.
Accepts values between 0 and 1. A higher
values results in passing more diverse documents
whereas lower values passes more similar documents.
doc_length: The maximum length of each document. If a document is longer,
it will be truncated. If None, the entire document is passed.
tokenizer: The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to `doc_length`
* If tokenizer is 'whitespace', the document is split up
into words separated by whitespaces. These words are counted
and truncated depending on `doc_length`
* If tokenizer is 'vectorizer', then the internal CountVectorizer
is used to tokenize the document. These tokens are counted
and truncated depending on `doc_length`
* If tokenizer is a callable, then that callable is used to tokenize
the document. These tokens are counted and truncated depending
on `doc_length`
Usage:
To use a gpt-like model:
```python
from bertopic.representation import TextGeneration
from bertopic import BERTopic
# Create your representation model
generator = pipeline('text-generation', model='gpt2')
representation_model = TextGeneration(generator)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTo pic(representation_model=representation_model)
```
You can use a custom prompt and decide where the keywords should
be inserted by using the `[KEYWORDS]` or documents with thte `[DOCUMENTS]` tag:
```python
from bertopic.representation import TextGeneration
prompt = "I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?""
# Create your representation model
generator = pipeline('text2text-generation', model='google/flan-t5-base')
representation_model = TextGeneration(generator)
```
"""
def __init__(
self,
model: Union[str, pipeline],
prompt: str = None,
pipeline_kwargs: Mapping[str, Any] = {},
random_state: int = 42,
nr_docs: int = 4,
diversity: float = None,
doc_length: int = None,
tokenizer: Union[str, Callable] = None,
):
self.random_state = random_state
set_seed(random_state)
if isinstance(model, str):
self.model = pipeline("text-generation", model=model)
elif isinstance(model, Pipeline):
self.model = model
else:
raise ValueError(
"Make sure that the HF model that you"
"pass is either a string referring to a"
"HF model or a `transformers.pipeline` object."
)
self.prompt = prompt if prompt is not None else DEFAULT_PROMPT
self.default_prompt_ = DEFAULT_PROMPT
self.pipeline_kwargs = pipeline_kwargs
self.nr_docs = nr_docs
self.diversity = diversity
self.doc_length = doc_length
self.tokenizer = tokenizer
validate_truncate_document_parameters(self.tokenizer, self.doc_length)
self.prompts_ = []
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topic representations and return a single label.
Arguments:
topic_model: A BERTopic model
documents: Not used
c_tf_idf: Not used
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
# Extract the top 4 representative documents per topic
if self.prompt != DEFAULT_PROMPT and "[DOCUMENTS]" in self.prompt:
repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity
)
else:
repr_docs_mappings = {topic: None for topic in topics.keys()}
updated_topics = {}
for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
# Prepare prompt
truncated_docs = (
[truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
if docs is not None
else docs
)
prompt = self._create_prompt(truncated_docs, topic, topics)
self.prompts_.append(prompt)
# Extract result from generator and use that as label
topic_description = self.model(prompt, **self.pipeline_kwargs)
topic_description = [
(description["generated_text"].replace(prompt, ""), 1) for description in topic_description
]
if len(topic_description) < 10:
topic_description += [("", 0) for _ in range(10 - len(topic_description))]
updated_topics[topic] = topic_description
return updated_topics
def _create_prompt(self, docs, topic, topics):
keywords = ", ".join(list(zip(*topics[topic]))[0])
# Use the default prompt and replace keywords
if self.prompt == DEFAULT_PROMPT:
prompt = self.prompt.replace("[KEYWORDS]", keywords)
# Use a prompt that leverages either keywords or documents in
# a custom location
else:
prompt = self.prompt
if "[KEYWORDS]" in prompt:
prompt = prompt.replace("[KEYWORDS]", keywords)
if "[DOCUMENTS]" in prompt:
to_replace = ""
for doc in docs:
to_replace += f"- {doc}\n"
prompt = prompt.replace("[DOCUMENTS]", to_replace)
return prompt
@@ -1,113 +0,0 @@
import random
import time
from typing import Union
def truncate_document(topic_model, doc_length: Union[int, None], tokenizer: Union[str, callable], document: str) -> str:
"""Truncate a document to a certain length.
If you want to add a custom tokenizer, then it will need to have a `decode` and
`encode` method. An example would be the following custom tokenizer:
```python
class Tokenizer:
'A custom tokenizer that splits on commas'
def encode(self, doc):
return doc.split(",")
def decode(self, doc_chunks):
return ",".join(doc_chunks)
```
You can use this tokenizer by passing it to the `tokenizer` parameter.
Arguments:
topic_model: A BERTopic model
doc_length: The maximum length of each document. If a document is longer,
it will be truncated. If None, the entire document is passed.
tokenizer: The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to `doc_length`
* If tokenizer is 'whitespace', the document is split up
into words separated by whitespaces. These words are counted
and truncated depending on `doc_length`
* If tokenizer is 'vectorizer', then the internal CountVectorizer
is used to tokenize the document. These tokens are counted
and truncated depending on `doc_length`. They are decoded with
whitespaces.
* If tokenizer is a callable, then that callable is used to tokenize
the document. These tokens are counted and truncated depending
on `doc_length`
document: A single document
Returns:
truncated_document: A truncated document
"""
if doc_length is not None:
if tokenizer == "char":
truncated_document = document[:doc_length]
elif tokenizer == "whitespace":
truncated_document = " ".join(document.split()[:doc_length])
elif tokenizer == "vectorizer":
tokenizer = topic_model.vectorizer_model.build_tokenizer()
truncated_document = " ".join(tokenizer(document)[:doc_length])
elif hasattr(tokenizer, "encode") and hasattr(tokenizer, "decode"):
encoded_document = tokenizer.encode(document)
truncated_document = tokenizer.decode(encoded_document[:doc_length])
return truncated_document
return document
def validate_truncate_document_parameters(tokenizer, doc_length) -> Union[None, ValueError]:
"""Validates parameters that are used in the function `truncate_document`."""
if tokenizer is None and doc_length is not None:
raise ValueError(
"Please select from one of the valid options for the `tokenizer` parameter: \n"
"{'char', 'whitespace', 'vectorizer'} \n"
"If `tokenizer` is of type callable ensure it has methods to encode and decode a document \n"
)
elif tokenizer is not None and doc_length is None:
raise ValueError("If `tokenizer` is provided, `doc_length` of type int must be provided as well.")
def retry_with_exponential_backoff(
func,
initial_delay: float = 1,
exponential_base: float = 2,
jitter: bool = True,
max_retries: int = 10,
errors: tuple = None,
):
"""Retry a function with exponential backoff."""
def wrapper(*args, **kwargs):
# Initialize variables
num_retries = 0
delay = initial_delay
# Loop until a successful response or max_retries is hit or an exception is raised
while True:
try:
return func(*args, **kwargs)
# Retry on specific errors
except errors:
# Increment retries
num_retries += 1
# Check if max retries has been reached
if num_retries > max_retries:
raise Exception(f"Maximum number of retries ({max_retries}) exceeded.")
# Increment the delay
delay *= exponential_base * (1 + jitter * random.random())
# Sleep for the delay
time.sleep(delay)
# Raise exceptions for any errors not specified
except Exception as e:
raise e
return wrapper
@@ -1,274 +0,0 @@
import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm
from scipy.sparse import csr_matrix
from typing import Mapping, List, Tuple, Union
from transformers.pipelines import Pipeline, pipeline
from bertopic.representation._mmr import mmr
from bertopic.representation._base import BaseRepresentation
class VisualRepresentation(BaseRepresentation):
"""From a collection of representative documents, extract
images to represent topics. These topics are represented by a
collage of images.
Arguments:
nr_repr_images: Number of representative images to extract
nr_samples: The number of candidate documents to extract per cluster.
image_height: The height of the resulting collage
image_square: Whether to resize each image in the collage
to a square. This can be visually more appealing
if all input images are all almost squares.
image_to_text_model: The model to caption images.
batch_size: The number of images to pass to the
`image_to_text_model`.
Usage:
```python
from bertopic.representation import VisualRepresentation
from bertopic import BERTopic
# The visual representation is typically not a core representation
# and is advised to pass to BERTopic as an additional aspect.
# Aspects can be labeled with dictionaries as shown below:
representation_model = {
"Visual_Aspect": VisualRepresentation()
}
# Use the representation model in BERTopic as a separate aspect
topic_model = BERTopic(representation_model=representation_model)
```
"""
def __init__(
self,
nr_repr_images: int = 9,
nr_samples: int = 500,
image_height: Tuple[int, int] = 600,
image_squares: bool = False,
image_to_text_model: Union[str, Pipeline] = None,
batch_size: int = 32,
):
self.nr_repr_images = nr_repr_images
self.nr_samples = nr_samples
self.image_height = image_height
self.image_squares = image_squares
# Text-to-image model
if isinstance(image_to_text_model, Pipeline):
self.image_to_text_model = image_to_text_model
elif isinstance(image_to_text_model, str):
self.image_to_text_model = pipeline("image-to-text", model=image_to_text_model)
elif image_to_text_model is None:
self.image_to_text_model = None
else:
raise ValueError(
"Please select a correct transformers pipeline. For example:"
"pipeline('image-to-text', model='nlpconnect/vit-gpt2-image-captioning')"
)
self.batch_size = batch_size
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topics.
Arguments:
topic_model: A BERTopic model
documents: All input documents
c_tf_idf: The topic c-TF-IDF representation
topics: The candidate topics as calculated with c-TF-IDF
Returns:
representative_images: Representative images per topic
"""
# Extract image ids of most representative documents
images = documents["Image"].values.tolist()
(_, _, _, repr_docs_ids) = topic_model._extract_representative_docs(
c_tf_idf,
documents,
topics,
nr_samples=self.nr_samples,
nr_repr_docs=self.nr_repr_images,
)
unique_topics = sorted(list(topics.keys()))
# Combine representative images into a single representation
representative_images = {}
for topic in tqdm(unique_topics):
# Get and order represetnative images
sliced_examplars = repr_docs_ids[topic + topic_model._outliers]
sliced_examplars = [sliced_examplars[i : i + 3] for i in range(0, len(sliced_examplars), 3)]
images_to_combine = [
[
Image.open(images[index]) if isinstance(images[index], str) else images[index]
for index in sub_indices
]
for sub_indices in sliced_examplars
]
# Concatenate representative images
representative_image = get_concat_tile_resize(images_to_combine, self.image_height, self.image_squares)
representative_images[topic] = representative_image
# Make sure to properly close images
if isinstance(images[0], str):
for image_list in images_to_combine:
for image in image_list:
image.close()
return representative_images
def _convert_image_to_text(self, images: List[str], verbose: bool = False) -> List[str]:
"""Convert a list of images to captions.
Arguments:
images: A list of images or words to be converted to text.
verbose: Controls the verbosity of the process
Returns:
List of captions
"""
# Batch-wise image conversion
if self.batch_size is not None:
documents = []
for batch in tqdm(self._chunks(images), disable=not verbose):
outputs = self.image_to_text_model(batch)
captions = [output[0]["generated_text"] for output in outputs]
documents.extend(captions)
# Convert images to text
else:
outputs = self.image_to_text_model(images)
documents = [output[0]["generated_text"] for output in outputs]
return documents
def image_to_text(self, documents: pd.DataFrame, embeddings: np.ndarray) -> pd.DataFrame:
"""Convert images to text."""
# Create image topic embeddings
topics = documents.Topic.values.tolist()
images = documents.Image.values.tolist()
df = pd.DataFrame(np.hstack([np.array(topics).reshape(-1, 1), embeddings]))
image_topic_embeddings = df.groupby(0).mean().values
# Extract image centroids
image_centroids = {}
unique_topics = sorted(list(set(topics)))
for topic, topic_embedding in zip(unique_topics, image_topic_embeddings):
indices = np.array([index for index, t in enumerate(topics) if t == topic])
top_n = min([self.nr_repr_images, len(indices)])
indices = mmr(
topic_embedding.reshape(1, -1),
embeddings[indices],
indices,
top_n=top_n,
diversity=0.1,
)
image_centroids[topic] = indices
# Extract documents
documents = pd.DataFrame(columns=["Document", "ID", "Topic", "Image"])
current_id = 0
for topic, image_ids in tqdm(image_centroids.items()):
selected_images = [
Image.open(images[index]) if isinstance(images[index], str) else images[index] for index in image_ids
]
text = self._convert_image_to_text(selected_images)
for doc, image_id in zip(text, image_ids):
documents.loc[len(documents), :] = [
doc,
current_id,
topic,
images[image_id],
]
current_id += 1
# Properly close images
if isinstance(images[image_ids[0]], str):
for image in selected_images:
image.close()
return documents
def _chunks(self, images):
for i in range(0, len(images), self.batch_size):
yield images[i : i + self.batch_size]
def get_concat_h_multi_resize(im_list):
"""Code adapted from: https://note.nkmk.me/en/python-pillow-concat-images/."""
min_height = min(im.height for im in im_list)
min_height = max(im.height for im in im_list)
im_list_resize = []
for im in im_list:
im.resize((int(im.width * min_height / im.height), min_height), resample=0)
im_list_resize.append(im)
total_width = sum(im.width for im in im_list_resize)
dst = Image.new("RGB", (total_width, min_height), (255, 255, 255))
pos_x = 0
for im in im_list_resize:
dst.paste(im, (pos_x, 0))
pos_x += im.width
return dst
def get_concat_v_multi_resize(im_list):
"""Code adapted from: https://note.nkmk.me/en/python-pillow-concat-images/."""
min_width = min(im.width for im in im_list)
min_width = max(im.width for im in im_list)
im_list_resize = [im.resize((min_width, int(im.height * min_width / im.width)), resample=0) for im in im_list]
total_height = sum(im.height for im in im_list_resize)
dst = Image.new("RGB", (min_width, total_height), (255, 255, 255))
pos_y = 0
for im in im_list_resize:
dst.paste(im, (0, pos_y))
pos_y += im.height
return dst
def get_concat_tile_resize(im_list_2d, image_height=600, image_squares=False):
"""Code adapted from: https://note.nkmk.me/en/python-pillow-concat-images/."""
images = [[image.copy() for image in images] for images in im_list_2d]
# Create
if image_squares:
width = int(image_height / 3)
height = int(image_height / 3)
images = [[image.resize((width, height)) for image in images] for images in im_list_2d]
# Resize images based on minimum size
else:
min_width = min([min([img.width for img in imgs]) for imgs in im_list_2d])
min_height = min([min([img.height for img in imgs]) for imgs in im_list_2d])
for i, imgs in enumerate(images):
for j, img in enumerate(imgs):
if img.height > img.width:
images[i][j] = img.resize(
(int(img.width * min_height / img.height), min_height),
resample=0,
)
elif img.width > img.height:
images[i][j] = img.resize((min_width, int(img.height * min_width / img.width)), resample=0)
else:
images[i][j] = img.resize((min_width, min_width))
# Resize grid image
images = [get_concat_h_multi_resize(im_list_h) for im_list_h in images]
img = get_concat_v_multi_resize(images)
height_percentage = image_height / float(img.size[1])
adjusted_width = int((float(img.size[0]) * float(height_percentage)))
img = img.resize((adjusted_width, image_height), Image.Resampling.LANCZOS)
return img
@@ -1,104 +0,0 @@
import pandas as pd
from transformers import pipeline
from transformers.pipelines.base import Pipeline
from scipy.sparse import csr_matrix
from typing import Mapping, List, Tuple, Any
from bertopic.representation._base import BaseRepresentation
class ZeroShotClassification(BaseRepresentation):
"""Zero-shot Classification on topic keywords with candidate labels.
Arguments:
candidate_topics: A list of labels to assign to the topics if they
exceed `min_prob`
model: A transformers pipeline that should be initialized as
"zero-shot-classification". For example,
`pipeline("zero-shot-classification", model="facebook/bart-large-mnli")`
pipeline_kwargs: Kwargs that you can pass to the transformers.pipeline
when it is called. NOTE: Use `{"multi_label": True}`
to extract multiple labels for each topic.
min_prob: The minimum probability to assign a candidate label to a topic
Usage:
```python
from bertopic.representation import ZeroShotClassification
from bertopic import BERTopic
# Create your representation model
candidate_topics = ["space and nasa", "bicycles", "sports"]
representation_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
"""
def __init__(
self,
candidate_topics: List[str],
model: str = "facebook/bart-large-mnli",
pipeline_kwargs: Mapping[str, Any] = {},
min_prob: float = 0.8,
):
self.candidate_topics = candidate_topics
if isinstance(model, str):
self.model = pipeline("zero-shot-classification", model=model)
elif isinstance(model, Pipeline):
self.model = model
else:
raise ValueError(
"Make sure that the HF model that you"
"pass is either a string referring to a"
"HF model or a `transformers.pipeline` object."
)
self.pipeline_kwargs = pipeline_kwargs
self.min_prob = min_prob
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topics.
Arguments:
topic_model: Not used
documents: Not used
c_tf_idf: Not used
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
# Classify topics
topic_descriptions = [" ".join(list(zip(*topics[topic]))[0]) for topic in topics.keys()]
classifications = self.model(topic_descriptions, self.candidate_topics, **self.pipeline_kwargs)
# Extract labels
updated_topics = {}
for topic, classification in zip(topics.keys(), classifications):
topic_description = topics[topic]
# Multi-label assignment
if self.pipeline_kwargs.get("multi_label"):
topic_description = []
for label, score in zip(classification["labels"], classification["scores"]):
if score > self.min_prob:
topic_description.append((label, score))
# Single label assignment
elif classification["scores"][0] > self.min_prob:
topic_description = [(classification["labels"][0], classification["scores"][0])]
# Make sure that 10 items are returned
if len(topic_description) == 0:
topic_description = topics[topic]
elif len(topic_description) < 10:
topic_description += [("", 0) for _ in range(10 - len(topic_description))]
updated_topics[topic] = topic_description
return updated_topics
@@ -1,4 +0,0 @@
from ._ctfidf import ClassTfidfTransformer
from ._online_cv import OnlineCountVectorizer
__all__ = ["ClassTfidfTransformer", "OnlineCountVectorizer"]
@@ -1,115 +0,0 @@
from typing import List
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import normalize
from sklearn.utils import check_array
import numpy as np
import scipy.sparse as sp
class ClassTfidfTransformer(TfidfTransformer):
"""A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.
![](../algorithm/c-TF-IDF.svg)
c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes
by joining all documents per class. Thus, each class is converted to a single document
instead of set of documents. The frequency of each word **x** is extracted
for each class **c** and is **l1** normalized. This constitutes the term frequency.
Then, the term frequency is multiplied with IDF which is the logarithm of 1 plus
the average number of words per class **A** divided by the frequency of word **x**
across all classes.
Arguments:
bm25_weighting: Uses BM25-inspired idf-weighting procedure instead of the procedure
as defined in the c-TF-IDF formula. It uses the following weighting scheme:
`log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))`
reduce_frequent_words: Takes the square root of the bag-of-words after normalizing the matrix.
Helps to reduce the impact of words that appear too frequently.
seed_words: Specific words that will have their idf value increased by
the value of `seed_multiplier`.
NOTE: This will only increase the value of words that have an exact match.
seed_multiplier: The value with which the idf values of the words in `seed_words`
are multiplied.
Examples:
```python
transformer = ClassTfidfTransformer()
```
"""
def __init__(
self,
bm25_weighting: bool = False,
reduce_frequent_words: bool = False,
seed_words: List[str] = None,
seed_multiplier: float = 2,
):
self.bm25_weighting = bm25_weighting
self.reduce_frequent_words = reduce_frequent_words
self.seed_words = seed_words
self.seed_multiplier = seed_multiplier
super(ClassTfidfTransformer, self).__init__()
def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):
"""Learn the idf vector (global term weights).
Arguments:
X: A matrix of term/token counts.
multiplier: A multiplier for increasing/decreasing certain IDF scores
"""
X = check_array(X, accept_sparse=("csr", "csc"))
if not sp.issparse(X):
X = sp.csr_matrix(X)
dtype = np.float64
if self.use_idf:
_, n_features = X.shape
# Calculate the frequency of words across all classes
df = np.squeeze(np.asarray(X.sum(axis=0)))
# Calculate the average number of samples as regularization
avg_nr_samples = int(X.sum(axis=1).mean())
# BM25-inspired weighting procedure
if self.bm25_weighting:
idf = np.log(1 + ((avg_nr_samples - df + 0.5) / (df + 0.5)))
# Divide the average number of samples by the word frequency
# +1 is added to force values to be positive
else:
idf = np.log((avg_nr_samples / df) + 1)
# Multiplier to increase/decrease certain idf scores
if multiplier is not None:
idf = idf * multiplier
self._idf_diag = sp.diags(
idf,
offsets=0,
shape=(n_features, n_features),
format="csr",
dtype=dtype,
)
return self
def transform(self, X: sp.csr_matrix):
"""Transform a count-based matrix to c-TF-IDF.
Arguments:
X (sparse matrix): A matrix of term/token counts.
Returns:
X (sparse matrix): A c-TF-IDF matrix
"""
if self.use_idf:
X = normalize(X, axis=1, norm="l1", copy=False)
if self.reduce_frequent_words:
X.data = np.sqrt(X.data)
X = X * self._idf_diag
return X
@@ -1,158 +0,0 @@
import numpy as np
from itertools import chain
from typing import List
from scipy import sparse
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import CountVectorizer
class OnlineCountVectorizer(CountVectorizer):
"""An online variant of the CountVectorizer with updating vocabulary.
At each `.partial_fit`, its vocabulary is updated based on any OOV words
it might find. Then, `.update_bow` can be used to track and update
the Bag-of-Words representation. These functions are separated such that
the vectorizer can be used in iteration without updating the Bag-of-Words
representation can might speed up the fitting process. However, the
`.update_bow` function is used in BERTopic to track changes in the
topic representations and allow for decay.
This class inherits its parameters and attributes from:
`sklearn.feature_extraction.text.CountVectorizer`
Arguments:
decay: A value between [0, 1] to weight the percentage of frequencies
the previous bag-of-words should be decreased. For example,
a value of `.1` will decrease the frequencies in the bag-of-words
matrix with 10% at each iteration.
delete_min_df: Delete words at each iteration from its vocabulary
that are below a minimum frequency.
This will keep the resulting bag-of-words matrix small
such that it does not explode in size with increasing
vocabulary. If `decay` is None then this equals `min_df`.
**kwargs: Set of parameters inherited from:
`sklearn.feature_extraction.text.CountVectorizer`
In practice, this means that you can still use parameters
from the original CountVectorizer, like `stop_words` and
`ngram_range`.
Attributes:
X_ (scipy.sparse.csr_matrix) : The Bag-of-Words representation
Examples:
```python
from bertopic.vectorizers import OnlineCountVectorizer
vectorizer = OnlineCountVectorizer(stop_words="english")
for index, doc in enumerate(my_docs):
vectorizer.partial_fit(doc)
# Update and clean the bow every 100 iterations:
if index % 100 == 0:
X = vectorizer.update_bow()
```
To use the model in BERTopic:
```python
from bertopic import BERTopic
from bertopic.vectorizers import OnlineCountVectorizer
vectorizer_model = OnlineCountVectorizer(stop_words="english")
topic_model = BERTopic(vectorizer_model=vectorizer_model)
```
References:
Adapted from: https://github.com/idoshlomo/online_vectorizers
"""
def __init__(self, decay: float = None, delete_min_df: float = None, **kwargs):
self.decay = decay
self.delete_min_df = delete_min_df
super(OnlineCountVectorizer, self).__init__(**kwargs)
def partial_fit(self, raw_documents: List[str]) -> None:
"""Perform a partial fit and update vocabulary with OOV tokens.
Arguments:
raw_documents: A list of documents
"""
if not hasattr(self, "vocabulary_"):
return self.fit(raw_documents)
analyzer = self.build_analyzer()
analyzed_documents = [analyzer(doc) for doc in raw_documents]
new_tokens = set(chain.from_iterable(analyzed_documents))
oov_tokens = new_tokens.difference(set(self.vocabulary_.keys()))
if oov_tokens:
max_index = max(self.vocabulary_.values())
oov_vocabulary = dict(
zip(
oov_tokens,
list(range(max_index + 1, max_index + 1 + len(oov_tokens), 1)),
)
)
self.vocabulary_.update(oov_vocabulary)
return self
def update_bow(self, raw_documents: List[str]) -> csr_matrix:
"""Create or update the bag-of-words matrix.
Update the bag-of-words matrix by adding the newly transformed
documents. This may add empty columns if new words are found and/or
add empty rows if new topics are found.
During this process, the previous bag-of-words matrix might be
decayed if `self.decay` has been set during init. Similarly, words
that do not exceed `self.delete_min_df` are removed from its
vocabulary and bag-of-words matrix.
Arguments:
raw_documents: A list of documents
Returns:
X_: Bag-of-words matrix
"""
if hasattr(self, "X_"):
X = self.transform(raw_documents)
# Add empty columns if new words are found
columns = csr_matrix((self.X_.shape[0], X.shape[1] - self.X_.shape[1]), dtype=int)
self.X_ = sparse.hstack([self.X_, columns])
# Add empty rows if new topics are found
rows = csr_matrix((X.shape[0] - self.X_.shape[0], self.X_.shape[1]), dtype=int)
self.X_ = sparse.vstack([self.X_, rows])
# Decay of BoW matrix
if self.decay is not None:
self.X_ = self.X_ * (1 - self.decay)
self.X_ += X
else:
self.X_ = self.transform(raw_documents)
if self.delete_min_df is not None:
self._clean_bow()
return self.X_
def _clean_bow(self) -> None:
"""Remove words that do not exceed `self.delete_min_df`."""
# Only keep words with a minimum frequency
indices = np.where(self.X_.sum(0) >= self.delete_min_df)[1]
indices_dict = {index: index for index in indices}
self.X_ = self.X_[:, indices]
# Update vocabulary with new words
new_vocab = {}
vocabulary_dict = {v: k for k, v in self.vocabulary_.items()}
for i, index in enumerate(indices):
if indices_dict.get(index) is not None:
new_vocab[vocabulary_dict[index]] = i
self.vocabulary_ = new_vocab
@@ -1,32 +0,0 @@
<svg width="228" height="113" viewBox="0 0 228 113" fill="none" xmlns="http://www.w3.org/2000/svg">
<path d="M68.7889 40.7606L54.4174 26.3594C54.1819 26.1238 53.8638 26 53.5317 26H16.34C14.4403 26 12.8962 27.5352 12.8962 29.4337L12.8765 92.5594C12.8765 94.4578 14.4219 96 16.3209 96H65.6905C67.5889 96 69.1343 94.459 69.1349 92.5613L69.1533 41.6413C69.1533 41.3098 69.0225 40.9949 68.7889 40.7606ZM66.634 92.5606C66.634 93.0806 66.2105 93.501 65.6905 93.501H16.3209C15.8003 93.501 15.3768 93.0844 15.3768 92.5644L15.3965 29.4362C15.3965 28.9162 15.8194 28.5003 16.34 28.5003H53.013L66.6517 42.1632L66.634 92.5606Z" fill="black"/>
<path d="M62.2626 40.3752H57.1876C55.8613 40.3752 54.7508 39.3098 54.7508 37.9835V27.2343C54.7508 26.5435 54.1908 25.9841 53.5006 25.9841C52.8105 25.9841 52.2505 26.5441 52.2505 27.2343V37.9835C52.2505 40.6889 54.4816 42.8749 57.187 42.8749H62.2619C62.9521 42.8749 63.5127 42.3162 63.5127 41.6254C63.5127 40.9346 62.9527 40.3752 62.2626 40.3752Z" fill="black"/>
<path d="M78.7584 30.7822L64.387 16.374C64.1514 16.1384 63.8333 16 63.5019 16H26.3095C24.4105 16 22.8746 17.5581 22.8746 19.4571V27.2343C22.8746 27.9251 23.434 28.4844 24.1248 28.4844C24.8156 28.4844 25.3749 27.9244 25.3749 27.2343V19.4571C25.3749 18.9371 25.7902 18.5003 26.3102 18.5003H62.9838L76.6232 32.1689L76.6041 82.574C76.6041 83.0933 76.1813 83.4997 75.6613 83.4997H67.8841C67.1933 83.4997 66.634 84.0597 66.634 84.7498C66.634 85.44 67.1933 86 67.8841 86H75.6613C77.5603 86 79.1044 84.4717 79.1038 82.5733L79.1235 31.6667C79.1235 31.3352 78.9914 31.0165 78.7584 30.7822Z" fill="black"/>
<path d="M72.2333 30.3746H67.1584C65.8321 30.3746 64.7508 29.339 64.7508 28.0127V17.2635C64.7508 16.5733 64.1908 16.0133 63.5006 16.0133C62.8105 16.0133 62.2505 16.5733 62.2505 17.2635V28.0127C62.2505 30.7181 64.453 32.8749 67.1578 32.8749H72.2327C72.9229 32.8749 73.4835 32.3156 73.4835 31.6248C73.4835 30.934 72.9235 30.3746 72.2333 30.3746Z" fill="black"/>
<path d="M22.7838 46.6248H19.7413C19.0511 46.6248 18.4911 47.1841 18.4911 47.8749C18.4911 48.5657 19.0511 49.1251 19.7413 49.1251H22.7838C23.4733 49.1251 24.034 48.5657 24.034 47.8749C24.034 47.1841 23.4733 46.6248 22.7838 46.6248Z" fill="black"/>
<path d="M62.2429 46.6248H28.3991C27.7076 46.6248 27.1489 47.1841 27.1489 47.8749C27.1489 48.5657 27.7083 49.1251 28.3991 49.1251H62.2429C62.9337 49.1251 63.493 48.5657 63.493 47.8749C63.493 47.1841 62.9337 46.6248 62.2429 46.6248Z" fill="black"/>
<path d="M62.2429 52.8749H52.7603C52.0695 52.8749 51.5102 53.4343 51.5102 54.1251C51.5102 54.8159 52.0695 55.3752 52.7603 55.3752H62.2429C62.9337 55.3752 63.493 54.8159 63.493 54.1251C63.493 53.4343 62.9337 52.8749 62.2429 52.8749Z" fill="black"/>
<path d="M47.1457 52.8749H19.7419C19.0518 52.8749 18.4918 53.4343 18.4918 54.1251C18.4918 54.8159 19.0518 55.3752 19.7419 55.3752H47.1457C47.8353 55.3752 48.3959 54.8159 48.3959 54.1251C48.3959 53.4343 47.8353 52.8749 47.1457 52.8749Z" fill="black"/>
<path d="M62.2429 59.1245H19.7419C19.0518 59.1245 18.4918 59.6845 18.4918 60.3746C18.4918 61.0648 19.0518 61.6248 19.7419 61.6248H62.2429C62.9337 61.6248 63.493 61.0648 63.493 60.3746C63.493 59.6845 62.9337 59.1245 62.2429 59.1245Z" fill="black"/>
<path d="M62.2429 77.8749H19.7419C19.0518 77.8749 18.4918 78.4349 18.4918 79.1251C18.4918 79.8152 19.0518 80.3752 19.7419 80.3752H62.2429C62.9337 80.3752 63.493 79.8152 63.493 79.1251C63.493 78.4349 62.9337 77.8749 62.2429 77.8749Z" fill="black"/>
<path d="M22.7838 65.3746H19.7413C19.0511 65.3746 18.4911 65.9346 18.4911 66.6248C18.4911 67.3149 19.0511 67.8749 19.7413 67.8749H22.7838C23.4733 67.8749 24.034 67.3149 24.034 66.6248C24.034 65.9346 23.4733 65.3746 22.7838 65.3746Z" fill="black"/>
<path d="M62.2429 65.3746H28.3991C27.7076 65.3746 27.1489 65.9346 27.1489 66.6248C27.1489 67.3149 27.7083 67.8749 28.3991 67.8749H62.2429C62.9337 67.8749 63.493 67.3149 63.493 66.6248C63.493 65.9346 62.9337 65.3746 62.2429 65.3746Z" fill="black"/>
<path d="M62.2429 71.6248H52.7603C52.0695 71.6248 51.5102 72.1848 51.5102 72.8749C51.5102 73.5651 52.0695 74.1251 52.7603 74.1251H62.2429C62.9337 74.1251 63.493 73.5651 63.493 72.8749C63.493 72.1848 62.9337 71.6248 62.2429 71.6248Z" fill="black"/>
<path d="M47.1457 71.6248H19.7419C19.0518 71.6248 18.4918 72.1848 18.4918 72.8749C18.4918 73.5651 19.0518 74.1251 19.7419 74.1251H47.1457C47.8353 74.1251 48.3959 73.5651 48.3959 72.8749C48.3959 72.1848 47.8353 71.6248 47.1457 71.6248Z" fill="black"/>
<path d="M22.7838 84.1245H19.7413C19.0511 84.1245 18.4911 84.6845 18.4911 85.3746C18.4911 86.0648 19.0511 86.6248 19.7413 86.6248H22.7838C23.4733 86.6248 24.034 86.0648 24.034 85.3746C24.034 84.6845 23.4733 84.1245 22.7838 84.1245Z" fill="black"/>
<path d="M62.2429 84.1245H28.3991C27.7076 84.1245 27.1489 84.6845 27.1489 85.3746C27.1489 86.0648 27.7083 86.6248 28.3991 86.6248H62.2429C62.9337 86.6248 63.493 86.0648 63.493 85.3746C63.493 84.6845 62.9337 84.1245 62.2429 84.1245Z" fill="black"/>
<path d="M72.2143 36.6248H64.7952C64.1044 36.6248 63.5451 37.1841 63.5451 37.8749C63.5451 38.5657 64.1044 39.1251 64.7952 39.1251H72.2136C72.9044 39.1251 73.4644 38.5657 73.4644 37.8749C73.4644 37.1841 72.9051 36.6248 72.2143 36.6248Z" fill="black"/>
<path d="M72.2137 42.8749H67.8841C67.1933 42.8749 66.634 43.4343 66.634 44.1251C66.634 44.8159 67.1933 45.3752 67.8841 45.3752H72.2137C72.9044 45.3752 73.4638 44.8159 73.4638 44.1251C73.4638 43.4343 72.9044 42.8749 72.2137 42.8749Z" fill="black"/>
<path d="M72.2137 49.1245H67.8841C67.1933 49.1245 66.634 49.6838 66.634 50.3746C66.634 51.0654 67.1933 51.6248 67.8841 51.6248H72.2137C72.9044 51.6248 73.4638 51.0654 73.4638 50.3746C73.4638 49.6838 72.9044 49.1245 72.2137 49.1245Z" fill="black"/>
<path d="M72.2136 67.8749H68.267C67.5775 67.8749 67.0168 68.4349 67.0168 69.1251C67.0168 69.8152 67.5775 70.3752 68.267 70.3752H72.2136C72.9044 70.3752 73.4638 69.8152 73.4638 69.1251C73.4638 68.4349 72.9044 67.8749 72.2136 67.8749Z" fill="black"/>
<path d="M72.2137 55.3746H67.8841C67.1933 55.3746 66.634 55.9346 66.634 56.6248C66.634 57.3149 67.1933 57.8749 67.8841 57.8749H72.2137C72.9044 57.8749 73.4638 57.3149 73.4638 56.6248C73.4638 55.934 72.9044 55.3746 72.2137 55.3746Z" fill="black"/>
<path d="M72.2137 61.6248H67.8841C67.1933 61.6248 66.634 62.1848 66.634 62.8749C66.634 63.5651 67.1933 64.1251 67.8841 64.1251H72.2137C72.9044 64.1251 73.4638 63.5651 73.4638 62.8749C73.4638 62.1848 72.9044 61.6248 72.2137 61.6248Z" fill="black"/>
<path d="M72.2137 74.1244H67.8841C67.1933 74.1244 66.634 74.6844 66.634 75.3746C66.634 76.0648 67.1933 76.6248 67.8841 76.6248H72.2137C72.9044 76.6248 73.4638 76.0648 73.4638 75.3746C73.4638 74.6844 72.9044 74.1244 72.2137 74.1244Z" fill="black"/>
<path d="M155.061 57.0607C155.646 56.4749 155.646 55.5251 155.061 54.9393L145.515 45.3934C144.929 44.8076 143.979 44.8076 143.393 45.3934C142.808 45.9792 142.808 46.9289 143.393 47.5147L151.879 56L143.393 64.4853C142.808 65.0711 142.808 66.0208 143.393 66.6066C143.979 67.1924 144.929 67.1924 145.515 66.6066L155.061 57.0607ZM98 57.5H154V54.5H98V57.5Z" fill="black"/>
<path d="M189 13H180V103H189" stroke="black" stroke-width="2"/>
<path d="M204 13H213V103H204" stroke="black" stroke-width="2"/>
<path d="M194.746 16.6543L196 19.2148L198.062 16.666H198.918L196.322 19.8066L197.98 23H197.219L195.883 20.3281L193.721 23H192.871L195.572 19.7305L193.984 16.6543H194.746ZM194.746 30.6543L196 33.2148L198.062 30.666H198.918L196.322 33.8066L197.98 37H197.219L195.883 34.3281L193.721 37H192.871L195.572 33.7305L193.984 30.6543H194.746ZM194.898 50.5723C194.902 50.4395 194.953 50.3242 195.051 50.2266C195.148 50.1289 195.266 50.0781 195.402 50.0742C195.543 50.0742 195.658 50.1211 195.748 50.2148C195.838 50.3086 195.879 50.4258 195.871 50.5664C195.863 50.7031 195.811 50.8184 195.713 50.9121C195.615 51.0059 195.498 51.0527 195.361 51.0527C195.221 51.0566 195.105 51.0137 195.016 50.9238C194.926 50.8301 194.887 50.7129 194.898 50.5723ZM194.898 64.5723C194.902 64.4395 194.953 64.3242 195.051 64.2266C195.148 64.1289 195.266 64.0781 195.402 64.0742C195.543 64.0742 195.658 64.1211 195.748 64.2148C195.838 64.3086 195.879 64.4258 195.871 64.5664C195.863 64.7031 195.811 64.8184 195.713 64.9121C195.615 65.0059 195.498 65.0527 195.361 65.0527C195.221 65.0566 195.105 65.0137 195.016 64.9238C194.926 64.8301 194.887 64.7129 194.898 64.5723ZM194.898 78.5723C194.902 78.4395 194.953 78.3242 195.051 78.2266C195.148 78.1289 195.266 78.0781 195.402 78.0742C195.543 78.0742 195.658 78.1211 195.748 78.2148C195.838 78.3086 195.879 78.4258 195.871 78.5664C195.863 78.7031 195.811 78.8184 195.713 78.9121C195.615 79.0059 195.498 79.0527 195.361 79.0527C195.221 79.0566 195.105 79.0137 195.016 78.9238C194.926 78.8301 194.887 78.7129 194.898 78.5723ZM194.746 86.6543L196 89.2148L198.062 86.666H198.918L196.322 89.8066L197.98 93H197.219L195.883 90.3281L193.721 93H192.871L195.572 89.7305L193.984 86.6543H194.746Z" fill="black"/>
<path d="M203.047 19.2891L202.074 25H201.617L202.504 19.8945L200.906 20.457L200.984 20.0039L202.961 19.2891H203.047Z" fill="black"/>
<path d="M203.523 38.5977L203.461 39H200.004L200.059 38.6211L202.176 36.5234C202.332 36.3672 202.496 36.1992 202.668 36.0195C202.842 35.8398 202.995 35.6471 203.125 35.4414C203.258 35.2357 203.342 35.0169 203.379 34.7852C203.41 34.5638 203.392 34.3672 203.324 34.1953C203.259 34.0234 203.15 33.888 202.996 33.7891C202.842 33.6875 202.651 33.6354 202.422 33.6328C202.161 33.6302 201.93 33.6875 201.727 33.8047C201.526 33.9219 201.361 34.0807 201.23 34.2812C201.103 34.4818 201.018 34.7044 200.977 34.9492H200.523C200.568 34.6237 200.677 34.3307 200.852 34.0703C201.026 33.8099 201.249 33.6055 201.52 33.457C201.793 33.306 202.098 33.2318 202.434 33.2344C202.736 33.237 202.999 33.2995 203.223 33.4219C203.449 33.5443 203.618 33.7188 203.73 33.9453C203.842 34.1719 203.88 34.4388 203.844 34.7461C203.82 34.9518 203.762 35.1497 203.668 35.3398C203.574 35.5273 203.46 35.7083 203.324 35.8828C203.191 36.0547 203.049 36.2188 202.898 36.375C202.747 36.5286 202.602 36.6745 202.461 36.8125L200.645 38.5977H203.523Z" fill="black"/>
<path d="M201.082 94.6953L200.512 98H200.055L200.785 93.7734H201.223L201.082 94.6953ZM200.828 95.625L200.645 95.5078C200.697 95.2786 200.776 95.056 200.883 94.8398C200.99 94.6211 201.122 94.4258 201.281 94.2539C201.443 94.0794 201.628 93.9427 201.836 93.8438C202.047 93.7422 202.281 93.6927 202.539 93.6953C202.771 93.6979 202.962 93.7409 203.113 93.8242C203.267 93.9049 203.385 94.0182 203.469 94.1641C203.552 94.3073 203.604 94.4727 203.625 94.6602C203.648 94.8451 203.646 95.0417 203.617 95.25L203.152 98H202.691L203.164 95.2422C203.193 95.0391 203.191 94.8516 203.16 94.6797C203.132 94.5052 203.059 94.3659 202.941 94.2617C202.824 94.1549 202.648 94.1016 202.414 94.1016C202.206 94.099 202.014 94.1419 201.84 94.2305C201.665 94.3164 201.509 94.4336 201.371 94.582C201.236 94.7279 201.121 94.8919 201.027 95.0742C200.936 95.2539 200.87 95.4375 200.828 95.625Z" fill="black"/>
</svg>

Before

Width:  |  Height:  |  Size: 11 KiB

@@ -1,17 +0,0 @@
<svg width="228" height="113" viewBox="0 0 228 113" fill="none" xmlns="http://www.w3.org/2000/svg">
<path d="M51 13H42V103H51" stroke="black" stroke-width="2"/>
<path d="M66 13H75V103H66" stroke="black" stroke-width="2"/>
<path d="M56.7461 16.6543L58 19.2148L60.0625 16.666H60.918L58.3223 19.8066L59.9805 23H59.2188L57.8828 20.3281L55.7207 23H54.8711L57.5723 19.7305L55.9844 16.6543H56.7461ZM56.7461 30.6543L58 33.2148L60.0625 30.666H60.918L58.3223 33.8066L59.9805 37H59.2188L57.8828 34.3281L55.7207 37H54.8711L57.5723 33.7305L55.9844 30.6543H56.7461ZM56.8984 50.5723C56.9023 50.4395 56.9531 50.3242 57.0508 50.2266C57.1484 50.1289 57.2656 50.0781 57.4023 50.0742C57.543 50.0742 57.6582 50.1211 57.748 50.2148C57.8379 50.3086 57.8789 50.4258 57.8711 50.5664C57.8633 50.7031 57.8105 50.8184 57.7129 50.9121C57.6152 51.0059 57.498 51.0527 57.3613 51.0527C57.2207 51.0566 57.1055 51.0137 57.0156 50.9238C56.9258 50.8301 56.8867 50.7129 56.8984 50.5723ZM56.8984 64.5723C56.9023 64.4395 56.9531 64.3242 57.0508 64.2266C57.1484 64.1289 57.2656 64.0781 57.4023 64.0742C57.543 64.0742 57.6582 64.1211 57.748 64.2148C57.8379 64.3086 57.8789 64.4258 57.8711 64.5664C57.8633 64.7031 57.8105 64.8184 57.7129 64.9121C57.6152 65.0059 57.498 65.0527 57.3613 65.0527C57.2207 65.0566 57.1055 65.0137 57.0156 64.9238C56.9258 64.8301 56.8867 64.7129 56.8984 64.5723ZM56.8984 78.5723C56.9023 78.4395 56.9531 78.3242 57.0508 78.2266C57.1484 78.1289 57.2656 78.0781 57.4023 78.0742C57.543 78.0742 57.6582 78.1211 57.748 78.2148C57.8379 78.3086 57.8789 78.4258 57.8711 78.5664C57.8633 78.7031 57.8105 78.8184 57.7129 78.9121C57.6152 79.0059 57.498 79.0527 57.3613 79.0527C57.2207 79.0566 57.1055 79.0137 57.0156 78.9238C56.9258 78.8301 56.8867 78.7129 56.8984 78.5723ZM56.7461 86.6543L58 89.2148L60.0625 86.666H60.918L58.3223 89.8066L59.9805 93H59.2188L57.8828 90.3281L55.7207 93H54.8711L57.5723 89.7305L55.9844 86.6543H56.7461Z" fill="black"/>
<path d="M65.0469 19.2891L64.0742 25H63.6172L64.5039 19.8945L62.9062 20.457L62.9844 20.0039L64.9609 19.2891H65.0469Z" fill="black"/>
<path d="M65.5234 38.5977L65.4609 39H62.0039L62.0586 38.6211L64.1758 36.5234C64.332 36.3672 64.4961 36.1992 64.668 36.0195C64.8424 35.8398 64.9948 35.6471 65.125 35.4414C65.2578 35.2357 65.3424 35.0169 65.3789 34.7852C65.4102 34.5638 65.3919 34.3672 65.3242 34.1953C65.2591 34.0234 65.1497 33.888 64.9961 33.7891C64.8424 33.6875 64.651 33.6354 64.4219 33.6328C64.1615 33.6302 63.9297 33.6875 63.7266 33.8047C63.526 33.9219 63.3607 34.0807 63.2305 34.2812C63.1029 34.4818 63.0182 34.7044 62.9766 34.9492H62.5234C62.5677 34.6237 62.6771 34.3307 62.8516 34.0703C63.026 33.8099 63.2487 33.6055 63.5195 33.457C63.793 33.306 64.0977 33.2318 64.4336 33.2344C64.7357 33.237 64.9987 33.2995 65.2227 33.4219C65.4492 33.5443 65.6185 33.7188 65.7305 33.9453C65.8424 34.1719 65.8802 34.4388 65.8438 34.7461C65.8203 34.9518 65.7617 35.1497 65.668 35.3398C65.5742 35.5273 65.4596 35.7083 65.3242 35.8828C65.1914 36.0547 65.0495 36.2188 64.8984 36.375C64.7474 36.5286 64.6016 36.6745 64.4609 36.8125L62.6445 38.5977H65.5234Z" fill="black"/>
<path d="M63.082 94.6953L62.5117 98H62.0547L62.7852 93.7734H63.2227L63.082 94.6953ZM62.8281 95.625L62.6445 95.5078C62.6966 95.2786 62.776 95.056 62.8828 94.8398C62.9896 94.6211 63.1224 94.4258 63.2812 94.2539C63.4427 94.0794 63.6276 93.9427 63.8359 93.8438C64.0469 93.7422 64.2812 93.6927 64.5391 93.6953C64.7708 93.6979 64.9622 93.7409 65.1133 93.8242C65.2669 93.9049 65.3854 94.0182 65.4688 94.1641C65.5521 94.3073 65.6042 94.4727 65.625 94.6602C65.6484 94.8451 65.6458 95.0417 65.6172 95.25L65.1523 98H64.6914L65.1641 95.2422C65.1927 95.0391 65.1914 94.8516 65.1602 94.6797C65.1315 94.5052 65.0586 94.3659 64.9414 94.2617C64.8242 94.1549 64.6484 94.1016 64.4141 94.1016C64.2057 94.099 64.0143 94.1419 63.8398 94.2305C63.6654 94.3164 63.5091 94.4336 63.3711 94.582C63.2357 94.7279 63.1211 94.8919 63.0273 95.0742C62.9362 95.2539 62.8698 95.4375 62.8281 95.625Z" fill="black"/>
<path d="M161 13H152V103H161" stroke="black" stroke-width="2"/>
<path d="M176 13H185V103H176" stroke="black" stroke-width="2"/>
<path d="M166.746 24.6543L168 27.2148L170.062 24.666H170.918L168.322 27.8066L169.98 31H169.219L167.883 28.3281L165.721 31H164.871L167.572 27.7305L165.984 24.6543H166.746ZM166.746 38.6543L168 41.2148L170.062 38.666H170.918L168.322 41.8066L169.98 45H169.219L167.883 42.3281L165.721 45H164.871L167.572 41.7305L165.984 38.6543H166.746ZM166.746 52.6543L168 55.2148L170.062 52.666H170.918L168.322 55.8066L169.98 59H169.219L167.883 56.3281L165.721 59H164.871L167.572 55.7305L165.984 52.6543H166.746ZM166.746 66.6543L168 69.2148L170.062 66.666H170.918L168.322 69.8066L169.98 73H169.219L167.883 70.3281L165.721 73H164.871L167.572 69.7305L165.984 66.6543H166.746ZM166.746 80.6543L168 83.2148L170.062 80.666H170.918L168.322 83.8066L169.98 87H169.219L167.883 84.3281L165.721 87H164.871L167.572 83.7305L165.984 80.6543H166.746Z" fill="black"/>
<path d="M173.785 28.7168L173.056 33H172.713L173.378 29.1709L172.18 29.5928L172.238 29.2529L173.721 28.7168H173.785Z" fill="black"/>
<path d="M174.143 46.6982L174.096 47H171.503L171.544 46.7158L173.132 45.1426C173.249 45.0254 173.372 44.8994 173.501 44.7646C173.632 44.6299 173.746 44.4854 173.844 44.3311C173.943 44.1768 174.007 44.0127 174.034 43.8389C174.058 43.6729 174.044 43.5254 173.993 43.3965C173.944 43.2676 173.862 43.166 173.747 43.0918C173.632 43.0156 173.488 42.9766 173.316 42.9746C173.121 42.9727 172.947 43.0156 172.795 43.1035C172.645 43.1914 172.521 43.3105 172.423 43.4609C172.327 43.6113 172.264 43.7783 172.232 43.9619H171.893C171.926 43.7178 172.008 43.498 172.139 43.3027C172.27 43.1074 172.437 42.9541 172.64 42.8428C172.845 42.7295 173.073 42.6738 173.325 42.6758C173.552 42.6777 173.749 42.7246 173.917 42.8164C174.087 42.9082 174.214 43.0391 174.298 43.209C174.382 43.3789 174.41 43.5791 174.383 43.8096C174.365 43.9639 174.321 44.1123 174.251 44.2549C174.181 44.3955 174.095 44.5312 173.993 44.6621C173.894 44.791 173.787 44.9141 173.674 45.0312C173.561 45.1465 173.451 45.2559 173.346 45.3594L171.983 46.6982H174.143Z" fill="black"/>
<path d="M172.622 58.6738L172.953 58.6768C173.127 58.6729 173.293 58.6396 173.451 58.5771C173.611 58.5146 173.746 58.4219 173.855 58.2988C173.967 58.1758 174.035 58.0215 174.061 57.8359C174.086 57.666 174.073 57.5176 174.022 57.3906C173.972 57.2617 173.889 57.1611 173.773 57.0889C173.658 57.0146 173.514 56.9766 173.34 56.9746C173.16 56.9727 172.997 57.0088 172.851 57.083C172.704 57.1572 172.582 57.2607 172.484 57.3936C172.389 57.5244 172.324 57.6768 172.291 57.8506H171.951C171.984 57.6182 172.066 57.4131 172.197 57.2354C172.33 57.0576 172.497 56.9199 172.698 56.8223C172.899 56.7227 173.117 56.6738 173.352 56.6758C173.582 56.6758 173.781 56.7256 173.949 56.8252C174.117 56.9229 174.242 57.0596 174.324 57.2354C174.406 57.4111 174.434 57.6152 174.406 57.8477C174.387 58.0215 174.332 58.1748 174.242 58.3076C174.154 58.4385 174.043 58.5488 173.908 58.6387C173.775 58.7266 173.63 58.7939 173.472 58.8408C173.313 58.8857 173.154 58.9092 172.994 58.9111L172.587 58.9082L172.622 58.6738ZM172.575 58.9756L172.61 58.7441H172.977C173.146 58.748 173.307 58.7715 173.457 58.8145C173.609 58.8574 173.742 58.9229 173.855 59.0107C173.971 59.0967 174.057 59.208 174.113 59.3447C174.172 59.4795 174.19 59.6416 174.169 59.8311C174.147 60.0186 174.096 60.1885 174.014 60.3408C173.934 60.4912 173.829 60.6201 173.7 60.7275C173.571 60.835 173.425 60.918 173.261 60.9766C173.097 61.0332 172.922 61.0605 172.736 61.0586C172.557 61.0566 172.393 61.0264 172.244 60.9678C172.096 60.9092 171.968 60.8271 171.86 60.7217C171.755 60.6143 171.676 60.4863 171.623 60.3379C171.572 60.1875 171.555 60.0215 171.57 59.8398L171.91 59.8428C171.893 60.0225 171.916 60.1807 171.98 60.3174C172.047 60.4541 172.146 60.5615 172.276 60.6396C172.409 60.7158 172.565 60.7549 172.745 60.7568C172.937 60.7588 173.108 60.7227 173.261 60.6484C173.415 60.5742 173.541 60.4688 173.639 60.332C173.738 60.1934 173.801 60.0293 173.826 59.8398C173.854 59.6406 173.83 59.4785 173.756 59.3535C173.682 59.2266 173.572 59.1328 173.428 59.0723C173.285 59.0117 173.123 58.9805 172.941 58.9785L172.575 58.9756Z" fill="black"/>
<path d="M174.444 73.623L174.397 73.9219H171.459L171.497 73.7051L173.914 70.7373H174.222L173.686 71.4727L171.945 73.623H174.444ZM174.298 70.7344L173.562 75H173.22L173.958 70.7344H174.298Z" fill="black"/>
<path d="M172.35 86.8877L172.074 86.8057L172.599 84.7344H174.67L174.623 85.0625H172.848L172.458 86.501C172.575 86.4229 172.703 86.3643 172.842 86.3252C172.98 86.2842 173.12 86.2646 173.261 86.2666C173.454 86.2666 173.621 86.3047 173.762 86.3809C173.902 86.4551 174.016 86.5566 174.102 86.6855C174.189 86.8125 174.249 86.958 174.28 87.1221C174.313 87.2842 174.32 87.4541 174.301 87.6318C174.277 87.8271 174.229 88.0117 174.157 88.1855C174.085 88.3594 173.988 88.5127 173.867 88.6455C173.746 88.7764 173.602 88.8789 173.434 88.9531C173.268 89.0273 173.079 89.0625 172.868 89.0586C172.69 89.0586 172.532 89.0293 172.394 88.9707C172.257 88.9121 172.142 88.8301 172.048 88.7246C171.954 88.6172 171.883 88.4922 171.834 88.3496C171.785 88.2051 171.761 88.0479 171.761 87.8779H172.089C172.089 88.0479 172.117 88.1992 172.174 88.332C172.23 88.4629 172.316 88.5664 172.432 88.6426C172.549 88.7188 172.698 88.7578 172.88 88.7598C173.044 88.7598 173.188 88.7295 173.311 88.6689C173.436 88.6084 173.542 88.5254 173.63 88.4199C173.72 88.3145 173.791 88.1943 173.844 88.0596C173.896 87.9248 173.934 87.7832 173.955 87.6348C173.973 87.5 173.971 87.3711 173.949 87.248C173.928 87.123 173.886 87.0117 173.823 86.9141C173.761 86.8145 173.677 86.7363 173.571 86.6797C173.466 86.6211 173.339 86.5898 173.19 86.5859C173.028 86.584 172.879 86.6094 172.742 86.6621C172.607 86.7148 172.477 86.79 172.35 86.8877Z" fill="black"/>
<path d="M134.061 62.0607C134.646 61.4749 134.646 60.5251 134.061 59.9393L124.515 50.3934C123.929 49.8076 122.979 49.8076 122.393 50.3934C121.808 50.9792 121.808 51.9289 122.393 52.5147L130.879 61L122.393 69.4853C121.808 70.0711 121.808 71.0208 122.393 71.6066C122.979 72.1924 123.929 72.1924 124.515 71.6066L134.061 62.0607ZM91 62.5H133V59.5H91V62.5Z" fill="black"/>
</svg>

Before

Width:  |  Height:  |  Size: 9.9 KiB

@@ -1,14 +0,0 @@
<svg width="228" height="113" viewBox="0 0 228 113" fill="none" xmlns="http://www.w3.org/2000/svg">
<rect x="32" y="12" width="12" height="12" fill="black"/>
<rect x="72" y="9" width="12" height="12" fill="black"/>
<rect x="60" y="32" width="12" height="12" fill="black"/>
<rect x="32" y="44" width="12" height="12" fill="black"/>
<circle cx="166" cy="53" r="6" fill="black"/>
<circle cx="180" cy="19" r="6" fill="black"/>
<circle cx="194" cy="44" r="6" fill="black"/>
<circle cx="154" cy="32" r="6" fill="black"/>
<path d="M90 98L95.1962 107H84.8038L90 98Z" fill="black"/>
<path d="M104 80L109.196 89H98.8038L104 80Z" fill="black"/>
<path d="M121 98L126.196 107H115.804L121 98Z" fill="black"/>
<path d="M127 74L132.196 83H121.804L127 74Z" fill="black"/>
</svg>

Before

Width:  |  Height:  |  Size: 762 B

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 42 KiB

@@ -1,23 +0,0 @@
<svg width="228" height="113" viewBox="0 0 228 113" fill="none" xmlns="http://www.w3.org/2000/svg">
<line x1="59.8941" y1="40.3059" x2="59.8941" y2="62.85" stroke="black"/>
<line x1="57.9618" y1="40.3059" x2="57.9618" y2="62.85" stroke="black"/>
<line x1="99.1853" y1="40.6618" x2="99.1853" y2="63.2059" stroke="black"/>
<line x1="97.2529" y1="40.6618" x2="97.2529" y2="63.2059" stroke="black"/>
<path d="M51.3695 48.5401V49.794H41.5961V48.5401H51.3695ZM51.3695 53.3565V54.6104H41.5961V53.3565H51.3695Z" fill="#ABA9A9"/>
<path d="M107.229 46.0497L110.651 51.1708L114.084 46.0497H115.748L111.448 52.2606L115.924 58.7294H114.284L110.662 53.3739L107.053 58.7294H105.412L109.889 52.2606L105.588 46.0497H107.229Z" fill="#ABA9A9"/>
<path d="M187.412 51.0458V52.37H175.717V51.0458H187.412ZM182.221 45.5966V58.0184H180.815V45.5966H182.221Z" fill="#ABA9A9"/>
<path d="M126.172 42.3736V60.3736H122.785V42.3736H126.172ZM128.422 54.1627V53.9166C128.422 52.9869 128.555 52.1314 128.82 51.3502C129.086 50.5611 129.473 49.8775 129.981 49.2994C130.488 48.7213 131.113 48.272 131.856 47.9517C132.598 47.6236 133.449 47.4595 134.41 47.4595C135.371 47.4595 136.227 47.6236 136.977 47.9517C137.727 48.272 138.356 48.7213 138.863 49.2994C139.379 49.8775 139.77 50.5611 140.035 51.3502C140.301 52.1314 140.434 52.9869 140.434 53.9166V54.1627C140.434 55.0845 140.301 55.94 140.035 56.7291C139.77 57.5103 139.379 58.1939 138.863 58.7798C138.356 59.358 137.731 59.8072 136.988 60.1275C136.246 60.4478 135.395 60.608 134.434 60.608C133.473 60.608 132.617 60.4478 131.867 60.1275C131.125 59.8072 130.496 59.358 129.981 58.7798C129.473 58.1939 129.086 57.5103 128.82 56.7291C128.555 55.94 128.422 55.0845 128.422 54.1627ZM131.797 53.9166V54.1627C131.797 54.6939 131.844 55.19 131.938 55.6509C132.031 56.1119 132.18 56.5181 132.383 56.8697C132.594 57.2134 132.867 57.483 133.203 57.6783C133.539 57.8736 133.949 57.9713 134.434 57.9713C134.903 57.9713 135.305 57.8736 135.641 57.6783C135.977 57.483 136.246 57.2134 136.449 56.8697C136.653 56.5181 136.801 56.1119 136.895 55.6509C136.996 55.19 137.047 54.6939 137.047 54.1627V53.9166C137.047 53.4009 136.996 52.9166 136.895 52.4634C136.801 52.0025 136.649 51.5963 136.438 51.2447C136.235 50.8853 135.965 50.6041 135.629 50.4009C135.293 50.1978 134.887 50.0963 134.41 50.0963C133.934 50.0963 133.528 50.1978 133.192 50.4009C132.863 50.6041 132.594 50.8853 132.383 51.2447C132.18 51.5963 132.031 52.0025 131.938 52.4634C131.844 52.9166 131.797 53.4009 131.797 53.9166ZM150.535 47.6939H153.594V59.9517C153.594 61.108 153.336 62.0884 152.82 62.8931C152.313 63.7056 151.602 64.3189 150.688 64.733C149.774 65.1548 148.711 65.3658 147.5 65.3658C146.969 65.3658 146.406 65.2955 145.813 65.1548C145.227 65.0142 144.664 64.7955 144.125 64.4986C143.594 64.2017 143.149 63.8267 142.789 63.3736L144.278 61.3814C144.668 61.8345 145.121 62.1861 145.637 62.4361C146.153 62.6939 146.723 62.8228 147.348 62.8228C147.957 62.8228 148.473 62.7095 148.895 62.483C149.317 62.2642 149.641 61.94 149.867 61.5103C150.094 61.0884 150.207 60.5767 150.207 59.9752V50.6236L150.535 47.6939ZM142.004 54.1861V53.94C142.004 52.9713 142.121 52.0923 142.356 51.3033C142.598 50.5064 142.938 49.8228 143.375 49.2525C143.82 48.6822 144.36 48.2408 144.992 47.9283C145.625 47.6158 146.34 47.4595 147.137 47.4595C147.981 47.4595 148.688 47.6158 149.258 47.9283C149.828 48.2408 150.297 48.6861 150.664 49.2642C151.031 49.8345 151.317 50.5103 151.52 51.2916C151.731 52.065 151.895 52.9127 152.012 53.8345V54.3736C151.895 55.2564 151.719 56.0767 151.485 56.8345C151.25 57.5923 150.942 58.2564 150.559 58.8267C150.176 59.3892 149.699 59.8267 149.129 60.1392C148.567 60.4517 147.895 60.608 147.113 60.608C146.332 60.608 145.625 60.4478 144.992 60.1275C144.367 59.8072 143.832 59.358 143.387 58.7798C142.942 58.2017 142.598 57.522 142.356 56.7408C142.121 55.9595 142.004 55.108 142.004 54.1861ZM145.379 53.94V54.1861C145.379 54.7095 145.43 55.1978 145.531 55.6509C145.633 56.1041 145.789 56.5064 146 56.858C146.219 57.2017 146.488 57.4713 146.809 57.6666C147.137 57.8541 147.524 57.9478 147.969 57.9478C148.586 57.9478 149.09 57.8189 149.481 57.5611C149.871 57.2955 150.164 56.9322 150.36 56.4713C150.555 56.0103 150.668 55.4791 150.699 54.8775V53.3423C150.684 52.8502 150.617 52.4088 150.5 52.0181C150.383 51.6197 150.219 51.2798 150.008 50.9986C149.797 50.7173 149.524 50.4986 149.188 50.3423C148.852 50.1861 148.453 50.108 147.992 50.108C147.547 50.108 147.16 50.2095 146.832 50.4127C146.512 50.608 146.242 50.8775 146.024 51.2213C145.813 51.565 145.653 51.9713 145.543 52.44C145.434 52.9009 145.379 53.4009 145.379 53.94Z" fill="black"/>
<path d="M157.212 52.8801V52.6603C157.212 50.756 157.413 48.9738 157.813 47.3137C158.213 45.6535 158.736 44.1594 159.38 42.8312C160.035 41.5031 160.748 40.3752 161.519 39.4474C162.3 38.5099 163.067 37.8166 163.819 37.3674L164.244 38.5685C163.599 39.0275 162.96 39.6916 162.325 40.5607C161.7 41.4299 161.133 42.4748 160.626 43.6955C160.118 44.9162 159.712 46.2785 159.41 47.7824C159.107 49.2863 158.956 50.8976 158.956 52.6164V52.9094C158.956 54.6281 159.107 56.2394 159.41 57.7433C159.712 59.2473 160.118 60.6096 160.626 61.8303C161.133 63.0607 161.7 64.1154 162.325 64.9943C162.96 65.883 163.599 66.5666 164.244 67.0451L163.819 68.173C163.067 67.7238 162.3 67.0402 161.519 66.1223C160.748 65.2141 160.035 64.1008 159.38 62.7824C158.736 61.4738 158.213 59.9846 157.813 58.3146C157.413 56.6447 157.212 54.8332 157.212 52.8801Z" fill="#ABA9A9"/>
<path d="M221.935 53.2359V53.0162C221.935 51.1119 221.734 49.3297 221.334 47.6695C220.934 46.0093 220.411 44.5152 219.767 43.1871C219.112 41.8589 218.399 40.731 217.628 39.8033C216.847 38.8658 216.08 38.1724 215.328 37.7232L214.903 38.9244C215.548 39.3834 216.188 40.0474 216.822 40.9166C217.447 41.7857 218.014 42.8306 218.521 44.0513C219.029 45.272 219.435 46.6343 219.737 48.1382C220.04 49.6422 220.191 51.2535 220.191 52.9722V53.2652C220.191 54.9839 220.04 56.5953 219.737 58.0992C219.435 59.6031 219.029 60.9654 218.521 62.1861C218.014 63.4166 217.447 64.4713 216.822 65.3502C216.188 66.2388 215.548 66.9224 214.903 67.4009L215.328 68.5289C216.08 68.0797 216.847 67.3961 217.628 66.4781C218.399 65.5699 219.112 64.4566 219.767 63.1382C220.411 61.8297 220.934 60.3404 221.334 58.6705C221.734 57.0005 221.935 55.189 221.935 53.2359Z" fill="#ABA9A9"/>
<path d="M200.208 29.5273L195.345 44H191.935L198.31 26.9375H200.489L200.208 29.5273ZM204.275 44L199.388 29.5273L199.095 26.9375H201.286L207.696 44H204.275ZM204.052 37.6602V40.2031H194.9V37.6602H204.052Z" fill="black"/>
<path d="M171.364 43.9083V61.0177H168.258V47.5294L164.145 48.8888V46.381L171.012 43.9083H171.364Z" fill="black"/>
<line x1="187.929" y1="46.6207" x2="211.762" y2="46.6207" stroke="#ABA9A9"/>
<path d="M201.288 69.1207H198.17V55.2691C198.17 54.316 198.354 53.5152 198.721 52.8668C199.088 52.2105 199.612 51.7144 200.291 51.3785C200.971 51.0425 201.772 50.8746 202.694 50.8746C202.998 50.8746 203.288 50.8941 203.561 50.9332C203.842 50.9722 204.12 51.0269 204.393 51.0972L204.334 53.4527C204.186 53.4136 204.022 53.3863 203.842 53.3707C203.67 53.355 203.479 53.3472 203.268 53.3472C202.846 53.3472 202.487 53.4214 202.19 53.5699C201.893 53.7183 201.666 53.9371 201.51 54.2261C201.362 54.5074 201.288 54.855 201.288 55.2691V69.1207ZM203.854 56.441V58.6675H196.26V56.441H203.854Z" fill="black"/>
<path d="M204.029 65.9971L204.682 67.5322L205.83 65.9971H206.701L205.006 68.1182L206.037 70.2236H205.268L204.568 68.6416L203.377 70.2236H202.514L204.256 68.0479L203.26 65.9971H204.029Z" fill="#0277BD"/>
<path d="M72.0705 47.6938V50.0845H64.6877V47.6938H72.0705ZM66.5158 44.5649H69.8908V56.5532C69.8908 56.9204 69.9377 57.2017 70.0314 57.397C70.133 57.5923 70.2814 57.729 70.4767 57.8071C70.6721 57.8774 70.9182 57.9126 71.215 57.9126C71.426 57.9126 71.6135 57.9048 71.7775 57.8892C71.9494 57.8657 72.0939 57.8423 72.2111 57.8188L72.2228 60.3032C71.9338 60.397 71.6213 60.4712 71.2853 60.5259C70.9494 60.5806 70.5783 60.6079 70.1721 60.6079C69.4299 60.6079 68.7814 60.4868 68.2267 60.2446C67.6799 59.9946 67.258 59.5962 66.9611 59.0493C66.6642 58.5024 66.5158 57.7837 66.5158 56.8931V44.5649ZM78.2932 60.3735H74.8947V46.5688C74.8947 45.6079 75.0822 44.7993 75.4572 44.1431C75.84 43.479 76.3752 42.979 77.0627 42.6431C77.758 42.2993 78.5822 42.1274 79.5353 42.1274C79.8478 42.1274 80.1486 42.1509 80.4377 42.1978C80.7267 42.2368 81.008 42.2876 81.2814 42.3501L81.2463 44.8931C81.0978 44.854 80.9416 44.8267 80.7775 44.811C80.6135 44.7954 80.4221 44.7876 80.2033 44.7876C79.7971 44.7876 79.4494 44.8579 79.1603 44.9985C78.8791 45.1313 78.6642 45.3306 78.5158 45.5962C78.3674 45.8618 78.2932 46.186 78.2932 46.5688V60.3735ZM80.8244 47.6938V50.0845H73.008V47.6938H80.8244Z" fill="black"/>
<path d="M82.291 59.894L82.9434 61.4291L84.0918 59.894H84.9629L83.2676 62.0151L84.2989 64.1205H83.5293L82.8301 62.5385L81.6387 64.1205H80.7754L82.5176 61.9448L81.5215 59.894H82.291ZM90.2442 63.6088C90.416 63.6114 90.5762 63.5789 90.7246 63.5112C90.8731 63.4435 90.9994 63.3471 91.1035 63.2221C91.2077 63.0971 91.2819 62.9526 91.3262 62.7885L91.9981 62.7846C91.9564 63.0685 91.8457 63.3172 91.666 63.5307C91.489 63.7442 91.2715 63.9109 91.0137 64.0307C90.7585 64.1479 90.4916 64.2039 90.2129 64.1987C89.916 64.1935 89.6634 64.1323 89.4551 64.0151C89.2494 63.8953 89.084 63.7364 88.959 63.5385C88.834 63.3406 88.7481 63.1179 88.7012 62.8705C88.6543 62.6205 88.6439 62.364 88.67 62.101L88.6856 61.933C88.7168 61.6492 88.7858 61.3797 88.8926 61.1245C88.9994 60.8666 89.1413 60.6388 89.3184 60.4409C89.4981 60.2403 89.7103 60.0841 89.9551 59.9721C90.1999 59.8601 90.4746 59.808 90.7793 59.8159C91.0762 59.8211 91.334 59.8914 91.5528 60.0268C91.7715 60.1596 91.9408 60.3406 92.0606 60.5698C92.1804 60.7989 92.2403 61.0593 92.2403 61.351L91.5762 61.3471C91.5736 61.1804 91.541 61.0268 91.4785 60.8862C91.416 60.7455 91.3236 60.6323 91.2012 60.5463C91.0788 60.4604 90.9278 60.4135 90.7481 60.4057C90.5319 60.4005 90.3431 60.4409 90.1817 60.5268C90.0228 60.6127 89.8874 60.7312 89.7754 60.8823C89.666 61.0307 89.5788 61.1961 89.5137 61.3784C89.4512 61.5606 89.4082 61.7455 89.3848 61.933L89.3653 62.0971C89.3496 62.2638 89.347 62.4343 89.3575 62.6088C89.3705 62.7833 89.4069 62.9461 89.4668 63.0971C89.5267 63.2455 89.6192 63.3666 89.7442 63.4604C89.8692 63.5541 90.0358 63.6036 90.2442 63.6088Z" fill="#0277BD"/>
<path d="M85.7754 63.2612L85.6817 63.8393C85.6374 64.1231 85.5371 64.3875 85.3809 64.6323C85.2246 64.8771 85.0332 65.0854 84.8067 65.2573L84.416 64.9643C84.5072 64.8523 84.5905 64.7377 84.666 64.6205C84.7416 64.506 84.8054 64.3849 84.8575 64.2573C84.9121 64.1297 84.9538 63.9955 84.9825 63.8549L85.084 63.2612H85.7754Z" fill="#ABA9A9"/>
<path d="M13.8242 58.3923L17.2227 44.5993H19.0625L19.1797 47.5056L15.5469 61.6618H13.6016L13.8242 58.3923ZM11.6797 44.5993L14.4688 58.3454V61.6618H12.3477L8.48047 44.5993H11.6797ZM22.707 58.2868L25.4492 44.5993H28.6602L24.793 61.6618H22.6719L22.707 58.2868ZM19.9414 44.5993L23.3398 58.4391L23.5391 61.6618H21.5938L17.9727 47.4938L18.1133 44.5993H19.9414Z" fill="black"/>
<path d="M29.2528 60.5382L29.9052 62.0734L31.0536 60.5382H31.9247L30.2294 62.6593L31.2606 64.7648H30.4911L29.7919 63.1827L28.6005 64.7648H27.7372L29.4794 62.589L28.4833 60.5382H29.2528ZM37.2059 64.2531C37.3778 64.2557 37.538 64.2231 37.6864 64.1554C37.8348 64.0877 37.9611 63.9913 38.0653 63.8663C38.1695 63.7413 38.2437 63.5968 38.288 63.4327L38.9598 63.4288C38.9182 63.7127 38.8075 63.9614 38.6278 64.1749C38.4507 64.3885 38.2333 64.5551 37.9755 64.6749C37.7203 64.7921 37.4533 64.8481 37.1747 64.8429C36.8778 64.8377 36.6252 64.7765 36.4169 64.6593C36.2111 64.5395 36.0458 64.3807 35.9208 64.1827C35.7958 63.9848 35.7098 63.7622 35.663 63.5148C35.6161 63.2648 35.6057 63.0083 35.6317 62.7452L35.6473 62.5773C35.6786 62.2934 35.7476 62.0239 35.8544 61.7687C35.9611 61.5109 36.1031 61.283 36.2802 61.0851C36.4598 60.8846 36.6721 60.7283 36.9169 60.6163C37.1617 60.5044 37.4364 60.4523 37.7411 60.4601C38.038 60.4653 38.2958 60.5356 38.5145 60.671C38.7333 60.8038 38.9025 60.9848 39.0223 61.214C39.1421 61.4432 39.202 61.7036 39.202 61.9952L38.538 61.9913C38.5354 61.8247 38.5028 61.671 38.4403 61.5304C38.3778 61.3898 38.2854 61.2765 38.163 61.1906C38.0406 61.1046 37.8895 61.0577 37.7098 61.0499C37.4937 61.0447 37.3049 61.0851 37.1434 61.171C36.9846 61.257 36.8492 61.3754 36.7372 61.5265C36.6278 61.6749 36.5406 61.8403 36.4755 62.0226C36.413 62.2049 36.37 62.3898 36.3466 62.5773L36.327 62.7413C36.3114 62.908 36.3088 63.0786 36.3192 63.2531C36.3322 63.4275 36.3687 63.5903 36.4286 63.7413C36.4885 63.8898 36.5809 64.0109 36.7059 64.1046C36.8309 64.1984 36.9976 64.2478 37.2059 64.2531Z" fill="#0277BD"/>
<path d="M32.7372 63.9054L32.6434 64.4835C32.5992 64.7674 32.4989 65.0317 32.3427 65.2765C32.1864 65.5213 31.995 65.7296 31.7684 65.9015L31.3778 65.6085C31.469 65.4965 31.5523 65.382 31.6278 65.2648C31.7033 65.1502 31.7671 65.0291 31.8192 64.9015C31.8739 64.7739 31.9156 64.6398 31.9442 64.4991L32.0458 63.9054H32.7372Z" fill="#ABA9A9"/>
</svg>

Before

Width:  |  Height:  |  Size: 13 KiB

Some files were not shown because too many files have changed in this diff Show More