Complete the part of the system crawler.

2025-08-20 21:54:31 +08:00
parent 995ec11144
commit 047bbf8c26
536 changed files with 20 additions and 115899 deletions
@@ -288,7 +288,7 @@ tmp/
 # ==== 配置和密钥 ====
 # 敏感配置文件
-config.py
+# config.py
 config.ini
 secrets.json
 .secrets
@@ -1,42 +0,0 @@
 # ChineseNlpCorpus
 搜集、整理、发布 中文 自然语言处理 语料/数据集，与 有志之士 共同 促进 中文 自然语言处理 的 发展。
 ## 情感/观点/评论 倾向性分析
 | 数据集 | 数据概览 | 下载地址 |
 | ----- | -------- | ------- |
 | ChnSentiCorp_htl_all | 7000 多条酒店评论数据，5000 多条正向评论，2000 多条负向评论 | [点击查看](./datasets/ChnSentiCorp_htl_all/intro.ipynb) |
 | waimai_10k | 某外卖平台收集的用户评价，正向 4000 条，负向 约 8000 条 | [点击查看](./datasets/waimai_10k/intro.ipynb) |
 | online_shopping_10_cats | 10 个类别，共 6 万多条评论数据，正、负向评论各约 3 万条，<br /> 包括书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店 | [点击查看](./datasets/online_shopping_10_cats/intro.ipynb) |
 | weibo_senti_100k | 10 万多条，带情感标注 新浪微博，正负向评论约各 5 万条 | [点击查看](./datasets/weibo_senti_100k/intro.ipynb) |
 | simplifyweibo_4_moods | 36 万多条，带情感标注 新浪微博，包含 4 种情感，<br /> 其中喜悦约 20 万条，愤怒、厌恶、低落各约 5 万条 | [点击查看](./datasets/simplifyweibo_4_moods/intro.ipynb) |
 | dmsc_v2 | 28 部电影，超 70 万 用户，超 200 万条 评分/评论 数据 | [点击查看](./datasets/dmsc_v2/intro.ipynb) |
 | yf_dianping | 24 万家餐馆，54 万用户，440 万条评论/评分数据 | [点击查看](./datasets/yf_dianping/intro.ipynb) |
 | yf_amazon | 52 万件商品，1100 多个类目，142 万用户，720 万条评论/评分数据 | [点击查看](./datasets/yf_amazon/intro.ipynb) |
 ## 中文命名实体识别
 | 数据集 | 数据概览 | 下载地址 |
 | ----- | -------- | ------- |
 | dh_msra | 5 万多条中文命名实体识别标注数据（包括地点、机构、人物） | [点击查看](./datasets/dh_msra/intro.ipynb) |
 ## 推荐系统
 | 数据集 | 数据概览 | 下载地址 |
 | ----- | -------- | ------- |
 | ez_douban | 5 万多部电影（3 万多有电影名称，2 万多没有电影名称），2.8 万 用户，280 万条评分数据 | [点击查看](./datasets/ez_douban/intro.ipynb) |
 | dmsc_v2 | 28 部电影，超 70 万 用户，超 200 万条 评分/评论 数据 | [点击查看](./datasets/dmsc_v2/intro.ipynb) |
 | yf_dianping | 24 万家餐馆，54 万用户，440 万条评论/评分数据 | [点击查看](./datasets/yf_dianping/intro.ipynb) |
 | yf_amazon | 52 万件商品，1100 多个类目，142 万用户，720 万条评论/评分数据 | [点击查看](./datasets/yf_amazon/intro.ipynb) |
 ## FAQ 问答系统
 | 数据集 | 数据概览 | 下载地址 |
 | ----- | -------- | ------- |
 | 保险知道 | 8000 多条保险行业问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/baoxianzhidao/intro.ipynb) |
 | 安徽电信知道 | 15.6 万条电信问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/anhuidianxinzhidao/intro.ipynb) |
 | 金融知道 | 77 万条金融行业问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/financezhidao/intro.ipynb) |
 | 法律知道 | 3.6 万条法律问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/lawzhidao/intro.ipynb) |
 | 联通知道 | 20.3 万条联通问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/liantongzhidao/intro.ipynb) |
 | 农行知道 | 4 万条农业银行问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/nonghangzhidao/intro.ipynb) |
 | 保险知道 | 58.8 万条保险行业问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/baoxianzhidao/intro.ipynb) |
@@ -1,668 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ChnSentiCorp_htl_all 说明\n",
    "0. **下载地址：** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv)\n",
    "1. **数据概览：** 7000 多条酒店评论数据，5000 多条正向评论，2000 多条负向评论\n",
    "2. **推荐实验：** 情感/观点/评论 倾向性分析\n",
    "2. **数据来源：**[携程网](http://www.ctrip.com/)\n",
    "3. **原数据集：** ChnSentiCorp_htl，由 [谭松波](http://people.ucas.ac.cn/~0012244) 老师整理的一份数据集\n",
    "4. **加工处理：**\n",
    "    1. 将原来 1 万个离散的文件整合到 1 个文件中\n",
    "    2. 将负向评论的 label 从 -1 改成 0\n",
    "    3. 去重"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'ChnSentiCorp_htl_all_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. ChnSentiCorp_htl_all.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：7766\n",
      "评论数目（正向）：5322\n",
      "评论数目（负向）：2444\n"
     ]
    }
   ],
   "source": [
    "pd_all = pd.read_csv(path + 'ChnSentiCorp_htl_all.csv')\n",
    "\n",
    "print('评论数目（总体）：%d' % pd_all.shape[0])\n",
    "print('评论数目（正向）：%d' % pd_all[pd_all.label==1].shape[0])\n",
    "print('评论数目（负向）：%d' % pd_all[pd_all.label==0].shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| label | 1 表示正向评论，0 表示负向评论 |\n",
    "| review | 评论内容 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5612</th>\n",
       "      <td>0</td>\n",
       "      <td>房间小得无法想象,建议个子大的不要选择,一般的睡觉脚也伸不直.房间不超过10平方,彩电是14...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7321</th>\n",
       "      <td>0</td>\n",
       "      <td>我们一家人带孩子去过“五.一”，在协程网上挑了半天才选中的酒店，但看来还是错了。1.酒店除了...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3870</th>\n",
       "      <td>1</td>\n",
       "      <td>周六到西山去采橘子,路过这家酒店的时候就觉得应该不错的,采好橘子回来天也晚了,就临时决定住在...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4057</th>\n",
       "      <td>1</td>\n",
       "      <td>交通很便利,到渔人码头和港澳码头都在步行的范围之内.CHECKIN和CHECKOUT的速度都...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1452</th>\n",
       "      <td>1</td>\n",
       "      <td>很不错的一个酒店,床很大,很舒服.酒店员工的服务态度很亲切.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4805</th>\n",
       "      <td>1</td>\n",
       "      <td>酒店环境和服务都还不错，地理位置也不错，尤其是酒店北面的川北凉粉确实好吃，不过就是隔音效果不...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6868</th>\n",
       "      <td>0</td>\n",
       "      <td>旧楼改建的酒店，期望不要太高。酒店经理的态度很好，会帮助解决问题。有一位前台小姐的态度实在是...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1345</th>\n",
       "      <td>1</td>\n",
       "      <td>经常去海口出差,但从没住过该酒店.看外表感觉一般吧其实酒店里面还真不错,房间是新装修的(我住...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2026</th>\n",
       "      <td>1</td>\n",
       "      <td>算是海口市比较好的酒店了。处于市中心，购物方便。服务态度好。保险柜出问题了叫人来开，打个电话...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2805</th>\n",
       "      <td>1</td>\n",
       "      <td>感受的是热情的服务！从入门开始，一直很愉快！房间硬件只是准2星的吧，卫生间淋浴头在马桶上方，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2915</th>\n",
       "      <td>1</td>\n",
       "      <td>房间很整洁，尤其是床上的哪个靠枕是我以前所住过宾馆没有的，红色的很喜庆。虽然是在当地比较繁华...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1803</th>\n",
       "      <td>1</td>\n",
       "      <td>准确的说，酒店的环境很漂亮，房间设施也还行，可以算4星标准。但是，卫生间下水道的气味实在是让...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4729</th>\n",
       "      <td>1</td>\n",
       "      <td>价格越来越高了,周遍不方便,去哪里都需要打车.不过装修风格很时尚舒适.服务态度不错.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1913</th>\n",
       "      <td>1</td>\n",
       "      <td>地理位置不错。但好像人气不太旺。不过下次也会考虑住这的。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7159</th>\n",
       "      <td>0</td>\n",
       "      <td>设施老化，紧靠马路噪音太大。晚上楼上卫生间的水流声和空调噪音非常大，无法入眠，跟总台反映后，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1119</th>\n",
       "      <td>1</td>\n",
       "      <td>11月份住了一次。1.服务方面还不错，门童挺积极。2.感觉房间略有陈旧。3.早餐品种还算丰富...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2170</th>\n",
       "      <td>1</td>\n",
       "      <td>总的来说，酒店还不错。比较安静，地理位置比较好，服务也不错，包括入住和结账。不太好的地方，7...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2793</th>\n",
       "      <td>1</td>\n",
       "      <td>我喜欢那里,性价比很高地.去太原90%都住在那里的.服务员的服务很不错</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5895</th>\n",
       "      <td>0</td>\n",
       "      <td>非常糟糕！1。我们通过其商务中心包了一辆车游西湖，该车拉我们去不正规景点买茶叶（我们买了），...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4089</th>\n",
       "      <td>1</td>\n",
       "      <td>我是7月9号晚10点多的时候入住的，房间很新，据说是跟格林豪泰是同一公司的，可能是是新开业的...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      label                                             review\n",
       "5612      0  房间小得无法想象,建议个子大的不要选择,一般的睡觉脚也伸不直.房间不超过10平方,彩电是14...\n",
       "7321      0  我们一家人带孩子去过“五.一”，在协程网上挑了半天才选中的酒店，但看来还是错了。1.酒店除了...\n",
       "3870      1  周六到西山去采橘子,路过这家酒店的时候就觉得应该不错的,采好橘子回来天也晚了,就临时决定住在...\n",
       "4057      1  交通很便利,到渔人码头和港澳码头都在步行的范围之内.CHECKIN和CHECKOUT的速度都...\n",
       "1452      1                     很不错的一个酒店,床很大,很舒服.酒店员工的服务态度很亲切.\n",
       "4805      1  酒店环境和服务都还不错，地理位置也不错，尤其是酒店北面的川北凉粉确实好吃，不过就是隔音效果不...\n",
       "6868      0  旧楼改建的酒店，期望不要太高。酒店经理的态度很好，会帮助解决问题。有一位前台小姐的态度实在是...\n",
       "1345      1  经常去海口出差,但从没住过该酒店.看外表感觉一般吧其实酒店里面还真不错,房间是新装修的(我住...\n",
       "2026      1  算是海口市比较好的酒店了。处于市中心，购物方便。服务态度好。保险柜出问题了叫人来开，打个电话...\n",
       "2805      1  感受的是热情的服务！从入门开始，一直很愉快！房间硬件只是准2星的吧，卫生间淋浴头在马桶上方，...\n",
       "2915      1  房间很整洁，尤其是床上的哪个靠枕是我以前所住过宾馆没有的，红色的很喜庆。虽然是在当地比较繁华...\n",
       "1803      1  准确的说，酒店的环境很漂亮，房间设施也还行，可以算4星标准。但是，卫生间下水道的气味实在是让...\n",
       "4729      1         价格越来越高了,周遍不方便,去哪里都需要打车.不过装修风格很时尚舒适.服务态度不错.\n",
       "1913      1                       地理位置不错。但好像人气不太旺。不过下次也会考虑住这的。\n",
       "7159      0  设施老化，紧靠马路噪音太大。晚上楼上卫生间的水流声和空调噪音非常大，无法入眠，跟总台反映后，...\n",
       "1119      1  11月份住了一次。1.服务方面还不错，门童挺积极。2.感觉房间略有陈旧。3.早餐品种还算丰富...\n",
       "2170      1  总的来说，酒店还不错。比较安静，地理位置比较好，服务也不错，包括入住和结账。不太好的地方，7...\n",
       "2793      1                我喜欢那里,性价比很高地.去太原90%都住在那里的.服务员的服务很不错\n",
       "5895      0  非常糟糕！1。我们通过其商务中心包了一辆车游西湖，该车拉我们去不正规景点买茶叶（我们买了），...\n",
       "4089      1  我是7月9号晚10点多的时候入住的，房间很新，据说是跟格林豪泰是同一公司的，可能是是新开业的..."
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. 构造平衡语料\n",
    "\n",
    "- 原数据集中还包含了 3 份平衡语料：ChnSentiCorp_htl_ba_2000, ChnSentiCorp_htl_ba_4000, ChnSentiCorp_htl_ba_6000\n",
    "- 用随机抽样的方法，很容易构造出类似的平衡语料"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_positive = pd_all[pd_all.label==1]\n",
    "pd_negative = pd_all[pd_all.label==0]\n",
    "\n",
    "def get_balance_corpus(corpus_size, corpus_pos, corpus_neg):\n",
    "    sample_size = corpus_size // 2\n",
    "    pd_corpus_balance = pd.concat([corpus_pos.sample(sample_size, replace=corpus_pos.shape[0]<sample_size), \\\n",
    "                                   corpus_neg.sample(sample_size, replace=corpus_neg.shape[0]<sample_size)])\n",
    "    \n",
    "    print('评论数目（总体）：%d' % pd_corpus_balance.shape[0])\n",
    "    print('评论数目（正向）：%d' % pd_corpus_balance[pd_corpus_balance.label==1].shape[0])\n",
    "    print('评论数目（负向）：%d' % pd_corpus_balance[pd_corpus_balance.label==0].shape[0])    \n",
    "    \n",
    "    return pd_corpus_balance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：2000\n",
      "评论数目（正向）：1000\n",
      "评论数目（负向）：1000\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5536</th>\n",
       "      <td>0</td>\n",
       "      <td>建议携程不要和这家酒店合作,名曰三星,要我看准星级都勉强!首先不在市区里面(去涵江区打车还要...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4086</th>\n",
       "      <td>1</td>\n",
       "      <td>感觉比老街口客栈舒适，很中规中矩的3星级，推荐大家住主楼的豪华间，设施比较好，前台和大堂的服...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6112</th>\n",
       "      <td>0</td>\n",
       "      <td>是我遇到的最差的4星酒店，进门没人管，进去要我和大堂打招呼，退房也很慢，不会再去住了</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4440</th>\n",
       "      <td>1</td>\n",
       "      <td>房间的设施不错，由于武夷山市是个小地方，酒店离景区有一定距离，如果没有自己开车就不太方便，但...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2706</th>\n",
       "      <td>1</td>\n",
       "      <td>首次入住该酒店,环境雅致,服务非常不错,很多笑脸,感觉热情,早餐可以接受,有送餐服务以后去徐...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1770</th>\n",
       "      <td>1</td>\n",
       "      <td>不错!就是洗澡的地方小点~~下回去还住这家~~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4306</th>\n",
       "      <td>1</td>\n",
       "      <td>环境位置很好,房间情况尚可,早餐一般般,价格偏高了一些.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2161</th>\n",
       "      <td>1</td>\n",
       "      <td>位置优越，出行方便。就是房间较小，床位较小，房间装修较旧，其他方面都不错。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7667</th>\n",
       "      <td>0</td>\n",
       "      <td>酒店周围环境差，内部也很旧，卫生不好，很脏，总之没什么好的，下次决不住这。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4419</th>\n",
       "      <td>1</td>\n",
       "      <td>我7月24号入住瑞豪酒店，开始有些不顺利，但是那里的管理还是非常好的，有位姓赵的经理发现问题...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      label                                             review\n",
       "5536      0  建议携程不要和这家酒店合作,名曰三星,要我看准星级都勉强!首先不在市区里面(去涵江区打车还要...\n",
       "4086      1  感觉比老街口客栈舒适，很中规中矩的3星级，推荐大家住主楼的豪华间，设施比较好，前台和大堂的服...\n",
       "6112      0         是我遇到的最差的4星酒店，进门没人管，进去要我和大堂打招呼，退房也很慢，不会再去住了\n",
       "4440      1  房间的设施不错，由于武夷山市是个小地方，酒店离景区有一定距离，如果没有自己开车就不太方便，但...\n",
       "2706      1  首次入住该酒店,环境雅致,服务非常不错,很多笑脸,感觉热情,早餐可以接受,有送餐服务以后去徐...\n",
       "1770      1                            不错!就是洗澡的地方小点~~下回去还住这家~~\n",
       "4306      1                       环境位置很好,房间情况尚可,早餐一般般,价格偏高了一些.\n",
       "2161      1              位置优越，出行方便。就是房间较小，床位较小，房间装修较旧，其他方面都不错。\n",
       "7667      0              酒店周围环境差，内部也很旧，卫生不好，很脏，总之没什么好的，下次决不住这。\n",
       "4419      1  我7月24号入住瑞豪酒店，开始有些不顺利，但是那里的管理还是非常好的，有位姓赵的经理发现问题..."
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ChnSentiCorp_htl_ba_2000 = get_balance_corpus(2000, pd_positive, pd_negative)\n",
    "\n",
    "ChnSentiCorp_htl_ba_2000.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：4000\n",
      "评论数目（正向）：2000\n",
      "评论数目（负向）：2000\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3605</th>\n",
       "      <td>1</td>\n",
       "      <td>酒店就在海水浴场旁边，出门到接触到海水两分钟，如果要和海水亲近的朋友，极力推荐。这样游泳换衣...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7260</th>\n",
       "      <td>0</td>\n",
       "      <td>TheWorsehotelinChengdurightnow,checkoutat12.30...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5762</th>\n",
       "      <td>0</td>\n",
       "      <td>房间还算可以，不过前台服务人员的态度，受不了，我晚上11点多到酒店CHEKIN第二天退房的时...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5790</th>\n",
       "      <td>0</td>\n",
       "      <td>酒店设施陈旧，浴缸排水不畅，入住无房，一间16：00，一间22：00，早餐差</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4504</th>\n",
       "      <td>1</td>\n",
       "      <td>虽是公寓式酒店，但其房间整洁程度、全方位的服务都给我留下了很好的印象。丝丝不完善之处在于很多...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5246</th>\n",
       "      <td>1</td>\n",
       "      <td>很好的酒店，很喜欢，房间很干净很漂亮，从房间的窗口看出去，超美的，在市中心区域，出行也非常的...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>624</th>\n",
       "      <td>1</td>\n",
       "      <td>在临沂，这个酒店算是比较有档次的了，给外国客人的服务也比较合格。可惜电视内容比较单调，国外的...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1382</th>\n",
       "      <td>1</td>\n",
       "      <td>4年前住过，我和德国同事都觉得很不错。今年我又选了豪门，还是觉得很好。自助餐品种丰富，房间宽...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3723</th>\n",
       "      <td>1</td>\n",
       "      <td>价格不高,比较实惠,服务也不错,离闹市区不远.交通也比较方便.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3328</th>\n",
       "      <td>1</td>\n",
       "      <td>房间：建筑风格比较独特。木屋矗立在随潮汐涨落的水中，围廊象迷宫一样。看着自己的小屋，却没有直...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      label                                             review\n",
       "3605      1  酒店就在海水浴场旁边，出门到接触到海水两分钟，如果要和海水亲近的朋友，极力推荐。这样游泳换衣...\n",
       "7260      0  TheWorsehotelinChengdurightnow,checkoutat12.30...\n",
       "5762      0  房间还算可以，不过前台服务人员的态度，受不了，我晚上11点多到酒店CHEKIN第二天退房的时...\n",
       "5790      0             酒店设施陈旧，浴缸排水不畅，入住无房，一间16：00，一间22：00，早餐差\n",
       "4504      1  虽是公寓式酒店，但其房间整洁程度、全方位的服务都给我留下了很好的印象。丝丝不完善之处在于很多...\n",
       "5246      1  很好的酒店，很喜欢，房间很干净很漂亮，从房间的窗口看出去，超美的，在市中心区域，出行也非常的...\n",
       "624       1  在临沂，这个酒店算是比较有档次的了，给外国客人的服务也比较合格。可惜电视内容比较单调，国外的...\n",
       "1382      1  4年前住过，我和德国同事都觉得很不错。今年我又选了豪门，还是觉得很好。自助餐品种丰富，房间宽...\n",
       "3723      1                    价格不高,比较实惠,服务也不错,离闹市区不远.交通也比较方便.\n",
       "3328      1  房间：建筑风格比较独特。木屋矗立在随潮汐涨落的水中，围廊象迷宫一样。看着自己的小屋，却没有直..."
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ChnSentiCorp_htl_ba_4000 = get_balance_corpus(4000, pd_positive, pd_negative)\n",
    "\n",
    "ChnSentiCorp_htl_ba_4000.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：6000\n",
      "评论数目（正向）：3000\n",
      "评论数目（负向）：3000\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4817</th>\n",
       "      <td>1</td>\n",
       "      <td>入住的是260元的迷你标准间。感觉比想象的要好很多，房间如果住一个人很合适的，洗手间很大，很...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7021</th>\n",
       "      <td>0</td>\n",
       "      <td>7点到了酒店前台打电话问了楼层说房间可以入住，上楼竟然房间的垃圾成堆根本就没有打扫，下楼要求...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6484</th>\n",
       "      <td>0</td>\n",
       "      <td>又要对他进行点评了，呜呜。。。说什么好呢</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6715</th>\n",
       "      <td>0</td>\n",
       "      <td>看了前面介绍的推荐去入住的，结果很失望，酒店的淋浴居然没有维护设施，洗个澡弄得整个洗手间都淋...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6775</th>\n",
       "      <td>0</td>\n",
       "      <td>酒店的设施太差了，估计连1星级都没有，房间空调都不开的，简直就是一塌糊涂。建议大家不要去预订该酒店</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7575</th>\n",
       "      <td>0</td>\n",
       "      <td>真的差得没话说，但说起来又有一堆。住进去的时候发现没有浴巾，第二天却一直打电话说我们拿了那两...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1615</th>\n",
       "      <td>1</td>\n",
       "      <td>酒店非常好，距离高速出口很近，服务也很到位，值得推荐的酒店，到泰山应该是最好的酒店了．</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6466</th>\n",
       "      <td>0</td>\n",
       "      <td>携城预定员极力推荐这家酒店，相信她才入住了这家，结果到了酒店才发现，连一星级都不如，前台的小...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1392</th>\n",
       "      <td>1</td>\n",
       "      <td>酒店很大，服务太差，Ａ楼房间也老，下次再也不住了。环境很好，打高尔夫的或许可以忍忍吧。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4408</th>\n",
       "      <td>1</td>\n",
       "      <td>房间很大，大的让我去其他宾馆都感觉性价比不高！服务也不错，值得一住！！</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      label                                             review\n",
       "4817      1  入住的是260元的迷你标准间。感觉比想象的要好很多，房间如果住一个人很合适的，洗手间很大，很...\n",
       "7021      0  7点到了酒店前台打电话问了楼层说房间可以入住，上楼竟然房间的垃圾成堆根本就没有打扫，下楼要求...\n",
       "6484      0                               又要对他进行点评了，呜呜。。。说什么好呢\n",
       "6715      0  看了前面介绍的推荐去入住的，结果很失望，酒店的淋浴居然没有维护设施，洗个澡弄得整个洗手间都淋...\n",
       "6775      0  酒店的设施太差了，估计连1星级都没有，房间空调都不开的，简直就是一塌糊涂。建议大家不要去预订该酒店\n",
       "7575      0  真的差得没话说，但说起来又有一堆。住进去的时候发现没有浴巾，第二天却一直打电话说我们拿了那两...\n",
       "1615      1        酒店非常好，距离高速出口很近，服务也很到位，值得推荐的酒店，到泰山应该是最好的酒店了．\n",
       "6466      0  携城预定员极力推荐这家酒店，相信她才入住了这家，结果到了酒店才发现，连一星级都不如，前台的小...\n",
       "1392      1        酒店很大，服务太差，Ａ楼房间也老，下次再也不住了。环境很好，打高尔夫的或许可以忍忍吧。\n",
       "4408      1                房间很大，大的让我去其他宾馆都感觉性价比不高！服务也不错，值得一住！！"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ChnSentiCorp_htl_ba_6000 = get_balance_corpus(6000, pd_positive, pd_negative)\n",
    "\n",
    "ChnSentiCorp_htl_ba_6000.sample(10)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,333 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# anhuidianxinzhidao 说明\n",
    "0. **下载地址：** [百度网盘](https://pan.baidu.com/s/1nrg5SRU3Xy1VN85dd85-vg)\n",
    "1. **数据概览：** 15.6 万条电信问答数据\n",
    "2. **推荐实验：** FAQ 问答系统\n",
    "3. **数据来源：** 百度知道\n",
    "4. **加工处理：**\n",
    "    1. 过滤了id、url、qid、reply_t、user字段\n",
    "    2. 对question、reply做了脱敏处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'anhuidianxinzhidao_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1.anhuidianxinzhidao_filter.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_all = pd.read_csv(path + 'anhuidianxinzhidao_filter.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| title | 标题 |\n",
    "| question | 问题（可为空） |\n",
    "| reply| 每个问题的内容 |\n",
    "| is_best| 是否是最佳答案 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>question</th>\n",
       "      <th>reply</th>\n",
       "      <th>is_best</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>129754</th>\n",
       "      <td>红米no##4x</td>\n",
       "      <td>NaN</td>\n",
       "      <td>可以，</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15843</th>\n",
       "      <td>为什么不能同时用两个电信卡</td>\n",
       "      <td>NaN</td>\n",
       "      <td>您好不可以的，目前推出的手机都是不能同时支持两张电信手机卡的，即使是全网通手机也只能在其中的...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23985</th>\n",
       "      <td>电信181、177、133哪个号段好？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>133的</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72065</th>\n",
       "      <td>华*荣耀7x和魅蓝note6哪个好</td>\n",
       "      <td>NaN</td>\n",
       "      <td>荣耀畅玩7X很不错，性价比很高，以下是手机的配置：1、外观方面：荣耀畅玩7X采用5.93英寸...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11843</th>\n",
       "      <td>p8青春版电信版多少钱</td>\n",
       "      <td>NaN</td>\n",
       "      <td>您好，这款手机价格参考如下</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3280</th>\n",
       "      <td>华为di####00叫什么</td>\n",
       "      <td>华为di####00叫什么</td>\n",
       "      <td>DI####00是华为畅享6S全网通版。华为畅享6S性价比高,是一款很不错的手机。电信新出流...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>143200</th>\n",
       "      <td>电信版酷派9190L双卡双通可以用移动网络吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>您好电信版双卡双待手机只能使用电信手机卡上网，卡槽2的移动或联通手机卡只能支持2G网络，一般...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120692</th>\n",
       "      <td>苹果微信载图怎么载图</td>\n",
       "      <td>苹果微信载图怎么载图</td>\n",
       "      <td>您说的应该是截图吧。您可以直接通过苹果手机截图组合按键进行截图操作。直接同时安装电源键和ho...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>109786</th>\n",
       "      <td>天翼网关的wifi被我关了又没有邦定客户端怎么办想再连wifi该怎么办</td>\n",
       "      <td>NaN</td>\n",
       "      <td>您好电信光纤猫的无线网络一般需要破解才能使用的，但破解可能会到帐宽带不稳定或不能正常上网，建...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29030</th>\n",
       "      <td>v*v*x21是不是全网通</td>\n",
       "      <td>v*v*x21是不是全网通</td>\n",
       "      <td>vi###21系列是有vi###21A全网通版本与vi###21移动全网通版本的；此两款机型...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72603</th>\n",
       "      <td>电信网上营业厅手机卡办理步骤</td>\n",
       "      <td>NaN</td>\n",
       "      <td>中*电信目前是支持网上办理手机号的，下面分享下网上营业厅办理号卡的步骤：1、首先打开浏览器，...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>103229</th>\n",
       "      <td>花呗可以充话费吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>您好，是可以的，目前花呗进行充值话费,每个月只能使用花呗一次,最高不超过500元,如果您已经...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91507</th>\n",
       "      <td>荣耀8好还是三星noT4好</td>\n",
       "      <td>NaN</td>\n",
       "      <td>如果我选择三星，华为去论坛发个意见都很尴尬。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>143504</th>\n",
       "      <td>ios10.2.1能降级吗ios10.2.1怎么降级</td>\n",
       "      <td>NaN</td>\n",
       "      <td>IOS设备一旦升级IOS系统就无法降级了，因为：1、IOS采用推荐升级、强制保持最新的升级策...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21999</th>\n",
       "      <td>电信校园网宽带超一分钟多少钱</td>\n",
       "      <td>NaN</td>\n",
       "      <td>由于各地业务情况不同，建议用户通过当地的电信网是营业厅或者手机营业厅了解，也可以直接到附近的...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7644</th>\n",
       "      <td>有没有人办过开发区的电信卡</td>\n",
       "      <td>NaN</td>\n",
       "      <td>您好目前使用电信手机卡的用户非常多，电信手机卡资费更优惠、网络更稳定、网速更快，请放心办理使...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76835</th>\n",
       "      <td>请问67###18这个电话号码是哪里的</td>\n",
       "      <td>NaN</td>\n",
       "      <td>查吧</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76752</th>\n",
       "      <td>电信，铁通，移动，广电。那个网速好呢？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>办理宽带推荐您办理电信宽带使用。由于中*电信的服务器、网络架设等较完善，且每年都在不断完善和...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>94290</th>\n",
       "      <td>三星s8+好用不</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S8+的主要特征：1.全视曲面屏:超窄边框、沉浸感视效、双曲面侧屏的显示屏，为您带来更纯粹的...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>79345</th>\n",
       "      <td>一加手机5玩王者会卡吗？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>不会卡，我也推荐你买一加5，它运行内存有8G，玩游戏的时候就能感受到性能有多好，手机不卡，丢...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      title       question  \\\n",
       "129754                             红米no##4x            NaN   \n",
       "15843                         为什么不能同时用两个电信卡            NaN   \n",
       "23985                   电信181、177、133哪个号段好？            NaN   \n",
       "72065                     华*荣耀7x和魅蓝note6哪个好            NaN   \n",
       "11843                           p8青春版电信版多少钱            NaN   \n",
       "3280                          华为di####00叫什么  华为di####00叫什么   \n",
       "143200               电信版酷派9190L双卡双通可以用移动网络吗            NaN   \n",
       "120692                           苹果微信载图怎么载图     苹果微信载图怎么载图   \n",
       "109786  天翼网关的wifi被我关了又没有邦定客户端怎么办想再连wifi该怎么办            NaN   \n",
       "29030                         v*v*x21是不是全网通  v*v*x21是不是全网通   \n",
       "72603                        电信网上营业厅手机卡办理步骤            NaN   \n",
       "103229                             花呗可以充话费吗            NaN   \n",
       "91507                         荣耀8好还是三星noT4好            NaN   \n",
       "143504           ios10.2.1能降级吗ios10.2.1怎么降级            NaN   \n",
       "21999                        电信校园网宽带超一分钟多少钱            NaN   \n",
       "7644                          有没有人办过开发区的电信卡            NaN   \n",
       "76835                   请问67###18这个电话号码是哪里的            NaN   \n",
       "76752                   电信，铁通，移动，广电。那个网速好呢？            NaN   \n",
       "94290                              三星s8+好用不            NaN   \n",
       "79345                          一加手机5玩王者会卡吗？            NaN   \n",
       "\n",
       "                                                    reply  is_best  \n",
       "129754                                                可以，        0  \n",
       "15843   您好不可以的，目前推出的手机都是不能同时支持两张电信手机卡的，即使是全网通手机也只能在其中的...        1  \n",
       "23985                                                133的        0  \n",
       "72065   荣耀畅玩7X很不错，性价比很高，以下是手机的配置：1、外观方面：荣耀畅玩7X采用5.93英寸...        1  \n",
       "11843                                       您好，这款手机价格参考如下        1  \n",
       "3280    DI####00是华为畅享6S全网通版。华为畅享6S性价比高,是一款很不错的手机。电信新出流...        1  \n",
       "143200  您好电信版双卡双待手机只能使用电信手机卡上网，卡槽2的移动或联通手机卡只能支持2G网络，一般...        1  \n",
       "120692  您说的应该是截图吧。您可以直接通过苹果手机截图组合按键进行截图操作。直接同时安装电源键和ho...        1  \n",
       "109786  您好电信光纤猫的无线网络一般需要破解才能使用的，但破解可能会到帐宽带不稳定或不能正常上网，建...        1  \n",
       "29030   vi###21系列是有vi###21A全网通版本与vi###21移动全网通版本的；此两款机型...        0  \n",
       "72603   中*电信目前是支持网上办理手机号的，下面分享下网上营业厅办理号卡的步骤：1、首先打开浏览器，...        1  \n",
       "103229  您好，是可以的，目前花呗进行充值话费,每个月只能使用花呗一次,最高不超过500元,如果您已经...        0  \n",
       "91507                              如果我选择三星，华为去论坛发个意见都很尴尬。        0  \n",
       "143504  IOS设备一旦升级IOS系统就无法降级了，因为：1、IOS采用推荐升级、强制保持最新的升级策...        1  \n",
       "21999   由于各地业务情况不同，建议用户通过当地的电信网是营业厅或者手机营业厅了解，也可以直接到附近的...        1  \n",
       "7644    您好目前使用电信手机卡的用户非常多，电信手机卡资费更优惠、网络更稳定、网速更快，请放心办理使...        1  \n",
       "76835                                                  查吧        0  \n",
       "76752   办理宽带推荐您办理电信宽带使用。由于中*电信的服务器、网络架设等较完善，且每年都在不断完善和...        1  \n",
       "94290   S8+的主要特征：1.全视曲面屏:超窄边框、沉浸感视效、双曲面侧屏的显示屏，为您带来更纯粹的...        1  \n",
       "79345   不会卡，我也推荐你买一加5，它运行内存有8G，玩游戏的时候就能感受到性能有多好，手机不卡，丢...        1  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(n=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,355 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# baoxianzhidao_filter 说明\n",
    "0. **下载地址：** [百度网盘](https://pan.baidu.com/s/1cgYeIrJHAgb8D33H09Zc5w)\n",
    "1. **数据概览：** 8000 多条保险行业问答数据\n",
    "2. **推荐实验：** FAQ 问答系统\n",
    "3. **数据来源：** 百度知道\n",
    "4. **加工处理：**\n",
    "    1. 过滤了id、url、qid、reply_t、user字段\n",
    "    2. 对question、reply做了脱敏处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'baoxianzhidao_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. baoxianzhidao_filter.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_all = pd.read_csv(path + 'baoxianzhidao_filter.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| title | 问题的标题 |\n",
    "| question | 问题内容（可为空） |\n",
    "| reply| 回复内容 |\n",
    "| is_best| 是否为页面上显示的最佳回答 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>question</th>\n",
       "      <th>reply</th>\n",
       "      <th>is_best</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6733</th>\n",
       "      <td>五险两金和五险一金有什么区别</td>\n",
       "      <td>单位招聘，独立待遇中有一项是五险两金。有些单位是五险一金，还有些五险两金。然而我刚毕业小白，...</td>\n",
       "      <td>五险一金是指：医疗保险，生育保险，工伤保险，失业保险和养老保险，还有住房公积金。五险两金指的...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7580</th>\n",
       "      <td>户口不在本地如何办医疗保险</td>\n",
       "      <td>户口不在本地如何办医疗保险</td>\n",
       "      <td>户口不在本地可以办理医保，通常都是以单位名义进行办理。医疗保险分两种办理方式，一种是单位办理...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6310</th>\n",
       "      <td>酒精含量百分之二十八保险公司理赔吗？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>不会赔</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5843</th>\n",
       "      <td>我买的二手车，车险都没过户，怎么交保险</td>\n",
       "      <td>NaN</td>\n",
       "      <td>要看保险合同了，有的是指定被保险人的，如果你出了险，保险公司是不理赔的。建议尽快去过户，或者...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2790</th>\n",
       "      <td>保险买交强险后可加其他险种吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>可以的。车险种类包括：1.交强险，交强险[全称机动车交通事故责任强制保险]是我国首个由国家法...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4301</th>\n",
       "      <td>农村九级伤残赔偿标准我父亲.因矿采煤塌陷至伤残九级应赔多少钱</td>\n",
       "      <td>农村九级伤残赔偿标准我父亲.因矿采煤塌陷至伤残九级应赔多少钱</td>\n",
       "      <td>发生九级伤残的赔偿标准主要包括医疗费用、一次性补偿金等等，具体包括这些：医疗费：以医院发票金...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4685</th>\n",
       "      <td>领着失业金还可以交失业险吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>可以。领取失业金只是说明目前是离职状态，但仍可以居民形式参加保险，但缴纳的只能是医疗保险和养...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7350</th>\n",
       "      <td>车辆上牌照必须在当地上保险吗</td>\n",
       "      <td>车辆上牌照必须在当地上保险吗</td>\n",
       "      <td>不是必须在当地买保险，也可以异地投保，现在很多保险公司开发了异地买汽车保险的购买渠道。但是保...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1611</th>\n",
       "      <td>泰康人寿保险官网产品多不多，能直接在网上买吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>你想买哪方面保险呢，主要是看给你的服务，国寿现在新*市一款你可以考虑下</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5127</th>\n",
       "      <td>车出事故对方全责第三者受伤对方保险应怎样理赔?</td>\n",
       "      <td>NaN</td>\n",
       "      <td>对方的交强险和第三者责任险可以对第三者的伤害进行赔偿。第三者责任险是保险车辆因意外事故致使第...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4743</th>\n",
       "      <td>我主责对方次责,对方摩托车无保险怎么赔付？</td>\n",
       "      <td>我汽车全险有不计免赔，对方摩托车什么都没有。他的车辆损失和医药费是不是由我保险公司出？那我的...</td>\n",
       "      <td>对方无保险需要自费赔付损失。一般在机动车与机动车之间发生交通事故，由保险公司在机动车第三者责...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5729</th>\n",
       "      <td>网上买健康保险不用检查身体吗</td>\n",
       "      <td>我想在慧择网买一款险种，保大病的，但是有个疑问就是，如果不用确认我身体健康就能入保的话，这样...</td>\n",
       "      <td>通常普通的健康保险是不需要体检的，不过如果年龄、保额超过保险公司规定的限度，就一定需要体检。...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3564</th>\n",
       "      <td>招商信诺儿童险如何投保？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>儿童保险是指用于解决其成长过程中所需要的教育、创业、婚嫁等费用，以及应付孩子可能面临的疾病、...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>824</th>\n",
       "      <td>医疗保险请问单位交的医疗保险到底有啥用–手机爱问</td>\n",
       "      <td>NaN</td>\n",
       "      <td>直接到当地社保处办理就可以了</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4856</th>\n",
       "      <td>以前办理过养老金，在交要身份证吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>第二次办理养老保险需要的资料1.本地人才市场《劳动保障事物代理委托协议书》2.身份正原件及复...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2054</th>\n",
       "      <td>江*车可以在防城港买保险吗?</td>\n",
       "      <td>江*车可以在防城港买保险吗?</td>\n",
       "      <td>理论上说是可行的。具体要看各地的政策和监管要求是如何运行，不同的城市对异地投保的情况的规定是...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1415</th>\n",
       "      <td>中英人寿户外保险好吗？有什么好处</td>\n",
       "      <td>NaN</td>\n",
       "      <td>建议直接拨打人寿客服电话咨询</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5225</th>\n",
       "      <td>机动车保险到期多少日内免于处罚</td>\n",
       "      <td>NaN</td>\n",
       "      <td>机动车保险到期就等于无保险，机动车交通事故责任强制保险条例第三十九条：机动车所有人、管理人未...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5596</th>\n",
       "      <td>上学放学途中发生意外，学校购买的意外保险，可以理赔吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>那要看你们学校买的意外保险的条款中有没有限定只负责理赔在校园中发生的意外伤害，如果没有这样的...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7390</th>\n",
       "      <td>办建筑工人意外险需要交什么证件</td>\n",
       "      <td>NaN</td>\n",
       "      <td>需要提供工人的身份证号需要提供建筑公司的组织机构代码证团体意外险投保书填写及盖章一、企业施工...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               title  \\\n",
       "6733                  五险两金和五险一金有什么区别   \n",
       "7580                   户口不在本地如何办医疗保险   \n",
       "6310              酒精含量百分之二十八保险公司理赔吗？   \n",
       "5843             我买的二手车，车险都没过户，怎么交保险   \n",
       "2790                  保险买交强险后可加其他险种吗   \n",
       "4301  农村九级伤残赔偿标准我父亲.因矿采煤塌陷至伤残九级应赔多少钱   \n",
       "4685                   领着失业金还可以交失业险吗   \n",
       "7350                  车辆上牌照必须在当地上保险吗   \n",
       "1611          泰康人寿保险官网产品多不多，能直接在网上买吗   \n",
       "5127         车出事故对方全责第三者受伤对方保险应怎样理赔?   \n",
       "4743           我主责对方次责,对方摩托车无保险怎么赔付？   \n",
       "5729                  网上买健康保险不用检查身体吗   \n",
       "3564                    招商信诺儿童险如何投保？   \n",
       "824         医疗保险请问单位交的医疗保险到底有啥用–手机爱问   \n",
       "4856                以前办理过养老金，在交要身份证吗   \n",
       "2054                  江*车可以在防城港买保险吗?   \n",
       "1415                中英人寿户外保险好吗？有什么好处   \n",
       "5225                 机动车保险到期多少日内免于处罚   \n",
       "5596      上学放学途中发生意外，学校购买的意外保险，可以理赔吗   \n",
       "7390                 办建筑工人意外险需要交什么证件   \n",
       "\n",
       "                                               question  \\\n",
       "6733  单位招聘，独立待遇中有一项是五险两金。有些单位是五险一金，还有些五险两金。然而我刚毕业小白，...   \n",
       "7580                                      户口不在本地如何办医疗保险   \n",
       "6310                                                NaN   \n",
       "5843                                                NaN   \n",
       "2790                                                NaN   \n",
       "4301                     农村九级伤残赔偿标准我父亲.因矿采煤塌陷至伤残九级应赔多少钱   \n",
       "4685                                                NaN   \n",
       "7350                                     车辆上牌照必须在当地上保险吗   \n",
       "1611                                                NaN   \n",
       "5127                                                NaN   \n",
       "4743  我汽车全险有不计免赔，对方摩托车什么都没有。他的车辆损失和医药费是不是由我保险公司出？那我的...   \n",
       "5729  我想在慧择网买一款险种，保大病的，但是有个疑问就是，如果不用确认我身体健康就能入保的话，这样...   \n",
       "3564                                                NaN   \n",
       "824                                                 NaN   \n",
       "4856                                                NaN   \n",
       "2054                                     江*车可以在防城港买保险吗?   \n",
       "1415                                                NaN   \n",
       "5225                                                NaN   \n",
       "5596                                                NaN   \n",
       "7390                                                NaN   \n",
       "\n",
       "                                                  reply  is_best  \n",
       "6733  五险一金是指：医疗保险，生育保险，工伤保险，失业保险和养老保险，还有住房公积金。五险两金指的...        0  \n",
       "7580  户口不在本地可以办理医保，通常都是以单位名义进行办理。医疗保险分两种办理方式，一种是单位办理...        1  \n",
       "6310                                                不会赔        0  \n",
       "5843  要看保险合同了，有的是指定被保险人的，如果你出了险，保险公司是不理赔的。建议尽快去过户，或者...        0  \n",
       "2790  可以的。车险种类包括：1.交强险，交强险[全称机动车交通事故责任强制保险]是我国首个由国家法...        1  \n",
       "4301  发生九级伤残的赔偿标准主要包括医疗费用、一次性补偿金等等，具体包括这些：医疗费：以医院发票金...        1  \n",
       "4685  可以。领取失业金只是说明目前是离职状态，但仍可以居民形式参加保险，但缴纳的只能是医疗保险和养...        1  \n",
       "7350  不是必须在当地买保险，也可以异地投保，现在很多保险公司开发了异地买汽车保险的购买渠道。但是保...        0  \n",
       "1611                你想买哪方面保险呢，主要是看给你的服务，国寿现在新*市一款你可以考虑下        0  \n",
       "5127  对方的交强险和第三者责任险可以对第三者的伤害进行赔偿。第三者责任险是保险车辆因意外事故致使第...        1  \n",
       "4743  对方无保险需要自费赔付损失。一般在机动车与机动车之间发生交通事故，由保险公司在机动车第三者责...        1  \n",
       "5729  通常普通的健康保险是不需要体检的，不过如果年龄、保额超过保险公司规定的限度，就一定需要体检。...        1  \n",
       "3564  儿童保险是指用于解决其成长过程中所需要的教育、创业、婚嫁等费用，以及应付孩子可能面临的疾病、...        1  \n",
       "824                                      直接到当地社保处办理就可以了        0  \n",
       "4856  第二次办理养老保险需要的资料1.本地人才市场《劳动保障事物代理委托协议书》2.身份正原件及复...        1  \n",
       "2054  理论上说是可行的。具体要看各地的政策和监管要求是如何运行，不同的城市对异地投保的情况的规定是...        1  \n",
       "1415                                     建议直接拨打人寿客服电话咨询        0  \n",
       "5225  机动车保险到期就等于无保险，机动车交通事故责任强制保险条例第三十九条：机动车所有人、管理人未...        1  \n",
       "5596  那要看你们学校买的意外保险的条款中有没有限定只负责理赔在校园中发生的意外伤害，如果没有这样的...        0  \n",
       "7390  需要提供工人的身份证号需要提供建筑公司的组织机构代码证团体意外险投保书填写及盖章一、企业施工...        0  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(n=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,194 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# dh_msra 说明\n",
    "0. **下载地址：** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/dh_msra/dh_msra.zip)\n",
    "1. **数据概览：** 5 万多条中文命名实体识别标注数据（[IOB2](https://dl.acm.org/citation.cfm?id=977059) 格式，符合 [CoNLL 2002](https://www.clips.uantwerpen.be/conll2002/ner/) 和 [CRF++](https://taku910.github.io/crfpp/#format) 标准）\n",
    "2. **推荐实验：** 中文命名实体识别\n",
    "2. **数据来源：** 不详\n",
    "3. **原数据集：** [zh-NER-TF](https://github.com/Determined22/zh-NER-TF)，网上搜集，具体作者、来源不详，可能是来自于 MSRA 的语料\n",
    "4. **加工处理：**\n",
    "    1. 将原来 2 个文件 (train 和 test) 整合到 1 个文件中"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import codecs\n",
    "import random\n",
    "\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'dh_msra_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. dh_msra.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_iob2(file_path):\n",
    "    '''加载 IOB2 格式的数据'''\n",
    "    token_seqs = []\n",
    "    label_seqs = []\n",
    "    tokens = []\n",
    "    labels = []\n",
    "    with codecs.open(file_path) as f:\n",
    "        for index, line in enumerate(f):\n",
    "            items = line.strip().split()\n",
    "            if len(items) == 2:\n",
    "                token, label = items\n",
    "                tokens.append(token)\n",
    "                labels.append(label)\n",
    "            elif len(items) == 0:\n",
    "                if tokens:\n",
    "                    token_seqs.append(tokens)\n",
    "                    label_seqs.append(labels)\n",
    "                    tokens = []\n",
    "                    labels = []\n",
    "            else:\n",
    "                print('格式错误。行号：{} 内容：{}'.format(index, line))\n",
    "                continue\n",
    "                \n",
    "    if tokens: # 如果文件末尾没有空行，手动将最后一条数据加入序列的列表中\n",
    "        token_seqs.append(tokens)\n",
    "        label_seqs.append(labels)    \n",
    "        \n",
    "    return np.array(token_seqs), np.array(label_seqs)\n",
    "\n",
    "\n",
    "def show_iob2(token_seqs, label_seqs, num=5, shuffle=True):\n",
    "    '''显示 IOB2 格式数据'''\n",
    "    if shuffle:\n",
    "        length = len(token_seqs)\n",
    "        indexes = [random.randrange(0, length) for i in range(num)] \n",
    "        zip_seqs = zip(token_seqs[indexes], label_seqs[indexes])\n",
    "    else:\n",
    "        zip_seqs = zip(token_seqs[0:num], label_seqs[0:num])\n",
    "        \n",
    "    for tokens, labels in zip_seqs:\n",
    "        for token, label in zip(tokens, labels):\n",
    "            print('{}/{} '.format(token, label), end='')\n",
    "        print('\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "55289 55289\n",
      "\n",
      "目/O 前/O “/O 继/B-PER 生/I-PER ”/O 共/O 产/O 仔/O 5/O 胎/O ，/O 产/O 下/O 小/O 老/O 虎/O 1/O 8/O 只/O ，/O 堪/O 称/O 虎/O 妈/O 妈/O 中/O 的/O 英/O 雄/O 。/O \n",
      "\n",
      "历/O 史/O 的/O 内/O 涵/O 是/O 很/O 丰/O 富/O 的/O ，/O 经/O 典/O 作/O 家/O 的/O 论/O 断/O 固/O 然/O 有/O 其/O 权/O 威/O 性/O 和/O 合/O 理/O 性/O ，/O 但/O 历/O 史/O 学/O 家/O 显/O 然/O 不/O 能/O 局/O 限/O 于/O 此/O 。/O \n",
      "\n",
      "5/O 月/O 3/O 0/O 日/O 在/O 中/B-LOC 国/I-LOC 革/I-LOC 命/I-LOC 军/I-LOC 事/I-LOC 博/I-LOC 物/I-LOC 馆/I-LOC 开/O 幕/O 的/O 全/O 国/O 禁/O 毒/O 展/O 览/O ，/O 在/O 社/O 会/O 上/O 引/O 起/O 了/O 强/O 烈/O 的/O 反/O 响/O 。/O \n",
      "\n",
      "另/O 外/O ，/O 还/O 有/O 一/O 个/O 惊/O 人/O 的/O 发/O 现/O ：/O 有/O 的/O 发/O 展/O 中/O 国/O 家/O 人/O 均/O 国/O 民/O 资/O 源/O 非/O 常/O 丰/O 富/O ，/O 但/O 发/O 展/O 不/O 起/O 来/O 的/O 原/O 因/O 在/O 于/O 教/O 育/O 水/O 平/O 太/O 低/O 、/O 对/O 技/O 术/O 的/O 理/O 解/O 和/O 把/O 握/O 太/O 低/O 、/O 管/O 理/O 水/O 平/O 太/O 低/O 等/O 等/O ，/O 一/O 句/O 话/O ，/O 智/O 力/O 资/O 本/O 太/O 贫/O 乏/O 。/O \n",
      "\n",
      "这/O 还/O 要/O 看/O 进/O 一/O 步/O 深/O 入/O 调/O 查/O 的/O 结/O 果/O 。/O \n",
      "\n"
     ]
    }
   ],
   "source": [
    "token_seqs, label_seqs = load_iob2(path+'dh_msra.txt')\n",
    "\n",
    "print(len(token_seqs), len(label_seqs))\n",
    "print()    \n",
    "show_iob2(token_seqs, label_seqs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 标签说明\n",
    "\n",
    "| 标签 | 说明 |\n",
    "| ---- | ---- |\n",
    "| LOC | 地点 (LOCATION) |\n",
    "| ORG | 机构 (ORGANIZATION) |\n",
    "| PER | 人物 (PERSON) |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'B-LOC', 'B-ORG', 'B-PER', 'I-LOC', 'I-ORG', 'I-PER', 'O'}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "set([label for labels in label_seqs for label in labels])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,987 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# dmsc_v2 说明\n",
    "0. **下载地址：** [百度网盘](https://pan.baidu.com/s/1c0yn3TlkzHYTdEBz3T5arA)\n",
    "1. **数据概览：** 28 部电影，超 70 万 用户，超 200 万条 评分/评论 数据\n",
    "2. **推荐实验：** 推荐系统、情感/观点/评论 倾向性分析\n",
    "2. **数据来源：**[豆瓣电影](https://movie.douban.com/)\n",
    "3. **原数据集：** [Douban Movie Short Comments Dataset V2](https://www.kaggle.com/utmhikari/doubanmovieshortcomments)\n",
    "4. **加工处理：**\n",
    "    1. 去重并整理成与 [MovieLens](https://grouplens.org/datasets/movielens/) 兼容的格式\n",
    "    2. 进行脱敏操作，以保护用户隐私"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "path = 'dmsc_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. movies.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "电影数目：28\n"
     ]
    }
   ],
   "source": [
    "movies = pd.read_csv(path + 'movies.csv')\n",
    "\n",
    "print('电影数目：%d' % movies.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| movieId | 电影 id (从 0 开始，连续编号) |\n",
    "| title | 英文名称 |\n",
    "| title_cn | 中文名称 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>title_cn</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>Avengers Age of Ultron</td>\n",
       "      <td>复仇者联盟2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Big Fish and Begonia</td>\n",
       "      <td>大鱼海棠</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Captain America Civil War</td>\n",
       "      <td>美国队长3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>Chinese Zodiac</td>\n",
       "      <td>十二生肖</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>Chronicles of the Ghostly Tribe</td>\n",
       "      <td>九层妖塔</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>CUG King of Heroes</td>\n",
       "      <td>大圣归来</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>Forever Young</td>\n",
       "      <td>栀子花开</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>Goodbye Mr. Loser</td>\n",
       "      <td>夏洛特烦恼</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>Iron Man</td>\n",
       "      <td>钢铁侠1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>Journey to the West Conquering the Demons</td>\n",
       "      <td>西游降魔篇</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>Journey to the West The Demons Strike Back</td>\n",
       "      <td>西游伏妖篇</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>La La Land</td>\n",
       "      <td>爱乐之城</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>Lost In Thailand</td>\n",
       "      <td>泰囧</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>My Sunshine</td>\n",
       "      <td>何以笙箫默</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>Operation Mekong</td>\n",
       "      <td>湄公河行动</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>Soulmate</td>\n",
       "      <td>七月与安生</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>The Avengers</td>\n",
       "      <td>复仇者联盟</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>The Continent</td>\n",
       "      <td>后会无期</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>The Ghouls</td>\n",
       "      <td>寻龙诀</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>The Great Wall</td>\n",
       "      <td>长城</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>20</td>\n",
       "      <td>The Left Ear</td>\n",
       "      <td>左耳</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>21</td>\n",
       "      <td>The Mermaid</td>\n",
       "      <td>美人鱼</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>22</td>\n",
       "      <td>Tiny Times 1.0</td>\n",
       "      <td>小时代1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>23</td>\n",
       "      <td>Tiny Times 3.0</td>\n",
       "      <td>小时代3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>24</td>\n",
       "      <td>Train to Busan</td>\n",
       "      <td>釜山行</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>25</td>\n",
       "      <td>Transformers Age of Extinction</td>\n",
       "      <td>变形金刚4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>26</td>\n",
       "      <td>Your Name</td>\n",
       "      <td>你的名字</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>27</td>\n",
       "      <td>Zootopia</td>\n",
       "      <td>疯狂动物城</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    movieId                                       title title_cn\n",
       "0         0                      Avengers Age of Ultron   复仇者联盟2\n",
       "1         1                        Big Fish and Begonia     大鱼海棠\n",
       "2         2                   Captain America Civil War    美国队长3\n",
       "3         3                              Chinese Zodiac     十二生肖\n",
       "4         4             Chronicles of the Ghostly Tribe     九层妖塔\n",
       "5         5                          CUG King of Heroes     大圣归来\n",
       "6         6                               Forever Young     栀子花开\n",
       "7         7                           Goodbye Mr. Loser    夏洛特烦恼\n",
       "8         8                                    Iron Man     钢铁侠1\n",
       "9         9   Journey to the West Conquering the Demons    西游降魔篇\n",
       "10       10  Journey to the West The Demons Strike Back    西游伏妖篇\n",
       "11       11                                  La La Land     爱乐之城\n",
       "12       12                            Lost In Thailand       泰囧\n",
       "13       13                                 My Sunshine    何以笙箫默\n",
       "14       14                            Operation Mekong    湄公河行动\n",
       "15       15                                    Soulmate    七月与安生\n",
       "16       16                                The Avengers    复仇者联盟\n",
       "17       17                               The Continent     后会无期\n",
       "18       18                                  The Ghouls      寻龙诀\n",
       "19       19                              The Great Wall       长城\n",
       "20       20                                The Left Ear       左耳\n",
       "21       21                                 The Mermaid      美人鱼\n",
       "22       22                              Tiny Times 1.0     小时代1\n",
       "23       23                              Tiny Times 3.0     小时代3\n",
       "24       24                              Train to Busan      釜山行\n",
       "25       25              Transformers Age of Extinction    变形金刚4\n",
       "26       26                                   Your Name     你的名字\n",
       "27       27                                    Zootopia    疯狂动物城"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. ratings.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "用户数据：738701\n",
      "评分数目：2125056\n"
     ]
    }
   ],
   "source": [
    "ratings = pd.read_csv(path + 'ratings.csv')\n",
    "\n",
    "print('用户数据：%d' % ratings.userId.unique().shape[0])\n",
    "print('评分数目：%d' % ratings.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| userId | 用户 id (从 0 开始，连续编号) |\n",
    "| movieId | 即 movies.csv 中的 movieId|\n",
    "|rating | 评分，[1,5] 之间的整数 | \n",
    "|timestamp | 评分时间戳 |\n",
    "|comment | 评论内容 |\n",
    "| like | 该评论被多少人点赞 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>rating</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>comment</th>\n",
       "      <th>like</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1763779</th>\n",
       "      <td>130888</td>\n",
       "      <td>24</td>\n",
       "      <td>5</td>\n",
       "      <td>1474560000</td>\n",
       "      <td>原著的剧本不是这样的，而是最后只有那个自私鬼活了下来。孕妇中枪，小孩中枪的时候哭出了声音，...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1608147</th>\n",
       "      <td>23695</td>\n",
       "      <td>22</td>\n",
       "      <td>2</td>\n",
       "      <td>1377360000</td>\n",
       "      <td>郭敬明真的要为中国产生如此大规模的青少年脑残群体负一定责任 = =</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1735498</th>\n",
       "      <td>323858</td>\n",
       "      <td>24</td>\n",
       "      <td>3</td>\n",
       "      <td>1473696000</td>\n",
       "      <td>三分不能再多。其中一分给壮汉大叔，帅过男主。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1631095</th>\n",
       "      <td>218188</td>\n",
       "      <td>22</td>\n",
       "      <td>3</td>\n",
       "      <td>1372953600</td>\n",
       "      <td>柯震东露点 给三星 后面的彩蛋很欢乐</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1193163</th>\n",
       "      <td>155900</td>\n",
       "      <td>17</td>\n",
       "      <td>4</td>\n",
       "      <td>1406390400</td>\n",
       "      <td>给四星不是因为电影有那么好，文艺腔调有，公路片元素够，但好看程度其实低于预期，但是因为是韩...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1874658</th>\n",
       "      <td>8534</td>\n",
       "      <td>26</td>\n",
       "      <td>4</td>\n",
       "      <td>1480780800</td>\n",
       "      <td>身体互换和改变未来都是老梗了，算是半新不旧的瓶装了个旧酒吧，不过倒是不错，意外的好看，伏笔...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>645671</th>\n",
       "      <td>312247</td>\n",
       "      <td>9</td>\n",
       "      <td>4</td>\n",
       "      <td>1476979200</td>\n",
       "      <td>念念不忘，必有回响…</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1681543</th>\n",
       "      <td>284941</td>\n",
       "      <td>23</td>\n",
       "      <td>4</td>\n",
       "      <td>1409673600</td>\n",
       "      <td>看到她们在雪地的那段，居然很感动</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1042238</th>\n",
       "      <td>100689</td>\n",
       "      <td>15</td>\n",
       "      <td>5</td>\n",
       "      <td>1474214400</td>\n",
       "      <td>以前看安妮宝贝时期....最喜欢的小说之一</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1672379</th>\n",
       "      <td>139726</td>\n",
       "      <td>23</td>\n",
       "      <td>2</td>\n",
       "      <td>1406736000</td>\n",
       "      <td>郭小四不是标榜自己时尚品味吗？四个女主一个镜头换一身皮草哪来的品味啊？？（客观的说，叙事增...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1823549</th>\n",
       "      <td>447412</td>\n",
       "      <td>25</td>\n",
       "      <td>2</td>\n",
       "      <td>1405958400</td>\n",
       "      <td>擎天柱胸前蓝色的部分装着生命所需的能量和他的记忆。这让我更加坚信一些东西，只是然后的然后我...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1112590</th>\n",
       "      <td>495975</td>\n",
       "      <td>16</td>\n",
       "      <td>4</td>\n",
       "      <td>1336838400</td>\n",
       "      <td>浩克抖包袱……</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>210239</th>\n",
       "      <td>123095</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>1390320000</td>\n",
       "      <td>轻松愉快，打斗设置还不错</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2093623</th>\n",
       "      <td>232598</td>\n",
       "      <td>27</td>\n",
       "      <td>5</td>\n",
       "      <td>1474560000</td>\n",
       "      <td>比之前大热的冰雪奇缘好太多，一部全家人都可以坐在一起看的电影。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>583777</th>\n",
       "      <td>322422</td>\n",
       "      <td>8</td>\n",
       "      <td>5</td>\n",
       "      <td>1301500800</td>\n",
       "      <td>的确比蜘蛛侠超人什么什么的好看</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1914937</th>\n",
       "      <td>75819</td>\n",
       "      <td>26</td>\n",
       "      <td>4</td>\n",
       "      <td>1473955200</td>\n",
       "      <td>真的棒。但是我自己还是不那么喜欢。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1211561</th>\n",
       "      <td>514748</td>\n",
       "      <td>17</td>\n",
       "      <td>4</td>\n",
       "      <td>1407427200</td>\n",
       "      <td>。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1965672</th>\n",
       "      <td>704638</td>\n",
       "      <td>26</td>\n",
       "      <td>5</td>\n",
       "      <td>1480953600</td>\n",
       "      <td>陪朋友去看的，本身我是拒绝这类小清新的电影的，而且在刚开始的时候说实话没怎么看懂，不过看到...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1935211</th>\n",
       "      <td>259717</td>\n",
       "      <td>26</td>\n",
       "      <td>4</td>\n",
       "      <td>1480694400</td>\n",
       "      <td>时间与空间错乱里的爱情 温暖又幽默</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>839108</th>\n",
       "      <td>426801</td>\n",
       "      <td>11</td>\n",
       "      <td>5</td>\n",
       "      <td>1486742400</td>\n",
       "      <td>Here is to the ones who dream.</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         userId  movieId  rating   timestamp  \\\n",
       "1763779  130888       24       5  1474560000   \n",
       "1608147   23695       22       2  1377360000   \n",
       "1735498  323858       24       3  1473696000   \n",
       "1631095  218188       22       3  1372953600   \n",
       "1193163  155900       17       4  1406390400   \n",
       "1874658    8534       26       4  1480780800   \n",
       "645671   312247        9       4  1476979200   \n",
       "1681543  284941       23       4  1409673600   \n",
       "1042238  100689       15       5  1474214400   \n",
       "1672379  139726       23       2  1406736000   \n",
       "1823549  447412       25       2  1405958400   \n",
       "1112590  495975       16       4  1336838400   \n",
       "210239   123095        3       4  1390320000   \n",
       "2093623  232598       27       5  1474560000   \n",
       "583777   322422        8       5  1301500800   \n",
       "1914937   75819       26       4  1473955200   \n",
       "1211561  514748       17       4  1407427200   \n",
       "1965672  704638       26       5  1480953600   \n",
       "1935211  259717       26       4  1480694400   \n",
       "839108   426801       11       5  1486742400   \n",
       "\n",
       "                                                   comment  like  \n",
       "1763779   原著的剧本不是这样的，而是最后只有那个自私鬼活了下来。孕妇中枪，小孩中枪的时候哭出了声音，...     1  \n",
       "1608147                  郭敬明真的要为中国产生如此大规模的青少年脑残群体负一定责任 = =     0  \n",
       "1735498                             三分不能再多。其中一分给壮汉大叔，帅过男主。     0  \n",
       "1631095                                 柯震东露点 给三星 后面的彩蛋很欢乐     0  \n",
       "1193163   给四星不是因为电影有那么好，文艺腔调有，公路片元素够，但好看程度其实低于预期，但是因为是韩...     0  \n",
       "1874658   身体互换和改变未来都是老梗了，算是半新不旧的瓶装了个旧酒吧，不过倒是不错，意外的好看，伏笔...     1  \n",
       "645671                                          念念不忘，必有回响…     0  \n",
       "1681543                                   看到她们在雪地的那段，居然很感动     0  \n",
       "1042238                              以前看安妮宝贝时期....最喜欢的小说之一     0  \n",
       "1672379   郭小四不是标榜自己时尚品味吗？四个女主一个镜头换一身皮草哪来的品味啊？？（客观的说，叙事增...     0  \n",
       "1823549   擎天柱胸前蓝色的部分装着生命所需的能量和他的记忆。这让我更加坚信一些东西，只是然后的然后我...     0  \n",
       "1112590                                            浩克抖包袱……     0  \n",
       "210239                                        轻松愉快，打斗设置还不错     0  \n",
       "2093623                    比之前大热的冰雪奇缘好太多，一部全家人都可以坐在一起看的电影。     0  \n",
       "583777                                     的确比蜘蛛侠超人什么什么的好看     0  \n",
       "1914937                                  真的棒。但是我自己还是不那么喜欢。     0  \n",
       "1211561                                                  。     0  \n",
       "1965672   陪朋友去看的，本身我是拒绝这类小清新的电影的，而且在刚开始的时候说实话没怎么看懂，不过看到...     0  \n",
       "1935211                                  时间与空间错乱里的爱情 温暖又幽默     0  \n",
       "839108                      Here is to the ones who dream.     0  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings.sample(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. 用于 情感/观点/评论 倾向性分析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 筛选出带有较明显倾向性的评论（1星和5星的评分）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "正向（5星）数目：638106\n",
      "负向（1星）数目：190927\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>rating</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>comment</th>\n",
       "      <th>like</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>405540</th>\n",
       "      <td>251302</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>1436976000</td>\n",
       "      <td>路人转自来水！大圣帅气！我要生猴子~~~^-^</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>159308</th>\n",
       "      <td>18639</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>1462636800</td>\n",
       "      <td>冬兵从醒了以后就应该要求被冻起来，美队这个人烂的真要命。心疼tony。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1329674</th>\n",
       "      <td>127217</td>\n",
       "      <td>18</td>\n",
       "      <td>5</td>\n",
       "      <td>1451059200</td>\n",
       "      <td>超级棒！远远超出预期 免费水军来了哈哈哈哈</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1945766</th>\n",
       "      <td>75720</td>\n",
       "      <td>26</td>\n",
       "      <td>5</td>\n",
       "      <td>1476460800</td>\n",
       "      <td>为爱而动</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1706244</th>\n",
       "      <td>29721</td>\n",
       "      <td>23</td>\n",
       "      <td>1</td>\n",
       "      <td>1406131200</td>\n",
       "      <td>看小时代3的时候真是太壮观了整个场子那个乱啊打电话的聊天的中途上厕所的没办法大家提不起兴趣...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1271715</th>\n",
       "      <td>546029</td>\n",
       "      <td>17</td>\n",
       "      <td>1</td>\n",
       "      <td>1406217600</td>\n",
       "      <td>可以给零分么</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>394698</th>\n",
       "      <td>243184</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>1437926400</td>\n",
       "      <td>一直听网友说好，今天去电影院看了下。真的不错，是中国动漫的一个值得一看的作品。太多的喜羊羊...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>324077</th>\n",
       "      <td>208900</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>1437062400</td>\n",
       "      <td>先吐槽一下自己的泪点，太低了。小和尚太像弟弟小时候的样子了。整部电影是良心之作，国产地影这...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1004222</th>\n",
       "      <td>186241</td>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "      <td>1475942400</td>\n",
       "      <td>主旋律片的杰出代表，节奏顺畅快速。看得人热血沸腾！</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198523</th>\n",
       "      <td>5774</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>1462723200</td>\n",
       "      <td>迄今看过最精彩的漫威电影 其实整个剧情核心是复仇 但是这个复仇点真心满怪的 队长还是一如既...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014461</th>\n",
       "      <td>25511</td>\n",
       "      <td>27</td>\n",
       "      <td>5</td>\n",
       "      <td>1457280000</td>\n",
       "      <td>try everything！动物界的乌托邦 nick真的好苏好腹黑啊啊（原谅我带入了小说</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2101031</th>\n",
       "      <td>727978</td>\n",
       "      <td>27</td>\n",
       "      <td>5</td>\n",
       "      <td>1462550400</td>\n",
       "      <td>讲真很棒！</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1614137</th>\n",
       "      <td>64084</td>\n",
       "      <td>22</td>\n",
       "      <td>1</td>\n",
       "      <td>1374768000</td>\n",
       "      <td>最后雪中的姐妹情。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1980114</th>\n",
       "      <td>248321</td>\n",
       "      <td>26</td>\n",
       "      <td>5</td>\n",
       "      <td>1480867200</td>\n",
       "      <td>时空的跨越，绝对不能忘记的，你的名字。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1829632</th>\n",
       "      <td>18891</td>\n",
       "      <td>25</td>\n",
       "      <td>5</td>\n",
       "      <td>1403884800</td>\n",
       "      <td>请记住一个特效片是不需要完美剧情的。在电影院看的就是特效，没有其他。给特效满分。顶端水平。...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>276335</th>\n",
       "      <td>186281</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1443715200</td>\n",
       "      <td>不知道在演什么鬼</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2090682</th>\n",
       "      <td>214830</td>\n",
       "      <td>27</td>\n",
       "      <td>5</td>\n",
       "      <td>1457193600</td>\n",
       "      <td>这狐狸怎么那么苏！！！反差萌的梗简直炉火纯青</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2108227</th>\n",
       "      <td>731117</td>\n",
       "      <td>27</td>\n",
       "      <td>5</td>\n",
       "      <td>1458403200</td>\n",
       "      <td>树懒梗可爱到爆。乌托邦社会的构建反讽了乌托邦社会设想，号称没有偏见的世界里，本身就是由偏见...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>864728</th>\n",
       "      <td>10418</td>\n",
       "      <td>12</td>\n",
       "      <td>5</td>\n",
       "      <td>1355673600</td>\n",
       "      <td>啥也不说了，从头笑到尾，差点没乐死我，最后又赚了些感动</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>422130</th>\n",
       "      <td>263856</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>1436544000</td>\n",
       "      <td>很感动很用心</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         userId  movieId  rating   timestamp  \\\n",
       "405540   251302        5       5  1436976000   \n",
       "159308    18639        2       5  1462636800   \n",
       "1329674  127217       18       5  1451059200   \n",
       "1945766   75720       26       5  1476460800   \n",
       "1706244   29721       23       1  1406131200   \n",
       "1271715  546029       17       1  1406217600   \n",
       "394698   243184        5       5  1437926400   \n",
       "324077   208900        5       5  1437062400   \n",
       "1004222  186241       14       5  1475942400   \n",
       "198523     5774        2       5  1462723200   \n",
       "2014461   25511       27       5  1457280000   \n",
       "2101031  727978       27       5  1462550400   \n",
       "1614137   64084       22       1  1374768000   \n",
       "1980114  248321       26       5  1480867200   \n",
       "1829632   18891       25       5  1403884800   \n",
       "276335   186281        4       1  1443715200   \n",
       "2090682  214830       27       5  1457193600   \n",
       "2108227  731117       27       5  1458403200   \n",
       "864728    10418       12       5  1355673600   \n",
       "422130   263856        5       5  1436544000   \n",
       "\n",
       "                                                   comment  like  \n",
       "405540                             路人转自来水！大圣帅气！我要生猴子~~~^-^     0  \n",
       "159308                 冬兵从醒了以后就应该要求被冻起来，美队这个人烂的真要命。心疼tony。     0  \n",
       "1329674                              超级棒！远远超出预期 免费水军来了哈哈哈哈     0  \n",
       "1945766                                               为爱而动     0  \n",
       "1706244   看小时代3的时候真是太壮观了整个场子那个乱啊打电话的聊天的中途上厕所的没办法大家提不起兴趣...     0  \n",
       "1271715                                             可以给零分么     0  \n",
       "394698    一直听网友说好，今天去电影院看了下。真的不错，是中国动漫的一个值得一看的作品。太多的喜羊羊...     0  \n",
       "324077    先吐槽一下自己的泪点，太低了。小和尚太像弟弟小时候的样子了。整部电影是良心之作，国产地影这...     0  \n",
       "1004222                          主旋律片的杰出代表，节奏顺畅快速。看得人热血沸腾！     0  \n",
       "198523    迄今看过最精彩的漫威电影 其实整个剧情核心是复仇 但是这个复仇点真心满怪的 队长还是一如既...     0  \n",
       "2014461      try everything！动物界的乌托邦 nick真的好苏好腹黑啊啊（原谅我带入了小说     0  \n",
       "2101031                                              讲真很棒！     0  \n",
       "1614137                                          最后雪中的姐妹情。     0  \n",
       "1980114                                时空的跨越，绝对不能忘记的，你的名字。     0  \n",
       "1829632   请记住一个特效片是不需要完美剧情的。在电影院看的就是特效，没有其他。给特效满分。顶端水平。...     0  \n",
       "276335                                            不知道在演什么鬼     1  \n",
       "2090682                             这狐狸怎么那么苏！！！反差萌的梗简直炉火纯青     4  \n",
       "2108227   树懒梗可爱到爆。乌托邦社会的构建反讽了乌托邦社会设想，号称没有偏见的世界里，本身就是由偏见...     0  \n",
       "864728                         啥也不说了，从头笑到尾，差点没乐死我，最后又赚了些感动     0  \n",
       "422130                                              很感动很用心     0  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings_with_opinions = ratings[(ratings.rating==1) | (ratings.rating==5)]\n",
    "\n",
    "\n",
    "print('正向（5星）数目：%d' % (ratings_with_opinions[ratings_with_opinions.rating==5].shape[0]))\n",
    "print('负向（1星）数目：%d' % (ratings_with_opinions[ratings_with_opinions.rating==1].shape[0]))\n",
    "\n",
    "ratings_with_opinions.sample(20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "keras"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,780 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ez_douban 说明\n",
    "0. **下载地址：** [百度网盘](https://pan.baidu.com/s/1DkN1LmdSMzm_jCBKhbPbig)\n",
    "1. **数据概览：** 5 万多部电影（3 万多有电影名称，2 万多没有电影名称），2.8 万 用户，280 万条评分数据\n",
    "2. **推荐实验：** 推荐系统\n",
    "2. **数据来源：**[豆瓣电影](https://movie.douban.com/)\n",
    "3. **原数据集：** [Douban-1 和 Douban-2](https://sites.google.com/site/erhengzhong/datasets)，这是 Erheng Zhong 博士 为在 KDD'12, TKDD'14, SDM'12 上发表论文而收集的数据\n",
    "4. **加工处理：**\n",
    "    1. 去除 Douban-1 中无用的 status 字段，以及无效的评分，并整理成与 [MovieLens](https://grouplens.org/datasets/movielens/) 兼容的格式\n",
    "    2. 从 Douban-2 中提取电影信息和链接信息，并与 Douban-1 中的评分数据进行联表操作\n",
    "    3. 进行脱敏操作，以保护用户隐私"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "path = 'ez_douban_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. movies.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "电影数目（有名称）：33258\n",
      "电影数目（没有名称）：24166\n",
      "电影数目（总计）：57424\n"
     ]
    }
   ],
   "source": [
    "movies = pd.read_csv(path + 'movies.csv')\n",
    "\n",
    "print('电影数目（有名称）：%d' % movies[~pd.isnull(movies.title)].shape[0])\n",
    "print('电影数目（没有名称）：%d' % movies[pd.isnull(movies.title)].shape[0])\n",
    "print('电影数目（总计）：%d' % movies.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| movieId | 电影 id (从 0 开始，连续编号) |\n",
    "| title | 电影名称 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>41807</th>\n",
       "      <td>41807</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16521</th>\n",
       "      <td>16521</td>\n",
       "      <td>五女拜寿</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10689</th>\n",
       "      <td>10689</td>\n",
       "      <td>La pelote de laine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21653</th>\n",
       "      <td>21653</td>\n",
       "      <td>Ma mha 4 khaa khrap</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36630</th>\n",
       "      <td>36630</td>\n",
       "      <td>the sky the earth and the rain</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31734</th>\n",
       "      <td>31734</td>\n",
       "      <td>Viva María!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31530</th>\n",
       "      <td>31530</td>\n",
       "      <td>远路</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22553</th>\n",
       "      <td>22553</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32346</th>\n",
       "      <td>32346</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29429</th>\n",
       "      <td>29429</td>\n",
       "      <td>The Crazies</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34912</th>\n",
       "      <td>34912</td>\n",
       "      <td>Stestí</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10350</th>\n",
       "      <td>10350</td>\n",
       "      <td>羊のうた</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31487</th>\n",
       "      <td>31487</td>\n",
       "      <td>一触即发</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50688</th>\n",
       "      <td>50688</td>\n",
       "      <td>还君明珠</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40769</th>\n",
       "      <td>40769</td>\n",
       "      <td>Red Riding Hood</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32748</th>\n",
       "      <td>32748</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17204</th>\n",
       "      <td>17204</td>\n",
       "      <td>작은아씨들</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>55870</th>\n",
       "      <td>55870</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42879</th>\n",
       "      <td>42879</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26432</th>\n",
       "      <td>26432</td>\n",
       "      <td>后门</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       movieId                           title\n",
       "41807    41807                             NaN\n",
       "16521    16521                            五女拜寿\n",
       "10689    10689              La pelote de laine\n",
       "21653    21653             Ma mha 4 khaa khrap\n",
       "36630    36630  the sky the earth and the rain\n",
       "31734    31734                     Viva María!\n",
       "31530    31530                              远路\n",
       "22553    22553                             NaN\n",
       "32346    32346                             NaN\n",
       "29429    29429                     The Crazies\n",
       "34912    34912                          Stestí\n",
       "10350    10350                            羊のうた\n",
       "31487    31487                            一触即发\n",
       "50688    50688                            还君明珠\n",
       "40769    40769                 Red Riding Hood\n",
       "32748    32748                             NaN\n",
       "17204    17204                           작은아씨들\n",
       "55870    55870                             NaN\n",
       "42879    42879                             NaN\n",
       "26432    26432                              后门"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies.sample(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. ratings.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "用户数据：28718\n",
      "评分数目：2828585\n"
     ]
    }
   ],
   "source": [
    "ratings = pd.read_csv(path + 'ratings.csv')\n",
    "\n",
    "print('用户数据：%d' % ratings.userId.unique().shape[0])\n",
    "print('评分数目：%d' % ratings.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| userId | 用户 id (从 0 开始，连续编号) |\n",
    "| movieId | 即 movies.csv 中的 movieId|\n",
    "|rating | 评分，[1,5] 之间的整数 | \n",
    "|timestamp | 评分时间戳 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>rating</th>\n",
       "      <th>timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1234569</th>\n",
       "      <td>4825</td>\n",
       "      <td>14852</td>\n",
       "      <td>5</td>\n",
       "      <td>1263084471</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1817521</th>\n",
       "      <td>7121</td>\n",
       "      <td>140</td>\n",
       "      <td>4</td>\n",
       "      <td>1259054160</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2417373</th>\n",
       "      <td>9449</td>\n",
       "      <td>116</td>\n",
       "      <td>3</td>\n",
       "      <td>1255344370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1234106</th>\n",
       "      <td>4822</td>\n",
       "      <td>685</td>\n",
       "      <td>5</td>\n",
       "      <td>1124800342</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2044878</th>\n",
       "      <td>7996</td>\n",
       "      <td>22343</td>\n",
       "      <td>4</td>\n",
       "      <td>1254639194</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>239277</th>\n",
       "      <td>947</td>\n",
       "      <td>5730</td>\n",
       "      <td>5</td>\n",
       "      <td>1253992436</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>305034</th>\n",
       "      <td>1178</td>\n",
       "      <td>9839</td>\n",
       "      <td>5</td>\n",
       "      <td>1304648204</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>121193</th>\n",
       "      <td>527</td>\n",
       "      <td>1512</td>\n",
       "      <td>4</td>\n",
       "      <td>1125694603</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2563603</th>\n",
       "      <td>10758</td>\n",
       "      <td>738</td>\n",
       "      <td>4</td>\n",
       "      <td>1301927887</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2034193</th>\n",
       "      <td>7949</td>\n",
       "      <td>1671</td>\n",
       "      <td>5</td>\n",
       "      <td>1276176595</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1373543</th>\n",
       "      <td>5369</td>\n",
       "      <td>893</td>\n",
       "      <td>3</td>\n",
       "      <td>1299972980</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1798131</th>\n",
       "      <td>7027</td>\n",
       "      <td>4530</td>\n",
       "      <td>3</td>\n",
       "      <td>1178099769</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>572517</th>\n",
       "      <td>2243</td>\n",
       "      <td>9773</td>\n",
       "      <td>3</td>\n",
       "      <td>1187275220</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2160230</th>\n",
       "      <td>8470</td>\n",
       "      <td>12</td>\n",
       "      <td>3</td>\n",
       "      <td>1306330169</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1672554</th>\n",
       "      <td>6554</td>\n",
       "      <td>5637</td>\n",
       "      <td>3</td>\n",
       "      <td>1168168788</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1504944</th>\n",
       "      <td>5920</td>\n",
       "      <td>6659</td>\n",
       "      <td>3</td>\n",
       "      <td>1254041654</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2657986</th>\n",
       "      <td>17116</td>\n",
       "      <td>738</td>\n",
       "      <td>4</td>\n",
       "      <td>1238829652</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2123663</th>\n",
       "      <td>8319</td>\n",
       "      <td>1242</td>\n",
       "      <td>4</td>\n",
       "      <td>1225941971</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>561109</th>\n",
       "      <td>2206</td>\n",
       "      <td>4209</td>\n",
       "      <td>3</td>\n",
       "      <td>1307884947</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>208970</th>\n",
       "      <td>887</td>\n",
       "      <td>4723</td>\n",
       "      <td>3</td>\n",
       "      <td>1306314265</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         userId  movieId  rating   timestamp\n",
       "1234569    4825    14852       5  1263084471\n",
       "1817521    7121      140       4  1259054160\n",
       "2417373    9449      116       3  1255344370\n",
       "1234106    4822      685       5  1124800342\n",
       "2044878    7996    22343       4  1254639194\n",
       "239277      947     5730       5  1253992436\n",
       "305034     1178     9839       5  1304648204\n",
       "121193      527     1512       4  1125694603\n",
       "2563603   10758      738       4  1301927887\n",
       "2034193    7949     1671       5  1276176595\n",
       "1373543    5369      893       3  1299972980\n",
       "1798131    7027     4530       3  1178099769\n",
       "572517     2243     9773       3  1187275220\n",
       "2160230    8470       12       3  1306330169\n",
       "1672554    6554     5637       3  1168168788\n",
       "1504944    5920     6659       3  1254041654\n",
       "2657986   17116      738       4  1238829652\n",
       "2123663    8319     1242       4  1225941971\n",
       "561109     2206     4209       3  1307884947\n",
       "208970      887     4723       3  1306314265"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings.sample(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. links.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "links = pd.read_csv(path + 'links.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| movieId | 即 movies.csv 和 ratings.csv 中的 movieId |\n",
    "| imdbId | IMDB 网站的电影编号 |\n",
    "|doubanId | 豆瓣网站的电影编号 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>doubanId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>50304</th>\n",
       "      <td>50304</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3712319</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46231</th>\n",
       "      <td>46231</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3035298</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56597</th>\n",
       "      <td>56597</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2980174</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54191</th>\n",
       "      <td>54191</td>\n",
       "      <td>86992.0</td>\n",
       "      <td>1294617</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3418</th>\n",
       "      <td>3418</td>\n",
       "      <td>87406.0</td>\n",
       "      <td>1533608</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6586</th>\n",
       "      <td>6586</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6383567</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52685</th>\n",
       "      <td>52685</td>\n",
       "      <td>376706.0</td>\n",
       "      <td>1770079</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53372</th>\n",
       "      <td>53372</td>\n",
       "      <td>218839.0</td>\n",
       "      <td>1295836</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27540</th>\n",
       "      <td>27540</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2371674</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34467</th>\n",
       "      <td>34467</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4868728</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2301</th>\n",
       "      <td>2301</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3732699</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16687</th>\n",
       "      <td>16687</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4840386</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36301</th>\n",
       "      <td>36301</td>\n",
       "      <td>364457.0</td>\n",
       "      <td>1764523</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44922</th>\n",
       "      <td>44922</td>\n",
       "      <td>452640.0</td>\n",
       "      <td>1920065</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27815</th>\n",
       "      <td>27815</td>\n",
       "      <td>114687.0</td>\n",
       "      <td>1773480</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25370</th>\n",
       "      <td>25370</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4192036</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36070</th>\n",
       "      <td>36070</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4848096</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40954</th>\n",
       "      <td>40954</td>\n",
       "      <td>115906.0</td>\n",
       "      <td>1302469</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38395</th>\n",
       "      <td>38395</td>\n",
       "      <td>436784.0</td>\n",
       "      <td>1857858</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49680</th>\n",
       "      <td>49680</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4168480</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       movieId    imdbId  doubanId\n",
       "50304    50304       NaN   3712319\n",
       "46231    46231       NaN   3035298\n",
       "56597    56597       NaN   2980174\n",
       "54191    54191   86992.0   1294617\n",
       "3418      3418   87406.0   1533608\n",
       "6586      6586       NaN   6383567\n",
       "52685    52685  376706.0   1770079\n",
       "53372    53372  218839.0   1295836\n",
       "27540    27540       NaN   2371674\n",
       "34467    34467       NaN   4868728\n",
       "2301      2301       NaN   3732699\n",
       "16687    16687       NaN   4840386\n",
       "36301    36301  364457.0   1764523\n",
       "44922    44922  452640.0   1920065\n",
       "27815    27815  114687.0   1773480\n",
       "25370    25370       NaN   4192036\n",
       "36070    36070       NaN   4848096\n",
       "40954    40954  115906.0   1302469\n",
       "38395    38395  436784.0   1857858\n",
       "49680    49680       NaN   4168480"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "links.sample(20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "keras"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,357 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# financezhidao 说明\n",
    "0. **下载地址：** [百度知道](https://pan.baidu.com/s/1z1Rnnk-ubRSvzDu4UvLlIw)\n",
    "1. **数据概览：** 77万 条金融行业问答数据\n",
    "2. **推荐实验：** FAQ 问答系统\n",
    "3. **数据来源：** 百度知道\n",
    "4. **加工处理：**\n",
    "    1. 过滤了id、url、qid、reply_t、user字段\n",
    "    2. 对question、reply做了脱敏处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'financezhidao_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. financezhidao_filter.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_all = pd.read_csv(path + 'financezhidao_filter.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| title | 标题 |\n",
    "| question | 问题（可为空） |\n",
    "| reply| 每个问题的内容 |\n",
    "| is_best| 是否是最佳答案 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>question</th>\n",
       "      <th>reply</th>\n",
       "      <th>is_best</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>678109</th>\n",
       "      <td>大家好，请问信用卡怎么分期，分期有什么用处呢</td>\n",
       "      <td>NaN</td>\n",
       "      <td>分期好提额，但是有利息</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>534025</th>\n",
       "      <td>本人在银行的存款，别人带本人的身份证可以取出来吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>若使用的是招商银行储蓄卡，在网点取款可代办，取款金额在1万元以上，需出示双人身份证原件和银行...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>501941</th>\n",
       "      <td>向银行贷款30万一个月要多少利息</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1000万</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>734438</th>\n",
       "      <td>招商信用卡还款怎么还，是每个月固定还多少钱，还是按照我们用款额度来算每个月还多少钱？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>消费多少还多少，还款期内免利息。账单出来会提示你全额还多少，最低还多少的。</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>448905</th>\n",
       "      <td>年利率6每月多少钱</td>\n",
       "      <td>NaN</td>\n",
       "      <td>一年按12个月算的</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>521387</th>\n",
       "      <td>以卡办卡查卡里余额吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>若需查询招行一卡通余额，可通过电话银行，手机银行，网上银行（大众版和专业版），自助设备等渠道...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>758812</th>\n",
       "      <td>2016年调整金融机构各个银行人民币存贷款基准利率是多少</td>\n",
       "      <td>NaN</td>\n",
       "      <td>这个问题的话，本金*利率*时间就可以算出来了总的存款利率的话一般都是有央*规定的，怕出现什么...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>220626</th>\n",
       "      <td>请问一下，广信贷怎么样？这个理财真的可以赚？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>所在城市若有招商银行，也可以了解下招行发售的理财产品，您可以进入招行主页，点击“理财产品”-...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>86984</th>\n",
       "      <td>公积金断交后，补上可以申请公积金贷款吗</td>\n",
       "      <td>公积金交了一年了，但是断了大概5个月了，现在想申请公积金贷款，请问补上可以吗</td>\n",
       "      <td>住房公积金断了，需要当事人准备相应的补交材料给单位经办人，由单位的经办人去有关部门办理补缴手...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20026</th>\n",
       "      <td>在哪里能借到钱</td>\n",
       "      <td>NaN</td>\n",
       "      <td>你要借多少</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>121538</th>\n",
       "      <td>哪有办理个人信用卡pos机</td>\n",
       "      <td>NaN</td>\n",
       "      <td>很多都可以办理</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>467245</th>\n",
       "      <td>身份证消磁了就不能办银行卡了吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>身份证读不出信息就是无效证件是没法去银行办理业务的目前部分银行支持临时身份证+辅助证明的方式...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>730725</th>\n",
       "      <td>年薪20万，招行信用卡标准金卡额度能有多少？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>正常来说一般是一万，要看你个人的信用度。这个情况要去银行问。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>517301</th>\n",
       "      <td>自己可以拿家长的身份证办银行卡么吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>必须本人办理</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>255614</th>\n",
       "      <td>有没有人现在能借一千以内给我，急需，无前期，走今借到</td>\n",
       "      <td>NaN</td>\n",
       "      <td>那么晚了还出来诈骗</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>545539</th>\n",
       "      <td>招行信用卡查询密码怎么修改有多种方式</td>\n",
       "      <td>NaN</td>\n",
       "      <td>可以通过网银大众版、专业版、手机银行、掌上生活客户端、电话银行等渠道修改。</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>747413</th>\n",
       "      <td>信用卡面签被拒的原因是什么？</td>\n",
       "      <td>信用卡面签被拒的原因是什么？</td>\n",
       "      <td>若申请的是招行信用卡，最主要的条件是有稳定的工作和收入，必备申请文件为身份证明复印件和工作证...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>669087</th>\n",
       "      <td>信用卡还款日提前一天会黑名单</td>\n",
       "      <td>NaN</td>\n",
       "      <td>你好，这个是不会的，信用卡还款日是指免息期的最后一天，在这个时间之前全额还款都是没有问题的。...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>237058</th>\n",
       "      <td>求一个农村人可以借钱的软件，就几百块急用，在网上找了十几个认证了</td>\n",
       "      <td>求一个农村人可以借钱的软件，就几百块急用，在网上找了十几个认证了半天都不给借，求一个靠谱的</td>\n",
       "      <td>你好很高兴为您解答:qq现金贷不错</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>131290</th>\n",
       "      <td>我们办理房贷合同时银行工作人员给信用卡申请来填，那个信用卡的核实信息我答不上会影响放款吗</td>\n",
       "      <td>我们办理房贷合同时银行工作人员给信用卡申请来填，那个信用卡的核实信息我答不上会影响放款吗急用</td>\n",
       "      <td>若是在招行申请的个人住房贷款，信用卡的核发情况不影响贷款放款。贷款的最终审核是否能够通过，是...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title  \\\n",
       "678109                        大家好，请问信用卡怎么分期，分期有什么用处呢   \n",
       "534025                      本人在银行的存款，别人带本人的身份证可以取出来吗   \n",
       "501941                              向银行贷款30万一个月要多少利息   \n",
       "734438    招商信用卡还款怎么还，是每个月固定还多少钱，还是按照我们用款额度来算每个月还多少钱？   \n",
       "448905                                     年利率6每月多少钱   \n",
       "521387                                    以卡办卡查卡里余额吗   \n",
       "758812                  2016年调整金融机构各个银行人民币存贷款基准利率是多少   \n",
       "220626                        请问一下，广信贷怎么样？这个理财真的可以赚？   \n",
       "86984                            公积金断交后，补上可以申请公积金贷款吗   \n",
       "20026                                        在哪里能借到钱   \n",
       "121538                                 哪有办理个人信用卡pos机   \n",
       "467245                               身份证消磁了就不能办银行卡了吗   \n",
       "730725                        年薪20万，招行信用卡标准金卡额度能有多少？   \n",
       "517301                             自己可以拿家长的身份证办银行卡么吗   \n",
       "255614                    有没有人现在能借一千以内给我，急需，无前期，走今借到   \n",
       "545539                            招行信用卡查询密码怎么修改有多种方式   \n",
       "747413                                信用卡面签被拒的原因是什么？   \n",
       "669087                                信用卡还款日提前一天会黑名单   \n",
       "237058              求一个农村人可以借钱的软件，就几百块急用，在网上找了十几个认证了   \n",
       "131290  我们办理房贷合同时银行工作人员给信用卡申请来填，那个信用卡的核实信息我答不上会影响放款吗   \n",
       "\n",
       "                                              question  \\\n",
       "678109                                             NaN   \n",
       "534025                                             NaN   \n",
       "501941                                             NaN   \n",
       "734438                                             NaN   \n",
       "448905                                             NaN   \n",
       "521387                                             NaN   \n",
       "758812                                             NaN   \n",
       "220626                                             NaN   \n",
       "86984           公积金交了一年了，但是断了大概5个月了，现在想申请公积金贷款，请问补上可以吗   \n",
       "20026                                              NaN   \n",
       "121538                                             NaN   \n",
       "467245                                             NaN   \n",
       "730725                                             NaN   \n",
       "517301                                             NaN   \n",
       "255614                                             NaN   \n",
       "545539                                             NaN   \n",
       "747413                                  信用卡面签被拒的原因是什么？   \n",
       "669087                                             NaN   \n",
       "237058   求一个农村人可以借钱的软件，就几百块急用，在网上找了十几个认证了半天都不给借，求一个靠谱的   \n",
       "131290  我们办理房贷合同时银行工作人员给信用卡申请来填，那个信用卡的核实信息我答不上会影响放款吗急用   \n",
       "\n",
       "                                                    reply  is_best  \n",
       "678109                                        分期好提额，但是有利息        0  \n",
       "534025  若使用的是招商银行储蓄卡，在网点取款可代办，取款金额在1万元以上，需出示双人身份证原件和银行...        1  \n",
       "501941                                              1000万        0  \n",
       "734438              消费多少还多少，还款期内免利息。账单出来会提示你全额还多少，最低还多少的。        1  \n",
       "448905                                          一年按12个月算的        0  \n",
       "521387  若需查询招行一卡通余额，可通过电话银行，手机银行，网上银行（大众版和专业版），自助设备等渠道...        1  \n",
       "758812  这个问题的话，本金*利率*时间就可以算出来了总的存款利率的话一般都是有央*规定的，怕出现什么...        0  \n",
       "220626  所在城市若有招商银行，也可以了解下招行发售的理财产品，您可以进入招行主页，点击“理财产品”-...        1  \n",
       "86984   住房公积金断了，需要当事人准备相应的补交材料给单位经办人，由单位的经办人去有关部门办理补缴手...        1  \n",
       "20026                                               你要借多少        0  \n",
       "121538                                            很多都可以办理        0  \n",
       "467245  身份证读不出信息就是无效证件是没法去银行办理业务的目前部分银行支持临时身份证+辅助证明的方式...        0  \n",
       "730725                     正常来说一般是一万，要看你个人的信用度。这个情况要去银行问。        0  \n",
       "517301                                             必须本人办理        0  \n",
       "255614                                          那么晚了还出来诈骗        0  \n",
       "545539              可以通过网银大众版、专业版、手机银行、掌上生活客户端、电话银行等渠道修改。        1  \n",
       "747413  若申请的是招行信用卡，最主要的条件是有稳定的工作和收入，必备申请文件为身份证明复印件和工作证...        1  \n",
       "669087  你好，这个是不会的，信用卡还款日是指免息期的最后一天，在这个时间之前全额还款都是没有问题的。...        0  \n",
       "237058                                  你好很高兴为您解答:qq现金贷不错        0  \n",
       "131290  若是在招行申请的个人住房贷款，信用卡的核发情况不影响贷款放款。贷款的最终审核是否能够通过，是...        1  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(n=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,357 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# lawzhidao_filter 说明\n",
    "0. **下载地址：** [百度知道](https://pan.baidu.com/s/18Lwq16VBo6wBD_qLb3i33g)\n",
    "1. **数据概览：** 3.6 万条法律问答数据\n",
    "2. **推荐实验：** FAQ 问答系统\n",
    "3. **数据来源：** 百度知道\n",
    "4. **加工处理：**\n",
    "    1. 过滤了id、url、qid、reply_t、user字段\n",
    "    2. 对question、reply做了脱敏处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'lawzhidao_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. lawzhidao_filter.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_all = pd.read_csv(path + 'baoxianzhidao_filter.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| title | 问题的标题 |\n",
    "| question | 问题内容（可为空） |\n",
    "| reply| 回复内容 |\n",
    "| is_best| 是否为页面上显示的最佳回答 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>question</th>\n",
       "      <th>reply</th>\n",
       "      <th>is_best</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6725</th>\n",
       "      <td>请问车险理赔时，全责一方和无责任一方收到待遇的区别</td>\n",
       "      <td>NaN</td>\n",
       "      <td>这位朋友提问的有些过于笼统了不是很详细，理论上来讲，从商业险的角度分析，有责任，保险公司才会...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6399</th>\n",
       "      <td>买保险,一定要找代理人吗,直接去保险公司买不可以吗?</td>\n",
       "      <td>买保险,一定要找代理人吗,直接去保险公司买不可以吗?</td>\n",
       "      <td>可以的。可以自行去保险公司进行投保，也可以选择在网上投保。不过有代理人的好处在于可以为被保险...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4242</th>\n",
       "      <td>机动车撞伤人至骨折保险公司该怎么赔偿</td>\n",
       "      <td>NaN</td>\n",
       "      <td>交通事故赔偿是有标准的，因交通事故造成损失，肇事者向受害者、保险公司对承保车辆造成的损失进行...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7481</th>\n",
       "      <td>贷款买养老保险如何办理？</td>\n",
       "      <td>贷款买养老保险如何办理？</td>\n",
       "      <td>助保贷款主要是针对中断缴纳基本养老保险费的接近退休年龄无力续保的困难*员，通过政府担保贴息、...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5674</th>\n",
       "      <td>摩托车行车证年审应交哪些保险?一定要交驾驶员个人险吗?</td>\n",
       "      <td>NaN</td>\n",
       "      <td>摩托车买保险最应该买的就是交强险，一般根据排量的不同共分为三个类别，其中50CC及以下的排量...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1122</th>\n",
       "      <td>惠*安保费贵不贵？一年需要多少钱？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>年缴保费500元，缴费20年，保障30年。</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5511</th>\n",
       "      <td>农村医保没有交,会把户口注销了吗?本人现不在家无法交医保，乡镇通知我，他说我不交医保就把我的户口</td>\n",
       "      <td>销了。是真的吗？</td>\n",
       "      <td>不会的，这是不合法的，新农合是指由政府组织、引导、支持，农民自愿参加，个人、集体和政府多方筹...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7338</th>\n",
       "      <td>新华保险的保单贷款是怎样还的?</td>\n",
       "      <td>NaN</td>\n",
       "      <td>半年要去签一次息，具体情况，可以直接咨询新华人寿保险公司，新华客服热线9##67</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1280</th>\n",
       "      <td>一起慧99到底有什么优惠相比其他的保险</td>\n",
       "      <td>NaN</td>\n",
       "      <td>您好！一起慧99</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6388</th>\n",
       "      <td>辞职后，养老保险如果不转移会怎么样</td>\n",
       "      <td>我2010年2月在原公司辞职后，养老保险没有转移。如果不转移，我这部分养老保险会怎么处理？</td>\n",
       "      <td>会被封存，所以要及时转移。养老保险转移和接续手续：一、申请出具《基本养老保险参保缴费凭证》职...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7920</th>\n",
       "      <td>慧*安*儿定期重疾是怎么理赔的</td>\n",
       "      <td>NaN</td>\n",
       "      <td>首先是报案您或被保险人应在知道保险事故发生之日起10日内通知本公司。如果您或受益人故意或者因...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3134</th>\n",
       "      <td>构不成住院条件的车祸需要赔付精神损失费误工费营养费护理费吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>只要存在精神损失、误工、需要增加营养、护理的费用，就可以向侵权人主张赔偿责任。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4015</th>\n",
       "      <td>基本保险金额是什么意思</td>\n",
       "      <td>基本保险金额是什么意思</td>\n",
       "      <td>基本保险金额是保单上明确标注的金额，保险金额是能拿到的保险赔付金额，有些保险条款的基本保险金...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6848</th>\n",
       "      <td>重大疾病保险有必要买吗？</td>\n",
       "      <td>我今年25岁，身体很健康，我去买保险，保险公司的人给我的计划里有重大疾病保险的项目，但是我只...</td>\n",
       "      <td>重大疾病保险还是很有必要买的。我国的医疗保障体系是由基本医保和商业健康保险组成。如果发生重大...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2494</th>\n",
       "      <td>库*勒妇科商业医保报销范围有哪些？</td>\n",
       "      <td>库*勒妇科商业医保报销范围有哪些？</td>\n",
       "      <td>你好，商业医保报销范围比医疗保险报销更广。基本都是能报销的。报销分农村居民和城镇职工：1、居...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7341</th>\n",
       "      <td>第三者保险营运与非营运什么区别</td>\n",
       "      <td>第三者保险营运与非营运什么区别</td>\n",
       "      <td>车辆行驶证的“使用性质“一个是营运，一个是非营运。营运需要在运输管理部门办理车辆的道路运输许...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4997</th>\n",
       "      <td>犹豫期内退保一定要去原来办理的地点吗?</td>\n",
       "      <td>犹豫期内退保一定要去原来办理的地点吗?</td>\n",
       "      <td>要退保必须去保险公司退，在银行的柜台上是没办法退的，而且退保必须由投保人本人持其身份证去退，...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5694</th>\n",
       "      <td>保险法的构成主要包括</td>\n",
       "      <td>NaN</td>\n",
       "      <td>保险法的构成主要包括保险业法、保险合同法*保险特别法。1.保险业法又称保险事业法、保险事业监...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1604</th>\n",
       "      <td>适合中老年的保险多不多，能买哪些保险？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>年龄多大呢？保费预算多少？</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3098</th>\n",
       "      <td>汽车购置税属于机动车第三者责任险赔偿范围内吗？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>购置税你是你购置车辆的时候上牌还需要交的费用。跟保险不是一个范围。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 title  \\\n",
       "6725                         请问车险理赔时，全责一方和无责任一方收到待遇的区别   \n",
       "6399                        买保险,一定要找代理人吗,直接去保险公司买不可以吗?   \n",
       "4242                                机动车撞伤人至骨折保险公司该怎么赔偿   \n",
       "7481                                      贷款买养老保险如何办理？   \n",
       "5674                       摩托车行车证年审应交哪些保险?一定要交驾驶员个人险吗?   \n",
       "1122                                 惠*安保费贵不贵？一年需要多少钱？   \n",
       "5511  农村医保没有交,会把户口注销了吗?本人现不在家无法交医保，乡镇通知我，他说我不交医保就把我的户口   \n",
       "7338                                   新华保险的保单贷款是怎样还的?   \n",
       "1280                               一起慧99到底有什么优惠相比其他的保险   \n",
       "6388                                 辞职后，养老保险如果不转移会怎么样   \n",
       "7920                                   慧*安*儿定期重疾是怎么理赔的   \n",
       "3134                     构不成住院条件的车祸需要赔付精神损失费误工费营养费护理费吗   \n",
       "4015                                       基本保险金额是什么意思   \n",
       "6848                                      重大疾病保险有必要买吗？   \n",
       "2494                                 库*勒妇科商业医保报销范围有哪些？   \n",
       "7341                                   第三者保险营运与非营运什么区别   \n",
       "4997                               犹豫期内退保一定要去原来办理的地点吗?   \n",
       "5694                                        保险法的构成主要包括   \n",
       "1604                               适合中老年的保险多不多，能买哪些保险？   \n",
       "3098                           汽车购置税属于机动车第三者责任险赔偿范围内吗？   \n",
       "\n",
       "                                               question  \\\n",
       "6725                                                NaN   \n",
       "6399                         买保险,一定要找代理人吗,直接去保险公司买不可以吗?   \n",
       "4242                                                NaN   \n",
       "7481                                       贷款买养老保险如何办理？   \n",
       "5674                                                NaN   \n",
       "1122                                                NaN   \n",
       "5511                                           销了。是真的吗？   \n",
       "7338                                                NaN   \n",
       "1280                                                NaN   \n",
       "6388      我2010年2月在原公司辞职后，养老保险没有转移。如果不转移，我这部分养老保险会怎么处理？   \n",
       "7920                                                NaN   \n",
       "3134                                                NaN   \n",
       "4015                                        基本保险金额是什么意思   \n",
       "6848  我今年25岁，身体很健康，我去买保险，保险公司的人给我的计划里有重大疾病保险的项目，但是我只...   \n",
       "2494                                  库*勒妇科商业医保报销范围有哪些？   \n",
       "7341                                    第三者保险营运与非营运什么区别   \n",
       "4997                                犹豫期内退保一定要去原来办理的地点吗?   \n",
       "5694                                                NaN   \n",
       "1604                                                NaN   \n",
       "3098                                                NaN   \n",
       "\n",
       "                                                  reply  is_best  \n",
       "6725  这位朋友提问的有些过于笼统了不是很详细，理论上来讲，从商业险的角度分析，有责任，保险公司才会...        0  \n",
       "6399  可以的。可以自行去保险公司进行投保，也可以选择在网上投保。不过有代理人的好处在于可以为被保险...        1  \n",
       "4242  交通事故赔偿是有标准的，因交通事故造成损失，肇事者向受害者、保险公司对承保车辆造成的损失进行...        1  \n",
       "7481  助保贷款主要是针对中断缴纳基本养老保险费的接近退休年龄无力续保的困难*员，通过政府担保贴息、...        0  \n",
       "5674  摩托车买保险最应该买的就是交强险，一般根据排量的不同共分为三个类别，其中50CC及以下的排量...        1  \n",
       "1122                              年缴保费500元，缴费20年，保障30年。        1  \n",
       "5511  不会的，这是不合法的，新农合是指由政府组织、引导、支持，农民自愿参加，个人、集体和政府多方筹...        1  \n",
       "7338           半年要去签一次息，具体情况，可以直接咨询新华人寿保险公司，新华客服热线9##67        0  \n",
       "1280                                           您好！一起慧99        0  \n",
       "6388  会被封存，所以要及时转移。养老保险转移和接续手续：一、申请出具《基本养老保险参保缴费凭证》职...        1  \n",
       "7920  首先是报案您或被保险人应在知道保险事故发生之日起10日内通知本公司。如果您或受益人故意或者因...        1  \n",
       "3134            只要存在精神损失、误工、需要增加营养、护理的费用，就可以向侵权人主张赔偿责任。        0  \n",
       "4015  基本保险金额是保单上明确标注的金额，保险金额是能拿到的保险赔付金额，有些保险条款的基本保险金...        1  \n",
       "6848  重大疾病保险还是很有必要买的。我国的医疗保障体系是由基本医保和商业健康保险组成。如果发生重大...        1  \n",
       "2494  你好，商业医保报销范围比医疗保险报销更广。基本都是能报销的。报销分农村居民和城镇职工：1、居...        0  \n",
       "7341  车辆行驶证的“使用性质“一个是营运，一个是非营运。营运需要在运输管理部门办理车辆的道路运输许...        1  \n",
       "4997  要退保必须去保险公司退，在银行的柜台上是没办法退的，而且退保必须由投保人本人持其身份证去退，...        1  \n",
       "5694  保险法的构成主要包括保险业法、保险合同法*保险特别法。1.保险业法又称保险事业法、保险事业监...        0  \n",
       "1604                                      年龄多大呢？保费预算多少？        0  \n",
       "3098                  购置税你是你购置车辆的时候上牌还需要交的费用。跟保险不是一个范围。        0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(n=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,355 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# liantongzhidao_filter 说明\n",
    "0. **下载地址：** [百度网盘](https://pan.baidu.com/s/1oYi9SfbXpnvreJYGV837Nw)\n",
    "1. **数据概览：** 20.3万 条联通问答数据\n",
    "2. **推荐实验：** FAQ 问答系统\n",
    "3. **数据来源：** 百度知道\n",
    "4. **加工处理：**\n",
    "    1. 过滤了id、url、qid、reply_t、user字段\n",
    "    2. 对question、reply做了脱敏处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'liantongzhidao_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. liantongzhidao_filter.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_all = pd.read_csv(path + 'liantongzhidao_filter.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| title | 问题的标题 |\n",
    "| question | 问题内容（可为空） |\n",
    "| reply| 回复内容 |\n",
    "| is_best| 是否为页面上显示的最佳回答 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>question</th>\n",
       "      <th>reply</th>\n",
       "      <th>is_best</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>104525</th>\n",
       "      <td>拖欠联通话费会有利息出现吗？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>应该没有</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>116168</th>\n",
       "      <td>5S日版为什么插移动卡可以用.联通卡就不读卡</td>\n",
       "      <td>NaN</td>\n",
       "      <td>苹果手机卡贴分为移动和联通的，说明卡贴支持移动卡，不支持联通卡，主要是网络制式决定的。联通网...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>154475</th>\n",
       "      <td>联通空中号激活了也不能打电话是怎么回事</td>\n",
       "      <td>联通空中号激活了也不能打电话是怎么回事</td>\n",
       "      <td>手机已激活却无法接打电话的常见原因及解决方法如下：【1】检查手机是否欠费停机，建议缴费充值；...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>153069</th>\n",
       "      <td>联通48元送2g活动本月月租到底算不算进去？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>算</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>195043</th>\n",
       "      <td>VI###13是不是不支持联通上网卡</td>\n",
       "      <td>NaN</td>\n",
       "      <td>VI###13支持联通上网卡。网络参考：主屏尺寸：4.5英寸主屏分辨率：854x480像素后...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5235</th>\n",
       "      <td>电话号码能定位是真吗</td>\n",
       "      <td>电话号码能定位是真吗</td>\n",
       "      <td>当然了</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10472</th>\n",
       "      <td>索尼LT22i可以刷机到4.1吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>由于手机所支持的网络是由硬件所确定的，无法通过破解软件或者升级软件系统让手机支持其他运营商的...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>86083</th>\n",
       "      <td>苹果ip##ne手机的个人热点怎么设置使用</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1、点击“设置”选项;2、在“设置”界面中找到“个人热点”;3、然后我们可以看到“个人热点”...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>150247</th>\n",
       "      <td>我用的联通的号码，信号一会有一会没有，请问到底是怎什么回事</td>\n",
       "      <td>NaN</td>\n",
       "      <td>信号不好，手机因素，运营商问题，手机卡问题，很多因素你可以到当地联通营业厅寻求帮助</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>202724</th>\n",
       "      <td>流量畅享包订购生效时间</td>\n",
       "      <td>NaN</td>\n",
       "      <td>您订购沃商店/沃游戏流量畅享包后，订购当月立即生效，按月自动续订；退订月底生效，当月可继续使...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44450</th>\n",
       "      <td>办了腾讯大王卡，激活后，身份证是不是就剩下半俩张卡的机会了</td>\n",
       "      <td>办了腾讯大王卡，激活后，身份证是不是就剩下半俩张卡的机会了</td>\n",
       "      <td>每人仅可预约一张音乐小*卡或视频小*卡或腾讯大*卡或腾讯天*卡（一共仅1张）（识别条件为：联...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>65875</th>\n",
       "      <td>现在有联通的合约机吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>联通有合约机。合约种类大致有存话费送手机、买手机送话费、合约惠机等，具体合约种类可登录联通网...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51934</th>\n",
       "      <td>联通卡那个腾讯应用省内定向流量免费是什么意思啊</td>\n",
       "      <td>NaN</td>\n",
       "      <td>大王卡，对腾讯的应用，都免流量！</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>155866</th>\n",
       "      <td>怎么设置电信手机彩铃？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>设置中*电信的彩铃可以自己在网上操作的，前提是先开通中*电信的彩铃业务，可以直接致电电信客服...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>118696</th>\n",
       "      <td>联通手机号挂失还能交费吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1、挂失状态下可以交费。交费渠道与手机正常状态下是一样的。2、温馨提示：如果号码有套餐，挂失...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>115890</th>\n",
       "      <td>我刚买了一张联通卡，过了几天我怎么收到达飞即有分期让我还款的信息</td>\n",
       "      <td>我刚买了一张联通卡，过了几天我怎么收到达飞即有分期让我还款的信息，我又没有借过，该怎么办，打...</td>\n",
       "      <td>出现此情况一般是有以下几种情况：1、信息可能发错接收人了。2、此卡为二次放号的手机卡，前一个...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35555</th>\n",
       "      <td>就是联通网用不了。</td>\n",
       "      <td>NaN</td>\n",
       "      <td>如使用联通手机无法上网，可做以下排查：1、升级为4G套餐后如不重启手机则无法正常使用上网功能...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73496</th>\n",
       "      <td>生份证复印件被公司拿去开联通号码了怎么办</td>\n",
       "      <td>生份证复印件被公司拿去开联通号码了怎么办</td>\n",
       "      <td>你再用原件去注销</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>114899</th>\n",
       "      <td>手机有4g网络，可是却显示无法上网</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1、检查信号是否正常；2、号卡是否欠费；3、如上面2项都正常，可重新开关机、换机换卡测试；4...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47230</th>\n",
       "      <td>移动，联通无限打到底是怎么回事</td>\n",
       "      <td>NaN</td>\n",
       "      <td>您好！现运营商均有推出各种语音、流量优惠套餐，具体情况建议您可咨询当地客服热线、实体营业厅、...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                   title  \\\n",
       "104525                    拖欠联通话费会有利息出现吗？   \n",
       "116168            5S日版为什么插移动卡可以用.联通卡就不读卡   \n",
       "154475               联通空中号激活了也不能打电话是怎么回事   \n",
       "153069            联通48元送2g活动本月月租到底算不算进去？   \n",
       "195043                VI###13是不是不支持联通上网卡   \n",
       "5235                          电话号码能定位是真吗   \n",
       "10472                   索尼LT22i可以刷机到4.1吗   \n",
       "86083              苹果ip##ne手机的个人热点怎么设置使用   \n",
       "150247     我用的联通的号码，信号一会有一会没有，请问到底是怎什么回事   \n",
       "202724                       流量畅享包订购生效时间   \n",
       "44450      办了腾讯大王卡，激活后，身份证是不是就剩下半俩张卡的机会了   \n",
       "65875                         现在有联通的合约机吗   \n",
       "51934            联通卡那个腾讯应用省内定向流量免费是什么意思啊   \n",
       "155866                       怎么设置电信手机彩铃？   \n",
       "118696                      联通手机号挂失还能交费吗   \n",
       "115890  我刚买了一张联通卡，过了几天我怎么收到达飞即有分期让我还款的信息   \n",
       "35555                          就是联通网用不了。   \n",
       "73496               生份证复印件被公司拿去开联通号码了怎么办   \n",
       "114899                 手机有4g网络，可是却显示无法上网   \n",
       "47230                    移动，联通无限打到底是怎么回事   \n",
       "\n",
       "                                                 question  \\\n",
       "104525                                                NaN   \n",
       "116168                                                NaN   \n",
       "154475                                联通空中号激活了也不能打电话是怎么回事   \n",
       "153069                                                NaN   \n",
       "195043                                                NaN   \n",
       "5235                                           电话号码能定位是真吗   \n",
       "10472                                                 NaN   \n",
       "86083                                                 NaN   \n",
       "150247                                                NaN   \n",
       "202724                                                NaN   \n",
       "44450                       办了腾讯大王卡，激活后，身份证是不是就剩下半俩张卡的机会了   \n",
       "65875                                                 NaN   \n",
       "51934                                                 NaN   \n",
       "155866                                                NaN   \n",
       "118696                                                NaN   \n",
       "115890  我刚买了一张联通卡，过了几天我怎么收到达飞即有分期让我还款的信息，我又没有借过，该怎么办，打...   \n",
       "35555                                                 NaN   \n",
       "73496                                生份证复印件被公司拿去开联通号码了怎么办   \n",
       "114899                                                NaN   \n",
       "47230                                                 NaN   \n",
       "\n",
       "                                                    reply  is_best  \n",
       "104525                                               应该没有        0  \n",
       "116168  苹果手机卡贴分为移动和联通的，说明卡贴支持移动卡，不支持联通卡，主要是网络制式决定的。联通网...        1  \n",
       "154475  手机已激活却无法接打电话的常见原因及解决方法如下：【1】检查手机是否欠费停机，建议缴费充值；...        1  \n",
       "153069                                                  算        1  \n",
       "195043  VI###13支持联通上网卡。网络参考：主屏尺寸：4.5英寸主屏分辨率：854x480像素后...        1  \n",
       "5235                                                  当然了        0  \n",
       "10472   由于手机所支持的网络是由硬件所确定的，无法通过破解软件或者升级软件系统让手机支持其他运营商的...        1  \n",
       "86083   1、点击“设置”选项;2、在“设置”界面中找到“个人热点”;3、然后我们可以看到“个人热点”...        0  \n",
       "150247          信号不好，手机因素，运营商问题，手机卡问题，很多因素你可以到当地联通营业厅寻求帮助        0  \n",
       "202724  您订购沃商店/沃游戏流量畅享包后，订购当月立即生效，按月自动续订；退订月底生效，当月可继续使...        1  \n",
       "44450   每人仅可预约一张音乐小*卡或视频小*卡或腾讯大*卡或腾讯天*卡（一共仅1张）（识别条件为：联...        1  \n",
       "65875   联通有合约机。合约种类大致有存话费送手机、买手机送话费、合约惠机等，具体合约种类可登录联通网...        0  \n",
       "51934                                    大王卡，对腾讯的应用，都免流量！        0  \n",
       "155866  设置中*电信的彩铃可以自己在网上操作的，前提是先开通中*电信的彩铃业务，可以直接致电电信客服...        1  \n",
       "118696  1、挂失状态下可以交费。交费渠道与手机正常状态下是一样的。2、温馨提示：如果号码有套餐，挂失...        1  \n",
       "115890  出现此情况一般是有以下几种情况：1、信息可能发错接收人了。2、此卡为二次放号的手机卡，前一个...        1  \n",
       "35555   如使用联通手机无法上网，可做以下排查：1、升级为4G套餐后如不重启手机则无法正常使用上网功能...        1  \n",
       "73496                                            你再用原件去注销        0  \n",
       "114899  1、检查信号是否正常；2、号卡是否欠费；3、如上面2项都正常，可重新开关机、换机换卡测试；4...        1  \n",
       "47230   您好！现运营商均有推出各种语音、流量优惠套餐，具体情况建议您可咨询当地客服热线、实体营业厅、...        1  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(n=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,355 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# nonghangzhidao_filter 说明\n",
    "0. **下载地址：** [百度网盘](https://pan.baidu.com/s/1n-jT9SKkt6cwI_PjCd7i_g)\n",
    "1. **数据概览：** 4万 条农业银行问答数据\n",
    "2. **推荐实验：** FAQ 问答系统\n",
    "3. **数据来源：** 百度知道\n",
    "4. **加工处理：**\n",
    "    1. 过滤了id、url、qid、reply_t、user字段\n",
    "    2. 对question、reply做了脱敏处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'nonghangzhidao_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. nonghangzhidao_filter.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_all = pd.read_csv(path + 'nonghangzhidao_filter.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| title | 问题的标题 |\n",
    "| question | 问题内容（可为空） |\n",
    "| reply| 回复内容 |\n",
    "| is_best| 是否为页面上显示的最佳回答 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>question</th>\n",
       "      <th>reply</th>\n",
       "      <th>is_best</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>31655</th>\n",
       "      <td>广东农行转账到江苏农行，几天可以到账？1月4日晚上10点多转的！</td>\n",
       "      <td>NaN</td>\n",
       "      <td>这么久还没有到账的话，建议查询一下是否被退回了，如果未退回的话，需要联系银行查询原因。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20349</th>\n",
       "      <td>惠水哪里有小额贷款的，而且抵押的东西能方</td>\n",
       "      <td>NaN</td>\n",
       "      <td>留vx..</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20303</th>\n",
       "      <td>想问一下重庆分行的体检通知还有第二批吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>若客户申请的是农行招聘，则可以参考以下信息：1、请登录农行官网，在“关于农行”栏目下选择点击...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18420</th>\n",
       "      <td>现在有什么软件借钱可以秒过的没。江湖救急</td>\n",
       "      <td>NaN</td>\n",
       "      <td>资料真实有效二十分钟放款</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39804</th>\n",
       "      <td>想找高利贷怎么找?</td>\n",
       "      <td>武*那里有高利贷啊？接个几千块就行年后还？有吗</td>\n",
       "      <td>留你联系方式</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23242</th>\n",
       "      <td>别人用建行卡往我农行卡转了20万，一天了怎么还不到账？</td>\n",
       "      <td>别人用建行卡往我农行卡转了20万，一天了怎么还不到账？肯定是</td>\n",
       "      <td>如果是昨天下午五点后就要等到中午以后</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30656</th>\n",
       "      <td>我问银行了，说消不了户、只能刷掉</td>\n",
       "      <td>NaN</td>\n",
       "      <td>如果使用的是农行信用卡，可以致电信用卡客服40######99反馈核实一下。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>贷的太少了，可以提前还清贷款，然后多贷点他用吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>建议客户选择正规渠道申请贷款，例如农行“网捷贷”。网捷贷是指农业银行向符合特定条件的农业银行...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2961</th>\n",
       "      <td>银行理财和证券公司理财一样吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>不太一样，产品的种类风险不同</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10837</th>\n",
       "      <td>农行提额喜欢刷大额是还是小额</td>\n",
       "      <td>NaN</td>\n",
       "      <td>老农现在是印头与时俱进哦比其他银行都大方。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34726</th>\n",
       "      <td>存折可以异地取款吗存折取钱一定要本人吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>农行个人活期存折支取方式里如果有凭证件支取，此类存折必须户主本人办理；没有密码的存折只能到开...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2381</th>\n",
       "      <td>住房公积金可以首付么</td>\n",
       "      <td>NaN</td>\n",
       "      <td>不能用公积金来付首付。这个贷款是在购房付了首付款后才能给贷的，也就是说公积金使用只能是与房屋...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21717</th>\n",
       "      <td>农行有银联标识的社会保障卡能开通网银吗？</td>\n",
       "      <td>如题</td>\n",
       "      <td>由农业银行发行的有银联标识的社会保障卡，上面如果有农业银行卡号的话，是可以用本人由身份证和银...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35871</th>\n",
       "      <td>求告知宁*装修贷款条件有哪些</td>\n",
       "      <td>NaN</td>\n",
       "      <td>以建行家装贷为例：“家装贷”是建设银行所有具有装修融资服务功能的个人贷款产品，包括个人住房抵...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3417</th>\n",
       "      <td>重*公积金中介能代取的吗</td>\n",
       "      <td>重*公积金中介能代取的吗个人公积金代取</td>\n",
       "      <td>一般来说，公积金套现主要存在几个方面的风险：一、中介机构提取完公积金后，有可能会携款潜逃，竹...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25763</th>\n",
       "      <td>农行的理财产品能买吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>农行理财业务与国内同业同步，迄今为止，已经形成了制度体系较为完善、系统开发不断前进、产品系列...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13381</th>\n",
       "      <td>出大事了，出大事了，急用钱，请问我</td>\n",
       "      <td>NaN</td>\n",
       "      <td>需要多少呢</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7448</th>\n",
       "      <td>向钱贷会跑路吗？</td>\n",
       "      <td>NaN</td>\n",
       "      <td>不会，放心。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2617</th>\n",
       "      <td>我想开通农业掌上银行提示要开通短信，可以先开通短信把掌上银行开通后，取消短信服务吗？对其他有影响吗</td>\n",
       "      <td>我想开通农业掌上银行提示要开通短信，可以先开通短信把掌上银行开通后，取消短信服务吗？对其他有...</td>\n",
       "      <td>我想开通农业掌上银行提示要开通短信，可以先开通短信把掌上银行开通后，取消短信服务吗？对其他有...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24669</th>\n",
       "      <td>朋友晚上8点半转账给我到现在还没到帐</td>\n",
       "      <td>NaN</td>\n",
       "      <td>现在外面ATM机都是24小时才到帐</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                   title  \\\n",
       "31655                   广东农行转账到江苏农行，几天可以到账？1月4日晚上10点多转的！   \n",
       "20349                               惠水哪里有小额贷款的，而且抵押的东西能方   \n",
       "20303                                想问一下重庆分行的体检通知还有第二批吗   \n",
       "18420                               现在有什么软件借钱可以秒过的没。江湖救急   \n",
       "39804                                          想找高利贷怎么找?   \n",
       "23242                        别人用建行卡往我农行卡转了20万，一天了怎么还不到账？   \n",
       "30656                                   我问银行了，说消不了户、只能刷掉   \n",
       "8                                贷的太少了，可以提前还清贷款，然后多贷点他用吗   \n",
       "2961                                      银行理财和证券公司理财一样吗   \n",
       "10837                                     农行提额喜欢刷大额是还是小额   \n",
       "34726                                存折可以异地取款吗存折取钱一定要本人吗   \n",
       "2381                                          住房公积金可以首付么   \n",
       "21717                               农行有银联标识的社会保障卡能开通网银吗？   \n",
       "35871                                     求告知宁*装修贷款条件有哪些   \n",
       "3417                                        重*公积金中介能代取的吗   \n",
       "25763                                         农行的理财产品能买吗   \n",
       "13381                                  出大事了，出大事了，急用钱，请问我   \n",
       "7448                                            向钱贷会跑路吗？   \n",
       "2617   我想开通农业掌上银行提示要开通短信，可以先开通短信把掌上银行开通后，取消短信服务吗？对其他有影响吗   \n",
       "24669                                 朋友晚上8点半转账给我到现在还没到帐   \n",
       "\n",
       "                                                question  \\\n",
       "31655                                                NaN   \n",
       "20349                                                NaN   \n",
       "20303                                                NaN   \n",
       "18420                                                NaN   \n",
       "39804                            武*那里有高利贷啊？接个几千块就行年后还？有吗   \n",
       "23242                     别人用建行卡往我农行卡转了20万，一天了怎么还不到账？肯定是   \n",
       "30656                                                NaN   \n",
       "8                                                    NaN   \n",
       "2961                                                 NaN   \n",
       "10837                                                NaN   \n",
       "34726                                                NaN   \n",
       "2381                                                 NaN   \n",
       "21717                                                 如题   \n",
       "35871                                                NaN   \n",
       "3417                                 重*公积金中介能代取的吗个人公积金代取   \n",
       "25763                                                NaN   \n",
       "13381                                                NaN   \n",
       "7448                                                 NaN   \n",
       "2617   我想开通农业掌上银行提示要开通短信，可以先开通短信把掌上银行开通后，取消短信服务吗？对其他有...   \n",
       "24669                                                NaN   \n",
       "\n",
       "                                                   reply  is_best  \n",
       "31655        这么久还没有到账的话，建议查询一下是否被退回了，如果未退回的话，需要联系银行查询原因。        0  \n",
       "20349                                              留vx..        0  \n",
       "20303  若客户申请的是农行招聘，则可以参考以下信息：1、请登录农行官网，在“关于农行”栏目下选择点击...        1  \n",
       "18420                                       资料真实有效二十分钟放款        0  \n",
       "39804                                             留你联系方式        0  \n",
       "23242                                 如果是昨天下午五点后就要等到中午以后        0  \n",
       "30656             如果使用的是农行信用卡，可以致电信用卡客服40######99反馈核实一下。        0  \n",
       "8      建议客户选择正规渠道申请贷款，例如农行“网捷贷”。网捷贷是指农业银行向符合特定条件的农业银行...        0  \n",
       "2961                                      不太一样，产品的种类风险不同        0  \n",
       "10837                              老农现在是印头与时俱进哦比其他银行都大方。        0  \n",
       "34726  农行个人活期存折支取方式里如果有凭证件支取，此类存折必须户主本人办理；没有密码的存折只能到开...        1  \n",
       "2381   不能用公积金来付首付。这个贷款是在购房付了首付款后才能给贷的，也就是说公积金使用只能是与房屋...        1  \n",
       "21717  由农业银行发行的有银联标识的社会保障卡，上面如果有农业银行卡号的话，是可以用本人由身份证和银...        1  \n",
       "35871  以建行家装贷为例：“家装贷”是建设银行所有具有装修融资服务功能的个人贷款产品，包括个人住房抵...        1  \n",
       "3417   一般来说，公积金套现主要存在几个方面的风险：一、中介机构提取完公积金后，有可能会携款潜逃，竹...        1  \n",
       "25763  农行理财业务与国内同业同步，迄今为止，已经形成了制度体系较为完善、系统开发不断前进、产品系列...        1  \n",
       "13381                                              需要多少呢        0  \n",
       "7448                                              不会，放心。        0  \n",
       "2617   我想开通农业掌上银行提示要开通短信，可以先开通短信把掌上银行开通后，取消短信服务吗？对其他有...        0  \n",
       "24669                                  现在外面ATM机都是24小时才到帐        0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(n=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,553 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# online_shopping_10_cats 说明\n",
    "0. **下载地址：** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/online_shopping_10_cats/online_shopping_10_cats.zip)\n",
    "1. **数据概览：** 10 个类别（书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店），共 6 万多条评论数据，正、负向评论各约 3 万条\n",
    "2. **推荐实验：** 情感/观点/评论 倾向性分析\n",
    "2. **数据来源：** 各电商平台，具体不详\n",
    "3. **原数据集：** [中文情感分析语料](https://download.csdn.net/download/weixin_38395744/10231401)、[中文情感分析语料库](https://download.csdn.net/download/u010097581/9919245)，网上搜集，具体作者、来源不详\n",
    "4. **加工处理：**\n",
    "    1. 将 2 份语料整合成 1 份语料\n",
    "    2. 将原来零散的 excel, txt 文档，整合成 1 个 统一的 csv 文档\n",
    "    3. 去重"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'online_shopping_10_cats_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. online_shopping_10_cats.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：62774\n",
      "评论数目（正向）：31728\n",
      "评论数目（负向）：31046\n"
     ]
    }
   ],
   "source": [
    "pd_all = pd.read_csv(path + 'online_shopping_10_cats.csv')\n",
    "\n",
    "print('评论数目（总体）：%d' % pd_all.shape[0])\n",
    "print('评论数目（正向）：%d' % pd_all[pd_all.label==1].shape[0])\n",
    "print('评论数目（负向）：%d' % pd_all[pd_all.label==0].shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| cat | 类别：包括 书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店 |\n",
    "| label | 1 表示正向评论，0 表示负向评论 |\n",
    "| review | 评论内容 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cat</th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>11194</th>\n",
       "      <td>平板</td>\n",
       "      <td>0</td>\n",
       "      <td>什么玩意。刚用一天，就充不上电，开不开机，返厂老麻烦，</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17794</th>\n",
       "      <td>水果</td>\n",
       "      <td>1</td>\n",
       "      <td>买了几次了，价格实惠，口感不错，保鲜好！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29529</th>\n",
       "      <td>洗发水</td>\n",
       "      <td>1</td>\n",
       "      <td>挺值得购买的，有包装买回去送家人，毛巾质量不错。小块的可以拿来当擦手帕。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24976</th>\n",
       "      <td>水果</td>\n",
       "      <td>0</td>\n",
       "      <td>真的就算后悔了。两天才拿到货。还不如水果店买！还都发霉不新鲜了！以后不买了</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28447</th>\n",
       "      <td>洗发水</td>\n",
       "      <td>1</td>\n",
       "      <td>一般般，薄荷洗发水没想象中的凉快</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>264</th>\n",
       "      <td>书籍</td>\n",
       "      <td>1</td>\n",
       "      <td>这本书有别于以往看过的早教书籍，结合了说明文的写实，散文的情致和图册的一目了然。特别是读过几...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53035</th>\n",
       "      <td>酒店</td>\n",
       "      <td>1</td>\n",
       "      <td>酒店的大堂很漂亮,房间不算小,设施还可以也很干净,离码头很近,而且又有车接送,很方便.晚上2...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50250</th>\n",
       "      <td>计算机</td>\n",
       "      <td>1</td>\n",
       "      <td>做工不错，外壳也很漂亮。测试了一下还行！~中通很快啊，13号下午的订单，今天早上就收到了。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>62461</th>\n",
       "      <td>酒店</td>\n",
       "      <td>0</td>\n",
       "      <td>房间空间比较小， 环境比较吵。特别半夜被窗户外面的空调外机的声音吵醒（因为窗外一条巷子之隔，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52888</th>\n",
       "      <td>酒店</td>\n",
       "      <td>1</td>\n",
       "      <td>清明节入住两天.从进入酒店就感受到无处不在的服务,非常周到,又很得体.从大堂,商务中心,到前...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31429</th>\n",
       "      <td>洗发水</td>\n",
       "      <td>0</td>\n",
       "      <td>感觉不怎么样，刚刚洗完头发又感觉头发干枯枯的而且还是好油</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21443</th>\n",
       "      <td>水果</td>\n",
       "      <td>0</td>\n",
       "      <td>算了，不要买了，先不说个头小，就味道难吃的要死，还没有路边摊卖的好吃，硬，涩，根本就没有苹果...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19374</th>\n",
       "      <td>水果</td>\n",
       "      <td>1</td>\n",
       "      <td>快递神速，品种与描述一样，比上次买的好吃！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28188</th>\n",
       "      <td>洗发水</td>\n",
       "      <td>1</td>\n",
       "      <td>还没有用，不过感觉和实体店买的差不多，等用过之后再追加评价吧</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46182</th>\n",
       "      <td>衣服</td>\n",
       "      <td>0</td>\n",
       "      <td>裤子又大又长，那里像休闲裤，妈的，还修身呢，真是够了</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>62616</th>\n",
       "      <td>酒店</td>\n",
       "      <td>0</td>\n",
       "      <td>奇葩的酒店。在一个办公楼里，自己开车去酒店，很难找到，等到了酒店地下停车场，不知道应该坐那部...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44044</th>\n",
       "      <td>衣服</td>\n",
       "      <td>0</td>\n",
       "      <td>我要晕死得节奏，买回来就没穿过，真的是霉！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19456</th>\n",
       "      <td>水果</td>\n",
       "      <td>1</td>\n",
       "      <td>苹果不大，但很脆甜。检查了一下，48个没有烂的，有个别难看的。总体上质量不错</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10562</th>\n",
       "      <td>平板</td>\n",
       "      <td>0</td>\n",
       "      <td>差差差真卡渣渣品牌以后在也不相信大品牌了坑是了</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34199</th>\n",
       "      <td>洗发水</td>\n",
       "      <td>0</td>\n",
       "      <td>这个是6月18当天买的，只有半瓶。购物太差劲了</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       cat  label                                             review\n",
       "11194   平板      0                        什么玩意。刚用一天，就充不上电，开不开机，返厂老麻烦，\n",
       "17794   水果      1                               买了几次了，价格实惠，口感不错，保鲜好！\n",
       "29529  洗发水      1               挺值得购买的，有包装买回去送家人，毛巾质量不错。小块的可以拿来当擦手帕。\n",
       "24976   水果      0              真的就算后悔了。两天才拿到货。还不如水果店买！还都发霉不新鲜了！以后不买了\n",
       "28447  洗发水      1                                   一般般，薄荷洗发水没想象中的凉快\n",
       "264     书籍      1  这本书有别于以往看过的早教书籍，结合了说明文的写实，散文的情致和图册的一目了然。特别是读过几...\n",
       "53035   酒店      1  酒店的大堂很漂亮,房间不算小,设施还可以也很干净,离码头很近,而且又有车接送,很方便.晚上2...\n",
       "50250  计算机      1      做工不错，外壳也很漂亮。测试了一下还行！~中通很快啊，13号下午的订单，今天早上就收到了。\n",
       "62461   酒店      0  房间空间比较小， 环境比较吵。特别半夜被窗户外面的空调外机的声音吵醒（因为窗外一条巷子之隔，...\n",
       "52888   酒店      1  清明节入住两天.从进入酒店就感受到无处不在的服务,非常周到,又很得体.从大堂,商务中心,到前...\n",
       "31429  洗发水      0                       感觉不怎么样，刚刚洗完头发又感觉头发干枯枯的而且还是好油\n",
       "21443   水果      0  算了，不要买了，先不说个头小，就味道难吃的要死，还没有路边摊卖的好吃，硬，涩，根本就没有苹果...\n",
       "19374   水果      1                              快递神速，品种与描述一样，比上次买的好吃！\n",
       "28188  洗发水      1                     还没有用，不过感觉和实体店买的差不多，等用过之后再追加评价吧\n",
       "46182   衣服      0                         裤子又大又长，那里像休闲裤，妈的，还修身呢，真是够了\n",
       "62616   酒店      0  奇葩的酒店。在一个办公楼里，自己开车去酒店，很难找到，等到了酒店地下停车场，不知道应该坐那部...\n",
       "44044   衣服      0                              我要晕死得节奏，买回来就没穿过，真的是霉！\n",
       "19456   水果      1             苹果不大，但很脆甜。检查了一下，48个没有烂的，有个别难看的。总体上质量不错\n",
       "10562   平板      0                            差差差真卡渣渣品牌以后在也不相信大品牌了坑是了\n",
       "34199  洗发水      0                            这个是6月18当天买的，只有半瓶。购物太差劲了"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. 统计各类别语料的规模"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "书籍: 3851 (总体), 2100 (正例), 1751 (负例)\n",
      "平板: 10000 (总体), 5000 (正例), 5000 (负例)\n",
      "手机: 2323 (总体), 1165 (正例), 1158 (负例)\n",
      "水果: 10000 (总体), 5000 (正例), 5000 (负例)\n",
      "洗发水: 10000 (总体), 5000 (正例), 5000 (负例)\n",
      "热水器: 575 (总体), 475 (正例), 100 (负例)\n",
      "蒙牛: 2033 (总体), 992 (正例), 1041 (负例)\n",
      "衣服: 10000 (总体), 5000 (正例), 5000 (负例)\n",
      "计算机: 3992 (总体), 1996 (正例), 1996 (负例)\n",
      "酒店: 10000 (总体), 5000 (正例), 5000 (负例)\n"
     ]
    }
   ],
   "source": [
    "all_cats = ['书籍', '平板', '手机', '水果', '洗发水', '热水器', '蒙牛', '衣服', '计算机', '酒店'] # 全部类别\n",
    "\n",
    "for cat in all_cats:\n",
    "    pd_data = pd_all[pd_all.cat==cat]\n",
    "    print('{}: {} (总体), {} (正例), {} (负例)'.format(cat, pd_data.shape[0], \n",
    "                                                 pd_data[pd_data.label==1].shape[0], pd_data[pd_data.label==0].shape[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. 加载指定类别的语料"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：17843\n",
      "评论数目（正向）：9096\n",
      "评论数目（负向）：8747\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cat</th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1620</th>\n",
       "      <td>书籍</td>\n",
       "      <td>1</td>\n",
       "      <td>符弦歌&amp;凌悠扬，一个背负着道义和家族荣誉，一个洒脱且桀骜不羁，两个完全不相同的人却因为千丝万...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18872</th>\n",
       "      <td>水果</td>\n",
       "      <td>1</td>\n",
       "      <td>一直在吃，烟台苹果，味道不错，物流快</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>443</th>\n",
       "      <td>书籍</td>\n",
       "      <td>1</td>\n",
       "      <td>仔细回想这本文集，发现自己喜欢的只是写《教室朝南，没有风筝》的麻宁，不知道是她成长了还是自己...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21437</th>\n",
       "      <td>水果</td>\n",
       "      <td>0</td>\n",
       "      <td>最差的一次购物体验，干瘪，坏心，糟糕透顶</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18321</th>\n",
       "      <td>水果</td>\n",
       "      <td>1</td>\n",
       "      <td>多次购买新鲜爽甜，80个头大大个，物流超快，上午9点前下单，下午16点收货</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>568</th>\n",
       "      <td>书籍</td>\n",
       "      <td>1</td>\n",
       "      <td>一开始我是看了当当上的推荐，说不一样的卡梅拉这套书是亚马逊的五星级图书，大家的评论也非常好。...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23927</th>\n",
       "      <td>水果</td>\n",
       "      <td>0</td>\n",
       "      <td>垃圾啊，以后再也不 会买了啊 ，好几个坏的，还有好多歪头歪闹的</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19244</th>\n",
       "      <td>水果</td>\n",
       "      <td>1</td>\n",
       "      <td>包装完好，没有烂果，就是比较小粒，卖相不好。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20643</th>\n",
       "      <td>水果</td>\n",
       "      <td>1</td>\n",
       "      <td>不错不错特别好吃，甜甜的水分还足而且还很脆，第一次在京东买苹果，果然没让我失望，</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22330</th>\n",
       "      <td>水果</td>\n",
       "      <td>0</td>\n",
       "      <td>第一次给差评，刚拿上打开第一个就黑心。差评。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17905</th>\n",
       "      <td>水果</td>\n",
       "      <td>1</td>\n",
       "      <td>妈妈说非常好，谢谢店家，会继续支持</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19439</th>\n",
       "      <td>水果</td>\n",
       "      <td>1</td>\n",
       "      <td>不错不错挺甜的。 收到还凉凉的。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23419</th>\n",
       "      <td>水果</td>\n",
       "      <td>0</td>\n",
       "      <td>吃第一个就是烂的，而且是烂透了的。认栽，图都难得传了！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>355</th>\n",
       "      <td>书籍</td>\n",
       "      <td>1</td>\n",
       "      <td>这本书从男性的视觉诠释了承诺和责任的关系。从达菲一开始的茫然到最后勇敢面对自己的真心，以及对...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24028</th>\n",
       "      <td>水果</td>\n",
       "      <td>0</td>\n",
       "      <td>味同嚼蜡，水泥地里长出来的吗？一点味道都没有还硬的很，颜色很红，个头很小，口感特别差，真后悔</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>497</th>\n",
       "      <td>书籍</td>\n",
       "      <td>1</td>\n",
       "      <td>因为众所周知的原因，我一直在内心深处比较抵制日本文化，我们接受的教育也是负面的信息多于正面的...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52307</th>\n",
       "      <td>计算机</td>\n",
       "      <td>0</td>\n",
       "      <td>噪音稍大，再就是装XP系统确实蓝屏的几率比较大，装VISTA算了，别的缺点暂时真没发觉，水平有限</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51268</th>\n",
       "      <td>计算机</td>\n",
       "      <td>0</td>\n",
       "      <td>可能是主板比较特殊，很多Ghost启动光盘不能识别光驱，不过好像萝卜花园的可以识别。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21227</th>\n",
       "      <td>水果</td>\n",
       "      <td>0</td>\n",
       "      <td>好小一个，根本不是进口的。包装好看而已！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18596</th>\n",
       "      <td>水果</td>\n",
       "      <td>1</td>\n",
       "      <td>好吃真心的好吃赞了，快递特快，继续关注，会回购的</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       cat  label                                             review\n",
       "1620    书籍      1  符弦歌&凌悠扬，一个背负着道义和家族荣誉，一个洒脱且桀骜不羁，两个完全不相同的人却因为千丝万...\n",
       "18872   水果      1                                 一直在吃，烟台苹果，味道不错，物流快\n",
       "443     书籍      1  仔细回想这本文集，发现自己喜欢的只是写《教室朝南，没有风筝》的麻宁，不知道是她成长了还是自己...\n",
       "21437   水果      0                               最差的一次购物体验，干瘪，坏心，糟糕透顶\n",
       "18321   水果      1              多次购买新鲜爽甜，80个头大大个，物流超快，上午9点前下单，下午16点收货\n",
       "568     书籍      1  一开始我是看了当当上的推荐，说不一样的卡梅拉这套书是亚马逊的五星级图书，大家的评论也非常好。...\n",
       "23927   水果      0                    垃圾啊，以后再也不 会买了啊 ，好几个坏的，还有好多歪头歪闹的\n",
       "19244   水果      1                             包装完好，没有烂果，就是比较小粒，卖相不好。\n",
       "20643   水果      1           不错不错特别好吃，甜甜的水分还足而且还很脆，第一次在京东买苹果，果然没让我失望，\n",
       "22330   水果      0                             第一次给差评，刚拿上打开第一个就黑心。差评。\n",
       "17905   水果      1                                  妈妈说非常好，谢谢店家，会继续支持\n",
       "19439   水果      1                                   不错不错挺甜的。 收到还凉凉的。\n",
       "23419   水果      0                        吃第一个就是烂的，而且是烂透了的。认栽，图都难得传了！\n",
       "355     书籍      1  这本书从男性的视觉诠释了承诺和责任的关系。从达菲一开始的茫然到最后勇敢面对自己的真心，以及对...\n",
       "24028   水果      0     味同嚼蜡，水泥地里长出来的吗？一点味道都没有还硬的很，颜色很红，个头很小，口感特别差，真后悔\n",
       "497     书籍      1  因为众所周知的原因，我一直在内心深处比较抵制日本文化，我们接受的教育也是负面的信息多于正面的...\n",
       "52307  计算机      0   噪音稍大，再就是装XP系统确实蓝屏的几率比较大，装VISTA算了，别的缺点暂时真没发觉，水平有限\n",
       "51268  计算机      0         可能是主板比较特殊，很多Ghost启动光盘不能识别光驱，不过好像萝卜花园的可以识别。\n",
       "21227   水果      0                               好小一个，根本不是进口的。包装好看而已！\n",
       "18596   水果      1                           好吃真心的好吃赞了，快递特快，继续关注，会回购的"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target_cats = ['书籍', '水果', '计算机'] # 假定只需要 书籍、水果、计算机 3 个 类别的数据\n",
    "\n",
    "pd_data = pd_all[pd_all.cat.isin(target_cats)]\n",
    "\n",
    "print('评论数目（总体）：%d' % pd_data.shape[0])\n",
    "print('评论数目（正向）：%d' % pd_data[pd_data.label==1].shape[0])\n",
    "print('评论数目（负向）：%d' % pd_data[pd_data.label==0].shape[0])\n",
    "\n",
    "pd_data.sample(20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,287 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# simplifyweibo_4_moods 说明\n",
    "0. **下载地址：** [百度网盘](https://pan.baidu.com/s/16c93E5x373nsGozyWevITg)\n",
    "1. **数据概览：** 36 万多条，带情感标注 新浪微博，包含 4 种情感，其中喜悦约 20 万条，愤怒、厌恶、低落各约 5 万条\n",
    "2. **推荐实验：** 情感/观点/评论 倾向性分析\n",
    "2. **数据来源：** [新浪微博](https://weibo.com/)\n",
    "3. **原数据集：** [微博情感分析数据集](https://download.csdn.net/download/turkan/9181661)，网上搜集，具体作者、来源不详\n",
    "4. **加工处理：**\n",
    "    1. 将原来的 4 份文档，整合成 1 份 csv 文件\n",
    "    2. 原始语料进行了分词处理，我们重新将其还原为未分词的语料\n",
    "    3. 编码统一为 UTF-8\n",
    "    4. 去重"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'simplifyweibo_4_moods_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. simplifyweibo_4_moods.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "微博数目（总体）：361744\n",
      "微博数目（喜悦）：199496\n",
      "微博数目（愤怒）：51714\n",
      "微博数目（厌恶）：55267\n",
      "微博数目（低落）：55267\n"
     ]
    }
   ],
   "source": [
    "pd_all = pd.read_csv(path + 'simplifyweibo_4_moods.csv')\n",
    "moods = {0: '喜悦', 1: '愤怒', 2: '厌恶', 3: '低落'}\n",
    "\n",
    "print('微博数目（总体）：%d' % pd_all.shape[0])\n",
    "\n",
    "for label, mood in moods.items(): \n",
    "    print('微博数目（{}）：{}'.format(mood,  pd_all[pd_all.label==label].shape[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| label | 0 喜悦，1 愤怒，2 厌恶，3 低落 |\n",
    "| review | 微博内容 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>307114</th>\n",
       "      <td>3</td>\n",
       "      <td>回复美国看起来很美，对别人比较狠！对付哪国人，就用哪国人做他的腿，简称狗腿落后的祖宗挨过打！...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>258815</th>\n",
       "      <td>2</td>\n",
       "      <td>我表示压力狠大。!哇。犀利妹！偶尔街拍，其实姐只是一个你永远无法超越的传说。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>249801</th>\n",
       "      <td>1</td>\n",
       "      <td>可怜，帮这孩子转下，希望不会因为涉嫌联系业务负什么责任啊…………是想粉丝想疯了什么情况啊？想...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>165587</th>\n",
       "      <td>0</td>\n",
       "      <td>哦也~ ~ ~ ！得瑟哈哈哈耶~ ~ ~ ！新logo 。。。。我们的logo 会不会抢了的...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351395</th>\n",
       "      <td>3</td>\n",
       "      <td>我发现真的是最齐全的一张。这是去看北方儿子的时候啊。怀念。对了，我怎么穿那件破衬衫。。好难看...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>339894</th>\n",
       "      <td>3</td>\n",
       "      <td>看你那个享受的表情nuna 很感动~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>307523</th>\n",
       "      <td>3</td>\n",
       "      <td>不得不轉 ！大家淚 奔吧哈哈27開 始,短短8秒,我咽哽了</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124636</th>\n",
       "      <td>0</td>\n",
       "      <td>早看到了，再看到还是想笑，好可爱啊</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56901</th>\n",
       "      <td>0</td>\n",
       "      <td>快来围观我的小丸子模板~ ~ 哇咔咔~ 得瑟~ ~ ~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>106905</th>\n",
       "      <td>0</td>\n",
       "      <td>也未免太厉害了吧.......观看完此视频之后，我终于明白了香港歌星GEM—— 邓紫棋走红的...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>291966</th>\n",
       "      <td>2</td>\n",
       "      <td>天啊…是住家发生爆炸了，天热，各位注意安全。一朋友开化工厂的。唉。注意安全。真难以想像，不知...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>321489</th>\n",
       "      <td>3</td>\n",
       "      <td>肯德基你就不会带个头，做件好事可爱的脖子们，帮她圆了梦吧~ ~ 小时候来北京，吃过一种小糕点...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>188566</th>\n",
       "      <td>0</td>\n",
       "      <td>想去桂林，上学时候就学到一课文说桂林山水甲天下，一直想去看看品橙网国庆旅游胜地创意评奖活动开...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15444</th>\n",
       "      <td>0</td>\n",
       "      <td>晃姐姐口才真不是一般人的高，这大概就是文凭带来的区别吧。拿着真文凭的人总会觉得那是自己的底线...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56820</th>\n",
       "      <td>0</td>\n",
       "      <td>火火happy birthday 天蝎座的人虽然喜欢隐藏自己，但是他喜欢掌握每天生活当中与他...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>257031</th>\n",
       "      <td>2</td>\n",
       "      <td>好久没看了。。。还是那么的感动~ ~ ~ ~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>144782</th>\n",
       "      <td>0</td>\n",
       "      <td>你看他像几岁？关键是牛尔多大?【分享图片】现场挑战高难度抗衰老奇迹~ 看看他都使用倩碧什么产品~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>130776</th>\n",
       "      <td>0</td>\n",
       "      <td>比江苏台的好玩这个真的很搞笑，再次推荐！哈哈，这个绝对值得一看，搞笑死了。当然其中的讽刺意味...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59158</th>\n",
       "      <td>0</td>\n",
       "      <td>【YMG 推荐】来，哥让你见识下，什么是真正的招财猫！要发财的童鞋抱走~ ~ 在海味舖 買 ?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>240262</th>\n",
       "      <td>1</td>\n",
       "      <td>该带套的时候要带上。大哥，你就得瑟吧和吃饭。美女很美很火。因为吃香辣小龙虾，我的衬衫歇火了。...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        label                                             review\n",
       "307114      3  回复美国看起来很美，对别人比较狠！对付哪国人，就用哪国人做他的腿，简称狗腿落后的祖宗挨过打！...\n",
       "258815      2             我表示压力狠大。!哇。犀利妹！偶尔街拍，其实姐只是一个你永远无法超越的传说。\n",
       "249801      1  可怜，帮这孩子转下，希望不会因为涉嫌联系业务负什么责任啊…………是想粉丝想疯了什么情况啊？想...\n",
       "165587      0  哦也~ ~ ~ ！得瑟哈哈哈耶~ ~ ~ ！新logo 。。。。我们的logo 会不会抢了的...\n",
       "351395      3  我发现真的是最齐全的一张。这是去看北方儿子的时候啊。怀念。对了，我怎么穿那件破衬衫。。好难看...\n",
       "339894      3                                 看你那个享受的表情nuna 很感动~\n",
       "307523      3                      不得不轉 ！大家淚 奔吧哈哈27開 始,短短8秒,我咽哽了\n",
       "124636      0                                  早看到了，再看到还是想笑，好可爱啊\n",
       "56901       0                        快来围观我的小丸子模板~ ~ 哇咔咔~ 得瑟~ ~ ~\n",
       "106905      0  也未免太厉害了吧.......观看完此视频之后，我终于明白了香港歌星GEM—— 邓紫棋走红的...\n",
       "291966      2  天啊…是住家发生爆炸了，天热，各位注意安全。一朋友开化工厂的。唉。注意安全。真难以想像，不知...\n",
       "321489      3  肯德基你就不会带个头，做件好事可爱的脖子们，帮她圆了梦吧~ ~ 小时候来北京，吃过一种小糕点...\n",
       "188566      0  想去桂林，上学时候就学到一课文说桂林山水甲天下，一直想去看看品橙网国庆旅游胜地创意评奖活动开...\n",
       "15444       0  晃姐姐口才真不是一般人的高，这大概就是文凭带来的区别吧。拿着真文凭的人总会觉得那是自己的底线...\n",
       "56820       0  火火happy birthday 天蝎座的人虽然喜欢隐藏自己，但是他喜欢掌握每天生活当中与他...\n",
       "257031      2                             好久没看了。。。还是那么的感动~ ~ ~ ~\n",
       "144782      0   你看他像几岁？关键是牛尔多大?【分享图片】现场挑战高难度抗衰老奇迹~ 看看他都使用倩碧什么产品~\n",
       "130776      0  比江苏台的好玩这个真的很搞笑，再次推荐！哈哈，这个绝对值得一看，搞笑死了。当然其中的讽刺意味...\n",
       "59158       0    【YMG 推荐】来，哥让你见识下，什么是真正的招财猫！要发财的童鞋抱走~ ~ 在海味舖 買 ?\n",
       "240262      1  该带套的时候要带上。大哥，你就得瑟吧和吃饭。美女很美很火。因为吃香辣小龙虾，我的衬衫歇火了。..."
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,341 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# touzizhidao_filter 说明\n",
    "0. **下载地址：** [百度网盘](https://pan.baidu.com/s/1SR5d20DPpU7F1h_OVf64GA)\n",
    "1. **数据概览：** 58.8 万条投资行业问答数据\n",
    "2. **推荐实验：** FAQ 问答系统\n",
    "3. **数据来源：** 百度知道\n",
    "4. **加工处理：**\n",
    "    1. 过滤了id、url、qid、reply_t、user字段\n",
    "    2. 对question、reply做了脱敏处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'touzizhidao_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. touzizhidao_filter.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_all = pd.read_csv(path + 'touzizhidao_filter.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| title | 问题的标题 |\n",
    "| question | 问题内容（可为空） |\n",
    "| reply| 回复内容 |\n",
    "| is_best| 是否为页面上显示的最佳回答 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>question</th>\n",
       "      <th>reply</th>\n",
       "      <th>is_best</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>133637</th>\n",
       "      <td>华夏银行信用卡怎么查询申请进度</td>\n",
       "      <td>NaN</td>\n",
       "      <td>信用卡申请进度查询：查询步骤：一、网银查询：1、登录银行信用卡中心页面，然后点击“办卡进度查...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>295236</th>\n",
       "      <td>我向上海复星投资创业有限公司申请贷款要交1000元保险开户费，交了</td>\n",
       "      <td>我向上海复星投资创业有限公司申请贷款要交1000元保险开户费，交了过后又说我银行卡不行还要交...</td>\n",
       "      <td>我的不用</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>329332</th>\n",
       "      <td>二手房买卖中介收费是多少二手房买卖中介如何收费</td>\n",
       "      <td>NaN</td>\n",
       "      <td>二手房交易流程(1)买方咨询买卖双方建立信息沟通渠道，买方了解房屋整体现状及产权状况，要求卖...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>176871</th>\n",
       "      <td>单位给职工办的社保卡买药里面资金不足怎么办</td>\n",
       "      <td>单位给职工办的社保卡买药里面资金不足怎么办</td>\n",
       "      <td>不足的部分需要自己支付医保卡的使用范围主要有以下三个方面：1、用于购药：参保人员在定点药店买...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>485667</th>\n",
       "      <td>银联医保卡去哪家银行激活。</td>\n",
       "      <td>NaN</td>\n",
       "      <td>医保卡上面的银行医保卡激活的步骤：1、带着老卡和新卡到建设银行办理；2、新医保卡的密码是身份...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5012</th>\n",
       "      <td>买一套大概150万的二手门面房大概要交多少钱</td>\n",
       "      <td>NaN</td>\n",
       "      <td>如果购买的是非普通住宅，除了缴纳房屋费用，还需要按以下规定缴纳相关税费：（1）增值税：非住宅...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>475672</th>\n",
       "      <td>二手房买卖，公共维修基金应怎么处理？是需要下家支付给上家账面余额，还是无偿顺延呢？</td>\n",
       "      <td>买卖合同中是这样写的：“出卖人同意其缴纳的该房屋专项维修资金（公共维修基金）的账面余额转移至...</td>\n",
       "      <td>需要办理维修基金过户。无偿顺延就可以。维修基金使用条件：1、维修基金只有在保修期满后，对物业...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199291</th>\n",
       "      <td>信用卡全额还款好还是最低还款好</td>\n",
       "      <td>NaN</td>\n",
       "      <td>如果条件可以，当然是全额还款好，最低还款是要付利息的，而且还有点高，银行当然希望是最低还款，...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>265499</th>\n",
       "      <td>花呗如何才能提额</td>\n",
       "      <td>花呗怎样才能提高额度</td>\n",
       "      <td>花呗额度取决于芝麻信用分，若要提升额度，需要先提升芝麻信用分，提升芝麻信用分小技巧：1、多在...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>224237</th>\n",
       "      <td>工行信用卡逾期两个月，没有90天！！</td>\n",
       "      <td>工行信用卡逾期两个月，没有90天！！银行把卡冻结了，欠款7000，全部还清以后打电话解冻，客...</td>\n",
       "      <td>可以用，但额度只有2000元，且征信上有逾期记录注销吧</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>271023</th>\n",
       "      <td>中*房价什么时候会大跌</td>\n",
       "      <td>NaN</td>\n",
       "      <td>我感觉房价下降的几率比较小，现在啥都涨价，国家再调控，也不可能让我这月收入几千块钱的人买得起...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14097</th>\n",
       "      <td>在四*地*，个人所得税达到多少金额</td>\n",
       "      <td>NaN</td>\n",
       "      <td>个人所得税征税内容工资、薪金所得,个体工商户的生产、经营所得,他有偿服务活动取得的所得。经营...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>363978</th>\n",
       "      <td>关于贷款的</td>\n",
       "      <td>关于贷款的有没有什么借款途径</td>\n",
       "      <td>有口子。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>517939</th>\n",
       "      <td>农村建房可以贷款吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>不可以，银行贷款一般是能够上市交易的房子。贷款需要准备四大类资料：1、个人身份证明：身份证、...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>505671</th>\n",
       "      <td>2017年甘*个人医保卡能让别人用吗</td>\n",
       "      <td>NaN</td>\n",
       "      <td>个人医保卡是不能让别人使用的。医保卡（社保卡）只限本人就医时使用，不能出借给他人。参保人如把...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>117318</th>\n",
       "      <td>别墅好还是高层好</td>\n",
       "      <td>别墅好还是高层好</td>\n",
       "      <td>别墅。还是看你自己的需要还有经济能力了不是房子建的好看就算是别墅的。别墅即别野，讲究的是周围...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>376669</th>\n",
       "      <td>新车什么时候算上户了也就说法律上属于自己的财产</td>\n",
       "      <td>NaN</td>\n",
       "      <td>购房合同签订完了车子就属于个人财产了。中*人*共*国*法通则第七十五条规定：个人财产所有权包...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179097</th>\n",
       "      <td>农民59岁买什么养老</td>\n",
       "      <td>农民59岁买什么养老</td>\n",
       "      <td>多存点钱。</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>77847</th>\n",
       "      <td>支付宝转账手续费怎么收的</td>\n",
       "      <td>NaN</td>\n",
       "      <td>好想是一个月内不能超过5万没有手续费你好，每个支付宝账户有两万元的免费提现和转账额度，提现和...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319220</th>\n",
       "      <td>甘*省企业退休人员养老金怎么调整</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2016年，我国实现了企业和机关事业单位养老金待遇同步调整，按6.5%左右提高企业和机关事业...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            title  \\\n",
       "133637                            华夏银行信用卡怎么查询申请进度   \n",
       "295236          我向上海复星投资创业有限公司申请贷款要交1000元保险开户费，交了   \n",
       "329332                    二手房买卖中介收费是多少二手房买卖中介如何收费   \n",
       "176871                      单位给职工办的社保卡买药里面资金不足怎么办   \n",
       "485667                              银联医保卡去哪家银行激活。   \n",
       "5012                       买一套大概150万的二手门面房大概要交多少钱   \n",
       "475672  二手房买卖，公共维修基金应怎么处理？是需要下家支付给上家账面余额，还是无偿顺延呢？   \n",
       "199291                            信用卡全额还款好还是最低还款好   \n",
       "265499                                   花呗如何才能提额   \n",
       "224237                         工行信用卡逾期两个月，没有90天！！   \n",
       "271023                                中*房价什么时候会大跌   \n",
       "14097                           在四*地*，个人所得税达到多少金额   \n",
       "363978                                      关于贷款的   \n",
       "517939                                  农村建房可以贷款吗   \n",
       "505671                         2017年甘*个人医保卡能让别人用吗   \n",
       "117318                                   别墅好还是高层好   \n",
       "376669                    新车什么时候算上户了也就说法律上属于自己的财产   \n",
       "179097                                 农民59岁买什么养老   \n",
       "77847                                支付宝转账手续费怎么收的   \n",
       "319220                           甘*省企业退休人员养老金怎么调整   \n",
       "\n",
       "                                                 question  \\\n",
       "133637                                                NaN   \n",
       "295236  我向上海复星投资创业有限公司申请贷款要交1000元保险开户费，交了过后又说我银行卡不行还要交...   \n",
       "329332                                                NaN   \n",
       "176871                              单位给职工办的社保卡买药里面资金不足怎么办   \n",
       "485667                                                NaN   \n",
       "5012                                                  NaN   \n",
       "475672  买卖合同中是这样写的：“出卖人同意其缴纳的该房屋专项维修资金（公共维修基金）的账面余额转移至...   \n",
       "199291                                                NaN   \n",
       "265499                                         花呗怎样才能提高额度   \n",
       "224237  工行信用卡逾期两个月，没有90天！！银行把卡冻结了，欠款7000，全部还清以后打电话解冻，客...   \n",
       "271023                                                NaN   \n",
       "14097                                                 NaN   \n",
       "363978                                     关于贷款的有没有什么借款途径   \n",
       "517939                                                NaN   \n",
       "505671                                                NaN   \n",
       "117318                                           别墅好还是高层好   \n",
       "376669                                                NaN   \n",
       "179097                                         农民59岁买什么养老   \n",
       "77847                                                 NaN   \n",
       "319220                                                NaN   \n",
       "\n",
       "                                                    reply  is_best  \n",
       "133637  信用卡申请进度查询：查询步骤：一、网银查询：1、登录银行信用卡中心页面，然后点击“办卡进度查...        1  \n",
       "295236                                               我的不用        0  \n",
       "329332  二手房交易流程(1)买方咨询买卖双方建立信息沟通渠道，买方了解房屋整体现状及产权状况，要求卖...        1  \n",
       "176871  不足的部分需要自己支付医保卡的使用范围主要有以下三个方面：1、用于购药：参保人员在定点药店买...        1  \n",
       "485667  医保卡上面的银行医保卡激活的步骤：1、带着老卡和新卡到建设银行办理；2、新医保卡的密码是身份...        1  \n",
       "5012    如果购买的是非普通住宅，除了缴纳房屋费用，还需要按以下规定缴纳相关税费：（1）增值税：非住宅...        1  \n",
       "475672  需要办理维修基金过户。无偿顺延就可以。维修基金使用条件：1、维修基金只有在保修期满后，对物业...        0  \n",
       "199291  如果条件可以，当然是全额还款好，最低还款是要付利息的，而且还有点高，银行当然希望是最低还款，...        1  \n",
       "265499  花呗额度取决于芝麻信用分，若要提升额度，需要先提升芝麻信用分，提升芝麻信用分小技巧：1、多在...        0  \n",
       "224237                        可以用，但额度只有2000元，且征信上有逾期记录注销吧        0  \n",
       "271023  我感觉房价下降的几率比较小，现在啥都涨价，国家再调控，也不可能让我这月收入几千块钱的人买得起...        0  \n",
       "14097   个人所得税征税内容工资、薪金所得,个体工商户的生产、经营所得,他有偿服务活动取得的所得。经营...        1  \n",
       "363978                                               有口子。        0  \n",
       "517939  不可以，银行贷款一般是能够上市交易的房子。贷款需要准备四大类资料：1、个人身份证明：身份证、...        1  \n",
       "505671  个人医保卡是不能让别人使用的。医保卡（社保卡）只限本人就医时使用，不能出借给他人。参保人如把...        1  \n",
       "117318  别墅。还是看你自己的需要还有经济能力了不是房子建的好看就算是别墅的。别墅即别野，讲究的是周围...        0  \n",
       "376669  购房合同签订完了车子就属于个人财产了。中*人*共*国*法通则第七十五条规定：个人财产所有权包...        1  \n",
       "179097                                              多存点钱。        0  \n",
       "77847   好想是一个月内不能超过5万没有手续费你好，每个支付宝账户有两万元的免费提现和转账额度，提现和...        0  \n",
       "319220  2016年，我国实现了企业和机关事业单位养老金待遇同步调整，按6.5%左右提高企业和机关事业...        1  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(n=20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,426 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# waimai_10k 说明\n",
    "0. **下载地址：** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/waimai_10k/waimai_10k.csv)\n",
    "1. **数据概览：** 某外卖平台收集的用户评价，正向 4000 条，负向 约 8000 条\n",
    "2. **推荐实验：** 情感/观点/评论 倾向性分析\n",
    "2. **数据来源：** 某外卖平台\n",
    "3. **原数据集：** [中文短文本情感分析语料 外卖评价](https://download.csdn.net/download/cstkl/10236683)，网上搜集，具体作者、来源不详\n",
    "4. **加工处理：**\n",
    "    1. 将原来 2 个文件整合到 1 个文件中\n",
    "    2. 去重"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'waimai_10k_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. waimai_10k.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：11987\n",
      "评论数目（正向）：4000\n",
      "评论数目（负向）：7987\n"
     ]
    }
   ],
   "source": [
    "pd_all = pd.read_csv(path + 'waimai_10k.csv')\n",
    "\n",
    "print('评论数目（总体）：%d' % pd_all.shape[0])\n",
    "print('评论数目（正向）：%d' % pd_all[pd_all.label==1].shape[0])\n",
    "print('评论数目（负向）：%d' % pd_all[pd_all.label==0].shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| label | 1 表示正向评论，0 表示负向评论 |\n",
    "| review | 评论内容 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>1</td>\n",
       "      <td>送餐特别快,态度也好,辛苦啦</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6632</th>\n",
       "      <td>0</td>\n",
       "      <td>点了热带雨林披萨+饮料，和BBQ鸡肉披萨+饮料，送来的是两个奥尔良披萨+两个银耳冰粥，冰凉冰...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8849</th>\n",
       "      <td>0</td>\n",
       "      <td>难吃!!!油死了，味道烂</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11114</th>\n",
       "      <td>0</td>\n",
       "      <td>今天菜太咸，连着定了3天吃，一天比一天难吃。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11661</th>\n",
       "      <td>0</td>\n",
       "      <td>送的太慢了，菜都凉了。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9571</th>\n",
       "      <td>0</td>\n",
       "      <td>没有满减！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10614</th>\n",
       "      <td>0</td>\n",
       "      <td>差评！定的时间是12点一刻，结果刚11点就送来了！果断退单。送餐前不看时间吗？</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7585</th>\n",
       "      <td>0</td>\n",
       "      <td>羊肉串太咸，还有些不新鲜。鸡心和鸡胗烤的太老</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6919</th>\n",
       "      <td>0</td>\n",
       "      <td>快递员挺好，速度挺快</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3192</th>\n",
       "      <td>1</td>\n",
       "      <td>小炒肉卷饼好辣~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10224</th>\n",
       "      <td>0</td>\n",
       "      <td>送来的时候都凉了,味道一般,鲜果西米露就两口的量,鲜果就是一块西瓜一个西瓜籽</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7295</th>\n",
       "      <td>0</td>\n",
       "      <td>没放糖，没放奶油，好难喝</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>275</th>\n",
       "      <td>1</td>\n",
       "      <td>他家的奶茶超级好喝。。。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8378</th>\n",
       "      <td>0</td>\n",
       "      <td>黑椒牛柳饭送成大排饭</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5879</th>\n",
       "      <td>0</td>\n",
       "      <td>一个半小时，可以</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7523</th>\n",
       "      <td>0</td>\n",
       "      <td>订单满减后应该是24，送过来要收我原价39？你搞笑呐，还少听加多宝！我管你什么美食送的还是你...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6590</th>\n",
       "      <td>0</td>\n",
       "      <td>真心也忒慢了，其他都还成</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1703</th>\n",
       "      <td>1</td>\n",
       "      <td>非常划算，很好</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5345</th>\n",
       "      <td>0</td>\n",
       "      <td>首选是得吐槽一下这家的速度,一个半小时起,然后卷饼包装很不错,酱香鸡肉的比较赞,飘香肘子一般...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1674</th>\n",
       "      <td>1</td>\n",
       "      <td>离我们远点55分钟送到的，可以理解，饼和粥都不错</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       label                                             review\n",
       "25         1                                     送餐特别快,态度也好,辛苦啦\n",
       "6632       0  点了热带雨林披萨+饮料，和BBQ鸡肉披萨+饮料，送来的是两个奥尔良披萨+两个银耳冰粥，冰凉冰...\n",
       "8849       0                                       难吃!!!油死了，味道烂\n",
       "11114      0                             今天菜太咸，连着定了3天吃，一天比一天难吃。\n",
       "11661      0                                        送的太慢了，菜都凉了。\n",
       "9571       0                                              没有满减！\n",
       "10614      0            差评！定的时间是12点一刻，结果刚11点就送来了！果断退单。送餐前不看时间吗？\n",
       "7585       0                             羊肉串太咸，还有些不新鲜。鸡心和鸡胗烤的太老\n",
       "6919       0                                         快递员挺好，速度挺快\n",
       "3192       1                                           小炒肉卷饼好辣~\n",
       "10224      0             送来的时候都凉了,味道一般,鲜果西米露就两口的量,鲜果就是一块西瓜一个西瓜籽\n",
       "7295       0                                       没放糖，没放奶油，好难喝\n",
       "275        1                                       他家的奶茶超级好喝。。。\n",
       "8378       0                                         黑椒牛柳饭送成大排饭\n",
       "5879       0                                           一个半小时，可以\n",
       "7523       0  订单满减后应该是24，送过来要收我原价39？你搞笑呐，还少听加多宝！我管你什么美食送的还是你...\n",
       "6590       0                                       真心也忒慢了，其他都还成\n",
       "1703       1                                            非常划算，很好\n",
       "5345       0  首选是得吐槽一下这家的速度,一个半小时起,然后卷饼包装很不错,酱香鸡肉的比较赞,飘香肘子一般...\n",
       "1674       1                           离我们远点55分钟送到的，可以理解，饼和粥都不错"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. 构造平衡语料"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_positive = pd_all[pd_all.label==1]\n",
    "pd_negative = pd_all[pd_all.label==0]\n",
    "\n",
    "def get_balance_corpus(corpus_size, corpus_pos, corpus_neg):\n",
    "    sample_size = corpus_size // 2\n",
    "    pd_corpus_balance = pd.concat([corpus_pos.sample(sample_size, replace=corpus_pos.shape[0]<sample_size), \\\n",
    "                                   corpus_neg.sample(sample_size, replace=corpus_neg.shape[0]<sample_size)])\n",
    "    \n",
    "    print('评论数目（总体）：%d' % pd_corpus_balance.shape[0])\n",
    "    print('评论数目（正向）：%d' % pd_corpus_balance[pd_corpus_balance.label==1].shape[0])\n",
    "    print('评论数目（负向）：%d' % pd_corpus_balance[pd_corpus_balance.label==0].shape[0])    \n",
    "    \n",
    "    return pd_corpus_balance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：4000\n",
      "评论数目（正向）：2000\n",
      "评论数目（负向）：2000\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10436</th>\n",
       "      <td>0</td>\n",
       "      <td>难吃～石锅拌饭居然没酱～而且刚好晚了29分钟</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10468</th>\n",
       "      <td>0</td>\n",
       "      <td>等了很久，没关系，毕竟还在约定时间内，可是最让我忍不了的是真的很一般，个人口味吧，反正不和我...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1643</th>\n",
       "      <td>1</td>\n",
       "      <td>嗯，纸袋比较高大上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8723</th>\n",
       "      <td>0</td>\n",
       "      <td>海参怎么是生的，没法吃，郁闷</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2431</th>\n",
       "      <td>1</td>\n",
       "      <td>送餐很快，送餐人员很热情！～</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5121</th>\n",
       "      <td>0</td>\n",
       "      <td>不如以前好吃，肘子都有味儿了！哎！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10565</th>\n",
       "      <td>0</td>\n",
       "      <td>东西有些小贵。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2413</th>\n",
       "      <td>1</td>\n",
       "      <td>虽然时间长了些但是很准时。下次记得给些番茄酱就更好了。,一个人吃足够了。好好吃</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11937</th>\n",
       "      <td>0</td>\n",
       "      <td>11点以前就定的餐，做了1小时48分钟，呵呵，我只想说：拜拜！！！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1024</th>\n",
       "      <td>1</td>\n",
       "      <td>很好吃，面皮特别有嚼劲儿，酱料也很好吃</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       label                                             review\n",
       "10436      0                             难吃～石锅拌饭居然没酱～而且刚好晚了29分钟\n",
       "10468      0  等了很久，没关系，毕竟还在约定时间内，可是最让我忍不了的是真的很一般，个人口味吧，反正不和我...\n",
       "1643       1                                          嗯，纸袋比较高大上\n",
       "8723       0                                     海参怎么是生的，没法吃，郁闷\n",
       "2431       1                                     送餐很快，送餐人员很热情！～\n",
       "5121       0                                  不如以前好吃，肘子都有味儿了！哎！\n",
       "10565      0                                            东西有些小贵。\n",
       "2413       1            虽然时间长了些但是很准时。下次记得给些番茄酱就更好了。,一个人吃足够了。好好吃\n",
       "11937      0                  11点以前就定的餐，做了1小时48分钟，呵呵，我只想说：拜拜！！！\n",
       "1024       1                                很好吃，面皮特别有嚼劲儿，酱料也很好吃"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "waimai_10k_ba_4000 = get_balance_corpus(4000, pd_positive, pd_negative)\n",
    "\n",
    "waimai_10k_ba_4000.sample(10)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,280 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# weibo_senti_100k 说明\n",
    "0. **下载地址：** [百度网盘](https://pan.baidu.com/s/1DoQbki3YwqkuwQUOj64R_g)\n",
    "1. **数据概览：** 10 万多条，带情感标注 新浪微博，正负向评论约各 5 万条\n",
    "2. **推荐实验：** 情感/观点/评论 倾向性分析\n",
    "2. **数据来源：** [新浪微博](https://weibo.com/)\n",
    "3. **原数据集：** [新浪微博，情感分析标记语料共12万条](https://download.csdn.net/download/weixin_38442818/10214750)，网上搜集，具体作者、来源不详\n",
    "4. **加工处理：**\n",
    "    1. 将原来的 2 份文档，整合成 1 份 csv 文件\n",
    "    2. 编码统一为 UTF-8\n",
    "    3. 去重"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'weibo_senti_100k_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. weibo_senti_100k.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：119988\n",
      "评论数目（正向）：59993\n",
      "评论数目（负向）：59995\n"
     ]
    }
   ],
   "source": [
    "pd_all = pd.read_csv(path + 'weibo_senti_100k.csv')\n",
    "\n",
    "print('评论数目（总体）：%d' % pd_all.shape[0])\n",
    "print('评论数目（正向）：%d' % pd_all[pd_all.label==1].shape[0])\n",
    "print('评论数目（负向）：%d' % pd_all[pd_all.label==0].shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| label | 1 表示正向评论，0 表示负向评论 |\n",
    "| review | 微博内容 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>62050</th>\n",
       "      <td>0</td>\n",
       "      <td>太过分了@Rexzhenghao  //@Janie_Zhang:招行最近负面新闻越来越多呀...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>68263</th>\n",
       "      <td>0</td>\n",
       "      <td>希望你?得好?我本＂?肥血?史＂[晕][哈哈]@Pete三姑父</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>81472</th>\n",
       "      <td>0</td>\n",
       "      <td>有点想参加????[偷?]想安排下时间再决定[抓狂]//@黑晶晶crystal: @细腿大羽...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42021</th>\n",
       "      <td>1</td>\n",
       "      <td>[给力]感谢所有支持雯婕的芝麻！[爱你]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7777</th>\n",
       "      <td>1</td>\n",
       "      <td>2013最后一天，在新加坡开心度过，向所有的朋友们问声：新年快乐！2014年，我们会更好[调...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100399</th>\n",
       "      <td>0</td>\n",
       "      <td>大中午出门办事找错路，曝晒中。要多杯具有多杯具。[泪][泪][汗]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82398</th>\n",
       "      <td>0</td>\n",
       "      <td>马航还会否认吗？到底在隐瞒啥呢？[抓狂]//@头条新闻: 转发微博</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>106423</th>\n",
       "      <td>0</td>\n",
       "      <td>克罗地亚球迷很爱放烟火！球又没进，就硝烟四起。[晕]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24798</th>\n",
       "      <td>1</td>\n",
       "      <td>[抱抱]福芦 TangRoulou 吉祥书 8.8折优惠 &gt;&gt;&gt; http://t.cn/z...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6598</th>\n",
       "      <td>1</td>\n",
       "      <td>回复@钱旭明QXM:[嘻嘻][嘻嘻] //@钱旭明QXM:杨大哥[good][good][g...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53920</th>\n",
       "      <td>1</td>\n",
       "      <td>人家这脸长的!!!!!![哈哈]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15587</th>\n",
       "      <td>1</td>\n",
       "      <td>这个价不算高，和一天内训相比相差无几。。[哈哈]//@博通传媒v: 6个月！一个月工资1万，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>101237</th>\n",
       "      <td>0</td>\n",
       "      <td>终于收工啦，脚丫子快冻掉了[泪][泪][泪]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82449</th>\n",
       "      <td>0</td>\n",
       "      <td>我决定从今天开始我想吃什么就去吃什么，一个人吃也无所谓，重点是不要因为别人的意见委屈了自己[...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32537</th>\n",
       "      <td>1</td>\n",
       "      <td>飘雪的北京 需要双份早餐.......//@美食天下: [哈哈]//@王淼Margay: 屁...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10630</th>\n",
       "      <td>1</td>\n",
       "      <td>[耶]，这个太赞了，生活大爆炸第六季马上要出啦[鼓掌] //@-郑瑜-:这个不错 //@经典...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85130</th>\n",
       "      <td>0</td>\n",
       "      <td>刚追完#倾世皇妃#，#千山暮雪#又紧随其后，网速和更新速度都太不给力，尽管我看过原著，还是焦...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105956</th>\n",
       "      <td>0</td>\n",
       "      <td>晚上看金二胖?察前?，推出的火炮基座?糟了，可以PK了[泪] //@艾米粒er: //@wi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72391</th>\n",
       "      <td>0</td>\n",
       "      <td>必须把中国足球的伟大，用我的职业演说出来 //@袁腾飞:[泪]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10761</th>\n",
       "      <td>1</td>\n",
       "      <td>[鼓掌] //@宁波香格里拉大酒店: 小编来答疑，周五晚惊艳全场的树根蛋糕到底有多长？蛋糕全...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        label                                             review\n",
       "62050       0  太过分了@Rexzhenghao  //@Janie_Zhang:招行最近负面新闻越来越多呀...\n",
       "68263       0                    希望你?得好?我本＂?肥血?史＂[晕][哈哈]@Pete三姑父\n",
       "81472       0  有点想参加????[偷?]想安排下时间再决定[抓狂]//@黑晶晶crystal: @细腿大羽...\n",
       "42021       1                               [给力]感谢所有支持雯婕的芝麻！[爱你]\n",
       "7777        1  2013最后一天，在新加坡开心度过，向所有的朋友们问声：新年快乐！2014年，我们会更好[调...\n",
       "100399      0                  大中午出门办事找错路，曝晒中。要多杯具有多杯具。[泪][泪][汗]\n",
       "82398       0                  马航还会否认吗？到底在隐瞒啥呢？[抓狂]//@头条新闻: 转发微博\n",
       "106423      0                         克罗地亚球迷很爱放烟火！球又没进，就硝烟四起。[晕]\n",
       "24798       1  [抱抱]福芦 TangRoulou 吉祥书 8.8折优惠 >>> http://t.cn/z...\n",
       "6598        1  回复@钱旭明QXM:[嘻嘻][嘻嘻] //@钱旭明QXM:杨大哥[good][good][g...\n",
       "53920       1                                   人家这脸长的!!!!!![哈哈]\n",
       "15587       1  这个价不算高，和一天内训相比相差无几。。[哈哈]//@博通传媒v: 6个月！一个月工资1万，...\n",
       "101237      0                             终于收工啦，脚丫子快冻掉了[泪][泪][泪]\n",
       "82449       0  我决定从今天开始我想吃什么就去吃什么，一个人吃也无所谓，重点是不要因为别人的意见委屈了自己[...\n",
       "32537       1  飘雪的北京 需要双份早餐.......//@美食天下: [哈哈]//@王淼Margay: 屁...\n",
       "10630       1  [耶]，这个太赞了，生活大爆炸第六季马上要出啦[鼓掌] //@-郑瑜-:这个不错 //@经典...\n",
       "85130       0  刚追完#倾世皇妃#，#千山暮雪#又紧随其后，网速和更新速度都太不给力，尽管我看过原著，还是焦...\n",
       "105956      0  晚上看金二胖?察前?，推出的火炮基座?糟了，可以PK了[泪] //@艾米粒er: //@wi...\n",
       "72391       0                    必须把中国足球的伟大，用我的职业演说出来 //@袁腾飞:[泪]\n",
       "10761       1  [鼓掌] //@宁波香格里拉大酒店: 小编来答疑，周五晚惊艳全场的树根蛋糕到底有多长？蛋糕全..."
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,804 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# yf_amazon 说明\n",
    "0. **下载地址：** [百度网盘](https://pan.baidu.com/s/1SbfpZb5cm-g2LmnYV_af8Q)\n",
    "1. **数据概览：** 52 万件商品，1100 多个类目，142 万用户，720 万条评论/评分数据\n",
    "2. **推荐实验：** 推荐系统、情感/观点/评论 倾向性分析\n",
    "2. **数据来源：** [亚马逊](https://www.amazon.cn/)\n",
    "3. **原数据集：** [JD.com E-Commerce Data](http://yongfeng.me/dataset/)，Yongfeng Zhang 教授为 WWW 2015 会议论文而搜集的数据\n",
    "4. **加工处理：**\n",
    "    1. 将全角字符转换为半角字符，并采用 UTF-8 编码\n",
    "    2. 整理成与 [MovieLens](https://grouplens.org/datasets/movielens/) 兼容的格式\n",
    "    3. 进行脱敏操作，以保护用户隐私"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'yf_amazon_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. products.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "产品数目：525619\n"
     ]
    }
   ],
   "source": [
    "products = pd.read_csv(path + 'products.csv')\n",
    "\n",
    "print('产品数目：%d' % products.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| productId | 产品 id (从 0 开始，连续编号) |\n",
    "| name | 产品名称 |\n",
    "| catIds | 类别 id（从 0 开始，连续编号，从左到右依次表示一级类目、二级类目、三级类目） |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>productId</th>\n",
       "      <th>name</th>\n",
       "      <th>catIds</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>331420</th>\n",
       "      <td>331420</td>\n",
       "      <td>欧意金狐狸 女式 皮手套 QT602</td>\n",
       "      <td>802,143,996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>130945</th>\n",
       "      <td>130945</td>\n",
       "      <td>YESO TOT 中性 单肩包/斜挎包 均码 9411</td>\n",
       "      <td>1111,864,781</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179886</th>\n",
       "      <td>179886</td>\n",
       "      <td>李斯特论柏辽兹与舒曼</td>\n",
       "      <td>832,552,337</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>504123</th>\n",
       "      <td>504123</td>\n",
       "      <td>Tuscarora 途斯卡洛拉 中性 烈焰驰骋无缝头巾 PSU3083</td>\n",
       "      <td>1111,522,720</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>387785</th>\n",
       "      <td>387785</td>\n",
       "      <td>我们的故事:一百个北大荒老知青的人生形态</td>\n",
       "      <td>832,519,599</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>406231</th>\n",
       "      <td>406231</td>\n",
       "      <td>图读周易</td>\n",
       "      <td>832,723,724</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199072</th>\n",
       "      <td>199072</td>\n",
       "      <td>Barbie 芭比 女童 运动休闲鞋 A22993</td>\n",
       "      <td>802,777,601</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>518528</th>\n",
       "      <td>518528</td>\n",
       "      <td>HiVi 惠威 多媒体音箱 D1080MKII 2.0声道 棕色</td>\n",
       "      <td>1057,439,1064</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>446621</th>\n",
       "      <td>446621</td>\n",
       "      <td>HALTI 男式 JUOVAJACKET 芬兰国家队系列 羽绒滑雪服 H0591922</td>\n",
       "      <td>1111,651,693</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>379960</th>\n",
       "      <td>379960</td>\n",
       "      <td>塑料回收再生术:百工百技</td>\n",
       "      <td>832,1096,509</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        productId                                         name         catIds\n",
       "331420     331420                           欧意金狐狸 女式 皮手套 QT602    802,143,996\n",
       "130945     130945                  YESO TOT 中性 单肩包/斜挎包 均码 9411   1111,864,781\n",
       "179886     179886                                   李斯特论柏辽兹与舒曼    832,552,337\n",
       "504123     504123          Tuscarora 途斯卡洛拉 中性 烈焰驰骋无缝头巾 PSU3083   1111,522,720\n",
       "387785     387785                         我们的故事:一百个北大荒老知青的人生形态    832,519,599\n",
       "406231     406231                                         图读周易    832,723,724\n",
       "199072     199072                    Barbie 芭比 女童 运动休闲鞋 A22993    802,777,601\n",
       "518528     518528             HiVi 惠威 多媒体音箱 D1080MKII 2.0声道 棕色  1057,439,1064\n",
       "446621     446621  HALTI 男式 JUOVAJACKET 芬兰国家队系列 羽绒滑雪服 H0591922   1111,651,693\n",
       "379960     379960                                 塑料回收再生术:百工百技   832,1096,509"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "products.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. categories.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "类别数目：1175\n"
     ]
    }
   ],
   "source": [
    "categories = pd.read_csv(path + 'categories.csv')\n",
    "\n",
    "print('类别数目：%d' % categories.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| catId | 类别 id (从 0 开始，连续编号) |\n",
    "| category | 类别名称 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>catId</th>\n",
       "      <th>category</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>947</th>\n",
       "      <td>947</td>\n",
       "      <td>理发器</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>818</th>\n",
       "      <td>818</td>\n",
       "      <td>电脑硬件</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>212</th>\n",
       "      <td>212</td>\n",
       "      <td>帐篷</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>815</th>\n",
       "      <td>815</td>\n",
       "      <td>路由器/中继器</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>829</th>\n",
       "      <td>829</td>\n",
       "      <td>拉杆箱/包</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>391</th>\n",
       "      <td>391</td>\n",
       "      <td>女鞋</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>756</th>\n",
       "      <td>756</td>\n",
       "      <td>大型健身器械</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>其他运动器材</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>633</th>\n",
       "      <td>633</td>\n",
       "      <td>垂钓用品</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>115</th>\n",
       "      <td>115</td>\n",
       "      <td>卡通</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     catId category\n",
       "947    947      理发器\n",
       "818    818     电脑硬件\n",
       "212    212       帐篷\n",
       "815    815  路由器/中继器\n",
       "829    829    拉杆箱/包\n",
       "391    391       女鞋\n",
       "756    756   大型健身器械\n",
       "11      11   其他运动器材\n",
       "633    633     垂钓用品\n",
       "115    115       卡通"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "categories.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. ratings.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "用户 数目：1424596\n",
      "评分/评论 数目（总计）：7202921\n",
      "\n"
     ]
    }
   ],
   "source": [
    "pd_ratings = pd.read_csv(path+'ratings.csv')\n",
    "\n",
    "print('用户 数目：%d' % pd_ratings.userId.unique().shape[0])\n",
    "print('评分/评论 数目（总计）：%d\\n' % pd_ratings.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| userId | 用户 id (从 0 开始，连续编号) |\n",
    "| productId | 即 products.csv 中的 productId |\n",
    "| rating | 评分，[1,5] 之间的整数 |\n",
    "| timestamp | 评分时间戳 |\n",
    "| title | 评论的标题 |\n",
    "| comment |  评论的内容 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>productId</th>\n",
       "      <th>rating</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>title</th>\n",
       "      <th>comment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4287636</th>\n",
       "      <td>230944.0</td>\n",
       "      <td>394505</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1393084800</td>\n",
       "      <td>赞!</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3940838</th>\n",
       "      <td>16628.0</td>\n",
       "      <td>84789</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1389715200</td>\n",
       "      <td>喜欢</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4064284</th>\n",
       "      <td>325829.0</td>\n",
       "      <td>94108</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1384531200</td>\n",
       "      <td>磨脚</td>\n",
       "      <td>右脚小脚趾磨掉一块皮</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4802616</th>\n",
       "      <td>586385.0</td>\n",
       "      <td>254002</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1383408000</td>\n",
       "      <td>哦~</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>292946</th>\n",
       "      <td>842028.0</td>\n",
       "      <td>231449</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1369324800</td>\n",
       "      <td>致我们终将逝去的青春</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2306551</th>\n",
       "      <td>933226.0</td>\n",
       "      <td>219015</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1341763200</td>\n",
       "      <td>有点大 不过很漂亮</td>\n",
       "      <td>外观很精致的说 就是外形有点偏大</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1707442</th>\n",
       "      <td>402851.0</td>\n",
       "      <td>228321</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1374076800</td>\n",
       "      <td>给宝宝讲讲挺好的,内容简单,便于宝宝理解。</td>\n",
       "      <td>给宝宝讲讲挺好的,内容简单,便于宝宝理解。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3641724</th>\n",
       "      <td>123473.0</td>\n",
       "      <td>515623</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1305475200</td>\n",
       "      <td>书很好,但居然没有包装!?!?!?</td>\n",
       "      <td>书很好,但居然没有包装!?!?!?这么好的书却没有包装!?!?!?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1921912</th>\n",
       "      <td>435946.0</td>\n",
       "      <td>63238</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1357228800</td>\n",
       "      <td>嗯</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1475151</th>\n",
       "      <td>1612.0</td>\n",
       "      <td>139044</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1316102400</td>\n",
       "      <td>一般</td>\n",
       "      <td>香味没有前面评价那么香,就是普通的爽肤水,有点黏黏的</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           userId  productId  rating   timestamp                  title  \\\n",
       "4287636  230944.0     394505     5.0  1393084800                     赞!   \n",
       "3940838   16628.0      84789     5.0  1389715200                     喜欢   \n",
       "4064284  325829.0      94108     3.0  1384531200                     磨脚   \n",
       "4802616  586385.0     254002     5.0  1383408000                     哦~   \n",
       "292946   842028.0     231449     5.0  1369324800             致我们终将逝去的青春   \n",
       "2306551  933226.0     219015     4.0  1341763200              有点大 不过很漂亮   \n",
       "1707442  402851.0     228321     5.0  1374076800  给宝宝讲讲挺好的,内容简单,便于宝宝理解。   \n",
       "3641724  123473.0     515623     4.0  1305475200      书很好,但居然没有包装!?!?!?   \n",
       "1921912  435946.0      63238     4.0  1357228800                      嗯   \n",
       "1475151    1612.0     139044     4.0  1316102400                     一般   \n",
       "\n",
       "                                   comment  \n",
       "4287636                                NaN  \n",
       "3940838                                NaN  \n",
       "4064284                         右脚小脚趾磨掉一块皮  \n",
       "4802616                                NaN  \n",
       "292946                                 NaN  \n",
       "2306551                   外观很精致的说 就是外形有点偏大  \n",
       "1707442              给宝宝讲讲挺好的,内容简单,便于宝宝理解。  \n",
       "3641724  书很好,但居然没有包装!?!?!?这么好的书却没有包装!?!?!?  \n",
       "1921912                                NaN  \n",
       "1475151         香味没有前面评价那么香,就是普通的爽肤水,有点黏黏的  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_ratings.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. links.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "links = pd.read_csv(path + 'links.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| productId | 即 products.csv 和 ratings.csv 中的 productId |\n",
    "| amazonId | 亚马逊的产品编号 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>productId</th>\n",
       "      <th>amazonId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>436251</th>\n",
       "      <td>436251</td>\n",
       "      <td>B00F91KYGK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>194578</th>\n",
       "      <td>194578</td>\n",
       "      <td>B00GICSVUK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>336998</th>\n",
       "      <td>336998</td>\n",
       "      <td>B00GMKUNBI</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>371924</th>\n",
       "      <td>371924</td>\n",
       "      <td>B008RIA4AS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>433617</th>\n",
       "      <td>433617</td>\n",
       "      <td>B00332FJ7Q</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>236918</th>\n",
       "      <td>236918</td>\n",
       "      <td>060614479X</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>388158</th>\n",
       "      <td>388158</td>\n",
       "      <td>B008TI5V2C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>479855</th>\n",
       "      <td>479855</td>\n",
       "      <td>B002NSML6I</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>311842</th>\n",
       "      <td>311842</td>\n",
       "      <td>B001DTWV2C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>445227</th>\n",
       "      <td>445227</td>\n",
       "      <td>B0055PT83U</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>360465</th>\n",
       "      <td>360465</td>\n",
       "      <td>B005UTT2QY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>258363</th>\n",
       "      <td>258363</td>\n",
       "      <td>0805092919</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>308642</th>\n",
       "      <td>308642</td>\n",
       "      <td>B0079WMXT8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>232740</th>\n",
       "      <td>232740</td>\n",
       "      <td>B0018HKRAW</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>335318</th>\n",
       "      <td>335318</td>\n",
       "      <td>B00840LWKU</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>497048</th>\n",
       "      <td>497048</td>\n",
       "      <td>B003ZI61RA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>388969</th>\n",
       "      <td>388969</td>\n",
       "      <td>B00BIUYL06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10448</th>\n",
       "      <td>10448</td>\n",
       "      <td>B00GMZ9DKK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75752</th>\n",
       "      <td>75752</td>\n",
       "      <td>B002R0DNB4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>392345</th>\n",
       "      <td>392345</td>\n",
       "      <td>B0041IY7CE</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        productId    amazonId\n",
       "436251     436251  B00F91KYGK\n",
       "194578     194578  B00GICSVUK\n",
       "336998     336998  B00GMKUNBI\n",
       "371924     371924  B008RIA4AS\n",
       "433617     433617  B00332FJ7Q\n",
       "236918     236918  060614479X\n",
       "388158     388158  B008TI5V2C\n",
       "479855     479855  B002NSML6I\n",
       "311842     311842  B001DTWV2C\n",
       "445227     445227  B0055PT83U\n",
       "360465     360465  B005UTT2QY\n",
       "258363     258363  0805092919\n",
       "308642     308642  B0079WMXT8\n",
       "232740     232740  B0018HKRAW\n",
       "335318     335318  B00840LWKU\n",
       "497048     497048  B003ZI61RA\n",
       "388969     388969  B00BIUYL06\n",
       "10448       10448  B00GMZ9DKK\n",
       "75752       75752  B002R0DNB4\n",
       "392345     392345  B0041IY7CE"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "links.sample(20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,736 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# yf_dianping 说明\n",
    "0. **下载地址：** [百度网盘](https://pan.baidu.com/s/1yMNvHLl6QYsGbjT7u51Nfg)\n",
    "1. **数据概览：** 24 万家餐馆，54 万用户，440 万条评论/评分数据\n",
    "2. **推荐实验：** 推荐系统、情感/观点/评论 倾向性分析\n",
    "2. **数据来源：** [大众点评](http://www.dianping.com/)\n",
    "3. **原数据集：** [Dianping Review Dataset](http://yongfeng.me/dataset/)，Yongfeng Zhang 教授为 WWW 2013, SIGIR 2013, SIGIR 2014 会议论文而搜集的数据\n",
    "4. **加工处理：**\n",
    "    1. 只保留原数据集中的评论、评分等信息，去除其他无用信息\n",
    "    2. 整理成与 [MovieLens](https://grouplens.org/datasets/movielens/) 兼容的格式\n",
    "    3. 进行脱敏操作，以保护用户隐私"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'yf_dianping_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. restaurants.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "餐馆数目（有名称）：209132\n",
      "餐馆数目（没有名称）：34115\n",
      "餐馆数目（总计）：243247\n"
     ]
    }
   ],
   "source": [
    "restaurants = pd.read_csv(path + 'restaurants.csv')\n",
    "\n",
    "print('餐馆数目（有名称）：%d' % restaurants[~pd.isnull(restaurants.name)].shape[0])\n",
    "print('餐馆数目（没有名称）：%d' % restaurants[pd.isnull(restaurants.name)].shape[0])\n",
    "print('餐馆数目（总计）：%d' % restaurants.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| restId | 餐馆 id (从 0 开始，连续编号) |\n",
    "| name | 餐馆名称 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>restId</th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>210902</th>\n",
       "      <td>210902</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124832</th>\n",
       "      <td>124832</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26766</th>\n",
       "      <td>26766</td>\n",
       "      <td>香锅制造(新苏天地店)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91754</th>\n",
       "      <td>91754</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>204465</th>\n",
       "      <td>204465</td>\n",
       "      <td>西部牛扒城(湖塘店)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36475</th>\n",
       "      <td>36475</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>231861</th>\n",
       "      <td>231861</td>\n",
       "      <td>四季火锅</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>79816</th>\n",
       "      <td>79816</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>140694</th>\n",
       "      <td>140694</td>\n",
       "      <td>彝家牛汤锅</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>169641</th>\n",
       "      <td>169641</td>\n",
       "      <td>春秋</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33809</th>\n",
       "      <td>33809</td>\n",
       "      <td>九头鸟酒家(永定门店)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>236919</th>\n",
       "      <td>236919</td>\n",
       "      <td>老上海城隍庙小吃(人民大学店)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>182387</th>\n",
       "      <td>182387</td>\n",
       "      <td>河源三家村酒楼</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>140475</th>\n",
       "      <td>140475</td>\n",
       "      <td>荣记麻辣烫</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>194224</th>\n",
       "      <td>194224</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>152406</th>\n",
       "      <td>152406</td>\n",
       "      <td>鼎丰真(东四马路店)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11701</th>\n",
       "      <td>11701</td>\n",
       "      <td>南亚餐厅</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58805</th>\n",
       "      <td>58805</td>\n",
       "      <td>益丰坊(虎泉店)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15641</th>\n",
       "      <td>15641</td>\n",
       "      <td>万达艾美酒店大堂吧</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43424</th>\n",
       "      <td>43424</td>\n",
       "      <td>新美心绿姿生活</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        restId             name\n",
       "210902  210902              NaN\n",
       "124832  124832              NaN\n",
       "26766    26766      香锅制造(新苏天地店)\n",
       "91754    91754              NaN\n",
       "204465  204465       西部牛扒城(湖塘店)\n",
       "36475    36475              NaN\n",
       "231861  231861             四季火锅\n",
       "79816    79816              NaN\n",
       "140694  140694            彝家牛汤锅\n",
       "169641  169641               春秋\n",
       "33809    33809      九头鸟酒家(永定门店)\n",
       "236919  236919  老上海城隍庙小吃(人民大学店)\n",
       "182387  182387          河源三家村酒楼\n",
       "140475  140475            荣记麻辣烫\n",
       "194224  194224              NaN\n",
       "152406  152406       鼎丰真(东四马路店)\n",
       "11701    11701             南亚餐厅\n",
       "58805    58805         益丰坊(虎泉店)\n",
       "15641    15641        万达艾美酒店大堂吧\n",
       "43424    43424          新美心绿姿生活"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "restaurants.sample(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. ratings.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "用户 数目：542706\n",
      "评分/评论 数目（总计）：4422473\n",
      "\n",
      "总体 评分 数目（[1,5]）：3293878\n",
      "环境 评分 数目（[1,5]）：4076220\n",
      "口味 评分 数目（[1,5]）：4093819\n",
      "服务 评分 数目（[1,5]）：4076220\n",
      "评论 数目：4107409\n"
     ]
    }
   ],
   "source": [
    "pd_ratings = pd.read_csv(path+'ratings.csv')\n",
    "\n",
    "print('用户 数目：%d' % pd_ratings.userId.unique().shape[0])\n",
    "print('评分/评论 数目（总计）：%d\\n' % pd_ratings.shape[0])\n",
    "\n",
    "print('总体 评分 数目（[1,5]）：%d' % pd_ratings[(pd_ratings.rating>=1) & (pd_ratings.rating<=5)].shape[0])\n",
    "print('环境 评分 数目（[1,5]）：%d' % pd_ratings[(pd_ratings.rating_env>=1) & (pd_ratings.rating_env<=5)].shape[0])\n",
    "print('口味 评分 数目（[1,5]）：%d' % pd_ratings[(pd_ratings.rating_flavor>=1) & (pd_ratings.rating_flavor<=5)].shape[0])\n",
    "print('服务 评分 数目（[1,5]）：%d' % pd_ratings[(pd_ratings.rating_service>=1) & (pd_ratings.rating_service<=5)].shape[0])\n",
    "print('评论 数目：%d' % pd_ratings[~pd_ratings.comment.isna()].shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| userId | 用户 id (从 0 开始，连续编号) |\n",
    "| restId | 即 restaurants.csv 中的 restId |\n",
    "| rating | 总体评分，[0,5] 之间的整数 |\n",
    "| rating_env | 环境评分，[1,5] 之间的整数 |\n",
    "| rating_flavor | 口味评分，[1,5] 之间的整数 |\n",
    "| rating_service | 服务评分，[1,5] 之间的整数 |\n",
    "| timestamp | 评分时间戳 |\n",
    "| comment |  评论内容 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>restId</th>\n",
       "      <th>rating</th>\n",
       "      <th>rating_env</th>\n",
       "      <th>rating_flavor</th>\n",
       "      <th>rating_service</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>comment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3331708</th>\n",
       "      <td>6802</td>\n",
       "      <td>183728</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1315673880000</td>\n",
       "      <td>环境不错，停车方便，交通也比较方便，东西齐全，应有尽有，吃、喝、玩、乐样样齐全，还有个五星级...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3332473</th>\n",
       "      <td>3106</td>\n",
       "      <td>183750</td>\n",
       "      <td>5.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1260155880000</td>\n",
       "      <td>去过两次，都是由日本朋友带着去的，很喜欢那种在小巷子深处的店，总觉得那样的店料理会很好吃。最...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>291609</th>\n",
       "      <td>39590</td>\n",
       "      <td>13570</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1324792500000</td>\n",
       "      <td>朋友请客，两个人中午去吃的，虽然不是节假日，但人还是非常的多，等了很长时间才上餐，价位偏高，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>749582</th>\n",
       "      <td>59192</td>\n",
       "      <td>38519</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1321430760000</td>\n",
       "      <td>十一长假之前，我们的房子终于有了好消息，这个月底就可以拿到钥匙，真是不容易，盼星星盼月亮的，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>719908</th>\n",
       "      <td>241643</td>\n",
       "      <td>36382</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1271862180000</td>\n",
       "      <td>很差的一家店！公司聚餐居然选在这里，真是个大大的失策！\\n点的菜迟迟不上，不知道是故意不上还...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3127953</th>\n",
       "      <td>12481</td>\n",
       "      <td>173459</td>\n",
       "      <td>4.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1300407540000</td>\n",
       "      <td>这家是离家最近的一家城市超市了，所以自然要进去随便逛逛啦。\\n因为附近是居民区，自然光顾的主...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2068253</th>\n",
       "      <td>13070</td>\n",
       "      <td>115853</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1308671820000</td>\n",
       "      <td>以前觉得还行，但有了85度之后就不行了。要了个提拉米苏，不行，太甜了。\\n辣松的味道倒不错，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>640356</th>\n",
       "      <td>168006</td>\n",
       "      <td>33263</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1224868560000</td>\n",
       "      <td>算比较地道的川菜了 味道辣的很正 强力推荐 据说还是标点美食的... 香辣鸡翅每去必点~！不...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1222261</th>\n",
       "      <td>76280</td>\n",
       "      <td>65171</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1302136740000</td>\n",
       "      <td>为什么这么多人说好吃啊？为什么这么多人说肉多啊？难道是我人品有问题？\\n这个也是慕名而去的~...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>101366</th>\n",
       "      <td>67372</td>\n",
       "      <td>2853</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1283741400000</td>\n",
       "      <td>两年前经常去这家吃卤煮，感觉特别好吃，可是最近吃了一次，让我大失所望。。。\\n卤煮的汤和食材...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         userId  restId  rating  rating_env  rating_flavor  rating_service  \\\n",
       "3331708    6802  183728     3.0         3.0            4.0             3.0   \n",
       "3332473    3106  183750     5.0         4.0            4.0             4.0   \n",
       "291609    39590   13570     3.0         3.0            2.0             3.0   \n",
       "749582    59192   38519     4.0         2.0            3.0             2.0   \n",
       "719908   241643   36382     1.0         2.0            1.0             1.0   \n",
       "3127953   12481  173459     4.0         3.0            3.0             3.0   \n",
       "2068253   13070  115853     3.0         3.0            3.0             2.0   \n",
       "640356   168006   33263     NaN         3.0            5.0             3.0   \n",
       "1222261   76280   65171     3.0         2.0            2.0             2.0   \n",
       "101366    67372    2853     1.0         1.0            1.0             1.0   \n",
       "\n",
       "             timestamp                                            comment  \n",
       "3331708  1315673880000  环境不错，停车方便，交通也比较方便，东西齐全，应有尽有，吃、喝、玩、乐样样齐全，还有个五星级...  \n",
       "3332473  1260155880000  去过两次，都是由日本朋友带着去的，很喜欢那种在小巷子深处的店，总觉得那样的店料理会很好吃。最...  \n",
       "291609   1324792500000  朋友请客，两个人中午去吃的，虽然不是节假日，但人还是非常的多，等了很长时间才上餐，价位偏高，...  \n",
       "749582   1321430760000  十一长假之前，我们的房子终于有了好消息，这个月底就可以拿到钥匙，真是不容易，盼星星盼月亮的，...  \n",
       "719908   1271862180000  很差的一家店！公司聚餐居然选在这里，真是个大大的失策！\\n点的菜迟迟不上，不知道是故意不上还...  \n",
       "3127953  1300407540000  这家是离家最近的一家城市超市了，所以自然要进去随便逛逛啦。\\n因为附近是居民区，自然光顾的主...  \n",
       "2068253  1308671820000  以前觉得还行，但有了85度之后就不行了。要了个提拉米苏，不行，太甜了。\\n辣松的味道倒不错，...  \n",
       "640356   1224868560000  算比较地道的川菜了 味道辣的很正 强力推荐 据说还是标点美食的... 香辣鸡翅每去必点~！不...  \n",
       "1222261  1302136740000  为什么这么多人说好吃啊？为什么这么多人说肉多啊？难道是我人品有问题？\\n这个也是慕名而去的~...  \n",
       "101366   1283741400000  两年前经常去这家吃卤煮，感觉特别好吃，可是最近吃了一次，让我大失所望。。。\\n卤煮的汤和食材...  "
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_ratings.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. links.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [],
   "source": [
    "links = pd.read_csv(path + 'links.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| restId | 即 restaurants.csv 和 ratings.csv 中的 restId |\n",
    "| dianpingId | 大众点评网的餐馆编号 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>restId</th>\n",
       "      <th>dianpingId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>138492</th>\n",
       "      <td>138492</td>\n",
       "      <td>3566359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>158007</th>\n",
       "      <td>158007</td>\n",
       "      <td>2484433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16170</th>\n",
       "      <td>16170</td>\n",
       "      <td>3651451</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>116637</th>\n",
       "      <td>116637</td>\n",
       "      <td>5143029</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>191554</th>\n",
       "      <td>191554</td>\n",
       "      <td>2734621</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>192481</th>\n",
       "      <td>192481</td>\n",
       "      <td>3000367</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40978</th>\n",
       "      <td>40978</td>\n",
       "      <td>3168181</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196832</th>\n",
       "      <td>196832</td>\n",
       "      <td>3523291</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6048</th>\n",
       "      <td>6048</td>\n",
       "      <td>2435827</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>200405</th>\n",
       "      <td>200405</td>\n",
       "      <td>4130573</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>69792</th>\n",
       "      <td>69792</td>\n",
       "      <td>2853502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>153075</th>\n",
       "      <td>153075</td>\n",
       "      <td>2000257</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8528</th>\n",
       "      <td>8528</td>\n",
       "      <td>2651221</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196930</th>\n",
       "      <td>196930</td>\n",
       "      <td>3534673</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>224063</th>\n",
       "      <td>224063</td>\n",
       "      <td>3138160</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3434</th>\n",
       "      <td>3434</td>\n",
       "      <td>2185753</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>125490</th>\n",
       "      <td>125490</td>\n",
       "      <td>2112511</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>230533</th>\n",
       "      <td>230533</td>\n",
       "      <td>4122445</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>130597</th>\n",
       "      <td>130597</td>\n",
       "      <td>2632129</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>186956</th>\n",
       "      <td>186956</td>\n",
       "      <td>2233513</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        restId  dianpingId\n",
       "138492  138492     3566359\n",
       "158007  158007     2484433\n",
       "16170    16170     3651451\n",
       "116637  116637     5143029\n",
       "191554  191554     2734621\n",
       "192481  192481     3000367\n",
       "40978    40978     3168181\n",
       "196832  196832     3523291\n",
       "6048      6048     2435827\n",
       "200405  200405     4130573\n",
       "69792    69792     2853502\n",
       "153075  153075     2000257\n",
       "8528      8528     2651221\n",
       "196930  196930     3534673\n",
       "224063  224063     3138160\n",
       "3434      3434     2185753\n",
       "125490  125490     2112511\n",
       "230533  230533     4122445\n",
       "130597  130597     2632129\n",
       "186956  186956     2233513"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "links.sample(20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
@@ -1,59 +0,0 @@
 ![](../images/recruit/jd_header.png)
 【岗位名称】系统架构师（人工智能产品）
 厦门 / 5-10 年 / 本科 / 15k-25k
 ---
 【岗位职责】
 1. 架构设计机器人客服软件系统
 2. 组织重构现有机器人客服软件系统
 3. 制定并优化机器人客服软件定制开发流程，提升定制开发的效率
 4. 制定内部技术标准，优化软件开发、测试、部署全流程，提升研发效率
 5. 对研发/测试人员进行技术培训，并指导其日常工作，打造科学、严谨、高效的技术团队
 【任职要求】
 1. 本科（或以上）学历，计算机、软件工程、自动化等相关专业毕业
 2. 5 年以上软件开发/系统架构相关工作经验
 3. 精通微服务系统架构、设计模式、常见数据结构及相关算法
 4. 具备优异的逻辑思维、系统抽象能力，强悍的工程实现能力
 【优先录用】
 1. 具备 1 年以上大型、复杂系统（尤其是人工智能或 SAAS 产品）软件开发或系统架构经验
 2. 了解自然语言处理（或机器学习、人工智能）基础知识，能迅速理解机器人客服软件各功能模块
 3. 熟悉 Python 开发
 【成长路径】
 1. 纵向发展：公司完善的职级职等体系，配合 OKR 工具，帮助你在成为技术专家的道路上不断突破进取
 2. 横向发展：扁平化的组织架构，大量的技术分享、交流活动，有机会接触到人工智能前沿相关产品、技术岗位，根据个人意愿，考核通过后，可申请调岗
 【团队介绍】
 1. 以人工智能技术服务全球 30 亿用户
 2. 人工智能朝阳产业，风口中的风口，期待你的加盟
 3. 极客精神、技术驱动，做有温度的技术，让世界更美好
 4. 每月不定期小组及大部门分享、交流活动，一同领略人工智能前沿的无限魅力……
 【公司福利、交通】
 1. 每周工作 5 天（双休），劳逸结合，高效执行
 2. 每天半小时弹性工作时间，可自由申请调休（不影响工作条件下）
 3. 交通便利，公司楼下即为莲花路口地铁站，如风一般快捷
 4. 市中心，一站公交直达莲坂、明发商业广场，吃喝不愁
 5. 公司有用餐区、咖啡厅、按摩椅
 6. 下午茶，节日礼物
 7. 各种团建活动
 8. 工作满 3 年以上，绩效/价值观优秀，有机会申请期权奖励
 ---
 【联系方式】
 - 蔡先生, jinhua@kuaishang.com.cn
 - 蓝先生, lanzl@kuaishang.com.cn, 180-3025-1206
 - 叶女士, yeyp@kuaishang.com.cn, 0592-5380356
@@ -1,50 +0,0 @@
 ![](../images/recruit/jd_header.png)
 【岗位名称】自然语言处理算法工程师
 厦门 / 3-5 年 / 硕士 / 15k-25k
 ---
 【岗位职责】
 1. 参与自动营销机器人客服软件核心模块的设计与研发，满足客户需求
 2. 专注对话系统的若干研究/应用领域/关键技术，展开深入研究，保持技术领先优势
 【任职要求】
 1. 硕士（或以上）学历
 2. 3 年以上对话/问答系统相关研究或开发经验
 3. 对话/问答系统核心技术骨干，了解各模块的设计与构造，并深入掌握其中的若干模块或关键技术
 4. 优秀的工程实现能力，能快速实现各种创新技术构想，编码和文档规范
 5. 英文阅读理解能力优秀，具有良好的英文技术文献阅读和理解能力
 【优先录用】
 1. 具有对话/问答系统相关产品研发成功经验者优先录用
 【团队介绍】
 1. 以人工智能技术服务全球 30 亿用户
 2. 专注面向行业细分领域的自动营销机器人，客户需求旺盛，产品前景无限
 3. 极客精神、技术驱动，做有温度的技术，让世界更美好
 4. 每月不定期小组及大部门分享、交流活动，团队氛围燃爆……
 【公司福利、交通】
 1. 每周工作 5 天（双休），劳逸结合，高效执行
 2. 每天半小时弹性工作时间，可自由申请调休（不影响工作条件下）
 3. 交通便利，公司楼下即为莲花路口地铁站，如风一般快捷
 4. 市中心，一站公交直达莲坂、明发商业广场，吃喝不愁
 5. 公司有用餐区、咖啡厅、按摩椅
 6. 下午茶，节日礼物
 7. 各种团建活动
 8. 工作满 3 年以上，绩效/价值观优秀，有机会申请期权奖励
 ---
 【联系方式】
 - 蔡先生, jinhua@kuaishang.com.cn
 - 蓝先生, lanzl@kuaishang.com.cn, 180-3025-1206
 - 叶女士, yeyp@kuaishang.com.cn, 0592-5380356
@@ -1,51 +0,0 @@
 ![](../images/recruit/jd_header.png)
 【岗位名称】自然语言人机交互应用研究
 厦门 / 5-10 年 / 硕士 / 20k-35k
 ---
 【岗位职责】
 1. 主持设计并组织研发面向行业细分领域的自动营销机器人客服软件
 2. 制定并优化机器人客服软件定制开发流程，显著提升定制开发的效率
 3. 洞察前沿技术发展趋势，帮助提升团队整体技术水平
 【任职要求】
 1. 硕士（或以上）学历
 2. 5 年以上对话/问答系统相关研究或开发经验
 3. 对话/问答系统核心技术骨干，熟悉各模块的设计与构造，尤其精通对话流程管理与控制（即中控系统）的研发
 4. 优秀的工程实现能力，能快速实现各种创新技术构想，编码和文档规范
 5. 优异的英文文献阅读能力，时刻把握前沿技术发展趋势
 【优先录用】
 1. 具有对话/问答系统相关产品研发成功经验者优先录用
 【团队介绍】
 1. 以人工智能技术服务全球 30 亿用户
 2. 专注面向行业细分领域的自动营销机器人，客户需求旺盛，产品前景无限
 3. 极客精神、技术驱动，做有温度的技术，让世界更美好
 4. 每月不定期小组及大部门分享、交流活动，团队氛围燃爆……
 【公司福利、交通】
 1. 每周工作 5 天（双休），劳逸结合，高效执行
 2. 每天半小时弹性工作时间，可自由申请调休（不影响工作条件下）
 3. 交通便利，公司楼下即为莲花路口地铁站，如风一般快捷
 4. 市中心，一站公交直达莲坂、明发商业广场，吃喝不愁
 5. 公司有用餐区、咖啡厅、按摩椅
 6. 下午茶，节日礼物
 7. 各种团建活动
 8. 工作满 3 年以上，绩效/价值观优秀，有机会申请期权奖励
 ---
 【联系方式】
 - 蔡先生, jinhua@kuaishang.com.cn
 - 蓝先生, lanzl@kuaishang.com.cn, 180-3025-1206
 - 叶女士, yeyp@kuaishang.com.cn, 0592-5380356
@@ -1,6 +0,0 @@
 # Initial Ruff Linting
 70d7725e5c89bccfe7d4e5a3ccd87e05c642d74b
 # Change line-length and ruff format
 39bbfdb8298b5faa32e4bc052080d240f6140bea
 # pre-commit hooks and ruff
 6ed123ecc4aec9da26bd48748df670cd5b42b3cd
@@ -1 +0,0 @@
 *.ipynb linguist-documentation
@@ -1,45 +0,0 @@
 name: "\U0001F41B Bug Report"
 description: Report your bug here.
 labels: ["bug"]
 body:
  - type: markdown
    attributes:
      value: |
        Thanks for taking the time to fill out this bug report! Any information you can provide about your system and the issue you encountered will help to resolve it faster.
  - type: checkboxes
    attributes:
      label: Have you searched existing issues?  🔎
      description: Please search to see if an [issue](https://github.com/MaartenGr/BERTopic/issues) already exists for the issue you encountered.
      options:
        - label: I have searched and found no existing issues
          required: true
  - type: textarea
    id: describe_the_bug
    attributes:
      label: Desribe the bug
      description: Please provide a concise description of the bug. If there is an error, make sure to provide the **full** error log.
      placeholder: Describe the bug
    validations:
      required: true
  - type: textarea
    id: reproduction
    attributes:
      label: Reproduction
      description: Please provide a minimal example, with code, that can be run to reproduce the issue.
      value: |
        ```python
        from bertopic import BERTopic
        ```
  - type: input
    id: bertopic_version
    attributes:
      label: BERTopic Version
      description: What version of BERTopic are you using?
    validations:
      required: true
@@ -1,8 +0,0 @@
 blank_issues_enabled: true
 contact_links:
  - name: 💡 General questions
    url: https://github.com/MaartenGr/BERTopic/discussions
    about: Ask a question there!
  - name: Want to contribute?
    url: https://github.com/MaartenGr/BERTopic/blob/master/CONTRIBUTING.md
    about: Head to the contributing guidelines
@@ -1,30 +0,0 @@
 name: "\U0001F680 Feature request"
 description: Submit a proposal/request for a new BERTopic feature
 labels: ["Feature request"]
 body:
  - type: textarea
    id: feature-request
    validations:
      required: true
    attributes:
      label: Feature request
      description: |
        A clear and concise description of the feature proposal.
  - type: textarea
    id: motivation
    validations:
      required: true
    attributes:
      label: Motivation
      description: |
        Please outline the motivation for the proposal. If this is related to another GitHub issue, please link here too.
  - type: textarea
    id: contribution
    validations:
      required: true
    attributes:
      label: Your contribution
      description: |
        Any help on the implementation of this feature would be greatly appreciated. If you are interested in working on this, make sure to read the [CONTRIBUTING.MD guide](https://github.com/MaartenGr/BERTopic/blob/master/CONTRIBUTING.md)
@@ -1,17 +0,0 @@
 # What does this PR do?
 <!--
 Thank you for considering creating a PR! Before you do, make sure to read through [contributor guideline](https://github.com/MaartenGr/BERTopic/blob/master/CONTRIBUTING.md)
 -->
 <!-- Remove if not applicable -->
 Fixes # (issue)
 ## Before submitting
 - [ ] This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
 - [ ] Did you read the [contributor guideline](https://github.com/MaartenGr/BERTopic/blob/master/CONTRIBUTING.md)?
 - [ ] Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
 - [ ] Did you make sure to update the documentation with your changes (if applicable)?
 - [ ] Did you write any new necessary tests?
@@ -1,39 +0,0 @@
 name: Code Checks
 on:
  push:
    branches:
    - master
    - dev
  pull_request:
    branches:
    - master
    - dev
 jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-python@v5
    # Ref: https://github.com/pre-commit/action
    - uses: pre-commit/action@v3.0.1
  build:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]
    steps:
    - uses: actions/checkout@v4
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v5
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -e ".[test]"
    - name: Run Checking Mechanisms
      run: make check
@@ -1,88 +0,0 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
 *$py.class
 # C extensions
 *.so
 # Distribution / packaging
 .Python
 build/
 develop-eggs/
 dist/
 downloads/
 eggs/
 .eggs/
 lib/
 lib64/
 parts/
 sdist/
 var/
 wheels/
 pip-wheel-metadata/
 share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 MANIFEST
 model_dir
 model_dir/
 test
 # PyInstaller
 #  Usually these files are written by a python script from a template
 #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 *.manifest
 *.spec
 # Installer logs
 pip-log.txt
 pip-delete-this-directory.txt
 # Unit test / coverage reports
 htmlcov/
 .tox/
 .nox/
 .coverage
 .coverage.*
 .cache
 nosetests.xml
 coverage.xml
 *.cover
 .hypothesis/
 .pytest_cache/
 # Sphinx documentation
 docs/_build/
 # Jupyter Notebook
 .ipynb_checkpoints
 notebooks/
 # IPython
 profile_default/
 ipython_config.py
 # pyenv
 .python-version
 # Environments
 .env
 .venv
 env/
 venv/
 ENV/
 env.bak/
 venv.bak/
 *.lock
 # Artifacts
 .idea
 .idea/
 .vscode
 .DS_Store
 # mkdocs
 site/
@@ -1,20 +0,0 @@
 repos:
 -   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
    hooks:
    -   id: trailing-whitespace
        exclude: |
            (?x)^(
                README.md|
                docs/
            )$
    -   id: end-of-file-fixer
        exclude_types: [html, svg]
    -   id: check-yaml
    -   id: check-added-large-files
 -   repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.9.9
    hooks:
    -   id: ruff
        args: [--fix, --show-fixes, --exit-non-zero-on-fix]
    -   id: ruff-format
@@ -1,64 +0,0 @@
 # Contributing to BERTopic
 Hi! Thank you for considering contributing to BERTopic. With the modular nature of BERTopic, many new add-ons, backends, representation models, sub-models, and LLMs, can quickly be added to keep up with the incredibly fast-pacing field.
 Whether contributions are new features, better documentation, bug fixes, or improvement on the repository itself, anything is appreciated!
 ## 📚 Guidelines
 ### 🤖 Contributing Code
 To contribute to this project, we follow an `issue -> pull request` approach for main features and bug fixes. This means that any new feature, bug fix, or anything else that touches on code directly needs to start from an issue first. That way, the main discussion about what needs to be added/fixed can be done in the issue before creating a pull request. This makes sure that we are on the same page before you start coding your pull request. If you start working on an issue, please assign it to yourself but do so after there is an agreement with the maintainer, [@MaartenGr](https://github.com/MaartenGr).
 When there is agreement on the assigned approach, a pull request can be created in which the fix/feature can be added. This follows a  ["fork and pull request"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) workflow.
 Please do not try to push directly to this repo unless you are a maintainer.
 There are exceptions to the `issue -> pull request` approach that are typically small changes that do not need agreements, such as:
 * Documentation
 * Spelling/grammar issues
 * Docstrings
 * etc.
 There is a large focus on documentation in this repository, so please make sure to add extensive descriptions of features when creating the pull request.
 Note that the main focus of pull requests and code should be:
 * Easy readability
 * Clear communication
 * Sufficient documentation
 ## 🚀 Quick Start
 To start contributing, make sure to first start from a fresh environment. Using an environment manager, such as `conda` or `pyenv` helps in making sure that your code is reproducible and tracks the versions you have in your environment.
 If you are using conda, you can approach it as follows:
 1. Create and activate a new conda environment (e.g., `conda create -n bertopic python=3.9`)
 2. Install requirements (e.g., `pip install .[dev]`)
  * This makes sure to also install documentation and testing packages
 3. (Optional) Run `make docs` to build your documentation
 4. (Optional) Run `make test` to run the unit tests and `make coverage` to check the coverage of unit tests
 ❗Note: Unit testing the package can take quite some time since it needs to run several variants of the BERTopic pipeline.
 ## 🧹 Linting and Formatting
 We use [Ruff](https://docs.astral.sh/ruff/) to ensure code is uniformly formatted and to avoid common mistakes and bad practices.
 * To automatically re-format code, run `make format`
 * To check for linting issues, run `make lint` - some issues may be automatically fixed, some will not be
 When a pull request is made, the CI will automatically check for linting and formatting issues. However, it will not automatically apply any fixes, so it is easiest to run locally.
 If you believe an error is incorrectly flagged, use a [`# noqa:` comment to suppress](https://docs.astral.sh/ruff/linter/#error-suppression), but this is discouraged unless strictly necessary.
 ## 🤓 Collaborative Efforts
 When you run into any issue with the above or need help to start with a pull request, feel free to reach out in the issues! As with all repositories, this one has its particularities as a result of the maintainer's view. Each repository is quite different and so will their processes.
 ## 🏆 Recognition
 If your contribution has made its way into a new release of BERTopic, you will be given credit in the changelog of the new release! Regardless of the size of the contribution, any help is greatly appreciated.
 ## 🎈 Release
 BERTopic tries to mostly follow [semantic versioning](https://semver.org/) for its new releases. Even though BERTopic has been around for a few years now, it is still pre-1.0 software. With the rapid chances in the field and as a way to keep up, this versioning is on purpose. Backwards-compatibility is taken into account but integrating new features and thereby keeping up with the field takes priority. Especially since BERTopic focuses on modularity, flexibility is necessary.
@@ -1,21 +0,0 @@
 MIT License
 Copyright (c) 2024, Maarten P. Grootendorst
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
@@ -1,29 +0,0 @@
 test:
 	pytest
 coverage:
 	pytest --cov
 format:
 	ruff format
 lint:
 	ruff check --fix
 install:
 	python -m pip install -e .
 install-test:
 	python -m pip install -e ".[dev]"
 docs:
 	mkdocs serve
 pypi:
 	python -m build
 	twine upload dist/*
 clean:
 	rm -rf **/.ipynb_checkpoints **/.pytest_cache **/__pycache__ **/**/__pycache__ .ipynb_checkpoints .pytest_cache
 check: test clean
@@ -1,309 +0,0 @@
 [![PyPI Downloads](https://static.pepy.tech/badge/bertopic)](https://pepy.tech/projects/bertopic)
 [![PyPI - Python](https://img.shields.io/badge/python-v3.9+-blue.svg)](https://pypi.org/project/bertopic/)
 [![Build](https://img.shields.io/github/actions/workflow/status/MaartenGr/BERTopic/testing.yml?branch=master)](https://github.com/MaartenGr/BERTopic/actions)
 [![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/BERTopic/)
 [![PyPI - PyPi](https://img.shields.io/pypi/v/BERTopic)](https://pypi.org/project/bertopic/)
 [![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/VLAC/blob/master/LICENSE)
 [![arXiv](https://img.shields.io/badge/arXiv-2203.05794-<COLOR>.svg)](https://arxiv.org/abs/2203.05794)
 # BERTopic
 <img src="images/logo.png" width="35%" align="right" /> 
 BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters
 allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
 BERTopic supports all kinds of topic modeling techniques:  
 <table>
  <tr>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/guided/guided.html">Guided</a></td>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html">Supervised</a></td>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html">Semi-supervised</a></td>
 </tr>
   <tr>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/manual/manual.html">Manual</a></td>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html">Multi-topic distributions</a></td>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html">Hierarchical</a></td>
 </tr>
 <tr>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html">Class-based</a></td>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html">Dynamic</a></td>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/online/online.html">Online/Incremental</a></td>
 </tr>
 <tr>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html">Multimodal</a></td>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html">Multi-aspect</a></td>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/representation/llm.html">Text Generation/LLM</a></td>
 </tr>
 <tr>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html">Zero-shot <b>(new!)</b></a></td>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/merge/merge.html">Merge Models <b>(new!)</b></a></td>
    <td><a href="https://maartengr.github.io/BERTopic/getting_started/seed_words/seed_words.html">Seed Words <b>(new!)</b></a></td>
 </tr>
 </table>
 Corresponding medium posts can be found [here](https://medium.com/data-science/topic-modeling-with-bert-779f7db187e6?sk=0b5a470c006d1842ad4c8a3057063a99
 ), [here](https://medium.com/data-science/using-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf?sk=b1e0fd46f70cb15e8422b4794a81161d
 ) and [here](https://medium.com/data-science/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4
 ). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794) or see a [brief overview](https://maartengr.github.io/BERTopic/algorithm/algorithm.html). 
 ## Installation
 Installation, with sentence-transformers, can be done using [uv](https://docs.astral.sh/uv/):
 ```bash
 uv add bertopic
 ```
 or with [pip](https://github.com/pypa/pip):
 ```bash
 pip install bertopic
 ```
 If you want to install BERTopic with other embedding models, you can choose one of the following:
 ```bash
 # Choose an embedding backend
 pip install bertopic[flair,gensim,spacy,use]
 # Topic modeling with images
 pip install bertopic[vision]
 ```
 For a *light-weight installation* without transformers, UMAP and/or HDBSCAN (for training with Model2Vec or inference), see [this tutorial](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#lightweight-installation).
 ## Getting Started
 For an in-depth overview of the features of BERTopic 
 you can check the [**full documentation**](https://maartengr.github.io/BERTopic/) or you can follow along 
 with one of the examples below:
 | Name  | Link  |
 |---|---|
 | Start Here - **Best Practices in BERTopic**  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing)  |
 | **🆕 New!** - Topic Modeling on Large Data (GPU Acceleration)  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing)  |
 | **🆕 New!** - Topic Modeling with Llama 2 🦙 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing)  |
 | **🆕 New!** - Topic Modeling with Quantized LLMs | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1DdSHvVPJA3rmNfBWjCo2P1E9686xfxFx?usp=sharing)  |
 | Topic Modeling with BERTopic  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing)  |
 | (Custom) Embedding Models in BERTopic  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/18arPPe50szvcCp_Y6xS56H2tY0m-RLqv?usp=sharing) |
 | Advanced Customization in BERTopic  |  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ClTYut039t-LDtlcd-oQAdXWgcsSGTw9?usp=sharing) |
 | (semi-)Supervised Topic Modeling with BERTopic  |  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bxizKzv5vfxJEB29sntU__ZC7PBSIPaQ?usp=sharing)  |
 | Dynamic Topic Modeling with Trump's Tweets  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1un8ooI-7ZNlRoK0maVkYhmNRl0XGK88f?usp=sharing)  |
 | Topic Modeling arXiv Abstracts | [![Kaggle](https://img.shields.io/static/v1?style=for-the-badge&message=Kaggle&color=222222&logo=Kaggle&logoColor=20BEFF&label=)](https://www.kaggle.com/maartengr/topic-modeling-arxiv-abstract-with-bertopic) |
 ## Quick Start
 We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:
 ```python
 from bertopic import BERTopic
 from sklearn.datasets import fetch_20newsgroups
 docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
 topic_model = BERTopic()
 topics, probs = topic_model.fit_transform(docs)
 ```
 After generating topics and their probabilities, we can access all of the topics together with their topic representations:
 ```python
 >>> topic_model.get_topic_info()
 Topic	Count	Name
 -1	4630	-1_can_your_will_any
 0	693	49_windows_drive_dos_file
 1	466	32_jesus_bible_christian_faith
 2	441	2_space_launch_orbit_lunar
 3	381	22_key_encryption_keys_encrypted
 ...
 ```
 The `-1` topic refers to all outlier documents and are typically ignored. Each word in a topic describes the underlying theme of that topic and can be used 
 for interpreting that topic. Next, let's take a look at the most frequent topic that was generated:
 ```python
 >>> topic_model.get_topic(0)
 [('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]
 ```  
 Using `.get_document_info`, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:
 ```python
 >>> topic_model.get_document_info(docs)
 Document                               Topic	Name	                        Top_n_words                     Probability    ...
 I am sure some bashers of Pens...	0	0_game_team_games_season	game - team - games...	        0.200010       ...
 My brother is in the market for...      -1     -1_can_your_will_any	        can - your - will...	        0.420668       ...
 Finally you said what you dream...	-1     -1_can_your_will_any	        can - your - will...            0.807259       ...
 Think! It's the SCSI card doing...	49     49_windows_drive_dos_file	windows - drive - docs...	0.071746       ...
 1) I have an old Jasmine drive...	49     49_windows_drive_dos_file	windows - drive - docs...	0.038983       ...
 ```
 **`🔥 Tip`**: Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages. 
 ## Fine-tune Topic Representations
 In BERTopic, there are a number of different [topic representations](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is `KeyBERTInspired`, which for many users increases the coherence and reduces stopwords from the resulting topic representations:
 ```python
 from bertopic.representation import KeyBERTInspired
 # Fine-tune your topic representations
 representation_model = KeyBERTInspired()
 topic_model = BERTopic(representation_model=representation_model)
 ```
 However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:
 ```python
 import openai
 from bertopic.representation import OpenAI
 # Fine-tune topic representations with GPT
 client = openai.OpenAI(api_key="sk-...")
 representation_model = OpenAI(client, model="gpt-4o-mini", chat=True)
 topic_model = BERTopic(representation_model=representation_model)
 ```
 **`🔥 Tip`**: Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic. 
 ## Visualizations
 After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good 
 understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the [many visualization options](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html) in BERTopic. 
 For example, we can visualize the topics that were generated in a way very similar to 
 [LDAvis](https://github.com/cpsievert/LDAvis):
 ```python
 topic_model.visualize_topics()
 ``` 
 <img src="images/topic_visualization.gif" width="80%" align="center" />
 ## Modularity
 By default, the [main steps](https://maartengr.github.io/BERTopic/algorithm/algorithm.html) for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:
 https://user-images.githubusercontent.com/25746895/218420473-4b2bb539-9dbe-407a-9674-a8317c7fb3bf.mp4
 You can swap out any of these models or even remove them entirely. The following steps are completely modular:
 1. [Embedding](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) documents
 2. [Reducing dimensionality](https://maartengr.github.io/BERTopic/getting_started/dim_reduction/dim_reduction.html) of embeddings
 3. [Clustering](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html) reduced embeddings into topics
 4. [Tokenization](https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html) of topics
 5. [Weight](https://maartengr.github.io/BERTopic/getting_started/ctfidf/ctfidf.html) tokens
 6. [Represent topics](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) with one or [multiple](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) representations
 ## Functionality
 BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview 
 of all methods and a short description of its purpose. 
 ### Common
 Below, you will find an overview of common functions in BERTopic. 
 | Method | Code  | 
 |-----------------------|---|
 | Fit the model    |  `.fit(docs)` |
 | Fit the model and predict documents  |  `.fit_transform(docs)` |
 | Predict new documents    |  `.transform([new_doc])` |
 | Access single topic   | `.get_topic(topic=12)`  |   
 | Access all topics     |  `.get_topics()` |
 | Get topic freq    |  `.get_topic_freq()` |
 | Get all topic information|  `.get_topic_info()` |
 | Get all document information|  `.get_document_info(docs)` |
 | Get representative docs per topic |  `.get_representative_docs()` |
 | Update topic representation | `.update_topics(docs, n_gram_range=(1, 3))` |
 | Generate topic labels | `.generate_topic_labels()` |
 | Set topic labels | `.set_topic_labels(my_custom_labels)` |
 | Merge topics | `.merge_topics(docs, topics_to_merge)` |
 | Reduce nr of topics | `.reduce_topics(docs, nr_topics=30)` |
 | Reduce outliers | `.reduce_outliers(docs, topics)` |
 | Find topics | `.find_topics("vehicle")` |
 | Save model    |  `.save("my_model", serialization="safetensors")` |
 | Load model    |  `BERTopic.load("my_model")` |
 | Get parameters |  `.get_params()` |
 ### Attributes
 After having trained your BERTopic model, several attributes are saved within your model. These attributes, in part, 
 refer to how model information is stored on an estimator during fitting. The attributes that you see below all end in `_` and are 
 public attributes that can be used to access model information. 
 | Attribute | Description |
 |------------------------|---------------------------------------------------------------------------------------------|
 | `.topics_`               | The topics that are generated for each document after training or updating the topic model. |
 | `.probabilities_` | The probabilities that are generated for each document if HDBSCAN is used. |
 | `.topic_sizes_`           | The size of each topic                                                                      |
 | `.topic_mapper_`          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
 | `.topic_representations_` | The top *n* terms per topic and their respective c-TF-IDF values.                           |
 | `.c_tf_idf_`              | The topic-term matrix as calculated through c-TF-IDF.                                       |
 | `.topic_aspects_`          | The different aspects, or representations, of each topic.                                  |
 | `.topic_labels_`          | The default labels for each topic.                                                          |
 | `.custom_labels_`         | Custom labels for each topic as generated through `.set_topic_labels`.                      |
 | `.topic_embeddings_`      | The embeddings for each topic if `embedding_model` was used.                                |
 | `.representative_docs_`   | The representative documents for each topic if HDBSCAN is used.                             |
 ### Variations
 There are many different use cases in which topic modeling can be used. As such, several variations of BERTopic have been developed such that one package can be used across many use cases.
 | Method | Code  | 
 |-----------------------|---|
 | [Topic Distribution Approximation](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html) | `.approximate_distribution(docs)` |
 | [Online Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/online/online.html) | `.partial_fit(doc)` |
 | [Semi-supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html) | `.fit(docs, y=y)` |
 | [Supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html) | `.fit(docs, y=y)` |
 | [Manual Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html) | `.fit(docs, y=y)` |
 | [Multimodal Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html) | ``.fit(docs, images=images)`` |
 | [Topic Modeling per Class](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html) | `.topics_per_class(docs, classes)` |
 | [Dynamic Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html) | `.topics_over_time(docs, timestamps)` |
 | [Hierarchical Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html) | `.hierarchical_topics(docs)` |
 | [Guided Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html) | `BERTopic(seed_topic_list=seed_topic_list)` |
 | [Zero-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html) | `BERTopic(zeroshot_topic_list=zeroshot_topic_list)` |
 | [Merge Multiple Models](https://maartengr.github.io/BERTopic/getting_started/merge/merge.html) | `BERTopic.merge_models([topic_model_1, topic_model_2])` |
 ### Visualizations
 Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation. 
 Visualizing different aspects of the topic model helps in understanding the model and makes it easier 
 to tweak the model to your liking. 
 | Method | Code  | 
 |-----------------------|---|
 | Visualize Topics    |  `.visualize_topics()` |
 | Visualize Documents    |  `.visualize_documents()` |
 | Visualize Document Hierarchy    |  `.visualize_hierarchical_documents()` |
 | Visualize Topic Hierarchy    |  `.visualize_hierarchy()` |
 | Visualize Topic Tree   |  `.get_topic_tree(hierarchical_topics)` |
 | Visualize Topic Terms    |  `.visualize_barchart()` |
 | Visualize Topic Similarity  |  `.visualize_heatmap()` |
 | Visualize Term Score Decline  |  `.visualize_term_rank()` |
 | Visualize Topic Probability Distribution    |  `.visualize_distribution(probs[0])` |
 | Visualize Topics over Time   |  `.visualize_topics_over_time(topics_over_time)` |
 | Visualize Topics per Class | `.visualize_topics_per_class(topics_per_class)` | 
 ## Citation
 To cite the [BERTopic paper](https://arxiv.org/abs/2203.05794), please use the following bibtex reference:
 ```bibtext
@article{grootendorst2022bertopic,
  title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},
  author={Grootendorst, Maarten},
  journal={arXiv preprint arXiv:2203.05794},
  year={2022}
 }
 ```
@@ -1,9 +0,0 @@
 from importlib.metadata import version
 from bertopic._bertopic import BERTopic
 __version__ = version("bertopic")
 __all__ = [
    "BERTopic",
 ]
@@ -1,538 +0,0 @@
 import os
 import json
 import numpy as np
 from pathlib import Path
 from tempfile import TemporaryDirectory
 # HuggingFace Hub
 try:
    from huggingface_hub import (
        create_repo,
        get_hf_file_metadata,
        hf_hub_download,
        hf_hub_url,
        repo_type_and_id_from_hf_id,
        upload_folder,
    )
    _has_hf_hub = True
 except ImportError:
    _has_hf_hub = False
 # Typing
 from typing import Union
 # Pytorch check
 try:
    import torch
    _has_torch = True
 except ImportError:
    _has_torch = False
 # Image check
 try:
    from PIL import Image
    _has_vision = True
 except ImportError:
    _has_vision = False
 TOPICS_NAME = "topics.json"
 CONFIG_NAME = "config.json"
 HF_WEIGHTS_NAME = "topic_embeddings.bin"  # default pytorch pkl
 HF_SAFE_WEIGHTS_NAME = "topic_embeddings.safetensors"  # safetensors version
 CTFIDF_WEIGHTS_NAME = "ctfidf.bin"  # default pytorch pkl
 CTFIDF_SAFE_WEIGHTS_NAME = "ctfidf.safetensors"  # safetensors version
 CTFIDF_CFG_NAME = "ctfidf_config.json"
 MODEL_CARD_TEMPLATE = """
 ---
 tags:
 - bertopic
 library_name: bertopic
 pipeline_tag: {PIPELINE_TAG}
 ---
 # {MODEL_NAME}
 This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
 BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
 ## Usage
 To use this model, please install BERTopic:
 ```
 pip install -U bertopic
 ```
 You can use the model as follows:
 ```python
 from bertopic import BERTopic
 topic_model = BERTopic.load("{PATH}")
 topic_model.get_topic_info()
 ```
 ## Topic overview
 * Number of topics: {NR_TOPICS}
 * Number of training documents: {NR_DOCUMENTS}
 <details>
  <summary>Click here for an overview of all topics.</summary>
  {TOPICS}
 </details>
 ## Training hyperparameters
 {HYPERPARAMS}
 ## Framework versions
 {FRAMEWORKS}
 """
 def push_to_hf_hub(
    model,
    repo_id: str,
    commit_message: str = "Add BERTopic model",
    token: str = None,
    revision: str = None,
    private: bool = False,
    create_pr: bool = False,
    model_card: bool = True,
    serialization: str = "safetensors",
    save_embedding_model: Union[str, bool] = True,
    save_ctfidf: bool = False,
 ):
    """Push your BERTopic model to a HuggingFace Hub.
    Arguments:
        model: The BERTopic model to push
        repo_id: The name of your HuggingFace repository
        commit_message: A commit message
        token: Token to add if not already logged in
        revision: Repository revision
        private: Whether to create a private repository
        create_pr: Whether to upload the model as a Pull Request
        model_card: Whether to automatically create a modelcard
        serialization: The type of serialization.
                       Either `safetensors` or `pytorch`
        save_embedding_model: A pointer towards a HuggingFace model to be loaded in with
                                SentenceTransformers. E.g.,
                                `sentence-transformers/all-MiniLM-L6-v2`
        save_ctfidf: Whether to save c-TF-IDF information
    """
    if not _has_hf_hub:
        raise ValueError("Make sure you have the huggingface hub installed via `pip install --upgrade huggingface_hub`")
    # Create repo if it doesn't exist yet and infer complete repo_id
    repo_url = create_repo(repo_id, token=token, private=private, exist_ok=True)
    _, repo_owner, repo_name = repo_type_and_id_from_hf_id(repo_url)
    repo_id = f"{repo_owner}/{repo_name}"
    # Temporarily save model and push to HF
    with TemporaryDirectory() as tmpdir:
        # Save model weights and config.
        model.save(
            tmpdir,
            serialization=serialization,
            save_embedding_model=save_embedding_model,
            save_ctfidf=save_ctfidf,
        )
        # Add README if it does not exist
        try:
            get_hf_file_metadata(hf_hub_url(repo_id=repo_id, filename="README.md", revision=revision))
        except:  # noqa: E722
            if model_card:
                readme_text = generate_readme(model, repo_id)
                readme_path = Path(tmpdir) / "README.md"
                readme_path.write_text(readme_text, encoding="utf8")
        # Upload model
        return upload_folder(
            repo_id=repo_id,
            folder_path=tmpdir,
            revision=revision,
            create_pr=create_pr,
            commit_message=commit_message,
        )
 def load_local_files(path):
    """Load local BERTopic files."""
    # Load json configs
    topics = load_cfg_from_json(path / TOPICS_NAME)
    params = load_cfg_from_json(path / CONFIG_NAME)
    # Load Topic Embeddings
    safetensor_path = path / HF_SAFE_WEIGHTS_NAME
    if safetensor_path.is_file():
        tensors = load_safetensors(safetensor_path)
    else:
        torch_path = path / HF_WEIGHTS_NAME
        if torch_path.is_file():
            tensors = torch.load(torch_path, map_location="cpu")
            tensors = {k: v.numpy() for k, v in tensors.items()}
    # c-TF-IDF
    try:
        ctfidf_tensors = None
        safetensor_path = path / CTFIDF_SAFE_WEIGHTS_NAME
        if safetensor_path.is_file():
            ctfidf_tensors = load_safetensors(safetensor_path)
        else:
            torch_path = path / CTFIDF_WEIGHTS_NAME
            if torch_path.is_file():
                ctfidf_tensors = torch.load(torch_path, map_location="cpu")
                ctfidf_tensors = {k: v.numpy() for k, v in ctfidf_tensors.items()}
        ctfidf_config = load_cfg_from_json(path / CTFIDF_CFG_NAME)
    except:  # noqa: E722
        ctfidf_config, ctfidf_tensors = None, None
    # Load images
    images = None
    if _has_vision:
        try:
            Image.open(path / "images/0.jpg")
            _has_images = True
        except:  # noqa: E722
            _has_images = False
        if _has_images:
            topic_list = list(topics["topic_representations"].keys())
            images = {}
            for topic in topic_list:
                image = Image.open(path / f"images/{topic}.jpg")
                images[int(topic)] = image
    return topics, params, tensors, ctfidf_tensors, ctfidf_config, images
 def load_files_from_hf(path):
    """Load files from HuggingFace."""
    path = str(path)
    # Configs
    topics = load_cfg_from_json(hf_hub_download(path, TOPICS_NAME, revision=None))
    params = load_cfg_from_json(hf_hub_download(path, CONFIG_NAME, revision=None))
    # Topic Embeddings
    try:
        tensors = hf_hub_download(path, HF_SAFE_WEIGHTS_NAME, revision=None)
        tensors = load_safetensors(tensors)
    except:  # noqa: E722
        tensors = hf_hub_download(path, HF_WEIGHTS_NAME, revision=None)
        tensors = torch.load(tensors, map_location="cpu")
    # c-TF-IDF
    try:
        ctfidf_config = load_cfg_from_json(hf_hub_download(path, CTFIDF_CFG_NAME, revision=None))
        try:
            ctfidf_tensors = hf_hub_download(path, CTFIDF_SAFE_WEIGHTS_NAME, revision=None)
            ctfidf_tensors = load_safetensors(ctfidf_tensors)
        except:  # noqa: E722
            ctfidf_tensors = hf_hub_download(path, CTFIDF_WEIGHTS_NAME, revision=None)
            ctfidf_tensors = torch.load(ctfidf_tensors, map_location="cpu")
    except:  # noqa: E722
        ctfidf_config, ctfidf_tensors = None, None
    # Load images if they exist
    images = None
    if _has_vision:
        try:
            hf_hub_download(path, "images/0.jpg", revision=None)
            _has_images = True
        except:  # noqa: E722
            _has_images = False
        if _has_images:
            topic_list = list(topics["topic_representations"].keys())
            images = {}
            for topic in topic_list:
                image = Image.open(hf_hub_download(path, f"images/{topic}.jpg", revision=None))
                images[int(topic)] = image
    return topics, params, tensors, ctfidf_tensors, ctfidf_config, images
 def generate_readme(model, repo_id: str):
    """Generate README for HuggingFace model card."""
    model_card = MODEL_CARD_TEMPLATE
    topic_table_head = "| Topic ID | Topic Keywords | Topic Frequency | Label | \n|----------|----------------|-----------------|-------| \n"
    # Get Statistics
    model_name = repo_id.split("/")[-1]
    params = {param: value for param, value in model.get_params().items() if "model" not in param}
    params = "\n".join([f"* {param}: {value}" for param, value in params.items()])
    topics = sorted(list(set(model.topics_)))
    nr_topics = str(len(set(model.topics_)))
    if model.topic_sizes_ is not None:
        nr_documents = str(sum(model.topic_sizes_.values()))
    else:
        nr_documents = ""
    # Topic information
    topic_keywords = [" - ".join(list(zip(*model.get_topic(topic)))[0][:5]) for topic in topics]
    topic_freq = [model.get_topic_freq(topic) for topic in topics]
    topic_labels = model.custom_labels_ if model.custom_labels_ else [model.topic_labels_[topic] for topic in topics]
    topics = [
        f"| {topic} | {topic_keywords[index]} | {topic_freq[topic]} | {topic_labels[index]} | \n"
        for index, topic in enumerate(topics)
    ]
    topics = topic_table_head + "".join(topics)
    frameworks = "\n".join([f"* {param}: {value}" for param, value in get_package_versions().items()])
    # Fill Statistics into model card
    model_card = model_card.replace("{MODEL_NAME}", model_name)
    model_card = model_card.replace("{PATH}", repo_id)
    model_card = model_card.replace("{NR_TOPICS}", nr_topics)
    model_card = model_card.replace("{TOPICS}", topics.strip())
    model_card = model_card.replace("{NR_DOCUMENTS}", nr_documents)
    model_card = model_card.replace("{HYPERPARAMS}", params)
    model_card = model_card.replace("{FRAMEWORKS}", frameworks)
    # Fill Pipeline tag
    has_visual_aspect = check_has_visual_aspect(model)
    if not has_visual_aspect:
        model_card = model_card.replace("{PIPELINE_TAG}", "text-classification")
    else:
        model_card = model_card.replace("pipeline_tag: {PIPELINE_TAG}\n", "")  # TODO add proper tag for this instance
    return model_card
 def save_hf(model, save_directory, serialization: str):
    """Save topic embeddings, either safely (using safetensors) or using legacy pytorch."""
    tensors = np.array(model.topic_embeddings_, dtype=np.float32)
    if serialization == "safetensors":
        tensors = {"topic_embeddings": tensors}
        save_safetensors(save_directory / HF_SAFE_WEIGHTS_NAME, tensors)
    if serialization == "pytorch":
        assert _has_torch, "`pip install pytorch` to save as bin"
        tensors = {"topic_embeddings": torch.from_numpy(tensors)}
        torch.save(tensors, save_directory / HF_WEIGHTS_NAME)
 def save_ctfidf(model, save_directory: str, serialization: str):
    """Save c-TF-IDF sparse matrix."""
    indptr = model.c_tf_idf_.indptr
    indices = model.c_tf_idf_.indices
    data = model.c_tf_idf_.data
    shape = np.array(model.c_tf_idf_.shape)
    diag = np.array(model.ctfidf_model._idf_diag.data)
    if serialization == "safetensors":
        tensors = {
            "indptr": indptr,
            "indices": indices,
            "data": data,
            "shape": shape,
            "diag": diag,
        }
        save_safetensors(save_directory / CTFIDF_SAFE_WEIGHTS_NAME, tensors)
    if serialization == "pytorch":
        assert _has_torch, "`pip install pytorch` to save as .bin"
        tensors = {
            "indptr": torch.from_numpy(indptr),
            "indices": torch.from_numpy(indices),
            "data": torch.from_numpy(data),
            "shape": torch.from_numpy(shape),
            "diag": torch.from_numpy(diag),
        }
        torch.save(tensors, save_directory / CTFIDF_WEIGHTS_NAME)
 def save_ctfidf_config(model, path):
    """Save parameters to recreate CountVectorizer and c-TF-IDF."""
    config = {}
    # Recreate ClassTfidfTransformer
    config["ctfidf_model"] = {
        "bm25_weighting": model.ctfidf_model.bm25_weighting,
        "reduce_frequent_words": model.ctfidf_model.reduce_frequent_words,
    }
    # Recreate CountVectorizer
    cv_params = model.vectorizer_model.get_params()
    del cv_params["tokenizer"], cv_params["preprocessor"], cv_params["dtype"]
    if not isinstance(cv_params["analyzer"], str):
        del cv_params["analyzer"]
    config["vectorizer_model"] = {
        "params": cv_params,
        "vocab": model.vectorizer_model.vocabulary_,
    }
    with path.open("w") as f:
        json.dump(config, f, indent=2)
 def save_config(model, path: str, embedding_model):
    """Save BERTopic configuration."""
    path = Path(path)
    params = model.get_params()
    config = {param: value for param, value in params.items() if "model" not in param}
    # Embedding model tag to be used in sentence-transformers
    if isinstance(embedding_model, str):
        config["embedding_model"] = embedding_model
    with path.open("w") as f:
        json.dump(config, f, indent=2)
    return config
 def check_has_visual_aspect(model):
    """Check if model has visual aspect."""
    if _has_vision:
        for aspect, value in model.topic_aspects_.items():
            if isinstance(value[0], Image.Image):
                return True
 def save_images(model, path: str):
    """Save topic images."""
    if _has_vision:
        visual_aspects = None
        for aspect, value in model.topic_aspects_.items():
            if isinstance(value[0], Image.Image):
                visual_aspects = model.topic_aspects_[aspect]
                break
        if visual_aspects is not None:
            path.mkdir(exist_ok=True, parents=True)
            for topic, image in visual_aspects.items():
                image.save(path / f"{topic}.jpg")
 def save_topics(model, path: str):
    """Save Topic-specific information."""
    path = Path(path)
    if _has_vision:
        selected_topic_aspects = {}
        for aspect, value in model.topic_aspects_.items():
            if not isinstance(value[0], Image.Image):
                selected_topic_aspects[aspect] = value
            else:
                selected_topic_aspects["Visual_Aspect"] = True
    else:
        selected_topic_aspects = model.topic_aspects_
    topics = {
        "topic_representations": model.topic_representations_,
        "topics": [int(topic) for topic in model.topics_],
        "topic_sizes": model.topic_sizes_,
        "topic_mapper": np.array(model.topic_mapper_.mappings_, dtype=int).tolist(),
        "topic_labels": model.topic_labels_,
        "custom_labels": model.custom_labels_,
        "_outliers": int(model._outliers),
        "topic_aspects": selected_topic_aspects,
    }
    with path.open("w") as f:
        json.dump(topics, f, indent=2, cls=NumpyEncoder)
 def load_cfg_from_json(json_file: Union[str, os.PathLike]):
    """Load configuration from json."""
    with open(json_file, "r", encoding="utf-8") as reader:
        text = reader.read()
    return json.loads(text)
 class NumpyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        if isinstance(obj, np.floating):
            return float(obj)
        return super(NumpyEncoder, self).default(obj)
 def get_package_versions():
    """Get versions of main dependencies of BERTopic."""
    try:
        import platform
        from numpy import __version__ as np_version
        from pandas import __version__ as pandas_version
        from sklearn import __version__ as sklearn_version
        from plotly import __version__ as plotly_version
        try:
            from importlib.metadata import version
            hdbscan_version = version("hdbscan")
        except (ImportError, ModuleNotFoundError):
            hdbscan_version = None
        try:
            from umap import __version__ as umap_version
        except (ImportError, ModuleNotFoundError):
            umap_version = None
        try:
            from sentence_transformers import __version__ as sbert_version
        except (ImportError, ModuleNotFoundError):
            sbert_version = None
        try:
            from numba import __version__ as numba_version
        except (ImportError, ModuleNotFoundError):
            numba_version = None
        try:
            from transformers import __version__ as transformers_version
        except (ImportError, ModuleNotFoundError):
            transformers_version = None
        return {
            "Numpy": np_version,
            "HDBSCAN": hdbscan_version,
            "UMAP": umap_version,
            "Pandas": pandas_version,
            "Scikit-Learn": sklearn_version,
            "Sentence-transformers": sbert_version,
            "Transformers": transformers_version,
            "Numba": numba_version,
            "Plotly": plotly_version,
            "Python": platform.python_version(),
        }
    except Exception as e:
        return e
 def load_safetensors(path):
    """Load safetensors and check whether it is installed."""
    try:
        import safetensors.numpy
        return safetensors.numpy.load_file(path)
    except ImportError:
        raise ValueError("`pip install safetensors` to load .safetensors")
 def save_safetensors(path, tensors):
    """Save safetensors and check whether it is installed."""
    try:
        import safetensors.numpy
        safetensors.numpy.save_file(tensors, path)
    except ImportError:
        raise ValueError("`pip install safetensors` to save as .safetensors")
@@ -1,228 +0,0 @@
 import numpy as np
 import pandas as pd
 import logging
 from collections.abc import Iterable
 from scipy.sparse import csr_matrix
 from scipy.spatial.distance import squareform
 from typing import Optional, Union, Tuple
 class MyLogger:
    def __init__(self):
        self.logger = logging.getLogger("BERTopic")
    def configure(self, level):
        self.set_level(level)
        self._add_handler()
        self.logger.propagate = False
    def info(self, message):
        self.logger.info(f"{message}")
    def warning(self, message):
        self.logger.warning(f"WARNING: {message}")
    def set_level(self, level):
        levels = ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
        if level in levels:
            self.logger.setLevel(level)
    def _add_handler(self):
        sh = logging.StreamHandler()
        sh.setFormatter(logging.Formatter("%(asctime)s - %(name)s - %(message)s"))
        self.logger.addHandler(sh)
        # Remove duplicate handlers
        if len(self.logger.handlers) > 1:
            self.logger.handlers = [self.logger.handlers[0]]
 def check_documents_type(documents):
    """Check whether the input documents are indeed a list of strings."""
    if isinstance(documents, pd.DataFrame):
        raise TypeError("Make sure to supply a list of strings, not a dataframe.")
    elif isinstance(documents, Iterable) and not isinstance(documents, str):
        if not any([isinstance(doc, str) for doc in documents]):
            raise TypeError("Make sure that the iterable only contains strings.")
    else:
        raise TypeError("Make sure that the documents variable is an iterable containing strings only.")
 def check_embeddings_shape(embeddings, docs):
    """Check if the embeddings have the correct shape."""
    if embeddings is not None:
        if not any([isinstance(embeddings, np.ndarray), isinstance(embeddings, csr_matrix)]):
            raise ValueError("Make sure to input embeddings as a numpy array or scipy.sparse.csr.csr_matrix. ")
        else:
            if embeddings.shape[0] != len(docs):
                raise ValueError(
                    "Make sure that the embeddings are a numpy array with shape: "
                    "(len(docs), vector_dim) where vector_dim is the dimensionality "
                    "of the vector embeddings. "
                )
 def check_is_fitted(topic_model):
    """Checks if the model was fitted by verifying the presence of self.matches.
    Arguments:
        topic_model: BERTopic instance for which the check is performed.
    Returns:
        None
    Raises:
        ValueError: If the matches were not found.
    """
    msg = "This %(name)s instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator."
    if topic_model.topics_ is None:
        raise ValueError(msg % {"name": type(topic_model).__name__})
 class NotInstalled:
    """This object is used to notify the user that additional dependencies need to be
    installed in order to use the string matching model.
    """
    def __init__(self, tool, dep, custom_msg=None):
        self.tool = tool
        self.dep = dep
        msg = f"In order to use {self.tool} you will need to install via;\n\n"
        if custom_msg is not None:
            msg += custom_msg
        else:
            msg += f"pip install bertopic[{self.dep}]\n\n"
        self.msg = msg
    def __getattr__(self, *args, **kwargs):
        raise ModuleNotFoundError(self.msg)
    def __call__(self, *args, **kwargs):
        raise ModuleNotFoundError(self.msg)
 def validate_distance_matrix(X, n_samples):
    """Validate the distance matrix and convert it to a condensed distance matrix
    if necessary.
    A valid distance matrix is either a square matrix of shape (n_samples, n_samples)
    with zeros on the diagonal and non-negative values or condensed distance matrix
    of shape (n_samples * (n_samples - 1) / 2,) containing the upper triangular of the
    distance matrix.
    Arguments:
        X: Distance matrix to validate.
        n_samples: Number of samples in the dataset.
    Returns:
        X: Validated distance matrix.
    Raises:
        ValueError: If the distance matrix is not valid.
    """
    # Make sure it is the 1-D condensed distance matrix with zeros on the diagonal
    s = X.shape
    if len(s) == 1:
        # check it has correct size
        n = s[0]
        if n != (n_samples * (n_samples - 1) / 2):
            raise ValueError("The condensed distance matrix must have shape (n*(n-1)/2,).")
    elif len(s) == 2:
        # check it has correct size
        if (s[0] != n_samples) or (s[1] != n_samples):
            raise ValueError("The distance matrix must be of shape (n, n) where n is the number of samples.")
        # force zero diagonal and convert to condensed
        np.fill_diagonal(X, 0)
        X = squareform(X)
    else:
        raise ValueError(
            "The distance matrix must be either a 1-D condensed "
            "distance matrix of shape (n*(n-1)/2,) or a "
            "2-D square distance matrix of shape (n, n)."
            "where n is the number of documents."
            "Got a distance matrix of shape %s" % str(s)
        )
    # Make sure its entries are non-negative
    if np.any(X < 0):
        raise ValueError("Distance matrix cannot contain negative values.")
    return X
 def get_unique_distances(dists: np.array, noise_max=1e-7) -> np.array:
    """Check if the consecutive elements in the distance array are the same. If so, a small noise
    is added to one of the elements to make sure that the array does not contain duplicates.
    Arguments:
        dists: distance array sorted in the increasing order.
        noise_max: the maximal magnitude of noise to be added.
    Returns:
         Unique distances sorted in the preserved increasing order.
    """
    dists_cp = dists.copy()
    for i in range(dists.shape[0] - 1):
        if dists[i] == dists[i + 1]:
            # returns the next unique distance or the current distance with the added noise
            next_unique_dist = next((d for d in dists[i + 1 :] if d != dists[i]), dists[i] + noise_max)
            # the noise can never be large then the difference between the next unique distance and the current one
            curr_max_noise = min(noise_max, next_unique_dist - dists_cp[i])
            dists_cp[i + 1] = np.random.uniform(low=dists_cp[i] + curr_max_noise / 2, high=dists_cp[i] + curr_max_noise)
    return dists_cp
 def select_topic_representation(
    ctfidf_embeddings: Optional[Union[np.ndarray, csr_matrix]] = None,
    embeddings: Optional[Union[np.ndarray, csr_matrix]] = None,
    use_ctfidf: bool = True,
    output_ndarray: bool = False,
 ) -> Tuple[np.ndarray, bool]:
    """Select the topic representation.
    Arguments:
        ctfidf_embeddings: The c-TF-IDF embedding matrix
        embeddings: The topic embedding matrix
        use_ctfidf: Whether to use the c-TF-IDF representation. If False, topics embedding representation is used, if it
                    exists. Default is True.
        output_ndarray: Whether to convert the selected representation into ndarray
    Raises
        ValueError:
            - If no topic representation was found
            - If c-TF-IDF embeddings are not a numpy array or a scipy.sparse.csr_matrix
    Returns:
        The selected topic representation and a boolean indicating whether it is c-TF-IDF.
    """
    def to_ndarray(array: Union[np.ndarray, csr_matrix]) -> np.ndarray:
        if isinstance(array, csr_matrix):
            return array.toarray()
        return array
    logger = MyLogger()
    if use_ctfidf:
        if ctfidf_embeddings is None:
            logger.warning(
                "No c-TF-IDF matrix was found despite it is supposed to be used (`use_ctfidf` is True). "
                "Defaulting to semantic embeddings."
            )
            repr_, ctfidf_used = embeddings, False
        else:
            repr_, ctfidf_used = ctfidf_embeddings, True
    else:
        if embeddings is None:
            logger.warning(
                "No topic embeddings were found despite they are supposed to be used (`use_ctfidf` is False). "
                "Defaulting to c-TF-IDF representation."
            )
            repr_, ctfidf_used = ctfidf_embeddings, True
        else:
            repr_, ctfidf_used = embeddings, False
    return to_ndarray(repr_) if output_ndarray else repr_, ctfidf_used
@@ -1,60 +0,0 @@
 from ._base import BaseEmbedder
 from ._word_doc import WordDocEmbedder
 from ._utils import languages
 from bertopic._utils import NotInstalled
 # OpenAI Embeddings
 try:
    from bertopic.backend._openai import OpenAIBackend
 except ModuleNotFoundError:
    msg = "`pip install openai` \n\n"
    OpenAIBackend = NotInstalled("OpenAI", "OpenAI", custom_msg=msg)
 # Cohere Embeddings
 try:
    from bertopic.backend._cohere import CohereBackend
 except ModuleNotFoundError:
    msg = "`pip install cohere` \n\n"
    CohereBackend = NotInstalled("Cohere", "Cohere", custom_msg=msg)
 # Multimodal Embeddings
 try:
    from bertopic.backend._multimodal import MultiModalBackend
 except ModuleNotFoundError:
    msg = "`pip install bertopic[vision]` \n\n"
    MultiModalBackend = NotInstalled("Vision", "Vision", custom_msg=msg)
 # Model2Vec Embeddings
 try:
    from bertopic.backend._model2vec import Model2VecBackend
 except ModuleNotFoundError:
    msg = "`pip install model2vec` \n\n"
    Model2VecBackend = NotInstalled("Model2Vec", "Model2Vec", custom_msg=msg)
 # FasteEmbed Embeddings
 try:
    from bertopic.backend._fastembed import FastEmbedBackend
 except ModuleNotFoundError:
    msg = "`pip install fastembed` \n\n"
    FastEmbedBackend = NotInstalled("FastEmbed", "FastEmbed", custom_msg=msg)
 # Langchain Embedddings
 try:
    from bertopic.backend._langchain import LangChainBackend
 except ModuleNotFoundError:
    msg = "`pip install langchain` \n\n"
    LangChainBackend = NotInstalled("LangChain", "LangChain", custom_msg=msg)
 __all__ = [
    "BaseEmbedder",
    "WordDocEmbedder",
    "OpenAIBackend",
    "CohereBackend",
    "Model2VecBackend",
    "MultiModalBackend",
    "FastEmbedBackend",
    "LangChainBackend",
    "languages",
 ]
@@ -1,62 +0,0 @@
 import numpy as np
 from typing import List
 class BaseEmbedder:
    """The Base Embedder used for creating embedding models.
    Arguments:
        embedding_model: The main embedding model to be used for extracting
                         document and word embedding
        word_embedding_model: The embedding model used for extracting word
                              embeddings only. If this model is selected,
                              then the `embedding_model` is purely used for
                              creating document embeddings.
    """
    def __init__(self, embedding_model=None, word_embedding_model=None):
        self.embedding_model = embedding_model
        self.word_embedding_model = word_embedding_model
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        pass
    def embed_words(self, words: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n words into an n-dimensional
        matrix of embeddings.
        Arguments:
            words: A list of words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Word embeddings with shape (n, m) with `n` words
            that each have an embeddings size of `m`
        """
        return self.embed(words, verbose)
    def embed_documents(self, document: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n words into an n-dimensional
        matrix of embeddings.
        Arguments:
            document: A list of documents to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document embeddings with shape (n, m) with `n` documents
            that each have an embeddings size of `m`
        """
        return self.embed(document, verbose)
@@ -1,94 +0,0 @@
 import time
 import numpy as np
 from tqdm import tqdm
 from typing import Any, List, Mapping
 from bertopic.backend import BaseEmbedder
 class CohereBackend(BaseEmbedder):
    """Cohere Embedding Model.
    Arguments:
        client: A `cohere` client.
        embedding_model: A Cohere model. Default is "large".
                         For an overview of models see:
                         https://docs.cohere.ai/docs/generation-card
        delay_in_seconds: If a `batch_size` is given, use this set
                          the delay in seconds between batches.
        batch_size: The size of each batch.
        embed_kwargs: Kwargs passed to `cohere.Client.embed`.
                            Can be used to define additional parameters
                            such as `input_type`
    Examples:
    ```python
    import cohere
    from bertopic.backend import CohereBackend
    client = cohere.Client("APIKEY")
    cohere_model = CohereBackend(client)
    ```
    If you want to specify `input_type`:
    ```python
    cohere_model = CohereBackend(
        client,
        embedding_model="embed-english-v3.0",
        embed_kwargs={"input_type": "clustering"}
    )
    ```
    """
    def __init__(
        self,
        client,
        embedding_model: str = "large",
        delay_in_seconds: float = None,
        batch_size: int = None,
        embed_kwargs: Mapping[str, Any] = {},
    ):
        super().__init__()
        self.client = client
        self.embedding_model = embedding_model
        self.delay_in_seconds = delay_in_seconds
        self.batch_size = batch_size
        self.embed_kwargs = embed_kwargs
        if self.embed_kwargs.get("model"):
            self.embedding_model = embed_kwargs.get("model")
        else:
            self.embed_kwargs["model"] = self.embedding_model
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        # Batch-wise embedding extraction
        if self.batch_size is not None:
            embeddings = []
            for batch in tqdm(self._chunks(documents), disable=not verbose):
                response = self.client.embed(texts=batch, **self.embed_kwargs)
                embeddings.extend(response.embeddings)
                # Delay subsequent calls
                if self.delay_in_seconds:
                    time.sleep(self.delay_in_seconds)
        # Extract embeddings all at once
        else:
            response = self.client.embed(texts=documents, **self.embed_kwargs)
            embeddings = response.embeddings
        return np.array(embeddings)
    def _chunks(self, documents):
        for i in range(0, len(documents), self.batch_size):
            yield documents[i : i + self.batch_size]
@@ -1,54 +0,0 @@
 import numpy as np
 from typing import List
 from fastembed import TextEmbedding
 from bertopic.backend import BaseEmbedder
 class FastEmbedBackend(BaseEmbedder):
    """FastEmbed embedding model.
    The FastEmbed embedding model used for generating sentence embeddings.
    Arguments:
        embedding_model: A FastEmbed embedding model
    Examples:
    To create a model, you can load in a string pointing to a supported
    FastEmbed model:
    ```python
    from bertopic.backend import FastEmbedBackend
    sentence_model = FastEmbedBackend("BAAI/bge-small-en-v1.5")
    ```
    """
    def __init__(self, embedding_model: str = "BAAI/bge-small-en-v1.5"):
        super().__init__()
        supported_models = [m["model"] for m in TextEmbedding.list_supported_models()]
        if isinstance(embedding_model, str) and embedding_model in supported_models:
            self.embedding_model = TextEmbedding(model_name=embedding_model)
        else:
            raise ValueError(
                "Please select a correct FasteEmbed model: \n"
                "the model must be a string and must be supported. \n"
                "The supported TextEmbedding model list is here: https://qdrant.github.io/fastembed/examples/Supported_Models/"
            )
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        embeddings = np.array(list(self.embedding_model.embed(documents, show_progress_bar=verbose)))
        return embeddings
@@ -1,78 +0,0 @@
 import numpy as np
 from tqdm import tqdm
 from typing import Union, List
 from flair.data import Sentence
 from flair.embeddings import DocumentEmbeddings, TokenEmbeddings, DocumentPoolEmbeddings
 from bertopic.backend import BaseEmbedder
 class FlairBackend(BaseEmbedder):
    """Flair Embedding Model.
    The Flair embedding model used for generating document and
    word embeddings.
    Arguments:
        embedding_model: A Flair embedding model
    Examples:
    ```python
    from bertopic.backend import FlairBackend
    from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
    # Create a Flair Embedding model
    glove_embedding = WordEmbeddings('crawl')
    document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])
    # Pass the Flair model to create a new backend
    flair_embedder = FlairBackend(document_glove_embeddings)
    ```
    """
    def __init__(self, embedding_model: Union[TokenEmbeddings, DocumentEmbeddings]):
        super().__init__()
        # Flair word embeddings
        if isinstance(embedding_model, TokenEmbeddings):
            self.embedding_model = DocumentPoolEmbeddings([embedding_model])
        # Flair document embeddings + disable fine tune to prevent CUDA OOM
        # https://github.com/flairNLP/flair/issues/1719
        elif isinstance(embedding_model, DocumentEmbeddings):
            if "fine_tune" in embedding_model.__dict__:
                embedding_model.fine_tune = False
            self.embedding_model = embedding_model
        else:
            raise ValueError(
                "Please select a correct Flair model by either using preparing a token or document "
                "embedding model: \n"
                "`from flair.embeddings import TransformerDocumentEmbeddings` \n"
                "`roberta = TransformerDocumentEmbeddings('roberta-base')`"
            )
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        embeddings = []
        for document in tqdm(documents, disable=not verbose):
            try:
                sentence = Sentence(document) if document else Sentence("an empty document")
                self.embedding_model.embed(sentence)
            except RuntimeError:
                sentence = Sentence("an empty document")
                self.embedding_model.embed(sentence)
            embedding = sentence.embedding.detach().cpu().numpy()
            embeddings.append(embedding)
        embeddings = np.asarray(embeddings)
        return embeddings
@@ -1,69 +0,0 @@
 import numpy as np
 from tqdm import tqdm
 from typing import List
 from bertopic.backend import BaseEmbedder
 from gensim.models.keyedvectors import Word2VecKeyedVectors
 class GensimBackend(BaseEmbedder):
    """Gensim Embedding Model.
    The Gensim embedding model is typically used for word embeddings with
    GloVe, Word2Vec or FastText.
    Arguments:
        embedding_model: A Gensim embedding model
    Examples:
    ```python
    from bertopic.backend import GensimBackend
    import gensim.downloader as api
    ft = api.load('fasttext-wiki-news-subwords-300')
    ft_embedder = GensimBackend(ft)
    ```
    """
    def __init__(self, embedding_model: Word2VecKeyedVectors):
        super().__init__()
        if isinstance(embedding_model, Word2VecKeyedVectors):
            self.embedding_model = embedding_model
        else:
            raise ValueError(
                "Please select a correct Gensim model: \n"
                "`import gensim.downloader as api` \n"
                "`ft = api.load('fasttext-wiki-news-subwords-300')`"
            )
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        vector_shape = self.embedding_model.get_vector(list(self.embedding_model.index_to_key)[0]).shape[0]
        empty_vector = np.zeros(vector_shape)
        # Extract word embeddings and pool to document-level
        embeddings = []
        for doc in tqdm(documents, disable=not verbose, position=0, leave=True):
            embedding = [
                self.embedding_model.get_vector(word)
                for word in doc.split()
                if word in self.embedding_model.key_to_index
            ]
            if len(embedding) > 0:
                embeddings.append(np.mean(embedding, axis=0))
            else:
                embeddings.append(empty_vector)
        embeddings = np.array(embeddings)
        return embeddings
@@ -1,104 +0,0 @@
 import numpy as np
 from tqdm import tqdm
 from typing import List
 from torch.utils.data import Dataset
 from sklearn.preprocessing import normalize
 from transformers.pipelines import Pipeline
 from bertopic.backend import BaseEmbedder
 class HFTransformerBackend(BaseEmbedder):
    """Hugging Face transformers model.
    This uses the `transformers.pipelines.pipeline` to define and create
    a feature generation pipeline from which embeddings can be extracted.
    Arguments:
        embedding_model: A Hugging Face feature extraction pipeline
    Examples:
    To use a Hugging Face transformers model, load in a pipeline and point
    to any model found on their model hub (https://huggingface.co/models):
    ```python
    from bertopic.backend import HFTransformerBackend
    from transformers.pipelines import pipeline
    hf_model = pipeline("feature-extraction", model="distilbert-base-cased")
    embedding_model = HFTransformerBackend(hf_model)
    ```
    """
    def __init__(self, embedding_model: Pipeline):
        super().__init__()
        if isinstance(embedding_model, Pipeline):
            self.embedding_model = embedding_model
        else:
            raise ValueError(
                "Please select a correct transformers pipeline. For example: "
                "pipeline('feature-extraction', model='distilbert-base-cased', device=0)"
            )
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        dataset = MyDataset(documents)
        embeddings = []
        for document, features in tqdm(
            zip(documents, self.embedding_model(dataset, truncation=True, padding=True)),
            total=len(dataset),
            disable=not verbose,
        ):
            embeddings.append(self._embed(document, features))
        return np.array(embeddings)
    def _embed(self, document: str, features: np.ndarray) -> np.ndarray:
        """Mean pooling.
        Arguments:
            document: The document for which to extract the attention mask
            features: The embeddings for each token
        Adopted from:
        https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2#usage-huggingface-transformers
        """
        token_embeddings = np.array(features)
        attention_mask = self.embedding_model.tokenizer(document, truncation=True, padding=True, return_tensors="np")[
            "attention_mask"
        ]
        input_mask_expanded = np.broadcast_to(np.expand_dims(attention_mask, -1), token_embeddings.shape)
        sum_embeddings = np.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = np.clip(
            input_mask_expanded.sum(1),
            a_min=1e-9,
            a_max=input_mask_expanded.sum(1).max(),
        )
        embedding = normalize(sum_embeddings / sum_mask)[0]
        return embedding
 class MyDataset(Dataset):
    """Dataset to pass to `transformers.pipelines.pipeline`."""
    def __init__(self, docs):
        self.docs = docs
    def __len__(self):
        return len(self.docs)
    def __getitem__(self, idx):
        return self.docs[idx]
@@ -1,43 +0,0 @@
 from typing import List
 import numpy as np
 from bertopic.backend import BaseEmbedder
 from langchain_core.embeddings import Embeddings
 class LangChainBackend(BaseEmbedder):
    """LangChain Embedding Model.
    This class uses the LangChain Embedding class to embed the documents.
    Argument:
        embedding_model: A LangChain Embedding Instance.
    Examples:
    ```python
    from langchain_community.embeddings import HuggingFaceInstructEmbeddings
    from bertopic.backend import LangChainBackend
    hf_embedding = HuggingFaceInstructEmbeddings()
    langchain_embedder = LangChainBackend(hf_embedding)
    ```
    """
    def __init__(self, embedding_model: Embeddings):
        self.embedding_model = embedding_model
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        # Prepare documents, replacing empty strings with a single space
        prepared_documents = [" " if doc == "" else doc for doc in documents]
        response = self.embedding_model.embed_documents(prepared_documents)
        return np.array(response)
@@ -1,129 +0,0 @@
 import numpy as np
 from typing import List, Union
 from model2vec import StaticModel
 from sklearn.feature_extraction.text import CountVectorizer
 from bertopic.backend import BaseEmbedder
 class Model2VecBackend(BaseEmbedder):
    """Model2Vec embedding model.
    Arguments:
        embedding_model: Either a model2vec model or a
                         string pointing to a model2vec model
        distill: Indicates whether to distill a sentence-transformers compatible model.
                 The distillation will happen during fitting of the topic model.
                 NOTE: Only works if `embedding_model` is a string.
        distill_kwargs: Keyword arguments to pass to the distillation process
                        of `model2vec.distill.distill`
        distill_vectorizer: A CountVectorizer used for creating a custom vocabulary
                            based on the same documents used for topic modeling.
                            NOTE: If "vocabulary" is in `distill_kwargs`, this will be ignored.
    Examples:
    To create a model, you can load in a string pointing to a
    model2vec model:
    ```python
    from bertopic.backend import Model2VecBackend
    sentence_model = Model2VecBackend("minishlab/potion-base-8M")
    ```
    or  you can instantiate a model yourself:
    ```python
    from bertopic.backend import Model2VecBackend
    from model2vec import StaticModel
    embedding_model = StaticModel.from_pretrained("minishlab/potion-base-8M")
    sentence_model = Model2VecBackend(embedding_model)
    ```
    If you want to distill a sentence-transformers model with the vocabulary of the documents,
    run the following:
    ```python
    from bertopic.backend import Model2VecBackend
    sentence_model = Model2VecBackend("sentence-transformers/all-MiniLM-L6-v2", distill=True)
    ```
    """
    def __init__(
        self,
        embedding_model: Union[str, StaticModel],
        distill: bool = False,
        distill_kwargs: dict = {},
        distill_vectorizer: str = None,
    ):
        super().__init__()
        self.distill = distill
        self.distill_kwargs = distill_kwargs
        self.distill_vectorizer = distill_vectorizer
        self._has_distilled = False
        # When we distill, we need a string pointing to a sentence-transformer model
        if self.distill:
            self._check_model2vec_installation()
            if not self.distill_vectorizer:
                self.distill_vectorizer = CountVectorizer()
            if isinstance(embedding_model, str):
                self.embedding_model = embedding_model
            else:
                raise ValueError("Please pass a string pointing to a sentence-transformer model when distilling.")
        # If we don't distill, we can pass a model2vec model directly or load from a string
        elif isinstance(embedding_model, StaticModel):
            self.embedding_model = embedding_model
        elif isinstance(embedding_model, str):
            self.embedding_model = StaticModel.from_pretrained(embedding_model)
        else:
            raise ValueError(
                "Please select a correct Model2Vec model: \n"
                "`from model2vec import StaticModel` \n"
                "`model = StaticModel.from_pretrained('minishlab/potion-base-8M')`"
            )
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        # Distill the model
        if self.distill and not self._has_distilled:
            from model2vec.distill import distill
            # Distill with the vocabulary of the documents
            if not self.distill_kwargs.get("vocabulary"):
                X = self.distill_vectorizer.fit_transform(documents)
                word_counts = np.array(X.sum(axis=0)).flatten()
                words = self.distill_vectorizer.get_feature_names_out()
                vocabulary = [word for word, _ in sorted(zip(words, word_counts), key=lambda x: x[1], reverse=True)]
                self.distill_kwargs["vocabulary"] = vocabulary
            # Distill the model
            self.embedding_model = distill(self.embedding_model, **self.distill_kwargs)
            # Distillation should happen only once and not for every embed call
            # The distillation should only happen the first time on the entire vocabulary
            self._has_distilled = True
        # Embed the documents
        embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
        return embeddings
    def _check_model2vec_installation(self):
        try:
            from model2vec.distill import distill  # noqa: F401
        except ImportError:
            raise ImportError("To distill a model using model2vec, you need to run `pip install model2vec[distill]`")
@@ -1,200 +0,0 @@
 import numpy as np
 from PIL import Image
 from tqdm import tqdm
 from typing import List, Union
 from sentence_transformers import SentenceTransformer
 from bertopic.backend import BaseEmbedder
 class MultiModalBackend(BaseEmbedder):
    """Multimodal backend using Sentence-transformers.
    The sentence-transformers embedding model used for
    generating word, document, and image embeddings.
    Arguments:
        embedding_model: A sentence-transformers embedding model that
                         can either embed both images and text or only text.
                         If it only embeds text, then `image_model` needs
                         to be used to embed the images.
        image_model: A sentence-transformers embedding model that is used
                     to embed only images.
        batch_size: The sizes of image batches to pass
    Examples:
    To create a model, you can load in a string pointing to a
    sentence-transformers model:
    ```python
    from bertopic.backend import MultiModalBackend
    sentence_model = MultiModalBackend("clip-ViT-B-32")
    ```
    or  you can instantiate a model yourself:
    ```python
    from bertopic.backend import MultiModalBackend
    from sentence_transformers import SentenceTransformer
    embedding_model = SentenceTransformer("clip-ViT-B-32")
    sentence_model = MultiModalBackend(embedding_model)
    ```
    """
    def __init__(
        self,
        embedding_model: Union[str, SentenceTransformer],
        image_model: Union[str, SentenceTransformer] = None,
        batch_size: int = 32,
    ):
        super().__init__()
        self.batch_size = batch_size
        # Text or Text+Image model
        if isinstance(embedding_model, SentenceTransformer):
            self.embedding_model = embedding_model
        elif isinstance(embedding_model, str):
            self.embedding_model = SentenceTransformer(embedding_model)
        else:
            raise ValueError(
                "Please select a correct SentenceTransformers model: \n"
                "`from sentence_transformers import SentenceTransformer` \n"
                "`model = SentenceTransformer('clip-ViT-B-32')`"
            )
        # Image Model
        self.image_model = None
        if image_model is not None:
            if isinstance(image_model, SentenceTransformer):
                self.image_model = image_model
            elif isinstance(image_model, str):
                self.image_model = SentenceTransformer(image_model)
            else:
                raise ValueError(
                    "Please select a correct SentenceTransformers model: \n"
                    "`from sentence_transformers import SentenceTransformer` \n"
                    "`model = SentenceTransformer('clip-ViT-B-32')`"
                )
        try:
            self.tokenizer = self.embedding_model._first_module().processor.tokenizer
        except AttributeError:
            self.tokenizer = self.embedding_model.tokenizer
        except:  # noqa: E722
            self.tokenizer = None
    def embed(self, documents: List[str], images: List[str] = None, verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words or images into an n-dimensional
        matrix of embeddings.
        Either documents, images, or both can be provided. If both are provided,
        then the embeddings are averaged.
        Arguments:
            documents: A list of documents or words to be embedded
            images: A list of image paths to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        # Embed documents
        doc_embeddings = None
        if documents[0] is not None:
            doc_embeddings = self.embed_documents(documents)
        # Embed images
        image_embeddings = None
        if isinstance(images, list):
            image_embeddings = self.embed_images(images, verbose)
        # Average embeddings
        averaged_embeddings = None
        if doc_embeddings is not None and image_embeddings is not None:
            averaged_embeddings = np.mean([doc_embeddings, image_embeddings], axis=0)
        if averaged_embeddings is not None:
            return averaged_embeddings
        elif doc_embeddings is not None:
            return doc_embeddings
        elif image_embeddings is not None:
            return image_embeddings
    def embed_documents(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        truncated_docs = [self._truncate_document(doc) for doc in documents]
        embeddings = self.embedding_model.encode(truncated_docs, show_progress_bar=verbose)
        return embeddings
    def embed_words(self, words: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n words into an n-dimensional
        matrix of embeddings.
        Arguments:
            words: A list of words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        embeddings = self.embedding_model.encode(words, show_progress_bar=verbose)
        return embeddings
    def embed_images(self, images, verbose):
        if self.batch_size:
            nr_iterations = int(np.ceil(len(images) / self.batch_size))
            # Embed images per batch
            embeddings = []
            for i in tqdm(range(nr_iterations), disable=not verbose):
                start_index = i * self.batch_size
                end_index = (i * self.batch_size) + self.batch_size
                images_to_embed = [
                    Image.open(image) if isinstance(image, str) else image for image in images[start_index:end_index]
                ]
                if self.image_model is not None:
                    img_emb = self.image_model.encode(images_to_embed)
                else:
                    img_emb = self.embedding_model.encode(images_to_embed, show_progress_bar=False)
                embeddings.extend(img_emb.tolist())
                # Close images
                if isinstance(images[0], str):
                    for image in images_to_embed:
                        image.close()
            embeddings = np.array(embeddings)
        else:
            images_to_embed = [Image.open(filepath) for filepath in images]
            if self.image_model is not None:
                embeddings = self.image_model.encode(images_to_embed)
            else:
                embeddings = self.embedding_model.encode(images_to_embed, show_progress_bar=False)
        return embeddings
    def _truncate_document(self, document):
        if self.tokenizer:
            tokens = self.tokenizer.encode(document)
            if len(tokens) > 77:
                # Skip the starting token, only include 75 tokens
                truncated_tokens = tokens[1:76]
                document = self.tokenizer.decode(truncated_tokens)
                # Recursive call here, because the encode(decode()) can have different result
                return self._truncate_document(document)
        return document
@@ -1,88 +0,0 @@
 import time
 import openai
 import numpy as np
 from tqdm import tqdm
 from typing import List, Mapping, Any
 from bertopic.backend import BaseEmbedder
 class OpenAIBackend(BaseEmbedder):
    """OpenAI Embedding Model.
    Arguments:
        client: A `openai.OpenAI` client.
        embedding_model: An OpenAI model. Default is
                         For an overview of models see:
                         https://platform.openai.com/docs/models/embeddings
        delay_in_seconds: If a `batch_size` is given, use this set
                          the delay in seconds between batches.
        batch_size: The size of each batch.
        generator_kwargs: Kwargs passed to `openai.Embedding.create`.
                          Can be used to define custom engines or
                          deployment_ids.
    Examples:
    ```python
    import openai
    from bertopic.backend import OpenAIBackend
    client = openai.OpenAI(api_key="sk-...")
    openai_embedder = OpenAIBackend(client, "text-embedding-ada-002")
    ```
    """
    def __init__(
        self,
        client: openai.OpenAI,
        embedding_model: str = "text-embedding-ada-002",
        delay_in_seconds: float = None,
        batch_size: int = None,
        generator_kwargs: Mapping[str, Any] = {},
    ):
        super().__init__()
        self.client = client
        self.embedding_model = embedding_model
        self.delay_in_seconds = delay_in_seconds
        self.batch_size = batch_size
        self.generator_kwargs = generator_kwargs
        if self.generator_kwargs.get("model"):
            self.embedding_model = generator_kwargs.get("model")
        elif not self.generator_kwargs.get("engine"):
            self.generator_kwargs["model"] = self.embedding_model
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        # Prepare documents, replacing empty strings with a single space
        prepared_documents = [" " if doc == "" else doc for doc in documents]
        # Batch-wise embedding extraction
        if self.batch_size is not None:
            embeddings = []
            for batch in tqdm(self._chunks(prepared_documents), disable=not verbose):
                response = self.client.embeddings.create(input=batch, **self.generator_kwargs)
                embeddings.extend([r.embedding for r in response.data])
                # Delay subsequent calls
                if self.delay_in_seconds:
                    time.sleep(self.delay_in_seconds)
        # Extract embeddings all at once
        else:
            response = self.client.embeddings.create(input=prepared_documents, **self.generator_kwargs)
            embeddings = [r.embedding for r in response.data]
        return np.array(embeddings)
    def _chunks(self, documents):
        for i in range(0, len(documents), self.batch_size):
            yield documents[i : i + self.batch_size]
@@ -1,85 +0,0 @@
 import numpy as np
 from typing import List, Union
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.models import StaticEmbedding
 from bertopic.backend import BaseEmbedder
 class SentenceTransformerBackend(BaseEmbedder):
    """Sentence-transformers embedding model.
    The sentence-transformers embedding model used for generating document and
    word embeddings.
    Arguments:
        embedding_model: A sentence-transformers embedding model
        model2vec: Indicates whether `embedding_model` is a model2vec model.
                   NOTE: Only works if `embedding_model` is a string.
                   Otherwise, you can pass the model2vec model directly to `embedding_model`.
    Examples:
    To create a model, you can load in a string pointing to a
    sentence-transformers model:
    ```python
    from bertopic.backend import SentenceTransformerBackend
    sentence_model = SentenceTransformerBackend("all-MiniLM-L6-v2")
    ```
    or  you can instantiate a model yourself:
    ```python
    from bertopic.backend import SentenceTransformerBackend
    from sentence_transformers import SentenceTransformer
    embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
    sentence_model = SentenceTransformerBackend(embedding_model)
    ```
    If you want to use a model2vec model without having to install model2vec,
    you can pass the model2vec model as a string:
    ```python
    from bertopic.backend import SentenceTransformerBackend
    from sentence_transformers import SentenceTransformer
    embedding_model = SentenceTransformer("minishlab/potion-base-8M", model2vec=True)
    sentence_model = SentenceTransformerBackend(embedding_model)
    ```
    """
    def __init__(self, embedding_model: Union[str, SentenceTransformer], model2vec: bool = False):
        super().__init__()
        self._hf_model = None
        if model2vec and isinstance(embedding_model, str):
            static_embedding = StaticEmbedding.from_model2vec(embedding_model)
            self.embedding_model = SentenceTransformer(modules=[static_embedding])
        elif isinstance(embedding_model, SentenceTransformer):
            self.embedding_model = embedding_model
        elif isinstance(embedding_model, str):
            self.embedding_model = SentenceTransformer(embedding_model)
            self._hf_model = embedding_model
        else:
            raise ValueError(
                "Please select a correct SentenceTransformers model: \n"
                "`from sentence_transformers import SentenceTransformer` \n"
                "`model = SentenceTransformer('all-MiniLM-L6-v2')`"
            )
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
        return embeddings
@@ -1,68 +0,0 @@
 from bertopic.backend import BaseEmbedder
 from sklearn.utils.validation import check_is_fitted, NotFittedError
 class SklearnEmbedder(BaseEmbedder):
    """Scikit-Learn based embedding model.
    This component allows the usage of scikit-learn pipelines for generating document and
    word embeddings.
    Arguments:
        pipe: A scikit-learn pipeline that can `.transform()` text.
    Examples:
    Scikit-Learn is very flexible and it allows for many representations.
    A relatively simple pipeline is shown below.
    ```python
    from sklearn.pipeline import make_pipeline
    from sklearn.decomposition import TruncatedSVD
    from sklearn.feature_extraction.text import TfidfVectorizer
    from bertopic.backend import SklearnEmbedder
    pipe = make_pipeline(
        TfidfVectorizer(),
        TruncatedSVD(100)
    )
    sklearn_embedder = SklearnEmbedder(pipe)
    topic_model = BERTopic(embedding_model=sklearn_embedder)
    ```
    This pipeline first constructs a sparse representation based on TF/idf and then
    makes it dense by applying SVD. Alternatively, you might also construct something
    more elaborate. As long as you construct a scikit-learn compatible pipeline, you
    should be able to pass it to Bertopic.
    !!! Warning
        One caveat to be aware of is that scikit-learns base `Pipeline` class does not
        support the `.partial_fit()`-API. If you have a pipeline that theoretically should
        be able to support online learning then you might want to explore
        the [scikit-partial](https://github.com/koaning/scikit-partial) project.
    """
    def __init__(self, pipe):
        super().__init__()
        self.pipe = pipe
    def embed(self, documents, verbose=False):
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: No-op variable that's kept around to keep the API consistent. If you want to get feedback on training times, you should use the sklearn API.
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        try:
            check_is_fitted(self.pipe)
            embeddings = self.pipe.transform(documents)
        except NotFittedError:
            embeddings = self.pipe.fit_transform(documents)
        return embeddings
@@ -1,94 +0,0 @@
 import numpy as np
 from tqdm import tqdm
 from typing import List
 from bertopic.backend import BaseEmbedder
 class SpacyBackend(BaseEmbedder):
    """Spacy embedding model.
    The Spacy embedding model used for generating document and
    word embeddings.
    Arguments:
        embedding_model: A spacy embedding model
    Examples:
    To create a Spacy backend, you need to create an nlp object and
    pass it through this backend:
    ```python
    import spacy
    from bertopic.backend import SpacyBackend
    nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
    spacy_model = SpacyBackend(nlp)
    ```
    To load in a transformer model use the following:
    ```python
    import spacy
    from thinc.api import set_gpu_allocator, require_gpu
    from bertopic.backend import SpacyBackend
    nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
    set_gpu_allocator("pytorch")
    require_gpu(0)
    spacy_model = SpacyBackend(nlp)
    ```
    If you run into gpu/memory-issues, please use:
    ```python
    import spacy
    from bertopic.backend import SpacyBackend
    spacy.prefer_gpu()
    nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
    spacy_model = SpacyBackend(nlp)
    ```
    """
    def __init__(self, embedding_model):
        super().__init__()
        if "spacy" in str(type(embedding_model)):
            self.embedding_model = embedding_model
        else:
            raise ValueError(
                "Please select a correct Spacy model by either using a string such as 'en_core_web_md' "
                "or create a nlp model using: `nlp = spacy.load('en_core_web_md')"
            )
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        # Handle empty documents, spaCy models automatically map
        # empty strings to the zero vector
        empty_document = " "
        # Extract embeddings
        embeddings = []
        for doc in tqdm(documents, position=0, leave=True, disable=not verbose):
            embedding = self.embedding_model(doc or empty_document)
            if embedding.has_vector:
                embedding = embedding.vector
            else:
                embedding = embedding._.trf_data.tensors[-1][0]
            if not isinstance(embedding, np.ndarray) and hasattr(embedding, "get"):
                # Convert cupy array to numpy array
                embedding = embedding.get()
            embeddings.append(embedding)
        return np.array(embeddings)
@@ -1,55 +0,0 @@
 import numpy as np
 from tqdm import tqdm
 from typing import List
 from bertopic.backend import BaseEmbedder
 class USEBackend(BaseEmbedder):
    """Universal Sentence Encoder.
    USE encodes text into high-dimensional vectors that
    are used for semantic similarity in BERTopic.
    Arguments:
        embedding_model: An USE embedding model
    Examples:
    ```python
    import tensorflow_hub
    from bertopic.backend import USEBackend
    embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
    use_embedder = USEBackend(embedding_model)
    ```
    """
    def __init__(self, embedding_model):
        super().__init__()
        try:
            embedding_model(["test sentence"])
            self.embedding_model = embedding_model
        except TypeError:
            raise ValueError(
                "Please select a correct USE model: \n"
                "`import tensorflow_hub` \n"
                "`embedding_model = tensorflow_hub.load(path_to_model)`"
            )
    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings.
        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """
        embeddings = np.array(
            [self.embedding_model([doc]).cpu().numpy()[0] for doc in tqdm(documents, disable=not verbose)]
        )
        return embeddings
@@ -1,171 +0,0 @@
 from ._base import BaseEmbedder
 # Imports for light-weight variant of BERTopic
 from bertopic.backend._sklearn import SklearnEmbedder
 from bertopic._utils import MyLogger
 from sklearn.pipeline import make_pipeline
 from sklearn.decomposition import TruncatedSVD
 from sklearn.feature_extraction.text import TfidfVectorizer
 from sklearn.pipeline import Pipeline as ScikitPipeline
 logger = MyLogger()
 logger.configure("WARNING")
 languages = [
    "arabic",
    "bulgarian",
    "catalan",
    "czech",
    "danish",
    "german",
    "greek",
    "english",
    "spanish",
    "estonian",
    "persian",
    "finnish",
    "french",
    "canadian french",
    "galician",
    "gujarati",
    "hebrew",
    "hindi",
    "croatian",
    "hungarian",
    "armenian",
    "indonesian",
    "italian",
    "japanese",
    "georgian",
    "korean",
    "kurdish",
    "lithuanian",
    "latvian",
    "macedonian",
    "mongolian",
    "marathi",
    "malay",
    "burmese",
    "norwegian bokmal",
    "dutch",
    "polish",
    "portuguese",
    "brazilian portuguese",
    "romanian",
    "russian",
    "slovak",
    "slovenian",
    "albanian",
    "serbian",
    "swedish",
    "thai",
    "turkish",
    "ukrainian",
    "urdu",
    "vietnamese",
    "chinese (simplified)",
    "chinese (traditional)",
 ]
 def select_backend(embedding_model, language: str = None, verbose: bool = False) -> BaseEmbedder:
    """Select an embedding model based on language or a specific provided model.
    When selecting a language, we choose all-MiniLM-L6-v2 for English and
    paraphrase-multilingual-MiniLM-L12-v2 for all other languages as it support 100+ languages.
    If sentence-transformers is not installed, in the case of a lightweight installation,
    a scikit-learn backend is default.
    Returns:
        model: The selected model backend.
    """
    logger.set_level("INFO" if verbose else "WARNING")
    # BERTopic language backend
    if isinstance(embedding_model, BaseEmbedder):
        return embedding_model
    # Scikit-learn backend
    if isinstance(embedding_model, ScikitPipeline):
        return SklearnEmbedder(embedding_model)
    # Flair word embeddings
    if "flair" in str(type(embedding_model)):
        from bertopic.backend._flair import FlairBackend
        return FlairBackend(embedding_model)
    # Spacy embeddings
    if "spacy" in str(type(embedding_model)):
        from bertopic.backend._spacy import SpacyBackend
        return SpacyBackend(embedding_model)
    # Gensim embeddings
    if "gensim" in str(type(embedding_model)):
        from bertopic.backend._gensim import GensimBackend
        return GensimBackend(embedding_model)
    # USE embeddings
    if "tensorflow" and "saved_model" in str(type(embedding_model)):
        from bertopic.backend._use import USEBackend
        return USEBackend(embedding_model)
    # Sentence Transformer embeddings
    if "sentence_transformers" in str(type(embedding_model)) or isinstance(embedding_model, str):
        from ._sentencetransformers import SentenceTransformerBackend
        return SentenceTransformerBackend(embedding_model)
    # Hugging Face embeddings
    if "transformers" and "pipeline" in str(type(embedding_model)):
        from ._hftransformers import HFTransformerBackend
        return HFTransformerBackend(embedding_model)
    # Model2Vec embeddings
    if "model2vec" in str(type(embedding_model)):
        from ._model2vec import Model2VecBackend
        return Model2VecBackend(embedding_model)
    # FastEmbed word embeddings
    if "fastembed" in str(type(embedding_model)):
        from bertopic.backend._fastembed import FastEmbedBackend
        return FastEmbedBackend(embedding_model)
    # Select embedding model based on language
    if language:
        try:
            from ._sentencetransformers import SentenceTransformerBackend
            if language.lower() in ["English", "english", "en"]:
                return SentenceTransformerBackend("sentence-transformers/all-MiniLM-L6-v2")
            elif language.lower() in languages or language == "multilingual":
                return SentenceTransformerBackend("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
            else:
                raise ValueError(
                    f"{language} is currently not supported. However, you can "
                    f"create any embeddings yourself and pass it through fit_transform(docs, embeddings)\n"
                    "Else, please select a language from the following list:\n"
                    f"{languages}"
                )
        # A ModuleNotFoundError might be a lightweight installation
        except ModuleNotFoundError as e:
            if e.name != "sentence_transformers":
                # Error occurred in a downstream module, probably not a lightweight install
                raise e
            # Whole sentence_transformers module is missing, probably a lightweight install
            if verbose:
                logger.info(
                    "Automatically selecting lightweight scikit-learn embedding backend as sentence-transformers appears to not be installed."
                )
            pipe = make_pipeline(TfidfVectorizer(), TruncatedSVD(100))
            return SklearnEmbedder(pipe)
    from ._sentencetransformers import SentenceTransformerBackend
    return SentenceTransformerBackend("sentence-transformers/all-MiniLM-L6-v2")
@@ -1,43 +0,0 @@
 import numpy as np
 from typing import List
 from bertopic.backend._base import BaseEmbedder
 from bertopic.backend._utils import select_backend
 class WordDocEmbedder(BaseEmbedder):
    """Combine a document- and word-level embedder."""
    def __init__(self, embedding_model, word_embedding_model):
        super().__init__()
        self.embedding_model = select_backend(embedding_model)
        self.word_embedding_model = select_backend(word_embedding_model)
    def embed_words(self, words: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n words into an n-dimensional
        matrix of embeddings.
        Arguments:
            words: A list of words to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Word embeddings with shape (n, m) with `n` words
            that each have an embeddings size of `m`
        """
        return self.word_embedding_model.embed(words, verbose)
    def embed_documents(self, document: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n words into an n-dimensional
        matrix of embeddings.
        Arguments:
            document: A list of documents to be embedded
            verbose: Controls the verbosity of the process
        Returns:
            Document embeddings with shape (n, m) with `n` documents
            that each have an embeddings size of `m`
        """
        return self.embedding_model.embed(document, verbose)
@@ -1,5 +0,0 @@
 from ._base import BaseCluster
 __all__ = [
    "BaseCluster",
 ]
@@ -1,41 +0,0 @@
 import numpy as np
 class BaseCluster:
    """The Base Cluster class.
    Using this class directly in BERTopic will make it skip
    over the cluster step. As a result, topics need to be passed
    to BERTopic in the form of its `y` parameter in order to create
    topic representations.
    Examples:
    This will skip over the cluster step in BERTopic:
    ```python
    from bertopic import BERTopic
    from bertopic.cluster import BaseCluster
    empty_cluster_model = BaseCluster()
    topic_model = BERTopic(hdbscan_model=empty_cluster_model)
    ```
    Then, this class can be used to perform manual topic modeling.
    That is, topic modeling on a topics that were already generated before
    without the need to learn them:
    ```python
    topic_model.fit(docs, y=y)
    ```
    """
    def fit(self, X, y=None):
        if y is not None:
            self.labels_ = y
        else:
            self.labels_ = None
        return self
    def transform(self, X: np.ndarray) -> np.ndarray:
        return X
@@ -1,81 +0,0 @@
 import numpy as np
 def hdbscan_delegator(model, func: str, embeddings: np.ndarray = None):
    """Function used to select the HDBSCAN-like model for generating
    predictions and probabilities.
    Arguments:
        model: The cluster model.
        func: The function to use. Options:
                - "approximate_predict"
                - "all_points_membership_vectors"
                - "membership_vector"
        embeddings: Input embeddings for "approximate_predict"
                    and "membership_vector"
    """
    try:
        import hdbscan
    except (ImportError, ModuleNotFoundError):
        hdbscan = type("hdbscan", (), {"HDBSCAN": None})()
    # Approximate predict
    if func == "approximate_predict":
        if isinstance(model, hdbscan.HDBSCAN):
            predictions, probabilities = hdbscan.approximate_predict(model, embeddings)
            return predictions, probabilities
        str_type_model = str(type(model)).lower()
        if "cuml" in str_type_model and "hdbscan" in str_type_model:
            from cuml.cluster import hdbscan as cuml_hdbscan
            predictions, probabilities = cuml_hdbscan.approximate_predict(model, embeddings)
            return predictions, probabilities
        predictions = model.predict(embeddings)
        return predictions, None
    # All points membership
    if func == "all_points_membership_vectors":
        if isinstance(model, hdbscan.HDBSCAN):
            return hdbscan.all_points_membership_vectors(model)
        str_type_model = str(type(model)).lower()
        if "cuml" in str_type_model and "hdbscan" in str_type_model:
            from cuml.cluster import hdbscan as cuml_hdbscan
            return cuml_hdbscan.all_points_membership_vectors(model)
        return None
    # membership_vector
    if func == "membership_vector":
        if isinstance(model, hdbscan.HDBSCAN):
            probabilities = hdbscan.membership_vector(model, embeddings)
            return probabilities
        str_type_model = str(type(model)).lower()
        if "cuml" in str_type_model and "hdbscan" in str_type_model:
            from cuml.cluster import hdbscan as cuml_hdbscan
            probabilities = cuml_hdbscan.membership_vector(model, embeddings)
            return probabilities
        return None
 def is_supported_hdbscan(model):
    """Check whether the input model is a supported HDBSCAN-like model."""
    try:
        import hdbscan
    except (ImportError, ModuleNotFoundError):
        hdbscan = type("hdbscan", (), {"HDBSCAN": None})()
    if isinstance(model, hdbscan.HDBSCAN):
        return True
    str_type_model = str(type(model)).lower()
    if "cuml" in str_type_model and "hdbscan" in str_type_model:
        return True
    return False
@@ -1,5 +0,0 @@
 from ._base import BaseDimensionalityReduction
 __all__ = [
    "BaseDimensionalityReduction",
 ]
@@ -1,26 +0,0 @@
 import numpy as np
 class BaseDimensionalityReduction:
    """The Base Dimensionality Reduction class.
    You can use this to skip over the dimensionality reduction step in BERTopic.
    Examples:
    This will skip over the reduction step in BERTopic:
    ```python
    from bertopic import BERTopic
    from bertopic.dimensionality import BaseDimensionalityReduction
    empty_reduction_model = BaseDimensionalityReduction()
    topic_model = BERTopic(umap_model=empty_reduction_model)
    ```
    """
    def fit(self, X: np.ndarray = None):
        return self
    def transform(self, X: np.ndarray) -> np.ndarray:
        return X
@@ -1,28 +0,0 @@
 from ._topics import visualize_topics
 from ._heatmap import visualize_heatmap
 from ._barchart import visualize_barchart
 from ._documents import visualize_documents
 from ._term_rank import visualize_term_rank
 from ._hierarchy import visualize_hierarchy
 from ._datamap import visualize_document_datamap
 from ._distribution import visualize_distribution
 from ._topics_over_time import visualize_topics_over_time
 from ._topics_per_class import visualize_topics_per_class
 from ._hierarchical_documents import visualize_hierarchical_documents
 from ._approximate_distribution import visualize_approximate_distribution
 __all__ = [
    "visualize_topics",
    "visualize_heatmap",
    "visualize_barchart",
    "visualize_documents",
    "visualize_term_rank",
    "visualize_hierarchy",
    "visualize_distribution",
    "visualize_document_datamap",
    "visualize_topics_over_time",
    "visualize_topics_per_class",
    "visualize_hierarchical_documents",
    "visualize_approximate_distribution",
 ]
@@ -1,100 +0,0 @@
 import numpy as np
 import pandas as pd
 try:
    from pandas.io.formats.style import Styler  # noqa: F401
    HAS_JINJA = True
 except (ModuleNotFoundError, ImportError):
    HAS_JINJA = False
 def visualize_approximate_distribution(
    topic_model,
    document: str,
    topic_token_distribution: np.ndarray,
    normalize: bool = False,
 ):
    """Visualize the topic distribution calculated by `.approximate_topic_distribution`
    on a token level. Thereby indicating the extend to which a certain word or phrases belong
    to a specific topic. The assumption here is that a single word can belong to multiple
    similar topics and as such give information about the broader set of topics within
    a single document.
    Note:
    This function will return a stylized pandas dataframe if Jinja2 is installed. If not,
    it will only return a pandas dataframe without color highlighting. To install jinja:
    `pip install jinja2`
    Arguments:
        topic_model: A fitted BERTopic instance.
        document: The document for which you want to visualize
                  the approximated topic distribution.
        topic_token_distribution: The topic-token distribution of the document as
                                  extracted by `.approximate_topic_distribution`
        normalize: Whether to normalize, between 0 and 1 (summing to 1), the
                   topic distribution values.
    Returns:
        df: A stylized dataframe indicating the best fitting topics
            for each token.
    Examples:
    ```python
    # Calculate the topic distributions on a token level
    # Note that we need to have `calculate_token_level=True`
    topic_distr, topic_token_distr = topic_model.approximate_distribution(
            docs, calculate_token_level=True
    )
    # Visualize the approximated topic distributions
    df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])
    df
    ```
    To revert this stylized dataframe back to a regular dataframe,
    you can run the following:
    ```python
    df.data.columns = [column.strip() for column in df.data.columns]
    df = df.data
    ```
    """
    # Tokenize document
    analyzer = topic_model.vectorizer_model.build_tokenizer()
    tokens = analyzer(document)
    if len(tokens) == 0:
        raise ValueError("Make sure that your document contains at least 1 token.")
    # Prepare dataframe with results
    if normalize:
        df = pd.DataFrame(topic_token_distribution / topic_token_distribution.sum()).T
    else:
        df = pd.DataFrame(topic_token_distribution).T
    df.columns = [f"{token}_{i}" for i, token in enumerate(tokens)]
    df.columns = [f"{token}{' ' * i}" for i, token in enumerate(tokens)]
    df.index = list(topic_model.topic_labels_.values())[topic_model._outliers :]
    df = df.loc[(df.sum(axis=1) != 0), :]
    # Style the resulting dataframe
    def text_color(val):
        color = "white" if val == 0 else "black"
        return "color: %s" % color
    def highligh_color(data, color="white"):
        attr = "background-color: {}".format(color)
        return pd.DataFrame(np.where(data == 0, attr, ""), index=data.index, columns=data.columns)
    if len(df) == 0:
        return df
    elif HAS_JINJA:
        df = (
            df.style.format("{:.3f}")
            .background_gradient(cmap="Blues", axis=None)
            .applymap(lambda x: text_color(x))
            .apply(highligh_color, axis=None)
        )
    return df
@@ -1,132 +0,0 @@
 import itertools
 import numpy as np
 from typing import List, Union
 import plotly.graph_objects as go
 from plotly.subplots import make_subplots
 def visualize_barchart(
    topic_model,
    topics: List[int] = None,
    top_n_topics: int = 8,
    n_words: int = 5,
    custom_labels: Union[bool, str] = False,
    title: str = "<b>Topic Word Scores</b>",
    width: int = 250,
    height: int = 250,
    autoscale: bool = False,
 ) -> go.Figure:
    """Visualize a barchart of selected topics.
    Arguments:
        topic_model: A fitted BERTopic instance.
        topics: A selection of topics to visualize.
        top_n_topics: Only select the top n most frequent topics.
        n_words: Number of words to show in a topic
        custom_labels: If bool, whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       If `str`, it uses labels from other aspects, e.g., "Aspect1".
        title: Title of the plot.
        width: The width of each figure.
        height: The height of each figure.
        autoscale: Whether to automatically calculate the height of the figures to fit the whole bar text
    Returns:
        fig: A plotly figure
    Examples:
    To visualize the barchart of selected topics
    simply run:
    ```python
    topic_model.visualize_barchart()
    ```
    Or if you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_barchart()
    fig.write_html("path/to/file.html")
    ```
    <iframe src="../../getting_started/visualization/bar_chart.html"
    style="width:1100px; height: 660px; border: 0px;""></iframe>
    """
    colors = itertools.cycle(["#D55E00", "#0072B2", "#CC79A7", "#E69F00", "#56B4E9", "#009E73", "#F0E442"])
    # Select topics based on top_n and topics args
    freq_df = topic_model.get_topic_freq()
    freq_df = freq_df.loc[freq_df.Topic != -1, :]
    if topics is not None:
        topics = list(topics)
    elif top_n_topics is not None:
        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
    else:
        topics = sorted(freq_df.Topic.to_list()[0:6])
    # Initialize figure
    if isinstance(custom_labels, str):
        subplot_titles = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]
        subplot_titles = ["_".join([label[0] for label in labels[:4]]) for labels in subplot_titles]
        subplot_titles = [label if len(label) < 30 else label[:27] + "..." for label in subplot_titles]
    elif topic_model.custom_labels_ is not None and custom_labels:
        subplot_titles = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in topics]
    else:
        subplot_titles = [f"Topic {topic}" for topic in topics]
    columns = 4
    rows = int(np.ceil(len(topics) / columns))
    fig = make_subplots(
        rows=rows,
        cols=columns,
        shared_xaxes=False,
        horizontal_spacing=0.1,
        vertical_spacing=0.4 / rows if rows > 1 else 0,
        subplot_titles=subplot_titles,
    )
    # Add barchart for each topic
    row = 1
    column = 1
    for topic in topics:
        words = [word + "  " for word, _ in topic_model.get_topic(topic)][:n_words][::-1]
        scores = [score for _, score in topic_model.get_topic(topic)][:n_words][::-1]
        fig.add_trace(
            go.Bar(x=scores, y=words, orientation="h", marker_color=next(colors)),
            row=row,
            col=column,
        )
        if autoscale:
            if len(words) > 12:
                height = 250 + (len(words) - 12) * 11
            if len(words) > 9:
                fig.update_yaxes(tickfont=dict(size=(height - 140) // len(words)))
        if column == columns:
            column = 1
            row += 1
        else:
            column += 1
    # Stylize graph
    fig.update_layout(
        template="plotly_white",
        showlegend=False,
        title={
            "text": f"{title}",
            "x": 0.5,
            "xanchor": "center",
            "yanchor": "top",
            "font": dict(size=22, color="Black"),
        },
        width=width * 4,
        height=height * rows if rows > 1 else height * 1.3,
        hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
    )
    fig.update_xaxes(showgrid=True)
    fig.update_yaxes(showgrid=True)
    return fig
@@ -1,188 +0,0 @@
 import numpy as np
 import pandas as pd
 from typing import List, Union
 from warnings import warn
 try:
    import datamapplot
    from matplotlib.figure import Figure
 except ImportError:
    warn("Data map plotting is unavailable unless datamapplot is installed.")
    # Create a dummy figure type for typing
    class Figure(object):
        pass
 def visualize_document_datamap(
    topic_model,
    docs: List[str] = None,
    topics: List[int] = None,
    embeddings: np.ndarray = None,
    reduced_embeddings: np.ndarray = None,
    custom_labels: Union[bool, str] = False,
    title: str = "Documents and Topics",
    sub_title: Union[str, None] = None,
    width: int = 1200,
    height: int = 750,
    interactive: bool = False,
    enable_search: bool = False,
    topic_prefix: bool = False,
    datamap_kwds: dict = {},
    int_datamap_kwds: dict = {},
 ) -> Figure:
    """Visualize documents and their topics in 2D as a static plot for publication using
    DataMapPlot.
    Arguments:
        topic_model:  A fitted BERTopic instance.
        docs: The documents you used when calling either `fit` or `fit_transform`.
        topics: A selection of topics to visualize.
                Not to be confused with the topics that you get from `.fit_transform`.
                For example, if you want to visualize only topics 1 through 5:
                `topics = [1, 2, 3, 4, 5]`. Documents not in these topics will be shown
                as noise points.
        embeddings:  The embeddings of all documents in `docs`.
        reduced_embeddings:  The 2D reduced embeddings of all documents in `docs`.
        custom_labels:  If bool, whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       If `str`, it uses labels from other aspects, e.g., "Aspect1".
        title: Title of the plot.
        sub_title: Sub-title of the plot.
        width: The width of the figure.
        height: The height of the figure.
        interactive: Whether to create an interactive plot using DataMapPlot's `create_interactive_plot`.
        enable_search: Whether to enable search in the interactive plot. Only works if `interactive=True`.
        topic_prefix: Prefix to add to the topic number when displaying the topic name.
        datamap_kwds:  Keyword args be passed on to DataMapPlot's `create_plot` function
                       if you are not using the interactive version.
                       See the DataMapPlot documentation for more details.
        int_datamap_kwds:  Keyword args be passed on to DataMapPlot's `create_interactive_plot` function
                           if you are using the interactive version.
                           See the DataMapPlot documentation for more details.
    Returns:
        figure: A Matplotlib Figure object.
    Examples:
    To visualize the topics simply run:
    ```python
    topic_model.visualize_document_datamap(docs)
    ```
    Do note that this re-calculates the embeddings and reduces them to 2D.
    The advised and preferred pipeline for using this function is as follows:
    ```python
    from sklearn.datasets import fetch_20newsgroups
    from sentence_transformers import SentenceTransformer
    from bertopic import BERTopic
    from umap import UMAP
    # Prepare embeddings
    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
    sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = sentence_model.encode(docs, show_progress_bar=False)
    # Train BERTopic
    topic_model = BERTopic().fit(docs, embeddings)
    # Reduce dimensionality of embeddings, this step is optional
    # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
    # Run the visualization with the original embeddings
    topic_model.visualize_document_datamap(docs, embeddings=embeddings)
    # Or, if you have reduced the original embeddings already:
    topic_model.visualize_document_datamap(docs, reduced_embeddings=reduced_embeddings)
    ```
    Or if you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_document_datamap(docs, reduced_embeddings=reduced_embeddings)
    fig.savefig("path/to/file.png", bbox_inches="tight")
    ```
    <img src="../../getting_started/visualization/datamapplot.png",
         alt="DataMapPlot of 20-Newsgroups", width=800, height=800></img>
    """
    topic_per_doc = topic_model.topics_
    df = pd.DataFrame({"topic": np.array(topic_per_doc)})
    df["doc"] = docs
    df["topic"] = topic_per_doc
    # Extract embeddings if not already done
    if embeddings is None and reduced_embeddings is None:
        embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
    else:
        embeddings_to_reduce = embeddings
    # Reduce input embeddings
    if reduced_embeddings is None:
        try:
            from umap import UMAP
            umap_model = UMAP(n_neighbors=15, n_components=2, min_dist=0.15, metric="cosine").fit(embeddings_to_reduce)
            embeddings_2d = umap_model.embedding_
        except (ImportError, ModuleNotFoundError):
            raise ModuleNotFoundError(
                "UMAP is required if the embeddings are not yet reduced in dimensionality. Please install it using `pip install umap-learn`."
            )
    else:
        embeddings_2d = reduced_embeddings
    unique_topics = set(topic_per_doc)
    # Prepare text and names
    if isinstance(custom_labels, str):
        names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
        names = [" ".join([label[0] for label in labels[:4]]) for labels in names]
        names = [label if len(label) < 30 else label[:27] + "..." for label in names]
    elif topic_model.custom_labels_ is not None and custom_labels:
        names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
    else:
        if topic_prefix:
            names = [
                f"Topic-{topic}: " + " ".join([word for word, value in topic_model.get_topic(topic)][:3])
                for topic in unique_topics
            ]
        else:
            names = [" ".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]
    topic_name_mapping = {topic_num: topic_name for topic_num, topic_name in zip(unique_topics, names)}
    topic_name_mapping[-1] = "Unlabelled"
    # If a set of topics is chosen, set everything else to "Unlabelled"
    if topics is not None:
        selected_topics = set(topics)
        for topic_num in topic_name_mapping:
            if topic_num not in selected_topics:
                topic_name_mapping[topic_num] = "Unlabelled"
    # Map in topic names and plot
    named_topic_per_doc = pd.Series(topic_per_doc).map(topic_name_mapping).values
    if interactive:
        figure = datamapplot.create_interactive_plot(
            embeddings_2d,
            named_topic_per_doc,
            hover_text=docs,
            enable_search=enable_search,
            width=width,
            height=height,
            **int_datamap_kwds,
        )
    else:
        figure, _ = datamapplot.create_plot(
            embeddings_2d,
            named_topic_per_doc,
            figsize=(width / 100, height / 100),
            dpi=100,
            title=title,
            sub_title=sub_title,
            **datamap_kwds,
        )
    return figure
@@ -1,109 +0,0 @@
 import numpy as np
 from typing import Union
 import plotly.graph_objects as go
 def visualize_distribution(
    topic_model,
    probabilities: np.ndarray,
    min_probability: float = 0.015,
    custom_labels: Union[bool, str] = False,
    title: str = "<b>Topic Probability Distribution</b>",
    width: int = 800,
    height: int = 600,
 ) -> go.Figure:
    """Visualize the distribution of topic probabilities.
    Arguments:
        topic_model: A fitted BERTopic instance.
        probabilities: An array of probability scores
        min_probability: The minimum probability score to visualize.
                         All others are ignored.
        custom_labels: If bool, whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       If `str`, it uses labels from other aspects, e.g., "Aspect1".
        title: Title of the plot.
        width: The width of the figure.
        height: The height of the figure.
    Examples:
    Make sure to fit the model before and only input the
    probabilities of a single document:
    ```python
    topic_model.visualize_distribution(probabilities[0])
    ```
    Or if you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_distribution(probabilities[0])
    fig.write_html("path/to/file.html")
    ```
    <iframe src="../../getting_started/visualization/probabilities.html"
    style="width:1000px; height: 500px; border: 0px;""></iframe>
    """
    if len(probabilities.shape) != 1:
        raise ValueError(
            "This visualization cannot be used if you have set `calculate_probabilities` to False "
            "as it uses the topic probabilities of all topics. "
        )
    if len(probabilities[probabilities > min_probability]) == 0:
        raise ValueError(
            "There are no values where `min_probability` is higher than the "
            "probabilities that were supplied. Lower `min_probability` to prevent this error."
        )
    # Get values and indices equal or exceed the minimum probability
    labels_idx = np.argwhere(probabilities >= min_probability).flatten()
    vals = probabilities[labels_idx].tolist()
    # Create labels
    if isinstance(custom_labels, str):
        labels = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in labels_idx]
        labels = ["_".join([label[0] for label in l[:4]]) for l in labels]  # noqa: E741
        labels = [label if len(label) < 30 else label[:27] + "..." for label in labels]
    elif topic_model.custom_labels_ is not None and custom_labels:
        labels = [topic_model.custom_labels_[idx + topic_model._outliers] for idx in labels_idx]
    else:
        labels = []
        for idx in labels_idx:
            words = topic_model.get_topic(idx)
            if words:
                label = [word[0] for word in words[:5]]
                label = f"<b>Topic {idx}</b>: {'_'.join(label)}"
                label = label[:40] + "..." if len(label) > 40 else label
                labels.append(label)
            else:
                vals.remove(probabilities[idx])
    # Create Figure
    fig = go.Figure(
        go.Bar(
            x=vals,
            y=labels,
            marker=dict(
                color="#C8D2D7",
                line=dict(color="#6E8484", width=1),
            ),
            orientation="h",
        )
    )
    fig.update_layout(
        xaxis_title="Probability",
        title={
            "text": f"{title}",
            "y": 0.95,
            "x": 0.5,
            "xanchor": "center",
            "yanchor": "top",
            "font": dict(size=22, color="Black"),
        },
        template="simple_white",
        width=width,
        height=height,
        hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
    )
    return fig
@@ -1,263 +0,0 @@
 import numpy as np
 import pandas as pd
 import plotly.graph_objects as go
 from typing import List, Union
 def visualize_documents(
    topic_model,
    docs: List[str],
    topics: List[int] = None,
    embeddings: np.ndarray = None,
    reduced_embeddings: np.ndarray = None,
    sample: float = None,
    hide_annotations: bool = False,
    hide_document_hover: bool = False,
    custom_labels: Union[bool, str] = False,
    title: str = "<b>Documents and Topics</b>",
    width: int = 1200,
    height: int = 750,
 ):
    """Visualize documents and their topics in 2D.
    Arguments:
        topic_model: A fitted BERTopic instance.
        docs: The documents you used when calling either `fit` or `fit_transform`
        topics: A selection of topics to visualize.
                Not to be confused with the topics that you get from `.fit_transform`.
                For example, if you want to visualize only topics 1 through 5:
                `topics = [1, 2, 3, 4, 5]`.
        embeddings: The embeddings of all documents in `docs`.
        reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.
        sample: The percentage of documents in each topic that you would like to keep.
                Value can be between 0 and 1. Setting this value to, for example,
                0.1 (10% of documents in each topic) makes it easier to visualize
                millions of documents as a subset is chosen.
        hide_annotations: Hide the names of the traces on top of each cluster.
        hide_document_hover: Hide the content of the documents when hovering over
                             specific points. Helps to speed up generation of visualization.
        custom_labels: If bool, whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       If `str`, it uses labels from other aspects, e.g., "Aspect1".
        title: Title of the plot.
        width: The width of the figure.
        height: The height of the figure.
    Examples:
    To visualize the topics simply run:
    ```python
    topic_model.visualize_documents(docs)
    ```
    Do note that this re-calculates the embeddings and reduces them to 2D.
    The advised and preferred pipeline for using this function is as follows:
    ```python
    from sklearn.datasets import fetch_20newsgroups
    from sentence_transformers import SentenceTransformer
    from bertopic import BERTopic
    from umap import UMAP
    # Prepare embeddings
    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
    sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = sentence_model.encode(docs, show_progress_bar=False)
    # Train BERTopic
    topic_model = BERTopic().fit(docs, embeddings)
    # Reduce dimensionality of embeddings, this step is optional
    # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
    # Run the visualization with the original embeddings
    topic_model.visualize_documents(docs, embeddings=embeddings)
    # Or, if you have reduced the original embeddings already:
    topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
    ```
    Or if you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
    fig.write_html("path/to/file.html")
    ```
    <iframe src="../../getting_started/visualization/documents.html"
    style="width:1000px; height: 800px; border: 0px;""></iframe>
    """
    topic_per_doc = topic_model.topics_
    # Sample the data to optimize for visualization and dimensionality reduction
    if sample is None or sample > 1:
        sample = 1
    indices = []
    for topic in set(topic_per_doc):
        s = np.where(np.array(topic_per_doc) == topic)[0]
        size = len(s) if len(s) < 100 else int(len(s) * sample)
        indices.extend(np.random.choice(s, size=size, replace=False))
    indices = np.array(indices)
    df = pd.DataFrame({"topic": np.array(topic_per_doc)[indices]})
    df["doc"] = [docs[index] for index in indices]
    df["topic"] = [topic_per_doc[index] for index in indices]
    # Extract embeddings if not already done
    if sample is None:
        if embeddings is None and reduced_embeddings is None:
            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
        else:
            embeddings_to_reduce = embeddings
    else:
        if embeddings is not None:
            embeddings_to_reduce = embeddings[indices]
        elif embeddings is None and reduced_embeddings is None:
            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
    # Reduce input embeddings
    if reduced_embeddings is None:
        try:
            from umap import UMAP
            umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric="cosine").fit(embeddings_to_reduce)
            embeddings_2d = umap_model.embedding_
        except (ImportError, ModuleNotFoundError):
            raise ModuleNotFoundError(
                "UMAP is required if the embeddings are not yet reduced in dimensionality. Please install it using `pip install umap-learn`."
            )
    elif sample is not None and reduced_embeddings is not None:
        embeddings_2d = reduced_embeddings[indices]
    elif sample is None and reduced_embeddings is not None:
        embeddings_2d = reduced_embeddings
    unique_topics = set(topic_per_doc)
    if topics is None:
        topics = unique_topics
    # Combine data
    df["x"] = embeddings_2d[:, 0]
    df["y"] = embeddings_2d[:, 1]
    # Prepare text and names
    if isinstance(custom_labels, str):
        names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
        names = ["_".join([label[0] for label in labels[:4]]) for labels in names]
        names = [label if len(label) < 30 else label[:27] + "..." for label in names]
    elif topic_model.custom_labels_ is not None and custom_labels:
        names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
    else:
        names = [
            f"{topic}_" + "_".join([word for word, value in topic_model.get_topic(topic)][:3])
            for topic in unique_topics
        ]
    # Visualize
    fig = go.Figure()
    # Outliers and non-selected topics
    non_selected_topics = set(unique_topics).difference(topics)
    if len(non_selected_topics) == 0:
        non_selected_topics = [-1]
    selection = df.loc[df.topic.isin(non_selected_topics), :]
    selection["text"] = ""
    selection.loc[len(selection), :] = [
        None,
        None,
        selection.x.mean(),
        selection.y.mean(),
        "Other documents",
    ]
    fig.add_trace(
        go.Scattergl(
            x=selection.x,
            y=selection.y,
            hovertext=selection.doc if not hide_document_hover else None,
            hoverinfo="text",
            mode="markers+text",
            name="other",
            showlegend=False,
            marker=dict(color="#CFD8DC", size=5, opacity=0.5),
        )
    )
    # Selected topics
    for name, topic in zip(names, unique_topics):
        if topic in topics and topic != -1:
            selection = df.loc[df.topic == topic, :]
            selection["text"] = ""
            if not hide_annotations:
                selection.loc[len(selection), :] = [
                    None,
                    None,
                    selection.x.mean(),
                    selection.y.mean(),
                    name,
                ]
            fig.add_trace(
                go.Scattergl(
                    x=selection.x,
                    y=selection.y,
                    hovertext=selection.doc if not hide_document_hover else None,
                    hoverinfo="text",
                    text=selection.text,
                    mode="markers+text",
                    name=name,
                    textfont=dict(
                        size=12,
                    ),
                    marker=dict(size=5, opacity=0.5),
                )
            )
    # Add grid in a 'plus' shape
    x_range = (
        df.x.min() - abs((df.x.min()) * 0.15),
        df.x.max() + abs((df.x.max()) * 0.15),
    )
    y_range = (
        df.y.min() - abs((df.y.min()) * 0.15),
        df.y.max() + abs((df.y.max()) * 0.15),
    )
    fig.add_shape(
        type="line",
        x0=sum(x_range) / 2,
        y0=y_range[0],
        x1=sum(x_range) / 2,
        y1=y_range[1],
        line=dict(color="#CFD8DC", width=2),
    )
    fig.add_shape(
        type="line",
        x0=x_range[0],
        y0=sum(y_range) / 2,
        x1=x_range[1],
        y1=sum(y_range) / 2,
        line=dict(color="#9E9E9E", width=2),
    )
    fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text="D1", showarrow=False, yshift=10)
    fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text="D2", showarrow=False, xshift=10)
    # Stylize layout
    fig.update_layout(
        template="simple_white",
        title={
            "text": f"{title}",
            "x": 0.5,
            "xanchor": "center",
            "yanchor": "top",
            "font": dict(size=22, color="Black"),
        },
        width=width,
        height=height,
    )
    fig.update_xaxes(visible=False)
    fig.update_yaxes(visible=False)
    return fig
@@ -1,136 +0,0 @@
 import numpy as np
 from typing import List, Union
 from scipy.cluster.hierarchy import fcluster, linkage
 from sklearn.metrics.pairwise import cosine_similarity
 from bertopic._utils import select_topic_representation
 import plotly.express as px
 import plotly.graph_objects as go
 def visualize_heatmap(
    topic_model,
    topics: List[int] = None,
    top_n_topics: int = None,
    n_clusters: int = None,
    use_ctfidf: bool = False,
    custom_labels: Union[bool, str] = False,
    title: str = "<b>Similarity Matrix</b>",
    width: int = 800,
    height: int = 800,
 ) -> go.Figure:
    """Visualize a heatmap of the topic's similarity matrix.
    Based on the cosine similarity matrix between topic embeddings (either c-TF-IDF or the embeddings from the embedding
    model), a heatmap is created showing the similarity between topics.
    Arguments:
        topic_model: A fitted BERTopic instance.
        topics: A selection of topics to visualize.
        top_n_topics: Only select the top n most frequent topics.
        n_clusters: Create n clusters and order the similarity
                    matrix by those clusters.
        use_ctfidf: Whether to calculate distances between topics based on c-TF-IDF embeddings. If False, the embeddings
                    from the embedding model are used.
        custom_labels: If bool, whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       If `str`, it uses labels from other aspects, e.g., "Aspect1".
        title: Title of the plot.
        width: The width of the figure.
        height: The height of the figure.
    Returns:
        fig: A plotly figure
    Examples:
    To visualize the similarity matrix of
    topics simply run:
    ```python
    topic_model.visualize_heatmap()
    ```
    Or if you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_heatmap()
    fig.write_html("path/to/file.html")
    ```
    <iframe src="../../getting_started/visualization/heatmap.html"
    style="width:1000px; height: 720px; border: 0px;""></iframe>
    """
    embeddings = select_topic_representation(topic_model.c_tf_idf_, topic_model.topic_embeddings_, use_ctfidf)[0][
        topic_model._outliers :
    ]
    # Select topics based on top_n and topics args
    freq_df = topic_model.get_topic_freq()
    freq_df = freq_df.loc[freq_df.Topic != -1, :]
    if topics is not None:
        topics = list(topics)
    elif top_n_topics is not None:
        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
    else:
        topics = sorted(freq_df.Topic.to_list())
    # Order heatmap by similar clusters of topics
    sorted_topics = topics
    if n_clusters:
        if n_clusters >= len(set(topics)):
            raise ValueError("Make sure to set `n_clusters` lower than the total number of unique topics.")
        distance_matrix = cosine_similarity(embeddings[topics])
        Z = linkage(distance_matrix, "ward")
        clusters = fcluster(Z, t=n_clusters, criterion="maxclust")
        # Extract new order of topics
        mapping = {cluster: [] for cluster in clusters}
        for topic, cluster in zip(topics, clusters):
            mapping[cluster].append(topic)
        mapping = [cluster for cluster in mapping.values()]
        sorted_topics = [topic for cluster in mapping for topic in cluster]
    # Select embeddings
    indices = np.array([topics.index(topic) for topic in sorted_topics])
    embeddings = embeddings[indices]
    distance_matrix = cosine_similarity(embeddings)
    # Create labels
    if isinstance(custom_labels, str):
        new_labels = [
            [[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in sorted_topics
        ]
        new_labels = ["_".join([label[0] for label in labels[:4]]) for labels in new_labels]
        new_labels = [label if len(label) < 30 else label[:27] + "..." for label in new_labels]
    elif topic_model.custom_labels_ is not None and custom_labels:
        new_labels = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in sorted_topics]
    else:
        new_labels = [[[str(topic), None]] + topic_model.get_topic(topic) for topic in sorted_topics]
        new_labels = ["_".join([label[0] for label in labels[:4]]) for labels in new_labels]
        new_labels = [label if len(label) < 30 else label[:27] + "..." for label in new_labels]
    fig = px.imshow(
        distance_matrix,
        labels=dict(color="Similarity Score"),
        x=new_labels,
        y=new_labels,
        color_continuous_scale="GnBu",
    )
    fig.update_layout(
        title={
            "text": f"{title}",
            "y": 0.95,
            "x": 0.55,
            "xanchor": "center",
            "yanchor": "top",
            "font": dict(size=22, color="Black"),
        },
        width=width,
        height=height,
        hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
    )
    fig.update_layout(showlegend=True)
    fig.update_layout(legend_title_text="Trend")
    return fig
@@ -1,375 +0,0 @@
 import numpy as np
 import pandas as pd
 import plotly.graph_objects as go
 import math
 from typing import List, Union
 def visualize_hierarchical_documents(
    topic_model,
    docs: List[str],
    hierarchical_topics: pd.DataFrame,
    topics: List[int] = None,
    embeddings: np.ndarray = None,
    reduced_embeddings: np.ndarray = None,
    sample: Union[float, int] = None,
    hide_annotations: bool = False,
    hide_document_hover: bool = True,
    nr_levels: int = 10,
    level_scale: str = "linear",
    custom_labels: Union[bool, str] = False,
    title: str = "<b>Hierarchical Documents and Topics</b>",
    width: int = 1200,
    height: int = 750,
 ) -> go.Figure:
    """Visualize documents and their topics in 2D at different levels of hierarchy.
    Arguments:
        topic_model: A fitted BERTopic instance.
        docs: The documents you used when calling either `fit` or `fit_transform`
        hierarchical_topics: A dataframe that contains a hierarchy of topics
                             represented by their parents and their children
        topics: A selection of topics to visualize.
                Not to be confused with the topics that you get from `.fit_transform`.
                For example, if you want to visualize only topics 1 through 5:
                `topics = [1, 2, 3, 4, 5]`.
        embeddings: The embeddings of all documents in `docs`.
        reduced_embeddings: The 2D reduced embeddings of all documents in `docs`.
        sample: The percentage of documents in each topic that you would like to keep.
                Value can be between 0 and 1. Setting this value to, for example,
                0.1 (10% of documents in each topic) makes it easier to visualize
                millions of documents as a subset is chosen.
        hide_annotations: Hide the names of the traces on top of each cluster.
        hide_document_hover: Hide the content of the documents when hovering over
                             specific points. Helps to speed up generation of visualizations.
        nr_levels: The number of levels to be visualized in the hierarchy. First, the distances
                   in `hierarchical_topics.Distance` are split in `nr_levels` lists of distances.
                   Then, for each list of distances, the merged topics are selected that have a
                   distance less or equal to the maximum distance of the selected list of distances.
                   NOTE: To get all possible merged steps, make sure that `nr_levels` is equal to
                   the length of `hierarchical_topics`.
        level_scale: Whether to apply a linear or logarithmic (log) scale levels of the distance
                     vector. Linear scaling will perform an equal number of merges at each level
                     while logarithmic scaling will perform more mergers in earlier levels to
                     provide more resolution at higher levels (this can be used for when the number
                     of topics is large).
        custom_labels: If bool, whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       If `str`, it uses labels from other aspects, e.g., "Aspect1".
                       NOTE: Custom labels are only generated for the original
                       un-merged topics.
        title: Title of the plot.
        width: The width of the figure.
        height: The height of the figure.
    Examples:
    To visualize the topics simply run:
    ```python
    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)
    ```
    Do note that this re-calculates the embeddings and reduces them to 2D.
    The advised and preferred pipeline for using this function is as follows:
    ```python
    from sklearn.datasets import fetch_20newsgroups
    from sentence_transformers import SentenceTransformer
    from bertopic import BERTopic
    from umap import UMAP
    # Prepare embeddings
    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
    sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = sentence_model.encode(docs, show_progress_bar=False)
    # Train BERTopic and extract hierarchical topics
    topic_model = BERTopic().fit(docs, embeddings)
    hierarchical_topics = topic_model.hierarchical_topics(docs)
    # Reduce dimensionality of embeddings, this step is optional
    # reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
    # Run the visualization with the original embeddings
    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)
    # Or, if you have reduced the original embeddings already:
    topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)
    ```
    Or if you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)
    fig.write_html("path/to/file.html")
    ```
    Note:
        This visualization was inspired by the scatter plot representation of Doc2Map:
        https://github.com/louisgeisler/Doc2Map
    <iframe src="../../getting_started/visualization/hierarchical_documents.html"
    style="width:1000px; height: 770px; border: 0px;""></iframe>
    """
    topic_per_doc = topic_model.topics_
    # Sample the data to optimize for visualization and dimensionality reduction
    if sample is None or sample > 1:
        sample = 1
    indices = []
    for topic in set(topic_per_doc):
        s = np.where(np.array(topic_per_doc) == topic)[0]
        size = len(s) if len(s) < 100 else int(len(s) * sample)
        indices.extend(np.random.choice(s, size=size, replace=False))
    indices = np.array(indices)
    df = pd.DataFrame({"topic": np.array(topic_per_doc)[indices]})
    df["doc"] = [docs[index] for index in indices]
    df["topic"] = [topic_per_doc[index] for index in indices]
    # Extract embeddings if not already done
    if sample is None:
        if embeddings is None and reduced_embeddings is None:
            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
        else:
            embeddings_to_reduce = embeddings
    else:
        if embeddings is not None:
            embeddings_to_reduce = embeddings[indices]
        elif embeddings is None and reduced_embeddings is None:
            embeddings_to_reduce = topic_model._extract_embeddings(df.doc.to_list(), method="document")
    # Reduce input embeddings
    if reduced_embeddings is None:
        try:
            from umap import UMAP
            umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric="cosine").fit(embeddings_to_reduce)
            embeddings_2d = umap_model.embedding_
        except (ImportError, ModuleNotFoundError):
            raise ModuleNotFoundError(
                "UMAP is required if the embeddings are not yet reduced in dimensionality. Please install it using `pip install umap-learn`."
            )
    elif sample is not None and reduced_embeddings is not None:
        embeddings_2d = reduced_embeddings[indices]
    elif sample is None and reduced_embeddings is not None:
        embeddings_2d = reduced_embeddings
    # Combine data
    df["x"] = embeddings_2d[:, 0]
    df["y"] = embeddings_2d[:, 1]
    # Create topic list for each level, levels are created by calculating the distance
    distances = hierarchical_topics.Distance.to_list()
    if level_scale == "log" or level_scale == "logarithmic":
        log_indices = (
            np.round(
                np.logspace(
                    start=math.log(1, 10),
                    stop=math.log(len(distances) - 1, 10),
                    num=nr_levels,
                )
            )
            .astype(int)
            .tolist()
        )
        log_indices.reverse()
        max_distances = [distances[i] for i in log_indices]
    elif level_scale == "lin" or level_scale == "linear":
        max_distances = [
            distances[indices[-1]] for indices in np.array_split(range(len(hierarchical_topics)), nr_levels)
        ][::-1]
    else:
        raise ValueError("level_scale needs to be one of 'log' or 'linear'")
    for index, max_distance in enumerate(max_distances):
        # Get topics below `max_distance`
        mapping = {topic: topic for topic in df.topic.unique()}
        selection = hierarchical_topics.loc[hierarchical_topics.Distance <= max_distance, :]
        selection.Parent_ID = selection.Parent_ID.astype(int)
        selection = selection.sort_values("Parent_ID")
        for row in selection.iterrows():
            for topic in row[1].Topics:
                mapping[topic] = row[1].Parent_ID
        # Make sure the mappings are mapped 1:1
        mappings = [True for _ in mapping]
        while any(mappings):
            for i, (key, value) in enumerate(mapping.items()):
                if value in mapping.keys() and key != value:
                    mapping[key] = mapping[value]
                else:
                    mappings[i] = False
        # Create new column
        df[f"level_{index + 1}"] = df.topic.map(mapping)
        df[f"level_{index + 1}"] = df[f"level_{index + 1}"].astype(int)
    # Prepare topic names of original and merged topics
    trace_names = []
    topic_names = {}
    for topic in range(hierarchical_topics.Parent_ID.astype(int).max()):
        if topic < hierarchical_topics.Parent_ID.astype(int).min():
            if topic_model.get_topic(topic):
                if isinstance(custom_labels, str):
                    trace_name = f"{topic}_" + "_".join(
                        list(zip(*topic_model.topic_aspects_[custom_labels][topic]))[0][:3]
                    )
                elif topic_model.custom_labels_ is not None and custom_labels:
                    trace_name = topic_model.custom_labels_[topic + topic_model._outliers]
                else:
                    trace_name = f"{topic}_" + "_".join([word[:20] for word, _ in topic_model.get_topic(topic)][:3])
                topic_names[topic] = {
                    "trace_name": trace_name[:40],
                    "plot_text": trace_name[:40],
                }
                trace_names.append(trace_name)
        else:
            trace_name = (
                f"{topic}_"
                + hierarchical_topics.loc[hierarchical_topics.Parent_ID == str(topic), "Parent_Name"].values[0]
            )
            plot_text = "_".join([name[:20] for name in trace_name.split("_")[:3]])
            topic_names[topic] = {
                "trace_name": trace_name[:40],
                "plot_text": plot_text[:40],
            }
            trace_names.append(trace_name)
    # Prepare traces
    all_traces = []
    for level in range(len(max_distances)):
        traces = []
        # Outliers
        if topic_model._outliers:
            traces.append(
                go.Scattergl(
                    x=df.loc[(df[f"level_{level + 1}"] == -1), "x"],
                    y=df.loc[df[f"level_{level + 1}"] == -1, "y"],
                    mode="markers+text",
                    name="other",
                    hoverinfo="text",
                    hovertext=df.loc[(df[f"level_{level + 1}"] == -1), "doc"] if not hide_document_hover else None,
                    showlegend=False,
                    marker=dict(color="#CFD8DC", size=5, opacity=0.5),
                )
            )
        # Selected topics
        if topics:
            selection = df.loc[(df.topic.isin(topics)), :]
            unique_topics = sorted([int(topic) for topic in selection[f"level_{level + 1}"].unique()])
        else:
            unique_topics = sorted([int(topic) for topic in df[f"level_{level + 1}"].unique()])
        for topic in unique_topics:
            if topic != -1:
                if topics:
                    selection = df.loc[(df[f"level_{level + 1}"] == topic) & (df.topic.isin(topics)), :]
                else:
                    selection = df.loc[df[f"level_{level + 1}"] == topic, :]
                if not hide_annotations:
                    selection.loc[len(selection), :] = None
                    selection["text"] = ""
                    selection.loc[len(selection) - 1, "x"] = selection.x.mean()
                    selection.loc[len(selection) - 1, "y"] = selection.y.mean()
                    selection.loc[len(selection) - 1, "text"] = topic_names[int(topic)]["plot_text"]
                traces.append(
                    go.Scattergl(
                        x=selection.x,
                        y=selection.y,
                        text=selection.text if not hide_annotations else None,
                        hovertext=selection.doc if not hide_document_hover else None,
                        hoverinfo="text",
                        name=topic_names[int(topic)]["trace_name"],
                        mode="markers+text",
                        marker=dict(size=5, opacity=0.5),
                    )
                )
        all_traces.append(traces)
    # Track and count traces
    nr_traces_per_set = [len(traces) for traces in all_traces]
    trace_indices = [(0, nr_traces_per_set[0])]
    for index, nr_traces in enumerate(nr_traces_per_set[1:]):
        start = trace_indices[index][1]
        end = nr_traces + start
        trace_indices.append((start, end))
    # Visualization
    fig = go.Figure()
    for traces in all_traces:
        for trace in traces:
            fig.add_trace(trace)
    for index in range(len(fig.data)):
        if index >= nr_traces_per_set[0]:
            fig.data[index].visible = False
    # Create and add slider
    steps = []
    for index, indices in enumerate(trace_indices):
        step = dict(
            method="update",
            label=str(index),
            args=[{"visible": [False] * len(fig.data)}],
        )
        for index in range(indices[1] - indices[0]):
            step["args"][0]["visible"][index + indices[0]] = True
        steps.append(step)
    sliders = [dict(currentvalue={"prefix": "Level: "}, pad={"t": 20}, steps=steps)]
    # Add grid in a 'plus' shape
    x_range = (
        df.x.min() - abs((df.x.min()) * 0.15),
        df.x.max() + abs((df.x.max()) * 0.15),
    )
    y_range = (
        df.y.min() - abs((df.y.min()) * 0.15),
        df.y.max() + abs((df.y.max()) * 0.15),
    )
    fig.add_shape(
        type="line",
        x0=sum(x_range) / 2,
        y0=y_range[0],
        x1=sum(x_range) / 2,
        y1=y_range[1],
        line=dict(color="#CFD8DC", width=2),
    )
    fig.add_shape(
        type="line",
        x0=x_range[0],
        y0=sum(y_range) / 2,
        x1=x_range[1],
        y1=sum(y_range) / 2,
        line=dict(color="#9E9E9E", width=2),
    )
    fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text="D1", showarrow=False, yshift=10)
    fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text="D2", showarrow=False, xshift=10)
    # Stylize layout
    fig.update_layout(
        sliders=sliders,
        template="simple_white",
        title={
            "text": f"{title}",
            "x": 0.5,
            "xanchor": "center",
            "yanchor": "top",
            "font": dict(size=22, color="Black"),
        },
        width=width,
        height=height,
    )
    fig.update_xaxes(visible=False)
    fig.update_yaxes(visible=False)
    return fig
@@ -1,330 +0,0 @@
 import numpy as np
 import pandas as pd
 from typing import Callable, List, Union
 from scipy.sparse import csr_matrix
 from scipy.cluster import hierarchy as sch
 from sklearn.metrics.pairwise import cosine_similarity
 from bertopic._utils import select_topic_representation
 import plotly.graph_objects as go
 import plotly.figure_factory as ff
 from bertopic._utils import validate_distance_matrix
 def visualize_hierarchy(
    topic_model,
    orientation: str = "left",
    topics: List[int] = None,
    top_n_topics: int = None,
    use_ctfidf: bool = True,
    custom_labels: Union[bool, str] = False,
    title: str = "<b>Hierarchical Clustering</b>",
    width: int = 1000,
    height: int = 600,
    hierarchical_topics: pd.DataFrame = None,
    linkage_function: Callable[[csr_matrix], np.ndarray] = None,
    distance_function: Callable[[csr_matrix], csr_matrix] = None,
    color_threshold: int = 1,
 ) -> go.Figure:
    """Visualize a hierarchical structure of the topics.
    A ward linkage function is used to perform the
    hierarchical clustering based on the cosine distance
    matrix between topic embeddings (either c-TF-IDF or the embeddings from the embedding model).
    Arguments:
        topic_model: A fitted BERTopic instance.
        orientation: The orientation of the figure.
                     Either 'left' or 'bottom'
        topics: A selection of topics to visualize
        top_n_topics: Only select the top n most frequent topics
        use_ctfidf: Whether to calculate distances between topics based on c-TF-IDF embeddings. If False, the embeddings
                    from the embedding model are used.
        custom_labels: If bool, whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       If `str`, it uses labels from other aspects, e.g., "Aspect1".
                       NOTE: Custom labels are only generated for the original
                       un-merged topics.
        title: Title of the plot.
        width: The width of the figure. Only works if orientation is set to 'left'
        height: The height of the figure. Only works if orientation is set to 'bottom'
        hierarchical_topics: A dataframe that contains a hierarchy of topics
                             represented by their parents and their children.
                             NOTE: The hierarchical topic names are only visualized
                             if both `topics` and `top_n_topics` are not set.
        linkage_function: The linkage function to use. Default is:
                          `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`
                          NOTE: Make sure to use the same `linkage_function` as used
                          in `topic_model.hierarchical_topics`.
        distance_function: The distance function to use on the c-TF-IDF matrix. Default is:
                           `lambda x: 1 - cosine_similarity(x)`.
                            You can pass any function that returns either a square matrix of
                            shape (n_samples, n_samples) with zeros on the diagonal and
                            non-negative values or condensed distance matrix of shape
                            (n_samples * (n_samples - 1) / 2,) containing the upper
                            triangular of the distance matrix.
                           NOTE: Make sure to use the same `distance_function` as used
                           in `topic_model.hierarchical_topics`.
        color_threshold: Value at which the separation of clusters will be made which
                         will result in different colors for different clusters.
                         A higher value will typically lead in less colored clusters.
    Returns:
        fig: A plotly figure
    Examples:
    To visualize the hierarchical structure of
    topics simply run:
    ```python
    topic_model.visualize_hierarchy()
    ```
    If you also want the labels visualized of hierarchical topics,
    run the following:
    ```python
    # Extract hierarchical topics and their representations
    hierarchical_topics = topic_model.hierarchical_topics(docs)
    # Visualize these representations
    topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
    ```
    If you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_hierarchy()
    fig.write_html("path/to/file.html")
    ```
    <iframe src="../../getting_started/visualization/hierarchy.html"
    style="width:1000px; height: 680px; border: 0px;""></iframe>
    """
    if distance_function is None:
        distance_function = lambda x: 1 - cosine_similarity(x)
    if linkage_function is None:
        linkage_function = lambda x: sch.linkage(x, "ward", optimal_ordering=True)
    # Select topics based on top_n and topics args
    freq_df = topic_model.get_topic_freq()
    freq_df = freq_df.loc[freq_df.Topic != -1, :]
    if topics is not None:
        topics = list(topics)
    elif top_n_topics is not None:
        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
    else:
        topics = sorted(freq_df.Topic.to_list())
    # Select embeddings
    all_topics = sorted(list(topic_model.get_topics().keys()))
    indices = np.array([all_topics.index(topic) for topic in topics])
    # Select topic embeddings
    embeddings = select_topic_representation(topic_model.c_tf_idf_, topic_model.topic_embeddings_, use_ctfidf)[0][
        indices
    ]
    # Annotations
    if hierarchical_topics is not None and len(topics) == len(freq_df.Topic.to_list()):
        annotations = _get_annotations(
            topic_model=topic_model,
            hierarchical_topics=hierarchical_topics,
            embeddings=embeddings,
            distance_function=distance_function,
            linkage_function=linkage_function,
            orientation=orientation,
            custom_labels=custom_labels,
        )
    else:
        annotations = None
    # wrap distance function to validate input and return a condensed distance matrix
    distance_function_viz = lambda x: validate_distance_matrix(distance_function(x), embeddings.shape[0])
    # Create dendogram
    fig = ff.create_dendrogram(
        embeddings,
        orientation=orientation,
        distfun=distance_function_viz,
        linkagefun=linkage_function,
        hovertext=annotations,
        color_threshold=color_threshold,
    )
    # Create nicer labels
    axis = "yaxis" if orientation == "left" else "xaxis"
    if isinstance(custom_labels, str):
        new_labels = [
            [[str(x), None]] + topic_model.topic_aspects_[custom_labels][x] for x in fig.layout[axis]["ticktext"]
        ]
        new_labels = ["_".join([label[0] for label in labels[:4]]) for labels in new_labels]
        new_labels = [label if len(label) < 30 else label[:27] + "..." for label in new_labels]
    elif topic_model.custom_labels_ is not None and custom_labels:
        new_labels = [
            topic_model.custom_labels_[topics[int(x)] + topic_model._outliers] for x in fig.layout[axis]["ticktext"]
        ]
    else:
        new_labels = [
            [[str(topics[int(x)]), None]] + topic_model.get_topic(topics[int(x)]) for x in fig.layout[axis]["ticktext"]
        ]
        new_labels = ["_".join([label[0] for label in labels[:4]]) for labels in new_labels]
        new_labels = [label if len(label) < 30 else label[:27] + "..." for label in new_labels]
    # Stylize layout
    fig.update_layout(
        plot_bgcolor="#ECEFF1",
        template="plotly_white",
        title={
            "text": f"{title}",
            "x": 0.5,
            "xanchor": "center",
            "yanchor": "top",
            "font": dict(size=22, color="Black"),
        },
        hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
    )
    # Stylize orientation
    if orientation == "left":
        fig.update_layout(
            height=200 + (15 * len(topics)),
            width=width,
            yaxis=dict(tickmode="array", ticktext=new_labels),
        )
        # Fix empty space on the bottom of the graph
        y_max = max([trace["y"].max() + 5 for trace in fig["data"]])
        y_min = min([trace["y"].min() - 5 for trace in fig["data"]])
        fig.update_layout(yaxis=dict(range=[y_min, y_max]))
    else:
        fig.update_layout(
            width=200 + (15 * len(topics)),
            height=height,
            xaxis=dict(tickmode="array", ticktext=new_labels),
        )
    if hierarchical_topics is not None:
        for index in [0, 3]:
            axis = "x" if orientation == "left" else "y"
            xs = [data["x"][index] for data in fig.data if (data["text"] and data[axis][index] > 0)]
            ys = [data["y"][index] for data in fig.data if (data["text"] and data[axis][index] > 0)]
            hovertext = [data["text"][index] for data in fig.data if (data["text"] and data[axis][index] > 0)]
            fig.add_trace(
                go.Scatter(
                    x=xs,
                    y=ys,
                    marker_color="black",
                    hovertext=hovertext,
                    hoverinfo="text",
                    mode="markers",
                    showlegend=False,
                )
            )
    return fig
 def _get_annotations(
    topic_model,
    hierarchical_topics: pd.DataFrame,
    embeddings: csr_matrix,
    linkage_function: Callable[[csr_matrix], np.ndarray],
    distance_function: Callable[[csr_matrix], csr_matrix],
    orientation: str,
    custom_labels: bool = False,
 ) -> List[List[str]]:
    """Get annotations by replicating linkage function calculation in scipy.
    Arguments:
        topic_model: A fitted BERTopic instance.
        hierarchical_topics: A dataframe that contains a hierarchy of topics
                             represented by their parents and their children.
                             NOTE: The hierarchical topic names are only visualized
                             if both `topics` and `top_n_topics` are not set.
        embeddings: The c-TF-IDF matrix on which to model the hierarchy
        linkage_function: The linkage function to use. Default is:
                          `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`
                          NOTE: Make sure to use the same `linkage_function` as used
                          in `topic_model.hierarchical_topics`.
        distance_function: The distance function to use on the c-TF-IDF matrix. Default is:
                           `lambda x: 1 - cosine_similarity(x)`.
                            You can pass any function that returns either a square matrix of
                            shape (n_samples, n_samples) with zeros on the diagonal and
                            non-negative values or condensed distance matrix of shape
                            (n_samples * (n_samples - 1) / 2,) containing the upper
                            triangular of the distance matrix.
                           NOTE: Make sure to use the same `distance_function` as used
                           in `topic_model.hierarchical_topics`.
        orientation: The orientation of the figure.
                     Either 'left' or 'bottom'
        custom_labels: Whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       NOTE: Custom labels are only generated for the original
                       un-merged topics.
    Returns:
        text_annotations: Annotations to be used within Plotly's `ff.create_dendogram`
    """
    df = hierarchical_topics.loc[hierarchical_topics.Parent_Name != "Top", :]
    # Calculate distance
    X = distance_function(embeddings)
    X = validate_distance_matrix(X, embeddings.shape[0])
    # Calculate linkage and generate dendrogram
    Z = linkage_function(X)
    P = sch.dendrogram(Z, orientation=orientation, no_plot=True)
    # store topic no.(leaves) corresponding to the x-ticks in dendrogram
    x_ticks = np.arange(5, len(P["leaves"]) * 10 + 5, 10)
    x_topic = dict(zip(P["leaves"], x_ticks))
    topic_vals = dict()
    for key, val in x_topic.items():
        topic_vals[val] = [key]
    parent_topic = dict(zip(df.Parent_ID, df.Topics))
    # loop through every trace (scatter plot) in dendrogram
    text_annotations = []
    for index, trace in enumerate(P["icoord"]):
        fst_topic = topic_vals[trace[0]]
        scnd_topic = topic_vals[trace[2]]
        if len(fst_topic) == 1:
            if isinstance(custom_labels, str):
                fst_name = f"{fst_topic[0]}_" + "_".join(
                    list(zip(*topic_model.topic_aspects_[custom_labels][fst_topic[0]]))[0][:3]
                )
            elif topic_model.custom_labels_ is not None and custom_labels:
                fst_name = topic_model.custom_labels_[fst_topic[0] + topic_model._outliers]
            else:
                fst_name = "_".join([word for word, _ in topic_model.get_topic(fst_topic[0])][:5])
        else:
            for key, value in parent_topic.items():
                if set(value) == set(fst_topic):
                    fst_name = df.loc[df.Parent_ID == key, "Parent_Name"].values[0]
        if len(scnd_topic) == 1:
            if isinstance(custom_labels, str):
                scnd_name = f"{scnd_topic[0]}_" + "_".join(
                    list(zip(*topic_model.topic_aspects_[custom_labels][scnd_topic[0]]))[0][:3]
                )
            elif topic_model.custom_labels_ is not None and custom_labels:
                scnd_name = topic_model.custom_labels_[scnd_topic[0] + topic_model._outliers]
            else:
                scnd_name = "_".join([word for word, _ in topic_model.get_topic(scnd_topic[0])][:5])
        else:
            for key, value in parent_topic.items():
                if set(value) == set(scnd_topic):
                    scnd_name = df.loc[df.Parent_ID == key, "Parent_Name"].values[0]
        text_annotations.append([fst_name, "", "", scnd_name])
        center = (trace[0] + trace[2]) / 2
        topic_vals[center] = fst_topic + scnd_topic
    return text_annotations
@@ -1,131 +0,0 @@
 import numpy as np
 from typing import List, Union
 import plotly.graph_objects as go
 def visualize_term_rank(
    topic_model,
    topics: List[int] = None,
    log_scale: bool = False,
    custom_labels: Union[bool, str] = False,
    title: str = "<b>Term score decline per Topic</b>",
    width: int = 800,
    height: int = 500,
 ) -> go.Figure:
    """Visualize the ranks of all terms across all topics.
    Each topic is represented by a set of words. These words, however,
    do not all equally represent the topic. This visualization shows
    how many words are needed to represent a topic and at which point
    the beneficial effect of adding words starts to decline.
    Arguments:
        topic_model: A fitted BERTopic instance.
        topics: A selection of topics to visualize. These will be colored
                red where all others will be colored black.
        log_scale: Whether to represent the ranking on a log scale
        custom_labels: If bool, whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       If `str`, it uses labels from other aspects, e.g., "Aspect1".
        title: Title of the plot.
        width: The width of the figure.
        height: The height of the figure.
    Returns:
        fig: A plotly figure
    Examples:
    To visualize the ranks of all words across
    all topics simply run:
    ```python
    topic_model.visualize_term_rank()
    ```
    Or if you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_term_rank()
    fig.write_html("path/to/file.html")
    ```
    <iframe src="../../getting_started/visualization/term_rank.html"
    style="width:1000px; height: 530px; border: 0px;""></iframe>
    <iframe src="../../getting_started/visualization/term_rank_log.html"
    style="width:1000px; height: 530px; border: 0px;""></iframe>
    Reference:
    This visualization was heavily inspired by the
    "Term Probability Decline" visualization found in an
    analysis by the amazing [tmtoolkit](https://tmtoolkit.readthedocs.io/).
    Reference to that specific analysis can be found
    [here](https://wzbsocialsciencecenter.github.io/tm_corona/tm_analysis.html).
    """
    topics = [] if topics is None else topics
    topic_ids = topic_model.get_topic_info().Topic.unique().tolist()
    topic_words = [topic_model.get_topic(topic) for topic in topic_ids]
    values = np.array([[value[1] for value in values] for values in topic_words])
    indices = np.array([[value + 1 for value in range(len(values))] for values in topic_words])
    # Create figure
    lines = []
    for topic, x, y in zip(topic_ids, indices, values):
        if not any(y > 1.5):
            # labels
            if isinstance(custom_labels, str):
                label = f"{topic}_" + "_".join(list(zip(*topic_model.topic_aspects_[custom_labels][topic]))[0][:3])
            elif topic_model.custom_labels_ is not None and custom_labels:
                label = topic_model.custom_labels_[topic + topic_model._outliers]
            else:
                label = f"<b>Topic {topic}</b>:" + "_".join([word[0] for word in topic_model.get_topic(topic)])
                label = label[:50]
            # line parameters
            color = "red" if topic in topics else "black"
            opacity = 1 if topic in topics else 0.1
            if any(y == 0):
                y[y == 0] = min(values[values > 0])
            y = np.log10(y, out=y, where=y > 0) if log_scale else y
            line = go.Scatter(
                x=x,
                y=y,
                name="",
                hovertext=label,
                mode="lines+lines",
                opacity=opacity,
                line=dict(color=color, width=1.5),
            )
            lines.append(line)
    fig = go.Figure(data=lines)
    # Stylize layout
    fig.update_xaxes(range=[0, len(indices[0])], tick0=1, dtick=2)
    fig.update_layout(
        showlegend=False,
        template="plotly_white",
        title={
            "text": f"{title}",
            "y": 0.9,
            "x": 0.5,
            "xanchor": "center",
            "yanchor": "top",
            "font": dict(size=22, color="Black"),
        },
        width=width,
        height=height,
        hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
    )
    fig.update_xaxes(title_text="Term Rank")
    if log_scale:
        fig.update_yaxes(title_text="c-TF-IDF score (log scale)")
    else:
        fig.update_yaxes(title_text="c-TF-IDF score")
    return fig
@@ -1,212 +0,0 @@
 import numpy as np
 import pandas as pd
 try:
    from umap import UMAP
    HAS_UMAP = True
 except (ImportError, ModuleNotFoundError):
    HAS_UMAP = False
 from typing import List, Union
 from sklearn.preprocessing import MinMaxScaler
 from bertopic._utils import select_topic_representation
 import plotly.express as px
 import plotly.graph_objects as go
 def visualize_topics(
    topic_model,
    topics: List[int] = None,
    top_n_topics: int = None,
    use_ctfidf: bool = False,
    custom_labels: Union[bool, str] = False,
    title: str = "<b>Intertopic Distance Map</b>",
    width: int = 650,
    height: int = 650,
 ) -> go.Figure:
    """Visualize topics, their sizes, and their corresponding words.
    This visualization is highly inspired by LDAvis, a great visualization
    technique typically reserved for LDA.
    Arguments:
        topic_model: A fitted BERTopic instance.
        topics: A selection of topics to visualize
        top_n_topics: Only select the top n most frequent topics
        use_ctfidf: Whether to use c-TF-IDF representations instead of the embeddings from the embedding model.
        custom_labels: If bool, whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       If `str`, it uses labels from other aspects, e.g., "Aspect1".
        title: Title of the plot.
        width: The width of the figure.
        height: The height of the figure.
    Examples:
    To visualize the topics simply run:
    ```python
    topic_model.visualize_topics()
    ```
    Or if you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_topics()
    fig.write_html("path/to/file.html")
    ```
    <iframe src="../../getting_started/visualization/viz.html"
    style="width:1000px; height: 680px; border: 0px;""></iframe>
    """
    # Select topics based on top_n and topics args
    freq_df = topic_model.get_topic_freq()
    freq_df = freq_df.loc[freq_df.Topic != -1, :]
    if topics is not None:
        topics = list(topics)
    elif top_n_topics is not None:
        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
    else:
        topics = sorted(freq_df.Topic.to_list())
    # Extract topic words and their frequencies
    topic_list = sorted(topics)
    frequencies = [topic_model.topic_sizes_[topic] for topic in topic_list]
    if isinstance(custom_labels, str):
        words = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topic_list]
        words = ["_".join([label[0] for label in labels[:4]]) for labels in words]
        words = [label if len(label) < 30 else label[:27] + "..." for label in words]
    elif custom_labels and topic_model.custom_labels_ is not None:
        words = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in topic_list]
    else:
        words = [" | ".join([word[0] for word in topic_model.get_topic(topic)[:5]]) for topic in topic_list]
    # Embed c-TF-IDF into 2D
    all_topics = sorted(list(topic_model.get_topics().keys()))
    indices = np.array([all_topics.index(topic) for topic in topics])
    embeddings, c_tfidf_used = select_topic_representation(
        topic_model.c_tf_idf_,
        topic_model.topic_embeddings_,
        use_ctfidf=use_ctfidf,
        output_ndarray=True,
    )
    embeddings = embeddings[indices]
    if HAS_UMAP:
        if c_tfidf_used:
            embeddings = MinMaxScaler().fit_transform(embeddings)
            embeddings = UMAP(n_neighbors=2, n_components=2, metric="hellinger", random_state=42).fit_transform(
                embeddings
            )
        else:
            embeddings = UMAP(n_neighbors=2, n_components=2, metric="cosine", random_state=42).fit_transform(embeddings)
    else:
        raise ModuleNotFoundError(
            "UMAP is required to reduce the embeddings.. Please install it using `pip install umap-learn`."
        )
    # Visualize with plotly
    df = pd.DataFrame(
        {
            "x": embeddings[:, 0],
            "y": embeddings[:, 1],
            "Topic": topic_list,
            "Words": words,
            "Size": frequencies,
        }
    )
    return _plotly_topic_visualization(df, topic_list, title, width, height)
 def _plotly_topic_visualization(df: pd.DataFrame, topic_list: List[str], title: str, width: int, height: int):
    """Create plotly-based visualization of topics with a slider for topic selection."""
    def get_color(topic_selected):
        if topic_selected == -1:
            marker_color = ["#B0BEC5" for _ in topic_list]
        else:
            marker_color = ["red" if topic == topic_selected else "#B0BEC5" for topic in topic_list]
        return [{"marker.color": [marker_color]}]
    # Prepare figure range
    x_range = (
        df.x.min() - abs((df.x.min()) * 0.15),
        df.x.max() + abs((df.x.max()) * 0.15),
    )
    y_range = (
        df.y.min() - abs((df.y.min()) * 0.15),
        df.y.max() + abs((df.y.max()) * 0.15),
    )
    # Plot topics
    fig = px.scatter(
        df,
        x="x",
        y="y",
        size="Size",
        size_max=40,
        template="simple_white",
        labels={"x": "", "y": ""},
        hover_data={"Topic": True, "Words": True, "Size": True, "x": False, "y": False},
    )
    fig.update_traces(marker=dict(color="#B0BEC5", line=dict(width=2, color="DarkSlateGrey")))
    # Update hover order
    fig.update_traces(
        hovertemplate="<br>".join(
            [
                "<b>Topic %{customdata[0]}</b>",
                "%{customdata[1]}",
                "Size: %{customdata[2]}",
            ]
        )
    )
    # Create a slider for topic selection
    steps = [dict(label=f"Topic {topic}", method="update", args=get_color(topic)) for topic in topic_list]
    sliders = [dict(active=0, pad={"t": 50}, steps=steps)]
    # Stylize layout
    fig.update_layout(
        title={
            "text": f"{title}",
            "y": 0.95,
            "x": 0.5,
            "xanchor": "center",
            "yanchor": "top",
            "font": dict(size=22, color="Black"),
        },
        width=width,
        height=height,
        hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
        xaxis={"visible": False},
        yaxis={"visible": False},
        sliders=sliders,
    )
    # Update axes ranges
    fig.update_xaxes(range=x_range)
    fig.update_yaxes(range=y_range)
    # Add grid in a 'plus' shape
    fig.add_shape(
        type="line",
        x0=sum(x_range) / 2,
        y0=y_range[0],
        x1=sum(x_range) / 2,
        y1=y_range[1],
        line=dict(color="#CFD8DC", width=2),
    )
    fig.add_shape(
        type="line",
        x0=x_range[0],
        y0=sum(y_range) / 2,
        x1=x_range[1],
        y1=sum(y_range) / 2,
        line=dict(color="#9E9E9E", width=2),
    )
    fig.add_annotation(x=x_range[0], y=sum(y_range) / 2, text="D1", showarrow=False, yshift=10)
    fig.add_annotation(y=y_range[1], x=sum(x_range) / 2, text="D2", showarrow=False, xshift=10)
    fig.data = fig.data[::-1]
    return fig
@@ -1,134 +0,0 @@
 import pandas as pd
 from typing import List, Union
 import plotly.graph_objects as go
 from sklearn.preprocessing import normalize
 def visualize_topics_over_time(
    topic_model,
    topics_over_time: pd.DataFrame,
    top_n_topics: int = None,
    topics: List[int] = None,
    normalize_frequency: bool = False,
    custom_labels: Union[bool, str] = False,
    title: str = "<b>Topics over Time</b>",
    width: int = 1250,
    height: int = 450,
 ) -> go.Figure:
    """Visualize topics over time.
    Arguments:
        topic_model: A fitted BERTopic instance.
        topics_over_time: The topics you would like to be visualized with the
                          corresponding topic representation
        top_n_topics: To visualize the most frequent topics instead of all
        topics: Select which topics you would like to be visualized
        normalize_frequency: Whether to normalize each topic's frequency individually
        custom_labels: If bool, whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       If `str`, it uses labels from other aspects, e.g., "Aspect1".
        title: Title of the plot.
        width: The width of the figure.
        height: The height of the figure.
    Returns:
        A plotly.graph_objects.Figure including all traces
    Examples:
    To visualize the topics over time, simply run:
    ```python
    topics_over_time = topic_model.topics_over_time(docs, timestamps)
    topic_model.visualize_topics_over_time(topics_over_time)
    ```
    Or if you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_topics_over_time(topics_over_time)
    fig.write_html("path/to/file.html")
    ```
    <iframe src="../../getting_started/visualization/trump.html"
    style="width:1000px; height: 680px; border: 0px;""></iframe>
    """
    colors = [
        "#E69F00",
        "#56B4E9",
        "#009E73",
        "#F0E442",
        "#D55E00",
        "#0072B2",
        "#CC79A7",
    ]
    # Select topics based on top_n and topics args
    freq_df = topic_model.get_topic_freq()
    freq_df = freq_df.loc[freq_df.Topic != -1, :]
    if topics is not None:
        selected_topics = list(topics)
    elif top_n_topics is not None:
        selected_topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
    else:
        selected_topics = sorted(freq_df.Topic.to_list())
    # Prepare data
    if isinstance(custom_labels, str):
        topic_names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]
        topic_names = ["_".join([label[0] for label in labels[:4]]) for labels in topic_names]
        topic_names = [label if len(label) < 30 else label[:27] + "..." for label in topic_names]
        topic_names = {key: topic_names[index] for index, key in enumerate(topic_model.topic_labels_.keys())}
    elif topic_model.custom_labels_ is not None and custom_labels:
        topic_names = {
            key: topic_model.custom_labels_[key + topic_model._outliers] for key, _ in topic_model.topic_labels_.items()
        }
    else:
        topic_names = {
            key: value[:40] + "..." if len(value) > 40 else value for key, value in topic_model.topic_labels_.items()
        }
    topics_over_time["Name"] = topics_over_time.Topic.map(topic_names)
    data = topics_over_time.loc[topics_over_time.Topic.isin(selected_topics), :].sort_values(["Topic", "Timestamp"])
    # Add traces
    fig = go.Figure()
    for index, topic in enumerate(data.Topic.unique()):
        trace_data = data.loc[data.Topic == topic, :]
        topic_name = trace_data.Name.values[0]
        words = trace_data.Words.values
        if normalize_frequency:
            y = normalize(trace_data.Frequency.values.reshape(1, -1))[0]
        else:
            y = trace_data.Frequency
        fig.add_trace(
            go.Scatter(
                x=trace_data.Timestamp,
                y=y,
                mode="lines",
                marker_color=colors[index % 7],
                hoverinfo="text",
                name=topic_name,
                hovertext=[f"<b>Topic {topic}</b><br>Words: {word}" for word in words],
            )
        )
    # Styling of the visualization
    fig.update_xaxes(showgrid=True)
    fig.update_yaxes(showgrid=True)
    fig.update_layout(
        yaxis_title="Normalized Frequency" if normalize_frequency else "Frequency",
        title={
            "text": f"{title}",
            "y": 0.95,
            "x": 0.40,
            "xanchor": "center",
            "yanchor": "top",
            "font": dict(size=22, color="Black"),
        },
        template="simple_white",
        width=width,
        height=height,
        hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
        legend=dict(
            title="<b>Global Topic Representation",
        ),
    )
    return fig
@@ -1,140 +0,0 @@
 import pandas as pd
 from typing import List, Union
 import plotly.graph_objects as go
 from sklearn.preprocessing import normalize
 def visualize_topics_per_class(
    topic_model,
    topics_per_class: pd.DataFrame,
    top_n_topics: int = 10,
    topics: List[int] = None,
    normalize_frequency: bool = False,
    custom_labels: Union[bool, str] = False,
    title: str = "<b>Topics per Class</b>",
    width: int = 1250,
    height: int = 900,
 ) -> go.Figure:
    """Visualize topics per class.
    Arguments:
        topic_model: A fitted BERTopic instance.
        topics_per_class: The topics you would like to be visualized with the
                          corresponding topic representation
        top_n_topics: To visualize the most frequent topics instead of all
        topics: Select which topics you would like to be visualized
        normalize_frequency: Whether to normalize each topic's frequency individually
        custom_labels: If bool, whether to use custom topic labels that were defined using
                       `topic_model.set_topic_labels`.
                       If `str`, it uses labels from other aspects, e.g., "Aspect1".
        title: Title of the plot.
        width: The width of the figure.
        height: The height of the figure.
    Returns:
        A plotly.graph_objects.Figure including all traces
    Examples:
    To visualize the topics per class, simply run:
    ```python
    topics_per_class = topic_model.topics_per_class(docs, classes)
    topic_model.visualize_topics_per_class(topics_per_class)
    ```
    Or if you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_topics_per_class(topics_per_class)
    fig.write_html("path/to/file.html")
    ```
    <iframe src="../../getting_started/visualization/topics_per_class.html"
    style="width:1400px; height: 1000px; border: 0px;""></iframe>
    """
    colors = [
        "#E69F00",
        "#56B4E9",
        "#009E73",
        "#F0E442",
        "#D55E00",
        "#0072B2",
        "#CC79A7",
    ]
    # Select topics based on top_n and topics args
    freq_df = topic_model.get_topic_freq()
    freq_df = freq_df.loc[freq_df.Topic != -1, :]
    if topics is not None:
        selected_topics = list(topics)
    elif top_n_topics is not None:
        selected_topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
    else:
        selected_topics = sorted(freq_df.Topic.to_list())
    # Prepare data
    if isinstance(custom_labels, str):
        topic_names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in topics]
        topic_names = ["_".join([label[0] for label in labels[:4]]) for labels in topic_names]
        topic_names = [label if len(label) < 30 else label[:27] + "..." for label in topic_names]
        topic_names = {key: topic_names[index] for index, key in enumerate(topic_model.topic_labels_.keys())}
    elif topic_model.custom_labels_ is not None and custom_labels:
        topic_names = {
            key: topic_model.custom_labels_[key + topic_model._outliers] for key, _ in topic_model.topic_labels_.items()
        }
    else:
        topic_names = {
            key: value[:40] + "..." if len(value) > 40 else value for key, value in topic_model.topic_labels_.items()
        }
    topics_per_class["Name"] = topics_per_class.Topic.map(topic_names)
    data = topics_per_class.loc[topics_per_class.Topic.isin(selected_topics), :]
    # Add traces
    fig = go.Figure()
    for index, topic in enumerate(selected_topics):
        if index == 0:
            visible = True
        else:
            visible = "legendonly"
        trace_data = data.loc[data.Topic == topic, :]
        topic_name = trace_data.Name.values[0]
        words = trace_data.Words.values
        if normalize_frequency:
            x = normalize(trace_data.Frequency.values.reshape(1, -1))[0]
        else:
            x = trace_data.Frequency
        fig.add_trace(
            go.Bar(
                y=trace_data.Class,
                x=x,
                visible=visible,
                marker_color=colors[index % 7],
                hoverinfo="text",
                name=topic_name,
                orientation="h",
                hovertext=[f"<b>Topic {topic}</b><br>Words: {word}" for word in words],
            )
        )
    # Styling of the visualization
    fig.update_xaxes(showgrid=True)
    fig.update_yaxes(showgrid=True)
    fig.update_layout(
        xaxis_title="Normalized Frequency" if normalize_frequency else "Frequency",
        yaxis_title="Class",
        title={
            "text": f"{title}",
            "y": 0.95,
            "x": 0.40,
            "xanchor": "center",
            "yanchor": "top",
            "font": dict(size=22, color="Black"),
        },
        template="simple_white",
        width=width,
        height=height,
        hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
        legend=dict(
            title="<b>Global Topic Representation",
        ),
    )
    return fig
@@ -1,76 +0,0 @@
 from bertopic._utils import NotInstalled
 from bertopic.representation._cohere import Cohere
 from bertopic.representation._base import BaseRepresentation
 from bertopic.representation._keybert import KeyBERTInspired
 from bertopic.representation._mmr import MaximalMarginalRelevance
 # Llama CPP Generator
 try:
    from bertopic.representation._llamacpp import LlamaCPP
 except ModuleNotFoundError:
    msg = "`pip install llama-cpp-python` \n\n"
    LlamaCPP = NotInstalled("llama.cpp", "llama-cpp-python", custom_msg=msg)
 # Text Generation using transformers
 try:
    from bertopic.representation._textgeneration import TextGeneration
 except ModuleNotFoundError:
    msg = "`pip install bertopic` without `--no-deps` \n\n"
    TextGeneration = NotInstalled("TextGeneration", "transformers", custom_msg=msg)
 # Zero-shot classification using transformers
 try:
    from bertopic.representation._zeroshot import ZeroShotClassification
 except ModuleNotFoundError:
    msg = "`pip install bertopic` without `--no-deps` \n\n"
    ZeroShotClassification = NotInstalled("ZeroShotClassification", "transformers", custom_msg=msg)
 # OpenAI Generator
 try:
    from bertopic.representation._openai import OpenAI
 except ModuleNotFoundError:
    msg = "`pip install openai` \n\n"
    OpenAI = NotInstalled("OpenAI", "openai", custom_msg=msg)
 # LiteLLM Generator
 try:
    from bertopic.representation._litellm import LiteLLM
 except ModuleNotFoundError:
    msg = "`pip install litellm` \n\n"
    LiteLLM = NotInstalled("LiteLLM", "litellm", custom_msg=msg)
 # LangChain Generator
 try:
    from bertopic.representation._langchain import LangChain
 except ModuleNotFoundError:
    msg = "`pip install langchain` \n\n"
    LangChain = NotInstalled("langchain", "langchain", custom_msg=msg)
 # POS using Spacy
 try:
    from bertopic.representation._pos import PartOfSpeech
 except ModuleNotFoundError:
    PartOfSpeech = NotInstalled("Part of Speech with Spacy", "spacy")
 # Multimodal
 try:
    from bertopic.representation._visual import VisualRepresentation
 except ModuleNotFoundError:
    VisualRepresentation = NotInstalled("a visual representation model", "vision")
 __all__ = [
    "BaseRepresentation",
    "TextGeneration",
    "ZeroShotClassification",
    "KeyBERTInspired",
    "PartOfSpeech",
    "MaximalMarginalRelevance",
    "Cohere",
    "OpenAI",
    "LangChain",
    "LiteLLM",
    "LlamaCPP",
    "VisualRepresentation",
 ]
@@ -1,40 +0,0 @@
 import pandas as pd
 from scipy.sparse import csr_matrix
 from sklearn.base import BaseEstimator
 from typing import Mapping, List, Tuple
 class BaseRepresentation(BaseEstimator):
    """The base representation model for fine-tuning topic representations."""
    def extract_topics(
        self,
        topic_model,
        documents: pd.DataFrame,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract topics.
        Each representation model that inherits this class will have
        its arguments (topic_model, documents, c_tf_idf, topics)
        automatically passed. Therefore, the representation model
        will only have access to the information about topics related
        to those arguments.
        Arguments:
            topic_model: The BERTopic model that is fitted until topic
                         representations are calculated.
            documents: A dataframe with columns "Document" and "Topic"
                       that contains all documents with each corresponding
                       topic.
            c_tf_idf: A c-TF-IDF representation that is typically
                      identical to `topic_model.c_tf_idf_` except for
                      dynamic, class-based, and hierarchical topic modeling
                      where it is calculated on a subset of the documents.
            topics: A dictionary with topic (key) and tuple of word and
                    weight (value) as calculated by c-TF-IDF. This is the
                    default topics that are returned if no representation
                    model is used.
        """
        return topic_model.topic_representations_
@@ -1,209 +0,0 @@
 import time
 import pandas as pd
 from tqdm import tqdm
 from scipy.sparse import csr_matrix
 from typing import Mapping, List, Tuple, Union, Callable
 from bertopic.representation._base import BaseRepresentation
 from bertopic.representation._utils import truncate_document, validate_truncate_document_parameters
 DEFAULT_PROMPT = """
 This is a list of texts where each collection of texts describe a topic. After each collection of texts, the name of the topic they represent is mentioned as a short-highly-descriptive title
 ---
 Topic:
 Sample texts from this topic:
 - Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
 - Meat, but especially beef, is the word food in terms of emissions.
 - Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.
 Keywords: meat beef eat eating emissions steak food health processed chicken
 Topic name: Environmental impacts of eating meat
 ---
 Topic:
 Sample texts from this topic:
 - I have ordered the product weeks ago but it still has not arrived!
 - The website mentions that it only takes a couple of days to deliver but I still have not received mine.
 - I got a message stating that I received the monitor but that is not true!
 - It took a month longer to deliver than was advised...
 Keywords: deliver weeks product shipping long delivery received arrived arrive week
 Topic name: Shipping and delivery issues
 ---
 Topic:
 Sample texts from this topic:
 [DOCUMENTS]
 Keywords: [KEYWORDS]
 Topic name:"""
 DEFAULT_SYSTEM_PROMPT = "You are an assistant that extracts high-level topics from texts."
 class Cohere(BaseRepresentation):
    """Use the Cohere API to generate topic labels based on their
    generative model.
    Find more about their models here:
    https://docs.cohere.ai/docs
    Arguments:
        client: A `cohere.Client`
        model: Model to use within Cohere, defaults to `"xlarge"`.
        prompt: The prompt to be used in the model. If no prompt is given,
                `self.default_prompt_` is used instead.
                NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
                to decide where the keywords and documents need to be
                inserted.
        system_prompt: The system prompt to be used in the model. If no system prompt is given,
                       `self.default_system_prompt_` is used instead.
        delay_in_seconds: The delay in seconds between consecutive prompts
                                in order to prevent RateLimitErrors.
        nr_docs: The number of documents to pass to OpenAI if a prompt
                 with the `["DOCUMENTS"]` tag is used.
        diversity: The diversity of documents to pass to OpenAI.
                   Accepts values between 0 and 1. A higher
                   values results in passing more diverse documents
                   whereas lower values passes more similar documents.
        doc_length: The maximum length of each document. If a document is longer,
                    it will be truncated. If None, the entire document is passed.
        tokenizer: The tokenizer used to calculate to split the document into segments
                   used to count the length of a document.
                       * If tokenizer is 'char', then the document is split up
                         into characters which are counted to adhere to `doc_length`
                       * If tokenizer is 'whitespace', the document is split up
                         into words separated by whitespaces. These words are counted
                         and truncated depending on `doc_length`
                       * If tokenizer is 'vectorizer', then the internal CountVectorizer
                         is used to tokenize the document. These tokens are counted
                         and truncated depending on `doc_length`
                       * If tokenizer is a callable, then that callable is used to tokenize
                         the document. These tokens are counted and truncated depending
                         on `doc_length`
    Usage:
    To use this, you will need to install cohere first:
    `pip install cohere`
    Then, get yourself an API key and use Cohere's API as follows:
    ```python
    import cohere
    from bertopic.representation import Cohere
    from bertopic import BERTopic
    # Create your representation model
    co = cohere.Client(my_api_key)
    representation_model = Cohere(co)
    # Use the representation model in BERTopic on top of the default pipeline
    topic_model = BERTopic(representation_model=representation_model)
    ```
    You can also use a custom prompt:
    ```python
    prompt = "I have the following documents: [DOCUMENTS]. What topic do they contain?"
    representation_model = Cohere(co, prompt=prompt)
    ```
    """
    def __init__(
        self,
        client,
        model: str = "command-r",
        prompt: str = None,
        system_prompt: str = None,
        delay_in_seconds: float = None,
        nr_docs: int = 4,
        diversity: float = None,
        doc_length: int = None,
        tokenizer: Union[str, Callable] = None,
    ):
        self.client = client
        self.model = model
        self.prompt = prompt if prompt is not None else DEFAULT_PROMPT
        self.system_prompt = system_prompt if system_prompt is not None else DEFAULT_SYSTEM_PROMPT
        self.default_prompt_ = DEFAULT_PROMPT
        self.default_system_prompt_ = DEFAULT_SYSTEM_PROMPT
        self.delay_in_seconds = delay_in_seconds
        self.nr_docs = nr_docs
        self.diversity = diversity
        self.doc_length = doc_length
        self.tokenizer = tokenizer
        validate_truncate_document_parameters(self.tokenizer, self.doc_length)
        self.prompts_ = []
    def extract_topics(
        self,
        topic_model,
        documents: pd.DataFrame,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract topics.
        Arguments:
            topic_model: Not used
            documents: Not used
            c_tf_idf: Not used
            topics: The candidate topics as calculated with c-TF-IDF
        Returns:
            updated_topics: Updated topic representations
        """
        # Extract the top 4 representative documents per topic
        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
            c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity
        )
        # Generate using Cohere's Language Model
        updated_topics = {}
        for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
            truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
            prompt = self._create_prompt(truncated_docs, topic, topics)
            self.prompts_.append(prompt)
            # Delay
            if self.delay_in_seconds:
                time.sleep(self.delay_in_seconds)
            request = self.client.chat(
                model=self.model,
                preamble=self.system_prompt,
                message=prompt,
                max_tokens=50,
                stop_sequences=["\n"],
            )
            label = request.text.strip()
            updated_topics[topic] = [(label, 1)] + [("", 0) for _ in range(9)]
        return updated_topics
    def _create_prompt(self, docs, topic, topics):
        keywords = list(zip(*topics[topic]))[0]
        # Use the Default Chat Prompt
        if self.prompt == DEFAULT_PROMPT:
            prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
            prompt = self._replace_documents(prompt, docs)
        # Use a custom prompt that leverages keywords, documents or both using
        # custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively
        else:
            prompt = self.prompt
            if "[KEYWORDS]" in prompt:
                prompt = prompt.replace("[KEYWORDS]", ", ".join(keywords))
            if "[DOCUMENTS]" in prompt:
                prompt = self._replace_documents(prompt, docs)
        return prompt
    @staticmethod
    def _replace_documents(prompt, docs):
        to_replace = ""
        for doc in docs:
            to_replace += f"- {doc}\n"
        prompt = prompt.replace("[DOCUMENTS]", to_replace)
        return prompt
@@ -1,222 +0,0 @@
 import numpy as np
 import pandas as pd
 from packaging import version
 from scipy.sparse import csr_matrix
 from typing import Mapping, List, Tuple, Union
 from sklearn.metrics.pairwise import cosine_similarity
 from bertopic.representation._base import BaseRepresentation
 from sklearn import __version__ as sklearn_version
 class KeyBERTInspired(BaseRepresentation):
    def __init__(
        self,
        top_n_words: int = 10,
        nr_repr_docs: int = 5,
        nr_samples: int = 500,
        nr_candidate_words: int = 100,
        random_state: int = 42,
    ):
        """Use a KeyBERT-like model to fine-tune the topic representations.
        The algorithm follows KeyBERT but does some optimization in
        order to speed up inference.
        The steps are as follows. First, we extract the top n representative
        documents per topic. To extract the representative documents, we
        randomly sample a number of candidate documents per cluster
        which is controlled by the `nr_samples` parameter. Then,
        the top n representative documents  are extracted by calculating
        the c-TF-IDF representation for the  candidate documents and finding,
        through cosine similarity, which are closest to the topic c-TF-IDF representation.
        Next, the top n words per topic are extracted based on their
        c-TF-IDF representation, which is controlled by the `nr_repr_docs`
        parameter.
        Then, we extract the embeddings for words and representative documents
        and create topic embeddings by averaging the representative documents.
        Finally, the most similar words to each topic are extracted by
        calculating the cosine similarity between word and topic embeddings.
        Arguments:
            top_n_words: The top n words to extract per topic.
            nr_repr_docs: The number of representative documents to extract per cluster.
            nr_samples: The number of candidate documents to extract per cluster.
            nr_candidate_words: The number of candidate words per cluster.
            random_state: The random state for randomly sampling candidate documents.
        Usage:
        ```python
        from bertopic.representation import KeyBERTInspired
        from bertopic import BERTopic
        # Create your representation model
        representation_model = KeyBERTInspired()
        # Use the representation model in BERTopic on top of the default pipeline
        topic_model = BERTopic(representation_model=representation_model)
        ```
        """
        self.top_n_words = top_n_words
        self.nr_repr_docs = nr_repr_docs
        self.nr_samples = nr_samples
        self.nr_candidate_words = nr_candidate_words
        self.random_state = random_state
    def extract_topics(
        self,
        topic_model,
        documents: pd.DataFrame,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
        embeddings: np.ndarray = None,
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract topics.
        Arguments:
            topic_model: A BERTopic model
            documents: All input documents
            c_tf_idf: The topic c-TF-IDF representation
            topics: The candidate topics as calculated with c-TF-IDF
            embeddings: Pre-trained document embeddings. These can be used
                        instead of an embedding model
        Returns:
            updated_topics: Updated topic representations
        """
        # We extract the top n representative documents per class
        _, representative_docs, repr_doc_indices, _ = topic_model._extract_representative_docs(
            c_tf_idf, documents, topics, self.nr_samples, self.nr_repr_docs
        )
        # If document embeddings are precomputed, extract the embeddings of the representative documents based on repr_doc_indices
        repr_embeddings = None
        if embeddings is not None:
            repr_embeddings = [embeddings[index] for index in np.concatenate(repr_doc_indices)]
        # We extract the top n words per class
        topics = self._extract_candidate_words(topic_model, c_tf_idf, topics)
        # We calculate the similarity between word and document embeddings and create
        # topic embeddings from the representative document embeddings
        sim_matrix, words = self._extract_embeddings(
            topic_model, topics, representative_docs, repr_doc_indices, repr_embeddings
        )
        # Find the best matching words based on the similarity matrix for each topic
        updated_topics = self._extract_top_words(words, topics, sim_matrix)
        return updated_topics
    def _extract_candidate_words(
        self,
        topic_model,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """For each topic, extract candidate words based on the c-TF-IDF
        representation.
        Arguments:
            topic_model: A BERTopic model
            c_tf_idf: The topic c-TF-IDF representation
            topics: The top words per topic
        Returns:
            topics: The `self.top_n_words` per topic
        """
        labels = [int(label) for label in sorted(list(topics.keys()))]
        # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0
        # and will be removed in 1.2. Please use get_feature_names_out instead.
        if version.parse(sklearn_version) >= version.parse("1.0.0"):
            words = topic_model.vectorizer_model.get_feature_names_out()
        else:
            words = topic_model.vectorizer_model.get_feature_names()
        indices = topic_model._top_n_idx_sparse(c_tf_idf, self.nr_candidate_words)
        scores = topic_model._top_n_values_sparse(c_tf_idf, indices)
        sorted_indices = np.argsort(scores, 1)
        indices = np.take_along_axis(indices, sorted_indices, axis=1)
        scores = np.take_along_axis(scores, sorted_indices, axis=1)
        # Get top 30 words per topic based on c-TF-IDF score
        topics = {
            label: [
                (words[word_index], score) if word_index is not None and score > 0 else ("", 0.00001)
                for word_index, score in zip(indices[index][::-1], scores[index][::-1])
            ]
            for index, label in enumerate(labels)
        }
        topics = {label: list(zip(*values[: self.nr_candidate_words]))[0] for label, values in topics.items()}
        return topics
    def _extract_embeddings(
        self,
        topic_model,
        topics: Mapping[str, List[Tuple[str, float]]],
        representative_docs: List[str],
        repr_doc_indices: List[List[int]],
        repr_embeddings: np.ndarray = None,
    ) -> Union[np.ndarray, List[str]]:
        """Extract the representative document embeddings and create topic embeddings.
        Then extract word embeddings and calculate the cosine similarity between topic
        embeddings and the word embeddings. Topic embeddings are the average of
        representative document embeddings.
        Arguments:
            topic_model: A BERTopic model
            topics: The top words per topic
            representative_docs: A flat list of representative documents
            repr_doc_indices: The indices of representative documents
                              that belong to each topic
            repr_embeddings: Embeddings of respective representative_docs
        Returns:
            sim: The similarity matrix between word and topic embeddings
            vocab: The complete vocabulary of input documents
        """
        # Calculate representative document embeddings if there are no precomputed embeddings.
        if repr_embeddings is None:
            repr_embeddings = topic_model._extract_embeddings(representative_docs, method="document", verbose=False)
        topic_embeddings = [np.mean(repr_embeddings[i[0] : i[-1] + 1], axis=0) for i in repr_doc_indices]
        # Calculate word embeddings and extract best matching with updated topic_embeddings
        vocab = list(set([word for words in topics.values() for word in words]))
        word_embeddings = topic_model._extract_embeddings(vocab, method="document", verbose=False)
        sim = cosine_similarity(topic_embeddings, word_embeddings)
        return sim, vocab
    def _extract_top_words(
        self,
        vocab: List[str],
        topics: Mapping[str, List[Tuple[str, float]]],
        sim: np.ndarray,
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract the top n words per topic based on the
        similarity matrix between topics and words.
        Arguments:
            vocab: The complete vocabulary of input documents
            labels: All topic labels
            topics: The top words per topic
            sim: The similarity matrix between word and topic embeddings
        Returns:
            updated_topics: The updated topic representations
        """
        labels = [int(label) for label in sorted(list(topics.keys()))]
        updated_topics = {}
        for i, topic in enumerate(labels):
            indices = [vocab.index(word) for word in topics[topic]]
            values = sim[:, indices][i]
            word_indices = [indices[index] for index in np.argsort(values)[-self.top_n_words :]]
            updated_topics[topic] = [
                (vocab[index], val) for val, index in zip(np.sort(values)[-self.top_n_words :], word_indices)
            ][::-1]
        return updated_topics
@@ -1,213 +0,0 @@
 import pandas as pd
 from langchain.docstore.document import Document
 from scipy.sparse import csr_matrix
 from typing import Callable, Mapping, List, Tuple, Union
 from bertopic.representation._base import BaseRepresentation
 from bertopic.representation._utils import truncate_document, validate_truncate_document_parameters
 DEFAULT_PROMPT = "What are these documents about? Please give a single label."
 class LangChain(BaseRepresentation):
    """Using chains in langchain to generate topic labels.
    The classic example uses `langchain.chains.question_answering.load_qa_chain`.
    This returns a chain that takes a list of documents and a question as input.
    You can also use Runnables such as those composed using the LangChain Expression Language.
    Arguments:
        chain: The langchain chain or Runnable with a `batch` method.
               Input keys must be `input_documents` and `question`.
               Output key must be `output_text`.
        prompt: The prompt to be used in the model. If no prompt is given,
                `self.default_prompt_` is used instead.
                 NOTE: Use `"[KEYWORDS]"` in the prompt
                 to decide where the keywords need to be
                 inserted. Keywords won't be included unless
                 indicated. Unlike other representation models,
                 Langchain does not use the `"[DOCUMENTS]"` tag
                 to insert documents into the prompt. The load_qa_chain function
                 formats the representative documents within the prompt.
        nr_docs: The number of documents to pass to LangChain
        diversity: The diversity of documents to pass to LangChain.
                   Accepts values between 0 and 1. A higher
                   values results in passing more diverse documents
                   whereas lower values passes more similar documents.
        doc_length: The maximum length of each document. If a document is longer,
                    it will be truncated. If None, the entire document is passed.
        tokenizer: The tokenizer used to calculate to split the document into segments
                   used to count the length of a document.
                       * If tokenizer is 'char', then the document is split up
                         into characters which are counted to adhere to `doc_length`
                       * If tokenizer is 'whitespace', the document is split up
                         into words separated by whitespaces. These words are counted
                         and truncated depending on `doc_length`
                       * If tokenizer is 'vectorizer', then the internal CountVectorizer
                         is used to tokenize the document. These tokens are counted
                         and truncated depending on `doc_length`. They are decoded with
                         whitespaces.
                       * If tokenizer is a callable, then that callable is used to tokenize
                         the document. These tokens are counted and truncated depending
                         on `doc_length`
        chain_config: The configuration for the langchain chain. Can be used to set options
                      like max_concurrency to avoid rate limiting errors.
    Usage:
    To use this, you will need to install the langchain package first.
    Additionally, you will need an underlying LLM to support langchain,
    like openai:
    `pip install langchain`
    `pip install openai`
    Then, you can create your chain as follows:
    ```python
    from langchain.chains.question_answering import load_qa_chain
    from langchain.llms import OpenAI
    chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
    ```
    Finally, you can pass the chain to BERTopic as follows:
    ```python
    from bertopic.representation import LangChain
    # Create your representation model
    representation_model = LangChain(chain)
    # Use the representation model in BERTopic on top of the default pipeline
    topic_model = BERTopic(representation_model=representation_model)
    ```
    You can also use a custom prompt:
    ```python
    prompt = "What are these documents about? Please give a single label."
    representation_model = LangChain(chain, prompt=prompt)
    ```
    You can also use a Runnable instead of a chain.
    The example below uses the LangChain Expression Language:
    ```python
    from bertopic.representation import LangChain
    from langchain.chains.question_answering import load_qa_chain
    from langchain.chat_models import ChatAnthropic
    from langchain.schema.document import Document
    from langchain.schema.runnable import RunnablePassthrough
    from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer
    prompt = ...
    llm = ...
    # We will construct a special privacy-preserving chain using Microsoft Presidio
    pii_handler = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"])
    chain = (
        {
            "input_documents": (
                lambda inp: [
                    Document(
                        page_content=pii_handler.anonymize(
                            d.page_content,
                            language="en",
                        ),
                    )
                    for d in inp["input_documents"]
                ]
            ),
            "question": RunnablePassthrough(),
        }
        | load_qa_chain(representation_llm, chain_type="stuff")
        | (lambda output: {"output_text": pii_handler.deanonymize(output["output_text"])})
    )
    representation_model = LangChain(chain, prompt=representation_prompt)
    ```
    """
    def __init__(
        self,
        chain,
        prompt: str = None,
        nr_docs: int = 4,
        diversity: float = None,
        doc_length: int = None,
        tokenizer: Union[str, Callable] = None,
        chain_config=None,
    ):
        self.chain = chain
        self.prompt = prompt if prompt is not None else DEFAULT_PROMPT
        self.default_prompt_ = DEFAULT_PROMPT
        self.chain_config = chain_config
        self.nr_docs = nr_docs
        self.diversity = diversity
        self.doc_length = doc_length
        self.tokenizer = tokenizer
        validate_truncate_document_parameters(self.tokenizer, self.doc_length)
    def extract_topics(
        self,
        topic_model,
        documents: pd.DataFrame,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
    ) -> Mapping[str, List[Tuple[str, int]]]:
        """Extract topics.
        Arguments:
            topic_model: A BERTopic model
            documents: All input documents
            c_tf_idf: The topic c-TF-IDF representation
            topics: The candidate topics as calculated with c-TF-IDF
        Returns:
            updated_topics: Updated topic representations
        """
        # Extract the top 4 representative documents per topic
        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
            c_tf_idf=c_tf_idf,
            documents=documents,
            topics=topics,
            nr_samples=500,
            nr_repr_docs=self.nr_docs,
            diversity=self.diversity,
        )
        # Generate label using langchain's batch functionality
        chain_docs: List[List[Document]] = [
            [
                Document(page_content=truncate_document(topic_model, self.doc_length, self.tokenizer, doc))
                for doc in docs
            ]
            for docs in repr_docs_mappings.values()
        ]
        # `self.chain` must take `input_documents` and `question` as input keys
        # Use a custom prompt that leverages keywords, using the tag: [KEYWORDS]
        if "[KEYWORDS]" in self.prompt:
            prompts = []
            for topic in topics:
                keywords = list(zip(*topics[topic]))[0]
                prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
                prompts.append(prompt)
            inputs = [{"input_documents": docs, "question": prompt} for docs, prompt in zip(chain_docs, prompts)]
        else:
            inputs = [{"input_documents": docs, "question": self.prompt} for docs in chain_docs]
        # `self.chain` must return a dict with an `output_text` key
        # same output key as the `StuffDocumentsChain` returned by `load_qa_chain`
        outputs = self.chain.batch(inputs=inputs, config=self.chain_config)
        labels = [output["output_text"].strip() for output in outputs]
        updated_topics = {
            topic: [(label, 1)] + [("", 0) for _ in range(9)] for topic, label in zip(repr_docs_mappings.keys(), labels)
        }
        return updated_topics
@@ -1,176 +0,0 @@
 import time
 from litellm import completion
 import pandas as pd
 from scipy.sparse import csr_matrix
 from typing import Mapping, List, Tuple, Any
 from bertopic.representation._base import BaseRepresentation
 from bertopic.representation._utils import retry_with_exponential_backoff
 DEFAULT_PROMPT = """
 I have a topic that contains the following documents:
 [DOCUMENTS]
 The topic is described by the following keywords: [KEYWORDS]
 Based on the information above, extract a short topic label in the following format:
 topic: <topic label>
 """
 class LiteLLM(BaseRepresentation):
    """Using the LiteLLM API to generate topic labels.
    For an overview of models see:
    https://docs.litellm.ai/docs/providers
    Arguments:
        model: Model to use. Defaults to OpenAI's "gpt-3.5-turbo".
        generator_kwargs: Kwargs passed to `litellm.completion`.
        prompt: The prompt to be used in the model. If no prompt is given,
                `self.default_prompt_` is used instead.
                NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
                to decide where the keywords and documents need to be
                inserted.
        delay_in_seconds: The delay in seconds between consecutive prompts
                          in order to prevent RateLimitErrors.
        exponential_backoff: Retry requests with a random exponential backoff.
                             A short sleep is used when a rate limit error is hit,
                             then the requests is retried. Increase the sleep length
                             if errors are hit until 10 unsuccesfull requests.
                             If True, overrides `delay_in_seconds`.
        nr_docs: The number of documents to pass to LiteLLM if a prompt
                 with the `["DOCUMENTS"]` tag is used.
        diversity: The diversity of documents to pass to LiteLLM.
                   Accepts values between 0 and 1. A higher
                   values results in passing more diverse documents
                   whereas lower values passes more similar documents.
    Usage:
    To use this, you will need to install the litellm package first:
    `pip install litellm`
    Then, get yourself an API key of any provider (for instance OpenAI) and use it as follows:
    ```python
    import os
    from bertopic.representation import LiteLLM
    from bertopic import BERTopic
    # set ENV variables
    os.environ["OPENAI_API_KEY"] = "your-openai-key"
    # Create your representation model
    representation_model = LiteLLM(model="gpt-3.5-turbo")
    # Use the representation model in BERTopic on top of the default pipeline
    topic_model = BERTopic(representation_model=representation_model)
    ```
    You can also use a custom prompt:
    ```python
    prompt = "I have the following documents: [DOCUMENTS] \nThese documents are about the following topic: '"
    representation_model = LiteLLM(model="gpt", prompt=prompt)
    ```
    """  # noqa: D301
    def __init__(
        self,
        model: str = "gpt-3.5-turbo",
        prompt: str = None,
        generator_kwargs: Mapping[str, Any] = {},
        delay_in_seconds: float = None,
        exponential_backoff: bool = False,
        nr_docs: int = 4,
        diversity: float = None,
    ):
        self.model = model
        self.prompt = prompt if prompt else DEFAULT_PROMPT
        self.default_prompt_ = DEFAULT_PROMPT
        self.delay_in_seconds = delay_in_seconds
        self.exponential_backoff = exponential_backoff
        self.nr_docs = nr_docs
        self.diversity = diversity
        self.generator_kwargs = generator_kwargs
        if self.generator_kwargs.get("model"):
            self.model = generator_kwargs.get("model")
        if self.generator_kwargs.get("prompt"):
            del self.generator_kwargs["prompt"]
    def extract_topics(
        self, topic_model, documents: pd.DataFrame, c_tf_idf: csr_matrix, topics: Mapping[str, List[Tuple[str, float]]]
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract topics.
        Arguments:
            topic_model: A BERTopic model
            documents: All input documents
            c_tf_idf: The topic c-TF-IDF representation
            topics: The candidate topics as calculated with c-TF-IDF
        Returns:
            updated_topics: Updated topic representations
        """
        # Extract the top n representative documents per topic
        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
            c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity
        )
        # Generate using a (Large) Language Model
        updated_topics = {}
        for topic, docs in repr_docs_mappings.items():
            prompt = self._create_prompt(docs, topic, topics)
            # Delay
            if self.delay_in_seconds:
                time.sleep(self.delay_in_seconds)
            messages = [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt},
            ]
            kwargs = {"model": self.model, "messages": messages, **self.generator_kwargs}
            if self.exponential_backoff:
                response = chat_completions_with_backoff(**kwargs)
            else:
                response = completion(**kwargs)
            label = response["choices"][0]["message"]["content"].strip().replace("topic: ", "")
            updated_topics[topic] = [(label, 1)]
        return updated_topics
    def _create_prompt(self, docs, topic, topics):
        keywords = list(zip(*topics[topic]))[0]
        # Use the Default Chat Prompt
        if self.prompt == DEFAULT_PROMPT:
            prompt = self.prompt.replace("[KEYWORDS]", " ".join(keywords))
            prompt = self._replace_documents(prompt, docs)
        # Use a custom prompt that leverages keywords, documents or both using
        # custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively
        else:
            prompt = self.prompt
            if "[KEYWORDS]" in prompt:
                prompt = prompt.replace("[KEYWORDS]", " ".join(keywords))
            if "[DOCUMENTS]" in prompt:
                prompt = self._replace_documents(prompt, docs)
        return prompt
    @staticmethod
    def _replace_documents(prompt, docs):
        to_replace = ""
        for doc in docs:
            to_replace += f"- {doc[:255]}\n"
        prompt = prompt.replace("[DOCUMENTS]", to_replace)
        return prompt
 def chat_completions_with_backoff(**kwargs):
    return retry_with_exponential_backoff(
        completion,
    )(**kwargs)
@@ -1,215 +0,0 @@
 import pandas as pd
 from tqdm import tqdm
 from scipy.sparse import csr_matrix
 from llama_cpp import Llama
 from typing import Mapping, List, Tuple, Any, Union, Callable
 from bertopic.representation._base import BaseRepresentation
 from bertopic.representation._utils import truncate_document, validate_truncate_document_parameters
 DEFAULT_PROMPT = """
 This is a list of texts where each collection of texts describe a topic. After each collection of texts, the name of the topic they represent is mentioned as a short-highly-descriptive title
 ---
 Topic:
 Sample texts from this topic:
 - Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
 - Meat, but especially beef, is the word food in terms of emissions.
 - Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.
 Keywords: meat beef eat eating emissions steak food health processed chicken
 Topic name: Environmental impacts of eating meat
 ---
 Topic:
 Sample texts from this topic:
 - I have ordered the product weeks ago but it still has not arrived!
 - The website mentions that it only takes a couple of days to deliver but I still have not received mine.
 - I got a message stating that I received the monitor but that is not true!
 - It took a month longer to deliver than was advised...
 Keywords: deliver weeks product shipping long delivery received arrived arrive week
 Topic name: Shipping and delivery issues
 ---
 Topic:
 Sample texts from this topic:
 [DOCUMENTS]
 Keywords: [KEYWORDS]
 Topic name:"""
 DEFAULT_SYSTEM_PROMPT = "You are an assistant that extracts high-level topics from texts."
 class LlamaCPP(BaseRepresentation):
    """A llama.cpp implementation to use as a representation model.
    Arguments:
        model: Either a string pointing towards a local LLM or a
                `llama_cpp.Llama` object.
        prompt: The prompt to be used in the model. If no prompt is given,
                `self.default_prompt_` is used instead.
                NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
                to decide where the keywords and documents need to be
                inserted.
        system_prompt: The system prompt to be used in the model. If no system prompt is given,
                       `self.default_system_prompt_` is used instead.
        pipeline_kwargs: Kwargs that you can pass to the `llama_cpp.Llama`
                         when it is called such as `max_tokens` to be generated.
        nr_docs: The number of documents to pass to OpenAI if a prompt
                 with the `["DOCUMENTS"]` tag is used.
        diversity: The diversity of documents to pass to OpenAI.
                   Accepts values between 0 and 1. A higher
                   values results in passing more diverse documents
                   whereas lower values passes more similar documents.
        doc_length: The maximum length of each document. If a document is longer,
                    it will be truncated. If None, the entire document is passed.
        tokenizer: The tokenizer used to calculate to split the document into segments
                   used to count the length of a document.
                       * If tokenizer is 'char', then the document is split up
                         into characters which are counted to adhere to `doc_length`
                       * If tokenizer is 'whitespace', the the document is split up
                         into words separated by whitespaces. These words are counted
                         and truncated depending on `doc_length`
                       * If tokenizer is 'vectorizer', then the internal CountVectorizer
                         is used to tokenize the document. These tokens are counted
                         and truncated depending on `doc_length`
                       * If tokenizer is a callable, then that callable is used to tokenize
                         the document. These tokens are counted and truncated depending
                         on `doc_length`
    Usage:
    To use a llama.cpp, first download the LLM:
    ```bash
    wget https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q4_K_M.gguf
    ```
    Then, we can now use the model the model with BERTopic in just a couple of lines:
    ```python
    from bertopic import BERTopic
    from bertopic.representation import LlamaCPP
    # Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha
    representation_model = LlamaCPP("zephyr-7b-alpha.Q4_K_M.gguf")
    # Create our BERTopic model
    topic_model = BERTopic(representation_model=representation_model, verbose=True)
    ```
    If you want to have more control over the LLMs parameters, you can run it like so:
    ```python
    from bertopic import BERTopic
    from bertopic.representation import LlamaCPP
    from llama_cpp import Llama
    # Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha
    llm = Llama(model_path="zephyr-7b-alpha.Q4_K_M.gguf", n_gpu_layers=-1, n_ctx=4096, stop="Q:")
    representation_model = LlamaCPP(llm)
    # Create our BERTopic model
    topic_model = BERTopic(representation_model=representation_model, verbose=True)
    ```
    """
    def __init__(
        self,
        model: Union[str, Llama],
        prompt: str = None,
        system_prompt: str = None,
        pipeline_kwargs: Mapping[str, Any] = {},
        nr_docs: int = 4,
        diversity: float = None,
        doc_length: int = None,
        tokenizer: Union[str, Callable] = None,
    ):
        if isinstance(model, str):
            self.model = Llama(model_path=model, n_gpu_layers=-1, stop="\n", chat_format="ChatML")
        elif isinstance(model, Llama):
            self.model = model
        else:
            raise ValueError(
                "Make sure that the model that you"
                "pass is either a string referring to a"
                "local LLM or a ` llama_cpp.Llama` object."
            )
        self.prompt = prompt if prompt is not None else DEFAULT_PROMPT
        self.system_prompt = system_prompt if system_prompt is not None else DEFAULT_SYSTEM_PROMPT
        self.default_prompt_ = DEFAULT_PROMPT
        self.default_system_prompt_ = DEFAULT_SYSTEM_PROMPT
        self.pipeline_kwargs = pipeline_kwargs
        self.nr_docs = nr_docs
        self.diversity = diversity
        self.doc_length = doc_length
        self.tokenizer = tokenizer
        validate_truncate_document_parameters(self.tokenizer, self.doc_length)
        self.prompts_ = []
    def extract_topics(
        self,
        topic_model,
        documents: pd.DataFrame,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract topic representations and return a single label.
        Arguments:
            topic_model: A BERTopic model
            documents: Not used
            c_tf_idf: Not used
            topics: The candidate topics as calculated with c-TF-IDF
        Returns:
            updated_topics: Updated topic representations
        """
        # Extract the top 4 representative documents per topic
        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
            c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity
        )
        updated_topics = {}
        for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
            # Prepare prompt
            truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
            prompt = self._create_prompt(truncated_docs, topic, topics)
            self.prompts_.append(prompt)
            # Extract result from generator and use that as label
            # topic_description = self.model(prompt, **self.pipeline_kwargs)["choices"]
            topic_description = self.model.create_chat_completion(
                messages=[{"role": "system", "content": self.system_prompt}, {"role": "user", "content": prompt}],
                **self.pipeline_kwargs,
            )
            label = topic_description["choices"][0]["message"]["content"].strip()
            updated_topics[topic] = [(label, 1)] + [("", 0) for _ in range(9)]
        return updated_topics
    def _create_prompt(self, docs, topic, topics):
        keywords = list(zip(*topics[topic]))[0]
        # Use the Default Chat Prompt
        if self.prompt == DEFAULT_PROMPT:
            prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
            prompt = self._replace_documents(prompt, docs)
        # Use a custom prompt that leverages keywords, documents or both using
        # custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively
        else:
            prompt = self.prompt
            if "[KEYWORDS]" in prompt:
                prompt = prompt.replace("[KEYWORDS]", ", ".join(keywords))
            if "[DOCUMENTS]" in prompt:
                prompt = self._replace_documents(prompt, docs)
        return prompt
    @staticmethod
    def _replace_documents(prompt, docs):
        to_replace = ""
        for doc in docs:
            to_replace += f"- {doc}\n"
        prompt = prompt.replace("[DOCUMENTS]", to_replace)
        return prompt
@@ -1,128 +0,0 @@
 import warnings
 import numpy as np
 import pandas as pd
 from typing import List, Mapping, Tuple
 from scipy.sparse import csr_matrix
 from sklearn.metrics.pairwise import cosine_similarity
 from bertopic.representation._base import BaseRepresentation
 class MaximalMarginalRelevance(BaseRepresentation):
    """Calculate Maximal Marginal Relevance (MMR)
    between candidate keywords and the document.
    MMR considers the similarity of keywords/keyphrases with the
    document, along with the similarity of already selected
    keywords and keyphrases. This results in a selection of keywords
    that maximize their within diversity with respect to the document.
    Arguments:
        diversity: How diverse the select keywords/keyphrases are.
                    Values range between 0 and 1 with 0 being not diverse at all
                    and 1 being most diverse.
        top_n_words: The number of keywords/keyhprases to return
    Usage:
    ```python
    from bertopic.representation import MaximalMarginalRelevance
    from bertopic import BERTopic
    # Create your representation model
    representation_model = MaximalMarginalRelevance(diversity=0.3)
    # Use the representation model in BERTopic on top of the default pipeline
    topic_model = BERTopic(representation_model=representation_model)
    ```
    """
    def __init__(self, diversity: float = 0.1, top_n_words: int = 10):
        self.diversity = diversity
        self.top_n_words = top_n_words
    def extract_topics(
        self,
        topic_model,
        documents: pd.DataFrame,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract topic representations.
        Arguments:
            topic_model: The BERTopic model
            documents: Not used
            c_tf_idf: Not used
            topics: The candidate topics as calculated with c-TF-IDF
        Returns:
            updated_topics: Updated topic representations
        """
        if topic_model.embedding_model is None:
            warnings.warn(
                "MaximalMarginalRelevance can only be used BERTopic was instantiated"
                "with the `embedding_model` parameter."
            )
            return topics
        updated_topics = {}
        for topic, topic_words in topics.items():
            words = [word[0] for word in topic_words]
            word_embeddings = topic_model._extract_embeddings(words, method="word", verbose=False)
            topic_embedding = topic_model._extract_embeddings(" ".join(words), method="word", verbose=False).reshape(
                1, -1
            )
            topic_words = mmr(
                topic_embedding,
                word_embeddings,
                words,
                self.diversity,
                self.top_n_words,
            )
            updated_topics[topic] = [(word, value) for word, value in topics[topic] if word in topic_words]
        return updated_topics
 def mmr(
    doc_embedding: np.ndarray,
    word_embeddings: np.ndarray,
    words: List[str],
    diversity: float = 0.1,
    top_n: int = 10,
 ) -> List[str]:
    """Maximal Marginal Relevance.
    Arguments:
        doc_embedding: The document embeddings
        word_embeddings: The embeddings of the selected candidate keywords/phrases
        words: The selected candidate keywords/keyphrases
        diversity: The diversity of the selected embeddings.
                   Values between 0 and 1.
        top_n: The top n items to return
    Returns:
            List[str]: The selected keywords/keyphrases
    """
    # Extract similarity within words, and between words and the document
    word_doc_similarity = cosine_similarity(word_embeddings, doc_embedding)
    word_similarity = cosine_similarity(word_embeddings)
    # Initialize candidates and already choose best keyword/keyphras
    keywords_idx = [np.argmax(word_doc_similarity)]
    candidates_idx = [i for i in range(len(words)) if i != keywords_idx[0]]
    for _ in range(top_n - 1):
        # Extract similarities within candidates and
        # between candidates and selected keywords/phrases
        candidate_similarities = word_doc_similarity[candidates_idx, :]
        target_similarities = np.max(word_similarity[candidates_idx][:, keywords_idx], axis=1)
        # Calculate MMR
        mmr = (1 - diversity) * candidate_similarities - diversity * target_similarities.reshape(-1, 1)
        mmr_idx = candidates_idx[np.argmax(mmr)]
        # Update keywords & candidates
        keywords_idx.append(mmr_idx)
        candidates_idx.remove(mmr_idx)
    return [words[idx] for idx in keywords_idx]
@@ -1,274 +0,0 @@
 import time
 import openai
 import pandas as pd
 from tqdm import tqdm
 from scipy.sparse import csr_matrix
 from typing import Mapping, List, Tuple, Any, Union, Callable
 from bertopic.representation._base import BaseRepresentation
 from bertopic.representation._utils import (
    retry_with_exponential_backoff,
    truncate_document,
    validate_truncate_document_parameters,
 )
 DEFAULT_CHAT_PROMPT = """You will extract a short topic label from given documents and keywords.
 Here are two examples of topics you created before:
 # Example 1
 Sample texts from this topic:
 - Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
 - Meat, but especially beef, is the worst food in terms of emissions.
 - Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.
 Keywords: meat beef eat eating emissions steak food health processed chicken
 topic: Environmental impacts of eating meat
 # Example 2
 Sample texts from this topic:
 - I have ordered the product weeks ago but it still has not arrived!
 - The website mentions that it only takes a couple of days to deliver but I still have not received mine.
 - I got a message stating that I received the monitor but that is not true!
 - It took a month longer to deliver than was advised...
 Keywords: deliver weeks product shipping long delivery received arrived arrive week
 topic: Shipping and delivery issues
 # Your task
 Sample texts from this topic:
 [DOCUMENTS]
 Keywords: [KEYWORDS]
 Based on the information above, extract a short topic label (three words at most) in the following format:
 topic: <topic_label>
 """
 DEFAULT_SYSTEM_PROMPT = "You are an assistant that extracts high-level topics from texts."
 class OpenAI(BaseRepresentation):
    r"""Using the OpenAI API to generate topic labels based
    on one of their Completion of ChatCompletion models.
    For an overview see:
    https://platform.openai.com/docs/models
    Arguments:
        client: A `openai.OpenAI` client
        model: Model to use within OpenAI, defaults to `"gpt-4o-mini"`.
        generator_kwargs: Kwargs passed to `openai.Completion.create`
                          for fine-tuning the output.
        prompt: The prompt to be used in the model. If no prompt is given,
                `self.default_prompt_` is used instead.
                NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
                to decide where the keywords and documents need to be
                inserted.
        system_prompt: The system prompt to be used in the model. If no system prompt is given,
                       `self.default_system_prompt_` is used instead.
        delay_in_seconds: The delay in seconds between consecutive prompts
                          in order to prevent RateLimitErrors.
        exponential_backoff: Retry requests with a random exponential backoff.
                             A short sleep is used when a rate limit error is hit,
                             then the requests is retried. Increase the sleep length
                             if errors are hit until 10 unsuccessful requests.
                             If True, overrides `delay_in_seconds`.
        nr_docs: The number of documents to pass to OpenAI if a prompt
                 with the `["DOCUMENTS"]` tag is used.
        diversity: The diversity of documents to pass to OpenAI.
                   Accepts values between 0 and 1. A higher
                   values results in passing more diverse documents
                   whereas lower values passes more similar documents.
        doc_length: The maximum length of each document. If a document is longer,
                    it will be truncated. If None, the entire document is passed.
        tokenizer: The tokenizer used to calculate to split the document into segments
                   used to count the length of a document.
                       * If tokenizer is 'char', then the document is split up
                         into characters which are counted to adhere to `doc_length`
                       * If tokenizer is 'whitespace', the document is split up
                         into words separated by whitespaces. These words are counted
                         and truncated depending on `doc_length`
                       * If tokenizer is 'vectorizer', then the internal CountVectorizer
                         is used to tokenize the document. These tokens are counted
                         and truncated depending on `doc_length`
                       * If tokenizer is a callable, then that callable is used to tokenize
                         the document. These tokens are counted and truncated depending
                         on `doc_length`
    Usage:
    To use this, you will need to install the openai package first:
    `pip install openai`
    Then, get yourself an API key and use OpenAI's API as follows:
    ```python
    import openai
    from bertopic.representation import OpenAI
    from bertopic import BERTopic
    # Create your representation model
    client = openai.OpenAI(api_key=MY_API_KEY)
    representation_model = OpenAI(client, delay_in_seconds=5)
    # Use the representation model in BERTopic on top of the default pipeline
    topic_model = BERTopic(representation_model=representation_model)
    ```
    You can also use a custom prompt:
    ```python
    prompt = "I have the following documents: [DOCUMENTS] \nThese documents are about the following topic: '"
    representation_model = OpenAI(client, prompt=prompt, delay_in_seconds=5)
    ```
    To choose a model:
    ```python
    representation_model = OpenAI(client, model="gpt-4o-mini", delay_in_seconds=10)
    ```
    """
    def __init__(
        self,
        client,
        model: str = "gpt-4o-mini",
        prompt: str = None,
        system_prompt: str = None,
        generator_kwargs: Mapping[str, Any] = {},
        delay_in_seconds: float = None,
        exponential_backoff: bool = False,
        nr_docs: int = 4,
        diversity: float = None,
        doc_length: int = None,
        tokenizer: Union[str, Callable] = None,
        **kwargs,
    ):
        self.client = client
        self.model = model
        if prompt is None:
            self.prompt = DEFAULT_CHAT_PROMPT
        else:
            self.prompt = prompt
        if system_prompt is None:
            self.system_prompt = DEFAULT_SYSTEM_PROMPT
        else:
            self.system_prompt = system_prompt
        self.default_prompt_ = DEFAULT_CHAT_PROMPT
        self.default_system_prompt_ = DEFAULT_SYSTEM_PROMPT
        self.delay_in_seconds = delay_in_seconds
        self.exponential_backoff = exponential_backoff
        self.nr_docs = nr_docs
        self.diversity = diversity
        self.doc_length = doc_length
        self.tokenizer = tokenizer
        validate_truncate_document_parameters(self.tokenizer, self.doc_length)
        self.prompts_ = []
        self.generator_kwargs = generator_kwargs
        if self.generator_kwargs.get("model"):
            self.model = generator_kwargs.get("model")
            del self.generator_kwargs["model"]
        if self.generator_kwargs.get("prompt"):
            del self.generator_kwargs["prompt"]
        if not self.generator_kwargs.get("stop"):
            self.generator_kwargs["stop"] = "\n"
    def extract_topics(
        self,
        topic_model,
        documents: pd.DataFrame,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract topics.
        Arguments:
            topic_model: A BERTopic model
            documents: All input documents
            c_tf_idf: The topic c-TF-IDF representation
            topics: The candidate topics as calculated with c-TF-IDF
        Returns:
            updated_topics: Updated topic representations
        """
        # Extract the top n representative documents per topic
        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
            c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity
        )
        # Generate using OpenAI's Language Model
        updated_topics = {}
        for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
            truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
            prompt = self._create_prompt(truncated_docs, topic, topics)
            self.prompts_.append(prompt)
            # Delay
            if self.delay_in_seconds:
                time.sleep(self.delay_in_seconds)
            messages = [
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": prompt},
            ]
            kwargs = {
                "model": self.model,
                "messages": messages,
                **self.generator_kwargs,
            }
            if self.exponential_backoff:
                response = chat_completions_with_backoff(self.client, **kwargs)
            else:
                response = self.client.chat.completions.create(**kwargs)
            # Check whether content was actually generated
            # Addresses #1570 for potential issues with OpenAI's content filter
            # Addresses #2176 for potential issues when openAI returns a None type object
            if response and hasattr(response.choices[0].message, "content"):
                label = response.choices[0].message.content.strip().replace("topic: ", "")
            else:
                label = "No label returned"
            updated_topics[topic] = [(label, 1)]
        return updated_topics
    def _create_prompt(self, docs, topic, topics):
        keywords = list(zip(*topics[topic]))[0]
        # Use the Default Chat Prompt
        if self.prompt == DEFAULT_CHAT_PROMPT:
            prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
            prompt = self._replace_documents(prompt, docs)
        # Use a custom prompt that leverages keywords, documents or both using
        # custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively
        else:
            prompt = self.prompt
            if "[KEYWORDS]" in prompt:
                prompt = prompt.replace("[KEYWORDS]", ", ".join(keywords))
            if "[DOCUMENTS]" in prompt:
                prompt = self._replace_documents(prompt, docs)
        return prompt
    @staticmethod
    def _replace_documents(prompt, docs):
        to_replace = ""
        for doc in docs:
            to_replace += f"- {doc}\n"
        prompt = prompt.replace("[DOCUMENTS]", to_replace)
        return prompt
 def chat_completions_with_backoff(client, **kwargs):
    return retry_with_exponential_backoff(
        client.chat.completions.create,
        errors=(openai.RateLimitError,),
    )(**kwargs)
@@ -1,161 +0,0 @@
 import numpy as np
 import pandas as pd
 import spacy
 from spacy.matcher import Matcher
 from spacy.language import Language
 from packaging import version
 from scipy.sparse import csr_matrix
 from typing import List, Mapping, Tuple, Union
 from sklearn import __version__ as sklearn_version
 from bertopic.representation._base import BaseRepresentation
 class PartOfSpeech(BaseRepresentation):
    """Extract Topic Keywords based on their Part-of-Speech.
    DEFAULT_PATTERNS = [
                [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
                [{'POS': 'NOUN'}],
                [{'POS': 'ADJ'}]
    ]
    From candidate topics, as extracted with c-TF-IDF,
    find documents that contain keywords found in the
    candidate topics. These candidate documents then
    serve as the representative set of documents from
    which the Spacy model can extract a set of candidate
    keywords for each topic.
    These candidate keywords are first judged by whether
    they fall within the DEFAULT_PATTERNS or the user-defined
    pattern. Then, the resulting keywords are sorted by
    their respective c-TF-IDF values.
    Arguments:
        model: The Spacy model to use
        top_n_words: The top n words to extract
        pos_patterns: Patterns for Spacy to use.
                      See https://spacy.io/usage/rule-based-matching
    Usage:
    ```python
    from bertopic.representation import PartOfSpeech
    from bertopic import BERTopic
    # Create your representation model
    representation_model = PartOfSpeech("en_core_web_sm")
    # Use the representation model in BERTopic on top of the default pipeline
    topic_model = BERTopic(representation_model=representation_model)
    ```
    You can define custom POS patterns to be extracted:
    ```python
    pos_patterns = [
                [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
                [{'POS': 'NOUN'}], [{'POS': 'ADJ'}]
    ]
    representation_model = PartOfSpeech("en_core_web_sm", pos_patterns=pos_patterns)
    ```
    """
    def __init__(
        self,
        model: Union[str, Language] = "en_core_web_sm",
        top_n_words: int = 10,
        pos_patterns: List[str] = None,
    ):
        if isinstance(model, str):
            self.model = spacy.load(model)
        elif isinstance(model, Language):
            self.model = model
        else:
            raise ValueError(
                "Make sure that the Spacy model that you"
                "pass is either a string referring to a"
                "Spacy model or a Spacy nlp object."
            )
        self.top_n_words = top_n_words
        if pos_patterns is None:
            self.pos_patterns = [
                [{"POS": "ADJ"}, {"POS": "NOUN"}],
                [{"POS": "NOUN"}],
                [{"POS": "ADJ"}],
            ]
        else:
            self.pos_patterns = pos_patterns
    def extract_topics(
        self,
        topic_model,
        documents: pd.DataFrame,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract topics.
        Arguments:
            topic_model: A BERTopic model
            documents: All input documents
            c_tf_idf: Not used
            topics: The candidate topics as calculated with c-TF-IDF
        Returns:
            updated_topics: Updated topic representations
        """
        matcher = Matcher(self.model.vocab)
        matcher.add("Pattern", self.pos_patterns)
        candidate_topics = {}
        for topic, values in topics.items():
            keywords = list(zip(*values))[0]
            # Extract candidate documents
            candidate_documents = []
            for keyword in keywords:
                selection = documents.loc[documents.Topic == topic, :]
                selection = selection.loc[selection.Document.str.contains(keyword, regex=False), "Document"]
                if len(selection) > 0:
                    for document in selection[:2]:
                        candidate_documents.append(document)
            candidate_documents = list(set(candidate_documents))
            # Extract keywords
            docs_pipeline = self.model.pipe(candidate_documents)
            updated_keywords = []
            for doc in docs_pipeline:
                matches = matcher(doc)
                for _, start, end in matches:
                    updated_keywords.append(doc[start:end].text)
            candidate_topics[topic] = list(set(updated_keywords))
        # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0
        # and will be removed in 1.2. Please use get_feature_names_out instead.
        if version.parse(sklearn_version) >= version.parse("1.0.0"):
            words = list(topic_model.vectorizer_model.get_feature_names_out())
        else:
            words = list(topic_model.vectorizer_model.get_feature_names())
        # Match updated keywords with c-TF-IDF values
        words_lookup = dict(zip(words, range(len(words))))
        updated_topics = {topic: [] for topic in topics.keys()}
        for topic, candidate_keywords in candidate_topics.items():
            word_indices = np.sort(
                [words_lookup.get(keyword) for keyword in candidate_keywords if keyword in words_lookup]
            )
            vals = topic_model.c_tf_idf_[:, word_indices][topic + topic_model._outliers]
            indices = np.argsort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words :][::-1]
            vals = np.sort(np.array(vals.todense().reshape(1, -1))[0])[-self.top_n_words :][::-1]
            topic_words = [(words[word_indices[index]], val) for index, val in zip(indices, vals)]
            updated_topics[topic] = topic_words
            if len(updated_topics[topic]) < self.top_n_words:
                updated_topics[topic] += [("", 0) for _ in range(self.top_n_words - len(updated_topics[topic]))]
        return updated_topics
@@ -1,188 +0,0 @@
 import pandas as pd
 from tqdm import tqdm
 from scipy.sparse import csr_matrix
 from transformers import pipeline, set_seed
 from transformers.pipelines.base import Pipeline
 from typing import Mapping, List, Tuple, Any, Union, Callable
 from bertopic.representation._base import BaseRepresentation
 from bertopic.representation._utils import truncate_document, validate_truncate_document_parameters
 DEFAULT_PROMPT = """
 I have a topic described by the following keywords: [KEYWORDS].
 The name of this topic is:
 """
 class TextGeneration(BaseRepresentation):
    """Text2Text or text generation with transformers.
    Arguments:
        model: A transformers pipeline that should be initialized as "text-generation"
               for gpt-like models or "text2text-generation" for T5-like models.
               For example, `pipeline('text-generation', model='gpt2')`. If a string
               is passed, "text-generation" will be selected by default.
        prompt: The prompt to be used in the model. If no prompt is given,
                `self.default_prompt_` is used instead.
                NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
                to decide where the keywords and documents need to be
                inserted.
        pipeline_kwargs: Kwargs that you can pass to the transformers.pipeline
                         when it is called.
        random_state: A random state to be passed to `transformers.set_seed`
        nr_docs: The number of documents to pass to OpenAI if a prompt
                 with the `["DOCUMENTS"]` tag is used.
        diversity: The diversity of documents to pass to OpenAI.
                   Accepts values between 0 and 1. A higher
                   values results in passing more diverse documents
                   whereas lower values passes more similar documents.
        doc_length: The maximum length of each document. If a document is longer,
                    it will be truncated. If None, the entire document is passed.
        tokenizer: The tokenizer used to calculate to split the document into segments
                   used to count the length of a document.
                       * If tokenizer is 'char', then the document is split up
                         into characters which are counted to adhere to `doc_length`
                       * If tokenizer is 'whitespace', the document is split up
                         into words separated by whitespaces. These words are counted
                         and truncated depending on `doc_length`
                       * If tokenizer is 'vectorizer', then the internal CountVectorizer
                         is used to tokenize the document. These tokens are counted
                         and truncated depending on `doc_length`
                       * If tokenizer is a callable, then that callable is used to tokenize
                         the document. These tokens are counted and truncated depending
                         on `doc_length`
    Usage:
    To use a gpt-like model:
    ```python
    from bertopic.representation import TextGeneration
    from bertopic import BERTopic
    # Create your representation model
    generator = pipeline('text-generation', model='gpt2')
    representation_model = TextGeneration(generator)
    # Use the representation model in BERTopic on top of the default pipeline
    topic_model = BERTo pic(representation_model=representation_model)
    ```
    You can use a custom prompt and decide where the keywords should
    be inserted by using the `[KEYWORDS]` or documents with thte `[DOCUMENTS]` tag:
    ```python
    from bertopic.representation import TextGeneration
    prompt = "I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?""
    # Create your representation model
    generator = pipeline('text2text-generation', model='google/flan-t5-base')
    representation_model = TextGeneration(generator)
    ```
    """
    def __init__(
        self,
        model: Union[str, pipeline],
        prompt: str = None,
        pipeline_kwargs: Mapping[str, Any] = {},
        random_state: int = 42,
        nr_docs: int = 4,
        diversity: float = None,
        doc_length: int = None,
        tokenizer: Union[str, Callable] = None,
    ):
        self.random_state = random_state
        set_seed(random_state)
        if isinstance(model, str):
            self.model = pipeline("text-generation", model=model)
        elif isinstance(model, Pipeline):
            self.model = model
        else:
            raise ValueError(
                "Make sure that the HF model that you"
                "pass is either a string referring to a"
                "HF model or a `transformers.pipeline` object."
            )
        self.prompt = prompt if prompt is not None else DEFAULT_PROMPT
        self.default_prompt_ = DEFAULT_PROMPT
        self.pipeline_kwargs = pipeline_kwargs
        self.nr_docs = nr_docs
        self.diversity = diversity
        self.doc_length = doc_length
        self.tokenizer = tokenizer
        validate_truncate_document_parameters(self.tokenizer, self.doc_length)
        self.prompts_ = []
    def extract_topics(
        self,
        topic_model,
        documents: pd.DataFrame,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract topic representations and return a single label.
        Arguments:
            topic_model: A BERTopic model
            documents: Not used
            c_tf_idf: Not used
            topics: The candidate topics as calculated with c-TF-IDF
        Returns:
            updated_topics: Updated topic representations
        """
        # Extract the top 4 representative documents per topic
        if self.prompt != DEFAULT_PROMPT and "[DOCUMENTS]" in self.prompt:
            repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
                c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity
            )
        else:
            repr_docs_mappings = {topic: None for topic in topics.keys()}
        updated_topics = {}
        for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
            # Prepare prompt
            truncated_docs = (
                [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
                if docs is not None
                else docs
            )
            prompt = self._create_prompt(truncated_docs, topic, topics)
            self.prompts_.append(prompt)
            # Extract result from generator and use that as label
            topic_description = self.model(prompt, **self.pipeline_kwargs)
            topic_description = [
                (description["generated_text"].replace(prompt, ""), 1) for description in topic_description
            ]
            if len(topic_description) < 10:
                topic_description += [("", 0) for _ in range(10 - len(topic_description))]
            updated_topics[topic] = topic_description
        return updated_topics
    def _create_prompt(self, docs, topic, topics):
        keywords = ", ".join(list(zip(*topics[topic]))[0])
        # Use the default prompt and replace keywords
        if self.prompt == DEFAULT_PROMPT:
            prompt = self.prompt.replace("[KEYWORDS]", keywords)
        # Use a prompt that leverages either keywords or documents in
        # a custom location
        else:
            prompt = self.prompt
            if "[KEYWORDS]" in prompt:
                prompt = prompt.replace("[KEYWORDS]", keywords)
            if "[DOCUMENTS]" in prompt:
                to_replace = ""
                for doc in docs:
                    to_replace += f"- {doc}\n"
                prompt = prompt.replace("[DOCUMENTS]", to_replace)
        return prompt
@@ -1,113 +0,0 @@
 import random
 import time
 from typing import Union
 def truncate_document(topic_model, doc_length: Union[int, None], tokenizer: Union[str, callable], document: str) -> str:
    """Truncate a document to a certain length.
    If you want to add a custom tokenizer, then it will need to have a `decode` and
    `encode` method. An example would be the following custom tokenizer:
    ```python
    class Tokenizer:
        'A custom tokenizer that splits on commas'
        def encode(self, doc):
            return doc.split(",")
        def decode(self, doc_chunks):
            return ",".join(doc_chunks)
    ```
    You can use this tokenizer by passing it to the `tokenizer` parameter.
    Arguments:
        topic_model: A BERTopic model
        doc_length: The maximum length of each document. If a document is longer,
                    it will be truncated. If None, the entire document is passed.
        tokenizer: The tokenizer used to calculate to split the document into segments
                   used to count the length of a document.
                       * If tokenizer is 'char', then the document is split up
                         into characters which are counted to adhere to `doc_length`
                       * If tokenizer is 'whitespace', the document is split up
                         into words separated by whitespaces. These words are counted
                         and truncated depending on `doc_length`
                       * If tokenizer is 'vectorizer', then the internal CountVectorizer
                         is used to tokenize the document. These tokens are counted
                         and truncated depending on `doc_length`. They are decoded with
                         whitespaces.
                       * If tokenizer is a callable, then that callable is used to tokenize
                         the document. These tokens are counted and truncated depending
                         on `doc_length`
        document: A single document
    Returns:
        truncated_document: A truncated document
    """
    if doc_length is not None:
        if tokenizer == "char":
            truncated_document = document[:doc_length]
        elif tokenizer == "whitespace":
            truncated_document = " ".join(document.split()[:doc_length])
        elif tokenizer == "vectorizer":
            tokenizer = topic_model.vectorizer_model.build_tokenizer()
            truncated_document = " ".join(tokenizer(document)[:doc_length])
        elif hasattr(tokenizer, "encode") and hasattr(tokenizer, "decode"):
            encoded_document = tokenizer.encode(document)
            truncated_document = tokenizer.decode(encoded_document[:doc_length])
        return truncated_document
    return document
 def validate_truncate_document_parameters(tokenizer, doc_length) -> Union[None, ValueError]:
    """Validates parameters that are used in the function `truncate_document`."""
    if tokenizer is None and doc_length is not None:
        raise ValueError(
            "Please select from one of the valid options for the `tokenizer` parameter: \n"
            "{'char', 'whitespace', 'vectorizer'} \n"
            "If `tokenizer` is of type callable ensure it has methods to encode and decode a document \n"
        )
    elif tokenizer is not None and doc_length is None:
        raise ValueError("If `tokenizer` is provided, `doc_length` of type int must be provided as well.")
 def retry_with_exponential_backoff(
    func,
    initial_delay: float = 1,
    exponential_base: float = 2,
    jitter: bool = True,
    max_retries: int = 10,
    errors: tuple = None,
 ):
    """Retry a function with exponential backoff."""
    def wrapper(*args, **kwargs):
        # Initialize variables
        num_retries = 0
        delay = initial_delay
        # Loop until a successful response or max_retries is hit or an exception is raised
        while True:
            try:
                return func(*args, **kwargs)
            # Retry on specific errors
            except errors:
                # Increment retries
                num_retries += 1
                # Check if max retries has been reached
                if num_retries > max_retries:
                    raise Exception(f"Maximum number of retries ({max_retries}) exceeded.")
                # Increment the delay
                delay *= exponential_base * (1 + jitter * random.random())
                # Sleep for the delay
                time.sleep(delay)
            # Raise exceptions for any errors not specified
            except Exception as e:
                raise e
    return wrapper
@@ -1,274 +0,0 @@
 import numpy as np
 import pandas as pd
 from PIL import Image
 from tqdm import tqdm
 from scipy.sparse import csr_matrix
 from typing import Mapping, List, Tuple, Union
 from transformers.pipelines import Pipeline, pipeline
 from bertopic.representation._mmr import mmr
 from bertopic.representation._base import BaseRepresentation
 class VisualRepresentation(BaseRepresentation):
    """From a collection of representative documents, extract
    images to represent topics. These topics are represented by a
    collage of images.
    Arguments:
        nr_repr_images: Number of representative images to extract
        nr_samples: The number of candidate documents to extract per cluster.
        image_height: The height of the resulting collage
        image_square: Whether to resize each image in the collage
                      to a square. This can be visually more appealing
                      if all input images are all almost squares.
        image_to_text_model: The model to caption images.
        batch_size: The number of images to pass to the
                    `image_to_text_model`.
    Usage:
    ```python
    from bertopic.representation import VisualRepresentation
    from bertopic import BERTopic
    # The visual representation is typically not a core representation
    # and is advised to pass to BERTopic as an additional aspect.
    # Aspects can be labeled with dictionaries as shown below:
    representation_model = {
        "Visual_Aspect": VisualRepresentation()
    }
    # Use the representation model in BERTopic as a separate aspect
    topic_model = BERTopic(representation_model=representation_model)
    ```
    """
    def __init__(
        self,
        nr_repr_images: int = 9,
        nr_samples: int = 500,
        image_height: Tuple[int, int] = 600,
        image_squares: bool = False,
        image_to_text_model: Union[str, Pipeline] = None,
        batch_size: int = 32,
    ):
        self.nr_repr_images = nr_repr_images
        self.nr_samples = nr_samples
        self.image_height = image_height
        self.image_squares = image_squares
        # Text-to-image model
        if isinstance(image_to_text_model, Pipeline):
            self.image_to_text_model = image_to_text_model
        elif isinstance(image_to_text_model, str):
            self.image_to_text_model = pipeline("image-to-text", model=image_to_text_model)
        elif image_to_text_model is None:
            self.image_to_text_model = None
        else:
            raise ValueError(
                "Please select a correct transformers pipeline. For example:"
                "pipeline('image-to-text', model='nlpconnect/vit-gpt2-image-captioning')"
            )
        self.batch_size = batch_size
    def extract_topics(
        self,
        topic_model,
        documents: pd.DataFrame,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract topics.
        Arguments:
            topic_model: A BERTopic model
            documents: All input documents
            c_tf_idf: The topic c-TF-IDF representation
            topics: The candidate topics as calculated with c-TF-IDF
        Returns:
            representative_images: Representative images per topic
        """
        # Extract image ids of most representative documents
        images = documents["Image"].values.tolist()
        (_, _, _, repr_docs_ids) = topic_model._extract_representative_docs(
            c_tf_idf,
            documents,
            topics,
            nr_samples=self.nr_samples,
            nr_repr_docs=self.nr_repr_images,
        )
        unique_topics = sorted(list(topics.keys()))
        # Combine representative images into a single representation
        representative_images = {}
        for topic in tqdm(unique_topics):
            # Get and order represetnative images
            sliced_examplars = repr_docs_ids[topic + topic_model._outliers]
            sliced_examplars = [sliced_examplars[i : i + 3] for i in range(0, len(sliced_examplars), 3)]
            images_to_combine = [
                [
                    Image.open(images[index]) if isinstance(images[index], str) else images[index]
                    for index in sub_indices
                ]
                for sub_indices in sliced_examplars
            ]
            # Concatenate representative images
            representative_image = get_concat_tile_resize(images_to_combine, self.image_height, self.image_squares)
            representative_images[topic] = representative_image
            # Make sure to properly close images
            if isinstance(images[0], str):
                for image_list in images_to_combine:
                    for image in image_list:
                        image.close()
        return representative_images
    def _convert_image_to_text(self, images: List[str], verbose: bool = False) -> List[str]:
        """Convert a list of images to captions.
        Arguments:
            images: A list of images or words to be converted to text.
            verbose: Controls the verbosity of the process
        Returns:
            List of captions
        """
        # Batch-wise image conversion
        if self.batch_size is not None:
            documents = []
            for batch in tqdm(self._chunks(images), disable=not verbose):
                outputs = self.image_to_text_model(batch)
                captions = [output[0]["generated_text"] for output in outputs]
                documents.extend(captions)
        # Convert images to text
        else:
            outputs = self.image_to_text_model(images)
            documents = [output[0]["generated_text"] for output in outputs]
        return documents
    def image_to_text(self, documents: pd.DataFrame, embeddings: np.ndarray) -> pd.DataFrame:
        """Convert images to text."""
        # Create image topic embeddings
        topics = documents.Topic.values.tolist()
        images = documents.Image.values.tolist()
        df = pd.DataFrame(np.hstack([np.array(topics).reshape(-1, 1), embeddings]))
        image_topic_embeddings = df.groupby(0).mean().values
        # Extract image centroids
        image_centroids = {}
        unique_topics = sorted(list(set(topics)))
        for topic, topic_embedding in zip(unique_topics, image_topic_embeddings):
            indices = np.array([index for index, t in enumerate(topics) if t == topic])
            top_n = min([self.nr_repr_images, len(indices)])
            indices = mmr(
                topic_embedding.reshape(1, -1),
                embeddings[indices],
                indices,
                top_n=top_n,
                diversity=0.1,
            )
            image_centroids[topic] = indices
        # Extract documents
        documents = pd.DataFrame(columns=["Document", "ID", "Topic", "Image"])
        current_id = 0
        for topic, image_ids in tqdm(image_centroids.items()):
            selected_images = [
                Image.open(images[index]) if isinstance(images[index], str) else images[index] for index in image_ids
            ]
            text = self._convert_image_to_text(selected_images)
            for doc, image_id in zip(text, image_ids):
                documents.loc[len(documents), :] = [
                    doc,
                    current_id,
                    topic,
                    images[image_id],
                ]
                current_id += 1
            # Properly close images
            if isinstance(images[image_ids[0]], str):
                for image in selected_images:
                    image.close()
        return documents
    def _chunks(self, images):
        for i in range(0, len(images), self.batch_size):
            yield images[i : i + self.batch_size]
 def get_concat_h_multi_resize(im_list):
    """Code adapted from: https://note.nkmk.me/en/python-pillow-concat-images/."""
    min_height = min(im.height for im in im_list)
    min_height = max(im.height for im in im_list)
    im_list_resize = []
    for im in im_list:
        im.resize((int(im.width * min_height / im.height), min_height), resample=0)
        im_list_resize.append(im)
    total_width = sum(im.width for im in im_list_resize)
    dst = Image.new("RGB", (total_width, min_height), (255, 255, 255))
    pos_x = 0
    for im in im_list_resize:
        dst.paste(im, (pos_x, 0))
        pos_x += im.width
    return dst
 def get_concat_v_multi_resize(im_list):
    """Code adapted from: https://note.nkmk.me/en/python-pillow-concat-images/."""
    min_width = min(im.width for im in im_list)
    min_width = max(im.width for im in im_list)
    im_list_resize = [im.resize((min_width, int(im.height * min_width / im.width)), resample=0) for im in im_list]
    total_height = sum(im.height for im in im_list_resize)
    dst = Image.new("RGB", (min_width, total_height), (255, 255, 255))
    pos_y = 0
    for im in im_list_resize:
        dst.paste(im, (0, pos_y))
        pos_y += im.height
    return dst
 def get_concat_tile_resize(im_list_2d, image_height=600, image_squares=False):
    """Code adapted from: https://note.nkmk.me/en/python-pillow-concat-images/."""
    images = [[image.copy() for image in images] for images in im_list_2d]
    # Create
    if image_squares:
        width = int(image_height / 3)
        height = int(image_height / 3)
        images = [[image.resize((width, height)) for image in images] for images in im_list_2d]
    # Resize images based on minimum size
    else:
        min_width = min([min([img.width for img in imgs]) for imgs in im_list_2d])
        min_height = min([min([img.height for img in imgs]) for imgs in im_list_2d])
        for i, imgs in enumerate(images):
            for j, img in enumerate(imgs):
                if img.height > img.width:
                    images[i][j] = img.resize(
                        (int(img.width * min_height / img.height), min_height),
                        resample=0,
                    )
                elif img.width > img.height:
                    images[i][j] = img.resize((min_width, int(img.height * min_width / img.width)), resample=0)
                else:
                    images[i][j] = img.resize((min_width, min_width))
    # Resize grid image
    images = [get_concat_h_multi_resize(im_list_h) for im_list_h in images]
    img = get_concat_v_multi_resize(images)
    height_percentage = image_height / float(img.size[1])
    adjusted_width = int((float(img.size[0]) * float(height_percentage)))
    img = img.resize((adjusted_width, image_height), Image.Resampling.LANCZOS)
    return img
@@ -1,104 +0,0 @@
 import pandas as pd
 from transformers import pipeline
 from transformers.pipelines.base import Pipeline
 from scipy.sparse import csr_matrix
 from typing import Mapping, List, Tuple, Any
 from bertopic.representation._base import BaseRepresentation
 class ZeroShotClassification(BaseRepresentation):
    """Zero-shot Classification on topic keywords with candidate labels.
    Arguments:
        candidate_topics: A list of labels to assign to the topics if they
                          exceed `min_prob`
        model: A transformers pipeline that should be initialized as
               "zero-shot-classification". For example,
               `pipeline("zero-shot-classification", model="facebook/bart-large-mnli")`
        pipeline_kwargs: Kwargs that you can pass to the transformers.pipeline
                         when it is called. NOTE: Use `{"multi_label": True}`
                         to extract multiple labels for each topic.
        min_prob: The minimum probability to assign a candidate label to a topic
    Usage:
    ```python
    from bertopic.representation import ZeroShotClassification
    from bertopic import BERTopic
    # Create your representation model
    candidate_topics = ["space and nasa", "bicycles", "sports"]
    representation_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli")
    # Use the representation model in BERTopic on top of the default pipeline
    topic_model = BERTopic(representation_model=representation_model)
    ```
    """
    def __init__(
        self,
        candidate_topics: List[str],
        model: str = "facebook/bart-large-mnli",
        pipeline_kwargs: Mapping[str, Any] = {},
        min_prob: float = 0.8,
    ):
        self.candidate_topics = candidate_topics
        if isinstance(model, str):
            self.model = pipeline("zero-shot-classification", model=model)
        elif isinstance(model, Pipeline):
            self.model = model
        else:
            raise ValueError(
                "Make sure that the HF model that you"
                "pass is either a string referring to a"
                "HF model or a `transformers.pipeline` object."
            )
        self.pipeline_kwargs = pipeline_kwargs
        self.min_prob = min_prob
    def extract_topics(
        self,
        topic_model,
        documents: pd.DataFrame,
        c_tf_idf: csr_matrix,
        topics: Mapping[str, List[Tuple[str, float]]],
    ) -> Mapping[str, List[Tuple[str, float]]]:
        """Extract topics.
        Arguments:
            topic_model: Not used
            documents: Not used
            c_tf_idf: Not used
            topics: The candidate topics as calculated with c-TF-IDF
        Returns:
            updated_topics: Updated topic representations
        """
        # Classify topics
        topic_descriptions = [" ".join(list(zip(*topics[topic]))[0]) for topic in topics.keys()]
        classifications = self.model(topic_descriptions, self.candidate_topics, **self.pipeline_kwargs)
        # Extract labels
        updated_topics = {}
        for topic, classification in zip(topics.keys(), classifications):
            topic_description = topics[topic]
            # Multi-label assignment
            if self.pipeline_kwargs.get("multi_label"):
                topic_description = []
                for label, score in zip(classification["labels"], classification["scores"]):
                    if score > self.min_prob:
                        topic_description.append((label, score))
            # Single label assignment
            elif classification["scores"][0] > self.min_prob:
                topic_description = [(classification["labels"][0], classification["scores"][0])]
            # Make sure that 10 items are returned
            if len(topic_description) == 0:
                topic_description = topics[topic]
            elif len(topic_description) < 10:
                topic_description += [("", 0) for _ in range(10 - len(topic_description))]
            updated_topics[topic] = topic_description
        return updated_topics
@@ -1,4 +0,0 @@
 from ._ctfidf import ClassTfidfTransformer
 from ._online_cv import OnlineCountVectorizer
 __all__ = ["ClassTfidfTransformer", "OnlineCountVectorizer"]
@@ -1,115 +0,0 @@
 from typing import List
 from sklearn.feature_extraction.text import TfidfTransformer
 from sklearn.preprocessing import normalize
 from sklearn.utils import check_array
 import numpy as np
 import scipy.sparse as sp
 class ClassTfidfTransformer(TfidfTransformer):
    """A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.
    ![](../algorithm/c-TF-IDF.svg)
    c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes
    by joining all documents per class. Thus, each class is converted to a single document
    instead of set of documents. The frequency of each word **x** is extracted
    for each class **c** and is **l1** normalized. This constitutes the term frequency.
    Then, the term frequency is multiplied with IDF which is the logarithm of 1 plus
    the average number of words per class **A** divided by the frequency of word **x**
    across all classes.
    Arguments:
        bm25_weighting: Uses BM25-inspired idf-weighting procedure instead of the procedure
                        as defined in the c-TF-IDF formula. It uses the following weighting scheme:
                        `log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))`
        reduce_frequent_words: Takes the square root of the bag-of-words after normalizing the matrix.
                               Helps to reduce the impact of words that appear too frequently.
        seed_words: Specific words that will have their idf value increased by
                    the value of `seed_multiplier`.
                    NOTE: This will only increase the value of words that have an exact match.
        seed_multiplier: The value with which the idf values of the words in `seed_words`
                         are multiplied.
    Examples:
    ```python
    transformer = ClassTfidfTransformer()
    ```
    """
    def __init__(
        self,
        bm25_weighting: bool = False,
        reduce_frequent_words: bool = False,
        seed_words: List[str] = None,
        seed_multiplier: float = 2,
    ):
        self.bm25_weighting = bm25_weighting
        self.reduce_frequent_words = reduce_frequent_words
        self.seed_words = seed_words
        self.seed_multiplier = seed_multiplier
        super(ClassTfidfTransformer, self).__init__()
    def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):
        """Learn the idf vector (global term weights).
        Arguments:
            X: A matrix of term/token counts.
            multiplier: A multiplier for increasing/decreasing certain IDF scores
        """
        X = check_array(X, accept_sparse=("csr", "csc"))
        if not sp.issparse(X):
            X = sp.csr_matrix(X)
        dtype = np.float64
        if self.use_idf:
            _, n_features = X.shape
            # Calculate the frequency of words across all classes
            df = np.squeeze(np.asarray(X.sum(axis=0)))
            # Calculate the average number of samples as regularization
            avg_nr_samples = int(X.sum(axis=1).mean())
            # BM25-inspired weighting procedure
            if self.bm25_weighting:
                idf = np.log(1 + ((avg_nr_samples - df + 0.5) / (df + 0.5)))
            # Divide the average number of samples by the word frequency
            # +1 is added to force values to be positive
            else:
                idf = np.log((avg_nr_samples / df) + 1)
            # Multiplier to increase/decrease certain idf scores
            if multiplier is not None:
                idf = idf * multiplier
            self._idf_diag = sp.diags(
                idf,
                offsets=0,
                shape=(n_features, n_features),
                format="csr",
                dtype=dtype,
            )
        return self
    def transform(self, X: sp.csr_matrix):
        """Transform a count-based matrix to c-TF-IDF.
        Arguments:
            X (sparse matrix): A matrix of term/token counts.
        Returns:
            X (sparse matrix): A c-TF-IDF matrix
        """
        if self.use_idf:
            X = normalize(X, axis=1, norm="l1", copy=False)
            if self.reduce_frequent_words:
                X.data = np.sqrt(X.data)
            X = X * self._idf_diag
        return X
@@ -1,158 +0,0 @@
 import numpy as np
 from itertools import chain
 from typing import List
 from scipy import sparse
 from scipy.sparse import csr_matrix
 from sklearn.feature_extraction.text import CountVectorizer
 class OnlineCountVectorizer(CountVectorizer):
    """An online variant of the CountVectorizer with updating vocabulary.
    At each `.partial_fit`, its vocabulary is updated based on any OOV words
    it might find. Then, `.update_bow` can be used to track and update
    the Bag-of-Words representation. These functions are separated such that
    the vectorizer can be used in iteration without updating the Bag-of-Words
    representation can might speed up the fitting process. However, the
    `.update_bow` function is used in BERTopic to track changes in the
    topic representations and allow for decay.
    This class inherits its parameters and attributes from:
        `sklearn.feature_extraction.text.CountVectorizer`
    Arguments:
        decay: A value between [0, 1] to weight the percentage of frequencies
               the previous bag-of-words should be decreased. For example,
               a value of `.1` will decrease the frequencies in the bag-of-words
               matrix with 10% at each iteration.
        delete_min_df: Delete words at each iteration from its vocabulary
                       that are below a minimum frequency.
                       This will keep the resulting bag-of-words matrix small
                       such that it does not explode in size with increasing
                       vocabulary. If `decay` is None then this equals `min_df`.
        **kwargs: Set of parameters inherited from:
                  `sklearn.feature_extraction.text.CountVectorizer`
                  In practice, this means that you can still use parameters
                  from the original CountVectorizer, like `stop_words` and
                  `ngram_range`.
    Attributes:
        X_ (scipy.sparse.csr_matrix) : The Bag-of-Words representation
    Examples:
    ```python
    from bertopic.vectorizers import OnlineCountVectorizer
    vectorizer = OnlineCountVectorizer(stop_words="english")
    for index, doc in enumerate(my_docs):
        vectorizer.partial_fit(doc)
        # Update and clean the bow every 100 iterations:
        if index % 100 == 0:
            X = vectorizer.update_bow()
    ```
    To use the model in BERTopic:
    ```python
    from bertopic import BERTopic
    from bertopic.vectorizers import OnlineCountVectorizer
    vectorizer_model = OnlineCountVectorizer(stop_words="english")
    topic_model = BERTopic(vectorizer_model=vectorizer_model)
    ```
    References:
        Adapted from: https://github.com/idoshlomo/online_vectorizers
    """
    def __init__(self, decay: float = None, delete_min_df: float = None, **kwargs):
        self.decay = decay
        self.delete_min_df = delete_min_df
        super(OnlineCountVectorizer, self).__init__(**kwargs)
    def partial_fit(self, raw_documents: List[str]) -> None:
        """Perform a partial fit and update vocabulary with OOV tokens.
        Arguments:
            raw_documents: A list of documents
        """
        if not hasattr(self, "vocabulary_"):
            return self.fit(raw_documents)
        analyzer = self.build_analyzer()
        analyzed_documents = [analyzer(doc) for doc in raw_documents]
        new_tokens = set(chain.from_iterable(analyzed_documents))
        oov_tokens = new_tokens.difference(set(self.vocabulary_.keys()))
        if oov_tokens:
            max_index = max(self.vocabulary_.values())
            oov_vocabulary = dict(
                zip(
                    oov_tokens,
                    list(range(max_index + 1, max_index + 1 + len(oov_tokens), 1)),
                )
            )
            self.vocabulary_.update(oov_vocabulary)
        return self
    def update_bow(self, raw_documents: List[str]) -> csr_matrix:
        """Create or update the bag-of-words matrix.
        Update the bag-of-words matrix by adding the newly transformed
        documents. This may add empty columns if new words are found and/or
        add empty rows if new topics are found.
        During this process, the previous bag-of-words matrix might be
        decayed if `self.decay` has been set during init. Similarly, words
        that do not exceed `self.delete_min_df` are removed from its
        vocabulary and bag-of-words matrix.
        Arguments:
            raw_documents: A list of documents
        Returns:
            X_: Bag-of-words matrix
        """
        if hasattr(self, "X_"):
            X = self.transform(raw_documents)
            # Add empty columns if new words are found
            columns = csr_matrix((self.X_.shape[0], X.shape[1] - self.X_.shape[1]), dtype=int)
            self.X_ = sparse.hstack([self.X_, columns])
            # Add empty rows if new topics are found
            rows = csr_matrix((X.shape[0] - self.X_.shape[0], self.X_.shape[1]), dtype=int)
            self.X_ = sparse.vstack([self.X_, rows])
            # Decay of BoW matrix
            if self.decay is not None:
                self.X_ = self.X_ * (1 - self.decay)
            self.X_ += X
        else:
            self.X_ = self.transform(raw_documents)
        if self.delete_min_df is not None:
            self._clean_bow()
        return self.X_
    def _clean_bow(self) -> None:
        """Remove words that do not exceed `self.delete_min_df`."""
        # Only keep words with a minimum frequency
        indices = np.where(self.X_.sum(0) >= self.delete_min_df)[1]
        indices_dict = {index: index for index in indices}
        self.X_ = self.X_[:, indices]
        # Update vocabulary with new words
        new_vocab = {}
        vocabulary_dict = {v: k for k, v in self.vocabulary_.items()}
        for i, index in enumerate(indices):
            if indices_dict.get(index) is not None:
                new_vocab[vocabulary_dict[index]] = i
        self.vocabulary_ = new_vocab
@@ -1,32 +0,0 @@
 <svg width="228" height="113" viewBox="0 0 228 113" fill="none" xmlns="http://www.w3.org/2000/svg">
 <path d="M68.7889 40.7606L54.4174 26.3594C54.1819 26.1238 53.8638 26 53.5317 26H16.34C14.4403 26 12.8962 27.5352 12.8962 29.4337L12.8765 92.5594C12.8765 94.4578 14.4219 96 16.3209 96H65.6905C67.5889 96 69.1343 94.459 69.1349 92.5613L69.1533 41.6413C69.1533 41.3098 69.0225 40.9949 68.7889 40.7606ZM66.634 92.5606C66.634 93.0806 66.2105 93.501 65.6905 93.501H16.3209C15.8003 93.501 15.3768 93.0844 15.3768 92.5644L15.3965 29.4362C15.3965 28.9162 15.8194 28.5003 16.34 28.5003H53.013L66.6517 42.1632L66.634 92.5606Z" fill="black"/>
 <path d="M62.2626 40.3752H57.1876C55.8613 40.3752 54.7508 39.3098 54.7508 37.9835V27.2343C54.7508 26.5435 54.1908 25.9841 53.5006 25.9841C52.8105 25.9841 52.2505 26.5441 52.2505 27.2343V37.9835C52.2505 40.6889 54.4816 42.8749 57.187 42.8749H62.2619C62.9521 42.8749 63.5127 42.3162 63.5127 41.6254C63.5127 40.9346 62.9527 40.3752 62.2626 40.3752Z" fill="black"/>
 <path d="M78.7584 30.7822L64.387 16.374C64.1514 16.1384 63.8333 16 63.5019 16H26.3095C24.4105 16 22.8746 17.5581 22.8746 19.4571V27.2343C22.8746 27.9251 23.434 28.4844 24.1248 28.4844C24.8156 28.4844 25.3749 27.9244 25.3749 27.2343V19.4571C25.3749 18.9371 25.7902 18.5003 26.3102 18.5003H62.9838L76.6232 32.1689L76.6041 82.574C76.6041 83.0933 76.1813 83.4997 75.6613 83.4997H67.8841C67.1933 83.4997 66.634 84.0597 66.634 84.7498C66.634 85.44 67.1933 86 67.8841 86H75.6613C77.5603 86 79.1044 84.4717 79.1038 82.5733L79.1235 31.6667C79.1235 31.3352 78.9914 31.0165 78.7584 30.7822Z" fill="black"/>
 <path d="M72.2333 30.3746H67.1584C65.8321 30.3746 64.7508 29.339 64.7508 28.0127V17.2635C64.7508 16.5733 64.1908 16.0133 63.5006 16.0133C62.8105 16.0133 62.2505 16.5733 62.2505 17.2635V28.0127C62.2505 30.7181 64.453 32.8749 67.1578 32.8749H72.2327C72.9229 32.8749 73.4835 32.3156 73.4835 31.6248C73.4835 30.934 72.9235 30.3746 72.2333 30.3746Z" fill="black"/>
 <path d="M22.7838 46.6248H19.7413C19.0511 46.6248 18.4911 47.1841 18.4911 47.8749C18.4911 48.5657 19.0511 49.1251 19.7413 49.1251H22.7838C23.4733 49.1251 24.034 48.5657 24.034 47.8749C24.034 47.1841 23.4733 46.6248 22.7838 46.6248Z" fill="black"/>
 <path d="M62.2429 46.6248H28.3991C27.7076 46.6248 27.1489 47.1841 27.1489 47.8749C27.1489 48.5657 27.7083 49.1251 28.3991 49.1251H62.2429C62.9337 49.1251 63.493 48.5657 63.493 47.8749C63.493 47.1841 62.9337 46.6248 62.2429 46.6248Z" fill="black"/>
 <path d="M62.2429 52.8749H52.7603C52.0695 52.8749 51.5102 53.4343 51.5102 54.1251C51.5102 54.8159 52.0695 55.3752 52.7603 55.3752H62.2429C62.9337 55.3752 63.493 54.8159 63.493 54.1251C63.493 53.4343 62.9337 52.8749 62.2429 52.8749Z" fill="black"/>
 <path d="M47.1457 52.8749H19.7419C19.0518 52.8749 18.4918 53.4343 18.4918 54.1251C18.4918 54.8159 19.0518 55.3752 19.7419 55.3752H47.1457C47.8353 55.3752 48.3959 54.8159 48.3959 54.1251C48.3959 53.4343 47.8353 52.8749 47.1457 52.8749Z" fill="black"/>
 <path d="M62.2429 59.1245H19.7419C19.0518 59.1245 18.4918 59.6845 18.4918 60.3746C18.4918 61.0648 19.0518 61.6248 19.7419 61.6248H62.2429C62.9337 61.6248 63.493 61.0648 63.493 60.3746C63.493 59.6845 62.9337 59.1245 62.2429 59.1245Z" fill="black"/>
 <path d="M62.2429 77.8749H19.7419C19.0518 77.8749 18.4918 78.4349 18.4918 79.1251C18.4918 79.8152 19.0518 80.3752 19.7419 80.3752H62.2429C62.9337 80.3752 63.493 79.8152 63.493 79.1251C63.493 78.4349 62.9337 77.8749 62.2429 77.8749Z" fill="black"/>
 <path d="M22.7838 65.3746H19.7413C19.0511 65.3746 18.4911 65.9346 18.4911 66.6248C18.4911 67.3149 19.0511 67.8749 19.7413 67.8749H22.7838C23.4733 67.8749 24.034 67.3149 24.034 66.6248C24.034 65.9346 23.4733 65.3746 22.7838 65.3746Z" fill="black"/>
 <path d="M62.2429 65.3746H28.3991C27.7076 65.3746 27.1489 65.9346 27.1489 66.6248C27.1489 67.3149 27.7083 67.8749 28.3991 67.8749H62.2429C62.9337 67.8749 63.493 67.3149 63.493 66.6248C63.493 65.9346 62.9337 65.3746 62.2429 65.3746Z" fill="black"/>
 <path d="M62.2429 71.6248H52.7603C52.0695 71.6248 51.5102 72.1848 51.5102 72.8749C51.5102 73.5651 52.0695 74.1251 52.7603 74.1251H62.2429C62.9337 74.1251 63.493 73.5651 63.493 72.8749C63.493 72.1848 62.9337 71.6248 62.2429 71.6248Z" fill="black"/>
 <path d="M47.1457 71.6248H19.7419C19.0518 71.6248 18.4918 72.1848 18.4918 72.8749C18.4918 73.5651 19.0518 74.1251 19.7419 74.1251H47.1457C47.8353 74.1251 48.3959 73.5651 48.3959 72.8749C48.3959 72.1848 47.8353 71.6248 47.1457 71.6248Z" fill="black"/>
 <path d="M22.7838 84.1245H19.7413C19.0511 84.1245 18.4911 84.6845 18.4911 85.3746C18.4911 86.0648 19.0511 86.6248 19.7413 86.6248H22.7838C23.4733 86.6248 24.034 86.0648 24.034 85.3746C24.034 84.6845 23.4733 84.1245 22.7838 84.1245Z" fill="black"/>
 <path d="M62.2429 84.1245H28.3991C27.7076 84.1245 27.1489 84.6845 27.1489 85.3746C27.1489 86.0648 27.7083 86.6248 28.3991 86.6248H62.2429C62.9337 86.6248 63.493 86.0648 63.493 85.3746C63.493 84.6845 62.9337 84.1245 62.2429 84.1245Z" fill="black"/>
 <path d="M72.2143 36.6248H64.7952C64.1044 36.6248 63.5451 37.1841 63.5451 37.8749C63.5451 38.5657 64.1044 39.1251 64.7952 39.1251H72.2136C72.9044 39.1251 73.4644 38.5657 73.4644 37.8749C73.4644 37.1841 72.9051 36.6248 72.2143 36.6248Z" fill="black"/>
 <path d="M72.2137 42.8749H67.8841C67.1933 42.8749 66.634 43.4343 66.634 44.1251C66.634 44.8159 67.1933 45.3752 67.8841 45.3752H72.2137C72.9044 45.3752 73.4638 44.8159 73.4638 44.1251C73.4638 43.4343 72.9044 42.8749 72.2137 42.8749Z" fill="black"/>
 <path d="M72.2137 49.1245H67.8841C67.1933 49.1245 66.634 49.6838 66.634 50.3746C66.634 51.0654 67.1933 51.6248 67.8841 51.6248H72.2137C72.9044 51.6248 73.4638 51.0654 73.4638 50.3746C73.4638 49.6838 72.9044 49.1245 72.2137 49.1245Z" fill="black"/>
 <path d="M72.2136 67.8749H68.267C67.5775 67.8749 67.0168 68.4349 67.0168 69.1251C67.0168 69.8152 67.5775 70.3752 68.267 70.3752H72.2136C72.9044 70.3752 73.4638 69.8152 73.4638 69.1251C73.4638 68.4349 72.9044 67.8749 72.2136 67.8749Z" fill="black"/>
 <path d="M72.2137 55.3746H67.8841C67.1933 55.3746 66.634 55.9346 66.634 56.6248C66.634 57.3149 67.1933 57.8749 67.8841 57.8749H72.2137C72.9044 57.8749 73.4638 57.3149 73.4638 56.6248C73.4638 55.934 72.9044 55.3746 72.2137 55.3746Z" fill="black"/>
 <path d="M72.2137 61.6248H67.8841C67.1933 61.6248 66.634 62.1848 66.634 62.8749C66.634 63.5651 67.1933 64.1251 67.8841 64.1251H72.2137C72.9044 64.1251 73.4638 63.5651 73.4638 62.8749C73.4638 62.1848 72.9044 61.6248 72.2137 61.6248Z" fill="black"/>
 <path d="M72.2137 74.1244H67.8841C67.1933 74.1244 66.634 74.6844 66.634 75.3746C66.634 76.0648 67.1933 76.6248 67.8841 76.6248H72.2137C72.9044 76.6248 73.4638 76.0648 73.4638 75.3746C73.4638 74.6844 72.9044 74.1244 72.2137 74.1244Z" fill="black"/>
 <path d="M155.061 57.0607C155.646 56.4749 155.646 55.5251 155.061 54.9393L145.515 45.3934C144.929 44.8076 143.979 44.8076 143.393 45.3934C142.808 45.9792 142.808 46.9289 143.393 47.5147L151.879 56L143.393 64.4853C142.808 65.0711 142.808 66.0208 143.393 66.6066C143.979 67.1924 144.929 67.1924 145.515 66.6066L155.061 57.0607ZM98 57.5H154V54.5H98V57.5Z" fill="black"/>
 <path d="M189 13H180V103H189" stroke="black" stroke-width="2"/>
 <path d="M204 13H213V103H204" stroke="black" stroke-width="2"/>
 <path d="M194.746 16.6543L196 19.2148L198.062 16.666H198.918L196.322 19.8066L197.98 23H197.219L195.883 20.3281L193.721 23H192.871L195.572 19.7305L193.984 16.6543H194.746ZM194.746 30.6543L196 33.2148L198.062 30.666H198.918L196.322 33.8066L197.98 37H197.219L195.883 34.3281L193.721 37H192.871L195.572 33.7305L193.984 30.6543H194.746ZM194.898 50.5723C194.902 50.4395 194.953 50.3242 195.051 50.2266C195.148 50.1289 195.266 50.0781 195.402 50.0742C195.543 50.0742 195.658 50.1211 195.748 50.2148C195.838 50.3086 195.879 50.4258 195.871 50.5664C195.863 50.7031 195.811 50.8184 195.713 50.9121C195.615 51.0059 195.498 51.0527 195.361 51.0527C195.221 51.0566 195.105 51.0137 195.016 50.9238C194.926 50.8301 194.887 50.7129 194.898 50.5723ZM194.898 64.5723C194.902 64.4395 194.953 64.3242 195.051 64.2266C195.148 64.1289 195.266 64.0781 195.402 64.0742C195.543 64.0742 195.658 64.1211 195.748 64.2148C195.838 64.3086 195.879 64.4258 195.871 64.5664C195.863 64.7031 195.811 64.8184 195.713 64.9121C195.615 65.0059 195.498 65.0527 195.361 65.0527C195.221 65.0566 195.105 65.0137 195.016 64.9238C194.926 64.8301 194.887 64.7129 194.898 64.5723ZM194.898 78.5723C194.902 78.4395 194.953 78.3242 195.051 78.2266C195.148 78.1289 195.266 78.0781 195.402 78.0742C195.543 78.0742 195.658 78.1211 195.748 78.2148C195.838 78.3086 195.879 78.4258 195.871 78.5664C195.863 78.7031 195.811 78.8184 195.713 78.9121C195.615 79.0059 195.498 79.0527 195.361 79.0527C195.221 79.0566 195.105 79.0137 195.016 78.9238C194.926 78.8301 194.887 78.7129 194.898 78.5723ZM194.746 86.6543L196 89.2148L198.062 86.666H198.918L196.322 89.8066L197.98 93H197.219L195.883 90.3281L193.721 93H192.871L195.572 89.7305L193.984 86.6543H194.746Z" fill="black"/>
 <path d="M203.047 19.2891L202.074 25H201.617L202.504 19.8945L200.906 20.457L200.984 20.0039L202.961 19.2891H203.047Z" fill="black"/>
 <path d="M203.523 38.5977L203.461 39H200.004L200.059 38.6211L202.176 36.5234C202.332 36.3672 202.496 36.1992 202.668 36.0195C202.842 35.8398 202.995 35.6471 203.125 35.4414C203.258 35.2357 203.342 35.0169 203.379 34.7852C203.41 34.5638 203.392 34.3672 203.324 34.1953C203.259 34.0234 203.15 33.888 202.996 33.7891C202.842 33.6875 202.651 33.6354 202.422 33.6328C202.161 33.6302 201.93 33.6875 201.727 33.8047C201.526 33.9219 201.361 34.0807 201.23 34.2812C201.103 34.4818 201.018 34.7044 200.977 34.9492H200.523C200.568 34.6237 200.677 34.3307 200.852 34.0703C201.026 33.8099 201.249 33.6055 201.52 33.457C201.793 33.306 202.098 33.2318 202.434 33.2344C202.736 33.237 202.999 33.2995 203.223 33.4219C203.449 33.5443 203.618 33.7188 203.73 33.9453C203.842 34.1719 203.88 34.4388 203.844 34.7461C203.82 34.9518 203.762 35.1497 203.668 35.3398C203.574 35.5273 203.46 35.7083 203.324 35.8828C203.191 36.0547 203.049 36.2188 202.898 36.375C202.747 36.5286 202.602 36.6745 202.461 36.8125L200.645 38.5977H203.523Z" fill="black"/>
 <path d="M201.082 94.6953L200.512 98H200.055L200.785 93.7734H201.223L201.082 94.6953ZM200.828 95.625L200.645 95.5078C200.697 95.2786 200.776 95.056 200.883 94.8398C200.99 94.6211 201.122 94.4258 201.281 94.2539C201.443 94.0794 201.628 93.9427 201.836 93.8438C202.047 93.7422 202.281 93.6927 202.539 93.6953C202.771 93.6979 202.962 93.7409 203.113 93.8242C203.267 93.9049 203.385 94.0182 203.469 94.1641C203.552 94.3073 203.604 94.4727 203.625 94.6602C203.648 94.8451 203.646 95.0417 203.617 95.25L203.152 98H202.691L203.164 95.2422C203.193 95.0391 203.191 94.8516 203.16 94.6797C203.132 94.5052 203.059 94.3659 202.941 94.2617C202.824 94.1549 202.648 94.1016 202.414 94.1016C202.206 94.099 202.014 94.1419 201.84 94.2305C201.665 94.3164 201.509 94.4336 201.371 94.582C201.236 94.7279 201.121 94.8919 201.027 95.0742C200.936 95.2539 200.87 95.4375 200.828 95.625Z" fill="black"/>
 </svg>
@@ -1,17 +0,0 @@
 <svg width="228" height="113" viewBox="0 0 228 113" fill="none" xmlns="http://www.w3.org/2000/svg">
 <path d="M51 13H42V103H51" stroke="black" stroke-width="2"/>
 <path d="M66 13H75V103H66" stroke="black" stroke-width="2"/>
 <path d="M56.7461 16.6543L58 19.2148L60.0625 16.666H60.918L58.3223 19.8066L59.9805 23H59.2188L57.8828 20.3281L55.7207 23H54.8711L57.5723 19.7305L55.9844 16.6543H56.7461ZM56.7461 30.6543L58 33.2148L60.0625 30.666H60.918L58.3223 33.8066L59.9805 37H59.2188L57.8828 34.3281L55.7207 37H54.8711L57.5723 33.7305L55.9844 30.6543H56.7461ZM56.8984 50.5723C56.9023 50.4395 56.9531 50.3242 57.0508 50.2266C57.1484 50.1289 57.2656 50.0781 57.4023 50.0742C57.543 50.0742 57.6582 50.1211 57.748 50.2148C57.8379 50.3086 57.8789 50.4258 57.8711 50.5664C57.8633 50.7031 57.8105 50.8184 57.7129 50.9121C57.6152 51.0059 57.498 51.0527 57.3613 51.0527C57.2207 51.0566 57.1055 51.0137 57.0156 50.9238C56.9258 50.8301 56.8867 50.7129 56.8984 50.5723ZM56.8984 64.5723C56.9023 64.4395 56.9531 64.3242 57.0508 64.2266C57.1484 64.1289 57.2656 64.0781 57.4023 64.0742C57.543 64.0742 57.6582 64.1211 57.748 64.2148C57.8379 64.3086 57.8789 64.4258 57.8711 64.5664C57.8633 64.7031 57.8105 64.8184 57.7129 64.9121C57.6152 65.0059 57.498 65.0527 57.3613 65.0527C57.2207 65.0566 57.1055 65.0137 57.0156 64.9238C56.9258 64.8301 56.8867 64.7129 56.8984 64.5723ZM56.8984 78.5723C56.9023 78.4395 56.9531 78.3242 57.0508 78.2266C57.1484 78.1289 57.2656 78.0781 57.4023 78.0742C57.543 78.0742 57.6582 78.1211 57.748 78.2148C57.8379 78.3086 57.8789 78.4258 57.8711 78.5664C57.8633 78.7031 57.8105 78.8184 57.7129 78.9121C57.6152 79.0059 57.498 79.0527 57.3613 79.0527C57.2207 79.0566 57.1055 79.0137 57.0156 78.9238C56.9258 78.8301 56.8867 78.7129 56.8984 78.5723ZM56.7461 86.6543L58 89.2148L60.0625 86.666H60.918L58.3223 89.8066L59.9805 93H59.2188L57.8828 90.3281L55.7207 93H54.8711L57.5723 89.7305L55.9844 86.6543H56.7461Z" fill="black"/>
 <path d="M65.0469 19.2891L64.0742 25H63.6172L64.5039 19.8945L62.9062 20.457L62.9844 20.0039L64.9609 19.2891H65.0469Z" fill="black"/>
 <path d="M65.5234 38.5977L65.4609 39H62.0039L62.0586 38.6211L64.1758 36.5234C64.332 36.3672 64.4961 36.1992 64.668 36.0195C64.8424 35.8398 64.9948 35.6471 65.125 35.4414C65.2578 35.2357 65.3424 35.0169 65.3789 34.7852C65.4102 34.5638 65.3919 34.3672 65.3242 34.1953C65.2591 34.0234 65.1497 33.888 64.9961 33.7891C64.8424 33.6875 64.651 33.6354 64.4219 33.6328C64.1615 33.6302 63.9297 33.6875 63.7266 33.8047C63.526 33.9219 63.3607 34.0807 63.2305 34.2812C63.1029 34.4818 63.0182 34.7044 62.9766 34.9492H62.5234C62.5677 34.6237 62.6771 34.3307 62.8516 34.0703C63.026 33.8099 63.2487 33.6055 63.5195 33.457C63.793 33.306 64.0977 33.2318 64.4336 33.2344C64.7357 33.237 64.9987 33.2995 65.2227 33.4219C65.4492 33.5443 65.6185 33.7188 65.7305 33.9453C65.8424 34.1719 65.8802 34.4388 65.8438 34.7461C65.8203 34.9518 65.7617 35.1497 65.668 35.3398C65.5742 35.5273 65.4596 35.7083 65.3242 35.8828C65.1914 36.0547 65.0495 36.2188 64.8984 36.375C64.7474 36.5286 64.6016 36.6745 64.4609 36.8125L62.6445 38.5977H65.5234Z" fill="black"/>
 <path d="M63.082 94.6953L62.5117 98H62.0547L62.7852 93.7734H63.2227L63.082 94.6953ZM62.8281 95.625L62.6445 95.5078C62.6966 95.2786 62.776 95.056 62.8828 94.8398C62.9896 94.6211 63.1224 94.4258 63.2812 94.2539C63.4427 94.0794 63.6276 93.9427 63.8359 93.8438C64.0469 93.7422 64.2812 93.6927 64.5391 93.6953C64.7708 93.6979 64.9622 93.7409 65.1133 93.8242C65.2669 93.9049 65.3854 94.0182 65.4688 94.1641C65.5521 94.3073 65.6042 94.4727 65.625 94.6602C65.6484 94.8451 65.6458 95.0417 65.6172 95.25L65.1523 98H64.6914L65.1641 95.2422C65.1927 95.0391 65.1914 94.8516 65.1602 94.6797C65.1315 94.5052 65.0586 94.3659 64.9414 94.2617C64.8242 94.1549 64.6484 94.1016 64.4141 94.1016C64.2057 94.099 64.0143 94.1419 63.8398 94.2305C63.6654 94.3164 63.5091 94.4336 63.3711 94.582C63.2357 94.7279 63.1211 94.8919 63.0273 95.0742C62.9362 95.2539 62.8698 95.4375 62.8281 95.625Z" fill="black"/>
 <path d="M161 13H152V103H161" stroke="black" stroke-width="2"/>
 <path d="M176 13H185V103H176" stroke="black" stroke-width="2"/>
 <path d="M166.746 24.6543L168 27.2148L170.062 24.666H170.918L168.322 27.8066L169.98 31H169.219L167.883 28.3281L165.721 31H164.871L167.572 27.7305L165.984 24.6543H166.746ZM166.746 38.6543L168 41.2148L170.062 38.666H170.918L168.322 41.8066L169.98 45H169.219L167.883 42.3281L165.721 45H164.871L167.572 41.7305L165.984 38.6543H166.746ZM166.746 52.6543L168 55.2148L170.062 52.666H170.918L168.322 55.8066L169.98 59H169.219L167.883 56.3281L165.721 59H164.871L167.572 55.7305L165.984 52.6543H166.746ZM166.746 66.6543L168 69.2148L170.062 66.666H170.918L168.322 69.8066L169.98 73H169.219L167.883 70.3281L165.721 73H164.871L167.572 69.7305L165.984 66.6543H166.746ZM166.746 80.6543L168 83.2148L170.062 80.666H170.918L168.322 83.8066L169.98 87H169.219L167.883 84.3281L165.721 87H164.871L167.572 83.7305L165.984 80.6543H166.746Z" fill="black"/>
 <path d="M173.785 28.7168L173.056 33H172.713L173.378 29.1709L172.18 29.5928L172.238 29.2529L173.721 28.7168H173.785Z" fill="black"/>
 <path d="M174.143 46.6982L174.096 47H171.503L171.544 46.7158L173.132 45.1426C173.249 45.0254 173.372 44.8994 173.501 44.7646C173.632 44.6299 173.746 44.4854 173.844 44.3311C173.943 44.1768 174.007 44.0127 174.034 43.8389C174.058 43.6729 174.044 43.5254 173.993 43.3965C173.944 43.2676 173.862 43.166 173.747 43.0918C173.632 43.0156 173.488 42.9766 173.316 42.9746C173.121 42.9727 172.947 43.0156 172.795 43.1035C172.645 43.1914 172.521 43.3105 172.423 43.4609C172.327 43.6113 172.264 43.7783 172.232 43.9619H171.893C171.926 43.7178 172.008 43.498 172.139 43.3027C172.27 43.1074 172.437 42.9541 172.64 42.8428C172.845 42.7295 173.073 42.6738 173.325 42.6758C173.552 42.6777 173.749 42.7246 173.917 42.8164C174.087 42.9082 174.214 43.0391 174.298 43.209C174.382 43.3789 174.41 43.5791 174.383 43.8096C174.365 43.9639 174.321 44.1123 174.251 44.2549C174.181 44.3955 174.095 44.5312 173.993 44.6621C173.894 44.791 173.787 44.9141 173.674 45.0312C173.561 45.1465 173.451 45.2559 173.346 45.3594L171.983 46.6982H174.143Z" fill="black"/>
 <path d="M172.622 58.6738L172.953 58.6768C173.127 58.6729 173.293 58.6396 173.451 58.5771C173.611 58.5146 173.746 58.4219 173.855 58.2988C173.967 58.1758 174.035 58.0215 174.061 57.8359C174.086 57.666 174.073 57.5176 174.022 57.3906C173.972 57.2617 173.889 57.1611 173.773 57.0889C173.658 57.0146 173.514 56.9766 173.34 56.9746C173.16 56.9727 172.997 57.0088 172.851 57.083C172.704 57.1572 172.582 57.2607 172.484 57.3936C172.389 57.5244 172.324 57.6768 172.291 57.8506H171.951C171.984 57.6182 172.066 57.4131 172.197 57.2354C172.33 57.0576 172.497 56.9199 172.698 56.8223C172.899 56.7227 173.117 56.6738 173.352 56.6758C173.582 56.6758 173.781 56.7256 173.949 56.8252C174.117 56.9229 174.242 57.0596 174.324 57.2354C174.406 57.4111 174.434 57.6152 174.406 57.8477C174.387 58.0215 174.332 58.1748 174.242 58.3076C174.154 58.4385 174.043 58.5488 173.908 58.6387C173.775 58.7266 173.63 58.7939 173.472 58.8408C173.313 58.8857 173.154 58.9092 172.994 58.9111L172.587 58.9082L172.622 58.6738ZM172.575 58.9756L172.61 58.7441H172.977C173.146 58.748 173.307 58.7715 173.457 58.8145C173.609 58.8574 173.742 58.9229 173.855 59.0107C173.971 59.0967 174.057 59.208 174.113 59.3447C174.172 59.4795 174.19 59.6416 174.169 59.8311C174.147 60.0186 174.096 60.1885 174.014 60.3408C173.934 60.4912 173.829 60.6201 173.7 60.7275C173.571 60.835 173.425 60.918 173.261 60.9766C173.097 61.0332 172.922 61.0605 172.736 61.0586C172.557 61.0566 172.393 61.0264 172.244 60.9678C172.096 60.9092 171.968 60.8271 171.86 60.7217C171.755 60.6143 171.676 60.4863 171.623 60.3379C171.572 60.1875 171.555 60.0215 171.57 59.8398L171.91 59.8428C171.893 60.0225 171.916 60.1807 171.98 60.3174C172.047 60.4541 172.146 60.5615 172.276 60.6396C172.409 60.7158 172.565 60.7549 172.745 60.7568C172.937 60.7588 173.108 60.7227 173.261 60.6484C173.415 60.5742 173.541 60.4688 173.639 60.332C173.738 60.1934 173.801 60.0293 173.826 59.8398C173.854 59.6406 173.83 59.4785 173.756 59.3535C173.682 59.2266 173.572 59.1328 173.428 59.0723C173.285 59.0117 173.123 58.9805 172.941 58.9785L172.575 58.9756Z" fill="black"/>
 <path d="M174.444 73.623L174.397 73.9219H171.459L171.497 73.7051L173.914 70.7373H174.222L173.686 71.4727L171.945 73.623H174.444ZM174.298 70.7344L173.562 75H173.22L173.958 70.7344H174.298Z" fill="black"/>
 <path d="M172.35 86.8877L172.074 86.8057L172.599 84.7344H174.67L174.623 85.0625H172.848L172.458 86.501C172.575 86.4229 172.703 86.3643 172.842 86.3252C172.98 86.2842 173.12 86.2646 173.261 86.2666C173.454 86.2666 173.621 86.3047 173.762 86.3809C173.902 86.4551 174.016 86.5566 174.102 86.6855C174.189 86.8125 174.249 86.958 174.28 87.1221C174.313 87.2842 174.32 87.4541 174.301 87.6318C174.277 87.8271 174.229 88.0117 174.157 88.1855C174.085 88.3594 173.988 88.5127 173.867 88.6455C173.746 88.7764 173.602 88.8789 173.434 88.9531C173.268 89.0273 173.079 89.0625 172.868 89.0586C172.69 89.0586 172.532 89.0293 172.394 88.9707C172.257 88.9121 172.142 88.8301 172.048 88.7246C171.954 88.6172 171.883 88.4922 171.834 88.3496C171.785 88.2051 171.761 88.0479 171.761 87.8779H172.089C172.089 88.0479 172.117 88.1992 172.174 88.332C172.23 88.4629 172.316 88.5664 172.432 88.6426C172.549 88.7188 172.698 88.7578 172.88 88.7598C173.044 88.7598 173.188 88.7295 173.311 88.6689C173.436 88.6084 173.542 88.5254 173.63 88.4199C173.72 88.3145 173.791 88.1943 173.844 88.0596C173.896 87.9248 173.934 87.7832 173.955 87.6348C173.973 87.5 173.971 87.3711 173.949 87.248C173.928 87.123 173.886 87.0117 173.823 86.9141C173.761 86.8145 173.677 86.7363 173.571 86.6797C173.466 86.6211 173.339 86.5898 173.19 86.5859C173.028 86.584 172.879 86.6094 172.742 86.6621C172.607 86.7148 172.477 86.79 172.35 86.8877Z" fill="black"/>
 <path d="M134.061 62.0607C134.646 61.4749 134.646 60.5251 134.061 59.9393L124.515 50.3934C123.929 49.8076 122.979 49.8076 122.393 50.3934C121.808 50.9792 121.808 51.9289 122.393 52.5147L130.879 61L122.393 69.4853C121.808 70.0711 121.808 71.0208 122.393 71.6066C122.979 72.1924 123.929 72.1924 124.515 71.6066L134.061 62.0607ZM91 62.5H133V59.5H91V62.5Z" fill="black"/>
 </svg>
@@ -1,14 +0,0 @@
 <svg width="228" height="113" viewBox="0 0 228 113" fill="none" xmlns="http://www.w3.org/2000/svg">
 <rect x="32" y="12" width="12" height="12" fill="black"/>
 <rect x="72" y="9" width="12" height="12" fill="black"/>
 <rect x="60" y="32" width="12" height="12" fill="black"/>
 <rect x="32" y="44" width="12" height="12" fill="black"/>
 <circle cx="166" cy="53" r="6" fill="black"/>
 <circle cx="180" cy="19" r="6" fill="black"/>
 <circle cx="194" cy="44" r="6" fill="black"/>
 <circle cx="154" cy="32" r="6" fill="black"/>
 <path d="M90 98L95.1962 107H84.8038L90 98Z" fill="black"/>
 <path d="M104 80L109.196 89H98.8038L104 80Z" fill="black"/>
 <path d="M121 98L126.196 107H115.804L121 98Z" fill="black"/>
 <path d="M127 74L132.196 83H121.804L127 74Z" fill="black"/>
 </svg>
@@ -1,23 +0,0 @@
 <svg width="228" height="113" viewBox="0 0 228 113" fill="none" xmlns="http://www.w3.org/2000/svg">
 <line x1="59.8941" y1="40.3059" x2="59.8941" y2="62.85" stroke="black"/>
 <line x1="57.9618" y1="40.3059" x2="57.9618" y2="62.85" stroke="black"/>
 <line x1="99.1853" y1="40.6618" x2="99.1853" y2="63.2059" stroke="black"/>
 <line x1="97.2529" y1="40.6618" x2="97.2529" y2="63.2059" stroke="black"/>
 <path d="M51.3695 48.5401V49.794H41.5961V48.5401H51.3695ZM51.3695 53.3565V54.6104H41.5961V53.3565H51.3695Z" fill="#ABA9A9"/>
 <path d="M107.229 46.0497L110.651 51.1708L114.084 46.0497H115.748L111.448 52.2606L115.924 58.7294H114.284L110.662 53.3739L107.053 58.7294H105.412L109.889 52.2606L105.588 46.0497H107.229Z" fill="#ABA9A9"/>
 <path d="M187.412 51.0458V52.37H175.717V51.0458H187.412ZM182.221 45.5966V58.0184H180.815V45.5966H182.221Z" fill="#ABA9A9"/>
 <path d="M126.172 42.3736V60.3736H122.785V42.3736H126.172ZM128.422 54.1627V53.9166C128.422 52.9869 128.555 52.1314 128.82 51.3502C129.086 50.5611 129.473 49.8775 129.981 49.2994C130.488 48.7213 131.113 48.272 131.856 47.9517C132.598 47.6236 133.449 47.4595 134.41 47.4595C135.371 47.4595 136.227 47.6236 136.977 47.9517C137.727 48.272 138.356 48.7213 138.863 49.2994C139.379 49.8775 139.77 50.5611 140.035 51.3502C140.301 52.1314 140.434 52.9869 140.434 53.9166V54.1627C140.434 55.0845 140.301 55.94 140.035 56.7291C139.77 57.5103 139.379 58.1939 138.863 58.7798C138.356 59.358 137.731 59.8072 136.988 60.1275C136.246 60.4478 135.395 60.608 134.434 60.608C133.473 60.608 132.617 60.4478 131.867 60.1275C131.125 59.8072 130.496 59.358 129.981 58.7798C129.473 58.1939 129.086 57.5103 128.82 56.7291C128.555 55.94 128.422 55.0845 128.422 54.1627ZM131.797 53.9166V54.1627C131.797 54.6939 131.844 55.19 131.938 55.6509C132.031 56.1119 132.18 56.5181 132.383 56.8697C132.594 57.2134 132.867 57.483 133.203 57.6783C133.539 57.8736 133.949 57.9713 134.434 57.9713C134.903 57.9713 135.305 57.8736 135.641 57.6783C135.977 57.483 136.246 57.2134 136.449 56.8697C136.653 56.5181 136.801 56.1119 136.895 55.6509C136.996 55.19 137.047 54.6939 137.047 54.1627V53.9166C137.047 53.4009 136.996 52.9166 136.895 52.4634C136.801 52.0025 136.649 51.5963 136.438 51.2447C136.235 50.8853 135.965 50.6041 135.629 50.4009C135.293 50.1978 134.887 50.0963 134.41 50.0963C133.934 50.0963 133.528 50.1978 133.192 50.4009C132.863 50.6041 132.594 50.8853 132.383 51.2447C132.18 51.5963 132.031 52.0025 131.938 52.4634C131.844 52.9166 131.797 53.4009 131.797 53.9166ZM150.535 47.6939H153.594V59.9517C153.594 61.108 153.336 62.0884 152.82 62.8931C152.313 63.7056 151.602 64.3189 150.688 64.733C149.774 65.1548 148.711 65.3658 147.5 65.3658C146.969 65.3658 146.406 65.2955 145.813 65.1548C145.227 65.0142 144.664 64.7955 144.125 64.4986C143.594 64.2017 143.149 63.8267 142.789 63.3736L144.278 61.3814C144.668 61.8345 145.121 62.1861 145.637 62.4361C146.153 62.6939 146.723 62.8228 147.348 62.8228C147.957 62.8228 148.473 62.7095 148.895 62.483C149.317 62.2642 149.641 61.94 149.867 61.5103C150.094 61.0884 150.207 60.5767 150.207 59.9752V50.6236L150.535 47.6939ZM142.004 54.1861V53.94C142.004 52.9713 142.121 52.0923 142.356 51.3033C142.598 50.5064 142.938 49.8228 143.375 49.2525C143.82 48.6822 144.36 48.2408 144.992 47.9283C145.625 47.6158 146.34 47.4595 147.137 47.4595C147.981 47.4595 148.688 47.6158 149.258 47.9283C149.828 48.2408 150.297 48.6861 150.664 49.2642C151.031 49.8345 151.317 50.5103 151.52 51.2916C151.731 52.065 151.895 52.9127 152.012 53.8345V54.3736C151.895 55.2564 151.719 56.0767 151.485 56.8345C151.25 57.5923 150.942 58.2564 150.559 58.8267C150.176 59.3892 149.699 59.8267 149.129 60.1392C148.567 60.4517 147.895 60.608 147.113 60.608C146.332 60.608 145.625 60.4478 144.992 60.1275C144.367 59.8072 143.832 59.358 143.387 58.7798C142.942 58.2017 142.598 57.522 142.356 56.7408C142.121 55.9595 142.004 55.108 142.004 54.1861ZM145.379 53.94V54.1861C145.379 54.7095 145.43 55.1978 145.531 55.6509C145.633 56.1041 145.789 56.5064 146 56.858C146.219 57.2017 146.488 57.4713 146.809 57.6666C147.137 57.8541 147.524 57.9478 147.969 57.9478C148.586 57.9478 149.09 57.8189 149.481 57.5611C149.871 57.2955 150.164 56.9322 150.36 56.4713C150.555 56.0103 150.668 55.4791 150.699 54.8775V53.3423C150.684 52.8502 150.617 52.4088 150.5 52.0181C150.383 51.6197 150.219 51.2798 150.008 50.9986C149.797 50.7173 149.524 50.4986 149.188 50.3423C148.852 50.1861 148.453 50.108 147.992 50.108C147.547 50.108 147.16 50.2095 146.832 50.4127C146.512 50.608 146.242 50.8775 146.024 51.2213C145.813 51.565 145.653 51.9713 145.543 52.44C145.434 52.9009 145.379 53.4009 145.379 53.94Z" fill="black"/>
 <path d="M157.212 52.8801V52.6603C157.212 50.756 157.413 48.9738 157.813 47.3137C158.213 45.6535 158.736 44.1594 159.38 42.8312C160.035 41.5031 160.748 40.3752 161.519 39.4474C162.3 38.5099 163.067 37.8166 163.819 37.3674L164.244 38.5685C163.599 39.0275 162.96 39.6916 162.325 40.5607C161.7 41.4299 161.133 42.4748 160.626 43.6955C160.118 44.9162 159.712 46.2785 159.41 47.7824C159.107 49.2863 158.956 50.8976 158.956 52.6164V52.9094C158.956 54.6281 159.107 56.2394 159.41 57.7433C159.712 59.2473 160.118 60.6096 160.626 61.8303C161.133 63.0607 161.7 64.1154 162.325 64.9943C162.96 65.883 163.599 66.5666 164.244 67.0451L163.819 68.173C163.067 67.7238 162.3 67.0402 161.519 66.1223C160.748 65.2141 160.035 64.1008 159.38 62.7824C158.736 61.4738 158.213 59.9846 157.813 58.3146C157.413 56.6447 157.212 54.8332 157.212 52.8801Z" fill="#ABA9A9"/>
 <path d="M221.935 53.2359V53.0162C221.935 51.1119 221.734 49.3297 221.334 47.6695C220.934 46.0093 220.411 44.5152 219.767 43.1871C219.112 41.8589 218.399 40.731 217.628 39.8033C216.847 38.8658 216.08 38.1724 215.328 37.7232L214.903 38.9244C215.548 39.3834 216.188 40.0474 216.822 40.9166C217.447 41.7857 218.014 42.8306 218.521 44.0513C219.029 45.272 219.435 46.6343 219.737 48.1382C220.04 49.6422 220.191 51.2535 220.191 52.9722V53.2652C220.191 54.9839 220.04 56.5953 219.737 58.0992C219.435 59.6031 219.029 60.9654 218.521 62.1861C218.014 63.4166 217.447 64.4713 216.822 65.3502C216.188 66.2388 215.548 66.9224 214.903 67.4009L215.328 68.5289C216.08 68.0797 216.847 67.3961 217.628 66.4781C218.399 65.5699 219.112 64.4566 219.767 63.1382C220.411 61.8297 220.934 60.3404 221.334 58.6705C221.734 57.0005 221.935 55.189 221.935 53.2359Z" fill="#ABA9A9"/>
 <path d="M200.208 29.5273L195.345 44H191.935L198.31 26.9375H200.489L200.208 29.5273ZM204.275 44L199.388 29.5273L199.095 26.9375H201.286L207.696 44H204.275ZM204.052 37.6602V40.2031H194.9V37.6602H204.052Z" fill="black"/>
 <path d="M171.364 43.9083V61.0177H168.258V47.5294L164.145 48.8888V46.381L171.012 43.9083H171.364Z" fill="black"/>
 <line x1="187.929" y1="46.6207" x2="211.762" y2="46.6207" stroke="#ABA9A9"/>
 <path d="M201.288 69.1207H198.17V55.2691C198.17 54.316 198.354 53.5152 198.721 52.8668C199.088 52.2105 199.612 51.7144 200.291 51.3785C200.971 51.0425 201.772 50.8746 202.694 50.8746C202.998 50.8746 203.288 50.8941 203.561 50.9332C203.842 50.9722 204.12 51.0269 204.393 51.0972L204.334 53.4527C204.186 53.4136 204.022 53.3863 203.842 53.3707C203.67 53.355 203.479 53.3472 203.268 53.3472C202.846 53.3472 202.487 53.4214 202.19 53.5699C201.893 53.7183 201.666 53.9371 201.51 54.2261C201.362 54.5074 201.288 54.855 201.288 55.2691V69.1207ZM203.854 56.441V58.6675H196.26V56.441H203.854Z" fill="black"/>
 <path d="M204.029 65.9971L204.682 67.5322L205.83 65.9971H206.701L205.006 68.1182L206.037 70.2236H205.268L204.568 68.6416L203.377 70.2236H202.514L204.256 68.0479L203.26 65.9971H204.029Z" fill="#0277BD"/>
 <path d="M72.0705 47.6938V50.0845H64.6877V47.6938H72.0705ZM66.5158 44.5649H69.8908V56.5532C69.8908 56.9204 69.9377 57.2017 70.0314 57.397C70.133 57.5923 70.2814 57.729 70.4767 57.8071C70.6721 57.8774 70.9182 57.9126 71.215 57.9126C71.426 57.9126 71.6135 57.9048 71.7775 57.8892C71.9494 57.8657 72.0939 57.8423 72.2111 57.8188L72.2228 60.3032C71.9338 60.397 71.6213 60.4712 71.2853 60.5259C70.9494 60.5806 70.5783 60.6079 70.1721 60.6079C69.4299 60.6079 68.7814 60.4868 68.2267 60.2446C67.6799 59.9946 67.258 59.5962 66.9611 59.0493C66.6642 58.5024 66.5158 57.7837 66.5158 56.8931V44.5649ZM78.2932 60.3735H74.8947V46.5688C74.8947 45.6079 75.0822 44.7993 75.4572 44.1431C75.84 43.479 76.3752 42.979 77.0627 42.6431C77.758 42.2993 78.5822 42.1274 79.5353 42.1274C79.8478 42.1274 80.1486 42.1509 80.4377 42.1978C80.7267 42.2368 81.008 42.2876 81.2814 42.3501L81.2463 44.8931C81.0978 44.854 80.9416 44.8267 80.7775 44.811C80.6135 44.7954 80.4221 44.7876 80.2033 44.7876C79.7971 44.7876 79.4494 44.8579 79.1603 44.9985C78.8791 45.1313 78.6642 45.3306 78.5158 45.5962C78.3674 45.8618 78.2932 46.186 78.2932 46.5688V60.3735ZM80.8244 47.6938V50.0845H73.008V47.6938H80.8244Z" fill="black"/>
 <path d="M82.291 59.894L82.9434 61.4291L84.0918 59.894H84.9629L83.2676 62.0151L84.2989 64.1205H83.5293L82.8301 62.5385L81.6387 64.1205H80.7754L82.5176 61.9448L81.5215 59.894H82.291ZM90.2442 63.6088C90.416 63.6114 90.5762 63.5789 90.7246 63.5112C90.8731 63.4435 90.9994 63.3471 91.1035 63.2221C91.2077 63.0971 91.2819 62.9526 91.3262 62.7885L91.9981 62.7846C91.9564 63.0685 91.8457 63.3172 91.666 63.5307C91.489 63.7442 91.2715 63.9109 91.0137 64.0307C90.7585 64.1479 90.4916 64.2039 90.2129 64.1987C89.916 64.1935 89.6634 64.1323 89.4551 64.0151C89.2494 63.8953 89.084 63.7364 88.959 63.5385C88.834 63.3406 88.7481 63.1179 88.7012 62.8705C88.6543 62.6205 88.6439 62.364 88.67 62.101L88.6856 61.933C88.7168 61.6492 88.7858 61.3797 88.8926 61.1245C88.9994 60.8666 89.1413 60.6388 89.3184 60.4409C89.4981 60.2403 89.7103 60.0841 89.9551 59.9721C90.1999 59.8601 90.4746 59.808 90.7793 59.8159C91.0762 59.8211 91.334 59.8914 91.5528 60.0268C91.7715 60.1596 91.9408 60.3406 92.0606 60.5698C92.1804 60.7989 92.2403 61.0593 92.2403 61.351L91.5762 61.3471C91.5736 61.1804 91.541 61.0268 91.4785 60.8862C91.416 60.7455 91.3236 60.6323 91.2012 60.5463C91.0788 60.4604 90.9278 60.4135 90.7481 60.4057C90.5319 60.4005 90.3431 60.4409 90.1817 60.5268C90.0228 60.6127 89.8874 60.7312 89.7754 60.8823C89.666 61.0307 89.5788 61.1961 89.5137 61.3784C89.4512 61.5606 89.4082 61.7455 89.3848 61.933L89.3653 62.0971C89.3496 62.2638 89.347 62.4343 89.3575 62.6088C89.3705 62.7833 89.4069 62.9461 89.4668 63.0971C89.5267 63.2455 89.6192 63.3666 89.7442 63.4604C89.8692 63.5541 90.0358 63.6036 90.2442 63.6088Z" fill="#0277BD"/>
 <path d="M85.7754 63.2612L85.6817 63.8393C85.6374 64.1231 85.5371 64.3875 85.3809 64.6323C85.2246 64.8771 85.0332 65.0854 84.8067 65.2573L84.416 64.9643C84.5072 64.8523 84.5905 64.7377 84.666 64.6205C84.7416 64.506 84.8054 64.3849 84.8575 64.2573C84.9121 64.1297 84.9538 63.9955 84.9825 63.8549L85.084 63.2612H85.7754Z" fill="#ABA9A9"/>
 <path d="M13.8242 58.3923L17.2227 44.5993H19.0625L19.1797 47.5056L15.5469 61.6618H13.6016L13.8242 58.3923ZM11.6797 44.5993L14.4688 58.3454V61.6618H12.3477L8.48047 44.5993H11.6797ZM22.707 58.2868L25.4492 44.5993H28.6602L24.793 61.6618H22.6719L22.707 58.2868ZM19.9414 44.5993L23.3398 58.4391L23.5391 61.6618H21.5938L17.9727 47.4938L18.1133 44.5993H19.9414Z" fill="black"/>
 <path d="M29.2528 60.5382L29.9052 62.0734L31.0536 60.5382H31.9247L30.2294 62.6593L31.2606 64.7648H30.4911L29.7919 63.1827L28.6005 64.7648H27.7372L29.4794 62.589L28.4833 60.5382H29.2528ZM37.2059 64.2531C37.3778 64.2557 37.538 64.2231 37.6864 64.1554C37.8348 64.0877 37.9611 63.9913 38.0653 63.8663C38.1695 63.7413 38.2437 63.5968 38.288 63.4327L38.9598 63.4288C38.9182 63.7127 38.8075 63.9614 38.6278 64.1749C38.4507 64.3885 38.2333 64.5551 37.9755 64.6749C37.7203 64.7921 37.4533 64.8481 37.1747 64.8429C36.8778 64.8377 36.6252 64.7765 36.4169 64.6593C36.2111 64.5395 36.0458 64.3807 35.9208 64.1827C35.7958 63.9848 35.7098 63.7622 35.663 63.5148C35.6161 63.2648 35.6057 63.0083 35.6317 62.7452L35.6473 62.5773C35.6786 62.2934 35.7476 62.0239 35.8544 61.7687C35.9611 61.5109 36.1031 61.283 36.2802 61.0851C36.4598 60.8846 36.6721 60.7283 36.9169 60.6163C37.1617 60.5044 37.4364 60.4523 37.7411 60.4601C38.038 60.4653 38.2958 60.5356 38.5145 60.671C38.7333 60.8038 38.9025 60.9848 39.0223 61.214C39.1421 61.4432 39.202 61.7036 39.202 61.9952L38.538 61.9913C38.5354 61.8247 38.5028 61.671 38.4403 61.5304C38.3778 61.3898 38.2854 61.2765 38.163 61.1906C38.0406 61.1046 37.8895 61.0577 37.7098 61.0499C37.4937 61.0447 37.3049 61.0851 37.1434 61.171C36.9846 61.257 36.8492 61.3754 36.7372 61.5265C36.6278 61.6749 36.5406 61.8403 36.4755 62.0226C36.413 62.2049 36.37 62.3898 36.3466 62.5773L36.327 62.7413C36.3114 62.908 36.3088 63.0786 36.3192 63.2531C36.3322 63.4275 36.3687 63.5903 36.4286 63.7413C36.4885 63.8898 36.5809 64.0109 36.7059 64.1046C36.8309 64.1984 36.9976 64.2478 37.2059 64.2531Z" fill="#0277BD"/>
 <path d="M32.7372 63.9054L32.6434 64.4835C32.5992 64.7674 32.4989 65.0317 32.3427 65.2765C32.1864 65.5213 31.995 65.7296 31.7684 65.9015L31.3778 65.6085C31.469 65.4965 31.5523 65.382 31.6278 65.2648C31.7033 65.1502 31.7671 65.0291 31.8192 64.9015C31.8739 64.7739 31.9156 64.6398 31.9442 64.4991L32.0458 63.9054H32.7372Z" fill="#ABA9A9"/>
 </svg>
--- a/Show More
+++ b/Show More