Add BERTopic.
This commit is contained in:
@@ -0,0 +1,121 @@
|
||||
After reducing the dimensionality of our input embeddings, we need to cluster them into groups of similar embeddings to extract our topics.
|
||||
This process of clustering is quite important because the more performant our clustering technique the more accurate our topic representations are.
|
||||
|
||||
In BERTopic, we typically use HDBSCAN as it is quite capable of capturing structures with different densities. However, there is not one perfect
|
||||
clustering model and you might want to be using something entirely different for your use case. Moreover, what if a new state-of-the-art model
|
||||
is released tomorrow? We would like to be able to use that in BERTopic, right? Since BERTopic assumes some independence among steps, we can allow for this modularity:
|
||||
|
||||
<figure markdown>
|
||||

|
||||
<figcaption></figcaption>
|
||||
</figure>
|
||||
|
||||
As a result, the `hdbscan_model` parameter in BERTopic now allows for a variety of clustering models. To do so, the class should have
|
||||
the following attributes:
|
||||
|
||||
* `.fit(X)`
|
||||
* A function that can be used to fit the model
|
||||
* `.predict(X)`
|
||||
* A predict function that transforms the input to cluster labels
|
||||
* `.labels_`
|
||||
* The labels after fitting the model
|
||||
|
||||
|
||||
In other words, it should have the following structure:
|
||||
|
||||
```python
|
||||
class ClusterModel:
|
||||
def fit(self, X):
|
||||
self.labels_ = None
|
||||
return self
|
||||
|
||||
def predict(self, X):
|
||||
return X
|
||||
```
|
||||
|
||||
In this section, we will go through several examples of clustering algorithms and how they can be implemented.
|
||||
|
||||
|
||||
## **HDBSCAN**
|
||||
As a default, BERTopic uses HDBSCAN to perform its clustering. To use a HDBSCAN model with custom parameters,
|
||||
we simply define it and pass it to BERTopic:
|
||||
|
||||
```python
|
||||
from bertopic import BERTopic
|
||||
from hdbscan import HDBSCAN
|
||||
|
||||
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
|
||||
topic_model = BERTopic(hdbscan_model=hdbscan_model)
|
||||
```
|
||||
|
||||
Here, we can define any parameters in HDBSCAN to optimize for the best performance based on whatever validation metrics you are using.
|
||||
|
||||
## **k-Means**
|
||||
Although HDBSCAN works quite well in BERTopic and is typically advised, you might want to be using k-Means instead.
|
||||
It allows you to select how many clusters you would like and forces every single point to be in a cluster. Therefore, no
|
||||
outliers will be created. This also has disadvantages. When you force every single point in a cluster, it will mean
|
||||
that the cluster is highly likely to contain noise which can hurt the topic representations. As a small tip, using
|
||||
the `vectorizer_model=CountVectorizer(stop_words="english")` helps quite a bit to then improve the topic representation.
|
||||
|
||||
Having said that, using k-Means is quite straightforward:
|
||||
|
||||
```python
|
||||
from bertopic import BERTopic
|
||||
from sklearn.cluster import KMeans
|
||||
|
||||
cluster_model = KMeans(n_clusters=50)
|
||||
topic_model = BERTopic(hdbscan_model=cluster_model)
|
||||
```
|
||||
|
||||
!!! note
|
||||
As you might have noticed, the `cluster_model` is passed to `hdbscan_model` which might be a bit confusing considering
|
||||
you are not passing an HDBSCAN model. For now, the name of the parameter is kept the same to adhere to the current
|
||||
state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible.
|
||||
|
||||
## **Agglomerative Clustering**
|
||||
Like k-Means, there are a bunch more clustering algorithms in `sklearn` that you can be using. Some of these models do
|
||||
not have a `.predict()` method but still can be used in BERTopic. However, using BERTopic's `.transform()` function
|
||||
will then give errors.
|
||||
|
||||
Here, we will demonstrate Agglomerative Clustering:
|
||||
|
||||
|
||||
```python
|
||||
from bertopic import BERTopic
|
||||
from sklearn.cluster import AgglomerativeClustering
|
||||
|
||||
cluster_model = AgglomerativeClustering(n_clusters=50)
|
||||
topic_model = BERTopic(hdbscan_model=cluster_model)
|
||||
```
|
||||
|
||||
|
||||
## **cuML HDBSCAN**
|
||||
|
||||
Although the original HDBSCAN implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead,
|
||||
we can use [cuML](https://rapids.ai/start.html#rapids-release-selector) to speed up HDBSCAN through GPU acceleration:
|
||||
|
||||
```python
|
||||
from bertopic import BERTopic
|
||||
from cuml.cluster import HDBSCAN
|
||||
|
||||
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)
|
||||
topic_model = BERTopic(hdbscan_model=hdbscan_model)
|
||||
```
|
||||
|
||||
The great thing about using cuML's HDBSCAN implementation is that it supports many features of the original implementation. In other words,
|
||||
`calculate_probabilities=True` also works!
|
||||
|
||||
!!! note
|
||||
As of the v0.13 release, it is not yet possible to calculate the topic-document probability matrix for unseen data (i.e., `.transform`) using cuML's HDBSCAN.
|
||||
However, it is still possible to calculate the topic-document probability matrix for the data on which the model was trained (i.e., `.fit` and `.fit_transform`).
|
||||
|
||||
!!! note
|
||||
If you want to install cuML together with BERTopic using Google Colab, you can run the following code:
|
||||
|
||||
```bash
|
||||
!pip install bertopic
|
||||
!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
|
||||
!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
|
||||
!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
|
||||
!pip install --upgrade cupy-cuda11x -f https://pip.cupy.dev/aarch64
|
||||
```
|
||||
Reference in New Issue
Block a user