Add BERTopic.
This commit is contained in:
@@ -0,0 +1,138 @@
|
||||
An important aspect of BERTopic is the dimensionality reduction of the input embeddings. As embeddings are often high in dimensionality, clustering becomes difficult due to the curse of dimensionality.
|
||||
|
||||
A solution is to reduce the dimensionality of the embeddings to a workable dimensional space (e.g., 5) for clustering algorithms to work with.
|
||||
UMAP is used as a default in BERTopic since it can capture both the local and global high-dimensional space in lower dimensions.
|
||||
However, there are other solutions out there, such as PCA that users might be interested in trying out. Since BERTopic allows assumes some independency between steps, we can
|
||||
use any other dimensionality reduction algorithm. The image below illustrates this modularity:
|
||||
|
||||
|
||||
<figure markdown>
|
||||

|
||||
<figcaption></figcaption>
|
||||
</figure>
|
||||
|
||||
|
||||
|
||||
As a result, the `umap_model` parameter in BERTopic now allows for a variety of dimensionality reduction models. To do so, the class should have
|
||||
the following attributes:
|
||||
|
||||
* `.fit(X)`
|
||||
* A function that can be used to fit the model
|
||||
* `.transform(X)`
|
||||
* A transform function that transforms the input to a lower dimensional size
|
||||
|
||||
In other words, it should have the following structure:
|
||||
|
||||
```python
|
||||
class DimensionalityReduction:
|
||||
def fit(self, X):
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
return X
|
||||
```
|
||||
|
||||
In this section, we will go through several examples of dimensionality reduction techniques and how they can be implemented.
|
||||
|
||||
|
||||
## **UMAP**
|
||||
As a default, BERTopic uses UMAP to perform its dimensionality reduction. To use a UMAP model with custom parameters,
|
||||
we simply define it and pass it to BERTopic:
|
||||
|
||||
```python
|
||||
from bertopic import BERTopic
|
||||
from umap import UMAP
|
||||
|
||||
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
|
||||
topic_model = BERTopic(umap_model=umap_model)
|
||||
```
|
||||
|
||||
Here, we can define any parameters in UMAP to optimize for the best performance based on whatever validation metrics you are using.
|
||||
|
||||
## **PCA**
|
||||
Although UMAP works quite well in BERTopic and is typically advised, you might want to be using PCA instead. It can be faster to train and perform
|
||||
inference. To use PCA, we can simply import it from `sklearn` and pass it to the `umap_model` parameter:
|
||||
|
||||
|
||||
```python
|
||||
from bertopic import BERTopic
|
||||
from sklearn.decomposition import PCA
|
||||
|
||||
dim_model = PCA(n_components=5)
|
||||
topic_model = BERTopic(umap_model=dim_model)
|
||||
```
|
||||
|
||||
As a small note, PCA and k-Means have worked quite well in my experiments and might be interesting to use instead of PCA and HDBSCAN.
|
||||
|
||||
|
||||
!!! note
|
||||
As you might have noticed, the `dim_model` is passed to `umap_model` which might be a bit confusing considering
|
||||
you are not passing a UMAP model. For now, the name of the parameter is kept the same to adhere to the current
|
||||
state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible.
|
||||
|
||||
## **Truncated SVD**
|
||||
Like PCA, there are a bunch more dimensionality reduction techniques in `sklearn` that you can be using. Here, we will demonstrate Truncated SVD
|
||||
but any model can be used as long as it has both a `.fit()` and `.transform()` method:
|
||||
|
||||
|
||||
```python
|
||||
from bertopic import BERTopic
|
||||
from sklearn.decomposition import TruncatedSVD
|
||||
|
||||
dim_model = TruncatedSVD(n_components=5)
|
||||
topic_model = BERTopic(umap_model=dim_model)
|
||||
```
|
||||
|
||||
## **cuML UMAP**
|
||||
|
||||
Although the original UMAP implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead,
|
||||
we can use [cuML](https://rapids.ai/start.html#rapids-release-selector) to speed up UMAP through GPU acceleration:
|
||||
|
||||
```python
|
||||
from bertopic import BERTopic
|
||||
from cuml.manifold import UMAP
|
||||
|
||||
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
|
||||
topic_model = BERTopic(umap_model=umap_model)
|
||||
```
|
||||
|
||||
!!! note
|
||||
If you want to install cuML together with BERTopic using Google Colab, you can run the following code:
|
||||
|
||||
```bash
|
||||
!pip install bertopic
|
||||
!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
|
||||
!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
|
||||
!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
|
||||
!pip install --upgrade cupy-cuda11x -f https://pip.cupy.dev/aarch64
|
||||
```
|
||||
|
||||
|
||||
## **Skip dimensionality reduction**
|
||||
Although BERTopic applies dimensionality reduction as a default in its pipeline, this is a step that you might want to skip. We generate an "empty" model that simply returns the data pass it to:
|
||||
|
||||
```python
|
||||
from bertopic import BERTopic
|
||||
from bertopic.dimensionality import BaseDimensionalityReduction
|
||||
|
||||
# Fit BERTopic without actually performing any dimensionality reduction
|
||||
empty_dimensionality_model = BaseDimensionalityReduction()
|
||||
topic_model = BERTopic(umap_model=empty_dimensionality_model)
|
||||
```
|
||||
|
||||
In other words, we go from this pipeline:
|
||||
|
||||
<br>
|
||||
<div class="svg_image">
|
||||
--8<-- "docs/getting_started/dim_reduction/default_pipeline.svg"
|
||||
</div>
|
||||
<br>
|
||||
|
||||
To the following pipeline:
|
||||
|
||||
<br>
|
||||
<div class="svg_image">
|
||||
--8<-- "docs/getting_started/dim_reduction/no_dimensionality.svg"
|
||||
</div>
|
||||
|
||||
<br>
|
||||
Reference in New Issue
Block a user