Add BERTopic.

This commit is contained in:
戒酒的李白
2025-08-12 19:01:20 +08:00
parent e2323d579c
commit c5c530775e
256 changed files with 28666 additions and 0 deletions
@@ -0,0 +1,18 @@
<svg width="534" height="57" viewBox="0 0 534 57" fill="none" xmlns="http://www.w3.org/2000/svg">
<rect width="534" height="57" fill="white"/>
<rect x="0.5" y="14.5" width="88" height="42" fill="white" stroke="black"/>
<text fill="#757474" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="10" letter-spacing="0em"><tspan x="30" y="10.9697">SBERT</tspan></text>
<text fill="#757474" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="10" letter-spacing="0em"><tspan x="183" y="10.9697">UMAP</tspan></text>
<text fill="#757474" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="10" letter-spacing="0em"><tspan x="313" y="10.9697">HDBSCAN</tspan></text>
<text fill="#757474" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="10" letter-spacing="0em"><tspan x="468" y="10.9697">c-TF-IDF</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="9" y="38.7637">Embeddings</tspan></text>
<rect x="142.5" y="14.5" width="105" height="42" fill="white" stroke="black"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="156.094" y="33.7637">Dimensionality &#10;</tspan><tspan x="171.762" y="47.7637">reduction</tspan></text>
<path d="M126.707 33.7071C127.098 33.3166 127.098 32.6834 126.707 32.2929L120.343 25.9289C119.953 25.5384 119.319 25.5384 118.929 25.9289C118.538 26.3195 118.538 26.9526 118.929 27.3431L124.586 33L118.929 38.6569C118.538 39.0474 118.538 39.6805 118.929 40.0711C119.319 40.4616 119.953 40.4616 120.343 40.0711L126.707 33.7071ZM99 34H126V32H99V34Z" fill="black"/>
<rect x="295.5" y="14.5" width="91" height="42" fill="white" stroke="black"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="317" y="38.7637">Clustering</tspan></text>
<path d="M285.707 33.7071C286.098 33.3166 286.098 32.6834 285.707 32.2929L279.343 25.9289C278.953 25.5384 278.319 25.5384 277.929 25.9289C277.538 26.3195 277.538 26.9526 277.929 27.3431L283.586 33L277.929 38.6569C277.538 39.0474 277.538 39.6805 277.929 40.0711C278.319 40.4616 278.953 40.4616 279.343 40.0711L285.707 33.7071ZM258 34H285V32H258V34Z" fill="black"/>
<rect x="442.5" y="14.5" width="91" height="42" fill="white" stroke="black"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="472.404" y="30.7637">Topic &#10;</tspan><tspan x="450.215" y="44.7637">representation</tspan></text>
<path d="M426.707 33.7071C427.098 33.3166 427.098 32.6834 426.707 32.2929L420.343 25.9289C419.953 25.5384 419.319 25.5384 418.929 25.9289C418.538 26.3195 418.538 26.9526 418.929 27.3431L424.586 33L418.929 38.6569C418.538 39.0474 418.538 39.6805 418.929 40.0711C419.319 40.4616 419.953 40.4616 420.343 40.0711L426.707 33.7071ZM399 34H426V32H399V34Z" fill="black"/>
</svg>

After

Width:  |  Height:  |  Size: 3.0 KiB

@@ -0,0 +1,138 @@
An important aspect of BERTopic is the dimensionality reduction of the input embeddings. As embeddings are often high in dimensionality, clustering becomes difficult due to the curse of dimensionality.
A solution is to reduce the dimensionality of the embeddings to a workable dimensional space (e.g., 5) for clustering algorithms to work with.
UMAP is used as a default in BERTopic since it can capture both the local and global high-dimensional space in lower dimensions.
However, there are other solutions out there, such as PCA that users might be interested in trying out. Since BERTopic allows assumes some independency between steps, we can
use any other dimensionality reduction algorithm. The image below illustrates this modularity:
<figure markdown>
![Image title](dimensionality.svg)
<figcaption></figcaption>
</figure>
As a result, the `umap_model` parameter in BERTopic now allows for a variety of dimensionality reduction models. To do so, the class should have
the following attributes:
* `.fit(X)`
* A function that can be used to fit the model
* `.transform(X)`
* A transform function that transforms the input to a lower dimensional size
In other words, it should have the following structure:
```python
class DimensionalityReduction:
def fit(self, X):
return self
def transform(self, X):
return X
```
In this section, we will go through several examples of dimensionality reduction techniques and how they can be implemented.
## **UMAP**
As a default, BERTopic uses UMAP to perform its dimensionality reduction. To use a UMAP model with custom parameters,
we simply define it and pass it to BERTopic:
```python
from bertopic import BERTopic
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
topic_model = BERTopic(umap_model=umap_model)
```
Here, we can define any parameters in UMAP to optimize for the best performance based on whatever validation metrics you are using.
## **PCA**
Although UMAP works quite well in BERTopic and is typically advised, you might want to be using PCA instead. It can be faster to train and perform
inference. To use PCA, we can simply import it from `sklearn` and pass it to the `umap_model` parameter:
```python
from bertopic import BERTopic
from sklearn.decomposition import PCA
dim_model = PCA(n_components=5)
topic_model = BERTopic(umap_model=dim_model)
```
As a small note, PCA and k-Means have worked quite well in my experiments and might be interesting to use instead of PCA and HDBSCAN.
!!! note
As you might have noticed, the `dim_model` is passed to `umap_model` which might be a bit confusing considering
you are not passing a UMAP model. For now, the name of the parameter is kept the same to adhere to the current
state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible.
## **Truncated SVD**
Like PCA, there are a bunch more dimensionality reduction techniques in `sklearn` that you can be using. Here, we will demonstrate Truncated SVD
but any model can be used as long as it has both a `.fit()` and `.transform()` method:
```python
from bertopic import BERTopic
from sklearn.decomposition import TruncatedSVD
dim_model = TruncatedSVD(n_components=5)
topic_model = BERTopic(umap_model=dim_model)
```
## **cuML UMAP**
Although the original UMAP implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead,
we can use [cuML](https://rapids.ai/start.html#rapids-release-selector) to speed up UMAP through GPU acceleration:
```python
from bertopic import BERTopic
from cuml.manifold import UMAP
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
topic_model = BERTopic(umap_model=umap_model)
```
!!! note
If you want to install cuML together with BERTopic using Google Colab, you can run the following code:
```bash
!pip install bertopic
!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install --upgrade cupy-cuda11x -f https://pip.cupy.dev/aarch64
```
## **Skip dimensionality reduction**
Although BERTopic applies dimensionality reduction as a default in its pipeline, this is a step that you might want to skip. We generate an "empty" model that simply returns the data pass it to:
```python
from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction
# Fit BERTopic without actually performing any dimensionality reduction
empty_dimensionality_model = BaseDimensionalityReduction()
topic_model = BERTopic(umap_model=empty_dimensionality_model)
```
In other words, we go from this pipeline:
<br>
<div class="svg_image">
--8<-- "docs/getting_started/dim_reduction/default_pipeline.svg"
</div>
<br>
To the following pipeline:
<br>
<div class="svg_image">
--8<-- "docs/getting_started/dim_reduction/no_dimensionality.svg"
</div>
<br>
@@ -0,0 +1,53 @@
<svg width="445" height="278" viewBox="0 0 445 278" fill="none" xmlns="http://www.w3.org/2000/svg">
<rect x="132" y="240" width="118" height="38" fill="#64B5F6"/>
<rect x="224" y="230" width="20" height="8" fill="#64B5F6"/>
<rect x="196" y="230" width="20" height="8" fill="#64B5F6"/>
<rect x="168" y="230" width="20" height="8" fill="#64B5F6"/>
<rect x="140" y="230" width="20" height="8" fill="#64B5F6"/>
<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" font-weight="bold" letter-spacing="0em"><tspan x="158.256" y="267.939">SBERT</tspan></text>
<rect y="200" width="118" height="38" fill="#E57373"/>
<rect x="92" y="190" width="20" height="8" fill="#E57373"/>
<rect x="64" y="190" width="20" height="8" fill="#E57373"/>
<rect x="36" y="190" width="20" height="8" fill="#E57373"/>
<rect x="8" y="190" width="20" height="8" fill="#E57373"/>
<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" font-weight="bold" letter-spacing="0em"><tspan x="29.2539" y="227.939">UMAP</tspan></text>
<rect x="132" y="200" width="118" height="38" fill="#E57373"/>
<rect x="224" y="190" width="20" height="8" fill="#E57373"/>
<rect x="196" y="190" width="20" height="8" fill="#E57373"/>
<rect x="168" y="190" width="20" height="8" fill="#E57373"/>
<rect x="140" y="190" width="20" height="8" fill="#E57373"/>
<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" font-weight="bold" letter-spacing="0em"><tspan x="170.902" y="227.939">PCA</tspan></text>
<rect x="327" y="200" width="118" height="38" fill="#E57373"/>
<rect x="419" y="190" width="20" height="8" fill="#E57373"/>
<rect x="391" y="190" width="20" height="8" fill="#E57373"/>
<rect x="363" y="190" width="20" height="8" fill="#E57373"/>
<rect x="335" y="190" width="20" height="8" fill="#E57373"/>
<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="14" font-weight="bold" letter-spacing="0em"><tspan x="335.886" y="225.058">TruncatedSVD</tspan></text>
<circle cx="266.5" cy="218.5" r="5.5" fill="black"/>
<circle cx="285.5" cy="218.5" r="5.5" fill="black"/>
<circle cx="307.5" cy="218.5" r="5.5" fill="black"/>
<rect x="132" y="130" width="118" height="38" fill="#4DB6AC"/>
<rect x="224" y="120" width="20" height="8" fill="#4DB6AC"/>
<rect x="196" y="120" width="20" height="8" fill="#4DB6AC"/>
<rect x="168" y="120" width="20" height="8" fill="#4DB6AC"/>
<rect x="140" y="120" width="20" height="8" fill="#4DB6AC"/>
<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" font-weight="bold" letter-spacing="0em"><tspan x="141.342" y="157.939">HDBSCAN</tspan></text>
<rect x="132" y="90" width="118" height="38" fill="#FFD54F"/>
<rect x="224" y="80" width="20" height="8" fill="#FFD54F"/>
<rect x="196" y="80" width="20" height="8" fill="#FFD54F"/>
<rect x="168" y="80" width="20" height="8" fill="#FFD54F"/>
<rect x="140" y="80" width="20" height="8" fill="#FFD54F"/>
<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="13" font-weight="bold" letter-spacing="0em"><tspan x="138.346" y="114.161">CountVectorizer</tspan></text>
<rect x="132" y="50" width="118" height="38" fill="#90A4AE"/>
<rect x="224" y="40" width="20" height="8" fill="#90A4AE"/>
<rect x="196" y="40" width="20" height="8" fill="#90A4AE"/>
<rect x="168" y="40" width="20" height="8" fill="#90A4AE"/>
<rect x="140" y="40" width="20" height="8" fill="#90A4AE"/>
<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" font-weight="bold" letter-spacing="0em"><tspan x="146.938" y="76.9395">c-TF-IDF</tspan></text>
<rect x="132" y="10" width="118" height="38" fill="#3F51B5"/>
<rect x="224" width="20" height="8" fill="#3F51B5"/>
<rect x="196" width="20" height="8" fill="#3F51B5"/>
<rect x="168" width="20" height="8" fill="#3F51B5"/>
<rect x="140" width="20" height="8" fill="#3F51B5"/>
<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="14" font-weight="bold" letter-spacing="0em"><tspan x="161.065" y="27.0576">Optional&#10;</tspan><tspan x="150.271" y="44.0576">Fine-tuning</tspan></text>
</svg>

After

Width:  |  Height:  |  Size: 4.2 KiB

@@ -0,0 +1,14 @@
<svg width="374" height="57" viewBox="0 0 374 57" fill="none" xmlns="http://www.w3.org/2000/svg">
<rect width="374" height="57" fill="white"/>
<rect x="0.5" y="14.5" width="88" height="42" fill="white" stroke="black"/>
<text fill="#757474" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="10" letter-spacing="0em"><tspan x="30" y="10.9697">SBERT</tspan></text>
<text fill="#757474" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="10" letter-spacing="0em"><tspan x="156" y="10.9697">HDBSCAN</tspan></text>
<text fill="#757474" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="10" letter-spacing="0em"><tspan x="308" y="10.9697">c-TF-IDF</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="9" y="38.7637">Embeddings</tspan></text>
<rect x="135.5" y="14.5" width="91" height="42" fill="white" stroke="black"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="157" y="38.7637">Clustering</tspan></text>
<path d="M125.707 33.7071C126.098 33.3166 126.098 32.6834 125.707 32.2929L119.343 25.9289C118.953 25.5384 118.319 25.5384 117.929 25.9289C117.538 26.3195 117.538 26.9526 117.929 27.3431L123.586 33L117.929 38.6569C117.538 39.0474 117.538 39.6805 117.929 40.0711C118.319 40.4616 118.953 40.4616 119.343 40.0711L125.707 33.7071ZM98 34H125V32H98V34Z" fill="black"/>
<rect x="282.5" y="14.5" width="91" height="42" fill="white" stroke="black"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="312.404" y="30.7637">Topic &#10;</tspan><tspan x="290.215" y="44.7637">representation</tspan></text>
<path d="M266.707 33.7071C267.098 33.3166 267.098 32.6834 266.707 32.2929L260.343 25.9289C259.953 25.5384 259.319 25.5384 258.929 25.9289C258.538 26.3195 258.538 26.9526 258.929 27.3431L264.586 33L258.929 38.6569C258.538 39.0474 258.538 39.6805 258.929 40.0711C259.319 40.4616 259.953 40.4616 260.343 40.0711L266.707 33.7071ZM239 34H266V32H239V34Z" fill="black"/>
</svg>

After

Width:  |  Height:  |  Size: 2.2 KiB