108 lines
6.2 KiB
Markdown
108 lines
6.2 KiB
Markdown
BERTopic approaches topic modeling as a cluster task and attempts to cluster semantically similar documents to extract common topics. A disadvantage of using such a method is that each document is assigned to a single cluster and therefore also a single topic. In practice, documents may contain a mixture of topics. This can be accounted for by splitting up the documents into sentences and feeding those to BERTopic.
|
|
|
|
Another option is to use a cluster model that can perform soft clustering, like HDBSCAN. As BERTopic focuses on modularity, we may still want to model that mixture of topics even when we are using a hard-clustering model, like k-Means without the need to split up our documents. This is where `.approximate_distribution` comes in!
|
|
|
|
<br>
|
|
<div class="svg_image">
|
|
--8<-- "docs/getting_started/distribution/approximate_distribution.svg"
|
|
</div>
|
|
<br>
|
|
|
|
To perform this approximation, each document is split into tokens according to the provided tokenizer in the `CountVectorizer`. Then, a **sliding window** is applied on each document creating subsets of the document. For example, with a window size of 3 and stride of 1, the document:
|
|
|
|
> Solving the right problem is difficult.
|
|
|
|
can be split up into `solving the right`, `the right problem`, `right problem is`, and `problem is difficult`. These are called token sets.
|
|
For each of these token sets, we calculate their c-TF-IDF representation and find out how similar they are to the previously generated topics.
|
|
Then, the similarities to the topics for each token set are summed to create a topic distribution for the entire document.
|
|
|
|
Although it is often said that documents can contain a mixture of topics, these are often modeled by assigning each word to a single topic.
|
|
With this approach, we take into account that there may be multiple topics for a single word.
|
|
|
|
We can make this multiple-topic word assignment a bit more accurate by then splitting these token sets up into individual tokens and assigning
|
|
the topic distributions for each token set to each individual token. That way, we can visualize the extent to which a certain word contributes
|
|
to a document's topic distribution.
|
|
|
|
## **Example**
|
|
|
|
To calculate our topic distributions, we first need to fit a basic topic model:
|
|
|
|
```python
|
|
from bertopic import BERTopic
|
|
from sklearn.datasets import fetch_20newsgroups
|
|
|
|
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
|
|
topic_model = BERTopic().fit(docs)
|
|
```
|
|
|
|
After doing so, we can approximate the topic distributions for your documents:
|
|
|
|
```python
|
|
topic_distr, _ = topic_model.approximate_distribution(docs)
|
|
```
|
|
|
|
The resulting `topic_distr` is a *n* x *m* matrix where *n* are the documents and *m* the topics. We can then visualize the distribution
|
|
of topics in a document:
|
|
|
|
```python
|
|
topic_model.visualize_distribution(topic_distr[1])
|
|
```
|
|
|
|
<iframe src="distribution_viz.html" style="width:1000px; height: 620px; border: 0px;""></iframe>
|
|
|
|
Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first
|
|
calculate topic distributions on a token level and then visualize the results:
|
|
|
|
```python
|
|
# Calculate the topic distributions on a token-level
|
|
topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)
|
|
|
|
# Visualize the token-level distributions
|
|
df = topic_model.visualize_approximate_distribution(docs[1], topic_token_distr[1])
|
|
df
|
|
```
|
|
|
|
<br><br>
|
|
<img src="distribution.png">
|
|
<br><br>
|
|
|
|
!!! tip
|
|
You can also approximate the topic distributions for unseen documents. It will not be as accurate as `.transform` but it is quite fast and can serve you well in a production setting.
|
|
|
|
!!! note
|
|
To get the stylized dataframe for `.visualize_approximate_distribution` you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via `pip install jinja2`
|
|
|
|
## **Parameters**
|
|
There are a few parameters that are of interest which will be discussed below.
|
|
|
|
|
|
### **batch_size**
|
|
Creating token sets for each document can result in quite a large list of token sets. The similarity of these token sets with the topics can result a large matrix that might not fit into memory anymore. To circumvent this, we can process batches of documents instead to minimize the memory overload. The value for `batch_size` indicates the number of documents that will be processed at once:
|
|
|
|
```python
|
|
topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=500)
|
|
```
|
|
|
|
### **window**
|
|
The number of tokens that are combined into token sets are defined by the `window` parameter. Seeing as we are performing a sliding window, we can change the size of the window. A larger window takes more tokens into account but setting it too large can result in considering too much information. Personally, I like to have this window between 4 and 8:
|
|
|
|
```python
|
|
topic_distr, _ = topic_model.approximate_distribution(docs, window=4)
|
|
```
|
|
|
|
### **stride**
|
|
The sliding window that is performed on a document shifts, as a default, 1 token to the right each time to create its token sets. As a result, especially with large windows, a single token gets judged several times. We can use the `stride` parameter to increase the number of tokens the window shifts to the right. By increasing
|
|
this value, we are judging each token less frequently which often results in a much faster calculation. Combining this parameter with `window` is preferred. For example, if we have a very large dataset, we can set `stride=4` and `window=8` to judge token sets that contain 8 tokens but that are shifted with 4 steps
|
|
each time. As a result, this increases the computational speed quite a bit:
|
|
|
|
```python
|
|
topic_distr, _ = topic_model.approximate_distribution(docs, window=4)
|
|
```
|
|
|
|
### **use_embedding_model**
|
|
As a default, we compare the c-TF-IDF calculations between the token sets and all topics. Due to its bag-of-word representation, this is quite fast. However, you might want to use the selected `embedding_model` instead to do this comparison. Do note that due to the many token sets, it is often computationally quite a bit slower:
|
|
|
|
```python
|
|
topic_distr, _ = topic_model.approximate_distribution(docs, use_embedding_model=True)
|
|
```
|