Add BERTopic.

2025-08-12 19:01:20 +08:00
parent e2323d579c
commit c5c530775e
256 changed files with 28666 additions and 0 deletions
@@ -0,0 +1,82 @@
+# c-TF-IDF
+
+In BERTopic, in order to get an accurate representation of the topics from our bag-of-words matrix, TF-IDF was adjusted to work on a cluster/categorical/topic level instead of a document level. This adjusted TF-IDF representation is called **c-TF-IDF** and takes into account what makes the documents in one cluster different from documents in another cluster:
+
+<img class="w-6/12" src="../../algorithm/c-TF-IDF.svg">
+
+<br>
+Each cluster is converted to a single document instead of a set of documents. Then, we extract the frequency of word `x` in class `c`, where `c` refers to the cluster we created before. This results in our class-based `tf` representation. This representation is L1-normalized to account for the differences in topic sizes.
+  <br><br>
+Then, we take the logarithm of one plus the average number of words per class `A` divided by the frequency of word `x` across all classes. We add plus one within the logarithm to force values to be positive. This results in our class-based `idf` representation. Like with the classic TF-IDF, we then multiply `tf` with `idf` to get the importance score per word in each class. In other words, the classical TF-IDF procedure is **not** used here but a modified version of the algorithm that allows for a much better representation.
+
+Since the topic representation is somewhat independent of the clustering step, we can change how the c-TF-IDF representation will look like. This can be in the form of parameter tuning, different weighting schemes, or using a diversity metric on top of it. This allows for some modularity concerning the weighting scheme:
+
+<figure markdown>
+  ![Image title](ctfidf.svg)
+  <figcaption></figcaption>
+</figure>
+
+
+This class-based TF-IDF representation is enabled by default in BERTopic. However, we can explicitly pass it to BERTopic through the `ctfidf_model` allowing for parameter tuning and the customization of the topic extraction technique:
+
+```python
+from bertopic import BERTopic
+from bertopic.vectorizers import ClassTfidfTransformer
+
+ctfidf_model = ClassTfidfTransformer()
+topic_model = BERTopic(ctfidf_model=ctfidf_model )
+```
+
+## **Parameters**
+There are two parameters worth exploring in the `ClassTfidfTransformer`, namely `bm25_weighting` and `reduce_frequent_words`.
+
+
+### bm25_weighting
+
+The `bm25_weighting` is a boolean parameter that indicates whether a class-based BM-25 weighting measure is used instead of the default method as defined in the formula at the beginning of this page.
+
+Instead of using the following weighting scheme:
+
+<img class="w-6/12" src="idf.svg">
+
+
+the class-based BM-25 weighting is used instead:
+
+<img class="w-6/12" src="bm25.svg">
+
+At smaller datasets, this variant can be more robust to stop words that appear in your data. It can be enabled as follows:
+
+```python
+from bertopic import BERTopic
+from bertopic.vectorizers import ClassTfidfTransformer
+
+ctfidf_model = ClassTfidfTransformer(bm25_weighting=True)
+topic_model = BERTopic(ctfidf_model=ctfidf_model )
+```
+
+
+### reduce_frequent_words
+
+Some words appear quite often in every topic but are generally not considered stop words as found in the `CountVectorizer(stop_words="english")` list. To further reduce these frequent words, we can use `reduce_frequent_words` to take the square root of the term frequency after applying the weighting scheme.
+
+Instead of the default term frequency:
+
+<img class="w-8/12" src="tf.svg">
+
+we take the square root of the term frequency after normalizing the frequency matrix:
+
+<img class="w-8/12" src="tf_reduced.svg">
+
+Although seemingly a small change, it can have quite a large effect on the number of stop words in the resulting topic representations. It can be enabled as follows:
+
+
+```python
+from bertopic import BERTopic
+from bertopic.vectorizers import ClassTfidfTransformer
+
+ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
+topic_model = BERTopic(ctfidf_model=ctfidf_model )
+```
+
+!!! tip
+	Both parameters can be used simultaneously: `ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)`
@@ -0,0 +1,53 @@
+<svg width="445" height="248" viewBox="0 0 445 248" fill="none" xmlns="http://www.w3.org/2000/svg">
+<rect x="132" y="210" width="118" height="38" fill="#64B5F6"/>
+<rect x="224" y="200" width="20" height="8" fill="#64B5F6"/>
+<rect x="196" y="200" width="20" height="8" fill="#64B5F6"/>
+<rect x="168" y="200" width="20" height="8" fill="#64B5F6"/>
+<rect x="140" y="200" width="20" height="8" fill="#64B5F6"/>
+<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" font-weight="bold" letter-spacing="0em"><tspan x="158.256" y="237.939">SBERT</tspan></text>
+<rect x="132" y="170" width="118" height="38" fill="#E57373"/>
+<rect x="224" y="160" width="20" height="8" fill="#E57373"/>
+<rect x="196" y="160" width="20" height="8" fill="#E57373"/>
+<rect x="168" y="160" width="20" height="8" fill="#E57373"/>
+<rect x="140" y="160" width="20" height="8" fill="#E57373"/>
+<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" font-weight="bold" letter-spacing="0em"><tspan x="161.254" y="197.939">UMAP</tspan></text>
+<rect x="132" y="130" width="118" height="38" fill="#4DB6AC"/>
+<rect x="224" y="120" width="20" height="8" fill="#4DB6AC"/>
+<rect x="196" y="120" width="20" height="8" fill="#4DB6AC"/>
+<rect x="168" y="120" width="20" height="8" fill="#4DB6AC"/>
+<rect x="140" y="120" width="20" height="8" fill="#4DB6AC"/>
+<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" font-weight="bold" letter-spacing="0em"><tspan x="141.342" y="157.939">HDBSCAN</tspan></text>
+<rect y="50" width="118" height="38" fill="#90A4AE"/>
+<rect x="92" y="40" width="20" height="8" fill="#90A4AE"/>
+<rect x="64" y="40" width="20" height="8" fill="#90A4AE"/>
+<rect x="36" y="40" width="20" height="8" fill="#90A4AE"/>
+<rect x="8" y="40" width="20" height="8" fill="#90A4AE"/>
+<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="13" font-weight="bold" letter-spacing="0em"><tspan x="23.1357" y="66.1606">c-TF-IDF +&#10;</tspan><tspan x="40.4521" y="82.1606">BM25</tspan></text>
+<rect x="132" y="90" width="118" height="38" fill="#FFD54F"/>
+<rect x="224" y="80" width="20" height="8" fill="#FFD54F"/>
+<rect x="196" y="80" width="20" height="8" fill="#FFD54F"/>
+<rect x="168" y="80" width="20" height="8" fill="#FFD54F"/>
+<rect x="140" y="80" width="20" height="8" fill="#FFD54F"/>
+<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="13" font-weight="bold" letter-spacing="0em"><tspan x="138.346" y="113.161">CountVectorizer</tspan></text>
+<rect x="132" y="50" width="118" height="38" fill="#90A4AE"/>
+<rect x="224" y="40" width="20" height="8" fill="#90A4AE"/>
+<rect x="196" y="40" width="20" height="8" fill="#90A4AE"/>
+<rect x="168" y="40" width="20" height="8" fill="#90A4AE"/>
+<rect x="140" y="40" width="20" height="8" fill="#90A4AE"/>
+<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" font-weight="bold" letter-spacing="0em"><tspan x="146.938" y="77.9395">c-TF-IDF</tspan></text>
+<rect x="132" y="10" width="118" height="38" fill="#3F51B5"/>
+<rect x="224" width="20" height="8" fill="#3F51B5"/>
+<rect x="196" width="20" height="8" fill="#3F51B5"/>
+<rect x="168" width="20" height="8" fill="#3F51B5"/>
+<rect x="140" width="20" height="8" fill="#3F51B5"/>
+<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="14" font-weight="bold" letter-spacing="0em"><tspan x="161.065" y="26.0576">Optional&#10;</tspan><tspan x="150.271" y="43.0576">Fine-tuning</tspan></text>
+<rect x="327" y="50" width="118" height="38" fill="#90A4AE"/>
+<rect x="419" y="40" width="20" height="8" fill="#90A4AE"/>
+<rect x="391" y="40" width="20" height="8" fill="#90A4AE"/>
+<rect x="363" y="40" width="20" height="8" fill="#90A4AE"/>
+<rect x="335" y="40" width="20" height="8" fill="#90A4AE"/>
+<text fill="white" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="14" font-weight="bold" letter-spacing="0em"><tspan x="345.326" y="67.0576">c-TF-IDF + &#10;</tspan><tspan x="336.453" y="84.0576">Normalization</tspan></text>
+<circle cx="266.5" cy="68.5" r="5.5" fill="black"/>
+<circle cx="285.5" cy="68.5" r="5.5" fill="black"/>
+<circle cx="307.5" cy="68.5" r="5.5" fill="black"/>
+</svg>