Add BERTopic.

2025-08-12 19:01:20 +08:00
parent e2323d579c
commit c5c530775e
256 changed files with 28666 additions and 0 deletions
@@ -0,0 +1,88 @@
+In BERTopic, you have several options to nudge the creation of topics toward certain pre-specified topics. Here, we will be looking at semi-supervised topic modeling with BERTopic.
+
+Semi-supervised modeling allows us to steer the dimensionality reduction of the embeddings into a space that closely follows any labels you might already have.
+
+<br>
+<div class="svg_image">
+--8<-- "docs/getting_started/semisupervised/semisupervised.svg"
+</div>
+<br>
+
+In other words, we use a semi-supervised UMAP instance to reduce the dimensionality of embeddings before clustering the documents
+with HDBSCAN.
+
+First, let us prepare the data needed for our topic model:
+
+```python
+from bertopic import BERTopic
+from sklearn.datasets import fetch_20newsgroups
+
+data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
+docs = data["data"]
+categories = data["target"]
+category_names = data["target_names"]
+```
+
+We are using the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts that each is
+assigned to one of 20 categories. Using this dataset we can try to extract its corresponding topic model whilst
+taking its underlying categories into account. These categories are here the variable `targets`.
+
+Each document can be put into one of the following categories:
+
+```python
+>>> category_names
+
+['alt.atheism',
+ 'comp.graphics',
+ 'comp.os.ms-windows.misc',
+ 'comp.sys.ibm.pc.hardware',
+ 'comp.sys.mac.hardware',
+ 'comp.windows.x',
+ 'misc.forsale',
+ 'rec.autos',
+ 'rec.motorcycles',
+ 'rec.sport.baseball',
+ 'rec.sport.hockey',
+ 'sci.crypt',
+ 'sci.electronics',
+ 'sci.med',
+ 'sci.space',
+ 'soc.religion.christian',
+ 'talk.politics.guns',
+ 'talk.politics.mideast',
+ 'talk.politics.misc',
+ 'talk.religion.misc']
+```
+
+To perform this semi-supervised approach, we can take in some pre-defined topics and simply pass those to the `y` parameter when fitting BERTopic. These labels can be pre-defined topics or simply documents that you feel belong together regardless of their content. BERTopic will nudge the creation of topics toward these categories
+using the pre-defined labels.
+
+To perform supervised topic modeling, we simply use all categories:
+
+```python
+topic_model = BERTopic(verbose=True).fit(docs, y=categories)
+```
+
+The topic model will be much more attuned to the categories that were defined previously. However, this does not mean that only topics for these categories will be found. BERTopic is likely to find more specific topics in those you have already defined. This allows you to discover previously unknown topics!
+
+## **Partial labels**
+
+At times, you might only have labels for a subset of documents. Fortunately, we can still use those labels to at least nudge the documents for which those labels exist. The documents for which we do not have labels are assigned a -1. For this example, imagine we only have the labels of categories that are related to computers and we want to create a topic model using semi-supervised modeling:
+
+```python
+labels_to_add = ['comp.graphics', 'comp.os.ms-windows.misc',
+              'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
+              'comp.windows.x',]
+indices = [category_names.index(label) for label in labels_to_add]
+y = [label if label in indices else -1 for label in categories]
+```
+
+The `y` variable contains many -1 values since we do not know all the categories.
+
+Next, we use those newly constructed labels to again BERTopic semi-supervised:
+
+```python
+topic_model = BERTopic(verbose=True).fit(docs, y=y)
+```
+
+And that is it! By defining certain classes for our documents, we can steer the topic modeling towards modeling the pre-defined categories.
@@ -0,0 +1,21 @@
+<svg width="534" height="135" viewBox="0 0 534 135" fill="none" xmlns="http://www.w3.org/2000/svg">
+<rect width="534" height="57" fill="white"/>
+<rect x="0.5" y="14.5" width="88" height="42" fill="white" stroke="black"/>
+<text fill="#757474" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="10" letter-spacing="0em"><tspan x="30" y="10.9697">SBERT</tspan></text>
+<text fill="#757474" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="10" letter-spacing="0em"><tspan x="183" y="10.9697">UMAP</tspan></text>
+<text fill="#757474" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="10" letter-spacing="0em"><tspan x="313" y="10.9697">HDBSCAN</tspan></text>
+<text fill="#757474" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="10" letter-spacing="0em"><tspan x="468" y="10.9697">c-TF-IDF</tspan></text>
+<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="9" y="38.7637">Embeddings</tspan></text>
+<rect x="142.5" y="14.5" width="105" height="42" fill="white" stroke="black"/>
+<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="156.094" y="33.7637">Dimensionality &#10;</tspan><tspan x="171.762" y="47.7637">reduction</tspan></text>
+<rect x="162.5" y="104.5" width="62" height="30" fill="white" stroke="black"/>
+<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="176.336" y="123.764">Labels</tspan></text>
+<path d="M126.707 33.7071C127.098 33.3166 127.098 32.6834 126.707 32.2929L120.343 25.9289C119.953 25.5384 119.319 25.5384 118.929 25.9289C118.538 26.3195 118.538 26.9526 118.929 27.3431L124.586 33L118.929 38.6569C118.538 39.0474 118.538 39.6805 118.929 40.0711C119.319 40.4616 119.953 40.4616 120.343 40.0711L126.707 33.7071ZM99 34H126V32H99V34Z" fill="black"/>
+<rect x="295.5" y="14.5" width="91" height="42" fill="white" stroke="black"/>
+<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="317" y="38.7637">Clustering</tspan></text>
+<path d="M285.707 33.7071C286.098 33.3166 286.098 32.6834 285.707 32.2929L279.343 25.9289C278.953 25.5384 278.319 25.5384 277.929 25.9289C277.538 26.3195 277.538 26.9526 277.929 27.3431L283.586 33L277.929 38.6569C277.538 39.0474 277.538 39.6805 277.929 40.0711C278.319 40.4616 278.953 40.4616 279.343 40.0711L285.707 33.7071ZM258 34H285V32H258V34Z" fill="black"/>
+<rect x="442.5" y="14.5" width="91" height="42" fill="white" stroke="black"/>
+<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="472.404" y="30.7637">Topic &#10;</tspan><tspan x="450.215" y="44.7637">representation</tspan></text>
+<path d="M426.707 33.7071C427.098 33.3166 427.098 32.6834 426.707 32.2929L420.343 25.9289C419.953 25.5384 419.319 25.5384 418.929 25.9289C418.538 26.3195 418.538 26.9526 418.929 27.3431L424.586 33L418.929 38.6569C418.538 39.0474 418.538 39.6805 418.929 40.0711C419.319 40.4616 419.953 40.4616 420.343 40.0711L426.707 33.7071ZM399 34H426V32H399V34Z" fill="black"/>
+<path d="M194.707 66.2929C194.317 65.9024 193.683 65.9024 193.293 66.2929L186.929 72.6569C186.538 73.0474 186.538 73.6805 186.929 74.0711C187.319 74.4616 187.953 74.4616 188.343 74.0711L194 68.4142L199.657 74.0711C200.047 74.4616 200.681 74.4616 201.071 74.0711C201.462 73.6805 201.462 73.0474 201.071 72.6569L194.707 66.2929ZM195 94L195 67L193 67L193 94L195 94Z" fill="black"/>
+</svg>