Add BERTopic.

2025-08-12 19:01:20 +08:00
parent e2323d579c
commit c5c530775e
256 changed files with 28666 additions and 0 deletions
@@ -0,0 +1,144 @@
+Saving, loading, and sharing a BERTopic model can be done in several ways. It is generally advised to go with `.safetensors` as that allows for a small, safe, and fast method for saving your BERTopic model. However, other formats, such as `.pickle` and pytorch `.bin` are also possible.
+
+## **Saving**
+
+There are three methods for saving BERTopic:
+
+1. A light model with `.safetensors` and config files
+2. A light model with pytorch `.bin` and config files
+3. A full model with `.pickle`
+
+
+!!! Tip "Tip"
+    It is advised to use methods 1 or 2 for saving as they generated very small models. Especially method 1 (`safetensors`)
+    allows for a relatively safe format compared to the other methods.
+
+The methods are used as follows:
+
+```python
+topic_model = BERTopic().fit(my_docs)
+
+# Method 1 - safetensors
+embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
+topic_model.save("path/to/my/model_dir", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)
+
+# Method 2 - pytorch
+embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
+topic_model.save("path/to/my/model_dir", serialization="pytorch", save_ctfidf=True, save_embedding_model=embedding_model)
+
+# Method 3 - pickle
+topic_model.save("my_model", serialization="pickle")
+```
+
+!!! Warning "Warning"
+    When saving the model, make sure to also keep track of the versions of dependencies and Python used.
+    Loading and saving the model should be done using the same dependencies and Python. Moreover, models
+    saved in one version of BERTopic are not guaranteed to load in other versions.
+
+
+### **Pickle Drawbacks**
+Saving the model with `pickle` allows for saving the entire topic model, including dimensionality reduction and clustering algorithms, but has several drawbacks:
+
+* Arbitrary code can be run from `.pickle` files
+* The resulting model is rather large (often > 500MB) since all sub-models need to be saved
+* Explicit and specific version control is needed as they typically only run if the environment is exactly the same
+
+
+### **Safetensors and Pytorch Advantages**
+Saving the topic modeling with `.safetensors` or `pytorch` has a number of advantages:
+
+* `.safetensors` is a relatively **safe format**
+* The resulting model can be **very small** (often < 20MB>) since no sub-models need to be saved
+* Although version control is important, there is a bit more **flexibility** with respect to specific versions of packages
+* More easily used in **production**
+* **Share** models with the HuggingFace Hub
+
+<br><br>
+<img src="serialization.png">
+<br><br>
+
+The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing `safetensors`, `pytorch`, and `pickle`. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings.
+
+
+## **HuggingFace Hub**
+
+When you have created a BERTopic model, you can easily share it with other through the HuggingFace Hub. First, you need to log in to your HuggingFace account which you can do in a number of ways:
+
+* Log in to your Hugging Face account with the command below
+
+```bash
+huggingface-cli login
+
+# or using an environment variable
+huggingface-cli login --token $HUGGINGFACE_TOKEN
+```
+
+* Alternatively, you can programmatically login using login() in a notebook or a script
+
+```python
+from huggingface_hub import login
+login()
+```
+
+* Or you can give a token with the `token` variable
+
+When you have logged in to your HuggingFace account, you can save and upload the model as follows:
+
+```python
+from bertopic import BERTopic
+
+# Train model
+topic_model = BERTopic().fit(my_docs)
+
+# Push to HuggingFace Hub
+topic_model.push_to_hf_hub(
+    repo_id="MaartenGr/BERTopic_ArXiv",
+    save_ctfidf=True
+)
+
+# Load from HuggingFace
+loaded_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")
+```
+
+### **Parameters**
+There are number of parameters that may be worthwhile to know:
+
+* `private`
+    * Whether to create a private repository
+* `serialization`
+    * The type of serialization. Either `safetensors` or `pytorch`. Make sure to run `pip install safetensors` for safetensors.
+* `save_embedding_model`
+    * A pointer towards a HuggingFace model to be loaded in with SentenceTransformers. E.g., `sentence-transformers/all-MiniLM-L6-v2`
+* `save_ctfidf`
+    * Whether to save c-TF-IDF information
+
+
+## **Loading**
+
+To load a model:
+
+```python
+# Load from directory
+loaded_model = BERTopic.load("path/to/my/model_dir")
+
+# Load from file
+loaded_model = BERTopic.load("my_model")
+
+# Load from HuggingFace
+loaded_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia")
+```
+
+The embedding model cannot always be saved using a non-pickle method if, for example, you are using OpenAI embeddings. Instead, you can load them in as follows:
+
+
+```python
+# Define embedding model
+import openai
+from bertopic.backend import OpenAIBackend
+
+client = openai.OpenAI(api_key="sk-...")
+embedding_model = OpenAIBackend(client, "text-embedding-ada-002")
+
+# Load model and add embedding model
+loaded_model = BERTopic.load("path/to/my/model_dir", embedding_model=embedding_model)
+```