Add BERTopic.

This commit is contained in:
戒酒的李白
2025-08-12 19:01:20 +08:00
parent e2323d579c
commit c5c530775e
256 changed files with 28666 additions and 0 deletions
@@ -0,0 +1,123 @@
<svg width="1173" height="612" viewBox="0 0 1173 612" fill="none" xmlns="http://www.w3.org/2000/svg">
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="35" letter-spacing="0em"><tspan x="505" y="35.894">the</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="35" letter-spacing="0em"><tspan x="567" y="35.894">right</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="35" letter-spacing="0em"><tspan x="655" y="35.894">problem</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="35" letter-spacing="0em"><tspan x="798" y="35.894">is</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="35" letter-spacing="0em"><tspan x="835" y="35.894">difficult</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="35" letter-spacing="0em"><tspan x="372" y="35.894">Solving</tspan></text>
<rect x="362.5" y="0.5" width="604" height="47" stroke="black"/>
<line x1="499.5" y1="2.18557e-08" x2="499.5" y2="47" stroke="black"/>
<line x1="563.5" y1="1" x2="563.5" y2="48" stroke="black"/>
<line x1="649.5" y1="2.18557e-08" x2="649.5" y2="47" stroke="black"/>
<line x1="792.5" y1="2.18557e-08" x2="792.5" y2="47" stroke="black"/>
<line x1="829.5" y1="2.18557e-08" x2="829.5" y2="47" stroke="black"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="348" y="152.204">right </tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="302" y="152.204">the </tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="209" y="152.204">Solving </tspan></text>
<rect x="200.5" y="125.779" width="214" height="35" stroke="black" stroke-dasharray="2 2"/>
<line x1="298.307" y1="125" x2="298.307" y2="160.269" stroke="black" stroke-dasharray="2 2"/>
<line x1="343.333" y1="125.75" x2="343.333" y2="161.02" stroke="black" stroke-dasharray="2 2"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="460" y="152.204">the </tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="507" y="152.204">right</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="572.956" y="151.924">problem</tspan></text>
<rect x="452.5" y="125.779" width="214" height="35" stroke="black" stroke-dasharray="2 2"/>
<line x1="502.307" y1="125" x2="502.307" y2="160.269" stroke="black" stroke-dasharray="2 2"/>
<line x1="564.333" y1="125.75" x2="564.333" y2="161.02" stroke="black" stroke-dasharray="2 2"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="711.405" y="152.204">right </tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="776.956" y="151.924">problem</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="878.405" y="152.204">is</tspan></text>
<rect x="704.5" y="125.779" width="198" height="35" stroke="black" stroke-dasharray="2 2"/>
<line x1="769.409" y1="125" x2="769.409" y2="160.269" stroke="black" stroke-dasharray="2 2"/>
<line x1="872.769" y1="125.75" x2="872.769" y2="161.02" stroke="black" stroke-dasharray="2 2"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="948" y="151.924">problem </tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="1051" y="151.924">is</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="25" letter-spacing="0em"><tspan x="1079" y="151.924">difficult</tspan></text>
<rect x="940.5" y="125.779" width="230.609" height="35" stroke="black" stroke-dasharray="2 2"/>
<line x1="1047.86" y1="125" x2="1047.86" y2="160.269" stroke="black" stroke-dasharray="2 2"/>
<line x1="1075.37" y1="125.75" x2="1075.37" y2="161.02" stroke="black" stroke-dasharray="2 2"/>
<path d="M303.025 114.779C302.903 115.318 303.24 115.853 303.779 115.975L312.556 117.965C313.095 118.087 313.631 117.749 313.753 117.211C313.875 116.672 313.537 116.136 312.998 116.014L305.196 114.246L306.965 106.444C307.087 105.905 306.749 105.369 306.211 105.247C305.672 105.125 305.136 105.463 305.014 106.002L303.025 114.779ZM395.467 56.1541L303.467 114.154L304.533 115.846L396.533 57.8459L395.467 56.1541Z" fill="black"/>
<path d="M1064.33 115.944C1064.85 115.763 1065.13 115.193 1064.94 114.671L1061.98 106.172C1061.8 105.65 1061.23 105.375 1060.71 105.556C1060.19 105.738 1059.91 106.308 1060.1 106.83L1062.73 114.385L1055.17 117.016C1054.65 117.198 1054.37 117.768 1054.56 118.289C1054.74 118.811 1055.31 119.086 1055.83 118.905L1064.33 115.944ZM943.565 57.9003L1063.56 115.9L1064.44 114.1L944.435 56.0997L943.565 57.9003Z" fill="black"/>
<path d="M636.52 351.854C636.992 351.567 637.141 350.952 636.854 350.48L632.174 342.793C631.887 342.321 631.272 342.171 630.8 342.458C630.328 342.746 630.179 343.361 630.466 343.833L634.626 350.666L627.793 354.826C627.321 355.113 627.171 355.728 627.458 356.2C627.746 356.672 628.361 356.821 628.833 356.534L636.52 351.854ZM306.764 271.972L635.764 351.972L636.236 350.028L307.236 270.028L306.764 271.972Z" fill="black"/>
<path d="M660.103 349.995C660.652 349.938 661.052 349.446 660.995 348.897L660.069 339.945C660.012 339.396 659.52 338.996 658.971 339.053C658.422 339.11 658.022 339.601 658.079 340.151L658.902 348.108L650.945 348.931C650.396 348.988 649.996 349.48 650.053 350.029C650.11 350.578 650.601 350.978 651.151 350.921L660.103 349.995ZM563.369 271.776L659.369 349.776L660.631 348.224L564.631 270.224L563.369 271.776Z" fill="black"/>
<path d="M701.011 348.85C700.928 349.396 701.304 349.906 701.85 349.989L710.747 351.343C711.293 351.426 711.803 351.05 711.886 350.504C711.969 349.958 711.594 349.448 711.048 349.365L703.139 348.162L704.343 340.253C704.426 339.707 704.05 339.197 703.504 339.114C702.958 339.031 702.448 339.406 702.365 339.952L701.011 348.85ZM807.407 270.195L701.407 348.195L702.593 349.805L808.593 271.805L807.407 270.195Z" fill="black"/>
<path d="M725.149 350.475C724.859 350.945 725.005 351.561 725.475 351.851L733.133 356.578C733.603 356.868 734.219 356.722 734.51 356.252C734.8 355.782 734.654 355.166 734.184 354.876L727.376 350.674L731.578 343.867C731.868 343.397 731.722 342.78 731.252 342.49C730.782 342.2 730.166 342.346 729.876 342.816L725.149 350.475ZM1063.77 270.027L725.77 350.027L726.23 351.973L1064.23 271.973L1063.77 270.027Z" fill="black"/>
<path d="M561.293 115.707C561.683 116.098 562.317 116.098 562.707 115.707L569.071 109.343C569.462 108.953 569.462 108.319 569.071 107.929C568.681 107.538 568.047 107.538 567.657 107.929L562 113.586L556.343 107.929C555.953 107.538 555.319 107.538 554.929 107.929C554.538 108.319 554.538 108.953 554.929 109.343L561.293 115.707ZM561 57L561 115L563 115L563 57L561 57Z" fill="black"/>
<path d="M806.293 115.707C806.683 116.098 807.317 116.098 807.707 115.707L814.071 109.343C814.462 108.953 814.462 108.319 814.071 107.929C813.681 107.538 813.047 107.538 812.657 107.929L807 113.586L801.343 107.929C800.953 107.538 800.319 107.538 799.929 107.929C799.538 108.319 799.538 108.953 799.929 109.343L806.293 115.707ZM806 57L806 115L808 115L808 57L806 57Z" fill="black"/>
<rect x="240.5" y="238.5" width="127" height="20.8389" stroke="black"/>
<rect x="240.5" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="261.732" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="282.964" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="304.197" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="325.429" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="346.661" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<path d="M303.293 230.707C303.683 231.098 304.317 231.098 304.707 230.707L311.071 224.343C311.462 223.953 311.462 223.319 311.071 222.929C310.681 222.538 310.047 222.538 309.657 222.929L304 228.586L298.343 222.929C297.953 222.538 297.319 222.538 296.929 222.929C296.538 223.319 296.538 223.953 296.929 224.343L303.293 230.707ZM303 172L303 230L305 230L305 172L303 172Z" fill="black"/>
<rect x="498.5" y="238.5" width="127" height="20.8389" stroke="black"/>
<rect x="498.5" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="519.732" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="540.964" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="562.197" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="583.429" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="604.661" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="617.5" y="364.5" width="127" height="20.8389" stroke="black"/>
<rect x="617.5" y="364.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="638.732" y="364.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="659.964" y="364.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="681.197" y="364.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="702.429" y="364.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="723.661" y="364.5" width="20.8389" height="20.8389" stroke="black"/>
<path d="M561.293 230.707C561.683 231.098 562.317 231.098 562.707 230.707L569.071 224.343C569.462 223.953 569.462 223.319 569.071 222.929C568.681 222.538 568.047 222.538 567.657 222.929L562 228.586L556.343 222.929C555.953 222.538 555.319 222.538 554.929 222.929C554.538 223.319 554.538 223.953 554.929 224.343L561.293 230.707ZM561 172L561 230L563 230L563 172L561 172Z" fill="black"/>
<path d="M681.293 460.707C681.683 461.098 682.317 461.098 682.707 460.707L689.071 454.343C689.462 453.953 689.462 453.319 689.071 452.929C688.681 452.538 688.047 452.538 687.657 452.929L682 458.586L676.343 452.929C675.953 452.538 675.319 452.538 674.929 452.929C674.538 453.319 674.538 453.953 674.929 454.343L681.293 460.707ZM681 402L681 460L683 460L683 402L681 402Z" fill="black"/>
<rect x="743.5" y="238.5" width="127" height="20.8389" stroke="black"/>
<rect x="743.5" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="764.732" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="785.964" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="807.197" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="828.429" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="849.661" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<path d="M806.293 230.707C806.683 231.098 807.317 231.098 807.707 230.707L814.071 224.343C814.462 223.953 814.462 223.319 814.071 222.929C813.681 222.538 813.047 222.538 812.657 222.929L807 228.586L801.343 222.929C800.953 222.538 800.319 222.538 799.929 222.929C799.538 223.319 799.538 223.953 799.929 224.343L806.293 230.707ZM806 172L806 230L808 230L808 172L806 172Z" fill="black"/>
<rect x="1000.5" y="238.5" width="127" height="20.8389" stroke="black"/>
<rect x="1000.5" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="1021.73" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="1042.96" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="1064.2" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="1085.43" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<rect x="1106.66" y="238.5" width="20.8389" height="20.8389" stroke="black"/>
<path d="M1063.29 230.707C1063.68 231.098 1064.32 231.098 1064.71 230.707L1071.07 224.343C1071.46 223.953 1071.46 223.319 1071.07 222.929C1070.68 222.538 1070.05 222.538 1069.66 222.929L1064 228.586L1058.34 222.929C1057.95 222.538 1057.32 222.538 1056.93 222.929C1056.54 223.319 1056.54 223.953 1056.93 224.343L1063.29 230.707ZM1063 172L1063 230L1065 230L1065 172L1063 172Z" fill="black"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" letter-spacing="0em"><tspan x="30.5156" y="146.939">create token sets</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" letter-spacing="0em"><tspan x="4.08203" y="258.939">topic-token set similarity</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" letter-spacing="0em"><tspan x="356.066" y="379.939">document-topic distribution</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="20" letter-spacing="0em"><tspan x="179.961" y="548.939">multi-topic assignment </tspan><tspan x="239.375" y="572.939">on a token level</tspan></text>
<rect x="398" y="528" width="82" height="24" fill="#F1F1F1"/>
<rect x="478" y="528" width="82" height="24" fill="#0A539E"/>
<rect x="560" y="528" width="82" height="24" fill="#85BCDC"/>
<rect x="642" y="528" width="82" height="24" fill="#EAF2FB"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Quicksand" font-size="18" font-weight="bold" letter-spacing="0em"><tspan x="489" y="493.25">solving</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="16" font-weight="bold" letter-spacing="0em"><tspan x="412" y="544.852">topic 2</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="16" font-weight="bold" letter-spacing="0em"><tspan x="412" y="520.852">topic 1</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="16" font-weight="bold" letter-spacing="0em"><tspan x="412" y="568.852">topic 3</tspan></text>
<rect x="398" y="576" width="82" height="24" fill="#F1F1F1"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="16" font-weight="bold" letter-spacing="0em"><tspan x="412" y="592.852">topic 4</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="18" font-weight="bold" letter-spacing="0em"><tspan x="587" y="493.146">the</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="18" font-weight="bold" letter-spacing="0em"><tspan x="663" y="493.146">right</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="18" font-weight="bold" letter-spacing="0em"><tspan x="731" y="493.146">problem</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="18" font-weight="bold" letter-spacing="0em"><tspan x="840" y="493.146">is</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="18" font-weight="bold" letter-spacing="0em"><tspan x="897" y="493.146">difficult</tspan></text>
<line x1="398" y1="503" x2="970" y2="503" stroke="#BDBDBD" stroke-width="2"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="504" y="544.764">0.75</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="590" y="544.764">0.32</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="671" y="544.764">0.16</tspan></text>
<rect x="560" y="576" width="82" height="24" fill="#D0E1F2"/>
<rect x="642" y="576" width="82" height="24" fill="#B7D4EA"/>
<rect x="724" y="576" width="82" height="24" fill="#0A539E"/>
<rect x="806" y="576" width="82" height="24" fill="#85BCDC"/>
<rect x="888" y="576" width="82" height="24" fill="#B7D4EA"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="590" y="592.764">0.21</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="671" y="592.764">0.29</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="750" y="592.764">0.81</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="837" y="592.764">0.47</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="917" y="592.764">0.26</tspan></text>
<rect x="806" y="504" width="82" height="24" fill="#EAF2FB"/>
<rect x="888" y="504" width="82" height="24" fill="#85BCDC"/>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="837" y="520.764">0.12</tspan></text>
<text fill="black" xml:space="preserve" style="white-space: pre" font-family="Tahoma" font-size="12" letter-spacing="0em"><tspan x="917" y="520.764">0.33</tspan></text>
</svg>

After

Width:  |  Height:  |  Size: 18 KiB

@@ -0,0 +1,107 @@
BERTopic approaches topic modeling as a cluster task and attempts to cluster semantically similar documents to extract common topics. A disadvantage of using such a method is that each document is assigned to a single cluster and therefore also a single topic. In practice, documents may contain a mixture of topics. This can be accounted for by splitting up the documents into sentences and feeding those to BERTopic.
Another option is to use a cluster model that can perform soft clustering, like HDBSCAN. As BERTopic focuses on modularity, we may still want to model that mixture of topics even when we are using a hard-clustering model, like k-Means without the need to split up our documents. This is where `.approximate_distribution` comes in!
<br>
<div class="svg_image">
--8<-- "docs/getting_started/distribution/approximate_distribution.svg"
</div>
<br>
To perform this approximation, each document is split into tokens according to the provided tokenizer in the `CountVectorizer`. Then, a **sliding window** is applied on each document creating subsets of the document. For example, with a window size of 3 and stride of 1, the document:
> Solving the right problem is difficult.
can be split up into `solving the right`, `the right problem`, `right problem is`, and `problem is difficult`. These are called token sets.
For each of these token sets, we calculate their c-TF-IDF representation and find out how similar they are to the previously generated topics.
Then, the similarities to the topics for each token set are summed to create a topic distribution for the entire document.
Although it is often said that documents can contain a mixture of topics, these are often modeled by assigning each word to a single topic.
With this approach, we take into account that there may be multiple topics for a single word.
We can make this multiple-topic word assignment a bit more accurate by then splitting these token sets up into individual tokens and assigning
the topic distributions for each token set to each individual token. That way, we can visualize the extent to which a certain word contributes
to a document's topic distribution.
## **Example**
To calculate our topic distributions, we first need to fit a basic topic model:
```python
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic().fit(docs)
```
After doing so, we can approximate the topic distributions for your documents:
```python
topic_distr, _ = topic_model.approximate_distribution(docs)
```
The resulting `topic_distr` is a *n* x *m* matrix where *n* are the documents and *m* the topics. We can then visualize the distribution
of topics in a document:
```python
topic_model.visualize_distribution(topic_distr[1])
```
<iframe src="distribution_viz.html" style="width:1000px; height: 620px; border: 0px;""></iframe>
Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first
calculate topic distributions on a token level and then visualize the results:
```python
# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)
# Visualize the token-level distributions
df = topic_model.visualize_approximate_distribution(docs[1], topic_token_distr[1])
df
```
<br><br>
<img src="distribution.png">
<br><br>
!!! tip
You can also approximate the topic distributions for unseen documents. It will not be as accurate as `.transform` but it is quite fast and can serve you well in a production setting.
!!! note
To get the stylized dataframe for `.visualize_approximate_distribution` you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via `pip install jinja2`
## **Parameters**
There are a few parameters that are of interest which will be discussed below.
### **batch_size**
Creating token sets for each document can result in quite a large list of token sets. The similarity of these token sets with the topics can result a large matrix that might not fit into memory anymore. To circumvent this, we can process batches of documents instead to minimize the memory overload. The value for `batch_size` indicates the number of documents that will be processed at once:
```python
topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=500)
```
### **window**
The number of tokens that are combined into token sets are defined by the `window` parameter. Seeing as we are performing a sliding window, we can change the size of the window. A larger window takes more tokens into account but setting it too large can result in considering too much information. Personally, I like to have this window between 4 and 8:
```python
topic_distr, _ = topic_model.approximate_distribution(docs, window=4)
```
### **stride**
The sliding window that is performed on a document shifts, as a default, 1 token to the right each time to create its token sets. As a result, especially with large windows, a single token gets judged several times. We can use the `stride` parameter to increase the number of tokens the window shifts to the right. By increasing
this value, we are judging each token less frequently which often results in a much faster calculation. Combining this parameter with `window` is preferred. For example, if we have a very large dataset, we can set `stride=4` and `window=8` to judge token sets that contain 8 tokens but that are shifted with 4 steps
each time. As a result, this increases the computational speed quite a bit:
```python
topic_distr, _ = topic_model.approximate_distribution(docs, window=4)
```
### **use_embedding_model**
As a default, we compare the c-TF-IDF calculations between the token sets and all topics. Due to its bag-of-word representation, this is quite fast. However, you might want to use the selected `embedding_model` instead to do this comparison. Do note that due to the many token sets, it is often computationally quite a bit slower:
```python
topic_distr, _ = topic_model.approximate_distribution(docs, use_embedding_model=True)
```
Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

File diff suppressed because one or more lines are too long