Add BERTopic.
This commit is contained in:
@@ -0,0 +1,360 @@
|
||||
The topics that you create can be hierarchically reduced. In order to understand the potential hierarchical
|
||||
structure of the topics, we can use `scipy.cluster.hierarchy` to create clusters and visualize how
|
||||
they relate to one another. This might help to select an appropriate `nr_topics` when reducing the number
|
||||
of topics that you have created. To visualize this hierarchy, run the following:
|
||||
|
||||
```python
|
||||
topic_model.visualize_hierarchy()
|
||||
```
|
||||
|
||||
<iframe src="hierarchy.html" style="width:1000px; height: 680px; border: 0px;""></iframe>
|
||||
|
||||
!!! note
|
||||
Do note that this is not the actual procedure of `.reduce_topics()` when `nr_topics` is set to
|
||||
auto since HDBSCAN is used to automatically extract topics. The visualization above closely resembles
|
||||
the actual procedure of `.reduce_topics()` when any number of `nr_topics` is selected.
|
||||
|
||||
### **Hierarchical labels**
|
||||
|
||||
Although visualizing this hierarchy gives us information about the structure, it would be helpful to see what happens
|
||||
to the topic representations when merging topics. To do so, we first need to calculate the representations of the
|
||||
hierarchical topics:
|
||||
|
||||
|
||||
First, we train a basic BERTopic model:
|
||||
|
||||
```python
|
||||
from bertopic import BERTopic
|
||||
from sklearn.datasets import fetch_20newsgroups
|
||||
|
||||
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))["data"]
|
||||
topic_model = BERTopic(verbose=True)
|
||||
topics, probs = topic_model.fit_transform(docs)
|
||||
hierarchical_topics = topic_model.hierarchical_topics(docs)
|
||||
```
|
||||
|
||||
To visualize these results, we simply need to pass the resulting `hierarchical_topics` to our `.visualize_hierarchy` function:
|
||||
|
||||
```python
|
||||
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
|
||||
```
|
||||
<iframe src="hierarchical_topics.html" style="width:1000px; height: 2150px; border: 0px;""></iframe>
|
||||
|
||||
|
||||
If you **hover** over the black circles, you will see the topic representation at that level of the hierarchy. These representations
|
||||
help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover,
|
||||
we can now see which sub-topics can be found within certain larger themes.
|
||||
|
||||
### **Text-based topic tree**
|
||||
|
||||
Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can
|
||||
use `topic_model.get_topic_tree` to create a text-based representation of this hierarchy. Although the general structure is more difficult
|
||||
to view, we can see better which topics could be logically merged:
|
||||
|
||||
```python
|
||||
>>> tree = topic_model.get_topic_tree(hierarchical_topics)
|
||||
>>> print(tree)
|
||||
.
|
||||
└─atheists_atheism_god_moral_atheist
|
||||
├─atheists_atheism_god_atheist_argument
|
||||
│ ├─■──atheists_atheism_god_atheist_argument ── Topic: 21
|
||||
│ └─■──br_god_exist_genetic_existence ── Topic: 124
|
||||
└─■──moral_morality_objective_immoral_morals ── Topic: 29
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>Click here to view the full tree.</summary>
|
||||
|
||||
```bash
|
||||
.
|
||||
├─people_armenian_said_god_armenians
|
||||
│ ├─god_jesus_jehovah_lord_christ
|
||||
│ │ ├─god_jesus_jehovah_lord_christ
|
||||
│ │ │ ├─jehovah_lord_mormon_mcconkie_god
|
||||
│ │ │ │ ├─■──ra_satan_thou_god_lucifer ── Topic: 94
|
||||
│ │ │ │ └─■──jehovah_lord_mormon_mcconkie_unto ── Topic: 78
|
||||
│ │ │ └─jesus_mary_god_hell_sin
|
||||
│ │ │ ├─jesus_hell_god_eternal_heaven
|
||||
│ │ │ │ ├─hell_jesus_eternal_god_heaven
|
||||
│ │ │ │ │ ├─■──jesus_tomb_disciples_resurrection_john ── Topic: 69
|
||||
│ │ │ │ │ └─■──hell_eternal_god_jesus_heaven ── Topic: 53
|
||||
│ │ │ │ └─■──aaron_baptism_sin_law_god ── Topic: 89
|
||||
│ │ │ └─■──mary_sin_maria_priest_conception ── Topic: 56
|
||||
│ │ └─■──marriage_married_marry_ceremony_marriages ── Topic: 110
|
||||
│ └─people_armenian_armenians_said_mr
|
||||
│ ├─people_armenian_armenians_said_israel
|
||||
│ │ ├─god_homosexual_homosexuality_atheists_sex
|
||||
│ │ │ ├─homosexual_homosexuality_sex_gay_homosexuals
|
||||
│ │ │ │ ├─■──kinsey_sex_gay_men_sexual ── Topic: 44
|
||||
│ │ │ │ └─homosexuality_homosexual_sin_homosexuals_gay
|
||||
│ │ │ │ ├─■──gay_homosexual_homosexuals_sexual_cramer ── Topic: 50
|
||||
│ │ │ │ └─■──homosexuality_homosexual_sin_paul_sex ── Topic: 27
|
||||
│ │ │ └─god_atheists_atheism_moral_atheist
|
||||
│ │ │ ├─islam_quran_judas_islamic_book
|
||||
│ │ │ │ ├─■──jim_context_challenges_articles_quote ── Topic: 36
|
||||
│ │ │ │ └─islam_quran_judas_islamic_book
|
||||
│ │ │ │ ├─■──islam_quran_islamic_rushdie_muslims ── Topic: 31
|
||||
│ │ │ │ └─■──judas_scripture_bible_books_greek ── Topic: 33
|
||||
│ │ │ └─atheists_atheism_god_moral_atheist
|
||||
│ │ │ ├─atheists_atheism_god_atheist_argument
|
||||
│ │ │ │ ├─■──atheists_atheism_god_atheist_argument ── Topic: 21
|
||||
│ │ │ │ └─■──br_god_exist_genetic_existence ── Topic: 124
|
||||
│ │ │ └─■──moral_morality_objective_immoral_morals ── Topic: 29
|
||||
│ │ └─armenian_armenians_people_israel_said
|
||||
│ │ ├─armenian_armenians_israel_people_jews
|
||||
│ │ │ ├─tax_rights_government_income_taxes
|
||||
│ │ │ │ ├─■──rights_right_slavery_slaves_residence ── Topic: 106
|
||||
│ │ │ │ └─tax_government_taxes_income_libertarians
|
||||
│ │ │ │ ├─■──government_libertarians_libertarian_regulation_party ── Topic: 58
|
||||
│ │ │ │ └─■──tax_taxes_income_billion_deficit ── Topic: 41
|
||||
│ │ │ └─armenian_armenians_israel_people_jews
|
||||
│ │ │ ├─gun_guns_militia_firearms_amendment
|
||||
│ │ │ │ ├─■──blacks_penalty_death_cruel_punishment ── Topic: 55
|
||||
│ │ │ │ └─■──gun_guns_militia_firearms_amendment ── Topic: 7
|
||||
│ │ │ └─armenian_armenians_israel_jews_turkish
|
||||
│ │ │ ├─■──israel_israeli_jews_arab_jewish ── Topic: 4
|
||||
│ │ │ └─■──armenian_armenians_turkish_armenia_azerbaijan ── Topic: 15
|
||||
│ │ └─stephanopoulos_president_mr_myers_ms
|
||||
│ │ ├─■──serbs_muslims_stephanopoulos_mr_bosnia ── Topic: 35
|
||||
│ │ └─■──myers_stephanopoulos_president_ms_mr ── Topic: 87
|
||||
│ └─batf_fbi_koresh_compound_gas
|
||||
│ ├─■──reno_workers_janet_clinton_waco ── Topic: 77
|
||||
│ └─batf_fbi_koresh_gas_compound
|
||||
│ ├─batf_koresh_fbi_warrant_compound
|
||||
│ │ ├─■──batf_warrant_raid_compound_fbi ── Topic: 42
|
||||
│ │ └─■──koresh_batf_fbi_children_compound ── Topic: 61
|
||||
│ └─■──fbi_gas_tear_bds_building ── Topic: 23
|
||||
└─use_like_just_dont_new
|
||||
├─game_team_year_games_like
|
||||
│ ├─game_team_games_25_year
|
||||
│ │ ├─game_team_games_25_season
|
||||
│ │ │ ├─window_printer_use_problem_mhz
|
||||
│ │ │ │ ├─mhz_wire_simms_wiring_battery
|
||||
│ │ │ │ │ ├─simms_mhz_battery_cpu_heat
|
||||
│ │ │ │ │ │ ├─simms_pds_simm_vram_lc
|
||||
│ │ │ │ │ │ │ ├─■──pds_nubus_lc_slot_card ── Topic: 119
|
||||
│ │ │ │ │ │ │ └─■──simms_simm_vram_meg_dram ── Topic: 32
|
||||
│ │ │ │ │ │ └─mhz_battery_cpu_heat_speed
|
||||
│ │ │ │ │ │ ├─mhz_cpu_speed_heat_fan
|
||||
│ │ │ │ │ │ │ ├─mhz_cpu_speed_heat_fan
|
||||
│ │ │ │ │ │ │ │ ├─■──fan_cpu_heat_sink_fans ── Topic: 92
|
||||
│ │ │ │ │ │ │ │ └─■──mhz_speed_cpu_fpu_clock ── Topic: 22
|
||||
│ │ │ │ │ │ │ └─■──monitor_turn_power_computer_electricity ── Topic: 91
|
||||
│ │ │ │ │ │ └─battery_batteries_concrete_duo_discharge
|
||||
│ │ │ │ │ │ ├─■──duo_battery_apple_230_problem ── Topic: 121
|
||||
│ │ │ │ │ │ └─■──battery_batteries_concrete_discharge_temperature ── Topic: 75
|
||||
│ │ │ │ │ └─wire_wiring_ground_neutral_outlets
|
||||
│ │ │ │ │ ├─wire_wiring_ground_neutral_outlets
|
||||
│ │ │ │ │ │ ├─wire_wiring_ground_neutral_outlets
|
||||
│ │ │ │ │ │ │ ├─■──leds_uv_blue_light_boards ── Topic: 66
|
||||
│ │ │ │ │ │ │ └─■──wire_wiring_ground_neutral_outlets ── Topic: 120
|
||||
│ │ │ │ │ │ └─scope_scopes_phone_dial_number
|
||||
│ │ │ │ │ │ ├─■──dial_number_phone_line_output ── Topic: 93
|
||||
│ │ │ │ │ │ └─■──scope_scopes_motorola_generator_oscilloscope ── Topic: 113
|
||||
│ │ │ │ │ └─celp_dsp_sampling_antenna_digital
|
||||
│ │ │ │ │ ├─■──antenna_antennas_receiver_cable_transmitter ── Topic: 70
|
||||
│ │ │ │ │ └─■──celp_dsp_sampling_speech_voice ── Topic: 52
|
||||
│ │ │ │ └─window_printer_xv_mouse_windows
|
||||
│ │ │ │ ├─window_xv_error_widget_problem
|
||||
│ │ │ │ │ ├─error_symbol_undefined_xterm_rx
|
||||
│ │ │ │ │ │ ├─■──symbol_error_undefined_doug_parse ── Topic: 63
|
||||
│ │ │ │ │ │ └─■──rx_remote_server_xdm_xterm ── Topic: 45
|
||||
│ │ │ │ │ └─window_xv_widget_application_expose
|
||||
│ │ │ │ │ ├─window_widget_expose_application_event
|
||||
│ │ │ │ │ │ ├─■──gc_mydisplay_draw_gxxor_drawing ── Topic: 103
|
||||
│ │ │ │ │ │ └─■──window_widget_application_expose_event ── Topic: 25
|
||||
│ │ │ │ │ └─xv_den_polygon_points_algorithm
|
||||
│ │ │ │ │ ├─■──den_polygon_points_algorithm_polygons ── Topic: 28
|
||||
│ │ │ │ │ └─■──xv_24bit_image_bit_images ── Topic: 57
|
||||
│ │ │ │ └─printer_fonts_print_mouse_postscript
|
||||
│ │ │ │ ├─printer_fonts_print_font_deskjet
|
||||
│ │ │ │ │ ├─■──scanner_logitech_grayscale_ocr_scanman ── Topic: 108
|
||||
│ │ │ │ │ └─printer_fonts_print_font_deskjet
|
||||
│ │ │ │ │ ├─■──printer_print_deskjet_hp_ink ── Topic: 18
|
||||
│ │ │ │ │ └─■──fonts_font_truetype_tt_atm ── Topic: 49
|
||||
│ │ │ │ └─mouse_ghostscript_midi_driver_postscript
|
||||
│ │ │ │ ├─ghostscript_midi_postscript_files_file
|
||||
│ │ │ │ │ ├─■──ghostscript_postscript_pageview_ghostview_dsc ── Topic: 104
|
||||
│ │ │ │ │ └─midi_sound_file_windows_driver
|
||||
│ │ │ │ │ ├─■──location_mar_file_host_rwrr ── Topic: 83
|
||||
│ │ │ │ │ └─■──midi_sound_driver_blaster_soundblaster ── Topic: 98
|
||||
│ │ │ │ └─■──mouse_driver_mice_ball_problem ── Topic: 68
|
||||
│ │ │ └─game_team_games_25_season
|
||||
│ │ │ ├─1st_sale_condition_comics_hulk
|
||||
│ │ │ │ ├─sale_condition_offer_asking_cd
|
||||
│ │ │ │ │ ├─condition_stereo_amp_speakers_asking
|
||||
│ │ │ │ │ │ ├─■──miles_car_amfm_toyota_cassette ── Topic: 62
|
||||
│ │ │ │ │ │ └─■──amp_speakers_condition_stereo_audio ── Topic: 24
|
||||
│ │ │ │ │ └─games_sale_pom_cds_shipping
|
||||
│ │ │ │ │ ├─pom_cds_sale_shipping_cd
|
||||
│ │ │ │ │ │ ├─■──size_shipping_sale_condition_mattress ── Topic: 100
|
||||
│ │ │ │ │ │ └─■──pom_cds_cd_sale_picture ── Topic: 37
|
||||
│ │ │ │ │ └─■──games_game_snes_sega_genesis ── Topic: 40
|
||||
│ │ │ │ └─1st_hulk_comics_art_appears
|
||||
│ │ │ │ ├─1st_hulk_comics_art_appears
|
||||
│ │ │ │ │ ├─lens_tape_camera_backup_lenses
|
||||
│ │ │ │ │ │ ├─■──tape_backup_tapes_drive_4mm ── Topic: 107
|
||||
│ │ │ │ │ │ └─■──lens_camera_lenses_zoom_pouch ── Topic: 114
|
||||
│ │ │ │ │ └─1st_hulk_comics_art_appears
|
||||
│ │ │ │ │ ├─■──1st_hulk_comics_art_appears ── Topic: 105
|
||||
│ │ │ │ │ └─■──books_book_cover_trek_chemistry ── Topic: 125
|
||||
│ │ │ │ └─tickets_hotel_ticket_voucher_package
|
||||
│ │ │ │ ├─■──hotel_voucher_package_vacation_room ── Topic: 74
|
||||
│ │ │ │ └─■──tickets_ticket_june_airlines_july ── Topic: 84
|
||||
│ │ │ └─game_team_games_season_hockey
|
||||
│ │ │ ├─game_hockey_team_25_550
|
||||
│ │ │ │ ├─■──espn_pt_pts_game_la ── Topic: 17
|
||||
│ │ │ │ └─■──team_25_game_hockey_550 ── Topic: 2
|
||||
│ │ │ └─■──year_game_hit_baseball_players ── Topic: 0
|
||||
│ │ └─bike_car_greek_insurance_msg
|
||||
│ │ ├─car_bike_insurance_cars_engine
|
||||
│ │ │ ├─car_insurance_cars_radar_engine
|
||||
│ │ │ │ ├─insurance_health_private_care_canada
|
||||
│ │ │ │ │ ├─■──insurance_health_private_care_canada ── Topic: 99
|
||||
│ │ │ │ │ └─■──insurance_car_accident_rates_sue ── Topic: 82
|
||||
│ │ │ │ └─car_cars_radar_engine_detector
|
||||
│ │ │ │ ├─car_radar_cars_detector_engine
|
||||
│ │ │ │ │ ├─■──radar_detector_detectors_ka_alarm ── Topic: 39
|
||||
│ │ │ │ │ └─car_cars_mustang_ford_engine
|
||||
│ │ │ │ │ ├─■──clutch_shift_shifting_transmission_gear ── Topic: 88
|
||||
│ │ │ │ │ └─■──car_cars_mustang_ford_v8 ── Topic: 14
|
||||
│ │ │ │ └─oil_diesel_odometer_diesels_car
|
||||
│ │ │ │ ├─odometer_oil_sensor_car_drain
|
||||
│ │ │ │ │ ├─■──odometer_sensor_speedo_gauge_mileage ── Topic: 96
|
||||
│ │ │ │ │ └─■──oil_drain_car_leaks_taillights ── Topic: 102
|
||||
│ │ │ │ └─■──diesel_diesels_emissions_fuel_oil ── Topic: 79
|
||||
│ │ │ └─bike_riding_ride_bikes_motorcycle
|
||||
│ │ │ ├─bike_ride_riding_bikes_lane
|
||||
│ │ │ │ ├─■──bike_ride_riding_lane_car ── Topic: 11
|
||||
│ │ │ │ └─■──bike_bikes_miles_honda_motorcycle ── Topic: 19
|
||||
│ │ │ └─■──countersteering_bike_motorcycle_rear_shaft ── Topic: 46
|
||||
│ │ └─greek_msg_kuwait_greece_water
|
||||
│ │ ├─greek_msg_kuwait_greece_water
|
||||
│ │ │ ├─greek_msg_kuwait_greece_dog
|
||||
│ │ │ │ ├─greek_msg_kuwait_greece_dog
|
||||
│ │ │ │ │ ├─greek_kuwait_greece_turkish_greeks
|
||||
│ │ │ │ │ │ ├─■──greek_greece_turkish_greeks_cyprus ── Topic: 71
|
||||
│ │ │ │ │ │ └─■──kuwait_iraq_iran_gulf_arabia ── Topic: 76
|
||||
│ │ │ │ │ └─msg_dog_drugs_drug_food
|
||||
│ │ │ │ │ ├─dog_dogs_cooper_trial_weaver
|
||||
│ │ │ │ │ │ ├─■──clinton_bush_quayle_reagan_panicking ── Topic: 101
|
||||
│ │ │ │ │ │ └─dog_dogs_cooper_trial_weaver
|
||||
│ │ │ │ │ │ ├─■──cooper_trial_weaver_spence_witnesses ── Topic: 90
|
||||
│ │ │ │ │ │ └─■──dog_dogs_bike_trained_springer ── Topic: 67
|
||||
│ │ │ │ │ └─msg_drugs_drug_food_chinese
|
||||
│ │ │ │ │ ├─■──msg_food_chinese_foods_taste ── Topic: 30
|
||||
│ │ │ │ │ └─■──drugs_drug_marijuana_cocaine_alcohol ── Topic: 72
|
||||
│ │ │ │ └─water_theory_universe_science_larsons
|
||||
│ │ │ │ ├─water_nuclear_cooling_steam_dept
|
||||
│ │ │ │ │ ├─■──rocketry_rockets_engines_nuclear_plutonium ── Topic: 115
|
||||
│ │ │ │ │ └─water_cooling_steam_dept_plants
|
||||
│ │ │ │ │ ├─■──water_dept_phd_environmental_atmospheric ── Topic: 97
|
||||
│ │ │ │ │ └─■──cooling_water_steam_towers_plants ── Topic: 109
|
||||
│ │ │ │ └─theory_universe_larsons_larson_science
|
||||
│ │ │ │ ├─■──theory_universe_larsons_larson_science ── Topic: 54
|
||||
│ │ │ │ └─■──oort_cloud_grbs_gamma_burst ── Topic: 80
|
||||
│ │ │ └─helmet_kirlian_photography_lock_wax
|
||||
│ │ │ ├─helmet_kirlian_photography_leaf_mask
|
||||
│ │ │ │ ├─kirlian_photography_leaf_pictures_deleted
|
||||
│ │ │ │ │ ├─deleted_joke_stuff_maddi_nickname
|
||||
│ │ │ │ │ │ ├─■──joke_maddi_nickname_nicknames_frank ── Topic: 43
|
||||
│ │ │ │ │ │ └─■──deleted_stuff_bookstore_joke_motto ── Topic: 81
|
||||
│ │ │ │ │ └─■──kirlian_photography_leaf_pictures_aura ── Topic: 85
|
||||
│ │ │ │ └─helmet_mask_liner_foam_cb
|
||||
│ │ │ │ ├─■──helmet_liner_foam_cb_helmets ── Topic: 112
|
||||
│ │ │ │ └─■──mask_goalies_77_santore_tl ── Topic: 123
|
||||
│ │ │ └─lock_wax_paint_plastic_ear
|
||||
│ │ │ ├─■──lock_cable_locks_bike_600 ── Topic: 117
|
||||
│ │ │ └─wax_paint_ear_plastic_skin
|
||||
│ │ │ ├─■──wax_paint_plastic_scratches_solvent ── Topic: 65
|
||||
│ │ │ └─■──ear_wax_skin_greasy_acne ── Topic: 116
|
||||
│ │ └─m4_mp_14_mw_mo
|
||||
│ │ ├─m4_mp_14_mw_mo
|
||||
│ │ │ ├─■──m4_mp_14_mw_mo ── Topic: 111
|
||||
│ │ │ └─■──test_ensign_nameless_deane_deanebinahccbrandeisedu ── Topic: 118
|
||||
│ │ └─■──ites_cheek_hello_hi_ken ── Topic: 3
|
||||
│ └─space_medical_health_disease_cancer
|
||||
│ ├─medical_health_disease_cancer_patients
|
||||
│ │ ├─■──cancer_centers_center_medical_research ── Topic: 122
|
||||
│ │ └─health_medical_disease_patients_hiv
|
||||
│ │ ├─patients_medical_disease_candida_health
|
||||
│ │ │ ├─■──candida_yeast_infection_gonorrhea_infections ── Topic: 48
|
||||
│ │ │ └─patients_disease_cancer_medical_doctor
|
||||
│ │ │ ├─■──hiv_medical_cancer_patients_doctor ── Topic: 34
|
||||
│ │ │ └─■──pain_drug_patients_disease_diet ── Topic: 26
|
||||
│ │ └─■──health_newsgroup_tobacco_vote_votes ── Topic: 9
|
||||
│ └─space_launch_nasa_shuttle_orbit
|
||||
│ ├─space_moon_station_nasa_launch
|
||||
│ │ ├─■──sky_advertising_billboard_billboards_space ── Topic: 59
|
||||
│ │ └─■──space_station_moon_redesign_nasa ── Topic: 16
|
||||
│ └─space_mission_hst_launch_orbit
|
||||
│ ├─space_launch_nasa_orbit_propulsion
|
||||
│ │ ├─■──space_launch_nasa_propulsion_astronaut ── Topic: 47
|
||||
│ │ └─■──orbit_km_jupiter_probe_earth ── Topic: 86
|
||||
│ └─■──hst_mission_shuttle_orbit_arrays ── Topic: 60
|
||||
└─drive_file_key_windows_use
|
||||
├─key_file_jpeg_encryption_image
|
||||
│ ├─key_encryption_clipper_chip_keys
|
||||
│ │ ├─■──key_clipper_encryption_chip_keys ── Topic: 1
|
||||
│ │ └─■──entry_file_ripem_entries_key ── Topic: 73
|
||||
│ └─jpeg_image_file_gif_images
|
||||
│ ├─motif_graphics_ftp_available_3d
|
||||
│ │ ├─motif_graphics_openwindows_ftp_available
|
||||
│ │ │ ├─■──openwindows_motif_xview_windows_mouse ── Topic: 20
|
||||
│ │ │ └─■──graphics_widget_ray_3d_available ── Topic: 95
|
||||
│ │ └─■──3d_machines_version_comments_contact ── Topic: 38
|
||||
│ └─jpeg_image_gif_images_format
|
||||
│ ├─■──gopher_ftp_files_stuffit_images ── Topic: 51
|
||||
│ └─■──jpeg_image_gif_format_images ── Topic: 13
|
||||
└─drive_db_card_scsi_windows
|
||||
├─db_windows_dos_mov_os2
|
||||
│ ├─■──copy_protection_program_software_disk ── Topic: 64
|
||||
│ └─■──db_windows_dos_mov_os2 ── Topic: 8
|
||||
└─drive_card_scsi_drives_ide
|
||||
├─drive_scsi_drives_ide_disk
|
||||
│ ├─■──drive_scsi_drives_ide_disk ── Topic: 6
|
||||
│ └─■──meg_sale_ram_drive_shipping ── Topic: 12
|
||||
└─card_modem_monitor_video_drivers
|
||||
├─■──card_monitor_video_drivers_vga ── Topic: 5
|
||||
└─■──modem_port_serial_irq_com ── Topic: 10
|
||||
```
|
||||
</details>
|
||||
|
||||
## **Visualize Hierarchical Documents**
|
||||
We can extend the previous method by calculating the topic representation at different levels of the hierarchy and
|
||||
plotting them on a 2D plane. To do so, we first need to calculate the hierarchical topics:
|
||||
|
||||
```python
|
||||
from sklearn.datasets import fetch_20newsgroups
|
||||
from sentence_transformers import SentenceTransformer
|
||||
from bertopic import BERTopic
|
||||
from umap import UMAP
|
||||
|
||||
# Prepare embeddings
|
||||
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
|
||||
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
embeddings = sentence_model.encode(docs, show_progress_bar=False)
|
||||
|
||||
# Train BERTopic and extract hierarchical topics
|
||||
topic_model = BERTopic().fit(docs, embeddings)
|
||||
hierarchical_topics = topic_model.hierarchical_topics(docs)
|
||||
```
|
||||
Then, we can visualize the hierarchical documents by either supplying it with our embeddings or by
|
||||
reducing their dimensionality ourselves:
|
||||
|
||||
```python
|
||||
# Run the visualization with the original embeddings
|
||||
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)
|
||||
|
||||
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
|
||||
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
|
||||
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)
|
||||
```
|
||||
|
||||
<iframe src="hierarchical_documents.html" style="width:1200px; height: 800px; border: 0px;""></iframe>
|
||||
|
||||
!!! note
|
||||
The visualization above was generated with the additional parameter `hide_document_hover=True` which disables the
|
||||
option to hover over the individual points and see the content of the documents. This makes the resulting visualization
|
||||
smaller and fit into your RAM. However, it might be interesting to set `hide_document_hover=False` to hover
|
||||
over the points and see the content of the documents.
|
||||
Reference in New Issue
Block a user