{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example Usage of TopicGPT: Amazon Reviews" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will be using the Amazon Reviews dataset to show how TopicGPT can be useful when analyzing a large corpus of text." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\umap\\distances.py:1063: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", " @numba.jit()\n", "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\umap\\distances.py:1071: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", " @numba.jit()\n", "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\umap\\distances.py:1086: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", " @numba.jit()\n", "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n", "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\umap\\umap_.py:660: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", " @numba.jit()\n", "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\umap\\plot.py:203: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", " @numba.jit()\n", "[nltk_data] Downloading package stopwords to\n", "[nltk_data] C:\\Users\\arik_\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n", "[nltk_data] Downloading package punkt to\n", "[nltk_data] C:\\Users\\arik_\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n" ] } ], "source": [ "from topicgpt.TopicGPT import TopicGPT" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# load api key\n", "import os\n", "api_key_openai = os.environ.get('OPENAI_API_KEY')\n", "\n", "import openai\n", "\n", "openai.organization = \"org-MOfdTrYSke1pXhlAdLXxwDKx\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# data from https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews?resource=download\n", "\n", "review_data = pd.read_csv(\"../Data/AmazonReviews/amazon_review_polarity_csv/train.csv\", header=None) # only use the first 10k reviews of the train set\n", "\n", "reviews = list(review_data[2])\n", "reviews = reviews[:10000] # only consider the first 10k reviews " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "tm = TopicGPT(\n", " openai_api_key = api_key_openai,\n", " corpus_instruction= \"The Amazon reviews dataset consists of reviews from amazon. The data span a period of 18 years, including 10000 reviews up to March 2013.\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tm.fit(reviews)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "tm.save_embeddings() #save the computed embeddings for later use" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tm.visualize_clusters()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Topic 0: Musical genres and characteristics,\n", " Topic 1: Sci-fi TV show.,\n", " Topic 2: Film Genres,\n", " Topic 3: Paranormal phenomena and UFO sightings,\n", " Topic 4: Earbuds and Headsets,\n", " Topic 5: Book Review Topics,\n", " Topic 6: Gluten-free Cookbook,\n", " Topic 7: Air Mattresses,\n", " Topic 8: Crime and Investigation.,\n", " Topic 9: Printer Troubleshooting,\n", " Topic 10: Hiking Footwear,\n", " Topic 11: Shapewear,\n", " Topic 12: Dance Instruction,\n", " Topic 13: Parenting and Education,\n", " Topic 14: Electronic Gadgets,\n", " Topic 15: Video Games,\n", " Topic 16: MP3 Player Issues,\n", " Topic 17: Camera Accessories,\n", " Topic 18: Power Adapters,\n", " Topic 19: Product Quality,\n", " Topic 20: Ancient civilizations and anthropology.,\n", " Topic 21: Router Connectivity,\n", " Topic 22: Technical Issues,\n", " Topic 23: Puritanical Society,\n", " Topic 24: Sci-fi Space Exploration,\n", " Topic 25: Beauty Products,\n", " Topic 26: Sexual Vibrators,\n", " Topic 27: Home Safety,\n", " Topic 28: Product Quality,\n", " Topic 29: Customer Service Experience,\n", " Topic 30: Textbook Quality,\n", " Topic 31: Programming Documentation,\n", " Topic 32: Hardware Tools,\n", " Topic 33: Product Quality,\n", " Topic 34: Educational Toys,\n", " Topic 35: Appliances,\n", " Topic 36: Kitchenware,\n", " Topic 37: Supernatural Witches,\n", " Topic 38: Horror Comics,\n", " Topic 39: Dystopian society,\n", " Topic 40: Emotional Turmoil,\n", " Topic 41: Book genres,\n", " Topic 42: Economic and Political Critique,\n", " Topic 43: Poorly Written Erotica,\n", " Topic 44: Dystopian Surveillance State,\n", " Topic 45: Experimental Poetry,\n", " Topic 46: Formatting Issues,\n", " Topic 47: Language Learning Resources,\n", " Topic 48: Book genres,\n", " Topic 49: Home Improvement,\n", " Topic 50: Religious Texts.]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm.topic_lis" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\umap\\distances.py:1063: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", " @numba.jit()\n", "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\umap\\distances.py:1071: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", " @numba.jit()\n", "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\umap\\distances.py:1086: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", " @numba.jit()\n", "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n", "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\umap\\umap_.py:660: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", " @numba.jit()\n", "c:\\Users\\arik_\\anaconda3\\envs\\llm_sem_test7\\Lib\\site-packages\\umap\\plot.py:203: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", " @numba.jit()\n", "[nltk_data] Downloading package stopwords to\n", "[nltk_data] C:\\Users\\arik_\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n", "[nltk_data] Downloading package punkt to\n", "[nltk_data] C:\\Users\\arik_\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Wed Sep 6 20:17:07 2023 Building and compiling search function\n" ] } ], "source": [ "# load the model if available\n", "import pickle\n", "with open(\"../Data/SavedTopicRepresentations/TopicGPT_amazonReviews.pkl\", \"rb\") as f:\n", " tm = pickle.load(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what topic 2 is about" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The common topic of the given words is \"Movie Reviews\".\n", "\n", "Aspects:\n", "1. Genre: animated, slasher, noir, zombie, thriller.\n", "2. Quality: watchable, unwatchable, dreadful, cheesy, ridiculous.\n", "3. Filmmaking: directors, filmmakers, screenwriter, cinematography, filmmaking.\n", "4. Audience reaction: scariest, thrilling, hilarious, disappointing, shocking.\n", "5. Technical aspects: widescreen, dolby, surround, cinematography, special effects.\n" ] } ], "source": [ "print(tm.topic_lis[2].topic_description)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GPT wants to the call the function: {\n", " \"name\": \"identify_topic_idx\",\n", " \"arguments\": \"{\\n \\\"query\\\": \\\"Avatar\\\"\\n}\"\n", "}\n", "GPT wants to the call the function: {\n", " \"name\": \"get_topic_information\",\n", " \"arguments\": \"{\\n \\\"topic_idx_lis\\\": [2]\\n}\"\n", "}\n", "Yes, the movie \"Avatar\" is mentioned in topic 2, which is about film genres. However, the specific context or sentiment of the mention is not provided.\n" ] }, { "data": { "text/plain": [ "{2: '\\n Topic index: 2\\n Topic name: Film Genres\\n Topic description: The common topic of the given words is \"Movie Reviews\".\\n\\nAspects:\\n1. Genre: animated, slasher, noir, zombie, thriller.\\n2. Quality: watchable, unwatchable, dreadful, cheesy, ridiculous.\\n3. Filmmaking: directors, filmmakers, screenwriter, cinematography, filmmaking.\\n4. Audience reaction: scariest, thrilling, hilarious, disappointing, shocking.\\n5. Technical aspects: widescreen, dolby, surround, cinematography, special effects.\\n Topic topwords: [\\'theaters\\', \\'flicks\\', \\'renting\\', \\'animated\\', \\'robots\\', \\'gang\\', \\'rental\\', \\'filmmakers\\', \\'aerial\\', \\'unrated\\', \\'dubbing\\', \\'credits\\', \\'accents\\', \\'slasher\\', \\'scares\\', \\'noir\\', \\'cop\\', \\'cartoons\\', \\'unwatchable\\', \\'watchable\\', \\'directors\\', \\'porn\\', \\'scariest\\', \\'fights\\', \\'aired\\', \\'adaptation\\', \\'locations\\', \\'wrestling\\', \\'zombie\\', \\'corny\\', \\'watches\\', \\'screenwriter\\', \\'chicks\\', \\'theatrical\\', \\'lesbian\\', \\'sequels\\', \\'villa\\', \\'countryside\\', \\'grabs\\', \\'producer\\', \\'combo\\', \\'redeem\\', \\'trapped\\', \\'marine\\', \\'idiot\\', \\'depicted\\', \\'storylines\\', \\'millions\\', \\'ruin\\', \\'flop\\', \\'thrills\\', \\'claustrophobic\\', \\'caves\\', \\'non-stop\\', \\'blockbuster\\', \\'soldier\\', \\'actresses\\', \\'cinematic\\', \\'breathtaking\\', \\'comedian\\', \\'theatres\\', \\'pilots\\', \\'airplanes\\', \\'widescreen\\', \\'steals\\', \\'grainy\\', \\'acclaimed\\', \\'surround\\', \\'channels\\', \\'innovative\\', \\'buff\\', \\'talents\\', \\'motorcycle\\', \\'soccer\\', \\'courage\\', \\'distracting\\', \\'scenario\\', \\'cult\\', \\'portrays\\', \\'compelled\\', \\'scale\\', \\'segments\\', \\'butt\\', \\'promising\\', \\'cliched\\', \\'sub\\', \\'artsy\\', \\'faces\\', \\'kills\\', \\'rubbish\\', \\'involving\\', \\'vulgar\\', \\'incoherent\\', \\'costumes\\', \\'accent\\', \\'disjointed\\', \\'gory\\', \\'gruesome\\', \\'low-budget\\', \\'dreadful\\', \\'scream\\', \\'host\\', \\'fright\\', \\'action-packed\\', \\'goofy\\', \\'caving\\', \\'driven\\', \\'nude\\', \\'re-make\\', \\'audiences\\', \\'over-the-top\\', \\'filmmaking\\', \\'popcorn\\', \\'aircraft\\', \\'newer\\', \\'sport\\', \\'kicks\\', \\'ventriloquist\\', \\'crying\\', \\'rescue\\', \\'buffs\\', \\'deluxe\\', \\'cheated\\', \\'lackluster\\', \\'dolby\\', \\'strangers\\', \\'pun\\', \\'pitiful\\', \\'adore\\', \\'suffers\\', \\'initially\\', \\'loser\\', \\'arts\\', \\'shocking\\', \\'criminals\\', \\'weapons\\', \\'lion\\', \\'eerie\\', \\'preview\\', \\'youth\\', \\'parent\\', \\'poignant\\', \\'presence\\', \\'decade\\', \\'sympathetic\\', \\'interview\\', \\'achieve\\', \\'jobs\\', \\'motivation\\', \\'blonde\\', \\'poem\\', \\'transitions\\', \\'convincing\\', \\'marketing\\', \\'consumers\\', \\'candy\\', \\'asks\\', \\'walked\\', \\'miscast\\', \\'stunts\\', \\'nasty\\', \\'zombies\\', \\'thrill\\', \\'vomit\\', \\'stupidity\\', \\'stinker\\', \\'freaking\\', \\'disgusting\\', \\'ridiculously\\', \\'farce\\', \\'chills\\', \\'brainless\\', \\'dude\\', \\'naked\\', \\'ratings\\', \\'appalling\\', \\'pretends\\', \\'disgust\\', \\'filming\\', \\'starred\\', \\'punches\\', \\'downhill\\', \\'des\\', \\'sappy\\', \\'breathless\\', \\'screening\\', \\'funniest\\', \\'rude\\', \\'tripe\\', \\'hairy\\', \\'handsome\\', \\'unfunny\\', \\'genuinely\\', \\'fighters\\', \\'spoiled\\', \\'fighter\\', \\'backdrop\\', \\'dogfight\\', \\'planes\\', \\'aviation\\', \\'fast-paced\\', \\'dogfights\\', \\'chasing\\', \\'destruction\\', \\'dramas\\', \\'silent\\', \\'specials\\', \\'westerns\\', \\'permanent\\', \\'blurry\\', \\'blu-rays\\', \\'bikers\\', \\'crazed\\', \\'transfers\\', \\'cue\\', \\'deaf\\', \\'skipping\\', \\'commercials\\', \\'gritty\\', \\'exit\\', \\'jumped\\', \\'continuity\\', \\'robbed\\', \\'knocked\\', \\'sync\\', \\'boxed\\', \\'studios\\', \\'remastered\\', \\'pacing\\', \\'boat\\', \\'headed\\', \\'lonely\\', \\'misfortune\\', \\'marry\\', \\'beg\\', \\'jungle\\', \\'alley\\', \\'suffer\\', \\'victim\\', \\'sympathy\\', \\'wouldnt\\', \\'hurts\\', \\'handled\\', \\'shouting\\', \\'crawl\\', \\'border\\', \\'intent\\', \\'flashbacks\\', \\'viewed\\', \\'spite\\', \\'throat\\', \\'sacrificing\\', \\'ticket\\', \\'painfully\\', \\'passionate\\', \\'martial\\', \\'narration\\', \\'reaction\\', \\'deaths\\', \\'neighbors\\', \\'offend\\', \\'visually\\', \\'perverse\\', \\'realy\\', \\'goal\\', \\'tender\\', \\'portray\\', \\'bullets\\', \\'engaged\\', \\'coverage\\', \\'monotonous\\', \\'unlikely\\', \\'historically\\', \\'banal\\', \\'credibility\\', \\'one-liners\\', \\'depicts\\', \\'caring\\', \\'divorced\\', \\'grave\\', \\'sincere\\', \\'reaches\\', \\'meaningful\\', \\'mild\\', \\'souls\\', \\'downright\\', \\'dramatically\\', \\'involves\\', \\'understatement\\', \\'hates\\', \\'crosses\\', \\'workers\\', \\'interactions\\', \\'overwhelming\\', \\'statues\\', \\'sum\\', \\'photographed\\', \\'ranks\\', \\'aged\\', \\'region\\', \\'post-apocalyptic\\', \\'spoil\\', \\'dud\\', \\'qualities\\', \\'merit\\', \\'borrowed\\', \\'adapted\\', \\'scripts\\', \\'weaker\\', \\'justify\\', \\'purposes\\', \\'coherent\\', \\'gratuitous\\', \\'creature\\', \\'segment\\', \\'whim\\', \\'determine\\', \\'firefighters\\', \\'ok.\\', \\'inaccurate\\', \\'warmth\\', \\'turkey\\', \\'installment\\', \\'picky\\', \\'remotely\\', \\'shell\\', \\'receipt\\', \\'complained\\', \\'phoned\\', \\'cheesey\\', \\'shooting\\', \\'freak\\', \\'cheezy\\', \\'shoots\\', \\'travesty\\', \\'plotless\\', \\'drunk\\', \\'vile\\', \\'spy\\', \\'laughably\\', \\'shameless\\', \\'actors/actresses\\', \\'screams\\', \\'embarrassing\\', \\'rape\\', \\'embarrassed\\', \\'foul\\', \\'half-hour\\', \\'chases\\', \\'retarded\\', \\'crawling\\', \\'spoiler\\', \\'lustful\\', \\'pissed\\', \\'filmmaker\\', \\'disgrace\\', \\'spliced\\', \\'depths\\', \\'uncensored\\', \\'originality\\', \\'twins\\', \\'must-see\\', \\'unreal\\', \\'mansion\\', \\'cameo\\', \\'rendering\\', \\'marvelous\\', \\'comedies\\', \\'crippled\\', \\'comedians\\', \\'marvel\\', \\'fought\\', \\'ghostly\\', \\'spooky\\', \\'bare\\', \\'phenomenal\\', \\'robot\\', \\'underrated\\', \\'beneath\\', \\'comical\\', \\'landing\\', \\'crew\\', \\'heartwarming\\', \\'mute\\', \\'finale\\', \\'teaser\\', \\'airplane\\', \\'nonetheless\\', \\'campy\\', \\'autobots\\', \\'five-star\\', \\'beating\\', \\'reruns\\', \\'marries\\', \\'fond\\', \\'flawless\\', \\'avenger\\', \\'cavalry\\', \\'faded\\', \\'laughter\\', \\'streaming\\', \\'viewings\\', \\'pixelated\\', \\'letterboxed\\', \\'dub\\', \\'biker\\', \\'bluray\\', \\'spiderman\\', \\'beloved\\', \\'beautifull\\', \\'broadcast\\', \\'boxset\\', \\'swinging\\', \\'restored\\', \\'captioning\\', \\'organ\\', \\'skips\\', \\'previews\\', \\'disapointment\\', \\'tickets\\', \\'peanuts\\', \\'holidays\\', \\'insomnia\\', \\'geared\\', \\'suit\\', \\'perfection\\', \\'crisp\\', \\'mesmerizing\\', \\'butter\\', \\'bus\\', \\'vehicle\\', \\'cloying\\', \\'butchered\\', \\'slap\\', \\'spots\\', \\'angles\\', \\'builds\\', \\'tossed\\', \\'facial\\', \\'ripping\\', \\'jumping\\', \\'manufactured\\', \\'glued\\', \\'struck\\', \\'idiots\\', \\'cliff\\', \\'bullet\\', \\'accident\\', \\'performs\\', \\'shakes\\', \\'lossless\\', \\'expired\\', \\'yell\\', \\'grass\\', \\'hurl\\', \\'walks\\', \\'flesh\\', \\'quirky\\', \\'sooo\\', \\'expedition\\', \\'stilted\\', \\'so-so\\', \\'sue\\', \\'cheating\\', \\'regrettably\\', \\'cried\\', \\'alike\\', \\'afterwards\\', \\'escapes\\', \\'headache\\', \\'den\\', \\'shy\\', \\'bisexual\\', \\'effeminate\\', \\'wit\\', \\'suffering\\', \\'cabin\\', \\'minimal\\', \\'riveting\\', \\'brow\\', \\'racism\\', \\'thankfully\\', \\'kinky\\', \\'songwriter\\', \\'mind-numbing\\', \\'leap\\', \\'artifacts\\', \\'attraction\\', \\'enticing\\', \\'mill\\', \\'collectors\\', \\'whiny\\', \\'bickering\\', \\'flawed\\', \\'ton\\', \\'adopted\\', \\'tortured\\', \\'assumed\\', \\'nominated\\', \\'maintain\\']'}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm.pprompt(\"Is the movie Avatar mentioned in topic 2?\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To check the output, we actually inspect the respective document at index 1498: " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Not just because of the 3D, but because this is the version where they made an effort to optimize the picture! They released this on blu ray like like avatar was released. First they release the movie without any optimization and special features. Which means the picture looks better than DVD but not the best that blu ray can be.(which means a grainy looking picture that looks like the characters are in a sandstorm and there is a lack of detail that you expect in a blu ray). The they make the limited edition which is made the way a blu ray is supposed to be down. So if you are wondering which one to choose, this is the one you want! All the features with the visuals to boot!\n" ] } ], "source": [ "print(tm.topic_lis[2].documents[1498])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us go own with the analysis. Since it is easy to loose the overview over all the topics, lets find out which one is about books" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GPT wants to the call the function: {\n", " \"name\": \"identify_topic_idx\",\n", " \"arguments\": \"{\\n \\\"query\\\": \\\"books\\\"\\n}\"\n", "}\n", "Topic 5 is about books.\n" ] }, { "data": { "text/plain": [ "5" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm.pprompt(\"Which topic is about books?\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The common topic of the given words is \"Book Reviews\". \n", "\n", "Various aspects and sub-topics of this topic include:\n", "1. Characters: likable, endearing, mighty, crew, pals\n", "2. Storyline: satirical, mythological, strange, satirical, endings\n", "3. Writing style: well-crafted, inviting, brilliantly, sarcastic\n", "4. Themes: philosophical, allusions, belief, religion, obsession\n", "5. Critique: uneven, dissatisfaction, novice, pale, unsuccessful\n" ] } ], "source": [ "print(tm.topic_lis[5].topic_description)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GPT wants to the call the function: {\n", " \"name\": \"knn_search\",\n", " \"arguments\": \"{\\n \\\"topic_index\\\": 4,\\n \\\"query\\\": \\\"Harry Potter\\\",\\n \\\"k\\\": 5\\n}\"\n", "}\n", "No, there is no mention of Harry Potter in topic 5. The documents that are most closely related to the query \"Harry Potter\" do not mention the topic.\n" ] }, { "data": { "text/plain": [ "([\"Unable to use. Compartments too tiny and too deep to reach in to get earrings - and I don't have unusually large fingers.\",\n", " \"I wanted an in ear blue tooth headset but couldn't get them to stay in. these made it work!\",\n", " 'people should buy headsets to fit these things bc they are so essential..A must have for any headset that will fit them',\n", " 'I am am a musician so I use these with earbuds coming off my computer within my headphones which is powered from my Marshall Amplifier. They transfer the sound well and are very well made.',\n", " 'The Jabra Eargels are wonderful.. I cannot believe how comfortable, and user friendly they are. Thank you, if not for these, I could not use my bluetooth.Great Merchandise..'],\n", " [71, 27, 30, 72, 69])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm.pprompt(\"Is Harry Potter mentioned in topic 5?\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Topic 0 (\"Musical genres and characteristics\") sounds a bit general and from the visual inspection it seems to contain a lot of documents. So let's break it down a little bit" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GPT wants to the call the function: {\n", " \"name\": \"split_topic_kmeans\",\n", " \"arguments\": \"{\\n \\\"topic_idx\\\": 0,\\n \\\"inplace\\\": true\\n}\"\n", "}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Epochs completed: 100%| ██████████ 100/100 [00:03]\n", "Computing word-topic matrix: 100%|██████████| 1/1 [00:01<00:00, 1.29s/it]\n", "Epochs completed: 100%| ██████████ 100/100 [00:01]\n", "Epochs completed: 100%| ██████████ 100/100 [00:01]\n", " 0%| | 0/1 [00:00