Added a base model class and training scripts for various sentiment analysis models, including Naive Bayes, SVM, XGBoost, LSTM, and BERT. Also, improved prediction functionality and the model loading mechanism.

2025-08-04 22:07:30 +08:00
parent bd60e2ed1b
commit 43525c5ca8
23 changed files with 1940 additions and 2362 deletions
@@ -1,32 +1,107 @@
-# WeiboSentiment
-用各种机器学习对中文微博进行情感分析    
-语料来源： https://github.com/dengxiuqi/weibo2018
---
-##### "微博情感分析"是我本科的毕业设计, 也是我入门NLP的项目, 就把它发出来供大家交流。
-##### 2021.06.07更新: 之前的版本写得比较随意, 没想到star破百了, 私下也有一些刚入门NLP的同学因为这个项目联系我, 就更新一下这个项目吧
-* 重构项目架构和代码, 提高可读性
-* 每个文件中的特征、数据处理方法与模型细节都尽可能避免重复, 以给各位同学提供更多的参考
-* 神经网络结构换成了pytorch, 需要`tensorflow 1.0`代码的同学请回退至`445998`版本。    
-* 新增了`Bert`模型
-* 由于gensim新老版本很多语法不兼容, 将gensim更新为4.0版本
----
-#### 项目说明
-* 训练集10000条语料, 测试集500条语料
-* 使用朴素贝叶斯、SVM、XGBoost、LSTM和Bert, 等多种模型搭建并训练二分类模型
-* 前3个模型都采用端到端的训练方法
-* LSTM先预训练得到Word2Vec词向量, 在训练神经网络
-* `Bert`使用的是哈工大的预训练模型, 用Bert的`[CLS]`位输出在一个下游网络上进行finetune。预训练模型需要自行下载:    
-    * github下载地址: https://github.com/ymcui/Chinese-BERT-wwm
-    * baidu网盘: https://pan.baidu.com/s/16z-ybrqT6wLdy_mLHtywSw  密码: djkj
-    * 下载后将文件夹放在`./model`文件夹下, 并将`bert_config.json`改名为`config.json`
---
-#### 实验结果
-各种分类器在测试集上的测试结果  
+# 微博情感分析 - 传统机器学习方法

-|模型|准确率|AUC|
-| :---: | :---: | :---: |
-|1.bayes|0.856| - |
-|2.svm|0.856| - |
-|3.xgboost|0.86| 0.904 |
-|4.lstm|0.87| 0.931 |
-|5.bert|0.87| 0.929 |
+## 项目介绍
+
+本项目使用5种传统机器学习方法对中文微博进行情感二分类（正面/负面）：
+
+- **朴素贝叶斯**: 基于词袋模型的概率分类
+- **SVM**: 基于TF-IDF特征的支持向量机  
+- **XGBoost**: 梯度提升决策树
+- **LSTM**: 循环神经网络 + Word2Vec词向量
+- **BERT+分类头**: 预训练语言模型接分类器（我认为也属于传统ML范畴）
+
+## 模型性能
+
+在微博情感数据集上的表现（训练集10000条，测试集500条）：
+
+| 模型 | 准确率 | AUC | 特点 |
+|------|--------|-----|------|
+| 朴素贝叶斯 | 85.6% | - | 速度快，内存占用小 |
+| SVM | 85.6% | - | 泛化能力好 |
+| XGBoost | 86.0% | 90.4% | 性能稳定，支持特征重要性 |
+| LSTM | 87.0% | 93.1% | 理解序列信息和上下文 |
+| BERT+分类头 | 87.0% | 92.9% | 强大的语义理解能力 |
+
+## 环境配置
+
+```bash
+pip install -r requirements.txt
+```
+
+数据文件结构：
+```
+data/
+├── weibo2018/
+│   ├── train.txt
+│   └── test.txt
+└── stopwords.txt
+```
+
+## 训练模型（后面可以不接参数直接运行）
+
+### 朴素贝叶斯
+```bash
+python bayes_train.py
+```
+
+### SVM
+```bash
+python svm_train.py --kernel rbf --C 1.0
+```
+
+### XGBoost
+```bash
+python xgboost_train.py --max_depth 6 --eta 0.3 --num_boost_round 200
+```
+
+### LSTM
+```bash
+python lstm_train.py --epochs 5 --batch_size 100 --hidden_size 64
+```
+
+### BERT
+```bash
+python bert_train.py --epochs 10 --batch_size 100 --learning_rate 1e-3
+```
+
+注：BERT模型会自动下载中文预训练模型（bert-base-chinese）
+
+## 使用预测
+
+### 交互式预测（推荐）
+```bash
+python predict.py
+```
+
+### 命令行预测
+```bash
+# 单模型预测
+python predict.py --model_type bert --text "今天天气真好，心情很棒"
+
+# 多模型集成预测
+python predict.py --ensemble --text "这部电影太无聊了"
+```
+
+## 文件结构
+
+```
+WeiboSentiment_MachineLearning/
+├── bayes_train.py           # 朴素贝叶斯训练
+├── svm_train.py             # SVM训练
+├── xgboost_train.py         # XGBoost训练
+├── lstm_train.py            # LSTM训练
+├── bert_train.py            # BERT训练
+├── predict.py               # 统一预测程序
+├── base_model.py            # 基础模型类
+├── utils.py                 # 工具函数
+├── requirements.txt         # 依赖包
+├── model/                   # 模型保存目录
+└── data/                    # 数据目录
+```
+
+## 注意事项
+
+1. **BERT模型**首次运行会自动下载预训练模型（约400MB）
+2. **LSTM模型**训练时间较长，建议使用GPU
+3. **模型保存**在 `model/` 目录下，确保有足够磁盘空间
+4. **内存需求**BERT > LSTM > XGBoost > SVM > 朴素贝叶斯