Compare commits
4 Commits
main
..
535ec5d568
| Author | SHA1 | Date | |
|---|---|---|---|
| 535ec5d568 | |||
| 79c8f03573 | |||
| 011ab4cd81 | |||
| 294e4885b1 |
@@ -32,7 +32,6 @@ jrxml_chunker_output/
|
||||
# IDE
|
||||
.idea/
|
||||
.vscode/
|
||||
。claude/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
||||
@@ -1,94 +1,79 @@
|
||||
# JRXML RAG 项目
|
||||
|
||||
基于 RAG 的 JasperReports JRXML 模板 + Markdown 文档智能检索系统,作为构建 JRXML 自定义 Agent 的前置工作。
|
||||
基于 RAG(Retrieval-Augmented Generation)的 JasperReports JRXML 模板智能问答系统,构建JRXML自定义agent的前置工作。
|
||||
|
||||
支持 JRXML 模板和 Markdown 文档的语义分块、向量化、Chroma 持久化存储,以及自然语言查询。**三个核心步骤均支持增量处理**。
|
||||
## 项目简介
|
||||
|
||||
本项目将 JasperReports 的 JRXML 模板文件进行语义分块、向量化,并存入 Chroma 向量数据库,实现通过自然语言查询来检索和理解报表模板的结构、配置和逻辑。
|
||||
|
||||
## 项目结构
|
||||
|
||||
```
|
||||
rag_jrxml/
|
||||
├── collect_jrxml.py # JRXML 文件收集
|
||||
├── jrxml_chunker.py # JRXML 语义分块引擎 (v3.0)
|
||||
├── md_chunker.py # Markdown 语义分块引擎
|
||||
├── batch_chunker.py # 统一批量分块入口 (JRXML + MD, 支持增量)
|
||||
├── down_embedding_model.py # 嵌入模型下载
|
||||
├── embed_chunks.py # Chunk 向量化 (支持增量)
|
||||
├── import_to_chroma.py # Chroma 向量入库 (支持增量)
|
||||
├── query_chroma.py # 语义搜索查询
|
||||
├── config.py # 统一配置管理 (.env)
|
||||
├── .env / .env.example # 环境变量配置
|
||||
├── requirements.txt # Python 依赖
|
||||
├── jrxml_source/ # JRXML 源文件
|
||||
├── jrxml_chunker_output/ # 分块输出
|
||||
├── embeddings/ # 向量输出
|
||||
├── chroma_db/ # Chroma 持久化数据库
|
||||
└── docs/file_guide.md # 详细文件功能说明
|
||||
RAG-jaspersoft/
|
||||
├── collect_jrxml.py # JRXML 文件收集脚本
|
||||
├── jrxml_chunker.py # JRXML 语义分块核心引擎
|
||||
├── jrxml_banch_chunker.py # 批量分块入口脚本
|
||||
├── down_embedding_model.py # 嵌入模型下载脚本
|
||||
├── embed_chunks.py # Chunk 向量化脚本
|
||||
├── import_to_chroma.py # 向量导入 Chroma 数据库
|
||||
├── query_chroma.py # 语义搜索查询工具
|
||||
├── jrxml_source/ # JRXML 源文件目录
|
||||
├── jrxml_chunker_output/ # 分块输出目录
|
||||
│ ├── all_chunks.json # 所有 chunks 合并文件
|
||||
│ ├── processing_stats.json # 处理统计报告
|
||||
│ └── per_file/ # 按文件分类的 chunks
|
||||
├── models/ # 嵌入模型存放目录
|
||||
│ └── Qwen3-Embedding-4B/ # Qwen3 嵌入模型
|
||||
├── embeddings/ # 向量输出目录
|
||||
│ ├── embeddings.npy # 向量矩阵
|
||||
│ ├── chunks.json # 原始 chunks
|
||||
│ └── embeddings.pkl # 完整数据 pickle
|
||||
├── chroma_db/ # Chroma 向量数据库
|
||||
└── docs/ # 项目文档
|
||||
└── file_guide.md # 文件功能说明
|
||||
```
|
||||
|
||||
## 环境要求
|
||||
## 快速开始
|
||||
|
||||
### 环境要求
|
||||
|
||||
- Python 3.11+
|
||||
- NVIDIA GPU (推荐 8GB+ 显存) 或 CPU
|
||||
- CUDA 12.1+ (GPU 模式)
|
||||
- NVIDIA GPU(推荐,8GB+ 显存)或 CPU
|
||||
- CUDA 12.1+(GPU 模式)
|
||||
|
||||
## 安装与配置
|
||||
### 安装依赖
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
cp .env.example .env # 编辑 .env 调整模型、路径等参数
|
||||
# 安装 PyTorch (CUDA 版本)
|
||||
uv pip install torch --index-url https://download.pytorch.org/whl/cu130
|
||||
|
||||
# 安装其他依赖
|
||||
uv pip install sentence-transformers chromadb numpy tqdm
|
||||
```
|
||||
|
||||
主要配置项:
|
||||
|
||||
| 变量 | 说明 | 默认值 |
|
||||
| --- | --- | --- |
|
||||
| `EMBEDDING_MODEL_NAME` | 嵌入模型 (Hub 名) | `Qwen/Qwen3-Embedding-0.6B` |
|
||||
| `EMBEDDING_MODEL_PATH` | 本地模型路径 | `models/Qwen3-Embedding-0.6B` |
|
||||
| `MAX_CHUNK_SIZE` | 单个 chunk 最大字符数 | `2000` |
|
||||
| `BATCH_SIZE` | 向量化批大小 | `16` |
|
||||
| `CHROMA_COLLECTION_NAME` | Chroma 集合名 | `jrxml_chunks` |
|
||||
|
||||
---
|
||||
|
||||
## 首次使用 — 全量建库
|
||||
|
||||
从头构建向量数据库,三个步骤顺序执行:
|
||||
|
||||
### 步骤 1:收集 & 分块
|
||||
### 完整流程
|
||||
|
||||
```bash
|
||||
# 收集 JRXML 模板文件
|
||||
# 1. 收集 JRXML 文件
|
||||
python collect_jrxml.py
|
||||
|
||||
# 统一分块 (JRXML + Markdown 混合目录)
|
||||
python batch_chunker.py ./jrxml_source --output ./jrxml_chunker_output
|
||||
```
|
||||
# 2. 语义分块
|
||||
python jrxml_banch_chunker.py ./jrxml_source --output ./jrxml_chunker_output
|
||||
|
||||
输出 `jrxml_chunker_output/all_chunks.json` 和 `processing_stats.json`。
|
||||
|
||||
### 步骤 2:向量化
|
||||
|
||||
```bash
|
||||
# 下载嵌入模型 (仅首次)
|
||||
# 3. 下载嵌入模型(首次运行)
|
||||
python down_embedding_model.py
|
||||
|
||||
# 全量向量化
|
||||
python embed_chunks.py
|
||||
```
|
||||
# 4. 向量化
|
||||
python embed_chunks.py --batch_size 2
|
||||
|
||||
输出 `embeddings/embeddings.npy`、`chunks.json` 等文件。
|
||||
|
||||
### 步骤 3:导入 Chroma
|
||||
|
||||
```bash
|
||||
# 全量导入 (创建新集合)
|
||||
# 5. 导入 Chroma 数据库
|
||||
python import_to_chroma.py
|
||||
|
||||
# 6. 开始查询
|
||||
python query_chroma.py
|
||||
```
|
||||
|
||||
输出 `chroma_db/` 持久化向量数据库。
|
||||
|
||||
### 步骤 4:查询
|
||||
### 快速查询
|
||||
|
||||
```bash
|
||||
# 交互模式
|
||||
@@ -96,86 +81,51 @@ python query_chroma.py
|
||||
|
||||
# 单次查询
|
||||
python query_chroma.py "如何修改报表标题"
|
||||
|
||||
# 按类型过滤
|
||||
python query_chroma.py "SQL查询怎么写" --filter_field query
|
||||
python query_chroma.py "报表参数" --threshold 0.5 --n_results 10
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 增量更新 — 追加新模板
|
||||
|
||||
已有数据库后,添加新模板无需重建。将新文件放入源目录后:
|
||||
|
||||
```bash
|
||||
# 步骤 1:增量分块 (自动跳过已处理文件,合并到已有结果)
|
||||
python batch_chunker.py ./jrxml_source --incremental
|
||||
|
||||
# 步骤 2:增量向量化 (只对新 chunks 编码,合并到已有向量)
|
||||
python embed_chunks.py --incremental
|
||||
|
||||
# 步骤 3:增量导入 (追加到已有集合,不删除现有数据)
|
||||
python import_to_chroma.py --incremental
|
||||
```
|
||||
|
||||
三个 `--incremental` 标志各自的工作逻辑:
|
||||
|
||||
| 步骤 | 如何识别已处理 | 无新数据时 |
|
||||
| --- | --- | --- |
|
||||
| `batch_chunker` | 对比 `processing_stats.json` 中的文件路径 | 输出 "没有新文件需要处理" |
|
||||
| `embed_chunks` | 按 `(context, chunk_id)` 去重 | 输出 "没有新 chunks 需要向量化" |
|
||||
| `import_to_chroma` | 查询 Chroma 已有 ID | 输出 "没有新数据需要导入" |
|
||||
|
||||
---
|
||||
|
||||
## 分块类型
|
||||
|
||||
### JRXML
|
||||
系统将 JRXML 模板按以下语义类型进行分块:
|
||||
|
||||
| 类型 | 说明 |
|
||||
|---|---|
|
||||
| `report_overview` | 报表概览 (含数据源分析) |
|
||||
| `datasource_config` | 数据源配置 |
|
||||
| `query` | 数据查询 (SQL/HQL/XPath/JSON 等) |
|
||||
|------|------|
|
||||
| `report_overview` | 报告整体概览,含数据源分析 |
|
||||
| `datasource_config` | 数据源配置属性 |
|
||||
| `query` | 数据查询(SQL/HQL/XPath 等) |
|
||||
| `parameters` | 参数定义 |
|
||||
| `fields` / `field` | 字段定义 |
|
||||
| `fields` | 字段定义 |
|
||||
| `sortFields` | 排序字段 |
|
||||
| `filterExpression` | 过滤表达式 |
|
||||
| `variables_*` | 变量定义 (按 resetType) |
|
||||
| `variables_*` | 变量定义(按重置类型分组) |
|
||||
| `styles` | 样式定义 |
|
||||
| `dataset` | 数据集定义 |
|
||||
| `group` | 分组定义 |
|
||||
| `band_*` | 标准带区 (title/detail/pageHeader 等) |
|
||||
| `groups` | 分组定义 |
|
||||
| `band_*` | 标准带(title/detail/pageHeader 等) |
|
||||
| `chart` | 图表元素 |
|
||||
| `crosstab` | 交叉表元素 |
|
||||
| `subreport` | 子报表元素 |
|
||||
| `component` | 组件元素 (列表等) |
|
||||
|
||||
### Markdown
|
||||
|
||||
| 类型 | 说明 |
|
||||
|---|---|
|
||||
| `section_h1` | 一级标题段落 |
|
||||
| `section_h2` / `section_h3` | 二/三级标题段落 |
|
||||
| `section_installation` | 安装/部署章节 |
|
||||
| `section_configuration` | 配置章节 |
|
||||
| `section_api` | API 接口章节 |
|
||||
| `section_example` | 示例/用法章节 |
|
||||
| `section_faq` | FAQ/常见问题章节 |
|
||||
| `section_changelog` | 更新日志章节 |
|
||||
| `code` | 代码块 |
|
||||
|
||||
## 支持的 JRXML 数据源
|
||||
|
||||
SQL/JDBC · HQL/Hibernate · XPath/XML · JSON · JSONQL · CSV · Data Adapter (Excel/XML/HTTP) · Bean Collection · Empty
|
||||
| `component` | 组件元素(列表等) |
|
||||
| `dataset` | 数据集定义 |
|
||||
|
||||
## 技术栈
|
||||
|
||||
- **分块引擎**: XML 语义解析 (JRXML) + Markdown 结构化解析
|
||||
- **嵌入模型**: Qwen3-Embedding (支持 FP16, 可替换)
|
||||
- **分块引擎**: 基于 XML 解析的语义分块器
|
||||
- **嵌入模型**: Qwen3-Embedding-4B(支持 FP16 半精度)
|
||||
- **向量数据库**: ChromaDB(持久化模式,余弦相似度)
|
||||
- **嵌入框架**: Sentence-Transformers
|
||||
- **向量数据库**: ChromaDB (持久化, 余弦相似度)
|
||||
- **深度学习**: PyTorch + CUDA (CPU 兼容)
|
||||
- **深度学习**: PyTorch + CUDA
|
||||
|
||||
## 性能参考
|
||||
|
||||
| 硬件 | 模型 | Batch Size | 速度 |
|
||||
|------|------|-----------|------|
|
||||
| RTX 4060 Laptop 8GB | Qwen3-Embedding-4B (FP16) | 2 | ~1.2s/chunk |
|
||||
| RTX 4060 Laptop 8GB | all-MiniLM-L6-v2 | 64 | ~0.001s/chunk |
|
||||
|
||||
> 离线建库是一次性开销,在线查询仅需 1-2 秒。
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
MIT
|
||||
@@ -1,296 +0,0 @@
|
||||
"""
|
||||
batch_chunker.py
|
||||
统一批量分块入口,支持 JRXML 和 Markdown 文件混合处理
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import time
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from collections import defaultdict
|
||||
|
||||
from jrxml_chunker import JRXMLSemanticChunker
|
||||
from md_chunker import MarkdownSemanticChunker, save_chunks_to_json
|
||||
|
||||
|
||||
SUPPORTED_EXTENSIONS = ('.jrxml', '.JRXML', '.md', '.markdown')
|
||||
|
||||
|
||||
def batch_chunk_with_report(input_dir: str = None, output_dir: str = None,
|
||||
max_chunk_size: int = 2000, incremental: bool = False):
|
||||
"""
|
||||
批量分块,支持 JRXML 和 Markdown 混合处理
|
||||
|
||||
Args:
|
||||
input_dir: 输入目录
|
||||
output_dir: 输出目录
|
||||
max_chunk_size: 单个 chunk 最大字符数
|
||||
incremental: 增量模式,只处理新增文件,合并到已有结果
|
||||
"""
|
||||
if input_dir is None:
|
||||
print("错误:请指定输入目录")
|
||||
return None
|
||||
|
||||
input_path = Path(input_dir).resolve()
|
||||
|
||||
if not input_path.exists():
|
||||
print(f"❌ 目录不存在: {input_path}")
|
||||
return None
|
||||
|
||||
if not input_path.is_dir():
|
||||
print(f"❌ 不是目录: {input_path}")
|
||||
return None
|
||||
|
||||
if output_dir is None:
|
||||
output_dir = input_path.parent / f"{input_path.stem}_chunks"
|
||||
output_path = Path(output_dir)
|
||||
output_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f"统一批量分块 v1.0" + (" [增量模式]" if incremental else ""))
|
||||
print(f"{'='*60}")
|
||||
print(f"输入目录: {input_path}")
|
||||
print(f"输出目录: {output_path}")
|
||||
print(f"{'='*60}\n")
|
||||
|
||||
# 增量模式:加载已有数据,跳过已处理的文件
|
||||
existing_chunks = []
|
||||
processed_files = set()
|
||||
if incremental:
|
||||
existing_chunks_path = output_path / "all_chunks.json"
|
||||
existing_stats_path = output_path / "processing_stats.json"
|
||||
if existing_chunks_path.exists() and existing_stats_path.exists():
|
||||
with open(existing_chunks_path, 'r', encoding='utf-8') as f:
|
||||
existing_chunks = json.load(f)
|
||||
with open(existing_stats_path, 'r', encoding='utf-8') as f:
|
||||
existing_stats = json.load(f)
|
||||
processed_files = set(existing_stats.get("chunks_per_file", {}).keys())
|
||||
print(f"增量模式: 已有 {len(existing_chunks)} 个 chunks, {len(processed_files)} 个已处理文件")
|
||||
else:
|
||||
print(f"增量模式: 未找到已有数据,切换为全量处理")
|
||||
incremental = False
|
||||
|
||||
# 初始化分块器
|
||||
jrxml_chunker = JRXMLSemanticChunker(max_chunk_size=max_chunk_size)
|
||||
md_chunker = MarkdownSemanticChunker(max_chunk_size=max_chunk_size)
|
||||
|
||||
# 收集所有支持的文件
|
||||
files_by_ext = defaultdict(list)
|
||||
for ext in SUPPORTED_EXTENSIONS:
|
||||
files_by_ext[ext] = list(input_path.rglob(f"*{ext}"))
|
||||
|
||||
# 增量模式:过滤已处理文件
|
||||
total_found = sum(len(f) for f in files_by_ext.values())
|
||||
if incremental and processed_files:
|
||||
skipped = 0
|
||||
for ext in SUPPORTED_EXTENSIONS:
|
||||
new_list = []
|
||||
for f in files_by_ext[ext]:
|
||||
if str(f.relative_to(input_path)) in processed_files:
|
||||
skipped += 1
|
||||
else:
|
||||
new_list.append(f)
|
||||
files_by_ext[ext] = new_list
|
||||
print(f"扫描到 {total_found} 个文件, 跳过 {skipped} 个已处理")
|
||||
else:
|
||||
print(f"扫描到 {total_found} 个文件")
|
||||
|
||||
total_files = sum(len(f) for f in files_by_ext.values())
|
||||
for ext, files in files_by_ext.items():
|
||||
if files:
|
||||
print(f" {ext}: {len(files)} 个")
|
||||
|
||||
if total_files == 0:
|
||||
print("✅ 没有新文件需要处理")
|
||||
result_stats = existing_stats.copy() if (incremental and processed_files) else {}
|
||||
return {
|
||||
"chunks": existing_chunks,
|
||||
"stats": result_stats,
|
||||
"output_path": str(output_path)
|
||||
}
|
||||
|
||||
# 统计变量
|
||||
all_chunks = []
|
||||
stats = {
|
||||
"total_files": total_found,
|
||||
"success": 0,
|
||||
"failed": 0,
|
||||
"total_chunks": 0,
|
||||
"failed_files": [],
|
||||
"chunks_per_file": defaultdict(int),
|
||||
"chunk_types": defaultdict(int),
|
||||
"files_by_type": {"jrxml": 0, "markdown": 0},
|
||||
"started_at": datetime.now().isoformat()
|
||||
}
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
# 处理 JRXML 文件
|
||||
jrxml_files = files_by_ext.get('.jrxml', []) + files_by_ext.get('.JRXML', [])
|
||||
if jrxml_files:
|
||||
print(f"\n📄 处理 JRXML 文件 ({len(jrxml_files)} 个)...")
|
||||
for i, jrxml_file in enumerate(jrxml_files, 1):
|
||||
relative_path = jrxml_file.relative_to(input_path)
|
||||
|
||||
try:
|
||||
file_start = time.time()
|
||||
chunks = jrxml_chunker.chunk_file(str(jrxml_file))
|
||||
file_duration = time.time() - file_start
|
||||
|
||||
all_chunks.extend(chunks)
|
||||
|
||||
stats["success"] += 1
|
||||
stats["files_by_type"]["jrxml"] += 1
|
||||
stats["total_chunks"] += len(chunks)
|
||||
stats["chunks_per_file"][str(relative_path)] = len(chunks)
|
||||
|
||||
for chunk in chunks:
|
||||
stats["chunk_types"][f"jrxml_{chunk['chunk_type']}"] += 1
|
||||
|
||||
print(f"[{i}/{len(jrxml_files)}] ✅ JRXML: {relative_path} → {len(chunks)} chunks ({file_duration:.2f}s)")
|
||||
|
||||
except Exception as e:
|
||||
stats["failed"] += 1
|
||||
error_info = {"file": str(relative_path), "type": "jrxml", "error": str(e)}
|
||||
stats["failed_files"].append(error_info)
|
||||
print(f"[{i}/{len(jrxml_files)}] ❌ JRXML: {relative_path} → {e}")
|
||||
|
||||
# 处理 Markdown 文件
|
||||
md_files = files_by_ext.get('.md', []) + files_by_ext.get('.markdown', [])
|
||||
if md_files:
|
||||
print(f"\n📝 处理 Markdown 文件 ({len(md_files)} 个)...")
|
||||
for i, md_file in enumerate(md_files, 1):
|
||||
relative_path = md_file.relative_to(input_path)
|
||||
|
||||
try:
|
||||
file_start = time.time()
|
||||
chunks = md_chunker.chunk_file(str(md_file))
|
||||
file_duration = time.time() - file_start
|
||||
|
||||
all_chunks.extend(chunks)
|
||||
|
||||
stats["success"] += 1
|
||||
stats["files_by_type"]["markdown"] += 1
|
||||
stats["total_chunks"] += len(chunks)
|
||||
stats["chunks_per_file"][str(relative_path)] = len(chunks)
|
||||
|
||||
for chunk in chunks:
|
||||
stats["chunk_types"][f"md_{chunk['chunk_type']}"] += 1
|
||||
|
||||
print(f"[{i}/{len(md_files)}] ✅ MD: {relative_path} → {len(chunks)} chunks ({file_duration:.2f}s)")
|
||||
|
||||
except Exception as e:
|
||||
stats["failed"] += 1
|
||||
error_info = {"file": str(relative_path), "type": "markdown", "error": str(e)}
|
||||
stats["failed_files"].append(error_info)
|
||||
print(f"[{i}/{len(md_files)}] ❌ MD: {relative_path} → {e}")
|
||||
|
||||
total_duration = time.time() - start_time
|
||||
stats["processing_time"] = round(total_duration, 2)
|
||||
stats["finished_at"] = datetime.now().isoformat()
|
||||
|
||||
# 增量模式:合并新旧数据
|
||||
if incremental and existing_chunks:
|
||||
merged_chunks = existing_chunks + all_chunks
|
||||
print(f"\n合并: 已有 {len(existing_chunks)} + 新增 {len(all_chunks)} = {len(merged_chunks)} 个 chunks")
|
||||
all_chunks = merged_chunks
|
||||
|
||||
# 合并统计
|
||||
merged_stats = existing_stats.copy()
|
||||
merged_stats["success"] = existing_stats.get("success", 0) + stats["success"]
|
||||
merged_stats["failed"] = existing_stats.get("failed", 0) + stats["failed"]
|
||||
merged_stats["total_chunks"] = existing_stats.get("total_chunks", 0) + stats["total_chunks"]
|
||||
merged_stats["processing_time"] = round(existing_stats.get("processing_time", 0) + total_duration, 2)
|
||||
merged_stats["finished_at"] = stats["finished_at"]
|
||||
for fp, count in stats["chunks_per_file"].items():
|
||||
merged_stats["chunks_per_file"][fp] = count
|
||||
for ct, count in stats["chunk_types"].items():
|
||||
merged_stats["chunk_types"][ct] = merged_stats.get("chunk_types", {}).get(ct, 0) + count
|
||||
merged_stats["files_by_type"]["jrxml"] = existing_stats.get("files_by_type", {}).get("jrxml", 0) + stats["files_by_type"]["jrxml"]
|
||||
merged_stats["files_by_type"]["markdown"] = existing_stats.get("files_by_type", {}).get("markdown", 0) + stats["files_by_type"]["markdown"]
|
||||
if stats["failed_files"]:
|
||||
merged_stats.setdefault("failed_files", []).extend(stats["failed_files"])
|
||||
stats_serializable = {k: (dict(v) if isinstance(v, defaultdict) else v) for k, v in merged_stats.items()}
|
||||
else:
|
||||
stats_serializable = {k: (dict(v) if isinstance(v, defaultdict) else v) for k, v in stats.items()}
|
||||
|
||||
# 保存所有 chunks
|
||||
all_chunks_path = output_path / "all_chunks.json"
|
||||
save_chunks_to_json(all_chunks, str(all_chunks_path))
|
||||
|
||||
# 保存统计报告
|
||||
stats_path = output_path / "processing_stats.json"
|
||||
with open(stats_path, "w", encoding="utf-8") as f:
|
||||
json.dump(stats_serializable, f, ensure_ascii=False, indent=2)
|
||||
|
||||
# 打印总结
|
||||
total_success = stats_serializable.get("success", stats["success"])
|
||||
total_failed = stats_serializable.get("failed", stats["failed"])
|
||||
total_chunks_count = stats_serializable.get("total_chunks", stats["total_chunks"])
|
||||
jrxml_count = stats_serializable.get("files_by_type", {}).get("jrxml", stats["files_by_type"]["jrxml"])
|
||||
md_count = stats_serializable.get("files_by_type", {}).get("markdown", stats["files_by_type"]["markdown"])
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f"处理完成!")
|
||||
print(f"{'='*60}")
|
||||
print(f"✅ 成功: {total_success} 文件 (JRXML: {jrxml_count}, MD: {md_count})")
|
||||
print(f"❌ 失败: {total_failed} 文件")
|
||||
print(f"📦 总 Chunks: {total_chunks_count}")
|
||||
print(f"⏱️ 总耗时: {total_duration:.2f}s")
|
||||
print(f"📂 输出目录: {output_path}")
|
||||
print(f"\n主要文件:")
|
||||
print(f" - {all_chunks_path}")
|
||||
print(f" - {stats_path}")
|
||||
|
||||
display_types = stats_serializable.get("chunk_types", stats.get("chunk_types", {}))
|
||||
if display_types:
|
||||
print(f"\nChunk 类型分布 (前 10):")
|
||||
sorted_types = sorted(display_types.items(), key=lambda x: -x[1])[:10]
|
||||
for ct, count in sorted_types:
|
||||
print(f" {ct}: {count}")
|
||||
|
||||
if stats["failed_files"]:
|
||||
print(f"\n⚠️ 失败文件详情:")
|
||||
for fail in stats["failed_files"][:10]:
|
||||
print(f" - {fail['file']} ({fail['type']}): {fail['error']}")
|
||||
|
||||
return {
|
||||
"chunks": all_chunks,
|
||||
"stats": stats_serializable,
|
||||
"output_path": str(output_path)
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) < 2:
|
||||
print("=" * 60)
|
||||
print("统一批量分块 v1.0")
|
||||
print("支持 JRXML 和 Markdown 文件")
|
||||
print("=" * 60)
|
||||
print("\n用法:")
|
||||
print(" python batch_chunker.py <目录路径>")
|
||||
print(" python batch_chunker.py <目录路径> --output <输出目录>")
|
||||
print(" python batch_chunker.py <目录路径> --incremental")
|
||||
print("\n示例:")
|
||||
print(" python batch_chunker.py ./jrxml_source")
|
||||
print(" python batch_chunker.py ./docs")
|
||||
print(" python batch_chunker.py ./ --output ./chunks")
|
||||
print(" python batch_chunker.py ./jrxml_source --incremental # 增量分块")
|
||||
sys.exit(0)
|
||||
|
||||
input_path = sys.argv[1]
|
||||
|
||||
output_dir = None
|
||||
if "--output" in sys.argv:
|
||||
idx = sys.argv.index("--output")
|
||||
if idx + 1 < len(sys.argv):
|
||||
output_dir = sys.argv[idx + 1]
|
||||
|
||||
incremental = "--incremental" in sys.argv
|
||||
|
||||
if os.path.isdir(input_path):
|
||||
batch_chunk_with_report(input_path, output_dir, incremental=incremental)
|
||||
else:
|
||||
print(f"❌ 路径无效或不是目录: {input_path}")
|
||||
+191
-195
@@ -4,262 +4,260 @@
|
||||
|
||||
---
|
||||
|
||||
## 1. collect_jrxml.py — JRXML 文件收集
|
||||
## 1. collect_jrxml.py — JRXML 文件收集脚本
|
||||
|
||||
**功能**: 从 JasperReports 模板库目录递归收集 `.jrxml` 文件,复制到项目 `jrxml_source` 目录。
|
||||
**功能**: 从指定的 JasperReports 模板库目录递归收集所有 `.jrxml` 文件,复制到项目的 `jrxml_source` 目录。
|
||||
|
||||
**输入**:
|
||||
- 源目录: `C:\Users\zy187\JaspersoftWorkspace\JasperReportsSamples`(可修改)
|
||||
|
||||
**输出**:
|
||||
- `jrxml_source/` 目录,包含所有收集到的 JRXML 文件
|
||||
|
||||
**使用方式**:
|
||||
```bash
|
||||
python collect_jrxml.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. jrxml_chunker.py — JRXML 语义分块引擎 (v3.0)
|
||||
|
||||
**功能**: 将单个 JRXML 文件按语义结构拆分。被 `batch_chunker.py` 调用,也可单独使用。
|
||||
|
||||
**输入**: 单个 `.jrxml` 文件路径(或目录)
|
||||
|
||||
**输出**: `JRXMLChunk` 列表,字段包括:
|
||||
- `chunk_id`: 文件内序号
|
||||
- `chunk_type`: 分块类型 (`query`, `band_detail`, `chart` 等)
|
||||
- `human_description`: 人类可读描述
|
||||
- `raw_xml`: 原始 XML 片段
|
||||
- `context`: 所属报表名称
|
||||
- `metadata`: 元数据 (report_name, band_name, element_kind 等)
|
||||
|
||||
**单独使用**:
|
||||
```bash
|
||||
python jrxml_chunker.py report.jrxml # 单文件
|
||||
python jrxml_chunker.py ./jrxml_source/ # 目录
|
||||
```
|
||||
**核心逻辑**:
|
||||
- 使用 `os.walk()` 递归遍历源目录
|
||||
- 筛选 `.jrxml` 后缀文件
|
||||
- 自动处理文件名冲突(添加数字后缀)
|
||||
- 使用 `shutil.copy2()` 保留文件元数据
|
||||
|
||||
---
|
||||
|
||||
## 3. md_chunker.py — Markdown 语义分块引擎
|
||||
## 2. jrxml_chunker.py — JRXML 语义分块核心引擎
|
||||
|
||||
**功能**: 将 Markdown 文件按标题层级、代码块、表格等结构化元素智能分块。被 `batch_chunker.py` 调用,也可单独使用。
|
||||
**功能**: 将单个 JRXML 文件按语义结构拆分为多个 chunk,每个 chunk 包含人类可读描述、原始 XML 和元数据。
|
||||
|
||||
**分块策略**:
|
||||
- 按标题层级 (H1/H2/H3) 划分段落,H2 自动识别特殊类型
|
||||
- 代码块、表格作为独立 chunk
|
||||
- 过长段落按段落/句子二次拆分
|
||||
|
||||
**单独使用**:
|
||||
```bash
|
||||
python md_chunker.py doc.md # 单文件
|
||||
python md_chunker.py ./docs/ # 目录
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. batch_chunker.py — 统一批量分块入口
|
||||
|
||||
**功能**: 统一入口,支持 JRXML + Markdown 混合批量处理。**支持增量模式**。
|
||||
**输入**:
|
||||
- 单个 JRXML 文件路径
|
||||
|
||||
**输出**:
|
||||
- `all_chunks.json`: 所有 chunks 合并
|
||||
- `processing_stats.json`: 处理统计 (文件级 chunk 数量、类型分布)
|
||||
- `JRXMLChunk` 对象列表,每个包含:
|
||||
- `chunk_id`: 唯一标识
|
||||
- `chunk_type`: 分块类型(如 `query`, `field`, `band_title` 等)
|
||||
- `human_description`: 人类可读的结构化描述
|
||||
- `raw_xml`: 原始 XML 片段
|
||||
- `context`: 上下文信息(所属报表名称)
|
||||
- `metadata`: 元数据字典
|
||||
|
||||
**全量模式** — 首次建库:
|
||||
**核心类**:
|
||||
- `JRXMLChunk`: 单个 chunk 的数据结构
|
||||
- `JRXMLSemanticChunker`: 主分块器,支持多种数据源类型(SQL、HQL、XPath、JSON、CSV 等)
|
||||
|
||||
**分块策略**:
|
||||
- 按 XML 元素类型分类(field、parameter、variable、band、chart 等)
|
||||
- 提取数据源配置和查询语句
|
||||
- 保留元素间的层级关系
|
||||
- 为每个 chunk 生成结构化的人类可读描述
|
||||
|
||||
**使用方式**:
|
||||
```bash
|
||||
python batch_chunker.py ./jrxml_source --output ./jrxml_chunker_output
|
||||
```
|
||||
# 处理单个文件
|
||||
python jrxml_chunker.py report.jrxml
|
||||
|
||||
**增量模式** (`--incremental`) — 追加新文件:
|
||||
```bash
|
||||
python batch_chunker.py ./jrxml_source --incremental
|
||||
# 处理整个目录
|
||||
python jrxml_chunker.py ./jrxml_source/
|
||||
```
|
||||
|
||||
增量模式逻辑:
|
||||
1. 加载已有 `processing_stats.json`,获取已处理文件列表
|
||||
2. 扫描输入目录,自动跳过已处理文件
|
||||
3. 只分块新增文件
|
||||
4. 合并新旧 `all_chunks.json` 和统计数据后保存
|
||||
|
||||
---
|
||||
|
||||
## 5. down_embedding_model.py — 嵌入模型下载
|
||||
## 3. jrxml_banch_chunker.py — 批量分块入口脚本
|
||||
|
||||
**功能**: 从 HuggingFace Hub 下载嵌入模型到本地。支持国内镜像 (`hf-mirror.com`)、断点续传。
|
||||
**功能**: 批量处理目录下所有 JRXML 文件,生成统计报告和分类输出。
|
||||
|
||||
**输入**:
|
||||
- JRXML 文件目录(默认: `jrxml_source`)
|
||||
|
||||
**输出**:
|
||||
- `jrxml_chunker_output/all_chunks.json`: 所有 chunks 合并文件
|
||||
- `jrxml_chunker_output/processing_stats.json`: 处理统计(成功/失败数、耗时、chunk 类型分布)
|
||||
- `jrxml_chunker_output/per_file/`: 按原文件分类的独立 chunk 文件
|
||||
|
||||
**核心函数**:
|
||||
- `batch_chunk_with_report()`: 批量处理目录
|
||||
- `chunk_single_file_with_report()`: 处理单个文件
|
||||
|
||||
**使用方式**:
|
||||
```bash
|
||||
# 使用默认输入目录
|
||||
python jrxml_banch_chunker.py
|
||||
|
||||
# 指定输入目录
|
||||
python jrxml_banch_chunker.py ./jrxml_source
|
||||
|
||||
# 指定输出目录
|
||||
python jrxml_banch_chunker.py ./jrxml_source --output ./my_output
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. down_embedding_model.py — 嵌入模型下载脚本
|
||||
|
||||
**功能**: 从 HuggingFace Hub 下载 Qwen3-Embedding-4B 嵌入模型到本地。
|
||||
|
||||
**输入**:
|
||||
- HuggingFace 模型仓库: `Qwen/Qwen3-Embedding-4B`
|
||||
|
||||
**输出**:
|
||||
- `models/Qwen3-Embedding-4B/` 目录,包含完整的模型文件
|
||||
|
||||
**特性**:
|
||||
- 使用国内镜像加速下载(`hf-mirror.com`)
|
||||
- 支持断点续传
|
||||
- 自动安装依赖
|
||||
|
||||
**使用方式**:
|
||||
```bash
|
||||
python down_embedding_model.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. embed_chunks.py — Chunk 向量化
|
||||
## 5. embed_chunks.py — Chunk 向量化脚本
|
||||
|
||||
**功能**: 将 chunks 转换为向量。支持 GPU/CPU、FP16 半精度,**支持增量模式**。
|
||||
**功能**: 使用嵌入模型将分块后的文本转换为向量表示,支持 GPU 加速和 FP16 半精度。
|
||||
|
||||
**输入**:
|
||||
- `jrxml_chunker_output/all_chunks.json`(默认)
|
||||
|
||||
**输出**:
|
||||
- `embeddings/embeddings.npy`: 向量矩阵 (float32)
|
||||
- `embeddings/chunks.json`: 原始 chunks
|
||||
- `embeddings/chunk_id_map.json` / `chunk_type_map.json`
|
||||
- `embeddings/embeddings.pkl`: 完整 pickle
|
||||
- `embeddings/embeddings.npy`: 向量矩阵(float32)
|
||||
- `embeddings/chunk_id_map.json`: chunk ID 映射
|
||||
- `embeddings/chunk_type_map.json`: chunk 类型映射
|
||||
- `embeddings/chunks.json`: 原始 chunks 副本
|
||||
- `embeddings/embeddings.pkl`: 完整数据 pickle
|
||||
|
||||
**全量模式** — 首次向量化:
|
||||
**核心函数**:
|
||||
- `build_text_for_embedding()`: 将 chunk 转换为适合向量化的文本(拼接类型、描述、XML、元数据)
|
||||
- `main()`: 主流程(加载→编码→保存→质量检查)
|
||||
|
||||
**特性**:
|
||||
- 自动检测 CUDA/CPU
|
||||
- 默认启用 FP16 半精度(节省约 50% 显存)
|
||||
- 支持 HuggingFace Hub 在线模型
|
||||
- 向量归一化 + NaN 检测
|
||||
|
||||
**使用方式**:
|
||||
```bash
|
||||
# 使用默认设置
|
||||
python embed_chunks.py
|
||||
python embed_chunks.py --batch_size 2 # 调整批大小
|
||||
python embed_chunks.py --model_path "all-MiniLM-L6-v2" # 换模型
|
||||
python embed_chunks.py --no_fp16 # 禁用半精度
|
||||
```
|
||||
|
||||
**增量模式** (`--incremental` / `-i`) — 只编码新 chunks:
|
||||
```bash
|
||||
python embed_chunks.py --incremental
|
||||
```
|
||||
# 指定模型和 batch size
|
||||
python embed_chunks.py --model_path "sentence-transformers/all-MiniLM-L6-v2" --batch_size 64
|
||||
|
||||
增量模式逻辑:
|
||||
1. 加载已有 `embeddings.npy` + `chunks.json`
|
||||
2. 按 `(context, chunk_id)` 去重
|
||||
3. 只向量化新 chunks
|
||||
4. 合并新旧数据后保存
|
||||
# 使用本地 Qwen3 模型
|
||||
python embed_chunks.py --batch_size 2
|
||||
|
||||
# 禁用 FP16
|
||||
python embed_chunks.py --no_fp16 --batch_size 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. import_to_chroma.py — Chroma 向量入库
|
||||
## 6. import_to_chroma.py — 向量导入 Chroma 数据库
|
||||
|
||||
**功能**: 将向量数据导入 Chroma 持久化数据库。**支持增量模式**。
|
||||
**功能**: 将已生成的向量和 chunks 导入 Chroma 持久化向量数据库。
|
||||
|
||||
**全量模式** — 首次导入 (删除旧集合重建):
|
||||
**输入**:
|
||||
- `embeddings/embeddings.npy`: 向量矩阵
|
||||
- `embeddings/chunks.json`: chunks 数据
|
||||
|
||||
**输出**:
|
||||
- `chroma_db/`: Chroma 持久化数据库目录
|
||||
- 集合名称: `jrxml_chunks`(默认)
|
||||
|
||||
**核心逻辑**:
|
||||
- 加载向量和 chunks
|
||||
- 初始化 Chroma PersistentClient
|
||||
- 创建集合(余弦相似度)
|
||||
- 分批导入(每批 1000 条)
|
||||
- 提取元数据(chunk_type、report_name、band_name 等)
|
||||
- 快速验证查询
|
||||
|
||||
**使用方式**:
|
||||
```bash
|
||||
# 使用默认设置
|
||||
python import_to_chroma.py
|
||||
```
|
||||
|
||||
**增量模式** (`--incremental` / `-i`) — 追加新记录:
|
||||
```bash
|
||||
python import_to_chroma.py --incremental
|
||||
# 指定路径
|
||||
python import_to_chroma.py --embeddings_dir ./embeddings --chroma_path ./chroma_db
|
||||
```
|
||||
|
||||
增量模式逻辑:
|
||||
1. `get_or_create_collection` (不删除已有数据)
|
||||
2. 查询 Chroma 已有 ID
|
||||
3. 跳过已导入的记录,只追加新数据
|
||||
|
||||
---
|
||||
|
||||
## 8. query_chroma.py — 语义搜索查询
|
||||
## 7. query_chroma.py — 语义搜索查询工具
|
||||
|
||||
**功能**: 通过自然语言查询 Chroma 数据库。
|
||||
**功能**: 通过自然语言查询 Chroma 数据库,检索相关的 JRXML chunk。
|
||||
|
||||
**输入**:
|
||||
- 用户自然语言查询
|
||||
- 可选的元数据过滤条件
|
||||
|
||||
**输出**:
|
||||
- 相似度排序的检索结果(含 chunk 类型、报表名称、区域、内容摘要)
|
||||
|
||||
**核心类**:
|
||||
- `JRXMLSearcher`: 搜索器,封装模型加载、向量编码和 Chroma 查询
|
||||
|
||||
**核心方法**:
|
||||
- `search()`: 基础语义搜索
|
||||
- `search_with_threshold()`: 带相似度阈值的搜索
|
||||
- `format_result()`: 格式化输出结果
|
||||
|
||||
**两种模式**:
|
||||
- **单次查询**: `python query_chroma.py "查询内容"`
|
||||
- **交互模式**: `python query_chroma.py`(支持连续查询和内联命令)
|
||||
1. **命令行单次查询**: `python query_chroma.py "查询内容"`
|
||||
2. **交互模式**: `python query_chroma.py`(支持连续查询和内联命令)
|
||||
|
||||
**交互模式命令**:
|
||||
```
|
||||
filter:<类型> 按 chunk_type 过滤 (如 filter:query)
|
||||
t:<阈值> 相似度阈值 0~1 (如 t:0.5)
|
||||
k:<数量> 返回结果数 (如 k:10)
|
||||
filter:<类型> 按 chunk_type 过滤(如 filter:query)
|
||||
t:<阈值> 设置相似度阈值 0~1(如 t:0.5)
|
||||
k:<数量> 设置返回结果数(如 k:10)
|
||||
```
|
||||
|
||||
**使用示例**:
|
||||
**使用方式**:
|
||||
```bash
|
||||
python query_chroma.py # 交互模式
|
||||
python query_chroma.py "如何修改报表标题" # 单次查询
|
||||
python query_chroma.py "SQL怎么写" --filter_field query
|
||||
python query_chroma.py "参数" --threshold 0.5 --n_results 10
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. config.py — 统一配置管理
|
||||
|
||||
**功能**: 从 `.env` 加载所有配置,所有脚本通过此模块获取配置项。
|
||||
|
||||
```bash
|
||||
python config.py # 打印当前配置
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. jrxml_banch_chunker.py — 旧版入口 (已废弃)
|
||||
|
||||
**功能**: JRXML 单类型批量分块。已被 `batch_chunker.py` 取代,保留以兼容旧流程。
|
||||
|
||||
---
|
||||
|
||||
## 使用场景
|
||||
|
||||
### 场景 A:首次构建数据库
|
||||
|
||||
```bash
|
||||
# 1. 准备源文件
|
||||
python collect_jrxml.py
|
||||
# 将 Markdown 文档放入 jrxml_source/ 或指定目录
|
||||
|
||||
# 2. 全量分块
|
||||
python batch_chunker.py ./jrxml_source
|
||||
|
||||
# 3. 下载模型 + 全量向量化
|
||||
python down_embedding_model.py
|
||||
python embed_chunks.py
|
||||
|
||||
# 4. 全量导入
|
||||
python import_to_chroma.py
|
||||
|
||||
# 5. 开始查询
|
||||
# 交互模式
|
||||
python query_chroma.py
|
||||
```
|
||||
|
||||
### 场景 B:追加新模板/文档
|
||||
# 单次查询
|
||||
python query_chroma.py "如何修改报表标题"
|
||||
|
||||
```bash
|
||||
# 将新 .jrxml / .md 文件放入源目录后:
|
||||
# 按类型过滤
|
||||
python query_chroma.py "SQL怎么写" --filter_field query
|
||||
|
||||
# 1. 增量分块 — 自动跳过已处理文件
|
||||
python batch_chunker.py ./jrxml_source --incremental
|
||||
|
||||
# 2. 增量向量化 — 只编码新 chunks
|
||||
python embed_chunks.py --incremental
|
||||
|
||||
# 3. 增量导入 — 追加到已有数据库
|
||||
python import_to_chroma.py --incremental
|
||||
```
|
||||
|
||||
### 场景 C:更换嵌入模型
|
||||
|
||||
```bash
|
||||
# 1. 编辑 .env 修改 EMBEDDING_MODEL_NAME / EMBEDDING_MODEL_PATH
|
||||
# 2. 下载新模型
|
||||
python down_embedding_model.py
|
||||
|
||||
# 3. 重新向量化 (全量)
|
||||
python embed_chunks.py
|
||||
|
||||
# 4. 重建数据库
|
||||
python import_to_chroma.py
|
||||
# 设置阈值和返回数量
|
||||
python query_chroma.py "报表参数" --threshold 0.5 --n_results 10
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 数据流
|
||||
## 数据流全景
|
||||
|
||||
```
|
||||
┌─────────────────────┐
|
||||
│ JRXML 模板 (.jrxml) │
|
||||
│ Markdown 文档 (.md) │
|
||||
└──────────┬──────────┘
|
||||
│ collect_jrxml.py / 手动放置
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ jrxml_source/ │
|
||||
└──────────┬──────────┘
|
||||
│ batch_chunker.py [--incremental]
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ JasperReports │ C:\Users\...\JasperReportsSamples
|
||||
│ 模板库 │
|
||||
└────────┬────────┘
|
||||
│ collect_jrxml.py
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ jrxml_source/ │ 收集的 JRXML 文件
|
||||
└────────┬────────┘
|
||||
│ jrxml_banch_chunker.py (调用 jrxml_chunker.py)
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ jrxml_chunker_output/│ all_chunks.json
|
||||
└──────────┬───────────┘
|
||||
│ embed_chunks.py [--incremental]
|
||||
▼
|
||||
│ jrxml_chunker_output/│ all_chunks.json + per_file/
|
||||
└────────┬─────────────┘
|
||||
│ embed_chunks.py (使用 Qwen3-Embedding-4B)
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ embeddings/ │ embeddings.npy + chunks.json
|
||||
└────────┬────────┘
|
||||
│ import_to_chroma.py [--incremental]
|
||||
│ import_to_chroma.py
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ chroma_db/ │ Chroma 向量数据库
|
||||
@@ -267,20 +265,18 @@ python import_to_chroma.py
|
||||
│ query_chroma.py
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 自然语言查询 │ 返回相关 chunks
|
||||
│ 用户查询 │ 自然语言 → 相关 JRXML chunks
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## 依赖关系
|
||||
|
||||
```
|
||||
query_chroma.py ──────────► chromadb, sentence_transformers, torch
|
||||
import_to_chroma.py ──────► chromadb, numpy
|
||||
embed_chunks.py ──────────► sentence_transformers, torch, numpy
|
||||
down_embedding_model.py ──► huggingface_hub
|
||||
batch_chunker.py ─────────► jrxml_chunker.py, md_chunker.py
|
||||
md_chunker.py ────────────► 标准库 (re, json, pathlib)
|
||||
jrxml_chunker.py ─────────► xml.etree.ElementTree (标准库)
|
||||
config.py ────────────────► 标准库 (os, pathlib)
|
||||
collect_jrxml.py ─────────► 标准库 (os, shutil)
|
||||
```
|
||||
query_chroma.py ──────► chromadb, sentence_transformers, torch
|
||||
import_to_chroma.py ──► chromadb, numpy
|
||||
embed_chunks.py ──────► sentence_transformers, torch, numpy
|
||||
down_embedding_model.py ► huggingface_hub
|
||||
jrxml_banch_chunker.py ─► jrxml_chunker.py
|
||||
jrxml_chunker.py ─────► xml.etree.ElementTree (标准库)
|
||||
collect_jrxml.py ─────► 标准库 (os, shutil)
|
||||
```
|
||||
+18
-64
@@ -20,8 +20,7 @@ from config import (
|
||||
def build_text_for_embedding(chunk: dict) -> str:
|
||||
"""
|
||||
将单个 chunk 转换为适合向量化的文本
|
||||
拼接:类型、描述、上下文、关键元数据、部分内容
|
||||
支持 JRXML chunks (raw_xml) 和 Markdown chunks (raw_content)
|
||||
拼接:类型、描述、上下文、关键元数据、部分 XML
|
||||
"""
|
||||
parts = [
|
||||
f"[ChunkType: {chunk.get('chunk_type', 'unknown')}]",
|
||||
@@ -31,10 +30,9 @@ def build_text_for_embedding(chunk: dict) -> str:
|
||||
if context:
|
||||
parts.append(f"Context: {context}")
|
||||
|
||||
# 支持两种格式:raw_xml (JRXML) 和 raw_content (Markdown)
|
||||
raw_content = chunk.get('raw_xml', '') or chunk.get('raw_content', '')
|
||||
if raw_content:
|
||||
parts.append(f"Content: {raw_content[:500]}")
|
||||
raw_xml = chunk.get('raw_xml', '')
|
||||
if raw_xml:
|
||||
parts.append(f"XML: {raw_xml[:500]}")
|
||||
|
||||
meta = chunk.get('metadata', {})
|
||||
if meta:
|
||||
@@ -50,16 +48,12 @@ def build_text_for_embedding(chunk: dict) -> str:
|
||||
parts.append(f"Element: {meta['element_kind']}")
|
||||
if 'query_language' in meta:
|
||||
parts.append(f"QueryLang: {meta['query_language']}")
|
||||
if 'language' in meta:
|
||||
parts.append(f"CodeLang: {meta['language']}")
|
||||
if 'heading' in meta:
|
||||
parts.append(f"Section: {meta['heading']}")
|
||||
return "\n".join(parts)
|
||||
|
||||
|
||||
def main(chunks_json_path: str = None, output_dir: str = None,
|
||||
model_path: str = None, batch_size: int = None, normalize: bool = True,
|
||||
use_fp16: bool = None, incremental: bool = False):
|
||||
use_fp16: bool = None):
|
||||
"""
|
||||
主流程:
|
||||
1. 加载 chunk JSON
|
||||
@@ -92,7 +86,7 @@ def main(chunks_json_path: str = None, output_dir: str = None,
|
||||
|
||||
if not chunks_json_path.exists():
|
||||
print(f"❌ Chunks 文件不存在: {chunks_json_path}")
|
||||
print(f" 请先运行 batch_chunker.py 生成 chunks")
|
||||
print(f" 请先运行 jrxml_banch_chunker.py 生成 chunks")
|
||||
return None
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
@@ -133,33 +127,6 @@ def main(chunks_json_path: str = None, output_dir: str = None,
|
||||
print(f" GPU: {torch.cuda.get_device_name(0)}")
|
||||
print(f" GPU memory: {torch.cuda.memory_allocated(0)/1024**3:.2f} GB / {torch.cuda.get_device_properties(0).total_memory/1024**3:.2f} GB")
|
||||
|
||||
# 增量模式:加载已有向量,只处理新 chunks
|
||||
existing_chunks = []
|
||||
existing_embeddings = None
|
||||
if incremental:
|
||||
existing_chunks_path = output_dir / "chunks.json"
|
||||
existing_emb_path = output_dir / "embeddings.npy"
|
||||
if existing_chunks_path.exists() and existing_emb_path.exists():
|
||||
with open(existing_chunks_path, 'r', encoding='utf-8') as f:
|
||||
existing_chunks = json.load(f)
|
||||
existing_embeddings = np.load(existing_emb_path)
|
||||
existing_keys = {(c.get('context', ''), c.get('chunk_id', -1)) for c in existing_chunks}
|
||||
new_chunks = [c for c in chunks if (c.get('context', ''), c.get('chunk_id', -1)) not in existing_keys]
|
||||
skipped = len(chunks) - len(new_chunks)
|
||||
print(f"\n🔄 增量模式: 已有 {len(existing_chunks)} 个 chunks, 跳过 {skipped} 个重复, 新增 {len(new_chunks)} 个")
|
||||
chunks = new_chunks
|
||||
else:
|
||||
print(f"\n🔄 增量模式: 未找到已有向量数据,切换为全量处理")
|
||||
incremental = False
|
||||
|
||||
if not chunks:
|
||||
print("✅ 没有新 chunks 需要向量化")
|
||||
return {
|
||||
"chunks": len(existing_chunks),
|
||||
"embedding_dim": existing_embeddings.shape[1] if existing_embeddings is not None else 0,
|
||||
"output_dir": str(output_dir)
|
||||
}
|
||||
|
||||
print(f"\n🛠️ 构建文本表示...")
|
||||
texts = []
|
||||
chunk_ids = []
|
||||
@@ -180,52 +147,42 @@ def main(chunks_json_path: str = None, output_dir: str = None,
|
||||
)
|
||||
print(f" Embeddings shape: {embeddings.shape}")
|
||||
|
||||
# 合并已有向量
|
||||
if existing_embeddings is not None and len(existing_chunks) > 0:
|
||||
all_embeddings = np.concatenate([existing_embeddings, embeddings], axis=0)
|
||||
all_chunks = existing_chunks + chunks
|
||||
else:
|
||||
all_embeddings = embeddings
|
||||
all_chunks = chunks
|
||||
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
np.save(output_dir / "embeddings.npy", all_embeddings.astype('float32'))
|
||||
all_chunk_ids = [c.get('chunk_id', -1) for c in all_chunks]
|
||||
all_chunk_types = [c.get('chunk_type', 'unknown') for c in all_chunks]
|
||||
np.save(output_dir / "embeddings.npy", embeddings.astype('float32'))
|
||||
with open(output_dir / "chunk_id_map.json", 'w', encoding='utf-8') as f:
|
||||
json.dump(all_chunk_ids, f, ensure_ascii=False, indent=2)
|
||||
json.dump(chunk_ids, f, ensure_ascii=False, indent=2)
|
||||
with open(output_dir / "chunk_type_map.json", 'w', encoding='utf-8') as f:
|
||||
json.dump(all_chunk_types, f, ensure_ascii=False, indent=2)
|
||||
json.dump(chunk_types, f, ensure_ascii=False, indent=2)
|
||||
with open(output_dir / "chunks.json", 'w', encoding='utf-8') as f:
|
||||
json.dump(all_chunks, f, ensure_ascii=False, indent=2)
|
||||
json.dump(chunks, f, ensure_ascii=False, indent=2)
|
||||
with open(output_dir / "embeddings.pkl", 'wb') as f:
|
||||
pickle.dump({
|
||||
'chunks': all_chunks,
|
||||
'embeddings': all_embeddings,
|
||||
'chunks': chunks,
|
||||
'embeddings': embeddings,
|
||||
'texts': texts,
|
||||
'normalized': normalize
|
||||
}, f)
|
||||
|
||||
nan_count = np.isnan(all_embeddings).sum()
|
||||
nan_count = np.isnan(embeddings).sum()
|
||||
print(f"\n📊 质量检查:")
|
||||
print(f" NaN values: {nan_count}")
|
||||
norms = np.linalg.norm(all_embeddings, axis=1)
|
||||
norms = np.linalg.norm(embeddings, axis=1)
|
||||
print(f" Norms: min={norms.min():.4f}, max={norms.max():.4f}, mean={norms.mean():.4f}")
|
||||
|
||||
print(f"\n✅ 向量数据已保存到: {output_dir}/")
|
||||
print(f" 文件: embeddings.npy, chunk_id_map.json, chunk_type_map.json, chunks.json, embeddings.pkl")
|
||||
|
||||
type_counts = {}
|
||||
for ct in all_chunk_types:
|
||||
for ct in chunk_types:
|
||||
type_counts[ct] = type_counts.get(ct, 0) + 1
|
||||
print(f"\n📈 Chunk 类型分布:")
|
||||
for ct, count in sorted(type_counts.items(), key=lambda x: -x[1]):
|
||||
print(f" {ct}: {count}")
|
||||
|
||||
return {
|
||||
"chunks": len(all_chunks),
|
||||
"embedding_dim": all_embeddings.shape[1],
|
||||
"chunks": len(chunks),
|
||||
"embedding_dim": embeddings.shape[1],
|
||||
"output_dir": str(output_dir)
|
||||
}
|
||||
|
||||
@@ -248,8 +205,6 @@ if __name__ == "__main__":
|
||||
help="不做向量归一化")
|
||||
parser.add_argument("--no_fp16", action="store_true",
|
||||
help="禁用 FP16 半精度(默认启用,可节省约 50%% 显存)")
|
||||
parser.add_argument("--incremental", "-i", action="store_true",
|
||||
help="增量模式:只向量化新增 chunks,追加到已有向量数据")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
@@ -259,6 +214,5 @@ if __name__ == "__main__":
|
||||
model_path=args.model_path,
|
||||
batch_size=args.batch_size,
|
||||
normalize=not args.no_normalize,
|
||||
use_fp16=not args.no_fp16,
|
||||
incremental=args.incremental
|
||||
use_fp16=not args.no_fp16
|
||||
)
|
||||
+17
-59
@@ -1,7 +1,6 @@
|
||||
"""
|
||||
import_to_chroma.py
|
||||
将 chunk 向量导入 Chroma 数据库
|
||||
支持 JRXML chunks 和 Markdown chunks 混合导入
|
||||
将已生成的 chunk 向量导入 Chroma 数据库
|
||||
"""
|
||||
|
||||
import os
|
||||
@@ -17,8 +16,7 @@ from config import EMBEDDINGS_DIR, CHROMA_DB_PATH, CHROMA_COLLECTION_NAME
|
||||
|
||||
def main(embeddings_dir: str = None,
|
||||
chroma_path: str = None,
|
||||
collection_name: str = None,
|
||||
incremental: bool = False):
|
||||
collection_name: str = None):
|
||||
"""
|
||||
从 embeddings 目录读取向量和 chunks,导入 Chroma 持久化数据库
|
||||
|
||||
@@ -71,55 +69,33 @@ def main(embeddings_dir: str = None,
|
||||
chroma_path.mkdir(parents=True, exist_ok=True)
|
||||
client = chromadb.PersistentClient(path=str(chroma_path))
|
||||
|
||||
if incremental:
|
||||
try:
|
||||
collection = client.get_collection(collection_name)
|
||||
existing_ids = set(collection.get()['ids'])
|
||||
print(f" 增量模式: 集合 '{collection_name}' 已有 {len(existing_ids)} 条记录")
|
||||
except Exception:
|
||||
collection = client.create_collection(
|
||||
name=collection_name,
|
||||
metadata={"hnsw:space": "cosine"}
|
||||
)
|
||||
existing_ids = set()
|
||||
print(f" 增量模式: 创建新集合 '{collection_name}'")
|
||||
else:
|
||||
try:
|
||||
client.delete_collection(collection_name)
|
||||
print(f" 已删除旧集合 '{collection_name}'")
|
||||
except Exception:
|
||||
pass
|
||||
collection = client.create_collection(
|
||||
name=collection_name,
|
||||
metadata={"hnsw:space": "cosine"}
|
||||
)
|
||||
existing_ids = set()
|
||||
try:
|
||||
client.delete_collection(collection_name)
|
||||
print(f" 已删除旧集合 '{collection_name}'")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
collection = client.create_collection(
|
||||
name=collection_name,
|
||||
metadata={"hnsw:space": "cosine"}
|
||||
)
|
||||
|
||||
print(f"\n🛠️ 准备导入数据...")
|
||||
ids = []
|
||||
documents = []
|
||||
metadatas = []
|
||||
embeddings_list = []
|
||||
skipped = 0
|
||||
|
||||
seen_ids = {}
|
||||
for i, chunk in enumerate(tqdm(chunks, desc="准备数据")):
|
||||
raw_id = str(chunk.get("chunk_id", i))
|
||||
context = chunk.get("context", "")
|
||||
|
||||
if raw_id in seen_ids:
|
||||
seen_ids[raw_id] += 1
|
||||
unique_chunk_id = f"{raw_id}_{seen_ids[raw_id]}"
|
||||
chunk_id = f"{raw_id}_{seen_ids[raw_id]}"
|
||||
else:
|
||||
seen_ids[raw_id] = 0
|
||||
unique_chunk_id = raw_id
|
||||
|
||||
# 增量模式:跳过已导入的
|
||||
if incremental and unique_chunk_id in existing_ids:
|
||||
skipped += 1
|
||||
continue
|
||||
|
||||
ids.append(unique_chunk_id)
|
||||
chunk_id = raw_id
|
||||
ids.append(chunk_id)
|
||||
|
||||
doc_text = chunk.get("human_description", "")
|
||||
documents.append(doc_text)
|
||||
@@ -129,6 +105,7 @@ def main(embeddings_dir: str = None,
|
||||
if chunk_type:
|
||||
meta["chunk_type"] = chunk_type
|
||||
|
||||
context = chunk.get("context", "")
|
||||
if context:
|
||||
meta["context"] = context
|
||||
|
||||
@@ -141,26 +118,10 @@ def main(embeddings_dir: str = None,
|
||||
meta["element_kind"] = chunk_meta["element_kind"]
|
||||
if "query_language" in chunk_meta:
|
||||
meta["query_language"] = chunk_meta["query_language"]
|
||||
# Markdown-specific metadata
|
||||
if "heading" in chunk_meta:
|
||||
meta["heading"] = chunk_meta["heading"]
|
||||
if "heading_level" in chunk_meta:
|
||||
meta["heading_level"] = chunk_meta["heading_level"]
|
||||
if "language" in chunk_meta:
|
||||
meta["code_language"] = chunk_meta["language"]
|
||||
|
||||
metadatas.append(meta)
|
||||
embeddings_list.append(embeddings[i].tolist())
|
||||
|
||||
if incremental and skipped > 0:
|
||||
print(f" 增量模式: 跳过 {skipped} 条已存在记录")
|
||||
|
||||
if not ids:
|
||||
print(f"\n✅ 没有新数据需要导入,集合已是最新")
|
||||
print(f" 数据库路径: {chroma_path}")
|
||||
print(f" 集合数量: {collection.count()}")
|
||||
return collection
|
||||
|
||||
print(f"\n📥 分批导入到 Chroma (每批 1000 条)...")
|
||||
import_batch_size = 1000
|
||||
start_time = time.time()
|
||||
@@ -212,14 +173,11 @@ if __name__ == "__main__":
|
||||
help=f"Chroma 数据库路径 (默认: {CHROMA_DB_PATH})")
|
||||
parser.add_argument("--collection_name", "-n", default=CHROMA_COLLECTION_NAME,
|
||||
help=f"集合名称 (默认: {CHROMA_COLLECTION_NAME})")
|
||||
parser.add_argument("--incremental", "-i", action="store_true",
|
||||
help="增量模式:只导入新增记录,不删除已有数据")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
main(
|
||||
embeddings_dir=args.embeddings_dir,
|
||||
chroma_path=args.chroma_path,
|
||||
collection_name=args.collection_name,
|
||||
incremental=args.incremental
|
||||
collection_name=args.collection_name
|
||||
)
|
||||
-358
@@ -1,358 +0,0 @@
|
||||
"""
|
||||
md_chunker.py
|
||||
Markdown 语义分块器
|
||||
支持标题层级、代码块、表格等元素的智能分块
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
from typing import List, Dict, Tuple
|
||||
from pathlib import Path
|
||||
from dataclasses import dataclass, field, asdict
|
||||
|
||||
|
||||
@dataclass
|
||||
class MDChunk:
|
||||
"""Single Markdown chunk data structure"""
|
||||
chunk_id: int
|
||||
chunk_type: str
|
||||
human_description: str
|
||||
raw_content: str
|
||||
context: str
|
||||
metadata: Dict = field(default_factory=dict)
|
||||
|
||||
|
||||
class MarkdownSemanticChunker:
|
||||
"""
|
||||
Markdown 语义分块器 v1.0
|
||||
分块策略:
|
||||
1. 按标题层级(H1/H2/H3...)划分大段落
|
||||
2. 代码块作为独立 chunk
|
||||
3. 表格作为独立 chunk
|
||||
4. 过长段落内部按句子/段落二次拆分
|
||||
"""
|
||||
|
||||
# Heading patterns
|
||||
HEADING_PATTERN = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
|
||||
|
||||
# Code block pattern (fenced)
|
||||
CODE_BLOCK_PATTERN = re.compile(r'```(\w*)\n([\s\S]*?)```', re.MULTILINE)
|
||||
|
||||
# Inline code pattern
|
||||
INLINE_CODE_PATTERN = re.compile(r'`([^`]+)`')
|
||||
|
||||
# Table pattern
|
||||
TABLE_PATTERN = re.compile(r'\|.+\|\n\|[-| :]+\|\n((?:\|.+\|\n)*)', re.MULTILINE)
|
||||
|
||||
# List pattern
|
||||
LIST_PATTERN = re.compile(r'^(\s*[-*+]\s+.+)+', re.MULTILINE)
|
||||
|
||||
def __init__(self, max_chunk_size: int = 2000):
|
||||
self.max_chunk_size = max_chunk_size
|
||||
|
||||
def chunk_file(self, file_path: str) -> List[Dict]:
|
||||
"""处理单个 Markdown 文件"""
|
||||
if not os.path.exists(file_path):
|
||||
raise FileNotFoundError(f"File not found: {file_path}")
|
||||
|
||||
with open(file_path, 'r', encoding='utf-8') as f:
|
||||
content = f.read()
|
||||
|
||||
file_name = Path(file_path).stem
|
||||
chunks = []
|
||||
chunk_id = 0
|
||||
|
||||
# 尝试提取文档标题(第一个 H1)
|
||||
title_match = re.search(r'^#\s+(.+)$', content, re.MULTILINE)
|
||||
doc_title = title_match.group(1).strip() if title_match else file_name
|
||||
|
||||
# 按结构化元素分割
|
||||
segments = self._split_by_structure(content)
|
||||
|
||||
for segment in segments:
|
||||
seg_type = segment['type']
|
||||
seg_content = segment['content']
|
||||
|
||||
if not seg_content.strip():
|
||||
continue
|
||||
|
||||
# 构建描述
|
||||
description = self._build_description(seg_type, seg_content, doc_title)
|
||||
|
||||
# 如果超过最大长度,尝试二次拆分
|
||||
if len(seg_content) > self.max_chunk_size:
|
||||
sub_chunks = self._split_large_chunk(
|
||||
seg_content, seg_type, doc_title, chunk_id
|
||||
)
|
||||
chunks.extend([asdict(c) for c in sub_chunks])
|
||||
chunk_id += len(sub_chunks)
|
||||
else:
|
||||
chunks.append(asdict(MDChunk(
|
||||
chunk_id=chunk_id,
|
||||
chunk_type=seg_type,
|
||||
human_description=description,
|
||||
raw_content=seg_content.strip(),
|
||||
context=f"{doc_title}",
|
||||
metadata=segment.get('metadata', {})
|
||||
)))
|
||||
chunk_id += 1
|
||||
|
||||
return chunks
|
||||
|
||||
def _split_by_structure(self, content: str) -> List[Dict]:
|
||||
"""
|
||||
按 Markdown 结构分割内容
|
||||
返回: [{'type': 'h1/h2/code/table/paragraph', 'content': '...', 'metadata': {...}}]
|
||||
"""
|
||||
segments = []
|
||||
|
||||
# 首先提取所有代码块(保留位置标记,稍后处理)
|
||||
code_blocks = []
|
||||
code_pattern = re.compile(r'(```\w*\n[\s\S]*?```)', re.MULTILINE)
|
||||
|
||||
last_end = 0
|
||||
for match in code_pattern.finditer(content):
|
||||
# 处理代码块前的普通文本
|
||||
before = content[last_end:match.start()]
|
||||
if before.strip():
|
||||
segments.extend(self._process_text_section(before))
|
||||
|
||||
# 添加代码块
|
||||
code_blocks.append(match.group(1))
|
||||
lang_match = re.match(r'```(\w*)', match.group(1))
|
||||
lang = lang_match.group(1) if lang_match else ''
|
||||
segments.append({
|
||||
'type': 'code',
|
||||
'content': match.group(1),
|
||||
'metadata': {'language': lang}
|
||||
})
|
||||
last_end = match.end()
|
||||
|
||||
# 处理剩余文本
|
||||
remaining = content[last_end:]
|
||||
if remaining.strip():
|
||||
segments.extend(self._process_text_section(remaining))
|
||||
|
||||
return segments
|
||||
|
||||
def _process_text_section(self, text: str) -> List[Dict]:
|
||||
"""处理普通文本区域,提取标题和段落"""
|
||||
segments = []
|
||||
|
||||
# 按标题分割
|
||||
lines = text.split('\n')
|
||||
current_section = []
|
||||
current_heading_level = 0
|
||||
current_heading = ''
|
||||
|
||||
for line in lines:
|
||||
heading_match = re.match(r'^(#{1,6})\s+(.+)', line)
|
||||
if heading_match:
|
||||
# 保存之前的段落
|
||||
if current_section:
|
||||
section_text = '\n'.join(current_section).strip()
|
||||
if section_text:
|
||||
segments.append({
|
||||
'type': self._get_section_type(current_heading_level, current_heading),
|
||||
'content': section_text,
|
||||
'metadata': {
|
||||
'heading': current_heading,
|
||||
'heading_level': current_heading_level
|
||||
}
|
||||
})
|
||||
current_section = []
|
||||
|
||||
# 开始新标题区域
|
||||
current_heading_level = len(heading_match.group(1))
|
||||
current_heading = heading_match.group(2).strip()
|
||||
else:
|
||||
current_section.append(line)
|
||||
|
||||
# 保存最后一段
|
||||
if current_section:
|
||||
section_text = '\n'.join(current_section).strip()
|
||||
if section_text:
|
||||
segments.append({
|
||||
'type': self._get_section_type(current_heading_level, current_heading),
|
||||
'content': section_text,
|
||||
'metadata': {
|
||||
'heading': current_heading,
|
||||
'heading_level': current_heading_level
|
||||
}
|
||||
})
|
||||
|
||||
return segments
|
||||
|
||||
def _get_section_type(self, level: int, heading: str) -> str:
|
||||
"""根据标题级别和内容确定段落类型"""
|
||||
heading_lower = heading.lower()
|
||||
|
||||
if level == 1:
|
||||
return 'section_h1'
|
||||
elif level == 2:
|
||||
# 检测特殊章节类型
|
||||
if any(kw in heading_lower for kw in ['install', '安装', 'setup', '部署']):
|
||||
return 'section_installation'
|
||||
elif any(kw in heading_lower for kw in ['config', '配置', 'setting']):
|
||||
return 'section_configuration'
|
||||
elif any(kw in heading_lower for kw in ['api', '接口']):
|
||||
return 'section_api'
|
||||
elif any(kw in heading_lower for kw in ['example', '示例', 'usage', '使用']):
|
||||
return 'section_example'
|
||||
elif any(kw in heading_lower for kw in ['faq', 'question', '问题', '常见']):
|
||||
return 'section_faq'
|
||||
elif any(kw in heading_lower for kw in ['changelog', '更新', 'release']):
|
||||
return 'section_changelog'
|
||||
return 'section_h2'
|
||||
elif level == 3:
|
||||
return 'section_h3'
|
||||
else:
|
||||
return 'section_other'
|
||||
|
||||
def _build_description(self, chunk_type: str, content: str, doc_title: str) -> str:
|
||||
"""为 chunk 生成人类可读描述"""
|
||||
lines = content.split('\n')[:5]
|
||||
preview = ' '.join(line.strip() for line in lines if line.strip())[:150]
|
||||
|
||||
if chunk_type == 'code':
|
||||
lang = ''
|
||||
lang_match = re.match(r'```(\w*)', content)
|
||||
if lang_match:
|
||||
lang = lang_match.group(1) or 'text'
|
||||
return f"Code block (language: {lang}) in {doc_title}. Preview: {preview}"
|
||||
|
||||
elif chunk_type.startswith('section_'):
|
||||
heading = content.split('\n')[0] if '\n' in content else content[:50]
|
||||
heading_clean = re.sub(r'^#+\s+', '', heading)
|
||||
type_hint = chunk_type.replace('section_', '')
|
||||
return f"[{type_hint.upper()}] {heading_clean}. Content: {preview}"
|
||||
|
||||
else:
|
||||
return f"Document section in {doc_title}. Content: {preview}"
|
||||
|
||||
def _split_large_chunk(self, content: str, chunk_type: str,
|
||||
doc_title: str, start_id: int) -> List[MDChunk]:
|
||||
"""对过长的 chunk 进行二次拆分"""
|
||||
chunks = []
|
||||
|
||||
# 按段落分割(双换行符)
|
||||
paragraphs = re.split(r'\n\n+', content)
|
||||
current_chunk = []
|
||||
current_size = 0
|
||||
|
||||
for para in paragraphs:
|
||||
para_size = len(para)
|
||||
|
||||
if current_size + para_size > self.max_chunk_size and current_chunk:
|
||||
# 当前块已满,生成 chunk
|
||||
chunk_text = '\n\n'.join(current_chunk)
|
||||
chunks.append(MDChunk(
|
||||
chunk_id=start_id + len(chunks),
|
||||
chunk_type=f"{chunk_type}_part",
|
||||
human_description=f"Part of {doc_title} ({chunk_type}): {chunk_text[:100]}...",
|
||||
raw_content=chunk_text,
|
||||
context=f"{doc_title} (continued)",
|
||||
metadata={'part': len(chunks) + 1, 'original_type': chunk_type}
|
||||
))
|
||||
current_chunk = []
|
||||
current_size = 0
|
||||
|
||||
current_chunk.append(para)
|
||||
current_size += para_size + 2
|
||||
|
||||
# 处理剩余内容
|
||||
if current_chunk:
|
||||
chunk_text = '\n\n'.join(current_chunk)
|
||||
chunks.append(MDChunk(
|
||||
chunk_id=start_id + len(chunks),
|
||||
chunk_type=f"{chunk_type}_part",
|
||||
human_description=f"Part of {doc_title} ({chunk_type}): {chunk_text[:100]}...",
|
||||
raw_content=chunk_text,
|
||||
context=f"{doc_title} (continued)",
|
||||
metadata={'part': len(chunks) + 1, 'original_type': chunk_type}
|
||||
))
|
||||
|
||||
return chunks if chunks else [MDChunk(
|
||||
chunk_id=start_id,
|
||||
chunk_type=chunk_type,
|
||||
human_description=f"{doc_title}: {content[:100]}...",
|
||||
raw_content=content[:self.max_chunk_size],
|
||||
context=doc_title,
|
||||
metadata={'truncated': True}
|
||||
)]
|
||||
|
||||
def chunk_directory(self, dir_path: str, extensions: tuple = ('.md', '.markdown')) -> List[Dict]:
|
||||
"""批量处理目录下所有 Markdown 文件"""
|
||||
all_chunks = []
|
||||
file_count = 0
|
||||
|
||||
for root, _, files in os.walk(dir_path):
|
||||
for file in files:
|
||||
if file.lower().endswith(extensions):
|
||||
file_path = os.path.join(root, file)
|
||||
try:
|
||||
chunks = self.chunk_file(file_path)
|
||||
all_chunks.extend(chunks)
|
||||
file_count += 1
|
||||
print(f"OK {file_path}: {len(chunks)} chunks")
|
||||
except Exception as e:
|
||||
print(f"FAIL {file_path}: {e}")
|
||||
|
||||
print(f"\nTotal: {file_count} files, {len(all_chunks)} chunks")
|
||||
return all_chunks
|
||||
|
||||
|
||||
def save_chunks_to_json(chunks: List[Dict], output_path: str):
|
||||
"""保存 chunks 到 JSON 文件"""
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(chunks, f, ensure_ascii=False, indent=2)
|
||||
print(f"Saved {len(chunks)} chunks to {output_path}")
|
||||
|
||||
|
||||
def print_chunk_summary(chunks: List[Dict]):
|
||||
"""打印 chunk 类型统计"""
|
||||
type_counts = {}
|
||||
for chunk in chunks:
|
||||
chunk_type = chunk["chunk_type"]
|
||||
type_counts[chunk_type] = type_counts.get(chunk_type, 0) + 1
|
||||
|
||||
print("\nChunk Type Summary:")
|
||||
for chunk_type, count in sorted(type_counts.items(), key=lambda x: -x[1]):
|
||||
print(f" {chunk_type}: {count}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
|
||||
chunker = MarkdownSemanticChunker(max_chunk_size=2000)
|
||||
|
||||
if len(sys.argv) > 1:
|
||||
path = sys.argv[1]
|
||||
if os.path.isdir(path):
|
||||
all_chunks = chunker.chunk_directory(path)
|
||||
output_path = os.path.join(os.path.dirname(path.rstrip("/\\")) if os.path.dirname(path) else ".",
|
||||
os.path.basename(path.rstrip("/\\")) + "_md_chunks.json")
|
||||
save_chunks_to_json(all_chunks, output_path)
|
||||
print_chunk_summary(all_chunks)
|
||||
else:
|
||||
chunks = chunker.chunk_file(path)
|
||||
output_path = path.replace(".md", "_chunks.json").replace(".markdown", "_chunks.json")
|
||||
save_chunks_to_json(chunks, output_path)
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print("Chunking Results Preview")
|
||||
print(f"{'='*60}")
|
||||
for chunk in chunks[:10]:
|
||||
print(f"\n[Chunk {chunk['chunk_id']}] Type: {chunk['chunk_type']}")
|
||||
print(f"Description: {chunk['human_description'][:120]}...")
|
||||
print(f"Content length: {len(chunk['raw_content'])} chars")
|
||||
if len(chunks) > 10:
|
||||
print(f"\n... and {len(chunks) - 10} more chunks")
|
||||
|
||||
print_chunk_summary(chunks)
|
||||
else:
|
||||
print("=" * 60)
|
||||
print("Markdown Semantic Chunking v1.0")
|
||||
print("=" * 60)
|
||||
print("\nUsage: python md_chunker.py <md_file_or_directory>")
|
||||
@@ -0,0 +1,36 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
||||
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"word_embedding_dimension": 2560,
|
||||
"pooling_mode_cls_token": false,
|
||||
"pooling_mode_mean_tokens": false,
|
||||
"pooling_mode_max_tokens": false,
|
||||
"pooling_mode_mean_sqrt_len_tokens": false,
|
||||
"pooling_mode_weightedmean_tokens": false,
|
||||
"pooling_mode_lasttoken": true,
|
||||
"include_prompt": true
|
||||
}
|
||||
@@ -0,0 +1,291 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
base_model:
|
||||
- Qwen/Qwen3-4B-Base
|
||||
tags:
|
||||
- transformers
|
||||
- sentence-transformers
|
||||
- sentence-similarity
|
||||
- feature-extraction
|
||||
- text-embeddings-inference
|
||||
---
|
||||
# Qwen3-Embedding-4B
|
||||
|
||||
<p align="center">
|
||||
<img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/logo_qwen3.png" width="400"/>
|
||||
<p>
|
||||
|
||||
## Highlights
|
||||
|
||||
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining.
|
||||
|
||||
**Exceptional Versatility**: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks **No.1** in the MTEB multilingual leaderboard (as of June 5, 2025, score **70.58**), while the reranking model excels in various text retrieval scenarios.
|
||||
|
||||
**Comprehensive Flexibility**: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios.
|
||||
|
||||
**Multilingual Capability**: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities.
|
||||
|
||||
## Model Overview
|
||||
|
||||
**Qwen3-Embedding-4B** has the following features:
|
||||
|
||||
- Model Type: Text Embedding
|
||||
- Supported Languages: 100+ Languages
|
||||
- Number of Paramaters: 4B
|
||||
- Context Length: 32k
|
||||
- Embedding Dimension: Up to 2560, supports user-defined output dimensions ranging from 32 to 2560
|
||||
|
||||
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3-embedding/), [GitHub](https://github.com/QwenLM/Qwen3-Embedding).
|
||||
|
||||
## Qwen3 Embedding Series Model list
|
||||
|
||||
| Model Type | Models | Size | Layers | Sequence Length | Embedding Dimension | MRL Support | Instruction Aware |
|
||||
|------------------|----------------------|------|--------|-----------------|---------------------|-------------|----------------|
|
||||
| Text Embedding | [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) | 0.6B | 28 | 32K | 1024 | Yes | Yes |
|
||||
| Text Embedding | [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) | 4B | 36 | 32K | 2560 | Yes | Yes |
|
||||
| Text Embedding | [Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B) | 8B | 36 | 32K | 4096 | Yes | Yes |
|
||||
| Text Reranking | [Qwen3-Reranker-0.6B](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B) | 0.6B | 28 | 32K | - | - | Yes |
|
||||
| Text Reranking | [Qwen3-Reranker-4B](https://huggingface.co/Qwen/Qwen3-Reranker-4B) | 4B | 36 | 32K | - | - | Yes |
|
||||
| Text Reranking | [Qwen3-Reranker-8B](https://huggingface.co/Qwen/Qwen3-Reranker-8B) | 8B | 36 | 32K | - | - | Yes |
|
||||
|
||||
> **Note**:
|
||||
> - `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding.
|
||||
> - `Instruction Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks.
|
||||
> - Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English.
|
||||
|
||||
## Usage
|
||||
|
||||
With Transformers versions earlier than 4.51.0, you may encounter the following error:
|
||||
```
|
||||
KeyError: 'qwen3'
|
||||
```
|
||||
|
||||
### Sentence Transformers Usage
|
||||
|
||||
```python
|
||||
# Requires transformers>=4.51.0
|
||||
# Requires sentence-transformers>=2.7.0
|
||||
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
# Load the model
|
||||
model = SentenceTransformer("Qwen/Qwen3-Embedding-4B")
|
||||
|
||||
# We recommend enabling flash_attention_2 for better acceleration and memory saving,
|
||||
# together with setting `padding_side` to "left":
|
||||
# model = SentenceTransformer(
|
||||
# "Qwen/Qwen3-Embedding-4B",
|
||||
# model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto"},
|
||||
# tokenizer_kwargs={"padding_side": "left"},
|
||||
# )
|
||||
|
||||
# The queries and documents to embed
|
||||
queries = [
|
||||
"What is the capital of China?",
|
||||
"Explain gravity",
|
||||
]
|
||||
documents = [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
|
||||
]
|
||||
|
||||
# Encode the queries and documents. Note that queries benefit from using a prompt
|
||||
# Here we use the prompt called "query" stored under `model.prompts`, but you can
|
||||
# also pass your own prompt via the `prompt` argument
|
||||
query_embeddings = model.encode(queries, prompt_name="query")
|
||||
document_embeddings = model.encode(documents)
|
||||
|
||||
# Compute the (cosine) similarity between the query and document embeddings
|
||||
similarity = model.similarity(query_embeddings, document_embeddings)
|
||||
print(similarity)
|
||||
# tensor([[0.7534, 0.1147],
|
||||
# [0.0320, 0.6258]])
|
||||
```
|
||||
|
||||
### Transformers Usage
|
||||
|
||||
```python
|
||||
# Requires transformers>=4.51.0
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
|
||||
from torch import Tensor
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
|
||||
|
||||
def last_token_pool(last_hidden_states: Tensor,
|
||||
attention_mask: Tensor) -> Tensor:
|
||||
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
|
||||
if left_padding:
|
||||
return last_hidden_states[:, -1]
|
||||
else:
|
||||
sequence_lengths = attention_mask.sum(dim=1) - 1
|
||||
batch_size = last_hidden_states.shape[0]
|
||||
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
|
||||
|
||||
|
||||
def get_detailed_instruct(task_description: str, query: str) -> str:
|
||||
return f'Instruct: {task_description}\nQuery:{query}'
|
||||
|
||||
# Each query must come with a one-sentence instruction that describes the task
|
||||
task = 'Given a web search query, retrieve relevant passages that answer the query'
|
||||
|
||||
queries = [
|
||||
get_detailed_instruct(task, 'What is the capital of China?'),
|
||||
get_detailed_instruct(task, 'Explain gravity')
|
||||
]
|
||||
# No need to add instruction for retrieval documents
|
||||
documents = [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
|
||||
]
|
||||
input_texts = queries + documents
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Embedding-4B', padding_side='left')
|
||||
model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-4B')
|
||||
|
||||
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
|
||||
# model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-4B', attn_implementation="flash_attention_2", torch_dtype=torch.float16).cuda()
|
||||
|
||||
max_length = 8192
|
||||
|
||||
# Tokenize the input texts
|
||||
batch_dict = tokenizer(
|
||||
input_texts,
|
||||
padding=True,
|
||||
truncation=True,
|
||||
max_length=max_length,
|
||||
return_tensors="pt",
|
||||
)
|
||||
batch_dict.to(model.device)
|
||||
outputs = model(**batch_dict)
|
||||
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
|
||||
|
||||
# normalize embeddings
|
||||
embeddings = F.normalize(embeddings, p=2, dim=1)
|
||||
scores = (embeddings[:2] @ embeddings[2:].T)
|
||||
print(scores.tolist())
|
||||
# [[0.7534257769584656, 0.1146894246339798], [0.03198453038930893, 0.6258305311203003]]
|
||||
```
|
||||
|
||||
### vLLM Usage
|
||||
|
||||
```python
|
||||
# Requires vllm>=0.8.5
|
||||
import torch
|
||||
import vllm
|
||||
from vllm import LLM
|
||||
|
||||
def get_detailed_instruct(task_description: str, query: str) -> str:
|
||||
return f'Instruct: {task_description}\nQuery:{query}'
|
||||
|
||||
# Each query must come with a one-sentence instruction that describes the task
|
||||
task = 'Given a web search query, retrieve relevant passages that answer the query'
|
||||
|
||||
queries = [
|
||||
get_detailed_instruct(task, 'What is the capital of China?'),
|
||||
get_detailed_instruct(task, 'Explain gravity')
|
||||
]
|
||||
# No need to add instruction for retrieval documents
|
||||
documents = [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
|
||||
]
|
||||
input_texts = queries + documents
|
||||
|
||||
model = LLM(model="Qwen/Qwen3-Embedding-4B", task="embed")
|
||||
|
||||
outputs = model.embed(input_texts)
|
||||
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
|
||||
scores = (embeddings[:2] @ embeddings[2:].T)
|
||||
print(scores.tolist())
|
||||
# [[0.7525103688240051, 0.1143278032541275], [0.030893627554178238, 0.6239761114120483]]
|
||||
```
|
||||
|
||||
📌 **Tip**: We recommend that developers customize the `instruct` according to their specific scenarios, tasks, and languages. Our tests have shown that in most retrieval scenarios, not using an `instruct` on the query side can lead to a drop in retrieval performance by approximately 1% to 5%.
|
||||
|
||||
### Text Embeddings Inference (TEI) Usage
|
||||
|
||||
You can either run / deploy TEI on NVIDIA GPUs as:
|
||||
|
||||
```bash
|
||||
docker run --gpus all -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7.2 --model-id Qwen/Qwen3-Embedding-4B --dtype float16
|
||||
```
|
||||
|
||||
Or on CPU devices as:
|
||||
|
||||
```bash
|
||||
docker run -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.2 --model-id Qwen/Qwen3-Embedding-4B --dtype float16
|
||||
```
|
||||
|
||||
And then, generate the embeddings sending a HTTP POST request as:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/embed \
|
||||
-X POST \
|
||||
-d '{"inputs": ["Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is the capital of China?", "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: Explain gravity"]}' \
|
||||
-H "Content-Type: application/json"
|
||||
```
|
||||
|
||||
## Evaluation
|
||||
|
||||
### MTEB (Multilingual)
|
||||
|
||||
| Model | Size | Mean (Task) | Mean (Type) | Bitxt Mining | Class. | Clust. | Inst. Retri. | Multi. Class. | Pair. Class. | Rerank | Retri. | STS |
|
||||
|----------------------------------|:-------:|:-------------:|:-------------:|:--------------:|:--------:|:--------:|:--------------:|:---------------:|:--------------:|:--------:|:--------:|:------:|
|
||||
| NV-Embed-v2 | 7B | 56.29 | 49.58 | 57.84 | 57.29 | 40.80 | 1.04 | 18.63 | 78.94 | 63.82 | 56.72 | 71.10|
|
||||
| GritLM-7B | 7B | 60.92 | 53.74 | 70.53 | 61.83 | 49.75 | 3.45 | 22.77 | 79.94 | 63.78 | 58.31 | 73.33|
|
||||
| BGE-M3 | 0.6B | 59.56 | 52.18 | 79.11 | 60.35 | 40.88 | -3.11 | 20.1 | 80.76 | 62.79 | 54.60 | 74.12|
|
||||
| multilingual-e5-large-instruct | 0.6B | 63.22 | 55.08 | 80.13 | 64.94 | 50.75 | -0.40 | 22.91 | 80.86 | 62.61 | 57.12 | 76.81|
|
||||
| gte-Qwen2-1.5B-instruct | 1.5B | 59.45 | 52.69 | 62.51 | 58.32 | 52.05 | 0.74 | 24.02 | 81.58 | 62.58 | 60.78 | 71.61|
|
||||
| gte-Qwen2-7b-Instruct | 7B | 62.51 | 55.93 | 73.92 | 61.55 | 52.77 | 4.94 | 25.48 | 85.13 | 65.55 | 60.08 | 73.98|
|
||||
| text-embedding-3-large | - | 58.93 | 51.41 | 62.17 | 60.27 | 46.89 | -2.68 | 22.03 | 79.17 | 63.89 | 59.27 | 71.68|
|
||||
| Cohere-embed-multilingual-v3.0 | - | 61.12 | 53.23 | 70.50 | 62.95 | 46.89 | -1.89 | 22.74 | 79.88 | 64.07 | 59.16 | 74.80|
|
||||
| gemini-embedding-exp-03-07 | - | 68.37 | 59.59 | 79.28 | 71.82 | 54.59 | 5.18 | **29.16** | 83.63 | 65.58 | 67.71 | 79.40|
|
||||
| **Qwen3-Embedding-0.6B** | 0.6B | 64.33 | 56.00 | 72.22 | 66.83 | 52.33 | 5.09 | 24.59 | 80.83 | 61.41 | 64.64 | 76.17|
|
||||
| **Qwen3-Embedding-4B** | 4B | 69.45 | 60.86 | 79.36 | 72.33 | 57.15 | **11.56** | 26.77 | 85.05 | 65.08 | 69.60 | 80.86|
|
||||
| **Qwen3-Embedding-8B** | 8B | **70.58** | **61.69** | **80.89** | **74.00** | **57.65** | 10.06 | 28.66 | **86.40** | **65.63** | **70.88** | **81.08** |
|
||||
|
||||
> **Note**: For compared models, the scores are retrieved from MTEB online [leaderboard](https://huggingface.co/spaces/mteb/leaderboard) on May 24th, 2025.
|
||||
|
||||
### MTEB (Eng v2)
|
||||
|
||||
| MTEB English / Models | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retri. | STS | Summ. |
|
||||
|--------------------------------|:--------:|:------------:|:------------:|:--------:|:--------:|:-------------:|:---------:|:--------:|:-------:|:-------:|
|
||||
| multilingual-e5-large-instruct | 0.6B | 65.53 | 61.21 | 75.54 | 49.89 | 86.24 | 48.74 | 53.47 | 84.72 | 29.89 |
|
||||
| NV-Embed-v2 | 7.8B | 69.81 | 65.00 | 87.19 | 47.66 | 88.69 | 49.61 | 62.84 | 83.82 | 35.21 |
|
||||
| GritLM-7B | 7.2B | 67.07 | 63.22 | 81.25 | 50.82 | 87.29 | 49.59 | 54.95 | 83.03 | 35.65 |
|
||||
| gte-Qwen2-1.5B-instruct | 1.5B | 67.20 | 63.26 | 85.84 | 53.54 | 87.52 | 49.25 | 50.25 | 82.51 | 33.94 |
|
||||
| stella_en_1.5B_v5 | 1.5B | 69.43 | 65.32 | 89.38 | 57.06 | 88.02 | 50.19 | 52.42 | 83.27 | 36.91 |
|
||||
| gte-Qwen2-7B-instruct | 7.6B | 70.72 | 65.77 | 88.52 | 58.97 | 85.9 | 50.47 | 58.09 | 82.69 | 35.74 |
|
||||
| gemini-embedding-exp-03-07 | - | 73.3 | 67.67 | 90.05 | **59.39** | **87.7** | 48.59 | 64.35 | 85.29 | **38.28** |
|
||||
| **Qwen3-Embedding-0.6B** | 0.6B | 70.70 | 64.88 | 85.76 | 54.05 | 84.37 | 48.18 | 61.83 | 86.57 | 33.43 |
|
||||
| **Qwen3-Embedding-4B** | 4B | 74.60 | 68.10 | 89.84 | 57.51 | 87.01 | 50.76 | 68.46 | **88.72** | 34.39 |
|
||||
| **Qwen3-Embedding-8B** | 8B | **75.22** | **68.71** | **90.43** | 58.57 | 87.52 | **51.56** | **69.44** | 88.58 | 34.83 |
|
||||
|
||||
### C-MTEB (MTEB Chinese)
|
||||
|
||||
| C-MTEB | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS |
|
||||
|------------------|--------|------------|------------|--------|--------|-------------|---------|-------|-------|
|
||||
| multilingual-e5-large-instruct | 0.6B | 58.08 | 58.24 | 69.80 | 48.23 | 64.52 | 57.45 | 63.65 | 45.81 |
|
||||
| bge-multilingual-gemma2 | 9B | 67.64 |68.52 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 |
|
||||
| gte-Qwen2-1.5B-instruct | 1.5B | 67.12 | 67.79 | 72.53 | 54.61 | 79.5 | 68.21 | 71.86 | 60.05 |
|
||||
| gte-Qwen2-7B-instruct | 7.6B | 71.62 | 72.19 | 75.77 | 66.06 | 81.16 | 69.24 | 75.70 | 65.20 |
|
||||
| ritrieve_zh_v1 | 0.3B | 72.71 | 73.85 | 76.88 | 66.5 | **85.98** | **72.86** | 76.97 | **63.92** |
|
||||
| **Qwen3-Embedding-0.6B** | 0.6B | 66.33 | 67.45 | 71.40 | 68.74 | 76.42 | 62.58 | 71.03 | 54.52 |
|
||||
| **Qwen3-Embedding-4B** | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 |
|
||||
| **Qwen3-Embedding-8B** | 8B | **73.84** | **75.00** | **76.97** | **80.08** | 84.23 | 66.99 | **78.21** | 63.53 |
|
||||
|
||||
|
||||
## Citation
|
||||
|
||||
If you find our work helpful, feel free to give us a cite.
|
||||
|
||||
```
|
||||
@article{qwen3embedding,
|
||||
title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
|
||||
author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
|
||||
journal={arXiv preprint arXiv:2506.05176},
|
||||
year={2025}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,30 @@
|
||||
{
|
||||
"architectures": [
|
||||
"Qwen3ForCausalLM"
|
||||
],
|
||||
"attention_bias": false,
|
||||
"attention_dropout": 0.0,
|
||||
"bos_token_id": 151643,
|
||||
"eos_token_id": 151645,
|
||||
"head_dim": 128,
|
||||
"hidden_act": "silu",
|
||||
"hidden_size": 2560,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 9728,
|
||||
"max_position_embeddings": 40960,
|
||||
"max_window_layers": 36,
|
||||
"model_type": "qwen3",
|
||||
"num_attention_heads": 32,
|
||||
"num_hidden_layers": 36,
|
||||
"num_key_value_heads": 8,
|
||||
"rms_norm_eps": 1e-06,
|
||||
"rope_scaling": null,
|
||||
"rope_theta": 1000000,
|
||||
"sliding_window": null,
|
||||
"tie_word_embeddings": true,
|
||||
"torch_dtype": "bfloat16",
|
||||
"transformers_version": "4.51.2",
|
||||
"use_cache": true,
|
||||
"use_sliding_window": false,
|
||||
"vocab_size": 151665
|
||||
}
|
||||
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"prompts": {
|
||||
"query": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:",
|
||||
"document": ""
|
||||
},
|
||||
"default_prompt_name": null,
|
||||
"similarity_fn_name": "cosine"
|
||||
}
|
||||
@@ -0,0 +1,6 @@
|
||||
{
|
||||
"bos_token_id": 151643,
|
||||
"eos_token_id": 151643,
|
||||
"max_new_tokens": 2048,
|
||||
"transformers_version": "4.51.3"
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
Binary file not shown.
Binary file not shown.
@@ -0,0 +1,405 @@
|
||||
{
|
||||
"metadata": {
|
||||
"total_size": 8043548672
|
||||
},
|
||||
"weight_map": {
|
||||
"embed_tokens.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.0.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.0.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.1.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.1.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.10.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.10.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.11.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.11.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.12.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.12.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.13.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.13.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.14.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.14.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.15.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.15.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.16.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.16.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.17.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.17.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.18.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.18.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.19.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.19.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.19.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.19.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.19.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.2.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.2.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.20.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.20.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.20.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.20.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.20.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.21.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.21.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.22.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.22.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.23.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.23.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.24.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.24.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.25.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.25.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.26.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.26.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.27.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.27.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.28.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.28.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.29.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.29.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.29.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.29.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.29.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.3.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.3.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.30.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.30.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.31.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.31.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.32.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.32.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.32.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.32.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.33.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.33.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.33.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.33.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.33.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.33.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.33.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.34.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.34.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.34.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.34.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.35.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.35.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.35.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.35.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.4.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.4.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.5.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.5.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.6.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.6.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.7.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.7.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.8.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.8.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.9.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.9.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"norm.weight": "model-00002-of-00002.safetensors"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,20 @@
|
||||
[
|
||||
{
|
||||
"idx": 0,
|
||||
"name": "0",
|
||||
"path": "",
|
||||
"type": "sentence_transformers.models.Transformer"
|
||||
},
|
||||
{
|
||||
"idx": 1,
|
||||
"name": "1",
|
||||
"path": "1_Pooling",
|
||||
"type": "sentence_transformers.models.Pooling"
|
||||
},
|
||||
{
|
||||
"idx": 2,
|
||||
"name": "2",
|
||||
"path": "2_Normalize",
|
||||
"type": "sentence_transformers.models.Normalize"
|
||||
}
|
||||
]
|
||||
Binary file not shown.
@@ -0,0 +1,208 @@
|
||||
{
|
||||
"add_bos_token": false,
|
||||
"add_prefix_space": false,
|
||||
"added_tokens_decoder": {
|
||||
"151643": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151644": {
|
||||
"content": "<|im_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151645": {
|
||||
"content": "<|im_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151646": {
|
||||
"content": "<|object_ref_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151647": {
|
||||
"content": "<|object_ref_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151648": {
|
||||
"content": "<|box_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151649": {
|
||||
"content": "<|box_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151650": {
|
||||
"content": "<|quad_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151651": {
|
||||
"content": "<|quad_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151652": {
|
||||
"content": "<|vision_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151653": {
|
||||
"content": "<|vision_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151654": {
|
||||
"content": "<|vision_pad|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151655": {
|
||||
"content": "<|image_pad|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151656": {
|
||||
"content": "<|video_pad|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151657": {
|
||||
"content": "<tool_call>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"151658": {
|
||||
"content": "</tool_call>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"151659": {
|
||||
"content": "<|fim_prefix|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"151660": {
|
||||
"content": "<|fim_middle|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"151661": {
|
||||
"content": "<|fim_suffix|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"151662": {
|
||||
"content": "<|fim_pad|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"151663": {
|
||||
"content": "<|repo_name|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"151664": {
|
||||
"content": "<|file_sep|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
}
|
||||
},
|
||||
"additional_special_tokens": [
|
||||
"<|im_start|>",
|
||||
"<|im_end|>",
|
||||
"<|object_ref_start|>",
|
||||
"<|object_ref_end|>",
|
||||
"<|box_start|>",
|
||||
"<|box_end|>",
|
||||
"<|quad_start|>",
|
||||
"<|quad_end|>",
|
||||
"<|vision_start|>",
|
||||
"<|vision_end|>",
|
||||
"<|vision_pad|>",
|
||||
"<|image_pad|>",
|
||||
"<|video_pad|>"
|
||||
],
|
||||
"bos_token": null,
|
||||
"chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
|
||||
"clean_up_tokenization_spaces": false,
|
||||
"eos_token": "<|im_end|>",
|
||||
"errors": "replace",
|
||||
"extra_special_tokens": {},
|
||||
"model_max_length": 131072,
|
||||
"pad_token": "<|endoftext|>",
|
||||
"split_special_tokens": false,
|
||||
"tokenizer_class": "Qwen2Tokenizer",
|
||||
"unk_token": null
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
@@ -1,10 +0,0 @@
|
||||
# Core dependencies
|
||||
torch>=2.0.0
|
||||
sentence-transformers>=2.2.0
|
||||
chromadb>=0.4.0
|
||||
numpy>=1.24.0
|
||||
tqdm>=4.65.0
|
||||
huggingface_hub>=0.19.0
|
||||
|
||||
# Optional - for LangChain document conversion
|
||||
langchain>=0.1.0
|
||||
Reference in New Issue
Block a user