Merge remote v4/v5 features (multimodal chat input, layered generation, annotation detection) with local v3 features (dialog file upload, XLSX support, session fix)

Key resolutions: - agent/nodes.py: Merged session_id exclusion fix with new persistable fields (ocr_extraction_result, annotation_result, layout_schema, ocr_elements) - app.py: Adopted st-multimodal-chatinput for unified paste/drop/upload, removed custom JS paste bridge - backend/file_parser.py: Kept local XLSX parser, added remote XLS/DOC parsers - CLAUDE.md + CODE_GUIDE.md: Merged documentation from both branches Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 10:05:43 +08:00
parent 87ead4fa6a 43a0542a11
commit 2befd44430
22 changed files with 2114 additions and 507 deletions
@@ -11,20 +11,21 @@
 3. [架构全景图](#3-架构全景图)
 4. [数据总线：AgentState](#4-数据总线agentstate)
 5. [状态机：graphpy](#5-状态机graphpy)
-6. [14 个节点详解：nodespy](#6-14-个节点详解nodespy)
+6. [18 个节点详解：nodespy](#6-18-个节点详解nodespy)
 7. [LLM 调用层：llmpy](#7-llm-调用层llmpy)
 8. [Prompt 系统：prompts](#8-prompt-系统prompts)
 9. [RAG 与向量搜索](#9-rag-与向量搜索)
-10. [错误自增长知识库](#10-错误自增长知识库)
-11. [布局分析器](#11-布局分析器)
-12. [文件解析器](#12-文件解析器)
-12b. [OCR 单据字段提取器](#12b-ocr-单据字段提取器)
-13. [验证服务](#13-验证服务)
-14. [会话持久化](#14-会话持久化)
-15. [Streamlit UI：apppy](#15-streamlit-uiapppy)
-16. [配置参考](#16-配置参考)
-17. [如何添加新功能](#17-如何添加新功能)
-18. [调试指南](#18-调试指南)
+10. [分层精确生成](#10-分层精确生成)
+11. [错误自增长知识库](#11-错误自增长知识库)
+12. [布局分析器](#12-布局分析器)
+13. [文件解析器](#13-文件解析器)
+14. [验证服务](#14-验证服务)
+15. [会话持久化](#15-会话持久化)
+16. [日志系统：loggerpy](#16-日志系统loggerpy)
+17. [Streamlit UI：apppy](#17-streamlit-uiapppy)
+18. [配置参考](#18-配置参考)
+19. [如何添加新功能](#19-如何添加新功能)
+20. [调试指南](#20-调试指南)

 ---

@@ -91,7 +92,10 @@ streamlit run app.py --server.port 8501
 │                                                               │
 │  load_session → process_input → manage_context → save_snapshot│
 │     → classify_intent                                         │
-│        ├─ initial_generation → retrieve → generate            │
+│        ├─ initial_generation → retrieve                       │
+│        │    ├─ [有布局schema] → generate_skeleton → refine    │
+│        │    │       → map_fields (3 阶段精确生成)             │
+│        │    └─ [无布局schema] → generate (原 1-shot)          │
 │        ├─ modify_report      → modify_jrxml                   │
 │        ├─ consult_question   → handle_consult                 │
 │        ├─ undo_modification  → handle_undo                    │
@@ -116,7 +120,7 @@ streamlit run app.py --server.port 8501
   ┌──────────┐   ┌──────────────┐   ┌───────────────┐
   │backend/  │   │prompts/      │   │validation_    │
   │llm.py    │   │loader.py     │   │service/main.py│
-   │logger.py │   │*.md (7个     │   │(FastAPI,      │
+   │logger.py │   │*.md (10个    │   │(FastAPI,      │
   │rag_      │   │Prompt模板)   │   │独立进程)      │
   │adapter.py│   └──────────────┘   └───────────────┘
   │error_kb  │
@@ -131,6 +135,12 @@ streamlit run app.py --server.port 8501
   │.py       │
   │file_     │
   │parser.py │
+   │ocr_      │
+   │extractor │
+   │.py       │
+   │annotation│
+   │_detector │
+   │.py       │
   │validation│
   │.py       │
   │session.py│
@@ -153,7 +163,7 @@ streamlit run app.py --server.port 8501

 ## 4. 数据总线：AgentState

-`agent/state.py` — 只有 23 个字段的定义，不包含任何逻辑。
+`agent/state.py` — 只有 28 个字段的定义，不包含任何逻辑。

 ```python
 class AgentState(TypedDict, total=False):
@@ -194,9 +204,14 @@ class AgentState(TypedDict, total=False):
    # ── 失败上下文传递 ──
    pending_failure_context: dict  # 重试耗尽后暂存失败信息，下次用户输入时自动注入

-    # ── OCR 单据字段提取 ──
-    ocr_extraction_result: dict   # OCR字段提取结果（来自 OcrExtractor）
-    uploaded_file_path: str        # 上传图片的临时路径
+    # ── 分层精确生成 (v5) ──
+    layout_schema: dict       # extract_layout_schema() 输出，列+区域结构
+    ocr_elements: list        # OCR 原始行数据（用于阶段二坐标采样）
+
+    # ── OCR 与批注 (v3/v4) ──
+    ocr_extraction_result: dict  # OCR 字段精确提取结果
+    uploaded_file_path: str      # 上传图片的临时路径
+    annotation_result: dict      # 批注检测结果（圈选+箭头）
 ```

 **数据流向**：每个节点函数接收 `state`，修改后返回 `state`（实际上是 dict）。LangGraph 自动合并返回值到全局状态。
@@ -225,6 +240,13 @@ def route_by_intent(state) -> Literal["retrieve", "modify_jrxml", ...]:
 def route_after_validate(state) -> Literal["finalize", "explain_error"]:
    return "finalize" if state.get("status") == "pass" else "explain_error"

+def route_after_retrieve(state) -> Literal["generate", "generate_skeleton"]:
+    """layout_schema 有行时走 3 阶段精确生成，否则走原 1-shot"""
+    schema = state.get("layout_schema")
+    if schema and isinstance(schema, dict) and schema.get("total_rows", 0) > 0:
+        return "generate_skeleton"
+    return "generate"
+
 def route_after_correct(state) -> Literal["validate", "finalize"]:
    return "validate" if state.get("retry_count", 0) < MAX_RETRY else "finalize"
 ```
@@ -234,6 +256,7 @@ def route_after_correct(state) -> Literal["validate", "finalize"]:

 **关键路由逻辑**：
 - `route_by_intent`：8 种意图分叉，是整个系统的"交通枢纽"
+- `route_after_retrieve`：有 layout_schema → 3 阶段精确生成（generate_skeleton → refine_layout → map_fields），无 schema → 原 1-shot generate
 - `route_after_save`：预览/导出意图**跳过验证**直通 finalize（这是修复预览问题的关键）
 - `route_after_correct`：重试次数 < 3 则继续验证循环，否则认输

@@ -246,7 +269,7 @@ def build_graph():
    # 注册节点
    workflow.add_node("load_session", load_session_node)
    workflow.add_node("process_input", process_input)
-    # ... 14 个节点
+    # ... 18 个节点

    # 连线
    workflow.set_entry_point("load_session")
@@ -288,38 +311,53 @@ def build_graph():
    retrieve  modify  save_  handle_  handle_  handle_
              _jrxml  session consult  undo     reset
        │        │               │       │        │
-        ▼        │               │       ▼        │
-    generate     │               │    save_session │
-        │        │               │       │        │
-        └───┬────┘               │       ▼        │
-            │                    │    finalize     │
-            ▼                    │                 │
-       save_session ◄───────────┘                 │
-            │                                      │
-            ├── preview/export? ──► finalize       │
-            │                                      │
-            ▼                                      │
-        validate ◄────────────────────────────────┘
-         │     │
-     pass     fail
-         │     │
-         │     ▼
-         │  explain_error
-         │     │
-         │     ▼
-         │  correct_jrxml
-         │     │
-         │     ├── retry < 3? ──► validate (循环)
-         │     │
-         │     └── retry >= 3? ──► finalize (放弃)
-         │
-         ▼
-     finalize ──► END
+   ┌────┤        │               │       ▼        │
+   │    │        │               │    save_session │
+   ▼    │        │               │       │        │
+ generate│       │               │       ▼        │
+(1-shot) │       │               │    finalize     │
+   │     │       │               │                 │
+   │     ▼       │               │                 │
+   │  generate  │               │                 │
+   │  _skeleton │               │                 │
+   │     │      │               │                 │
+   │     ▼      │               │                 │
+   │  refine   │               │                 │
+   │  _layout  │               │                 │
+   │     │      │               │                 │
+   │     ▼      │               │                 │
+   │  map_     │               │                 │
+   │  fields   │               │                 │
+   │     │      │               │                 │
+   └──┬──┘      │               │                 │
+      │         │               │                 │
+      ▼         │               │                 │
+ save_session ◄─┘               │                 │
+      │                          │                 │
+      ├── preview/export? ──► finalize            │
+      │                          ▲                │
+      ▼                          │                │
+  validate ◄─────────────────────┘                │
+   │     │                                        │
+ pass    fail                                     │
+   │     │                                        │
+   │     ▼                                        │
+   │  explain_error                               │
+   │     │                                        │
+   │     ▼                                        │
+   │  correct_jrxml                               │
+   │     │                                        │
+   │     ├── retry < 3? ──► validate (循环)       │
+   │     │                                        │
+   │     └── retry >= 3? ──► finalize (放弃)      │
+   │                                              │
+   ▼                                              │
+finalize ──► END                                  │
 ```

 ---

-## 6. 14 个节点详解：nodes.py
+## 6. 18 个节点详解：nodes.py

 `agent/nodes.py` 是系统的"血肉"，每个节点实现一个处理步骤。

@@ -596,17 +634,20 @@ def load_prompt(name: str) -> str:

 这意味着你可以直接编辑 `prompts/*.md`，下次请求立即生效，无需重启。

-### 8.2 7 个 Prompt 文件
+### 8.2 10 个 Prompt 文件

 | 文件 | 调用节点 | 占位符 | 用途 |
 |------|---------|--------|------|
 | `intent_classify.md` | classify_intent | `{has_report}`, `{user_input}` | 8 分类意图识别 |
 | `initial_generation.md` | generate | `{context}`, `{user_request}` | 首次生成 JRXML |
-| `modification.md` | modify_jrxml | `{current_jrxml}`, `{conversation_history}`, `{modification_request}` | 修改现有 JRXML |
+| `modification.md` | modify_jrxml | `{current_jrxml}`, `{conversation_history}`, `{modification_request}`, `{ocr_context}` | 修改现有 JRXML |
 | `correction.md` | correct_jrxml | `{current_jrxml}`, `{error_msg}`, `{explanation}` | 修正验证错误 |
 | `explain_error.md` | explain_error | `{error_msg}`, `{jrxml_snippet}` | 技术错误转人话 |
 | `compression.md` | manage_context | `{conversation_text}` | 对话摘要压缩 |
 | `consult.md` | handle_consult | `{question}` | 咨询问答 |
+| `skeleton_generation.md` | generate_skeleton | `{layout_schema}`, `{context}`, `{user_request}` | 骨架 JRXML ($F{field_N}) |
+| `refine_layout.md` | refine_layout | `{current_jrxml}`, `{sampled_coordinates}` | 像素级位置精调 |
+| `field_mapping.md` | map_fields | `{current_jrxml}`, `{ocr_fields}` | 占位符 → 真实字段名 |

 ### 8.3 Prompt 模板写法

@@ -663,7 +704,72 @@ class RAGSearcher:

 ---

-## 10. 错误自增长知识库
+## 10. 分层精确生成
+
+专为 A4 报表图片上传场景设计，解决 OCR 元素过多（数百个）导致 LLM prompt 超长的问题。
+
+### 10.1 触发条件
+
+仅当满足以下条件时走 3 阶段管线：
+- `intent == "initial_generation"`（新建报表）
+- `layout_schema` 存在且 `total_rows > 0`（成功提取布局 schema）
+
+其他所有意图（modify_report、文本新建等）走原有 1-shot `generate` 节点，零行为变更。
+
+### 10.2 3 阶段管线
+
+```
+上传 A4 图片
+  │ analyze_layout() → layout dict
+  │ extract_layout_schema() → schema
+  ▼
+route_after_retrieve()
+  ├─ 有 schema → generate_skeleton → refine_layout → map_fields
+  └─ 无 schema → generate (原 1-shot)
+```
+
+**Phase 1: generate_skeleton**
+- 输入：压缩的布局 schema（`schema_text`：列定义 + 区域 + 宽度分类）
+- 输出：骨架 JRXML，所有字段用 `$F{field_N}` 占位
+- 目标：正确的 band 结构和大致位置
+
+**Phase 2: refine_layout**
+- 输入：当前 JRXML + 采样坐标（表头行 + 首行数据 + 末行）
+- 输出：像素级位置精调后的 JRXML
+- 目标：精确的 x/y/w/h 数值，中间行通过插值处理
+
+**Phase 3: map_fields**
+- 输入：当前 JRXML + OCR 字段名列表（来自 `ocr_extraction_result.fields`）
+- 输出：`$F{field_N}` → 真实字段名（如 `$F{name}`、`$F{department}`）
+- 目标：可读且可编译的完整 JRXML
+
+**关键设计**：中间阶段（骨架/精调）跳过验证，只有最终 mapped 结果进入 validate 循环。
+
+### 10.3 extract_layout_schema()
+
+位于 `backend/layout_analyzer.py`，在 `analyze_layout()` 之后调用：
+
+```python
+def extract_layout_schema(layout_result: dict) -> dict:
+    # 列检测：X 坐标聚类，同列条件 → X 中心距离 < avg_width * 0.5
+    # 区域分类：row[0] 元素少 → title; row[1] → header; 末尾1-2行 → footer
+    # 宽度分类：< A4宽度 10% → 窄; > 25% → 宽; 其余 → 中
+    # 返回: {columns, regions, total_rows, total_columns, a4_dimensions, schema_text}
+```
+
+`schema_text` 示例：`"报表布局: 5列 x 10行, A4纵向\n列定义: 序号(窄), 姓名(中), 部门(中), 职位(中), 入职日期(宽)\n区域: 标题(1行) → 表头(1行) → 数据(8行)"`
+
+### 10.4 _format_row_coordinates()
+
+```python
+def _format_row_coordinates(row: dict) -> dict:
+    # 将 OCR 单行元素转为 {y_center, columns: [{col, x, y, w, h, font_size, text}]}
+    # 按 x 坐标从左到右排序
+```
+
+---
+
+## 11. 错误自增长知识库

 `backend/error_kb.py` — 自动积累修正成功的错误案例，下次遇到相似错误时提供参考。

@@ -709,9 +815,9 @@ ChromaDB 中每条记录：

 ---

-## 11. 布局分析器
+## 12. 布局分析器

-`backend/layout_analyzer.py` — 处理用户上传的图片/PDF，识别报表布局结构。
+`backend/layout_analyzer.py` — 处理用户上传的图片/PDF，识别报表布局结构。另有 `extract_layout_schema()` 从 OCR 行数据提取列+区域的紧凑描述（用于分层精确生成）。

 ### 11.1 三种处理路径

@@ -772,7 +878,7 @@ def _parse_jrxml_sections(jrxml):

 ---

-## 12. 文件解析器
+## 13. 文件解析器

 `backend/file_parser.py` — 统一的多格式文件解析入口。

@@ -784,79 +890,25 @@ def parse_file(file_path, file_type="") -> dict:
    # .png/.jpg/.jpeg/.bmp/.webp → _parse_image()
    # .pdf                         → _parse_pdf()
    # .docx                        → _parse_docx()
+    # .xlsx                        → _parse_xlsx()
+    # .xls                         → _parse_xls()
+    # .doc                         → _parse_doc()
    # 其他                         → _parse_text()  (UTF-8 / GBK)
 ```

 ### 各解析器的回退链

- **图片**：EasyOCR（ch_sim+en）→ PaddleOCR → 仅返回元信息 + 安装提示
+- **图片**：PaddleOCR（精确识别首选）→ EasyOCR（ch_sim+en）→ 仅返回元信息 + 安装提示
 - **PDF**：pdfplumber → PyMuPDF → 失败
 - **DOCX**：python-docx（含表格内容提取）→ 失败
+- **XLSX**：openpyxl（含多 sheet 支持）→ 失败
+- **XLS**：xlrd（旧版 Excel 格式）→ 失败
+- **DOC**：olefile（二进制格式，尽力而为提取）→ 失败
 - **文本**：UTF-8 → GBK → 失败

 ---

-## 12b. OCR 单据字段提取器
-
-`backend/ocr_extractor.py` — 两阶段精确提取单据图像中的字段值。
-
-### 12b.1 数据模型
-
-```python
-@dataclass
-class OcrTextElement:        # OCR 文本元素，含精确坐标
-    text: str
-    x_min, y_min, x_max, y_max: float
-    confidence: float = 1.0
-    # 属性: center_x, center_y, width, height, bbox
-
-@dataclass
-class ExtractedField:        # 提取的字段结果
-    field_name: str
-    field_value: str
-    bbox: list[float]
-    confidence: float
-    extraction_method: str    # exact_match / kv_pair / regex / table_match / none
-
-@dataclass
-class ExtractionResult:      # 完整提取结果
-    file_path: str
-    image_size: tuple
-    fields: list[ExtractedField]
-    all_elements: list[OcrTextElement]
-    errors: list[str]
-    ocr_available: bool
-```
-
-### 12b.2 两阶段流水线
-
-**阶段1 — 文档分析** (`_analyze_document`):
- 加载图片 → `_ocr_elements_enhanced()` → EasyOCR(ch_sim+en) → PaddleOCR 回退
- 按 `OCR_CONFIDENCE_THRESHOLD` (默认 0.5) 过滤低置信度元素
- 返回按 (y, x) 排序的 `OcrTextElement` 列表
-
-**阶段2 — 字段提取** (`_extract_field`):
-按优先级尝试 4 种策略：
-1. **精确键值对** (`_exact_kv_match`, conf=0.95/0.85): 同一元素中 "字段名: 值" 模式
-2. **模糊键值对** (`_fuzzy_kv_match`, conf=0.75/0.60): 相邻元素匹配，同行/下一行搜索
-3. **正则模式** (`_regex_match`, conf=0.70/0.60): 12 种预定义模式 (发票代码/号码/金额/日期等)
-4. **表格结构** (`_table_match`, conf=0.55): 行列分组 + 表头匹配
-
-### 12b.3 集成点
-
- **`process_input`**: 检测到上传图片后自动调用，传入 17 个默认中文字段
- **结果注入**: 提取到的字段值自动拼入 `user_input` 前缀（`[OCR 单据字段提取结果]`）
- **结果展示**: `app.py` 总结卡片中 "🔍 OCR 单据字段提取结果" 折叠区
-
-### 12b.4 回退能力
-
- 任一 OCR 引擎不可用时静默回退，不影响主流程
- 两种复用路径: `extract()` (全流程) 和 `extract_from_layout_result()` (复用已有布局分析)
- 便捷函数: `extract_ocr_fields()`, `extract_from_layout()`
-
---
-
-## 13. 验证服务
+## 14. 验证服务

 `validation_service/main.py` — 独立的 FastAPI 进程，提供 JRXML 验证。

@@ -892,7 +944,7 @@ def validate_jrxml(jrxml_text):

 ---

-## 14. 会话持久化
+## 15. 会话持久化

 `backend/session.py` — 基于 JSON 文件的简单 CRUD，每个会话一个文件。

@@ -920,7 +972,7 @@ generate_session_id() → str                     # UUID hex[:12]

 ---

-## 15. 日志系统：logger.py
+## 16. 日志系统：logger.py

 `backend/logger.py` 提供结构化日志能力，是整个系统的"黑匣子"。

@@ -975,14 +1027,14 @@ backend/logger.py

 ### 15.5 `@log_node` 装饰器

-[agent/nodes.py](file:///d:/Idea%20Project/jaspersoft/agent/nodes.py) 中 17 个节点均使用 `@log_node("节点名")` 装饰器，自动记录：
+[agent/nodes.py](file:///d:/Idea%20Project/jaspersoft/agent/nodes.py) 中 18 个节点均使用 `@log_node("节点名")` 装饰器，自动记录：
 - **入口日志** — 节点开始执行时的 state 摘要
 - **出口日志** — 节点完成时的 state 摘要 + 耗时 (duration_ms)
 - **异常日志** — 节点抛异常时的错误信息 + state 摘要

 ### 15.6 `@_log_route` 装饰器

-[agent/graph.py](file:///d:/Idea%20Project/jaspersoft/agent/graph.py) 中 8 个路由函数均使用 `@_log_route("路由名")`，自动记录每次路由决策（from → to）。
+[agent/graph.py](file:///d:/Idea%20Project/jaspersoft/agent/graph.py) 中 9 个路由函数均使用 `@_log_route("路由名")`，自动记录每次路由决策（from → to）。

 ### 15.7 日志分析示例

@@ -999,7 +1051,7 @@ jq 'select(.extra.direction=="response") | {caller: .extra.caller, ms: .extra.du

 ---

-## 16. Streamlit UI：app.py
+## 17. Streamlit UI：app.py

 `app.py` 是整个系统的入口，约 560 行。分为几个区域：

@@ -1096,7 +1148,7 @@ parent.addEventListener('keydown', function(e) {

 ---

-## 17. 配置参考
+## 18. 配置参考

 所有配置通过 `.env` 文件管理。完整配置项：

@@ -1127,7 +1179,7 @@ parent.addEventListener('keydown', function(e) {

 ---

-## 18. 如何添加新功能
+## 19. 如何添加新功能

 ### 18.1 添加新的意图类型

@@ -1171,7 +1223,7 @@ elif provider == "my_provider":

 ---

-## 19. 调试指南
+## 20. 调试指南

 ### 19.1 常见问题

@@ -1251,24 +1303,25 @@ st.json(state)  # 打印完整状态（调试用，记得删除）

 | 文件 | 行数 | 角色 |
 |------|------|------|
-| `app.py` | ~530 | Streamlit UI 入口 |
-| `agent/state.py` | ~40 | 状态类型定义 |
-| `agent/nodes.py` | ~523 | 14 个工作流节点 |
-| `agent/graph.py` | ~232 | 状态图编译 + 路由 |
+| `app.py` | ~690 | Streamlit UI 入口（多模态聊天输入） |
+| `agent/state.py` | ~52 | 状态类型定义（28 字段） |
+| `agent/nodes.py` | ~900 | 18 个工作流节点 |
+| `agent/graph.py` | ~270 | 状态图编译 + 路由（9 个路由函数） |
 | `backend/llm.py` | ~105 | LLM 工厂 (3 个后端) |
 | `backend/rag_adapter.py` | ~156 | ChromaDB 语义搜索 |
 | `backend/error_kb.py` | ~226 | 错误知识库 |
 | `backend/embeddings.py` | ~49 | 嵌入模型工厂 |
-| `backend/file_parser.py` | ~194 | 多格式文件解析 |
-| `backend/layout_analyzer.py` | ~495 | A4 模板布局分析 |
-| `backend/ocr_extractor.py` | ~797 | OCR 单据字段精确提取 (两阶段+4策略) |
+| `backend/file_parser.py` | ~320 | 多格式文件解析（7 种格式） |
+| `backend/layout_analyzer.py` | ~600 | A4 模板布局分析 + 布局 schema 提取 |
+| `backend/ocr_extractor.py` | ~380 | OCR 字段精确提取 |
+| `backend/annotation_detector.py` | ~250 | 批注检测（圈选 + 箭头） |
 | `backend/validation.py` | ~27 | 验证服务 HTTP 客户端 |
 | `backend/session.py` | ~113 | 会话 JSON CRUD |
 | `prompts/loader.py` | ~54 | Prompt 热重载 |
-| `prompts/*.md` (7 个) | — | Prompt 模板 |
+| `prompts/*.md` (10 个) | — | Prompt 模板 |
 | `validation_service/main.py` | ~130 | FastAPI 验证服务 |
 | `tests/test_ocr_extraction.py` | ~543 | OCR 提取器单元测试 (48 项) |
 | `start.bat` | — | 一键启动脚本 (Windows) |
 | `stop.bat` | — | 一键停止脚本 (Windows) |
 | `.env.example` | ~62 | 配置模板 |
-| `requirements.txt` | ~32 | Python 依赖 |
+| `requirements.txt` | ~42 | Python 依赖 |