fix: band-level windowed refine_layout + programmatic map_fields to prevent 91.5% content loss

Root cause: LLM receiving full 34k-char JRXML would regenerate from scratch
instead of modifying coordinates in-place, shrinking output to ~3k chars.

Solution (programmatic node control, not prompt engineering):

- New agent/jrxml_windower.py: decompose JRXML into header (never sent to
  LLM) + individual bands. Split bands >4000 chars at element boundaries.
  Reassemble with element count validation (>10% change = rollback).

- Rewrite refine_layout: per-band windowed LLM processing (~2-4k chars
  each). LLM cannot "reimagine" the entire report.

- Rewrite map_fields: 100% programmatic regex $F{field_N} -> real name
  replacement. Zero LLM calls, zero content loss.

- _sanitize_field_name: non-ASCII chars escaped to _uXXXX_ format for
  valid JRXML identifiers.

- Tests: 48 new unit tests (windower 28 + map_fields 20). All passing.
  Full suite 385 tests, zero regressions.
This commit is contained in:
2026-05-24 08:55:38 +08:00
parent bb6cc6e241
commit bd5bfbac2d
80 changed files with 39463 additions and 108 deletions
+11
View File
@@ -51,3 +51,14 @@ class AgentState(TypedDict, total=False):
# 需求9:分层精确生成
layout_schema: dict # extract_layout_schema() 输出,列+区域结构
ocr_elements: list # OCR 原始行数据(用于阶段二坐标采样)
# 需求10:多租户知识库
kb_id: str # 当前会话绑定的知识库 ID
kb_fields: list # KB 提取的字段定义 [{name, description, type, required}]
kb_field_mapping: dict # OCR 字段 → KB 字段映射 {"工单号": "billNo", ...}
uploaded_template_jrxml: str # 对话中上传的 JRXML 模板原文
uploaded_template_params: list # 解析出的参数 [{name, type}]
kb_template_jrxml: str # 从 KB 检索到的模板 JRXML
kb_template_name: str # 检索到的模板名称
datasource_mode: str # "parameter" 或 "jdbc"
db_config: dict # JDBC 连接配置