fix: band-level windowed refine_layout + programmatic map_fields to prevent 91.5% content loss
Root cause: LLM receiving full 34k-char JRXML would regenerate from scratch
instead of modifying coordinates in-place, shrinking output to ~3k chars.
Solution (programmatic node control, not prompt engineering):
- New agent/jrxml_windower.py: decompose JRXML into header (never sent to
LLM) + individual bands. Split bands >4000 chars at element boundaries.
Reassemble with element count validation (>10% change = rollback).
- Rewrite refine_layout: per-band windowed LLM processing (~2-4k chars
each). LLM cannot "reimagine" the entire report.
- Rewrite map_fields: 100% programmatic regex $F{field_N} -> real name
replacement. Zero LLM calls, zero content loss.
- _sanitize_field_name: non-ASCII chars escaped to _uXXXX_ format for
valid JRXML identifiers.
- Tests: 48 new unit tests (windower 28 + map_fields 20). All passing.
Full suite 385 tests, zero regressions.
This commit is contained in:
@@ -51,3 +51,14 @@ class AgentState(TypedDict, total=False):
|
||||
# 需求9:分层精确生成
|
||||
layout_schema: dict # extract_layout_schema() 输出,列+区域结构
|
||||
ocr_elements: list # OCR 原始行数据(用于阶段二坐标采样)
|
||||
|
||||
# 需求10:多租户知识库
|
||||
kb_id: str # 当前会话绑定的知识库 ID
|
||||
kb_fields: list # KB 提取的字段定义 [{name, description, type, required}]
|
||||
kb_field_mapping: dict # OCR 字段 → KB 字段映射 {"工单号": "billNo", ...}
|
||||
uploaded_template_jrxml: str # 对话中上传的 JRXML 模板原文
|
||||
uploaded_template_params: list # 解析出的参数 [{name, type}]
|
||||
kb_template_jrxml: str # 从 KB 检索到的模板 JRXML
|
||||
kb_template_name: str # 检索到的模板名称
|
||||
datasource_mode: str # "parameter" 或 "jdbc"
|
||||
db_config: dict # JDBC 连接配置
|
||||
|
||||
Reference in New Issue
Block a user