fix: OCR fidelity scoring reform — prevent false fail from language-mismatched field names
Root cause (from review): field_coverage compared English JRXML field names against Chinese OCR field names with set intersection — always zero. Combined with 0.5 weight in score formula, caused valid JRXML (XSD pass, 82% element coverage) to score 0.41 < 0.5 → fail → correction loop → progressive destruction. Changes: - Scoring weight: element_coverage 0.8 + field_coverage 0.2 (was 0.5+0.5) - Validate node: only fail on fidelity when BOTH score<0.5 AND element_coverage<0.4 - Field name regex: \w+ → [^"]+ to support non-ASCII field names - Field matching: also try _sanitize_field_name conversion (Chinese→_uXXXX_) - correction.md: namespace check always active, not conditional on error keywords
This commit is contained in:
+16
-9
@@ -1207,8 +1207,8 @@ def _check_ocr_fidelity(jrxml: str, state: dict) -> dict:
|
|||||||
else:
|
else:
|
||||||
element_coverage = 1.0
|
element_coverage = 1.0
|
||||||
|
|
||||||
# 2. 字段名覆盖
|
# 2. 字段名覆盖(英文字段名 vs OCR 中文字段名天然不匹配,权重降低)
|
||||||
jrxml_fields = set(re.findall(r'<field name="(\w+)"', jrxml))
|
jrxml_fields = set(re.findall(r'<field name="([^"]+)"', jrxml))
|
||||||
ocr_field_names = set()
|
ocr_field_names = set()
|
||||||
ocr_fields = ocr_result.get("fields", []) if isinstance(ocr_result, dict) else []
|
ocr_fields = ocr_result.get("fields", []) if isinstance(ocr_result, dict) else []
|
||||||
for f in ocr_fields:
|
for f in ocr_fields:
|
||||||
@@ -1219,10 +1219,15 @@ def _check_ocr_fidelity(jrxml: str, state: dict) -> dict:
|
|||||||
|
|
||||||
if ocr_field_names and jrxml_fields:
|
if ocr_field_names and jrxml_fields:
|
||||||
matched = jrxml_fields & ocr_field_names
|
matched = jrxml_fields & ocr_field_names
|
||||||
field_coverage = len(matched) / max(len(ocr_field_names), 1)
|
# 尝试通过 _sanitize_field_name 转义后匹配(中文→_uXXXX_)
|
||||||
unmatched = ocr_field_names - jrxml_fields
|
sanitized_ocr = {_sanitize_field_name(n) for n in ocr_field_names}
|
||||||
if unmatched:
|
matched_via_sanitize = jrxml_fields & sanitized_ocr
|
||||||
sample = list(unmatched)[:8]
|
all_matched = matched | matched_via_sanitize
|
||||||
|
field_coverage = len(all_matched) / max(len(ocr_field_names), 1)
|
||||||
|
still_unmatched = {n for n in ocr_field_names
|
||||||
|
if n not in jrxml_fields and _sanitize_field_name(n) not in jrxml_fields}
|
||||||
|
if still_unmatched:
|
||||||
|
sample = list(still_unmatched)[:8]
|
||||||
issues.append(f"OCR 字段未在 JRXML 中声明: {', '.join(sample)}")
|
issues.append(f"OCR 字段未在 JRXML 中声明: {', '.join(sample)}")
|
||||||
elif ocr_field_names and not jrxml_fields:
|
elif ocr_field_names and not jrxml_fields:
|
||||||
field_coverage = 0.0
|
field_coverage = 0.0
|
||||||
@@ -1247,8 +1252,9 @@ def _check_ocr_fidelity(jrxml: str, state: dict) -> dict:
|
|||||||
f"OCR 布局分析有 {ocr_columns} 列"
|
f"OCR 布局分析有 {ocr_columns} 列"
|
||||||
)
|
)
|
||||||
|
|
||||||
# 综合评分
|
# 综合评分(元素覆盖为主权重 0.8,字段覆盖为辅助 0.2)
|
||||||
score = round(field_coverage * 0.5 + element_coverage * 0.5, 3)
|
# 英文字段名 vs 中文 OCR 字段名天然不匹配是预期行为,字段覆盖仅作参考
|
||||||
|
score = round(element_coverage * 0.8 + field_coverage * 0.2, 3)
|
||||||
return {
|
return {
|
||||||
"score": score,
|
"score": score,
|
||||||
"field_coverage": round(field_coverage, 3),
|
"field_coverage": round(field_coverage, 3),
|
||||||
@@ -1291,7 +1297,8 @@ def validate(state: AgentState) -> Dict:
|
|||||||
if fidelity["issues"]:
|
if fidelity["issues"]:
|
||||||
if state["status"] == "pass":
|
if state["status"] == "pass":
|
||||||
# XSD 通过但内容保真度不足 → 降级为 fail
|
# XSD 通过但内容保真度不足 → 降级为 fail
|
||||||
if fidelity["score"] < 0.5:
|
# 需要 score < 0.5 且 element_coverage < 0.4(字段名语言不匹配不应单独导致 fail)
|
||||||
|
if fidelity["score"] < 0.5 and fidelity.get("element_coverage", 0) < 0.4:
|
||||||
state["status"] = "fail"
|
state["status"] = "fail"
|
||||||
state["error_msg"] = (
|
state["error_msg"] = (
|
||||||
f"[内容保真度不足] 得分 {fidelity['score']:.2f}/1.0。"
|
f"[内容保真度不足] 得分 {fidelity['score']:.2f}/1.0。"
|
||||||
|
|||||||
@@ -8,7 +8,7 @@
|
|||||||
- 如果当前 JRXML 内容为空或过短(<200 字符),请根据下方提供的 OCR 识别数据和布局 schema 重新生成完整的 JRXML,而非输出一个占位桩。
|
- 如果当前 JRXML 内容为空或过短(<200 字符),请根据下方提供的 OCR 识别数据和布局 schema 重新生成完整的 JRXML,而非输出一个占位桩。
|
||||||
- 如果错误是"字段 'field_N' 未在 <field> 部分声明",**必须**为每个缺失的 field_N 添加 `<field name="field_N" class="java.lang.String"/>` 声明。这些是占位字段,不可删除。同时确保所有 $F{{field_N}} 引用都有对应的 <field> 声明。
|
- 如果错误是"字段 'field_N' 未在 <field> 部分声明",**必须**为每个缺失的 field_N 添加 `<field name="field_N" class="java.lang.String"/>` 声明。这些是占位字段,不可删除。同时确保所有 $F{{field_N}} 引用都有对应的 <field> 声明。
|
||||||
- 如果错误是"字段 'field_N' 未在 <field> 部分声明"且有 OCR 字段数据,尝试将 $F{{field_N}} 替换为 OCR 中对应的真实字段名(如 $F{{invoice_code}}),同时更新 <field> 声明和所有引用。
|
- 如果错误是"字段 'field_N' 未在 <field> 部分声明"且有 OCR 字段数据,尝试将 $F{{field_N}} 替换为 OCR 中对应的真实字段名(如 $F{{invoice_code}}),同时更新 <field> 声明和所有引用。
|
||||||
- **如果错误包含 "No matching global declaration available for the validation root" 或 "namespace" 或 "命名空间"**,说明命名空间 URL 错误。正确的根元素格式必须为:`<jasperReport xmlns="http://jasperreports.sourceforge.net/jasperreports" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://jasperreports.sourceforge.net/jasperreports http://jasperreports.sourceforge.net/xsd/jasperreport.xsd">`。删除所有 ns0: 前缀,删除所有 `xmlns:ns0` 声明,删除所有元素标签上的 `ns0:` 前缀。
|
- **始终检查并修复命名空间**:正确的根元素格式必须为:`<jasperReport xmlns="http://jasperreports.sourceforge.net/jasperreports" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://jasperreports.sourceforge.net/jasperreports http://jasperreports.sourceforge.net/xsd/jasperreport.xsd">`。删除所有 ns0: 前缀,删除所有 `xmlns:ns0` 声明,删除所有元素标签上的 `ns0:` 前缀。
|
||||||
|
|
||||||
当前 JRXML(带错误):
|
当前 JRXML(带错误):
|
||||||
{current_jrxml}
|
{current_jrxml}
|
||||||
|
|||||||
Reference in New Issue
Block a user