Zotero History
- Date item added to Zotero:: 2026-03-23
- First date annotations or notes modified:: 2026-03-23
- Last date annotations or notes modified:: 2026-03-27
- Export date:: 2026-03-27
Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction
Cite
Fang, Y., Chen, B., Peng, J., Li, X., Xi, Y., Zhang, C., & Zhong, G. (2025). Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction (arXiv:2505.24347). arXiv. https://doi.org/10.48550/arXiv.2505.24347
TL;DR
Contribution:: fine-tuning/์ธ๋ถ ์ ๋ณด ์์ด ๋ฒ์ฉ LLM๋ง์ผ๋ก ๋์ํ๋ 3๋จ๊ณ hallucination ์ต์ ASR ๊ต์ ํ๋ ์์ํฌ (Pre-Detection + CoT Subtask + Verification)
Pros:: ์ถ๊ฐ ํ์ต/๋ฐ์ดํฐ ๋ถํ์, ์ค๊ตญ์ดยท์์ด ๋ชจ๋ ์ ์ฉ ๊ฐ๋ฅ, ๊ฐ ๋จ๊ณ๊ฐ modularํ๊ฒ hallucination ์ต์ ์ ๊ธฐ์ฌ
Cons:: del/ins ์ค๋ฅ ์ํญ ์ฆ๊ฐ(hallucination ์์ ์ ๊ฑฐ ๋ถ๊ฐ), CoT ๋์ ์ ํ ํฐ ๋น์ฉ ๊ธ์ฆ(~4๋ฐฐ), ๋ณต์ ์ค๋ฅ ๋ฌธ์ฅ์ ์ ๋ํ ์ด๋ ค์
Study Snapshot
Key takeaway:: plain prompt LLM ์ ์ฉ ์ CER ๋ณํ๋์ด baseline ๋๋น 949% ์ฆ๊ฐ(5.06โ53.10)ํ์ง๋ง, 3๋จ๊ณ ํ๋ ์์ํฌ ์ ์ฉ ์ baseline ๋๋น ์ต๋ 21% ๊ฐ์ .
Methods:: Conformer-based AED ASR + GPT-4o/DeepSeek-V2, AISHELL-1/2 + LibriSpeech ๋ฒค์น๋งํฌ
Outcomes:: substitution ์ค๋ฅ ๊ต์ ํจ๊ณผ์ , Noun Recall ํฅ์, chunk ๊ธฐ๋ฐ ๋์ฝ๋ฉ ์ค์ ์์๋ ์ ํจ.
Results:: AISHELL-1 21%, AISHELL-2 11%, LibriSpeech test-clean 9%, test-other 11.4% CER/WER ์๋์ ๊ฐ์
Implementations
- ๊ณต์ ๊ตฌํ: teamtee/LLM-ASR-Error-Correction
- ๊ฐ์ธ ๊ตฌํ: lots-o/paper-to-code (@fangFewerHallucinationsMore2025)
Meta
Author:: Fang, Yangui
Author:: Chen, Baixu
Author:: Peng, Jing
Author:: Li, Xu
Author:: Xi, Yu
Author:: Zhang, Chengwei
Author:: Zhong, GuohuiTitle:: Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction
Short Title::Fewer Hallucinations, More Verification Year:: 2025Citekey:: @fangFewerHallucinationsMore2025
itemType:: preprintDOI:: 10.48550/arXiv.2505.24347
LINK
Abstract
Automatic Speech Recognition (ASR) error correction aims to correct recognition errors while preserving accurate text. Although traditional approaches demonstrate moderate effectiveness, LLMs offer a paradigm that eliminates the need for training and labeled data. However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. The advantage of our method is that it does not require additional information or fine-tuning of the model, and ensures the correctness of the LLM correction under multi-pass programming. Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.
Reading notes
๐ด Problems
Highlight (1 page, edited: 2026-03-27)
Although traditional approaches demonstrate moderate effectiveness, LLMs offer a paradigm that eliminates the need for training and labeled data. However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text.
Problems:
LLM์ ASR ๊ต์ ์ ์ง์ ์ ์ฉํ๋ฉด hallucination์ผ๋ก ์ธํด ์ฌ๋ฐ๋ฅธ ํ ์คํธ๊น์ง ์๋ชป ์์ ํ๋ ์ญํจ๊ณผ ๋ฐ์ โ RLLM-CF ์ค๊ณ์ ํต์ฌ ๋๊ธฐ
Highlight (1 page, edited: 2026-03-27)
However, these methods typically require either domain-specific fine-tuning or additional contextual information, which constrains their scalability in real-world applications. Therefore, leveraging the general knowledge embedded in LLMs without fine-tuning or external inputs emerges as a more practical and scalable alternative. Nevertheless, directly applying general LLMs often results in hallucination issues [20], posing a significant challenge to reliable correction.
Problems:
๊ธฐ์กด ๋ฐฉ๋ฒ๋ค์ domain fine-tuning ๋๋ ์ธ๋ถ ์ ๋ณด(N-best list ๋ฑ) ํ์ โ ํ์ฅ์ฑ ์ ํ.
์ผ๋ฐ LLM์ ์ง์ ์ ์ฉํ๋ฉด hallucination์ด ๋ฐ์ํ์ฌ reliable correction์ ์ ํดํจ.
Highlight (2 page, edited: 2026-03-27)
Hallucinations in LLMs refer to instances where the generated responses, while grammatically correct, fluent, and plausible, deviate from the input or contradict factual information [21], [22]. In the context of ASR error correction, model-induced hallucinations are closely linked to transcription errors.
Problems:
LLM hallucination ์ ์ โ ๋ฌธ๋ฒ์ ์ผ๋ก ์ ์ฐฝํ๊ณ ๊ทธ๋ด๋ฏํ๋ ์ ๋ ฅ๊ณผ ๋ค๋ฅด๊ฑฐ๋ ์ฌ์ค๊ณผ ๋ชจ์๋๋ ์๋ต. ASR ๊ต์ ๋งฅ๋ฝ์์ ์๋ณธ ์ค๋์ค์ ์๋ ๋ด์ฉ์ fabrication โ ์ ์ฌ ์ค๋ฅ์ ์ง๊ฒฐ
Image (3 page, edited: 2026-03-27)
Problems:
[Table I ๋ถ์] LLM hallucination ์ ํ๋ณ ์์ธ:
[Faithful Hallucinations]
Instruction Violation: ๊ต์ ๋์ ์ง๋ฌธ ๊ฑฐ๋ถ ์๋ต (โSorry, I canโt answerโ)
Redundant Output: ๊ต์ ๊ฒฐ๊ณผ์ ๋ถํ์ํ ํ ์คํธ ์ถ๊ฐ (โThis answer is: โฆโ)
Continue Writing: ์๋ฌธ์ ์์๋ก ์ด์ด์จ์ ํ์ฅ
Blank Output: ์๋ฌด๊ฒ๋ ์ถ๋ ฅํ์ง ์์
Repeated Output: ๋จ์ด๋ฅผ ๋ฌดํ ๋ฐ๋ณต ์ถ๋ ฅ
Grammar Correction: ASR ์ค๋ฅ๊ฐ ์๋ ๋ฌธ๋ฒ ์์ฒด๋ฅผ ๊ณผ๊ต์ (์ฌ๋ฐ๋ฅธ ํ ์คํธ๋ฅผ ์๋ชป ์์ ํ๋ ํต์ฌ ๋ฌธ์ )
[Factual Hallucinations]
- Make Mistake: ์ค๋ฅ ๋จ์ด๋ฅผ ๋ค๋ฅธ ์ค๋ฅ ๋จ์ด๋ก ๋์ฒด (haskโtask๊ฐ ์๋ ๋ค๋ฅธ ๋จ์ด๋ก)
โ ๋ ผ๋ฌธ ์ฃผ์ฅ ์ฐ๊ฒฐ: Pre-Detection์ด Faithful์, Verification์ด ๋ ์ ํ ๋ชจ๋๋ฅผ ์ฐจ๋จํ๋ ์ค๊ณ ๊ทผ๊ฑฐ
Highlight (4 page, edited: 2026-03-27)
we categorize LLM hallucinations in ASR error correction into two types: faithful hallucinationsโincluding instruction violations, redundant outputs, continuation of writing, blank outputs, and grammar correctionsโand factual hallucinations, characterized by content errors.
Problems:
Table I ๊ทผ๊ฑฐ ํ ์คํธ: hallucination์ Faithful(๋ช ๋ น์๋ฐ/์ค๋ณต์ถ๋ ฅ/์ด์ด์ฐ๊ธฐ/๋น์ถ๋ ฅ/๋ฐ๋ณต์ถ๋ ฅ/๋ฌธ๋ฒ๊ณผ๊ต์ )๊ณผ Factual(๋ด์ฉ์ค๋ฅ) ๋ ์ ํ์ผ๋ก ๋ถ๋ฅ. ์ด ๋ถ๋ฅ๊ฐ RLLM-CF 3๋จ๊ณ ์ค๊ณ ๊ทผ๊ฑฐ โ Pre-Detection์ Faithful์, Verification์ ๋ ์ ํ ๋ชจ๋ ์ต์
๐ก Prior Research
Highlight (1 page, edited: 2026-03-27)
Autoregressive models exploit encoder-decoder architectures with Connectionist Temporal Classification (CTC) loss [10], including translation-style correction frameworks [7], [8] and entity-aware transformers [9]. In parallel, non-autoregressive edit-based models such as FastCorrect [1], [2] and SoftCorrect [4] predict edit operations through duration modeling and integrate CTC loss with sequence-to-sequence frameworks to enhance error detection.
Prior Research:
๊ธฐ์กด ASR ๊ต์ ๋ ๊ฐ์ง ํจ๋ฌ๋ค์ โ (1) autoregressive seq2seq (CTC loss, translation-style, entity-aware transformer) (2) non-autoregressive edit ๊ธฐ๋ฐ (FastCorrect, SoftCorrect). ๋ชจ๋ ๋๊ท๋ชจ labeled data์ task-specific training ์์กด
Highlight (2 page, edited: 2026-03-27)
In recent years, integrating LLMs into ASR error correction pipelines has attracted increasing attention. Min and Wang [20] investigated the direct application of LLMs for error correction and concluded that models such as GPT-4o are ineffective due to hallucination issues. Ma et al. [19] explored combining Nbest rescoring with LLM-based correction; however, obtaining the N-best list from the ASR system may impose additional costs in practice. Yang et al. [18] compared LLM-based rescoring and generation methods, demonstrating that the latter outperforms the former when domain-specific information is provided or the LLM is fine-tuned. Nevertheless, such approaches require either domain knowledge or fine-tuning.
Prior Research:ย
Min & Wang [20] โ GPT-4o๋ hallucination์ผ๋ก ์ง์ ๊ต์ ์ ๋นํจ๊ณผ์ . Ma et al. [19] โ N-best rescoring + LLM ๊ฒฐํฉ, ๊ทธ๋ฌ๋ N-best ํ๋ ์ถ๊ฐ ๋น์ฉ. Yang et al. [18] โ ๋๋ฉ์ธ ์ ๋ณด/fine-tuning ์ generation > rescoring์ด๋ ์ธ๋ถ ์์ ์์กด. ๋ชจ๋ supplementary resource ๋๋ fine-tuning ํ์
๐ต Main Idea
Highlight (1 page, edited: 2026-03-27)
we propose the Reliable LLM Correction Framework (RLLMCF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. The advantage of our method is that it does not require additional information or fine-tuning of the model, and ensures the correctness of the LLM correction under multipass programming.
Main Idea:
RLLM-CF 3๋จ๊ณ ํ๋ ์์ํฌ โ (1) Error Pre-Detection (2) CoT Subtask Iterative Correction (3) Reasoning Process Verification. fine-tuning ๋ฐ ์ธ๋ถ ์ ๋ณด ์์ด LLM ์ฌ์ ํ์ต ์ง์๋ง์ผ๋ก hallucination ์ต์
Image (3 page, edited: 2026-03-27)
Main Idea:
Prompt ์ค๊ณ ๊ตฌ์กฐ (Figure 2): Task/Scene โ Error Detection โ 4๊ฐ Sub-Tasks (Locating / Pronunciation / Candidates / Selection) โ Format & Flag ์ ๊ณ์ธต์ ๊ตฌ์กฐ. โThink many times and be sure of the outcomeโ ๋ฌธ๊ตฌ๊ฐ iterative correction์ prompt ์์ค ์ ๋ ์ฅ์น. few-shot 3๊ฐ ์์ ํฌํจ์ผ๋ก LLM ์ถ๋ ฅ ๊ณต๊ฐ์ ๊ตฌ์กฐํ
๐ข Methods
Image (2 page, edited: 2026-03-26)
Highlight (3 page, edited: 2026-03-27)
This highlights the necessity of adopting an error prevention first strategy for correction tasks. To prevent LLMs from altering correct content, we first instruct the model to detect errors in the input sentence. If no errors are detected, the sentence is directly retained; otherwise, the model proceeds to the correction stage, referred to as Stage 1 in Algorithm 1.
Methods:
Stage 1 (Error Pre-Detection): ์ค๋ฅ ๊ฐ์ง ํ ์ค๋ฅ ์์ผ๋ฉด ์๋ฌธ ๊ทธ๋๋ก ๋ฐํ โ LLM์ด ์ฌ๋ฐ๋ฅธ ํ ์คํธ๋ฅผ ์์ ํ๋ ๊ฒ์ ์์ฒ ์ฐจ๋จ. โerror prevention firstโ ์ ๋ต์ ํต์ฌ
Highlight (3 page, edited: 2026-03-27)
To address this, we decompose the correction task into four subtasksโlocalization, pronunciation assessment, candidate generation, and candidate selectionโfollowing a CoT strategy to improve reasoning reliability, as illustrated in Figure 3.
Methods:
Stage 2 (CoT Subtask Iterative Correction): ๊ต์ ์ 4๊ฐ ์๋ธํ์คํฌ๋ก ๋ถํด โ (1)localization (2)pronunciation assessment (3)candidate generation (4)candidate selection. ์ ๋ขฐ๋ ๋ฎ์ผ๋ฉด ์ต๋ 3ํ ๋ฐ๋ณต ๊ต์
Highlight (3 page, edited: 2026-03-27)
Following the correction, a verification step is conducted to ensure compliance with task instructions. Specifically, we employ the modelโs output to assess: (1) whether the answer conforms to the required format, and (2) whether all reasoning steps are correctly completed. Only when both criteria are sat
Methods:
Stage 3 (Answer Verification): (1) ์๊ตฌ ํ์ ์ค์ ์ฌ๋ถ (2) ๋ชจ๋ ์ถ๋ก ๋จ๊ณ ์๋ฃ ์ฌ๋ถ โ ๋ ์กฐ๊ฑด ๋ชจ๋ ์ถฉ์กฑ ์์๋ง ๊ต์ ๊ฒฐ๊ณผ ์ฑํ, ๋ฏธ์ถฉ์กฑ ์ ์๋ฌธ ๋ฐํ
Image (3 page, edited: 2026-03-27)
Methods:
Figure 3 CoT ์ค์ ์์ ๋ถ์: โbreadโ โ โbreedโ ๊ต์ ๊ณผ์ . (1) Locate: โbreadโ ์์น ํน์ (2) Pronunciation: /bred/ ๋ฐ์ ํ์ธ (3) Candidates: โbreedโ/briหd/, โbledโ/blษd/, โbrandโ/brรฆnd/ ์์ฑ (4) Selection: ๋ฌธ๋งฅ(โbakersโ) ๊ณ ๋ คํด โbreedโ ์ ํ. ๋ฐ์ ์ ์ฌ์ฑ + ๋ฌธ๋งฅ ๊ฒฐํฉ์ด ํต์ฌ โ ์์ ์ธ์ด๋ชจ๋ธ ์ง์์ผ๋ก ๋์์ด์์ด ASR ์ค๋ฅ๋ฅผ ๊ต์ ํ๋ ๋ฐฉ์ ์ค์ฆ
Highlight (4 page, edited: 2026-03-27)
In our experiment, we adopt a conformer-based attentionencoder-decoder (AED) ASR model trained with the WeNet toolkit [28], following the U2++ [29] method, across three datasets. For AISHELL-1 and LibriSpeech, a non-streaming model architecture is employed, and four decoding strategies are evaluated: Attention, Attention Rescoring, CTC Greedy Search, and CTC Prefix Search. To further assess the performance of streaming models, experiments are conducted on AISHELL-2 using the Attention Rescoring decoding method with a chunk size of 16. The Conformer encoder comprises 12 blocks with an attention dimension of 256, 4 attention heads, and 2048 linear units, incorporating relative positional encoding and Swish activation. The decoder is implemented as a bidirectional Transformer with three forward and three backward layers. The network follows a hybrid CTC/attention architecture, with the CTC loss weight set to 0.3.
Methods:
ASR ๋ชจ๋ธ ๊ตฌ์ฑ ์์ธ: Conformer encoder (12 blocks, dim 256, 4 heads, 2048 FFN, relative positional encoding, Swish activation) + Bidirectional Transformer decoder (3 forward + 3 backward layers). Hybrid CTC/Attention, CTC loss weight 0.3. AISHELL-2๋ chunk size 16 streaming ๋ชจ๋ธ. WeNet toolkit + U2++ ๋ฐฉ๋ฒ โ 4๊ฐ์ง decoding ์ ๋ต(Attention / Attention Rescore / CTC Greedy / CTC Prefix) ๋ชจ๋ ์ง์
๐ Limitations
Highlight (5 page, edited: 2026-03-27)
Experiments with DeepSeek-V2 on LibriSpeech yielded suboptimal performance, as DeepSeek-V2 is primarily designed for Chinese and demonstrates limited capability on English datasets.
Limitations:
DeepSeek-V2 ์์ด ์ฑ๋ฅ ์ ์กฐ: DeepSeek-V2๋ ์ค๊ตญ์ด ํนํ ์ค๊ณ๋ก LibriSpeech(์์ด)์์ suboptimal ์ฑ๋ฅ. ๋ฐ๋ฉด GPT-4o๋ ์คยท์ ๋ชจ๋ ๊ฐ์ โ RLLM-CF ์ฑ๋ฅ์ด backbone LLM ํ์ง์ ํฌ๊ฒ ์์กด. ๋ฒ์ฉ์ฑ ์ฃผ์ฅ(no fine-tuning, cross-domain)๊ณผ ์์ถฉ: ์ค์ ๋ก ๊ฐ๋ ฅํ ๋ค๊ตญ์ด LLM์ด ์ ์ ์กฐ๊ฑด
Highlight (5 page, edited: 2026-03-27)
However, a slight increase in deletion and insertion errors is observed, primarily due to hallucinations.
Limitations:
๊ต์ ํ deletion/insertion ์ค๋ฅ ์ํญ ์ฆ๊ฐ โ hallucination ์์ ์ ๊ฑฐ ๋ถ๊ฐ. substitution ๊ต์ ์๋ ํจ๊ณผ์ ์ด๋ ๊ตฌ์กฐ์ ์ค๋ฅ(์ฝ์ /์ญ์ ) ์ต์ ์ ํ๊ณ
Highlight (6 page, edited: 2026-03-27)
For sentences containing multiple errors, the overall error count was reduced, although precise quantification remains challenging.
Limitations:
๋ณต์ ์ค๋ฅ ํฌํจ ๋ฌธ์ฅ์์ ์ค๋ฅ ๊ฐ์๋ ์ ๋ฐ ์ ๋ํ ์ด๋ ค์. ์ค์ ๋ก 98๊ฐ์ ์๋ ์ฌ๋ฐ๋ฅธ ๋ฌธ์ฅ์ด ์๋ชป ๊ต์ ๋จ(383๊ฐ ๊ต์ ์ฑ๊ณต ๋๋น) โ hallucination ์์ ์ ๊ฑฐ ๋ถ๊ฐ ํ๊ณ ์ฌํ์ธ
๐ฃ Key Concepts to Clarify
Highlight (1 page, edited: 2026-03-27)
Connectionist Temporal Classification (CTC) loss
Key Concepts to Clarify:
CTC (Connectionist Temporal Classification): ์ ๋ ฅยท์ถ๋ ฅ ๊ธธ์ด๊ฐ ๋ฌ๋ผ๋ ํ์ต ๊ฐ๋ฅํ ์ํ์ค ๋ ์ด๋ธ๋ง loss. ASR์์ ์์ฑ ํ๋ ์โํ ์คํธ ํ ํฐ ์ ๋ ฌ ๋ฌธ์ ๋ฅผ ํด๊ฒฐ. CTC Greedy / Prefix Search๋ ์ด ์ถ๋ ฅ์ ๋์ฝ๋ฉํ๋ ์ ๋ต โ Table II~III์ 4๊ฐ์ง decoding ํ ๊ตฌ์ฑ ๊ทผ๊ฑฐ
Highlight (1 page, edited: 2026-03-27)
Chain-of-Thought (CoT) prompting to enhance reasoning [15].
Key Concepts to Clarify:
Chain-of-Thought (CoT) Prompting: LLM์ด ์ต์ข ๋ต๋ง ๋ฐ๋ก ์ถ๋ ฅํ๋ ๋์ ์ค๊ฐ ์ถ๋ก ๋จ๊ณ๋ฅผ ๋ช ์์ ์ผ๋ก ์์ฑํ๋๋ก ์ ๋ํ๋ prompting ๊ธฐ๋ฒ [Wei et al., 2022]. ๋ณต์กํ ์ถ๋ก ์ ๋ขฐ๋ ํฅ์ โ RLLM-CF Stage 2์ ์ด๋ก ์ ๊ทผ๊ฑฐ. ๋จ์ CoT๋ง์ผ๋ก๋ ๋ถ์กฑํด์ iterative correction์ ์ถ๊ฐํ ๊ฒ์ด ํต์ฌ ๊ธฐ์ฌ
Highlight (1 page, edited: 2026-03-27)
Fine-tuning approaches, often using low-rank adaptation (LoRA)
Key Concepts to Clarify:
LoRA (Low-Rank Adaptation): ๋ํ ๋ชจ๋ธ์ ์ ์ฒด ํ๋ผ๋ฏธํฐ๋ฅผ ์ฌํ์ตํ์ง ์๊ณ ์ ์ฐจ์ ํ๋ ฌ ๋ถํด๋ฅผ ํตํด ์์ ํ๋ผ๋ฏธํฐ๋ง fine-tuningํ๋ ๊ฒฝ๋ ๊ธฐ๋ฒ. ๋ณธ ๋ ผ๋ฌธ์ LoRA์กฐ์ฐจ ์ฌ์ฉํ์ง ์์ โ โno fine-tuningโ ์ฃผ์ฅ์ ์๋ฏธ: LoRA ํฌํจ ์ด๋ ํ ํ๋ผ๋ฏธํฐ ์ ๋ฐ์ดํธ๋ ์์ด ์์ prompting๋ง ์ฌ์ฉ
Highlight (2 page, edited: 2026-03-27)
however, obtaining the N-best list from the ASR system may impose additional costs in practice.
Key Concepts to Clarify:
N-best list: ASR ๋์ฝ๋๊ฐ ์ต์์ 1๊ฐ ๊ฒฐ๊ณผ๋ง ๋ฐํํ๋ ๋์ , ํ๋ฅ ๋์ ์์ N๊ฐ ํ๋ณด ์ํ์ค๋ฅผ ๋ฐํํ๋ ๋ฐฉ์. LLM์ด ์ด N๊ฐ ํ๋ณด ์ค ๊ฐ์ฅ ์ ํฉํ ๊ฒ์ ์ฌ์ ํ(rescoring)ํ๋ ๋ฐ ํ์ฉ. ํ์ง๋ง N-best list ์์ฑ ์์ฒด๊ฐ ์ถ๊ฐ ์ฐ์ฐ ๋น์ฉ โ ๋ณธ ๋ ผ๋ฌธ์ด ์ด๋ฅผ ์ฌ์ฉํ์ง ์๋ ์ด์
Highlight (5 page, edited: 2026-03-27)
To analyze noun recall within substitution errors, noun filtering was performed using the method proposed in [32] for Chinese (AISHELL-1/2) and [33] for English (LibriSpeech).
Key Concepts to Clarify:
Noun Recall: ๊ต์ ํ ๊ณ ์ ๋ช ์ฌ/์ ๋ฌธ์ฉ์ด๊ฐ ์ผ๋ง๋ ์ ๋ณต์๋๋์ง ์ธก์ ํ๋ ์งํ. ASR ์ค๋ฅ๊ฐ ๋์์ด์์ด ๊ธฐ๋ฐ์ผ๋ก ํนํ ๊ณ ์ ๋ช ์ฌ์์ ๋น๋ฒํ๋ฏ๋ก ๋ณ๋ ์ธก์ . ์ค๊ตญ์ด๋ N-LTP [32], ์์ด๋ NLTK [33]๋ก ๋ช ์ฌ ํํฐ๋ง. CER/WER ๊ฐ์ ๊ณผ ํจ๊ป Noun Recall ํฅ์์ด LLM ๊ต์ ์ ํต์ฌ ๊ฐ์ โ ๋ฌธ๋งฅ ์ดํด ๊ธฐ๋ฐ ๋์์ด์์ด ๋ถ๋ณ ๋ฅ๋ ฅ
๐ช Results
Highlight (1 page, edited: 2026-03-27)
Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.
Result:
GPT-4o + RLLM-CF ์ต์ข ์ฑ๋ฅ โ AISHELL-1: 21%, AISHELL-2: 11%, LibriSpeech test-clean: 9%, test-other: 11.4% CER/WER ์๋์ ๊ฐ์. ์์ธ ์์น๋ Table II~IV ์ฐธ์กฐ
Image (5 page, edited: 2026-03-27)
Results:
[Table II ๋ถ์] AISHELL-1 CER / Noun Recall (GPT-4o vs DeepSeek-V2):
[GPT-4o]
Attention: 5.06โ4.32 (-14.6%), Attention Rescore: 4.62โ4.01 (-13%)
CTC Greedy/Prefix: 5.17โ4.06 (-21%) โ Abstract 21% ์์น ์ถ์ฒ
ํ๊ท 17.4% ์๋ ๊ฐ์, Noun Recall +2~3pp ํฅ์
[DeepSeek-V2]
์ต๊ณ CTC Prefix: 5.17โ4.48 (-13%)
GPT-4o ๋๋น ์ ๋ฐ์ ์ผ๋ก ์ด์ธ
[๋ ผ๋ฌธ ์ฃผ์ฅ ์ฐ๊ฒฐ]
sub ์ค๋ฅ ํฌ๊ฒ ๊ฐ์, del/ins ์ํญ ์ฆ๊ฐ โ hallucination ์์ ์ ๊ฑฐ ๋ถ๊ฐ, Limitations์ ์ผ์น
GPT-4o > DeepSeek-V2 ์ผ๊ด ์ฐ์ธ โ ๋ชจ๋ธ ํ์ง์ด ๊ต์ ์ฑ๋ฅ์ ์ง๊ฒฐ
Noun Recall ํฅ์ โ LLM์ด ๋์์ด์์ด ๊ธฐ๋ฐ ๊ณ ์ ๋ช ์ฌ ์ค๋ฅ ๊ต์ ์ ํจ๊ณผ์
Image (5 page, edited: 2026-03-27)
Results:
[Table III ๋ถ์] LibriSpeech WER (GPT-4o, test-clean/other):
[test-clean]
Attention Rescore: 3.35โ3.19 (-4.8%, ์ต์ WER) โ Abstract 9% ์์น ์ถ์ฒ
ํ๊ท 5.72% ์๋ ๊ฐ์
[test-other (์ก์ ํ๊ฒฝ)]
CTC Greedy: 9.52โ8.45 (-11.4%) โ Abstract 11.4% ์์น ์ถ์ฒ
์ ๋ ๊ฐ์ํญ์ด test-clean๋ณด๋ค ํผ โ ์ค๋ฅ ๋ง์์๋ก ๊ต์ ํจ๊ณผ ์ฆ๊ฐ
[๋ ผ๋ฌธ ์ฃผ์ฅ ์ฐ๊ฒฐ]
์์ด์์๋ ์ ํจ โ fine-tuning ์์ด ์ธ์ด ๋ฒ์ฉ์ฑ ํ๋ณด ์ฃผ์ฅ ์ง์ง
AISHELL-1์ 21% ๋๋น ๋ฎ์ โ LLM์ ์ค๊ตญ์ด ๋์์ด์์ด ํจํด ์ธ์์ด ๋ ๊ฐํจ
del/ins ์ฆ๊ฐ ํจํด ๋์ผ โ hallucination ์ธ์ด ๋ฌด๊ดํ๊ฒ ์์กด
[Table IV ๋ถ์] AISHELL-2 Streaming CER:
GPT-4o: 5.57โ4.95 (-11%) โ Abstract 11% ์์น ์ถ์ฒ
Streaming ๋ชจ๋ธ์์๋ RLLM-CF ์ ํจ์ฑ ํ์ธ
Image (5 page, edited: 2026-03-27)
Results:
[Figure 4 ๋ถ์] 7,176 ๋ฌธ์ฅ sentence-level ํ๋ฆ: Pre-Detection์์ 2,043๊ฐ ์ค๋ฅ ๊ฐ์ง โ ์ด ์ค 1,915๊ฐ confidence ํ๋ณด, 128๊ฐ ๋ฐ๋ณต ํ ํฌ๊ธฐ โ Verification์์ 347๊ฐ ์ถ๊ฐ ํ๋ฝ โ ์ต์ข 1,568๊ฐ ๊ต์ ์๋ โ 383๊ฐ ์ฑ๊ณต ๊ต์ / 98๊ฐ ์ค์ ์ (์๋ ๋ง๋ ๋ฌธ์ฅ์ ํ๋ฆฌ๊ฒ). ์ ๋ฐ๋ ๊ด์ : 383/(383+98) = ์ฝ 79.6% โ hallucination ์์ ์ ๊ฑฐ ๋ถ๊ฐ ํ๊ณ ์์นํ
๐ Ablation Study
Image (5 page, edited: 2026-03-27)
Ablation Study:
[Table V ๋ถ์] ์ปดํฌ๋ํธ๋ณ CER ๋ณํ (AISHELL-1, DeepSeek-V2, Attention Decoding):
[๋จ๊ณ๋ณ CER]
Baseline: 5.06 (๊ต์ ์์)
+Base (plain prompt): 53.10 (+949%) โ hallucination ํญ๋ฐ, ์ฌ์ฉ ๋ถ๊ฐ ์์ค
+Pre-Detection: 8.19 โ ๊ฐ์ฅ ํฐ ๋จ์ผ ๊ฐ์ , Faithful hallucination ์ต์
+CoT Sub-Tasks: 7.05 โ insertion/deletion ์ต์ , LLM ์ถ๋ ฅ ๊ณต๊ฐ ์ ํ
+Iterative Correction: 4.89 โ baseline ์ดํ ์ฒซ ์ง์ , ๋จ์ผ ํจ์ค ํ๊ณ ๊ทน๋ณต
+Answer Verification: 4.69 โ ์ต์ข hallucination ์์กด ์ ๊ฑฐ
[Token ๋น์ฉ]
- Base: 62k/89k โ CoT: 239k/122k (์ ๋ ฅ ๊ธ์ฆ) โ Iterative+Verification: 251k/260k ์์
[๋ ผ๋ฌธ ์ฃผ์ฅ ์ฐ๊ฒฐ]
Pre-Detection์ด ํต์ฌ ๊ธฐ์ฌ ๋จ๊ณ โ โerror prevention firstโ ์ ๋ต ์ ํจ์ฑ ์ค์ฆ
CoT ๋จ๋ ์ผ๋ก๋ baseline ์ดํ ๋ฌ์ฑ ๋ถ๊ฐ โ Iterative๊ฐ ํ์ ๋ณด์์ฌ
Verification ์ถ๊ฐ๋ ๊ฑฐ์ ๋ฌด๋น์ฉ(ํ ํฐ ์ฆ๊ฐ ์์) ๋๋น ํจ๊ณผ โ ํจ์จ์ ์ค๊ณ
๊ฐ ์ปดํฌ๋ํธ ์์ฐจ์ ์ํธ๋ณด์ โ 3๋จ๊ณ ์ ์ฒด ํ์์ฑ ์ ๋นํ
[Table V ๋ถ์] ํ ํฐ ์๋น ๋ถ์:
Base: 62k/89k โ +Pre-Detection: 72k/87k (์ ๋ ฅ ์ํญ ์ฆ๊ฐ) โ +CoT Sub-Tasks: 239k/122k (์ ๋ ฅ ๊ธ์ฆ, ์ถ๋ ฅ ๊ฐ์) โ +Iterative+Verification: 251k/260k (์ถ๋ ฅ ๊ธ์ฆ).ย
CoT๊ฐ ์ ๋ ฅ ํ ํฐ์ 3๋ฐฐ ์ด์ ์ฆ๊ฐ์ํค๋ ์ฃผ์ ๋น์ฉ ์์ธ.ย
Verification์ ํ ํฐ ์ถ๊ฐ ๋น์ฉ ๊ฑฐ์ ์์ด ์ฑ๋ฅ ํฅ์ โ ๊ฐ์ฅ ํจ์จ์ ์ธ ์ปดํฌ๋ํธ. ์คํ ์ ์ฌ๋ฌ ๋ฌธ์ฅ์ ๋ฌถ์ด์ inference โ ํ ํฐ ๋น์ฉ ์ ๊ฐ







