Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

TL;DR

Contribution:: fine-tuning/์™ธ๋ถ€ ์ •๋ณด ์—†์ด ๋ฒ”์šฉ LLM๋งŒ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” 3๋‹จ๊ณ„ hallucination ์–ต์ œ ASR ๊ต์ • ํ”„๋ ˆ์ž„์›Œํฌ (Pre-Detection + CoT Subtask + Verification)
Pros:: ์ถ”๊ฐ€ ํ•™์Šต/๋ฐ์ดํ„ฐ ๋ถˆํ•„์š”, ์ค‘๊ตญ์–ดยท์˜์–ด ๋ชจ๋‘ ์ ์šฉ ๊ฐ€๋Šฅ, ๊ฐ ๋‹จ๊ณ„๊ฐ€ modularํ•˜๊ฒŒ hallucination ์–ต์ œ์— ๊ธฐ์—ฌ
Cons:: del/ins ์˜ค๋ฅ˜ ์†Œํญ ์ฆ๊ฐ€(hallucination ์™„์ „ ์ œ๊ฑฐ ๋ถˆ๊ฐ€), CoT ๋„์ž… ์‹œ ํ† ํฐ ๋น„์šฉ ๊ธ‰์ฆ(~4๋ฐฐ), ๋ณต์ˆ˜ ์˜ค๋ฅ˜ ๋ฌธ์žฅ์˜ ์ •๋Ÿ‰ํ™” ์–ด๋ ค์›€

Study Snapshot

Key takeaway:: plain prompt LLM ์ ์šฉ ์‹œ CER ๋ณ€ํ™”๋Ÿ‰์ด baseline ๋Œ€๋น„ 949% ์ฆ๊ฐ€(5.06โ†’53.10)ํ•˜์ง€๋งŒ, 3๋‹จ๊ณ„ ํ”„๋ ˆ์ž„์›Œํฌ ์ ์šฉ ์‹œ baseline ๋Œ€๋น„ ์ตœ๋Œ€ 21% ๊ฐœ์„ .
Methods:: Conformer-based AED ASR + GPT-4o/DeepSeek-V2, AISHELL-1/2 + LibriSpeech ๋ฒค์น˜๋งˆํฌ
Outcomes:: substitution ์˜ค๋ฅ˜ ๊ต์ • ํšจ๊ณผ์ , Noun Recall ํ–ฅ์ƒ, chunk ๊ธฐ๋ฐ˜ ๋””์ฝ”๋”ฉ ์„ค์ •์—์„œ๋„ ์œ ํšจ.
Results:: AISHELL-1 21%, AISHELL-2 11%, LibriSpeech test-clean 9%, test-other 11.4% CER/WER ์ƒ๋Œ€์  ๊ฐ์†Œ



Reading notes

๐Ÿ”ด Problems

Highlight (1 page, edited: 2026-03-27)

Although traditional approaches demonstrate moderate effectiveness, LLMs offer a paradigm that eliminates the need for training and labeled data. However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text.

Problems:

LLM์„ ASR ๊ต์ •์— ์ง์ ‘ ์ ์šฉํ•˜๋ฉด hallucination์œผ๋กœ ์ธํ•ด ์˜ฌ๋ฐ”๋ฅธ ํ…์ŠคํŠธ๊นŒ์ง€ ์ž˜๋ชป ์ˆ˜์ •ํ•˜๋Š” ์—ญํšจ๊ณผ ๋ฐœ์ƒ โ†’ RLLM-CF ์„ค๊ณ„์˜ ํ•ต์‹ฌ ๋™๊ธฐ

Highlight (1 page, edited: 2026-03-27)

However, these methods typically require either domain-specific fine-tuning or additional contextual information, which constrains their scalability in real-world applications. Therefore, leveraging the general knowledge embedded in LLMs without fine-tuning or external inputs emerges as a more practical and scalable alternative. Nevertheless, directly applying general LLMs often results in hallucination issues [20], posing a significant challenge to reliable correction.

Problems:

  • ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์€ domain fine-tuning ๋˜๋Š” ์™ธ๋ถ€ ์ •๋ณด(N-best list ๋“ฑ) ํ•„์š” โ†’ ํ™•์žฅ์„ฑ ์ œํ•œ.

  • ์ผ๋ฐ˜ LLM์„ ์ง์ ‘ ์ ์šฉํ•˜๋ฉด hallucination์ด ๋ฐœ์ƒํ•˜์—ฌ reliable correction์„ ์ €ํ•ดํ•จ.

Highlight (2 page, edited: 2026-03-27)

Hallucinations in LLMs refer to instances where the generated responses, while grammatically correct, fluent, and plausible, deviate from the input or contradict factual information [21], [22]. In the context of ASR error correction, model-induced hallucinations are closely linked to transcription errors.

Problems:

LLM hallucination ์ •์˜ โ€” ๋ฌธ๋ฒ•์ ์œผ๋กœ ์œ ์ฐฝํ•˜๊ณ  ๊ทธ๋Ÿด๋“ฏํ•˜๋‚˜ ์ž…๋ ฅ๊ณผ ๋‹ค๋ฅด๊ฑฐ๋‚˜ ์‚ฌ์‹ค๊ณผ ๋ชจ์ˆœ๋˜๋Š” ์‘๋‹ต. ASR ๊ต์ • ๋งฅ๋ฝ์—์„œ ์›๋ณธ ์˜ค๋””์˜ค์— ์—†๋Š” ๋‚ด์šฉ์„ fabrication โ†’ ์ „์‚ฌ ์˜ค๋ฅ˜์™€ ์ง๊ฒฐ

Image (3 page, edited: 2026-03-27)

Problems:

[Table I ๋ถ„์„] LLM hallucination ์œ ํ˜•๋ณ„ ์ƒ์„ธ:

[Faithful Hallucinations]

  • Instruction Violation: ๊ต์ • ๋Œ€์‹  ์งˆ๋ฌธ ๊ฑฐ๋ถ€ ์‘๋‹ต (โ€œSorry, I canโ€™t answerโ€)

  • Redundant Output: ๊ต์ • ๊ฒฐ๊ณผ์— ๋ถˆํ•„์š”ํ•œ ํ…์ŠคํŠธ ์ถ”๊ฐ€ (โ€œThis answer is: โ€ฆโ€œ)

  • Continue Writing: ์›๋ฌธ์„ ์ž„์˜๋กœ ์ด์–ด์จ์„œ ํ™•์žฅ

  • Blank Output: ์•„๋ฌด๊ฒƒ๋„ ์ถœ๋ ฅํ•˜์ง€ ์•Š์Œ

  • Repeated Output: ๋‹จ์–ด๋ฅผ ๋ฌดํ•œ ๋ฐ˜๋ณต ์ถœ๋ ฅ

  • Grammar Correction: ASR ์˜ค๋ฅ˜๊ฐ€ ์•„๋‹Œ ๋ฌธ๋ฒ• ์ž์ฒด๋ฅผ ๊ณผ๊ต์ • (์˜ฌ๋ฐ”๋ฅธ ํ…์ŠคํŠธ๋ฅผ ์ž˜๋ชป ์ˆ˜์ •ํ•˜๋Š” ํ•ต์‹ฌ ๋ฌธ์ œ)

[Factual Hallucinations]

  • Make Mistake: ์˜ค๋ฅ˜ ๋‹จ์–ด๋ฅผ ๋‹ค๋ฅธ ์˜ค๋ฅ˜ ๋‹จ์–ด๋กœ ๋Œ€์ฒด (haskโ†’task๊ฐ€ ์•„๋‹Œ ๋‹ค๋ฅธ ๋‹จ์–ด๋กœ)

โ†’ ๋…ผ๋ฌธ ์ฃผ์žฅ ์—ฐ๊ฒฐ: Pre-Detection์ด Faithful์„, Verification์ด ๋‘ ์œ ํ˜• ๋ชจ๋‘๋ฅผ ์ฐจ๋‹จํ•˜๋Š” ์„ค๊ณ„ ๊ทผ๊ฑฐ

Highlight (4 page, edited: 2026-03-27)

we categorize LLM hallucinations in ASR error correction into two types: faithful hallucinationsโ€”including instruction violations, redundant outputs, continuation of writing, blank outputs, and grammar correctionsโ€”and factual hallucinations, characterized by content errors.

Problems:

Table I ๊ทผ๊ฑฐ ํ…์ŠคํŠธ: hallucination์„ Faithful(๋ช…๋ น์œ„๋ฐ˜/์ค‘๋ณต์ถœ๋ ฅ/์ด์–ด์“ฐ๊ธฐ/๋นˆ์ถœ๋ ฅ/๋ฐ˜๋ณต์ถœ๋ ฅ/๋ฌธ๋ฒ•๊ณผ๊ต์ •)๊ณผ Factual(๋‚ด์šฉ์˜ค๋ฅ˜) ๋‘ ์œ ํ˜•์œผ๋กœ ๋ถ„๋ฅ˜. ์ด ๋ถ„๋ฅ˜๊ฐ€ RLLM-CF 3๋‹จ๊ณ„ ์„ค๊ณ„ ๊ทผ๊ฑฐ โ€” Pre-Detection์€ Faithful์„, Verification์€ ๋‘ ์œ ํ˜• ๋ชจ๋‘ ์–ต์ œ

๐ŸŸก Prior Research

Highlight (1 page, edited: 2026-03-27)

Autoregressive models exploit encoder-decoder architectures with Connectionist Temporal Classification (CTC) loss [10], including translation-style correction frameworks [7], [8] and entity-aware transformers [9]. In parallel, non-autoregressive edit-based models such as FastCorrect [1], [2] and SoftCorrect [4] predict edit operations through duration modeling and integrate CTC loss with sequence-to-sequence frameworks to enhance error detection.

Prior Research:

๊ธฐ์กด ASR ๊ต์ • ๋‘ ๊ฐ€์ง€ ํŒจ๋Ÿฌ๋‹ค์ž„ โ€” (1) autoregressive seq2seq (CTC loss, translation-style, entity-aware transformer) (2) non-autoregressive edit ๊ธฐ๋ฐ˜ (FastCorrect, SoftCorrect). ๋ชจ๋‘ ๋Œ€๊ทœ๋ชจ labeled data์™€ task-specific training ์˜์กด

Highlight (2 page, edited: 2026-03-27)

In recent years, integrating LLMs into ASR error correction pipelines has attracted increasing attention. Min and Wang [20] investigated the direct application of LLMs for error correction and concluded that models such as GPT-4o are ineffective due to hallucination issues. Ma et al. [19] explored combining Nbest rescoring with LLM-based correction; however, obtaining the N-best list from the ASR system may impose additional costs in practice. Yang et al. [18] compared LLM-based rescoring and generation methods, demonstrating that the latter outperforms the former when domain-specific information is provided or the LLM is fine-tuned. Nevertheless, such approaches require either domain knowledge or fine-tuning.

Prior Research:ย 

Min & Wang [20] โ€” GPT-4o๋„ hallucination์œผ๋กœ ์ง์ ‘ ๊ต์ •์— ๋น„ํšจ๊ณผ์ . Ma et al. [19] โ€” N-best rescoring + LLM ๊ฒฐํ•ฉ, ๊ทธ๋Ÿฌ๋‚˜ N-best ํš๋“ ์ถ”๊ฐ€ ๋น„์šฉ. Yang et al. [18] โ€” ๋„๋ฉ”์ธ ์ •๋ณด/fine-tuning ์‹œ generation > rescoring์ด๋‚˜ ์™ธ๋ถ€ ์ž์› ์˜์กด. ๋ชจ๋‘ supplementary resource ๋˜๋Š” fine-tuning ํ•„์š”

๐Ÿ”ต Main Idea

Highlight (1 page, edited: 2026-03-27)

we propose the Reliable LLM Correction Framework (RLLMCF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. The advantage of our method is that it does not require additional information or fine-tuning of the model, and ensures the correctness of the LLM correction under multipass programming.

Main Idea:

RLLM-CF 3๋‹จ๊ณ„ ํ”„๋ ˆ์ž„์›Œํฌ โ€” (1) Error Pre-Detection (2) CoT Subtask Iterative Correction (3) Reasoning Process Verification. fine-tuning ๋ฐ ์™ธ๋ถ€ ์ •๋ณด ์—†์ด LLM ์‚ฌ์ „ํ•™์Šต ์ง€์‹๋งŒ์œผ๋กœ hallucination ์–ต์ œ

Image (3 page, edited: 2026-03-27)

Main Idea:

Prompt ์„ค๊ณ„ ๊ตฌ์กฐ (Figure 2): Task/Scene โ†’ Error Detection โ†’ 4๊ฐœ Sub-Tasks (Locating / Pronunciation / Candidates / Selection) โ†’ Format & Flag ์˜ ๊ณ„์ธต์  ๊ตฌ์กฐ. โ€œThink many times and be sure of the outcomeโ€ ๋ฌธ๊ตฌ๊ฐ€ iterative correction์˜ prompt ์ˆ˜์ค€ ์œ ๋„ ์žฅ์น˜. few-shot 3๊ฐœ ์˜ˆ์‹œ ํฌํ•จ์œผ๋กœ LLM ์ถœ๋ ฅ ๊ณต๊ฐ„์„ ๊ตฌ์กฐํ™”

๐ŸŸข Methods

Image (2 page, edited: 2026-03-26)

Highlight (3 page, edited: 2026-03-27)

This highlights the necessity of adopting an error prevention first strategy for correction tasks. To prevent LLMs from altering correct content, we first instruct the model to detect errors in the input sentence. If no errors are detected, the sentence is directly retained; otherwise, the model proceeds to the correction stage, referred to as Stage 1 in Algorithm 1.

Methods:

Stage 1 (Error Pre-Detection): ์˜ค๋ฅ˜ ๊ฐ์ง€ ํ›„ ์˜ค๋ฅ˜ ์—†์œผ๋ฉด ์›๋ฌธ ๊ทธ๋Œ€๋กœ ๋ฐ˜ํ™˜ โ†’ LLM์ด ์˜ฌ๋ฐ”๋ฅธ ํ…์ŠคํŠธ๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์„ ์›์ฒœ ์ฐจ๋‹จ. โ€œerror prevention firstโ€ ์ „๋žต์˜ ํ•ต์‹ฌ

Highlight (3 page, edited: 2026-03-27)

To address this, we decompose the correction task into four subtasksโ€”localization, pronunciation assessment, candidate generation, and candidate selectionโ€”following a CoT strategy to improve reasoning reliability, as illustrated in Figure 3.

Methods:

Stage 2 (CoT Subtask Iterative Correction): ๊ต์ •์„ 4๊ฐœ ์„œ๋ธŒํƒœ์Šคํฌ๋กœ ๋ถ„ํ•ด โ€” (1)localization (2)pronunciation assessment (3)candidate generation (4)candidate selection. ์‹ ๋ขฐ๋„ ๋‚ฎ์œผ๋ฉด ์ตœ๋Œ€ 3ํšŒ ๋ฐ˜๋ณต ๊ต์ •

Highlight (3 page, edited: 2026-03-27)

Following the correction, a verification step is conducted to ensure compliance with task instructions. Specifically, we employ the modelโ€™s output to assess: (1) whether the answer conforms to the required format, and (2) whether all reasoning steps are correctly completed. Only when both criteria are sat

Methods:

Stage 3 (Answer Verification): (1) ์š”๊ตฌ ํ˜•์‹ ์ค€์ˆ˜ ์—ฌ๋ถ€ (2) ๋ชจ๋“  ์ถ”๋ก  ๋‹จ๊ณ„ ์™„๋ฃŒ ์—ฌ๋ถ€ โ€” ๋‘ ์กฐ๊ฑด ๋ชจ๋‘ ์ถฉ์กฑ ์‹œ์—๋งŒ ๊ต์ • ๊ฒฐ๊ณผ ์ฑ„ํƒ, ๋ฏธ์ถฉ์กฑ ์‹œ ์›๋ฌธ ๋ฐ˜ํ™˜

Image (3 page, edited: 2026-03-27)

Methods:

Figure 3 CoT ์‹ค์ œ ์˜ˆ์‹œ ๋ถ„์„: โ€œbreadโ€ โ†’ โ€œbreedโ€ ๊ต์ • ๊ณผ์ •. (1) Locate: โ€œbreadโ€ ์œ„์น˜ ํŠน์ • (2) Pronunciation: /bred/ ๋ฐœ์Œ ํ™•์ธ (3) Candidates: โ€œbreedโ€/briหd/, โ€œbledโ€/blษ›d/, โ€œbrandโ€/brรฆnd/ ์ƒ์„ฑ (4) Selection: ๋ฌธ๋งฅ(โ€œbakersโ€) ๊ณ ๋ คํ•ด โ€œbreedโ€ ์„ ํƒ. ๋ฐœ์Œ ์œ ์‚ฌ์„ฑ + ๋ฌธ๋งฅ ๊ฒฐํ•ฉ์ด ํ•ต์‹ฌ โ€” ์ˆœ์ˆ˜ ์–ธ์–ด๋ชจ๋ธ ์ง€์‹์œผ๋กœ ๋™์Œ์ด์˜์–ด ASR ์˜ค๋ฅ˜๋ฅผ ๊ต์ •ํ•˜๋Š” ๋ฐฉ์‹ ์‹ค์ฆ

Highlight (4 page, edited: 2026-03-27)

In our experiment, we adopt a conformer-based attentionencoder-decoder (AED) ASR model trained with the WeNet toolkit [28], following the U2++ [29] method, across three datasets. For AISHELL-1 and LibriSpeech, a non-streaming model architecture is employed, and four decoding strategies are evaluated: Attention, Attention Rescoring, CTC Greedy Search, and CTC Prefix Search. To further assess the performance of streaming models, experiments are conducted on AISHELL-2 using the Attention Rescoring decoding method with a chunk size of 16. The Conformer encoder comprises 12 blocks with an attention dimension of 256, 4 attention heads, and 2048 linear units, incorporating relative positional encoding and Swish activation. The decoder is implemented as a bidirectional Transformer with three forward and three backward layers. The network follows a hybrid CTC/attention architecture, with the CTC loss weight set to 0.3.

Methods:

ASR ๋ชจ๋ธ ๊ตฌ์„ฑ ์ƒ์„ธ: Conformer encoder (12 blocks, dim 256, 4 heads, 2048 FFN, relative positional encoding, Swish activation) + Bidirectional Transformer decoder (3 forward + 3 backward layers). Hybrid CTC/Attention, CTC loss weight 0.3. AISHELL-2๋Š” chunk size 16 streaming ๋ชจ๋ธ. WeNet toolkit + U2++ ๋ฐฉ๋ฒ• โ†’ 4๊ฐ€์ง€ decoding ์ „๋žต(Attention / Attention Rescore / CTC Greedy / CTC Prefix) ๋ชจ๋‘ ์ง€์›

๐ŸŸ  Limitations

Highlight (5 page, edited: 2026-03-27)

Experiments with DeepSeek-V2 on LibriSpeech yielded suboptimal performance, as DeepSeek-V2 is primarily designed for Chinese and demonstrates limited capability on English datasets.

Limitations:

DeepSeek-V2 ์˜์–ด ์„ฑ๋Šฅ ์ €์กฐ: DeepSeek-V2๋Š” ์ค‘๊ตญ์–ด ํŠนํ™” ์„ค๊ณ„๋กœ LibriSpeech(์˜์–ด)์—์„œ suboptimal ์„ฑ๋Šฅ. ๋ฐ˜๋ฉด GPT-4o๋Š” ์ค‘ยท์˜ ๋ชจ๋‘ ๊ฐ•์  โ†’ RLLM-CF ์„ฑ๋Šฅ์ด backbone LLM ํ’ˆ์งˆ์— ํฌ๊ฒŒ ์˜์กด. ๋ฒ”์šฉ์„ฑ ์ฃผ์žฅ(no fine-tuning, cross-domain)๊ณผ ์ƒ์ถฉ: ์‹ค์ œ๋ก  ๊ฐ•๋ ฅํ•œ ๋‹ค๊ตญ์–ด LLM์ด ์ „์ œ์กฐ๊ฑด

Highlight (5 page, edited: 2026-03-27)

However, a slight increase in deletion and insertion errors is observed, primarily due to hallucinations.

Limitations:

๊ต์ • ํ›„ deletion/insertion ์˜ค๋ฅ˜ ์†Œํญ ์ฆ๊ฐ€ โ†’ hallucination ์™„์ „ ์ œ๊ฑฐ ๋ถˆ๊ฐ€. substitution ๊ต์ •์—๋Š” ํšจ๊ณผ์ ์ด๋‚˜ ๊ตฌ์กฐ์  ์˜ค๋ฅ˜(์‚ฝ์ž…/์‚ญ์ œ) ์–ต์ œ์— ํ•œ๊ณ„

Highlight (6 page, edited: 2026-03-27)

For sentences containing multiple errors, the overall error count was reduced, although precise quantification remains challenging.

Limitations:

๋ณต์ˆ˜ ์˜ค๋ฅ˜ ํฌํ•จ ๋ฌธ์žฅ์—์„œ ์˜ค๋ฅ˜ ๊ฐ์†Œ๋Ÿ‰ ์ •๋ฐ€ ์ •๋Ÿ‰ํ™” ์–ด๋ ค์›€. ์‹ค์ œ๋กœ 98๊ฐœ์˜ ์›๋ž˜ ์˜ฌ๋ฐ”๋ฅธ ๋ฌธ์žฅ์ด ์ž˜๋ชป ๊ต์ •๋จ(383๊ฐœ ๊ต์ • ์„ฑ๊ณต ๋Œ€๋น„) โ†’ hallucination ์™„์ „ ์ œ๊ฑฐ ๋ถˆ๊ฐ€ ํ•œ๊ณ„ ์žฌํ™•์ธ

๐ŸŸฃ Key Concepts to Clarify

Highlight (1 page, edited: 2026-03-27)

Connectionist Temporal Classification (CTC) loss

Key Concepts to Clarify:

CTC (Connectionist Temporal Classification): ์ž…๋ ฅยท์ถœ๋ ฅ ๊ธธ์ด๊ฐ€ ๋‹ฌ๋ผ๋„ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์‹œํ€€์Šค ๋ ˆ์ด๋ธ”๋ง loss. ASR์—์„œ ์Œ์„ฑ ํ”„๋ ˆ์ž„โ†”ํ…์ŠคํŠธ ํ† ํฐ ์ •๋ ฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ. CTC Greedy / Prefix Search๋Š” ์ด ์ถœ๋ ฅ์„ ๋””์ฝ”๋”ฉํ•˜๋Š” ์ „๋žต โ†’ Table II~III์˜ 4๊ฐ€์ง€ decoding ํ–‰ ๊ตฌ์„ฑ ๊ทผ๊ฑฐ

Highlight (1 page, edited: 2026-03-27)

Chain-of-Thought (CoT) prompting to enhance reasoning [15].

Key Concepts to Clarify:

Chain-of-Thought (CoT) Prompting: LLM์ด ์ตœ์ข… ๋‹ต๋งŒ ๋ฐ”๋กœ ์ถœ๋ ฅํ•˜๋Š” ๋Œ€์‹  ์ค‘๊ฐ„ ์ถ”๋ก  ๋‹จ๊ณ„๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์ƒ์„ฑํ•˜๋„๋ก ์œ ๋„ํ•˜๋Š” prompting ๊ธฐ๋ฒ• [Wei et al., 2022]. ๋ณต์žกํ•œ ์ถ”๋ก  ์‹ ๋ขฐ๋„ ํ–ฅ์ƒ โ†’ RLLM-CF Stage 2์˜ ์ด๋ก ์  ๊ทผ๊ฑฐ. ๋‹จ์ˆœ CoT๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑํ•ด์„œ iterative correction์„ ์ถ”๊ฐ€ํ•œ ๊ฒƒ์ด ํ•ต์‹ฌ ๊ธฐ์—ฌ

Highlight (1 page, edited: 2026-03-27)

Fine-tuning approaches, often using low-rank adaptation (LoRA)

Key Concepts to Clarify:

LoRA (Low-Rank Adaptation): ๋Œ€ํ˜• ๋ชจ๋ธ์˜ ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์žฌํ•™์Šตํ•˜์ง€ ์•Š๊ณ  ์ €์ฐจ์› ํ–‰๋ ฌ ๋ถ„ํ•ด๋ฅผ ํ†ตํ•ด ์†Œ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ fine-tuningํ•˜๋Š” ๊ฒฝ๋Ÿ‰ ๊ธฐ๋ฒ•. ๋ณธ ๋…ผ๋ฌธ์€ LoRA์กฐ์ฐจ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ โ†’ โ€œno fine-tuningโ€ ์ฃผ์žฅ์˜ ์˜๋ฏธ: LoRA ํฌํ•จ ์–ด๋– ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ๋„ ์—†์ด ์ˆœ์ˆ˜ prompting๋งŒ ์‚ฌ์šฉ

Highlight (2 page, edited: 2026-03-27)

however, obtaining the N-best list from the ASR system may impose additional costs in practice.

Key Concepts to Clarify:

N-best list: ASR ๋””์ฝ”๋”๊ฐ€ ์ตœ์ƒ์œ„ 1๊ฐœ ๊ฒฐ๊ณผ๋งŒ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋Œ€์‹ , ํ™•๋ฅ  ๋†’์€ ์ƒ์œ„ N๊ฐœ ํ›„๋ณด ์‹œํ€€์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ฐฉ์‹. LLM์ด ์ด N๊ฐœ ํ›„๋ณด ์ค‘ ๊ฐ€์žฅ ์ ํ•ฉํ•œ ๊ฒƒ์„ ์žฌ์„ ํƒ(rescoring)ํ•˜๋Š” ๋ฐ ํ™œ์šฉ. ํ•˜์ง€๋งŒ N-best list ์ƒ์„ฑ ์ž์ฒด๊ฐ€ ์ถ”๊ฐ€ ์—ฐ์‚ฐ ๋น„์šฉ โ†’ ๋ณธ ๋…ผ๋ฌธ์ด ์ด๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ์ด์œ 

Highlight (5 page, edited: 2026-03-27)

To analyze noun recall within substitution errors, noun filtering was performed using the method proposed in [32] for Chinese (AISHELL-1/2) and [33] for English (LibriSpeech).

Key Concepts to Clarify:

Noun Recall: ๊ต์ • ํ›„ ๊ณ ์œ ๋ช…์‚ฌ/์ „๋ฌธ์šฉ์–ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋ณต์›๋๋Š”์ง€ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ. ASR ์˜ค๋ฅ˜๊ฐ€ ๋™์Œ์ด์˜์–ด ๊ธฐ๋ฐ˜์œผ๋กœ ํŠนํžˆ ๊ณ ์œ ๋ช…์‚ฌ์—์„œ ๋นˆ๋ฒˆํ•˜๋ฏ€๋กœ ๋ณ„๋„ ์ธก์ •. ์ค‘๊ตญ์–ด๋Š” N-LTP [32], ์˜์–ด๋Š” NLTK [33]๋กœ ๋ช…์‚ฌ ํ•„ํ„ฐ๋ง. CER/WER ๊ฐœ์„ ๊ณผ ํ•จ๊ป˜ Noun Recall ํ–ฅ์ƒ์ด LLM ๊ต์ •์˜ ํ•ต์‹ฌ ๊ฐ•์  โ€” ๋ฌธ๋งฅ ์ดํ•ด ๊ธฐ๋ฐ˜ ๋™์Œ์ด์˜์–ด ๋ถ„๋ณ„ ๋Šฅ๋ ฅ

๐ŸŸช Results

Highlight (1 page, edited: 2026-03-27)

Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.

Result:

GPT-4o + RLLM-CF ์ตœ์ข… ์„ฑ๋Šฅ โ€” AISHELL-1: 21%, AISHELL-2: 11%, LibriSpeech test-clean: 9%, test-other: 11.4% CER/WER ์ƒ๋Œ€์  ๊ฐ์†Œ. ์ƒ์„ธ ์ˆ˜์น˜๋Š” Table II~IV ์ฐธ์กฐ

Image (5 page, edited: 2026-03-27)

Results:

[Table II ๋ถ„์„] AISHELL-1 CER / Noun Recall (GPT-4o vs DeepSeek-V2):

[GPT-4o]

  • Attention: 5.06โ†’4.32 (-14.6%), Attention Rescore: 4.62โ†’4.01 (-13%)

  • CTC Greedy/Prefix: 5.17โ†’4.06 (-21%) โ† Abstract 21% ์ˆ˜์น˜ ์ถœ์ฒ˜

  • ํ‰๊ท  17.4% ์ƒ๋Œ€ ๊ฐ์†Œ, Noun Recall +2~3pp ํ–ฅ์ƒ

[DeepSeek-V2]

  • ์ตœ๊ณ  CTC Prefix: 5.17โ†’4.48 (-13%)

  • GPT-4o ๋Œ€๋น„ ์ „๋ฐ˜์ ์œผ๋กœ ์—ด์„ธ

[๋…ผ๋ฌธ ์ฃผ์žฅ ์—ฐ๊ฒฐ]

  • sub ์˜ค๋ฅ˜ ํฌ๊ฒŒ ๊ฐ์†Œ, del/ins ์†Œํญ ์ฆ๊ฐ€ โ†’ hallucination ์™„์ „ ์ œ๊ฑฐ ๋ถˆ๊ฐ€, Limitations์™€ ์ผ์น˜

  • GPT-4o > DeepSeek-V2 ์ผ๊ด€ ์šฐ์„ธ โ†’ ๋ชจ๋ธ ํ’ˆ์งˆ์ด ๊ต์ • ์„ฑ๋Šฅ์— ์ง๊ฒฐ

  • Noun Recall ํ–ฅ์ƒ โ†’ LLM์ด ๋™์Œ์ด์˜์–ด ๊ธฐ๋ฐ˜ ๊ณ ์œ ๋ช…์‚ฌ ์˜ค๋ฅ˜ ๊ต์ •์— ํšจ๊ณผ์ 

Image (5 page, edited: 2026-03-27)

Results:

[Table III ๋ถ„์„] LibriSpeech WER (GPT-4o, test-clean/other):

[test-clean]

  • Attention Rescore: 3.35โ†’3.19 (-4.8%, ์ตœ์ € WER) โ† Abstract 9% ์ˆ˜์น˜ ์ถœ์ฒ˜

  • ํ‰๊ท  5.72% ์ƒ๋Œ€ ๊ฐ์†Œ

[test-other (์žก์Œ ํ™˜๊ฒฝ)]

  • CTC Greedy: 9.52โ†’8.45 (-11.4%) โ† Abstract 11.4% ์ˆ˜์น˜ ์ถœ์ฒ˜

  • ์ ˆ๋Œ€ ๊ฐ์†Œํญ์ด test-clean๋ณด๋‹ค ํผ โ†’ ์˜ค๋ฅ˜ ๋งŽ์„์ˆ˜๋ก ๊ต์ • ํšจ๊ณผ ์ฆ๊ฐ€

[๋…ผ๋ฌธ ์ฃผ์žฅ ์—ฐ๊ฒฐ]

  • ์˜์–ด์—์„œ๋„ ์œ ํšจ โ†’ fine-tuning ์—†์ด ์–ธ์–ด ๋ฒ”์šฉ์„ฑ ํ™•๋ณด ์ฃผ์žฅ ์ง€์ง€

  • AISHELL-1์˜ 21% ๋Œ€๋น„ ๋‚ฎ์Œ โ†’ LLM์˜ ์ค‘๊ตญ์–ด ๋™์Œ์ด์˜์–ด ํŒจํ„ด ์ธ์‹์ด ๋” ๊ฐ•ํ•จ

  • del/ins ์ฆ๊ฐ€ ํŒจํ„ด ๋™์ผ โ†’ hallucination ์–ธ์–ด ๋ฌด๊ด€ํ•˜๊ฒŒ ์ž”์กด

[Table IV ๋ถ„์„] AISHELL-2 Streaming CER:

  • GPT-4o: 5.57โ†’4.95 (-11%) โ† Abstract 11% ์ˆ˜์น˜ ์ถœ์ฒ˜

  • Streaming ๋ชจ๋ธ์—์„œ๋„ RLLM-CF ์œ ํšจ์„ฑ ํ™•์ธ

Image (5 page, edited: 2026-03-27)

Results:

[Figure 4 ๋ถ„์„] 7,176 ๋ฌธ์žฅ sentence-level ํ๋ฆ„: Pre-Detection์—์„œ 2,043๊ฐœ ์˜ค๋ฅ˜ ๊ฐ์ง€ โ†’ ์ด ์ค‘ 1,915๊ฐœ confidence ํ™•๋ณด, 128๊ฐœ ๋ฐ˜๋ณต ํ›„ ํฌ๊ธฐ โ†’ Verification์—์„œ 347๊ฐœ ์ถ”๊ฐ€ ํƒˆ๋ฝ โ†’ ์ตœ์ข… 1,568๊ฐœ ๊ต์ • ์‹œ๋„ โ†’ 383๊ฐœ ์„ฑ๊ณต ๊ต์ • / 98๊ฐœ ์˜ค์ •์ •(์›๋ž˜ ๋งž๋Š” ๋ฌธ์žฅ์„ ํ‹€๋ฆฌ๊ฒŒ). ์ •๋ฐ€๋„ ๊ด€์ : 383/(383+98) = ์•ฝ 79.6% โ†’ hallucination ์™„์ „ ์ œ๊ฑฐ ๋ถˆ๊ฐ€ ํ•œ๊ณ„ ์ˆ˜์น˜ํ™”

๐Ÿ”˜ Ablation Study

Image (5 page, edited: 2026-03-27)

Ablation Study:

[Table V ๋ถ„์„] ์ปดํฌ๋„ŒํŠธ๋ณ„ CER ๋ณ€ํ™” (AISHELL-1, DeepSeek-V2, Attention Decoding):

[๋‹จ๊ณ„๋ณ„ CER]

  • Baseline: 5.06 (๊ต์ • ์—†์Œ)

  • +Base (plain prompt): 53.10 (+949%) โ†’ hallucination ํญ๋ฐœ, ์‚ฌ์šฉ ๋ถˆ๊ฐ€ ์ˆ˜์ค€

  • +Pre-Detection: 8.19 โ†’ ๊ฐ€์žฅ ํฐ ๋‹จ์ผ ๊ฐœ์„ , Faithful hallucination ์–ต์ œ

  • +CoT Sub-Tasks: 7.05 โ†’ insertion/deletion ์–ต์ œ, LLM ์ถœ๋ ฅ ๊ณต๊ฐ„ ์ œํ•œ

  • +Iterative Correction: 4.89 โ†’ baseline ์ดํ•˜ ์ฒซ ์ง„์ž…, ๋‹จ์ผ ํŒจ์Šค ํ•œ๊ณ„ ๊ทน๋ณต

  • +Answer Verification: 4.69 โ†’ ์ตœ์ข… hallucination ์ž”์กด ์ œ๊ฑฐ

[Token ๋น„์šฉ]

  • Base: 62k/89k โ†’ CoT: 239k/122k (์ž…๋ ฅ ๊ธ‰์ฆ) โ†’ Iterative+Verification: 251k/260k ์•ˆ์ •

[๋…ผ๋ฌธ ์ฃผ์žฅ ์—ฐ๊ฒฐ]

  • Pre-Detection์ด ํ•ต์‹ฌ ๊ธฐ์—ฌ ๋‹จ๊ณ„ โ†’ โ€œerror prevention firstโ€ ์ „๋žต ์œ ํšจ์„ฑ ์‹ค์ฆ

  • CoT ๋‹จ๋…์œผ๋กœ๋Š” baseline ์ดํ•˜ ๋‹ฌ์„ฑ ๋ถˆ๊ฐ€ โ†’ Iterative๊ฐ€ ํ•„์ˆ˜ ๋ณด์™„์žฌ

  • Verification ์ถ”๊ฐ€๋Š” ๊ฑฐ์˜ ๋ฌด๋น„์šฉ(ํ† ํฐ ์ฆ๊ฐ€ ์—†์Œ) ๋Œ€๋น„ ํšจ๊ณผ โ†’ ํšจ์œจ์  ์„ค๊ณ„

  • ๊ฐ ์ปดํฌ๋„ŒํŠธ ์ˆœ์ฐจ์  ์ƒํ˜ธ๋ณด์™„ โ†’ 3๋‹จ๊ณ„ ์ „์ฒด ํ•„์š”์„ฑ ์ •๋‹นํ™”

[Table V ๋ถ„์„] ํ† ํฐ ์†Œ๋น„ ๋ถ„์„:

  • Base: 62k/89k โ†’ +Pre-Detection: 72k/87k (์ž…๋ ฅ ์†Œํญ ์ฆ๊ฐ€) โ†’ +CoT Sub-Tasks: 239k/122k (์ž…๋ ฅ ๊ธ‰์ฆ, ์ถœ๋ ฅ ๊ฐ์†Œ) โ†’ +Iterative+Verification: 251k/260k (์ถœ๋ ฅ ๊ธ‰์ฆ).ย 

  • CoT๊ฐ€ ์ž…๋ ฅ ํ† ํฐ์„ 3๋ฐฐ ์ด์ƒ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ์ฃผ์š” ๋น„์šฉ ์›์ธ.ย 

  • Verification์€ ํ† ํฐ ์ถ”๊ฐ€ ๋น„์šฉ ๊ฑฐ์˜ ์—†์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ โ†’ ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ์ปดํฌ๋„ŒํŠธ. ์‹คํ—˜ ์‹œ ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์„ ๋ฌถ์–ด์„œ inference โ†’ ํ† ํฐ ๋น„์šฉ ์ ˆ๊ฐ