BERTScore: Evaluating Text Generation with BERT

TL;DR

Contribution:: BERT contextual embedding ๊ธฐ๋ฐ˜์˜ text generation ์ž๋™ ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ. n-gram exact match ๋Œ€์‹  token-level cosine similarity๋กœ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ ํ‰๊ฐ€

Pros:: 363๊ฐœ ์‹œ์Šคํ…œ์—์„œ human judgment๊ณผ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„. task-agnostic (MT, captioning). 104๊ฐœ ์–ธ์–ด ์ง€์›. ์™ธ๋ถ€ ๋ฆฌ์†Œ์Šค ๋ถˆํ•„์š”

Cons:: max_length ์ œํ•œ(BERT=512)์œผ๋กœ ๋ฌธ์„œ ์ˆ˜์ค€ ํ‰๊ฐ€ ๋ถ€์ ํ•ฉ. ์‚ฌ์‹ค ์˜ค๋ฅ˜(factual error) ํƒ์ง€ ๋ถˆ๊ฐ€. ๋ชจ๋ธ/๋ ˆ์ด์–ด/IDF ์„ค์ •์— ๋”ฐ๋ผ ์„ฑ๋Šฅ ๋ณ€๋™

Study Snapshot

Key takeaway:: contextual embedding + greedy matching์œผ๋กœ ๊ธฐ์กด n-gram ๋ฉ”ํŠธ๋ฆญ์˜ ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ฆˆ/๋™์˜์–ด ๋งค์นญ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐ. ๋‹จ, faithfulness ํ‰๊ฐ€์—๋Š” ๋ถ€์ ํ•ฉ

Methods:: (1) Reference/Candidate๋ฅผ BERT์— ํ†ต๊ณผ์‹œ์ผœ contextual embedding ์ถ”์ถœ (2) token ์Œ cosine similarity ํ–‰๋ ฌ ์ƒ์„ฑ (3) greedy matching: Recall=row-wise max, Precision=column-wise max (4) optional IDF ๊ฐ€์ค‘ ํ‰๊ท  (5) F1 = 2PR/(P+R)

Outcomes:: WMT18 system-level: ๋Œ€๋ถ€๋ถ„ ์–ธ์–ด์Œ์—์„œ BLEU, METEOR, YiSi-1 ์••๋„. Segment-level: RUSE(supervised)๋„ ๋Šฅ๊ฐ€. Image captioning: task-specific SPICE๋ณด๋‹ค ์šฐ์ˆ˜

Results:: PAWS ์ ๋Œ€์  ์˜ˆ์ œ์—์„œ ๋‹ค๋ฅธ ๋ฉ”ํŠธ๋ฆญ์€ chance ์ˆ˜์ค€ ํ•˜๋ฝ, BERTScore๋Š” ์†Œํญ ํ•˜๋ฝ๋งŒ. ์ค‘๊ฐ„ ๋ ˆ์ด์–ด๊ฐ€ ์ตœ์ (Appendix B). WMD optimal matching ๊ต์ฒด ์‹œ ์ผ๊ด€๋œ ๊ฐœ์„  ์—†์Œ(Appendix C)



Reading notes

Main Ideas and Conclusions

Image (4 page, edited: 2024-12-03)

  • candidate ์™€ reference ๋ฅผ BERT์— ํƒœ์›Œ contextual embedding ๊ฐ’์„ ์–ป์–ด๋‚ด๊ณ , token-pair ๋งˆ๋‹ค cosine similarity๋ฅผ ์ด์šฉํ•˜์—ฌ ์œ ์‚ฌ์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ณ  IDF๋กœ ๊ฐ token์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌ.

Methods or Evidences Supporting Conclusions

Image (4 page, edited: 2024-12-03)

  • BERTScore recall : row-wise max pooling

  • BERTScore precision: column-wise max pooling

  • (optional) : ๊ฐœ์˜ reference ๋ฌธ์žฅ๋“ค์„ ๋ณด๋ฉด์„œ ํ† ํฐ์ด reference ๋ฌธ์žฅ์— ๋“ค์–ด๊ฐ€๋ฉด 1, ์•„๋‹ˆ๋ฉด 0์œผ๋กœ ์นด์šดํŒ… ํ›„ ํ‰๊ท ๊ฐ’(๋กœ๊ทธ์Šค์ผ€์ผ)

Image (5 page, edited: 2024-12-03)

  • cosine-similarity๋กœ score๋ฅผ ๊ณ„์‚ฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— bound๊ฐ€ ์ด์ง€๋งŒ,์ €์ž๋“ค์ด ์‹ค์ œ๋กœ ๊ณ„์‚ฐ ์‹œ์—๋Š” (-1,+1)๋ณด๋‹ค ์ž‘์€ ๊ตฌ๊ฐ„์—์„œ ๊ฐ’๋“ค์ด ํ˜•์„ฑ. (์ดˆ๊ณ ์ฐจ์›์—์„œ -1,1 ์— ๊ฐ€๊นŒ์šด ๊ฐ’์„ ๊ฐ–๊ธฐ์—๋Š” ๋งค์šฐ ์–ด๋ ค์›€)

  • ๋”ฐ๋ผ์„œ ์ €์ž๋“ค์€ score์˜ readability๋ฅผ ๋†’ํžˆ๊ธฐ ์œ„ํ•ด ์‹ค์ฆ์ ์ธ lower-bound ๋ฅผ ์ฐพ์•„ ์‹ค์ œ ๊ณ„์‚ฐ score๊ฐ€ (-1,+1) ์‚ฌ์ด๋กœ ์˜ค๋„๋ก rescaling์„ ์ง„ํ–‰.

๐Ÿ”ด Problems

Highlight (1 page, edited: 2026-04-12)

n-gram overlap between the candidate and the reference. While this provides a simple and general measure, it fails to account for meaning-preserving lexical and compositional diversity.

Problems:

n-gram overlap ๊ธฐ๋ฐ˜ ๋ฉ”ํŠธ๋ฆญ(BLEU)์˜ ๊ทผ๋ณธ์  ํ•œ๊ณ„ โ€” ์˜๋ฏธ๋ฅผ ๋ณด์กดํ•˜๋Š” ์–ดํœ˜์ ยท๊ตฌ์„ฑ์  ๋‹ค์–‘์„ฑ์„ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•จ

  • โ€œconsumers prefer imported carsโ€์™€ โ€œpeople like foreign carsโ€๊ฐ€ ์˜๋ฏธ์ ์œผ๋กœ ๋™์ผํ•˜๋”๋ผ๋„ ํ‘œ๋ฉด ํ˜•ํƒœ๊ฐ€ ๋‹ค๋ฅด๋ฉด ๋‚ฎ์€ ์ ์ˆ˜

  • BERTScore ์„ค๊ณ„์˜ ํ•ต์‹ฌ ๋™๊ธฐ

Highlight (1 page, edited: 2026-04-12)

This leads to performance underestimation when semantically-correct phrases are penalized because they differ from the surface form of the reference.

Problems:

์˜๋ฏธ์ ์œผ๋กœ ์˜ฌ๋ฐ”๋ฅธ ๋ฒˆ์—ญ์ด reference์™€ ํ‘œ๋ฉด ํ˜•ํƒœ๊ฐ€ ๋‹ค๋ฅด๋‹ค๋Š” ์ด์œ ๋กœ ๊ณผ์†Œํ‰๊ฐ€๋จ

  • exact match ๋ฐฉ์‹์˜ ๊ตฌ์กฐ์  ํ•œ๊ณ„: ๋™์˜์–ด, ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ฆˆ๋ฅผ ํฌ์ฐฉ ๋ถˆ๊ฐ€

  • ์ด ๋ฌธ์ œ๊ฐ€ BERTScore์—์„œ contextual embedding + cosine similarity๋กœ ๋Œ€์ฒด๋˜๋Š” ์ง์ ‘์  ๋™๊ธฐ

Highlight (1 page, edited: 2026-04-12)

Second, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes (Isozaki et al., 2010).

Problems:

n-gram ๋ชจ๋ธ์˜ ๋‘ ๋ฒˆ์งธ ํ•œ๊ณ„: ์›๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ํฌ์ฐฉ ์‹คํŒจ + ์˜๋ฏธ์ ์œผ๋กœ ์ค‘์š”ํ•œ ์–ด์ˆœ ๋ณ€ํ™”์— ๋Œ€ํ•œ ํŽ˜๋„ํ‹ฐ ๋ถ€์กฑ

  • ์˜ˆ: โ€œA because Bโ€์™€ โ€œB because Aโ€๋ฅผ window=2์ธ BLEU๊ฐ€ ๊ฑฐ์˜ ๊ตฌ๋ถ„ํ•˜์ง€ ๋ชปํ•จ

  • contextual embedding์€ self-attention์œผ๋กœ unbounded dependency๋ฅผ ํฌ์ฐฉํ•˜์—ฌ ํ•ด๊ฒฐ

๐ŸŸก Prior Research

Highlight (2 page, edited: 2026-04-12)

METEOR (Banerjee & Lavie, 2005) computes Exact-P1 and Exact-R1 while allowing backing-off from exact unigram matching to matching word stems, synonyms, and paraphrases.

Prior Research:

METEOR์˜ ์ ‘๊ทผ: exact match ์‹คํŒจ ์‹œ stem/synonym/paraphrase๋กœ fallback

  • ์™ธ๋ถ€ ๋ฆฌ์†Œ์Šค(stemmer, synonym lexicon, paraphrase table)์— ์˜์กด

  • 5๊ฐœ ์–ธ์–ด๋งŒ ์ „์ฒด ์ง€์›, 11๊ฐœ๋Š” ๋ถ€๋ถ„ ์ง€์› โ†’ BERTScore๋Š” BERT 104๊ฐœ ์–ธ์–ด ํ™œ์šฉ

Highlight (3 page, edited: 2026-04-12)

All these methods require costly human judgments as supervision for each dataset, and risk poor generalization to new domains, even within a known language and task

Prior Research:

ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ฉ”ํŠธ๋ฆญ(BEER, BLEND, RUSE)์˜ ํ•œ๊ณ„

  • ๊ฐ ๋ฐ์ดํ„ฐ์…‹๋งˆ๋‹ค ๋น„์‹ผ human judgment ํ•„์š”

  • ๋™์ผ ์–ธ์–ดยทํƒœ์Šคํฌ ๋‚ด์—์„œ๋„ ์ƒˆ ๋„๋ฉ”์ธ์œผ๋กœ ์ผ๋ฐ˜ํ™” ์œ„ํ—˜

  • BERTScore๋Š” ํŠน์ • evaluation task์— ์ตœ์ ํ™”ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ด ๋ฌธ์ œ ํšŒํ”ผ

Highlight (3 page, edited: 2026-04-12)

However, we use contextual embeddings, which capture the specific use of a token in a sentence, and potentially capture sequence information.

Prior Research:

๊ธฐ์กด embedding ๊ธฐ๋ฐ˜ ๋ฉ”ํŠธ๋ฆญ(MEANT 2.0, YiSi-1)๊ณผ์˜ ์ฐจ๋ณ„์ 

  • static word embedding์€ ๋ฌธ๋งฅ ๋ฌด๊ด€ํ•œ ๋‹จ์ผ ๋ฒกํ„ฐ โ†’ contextual embedding์€ ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๋ฒกํ„ฐ

  • ์™ธ๋ถ€ linguistic structure(semantic parse) ๋ถˆํ•„์š” โ†’ ์–ธ์–ด ์ด์‹์„ฑ ์šฐ์ˆ˜

๐Ÿ”ต Main Idea

Highlight (1 page, edited: 2026-04-12)

In this paper, we introduce BERTSCORE, a language generation evaluation metric based on pretrained BERT contextual embeddings (Devlin et al., 2019)

Main Idea:

ํ•ต์‹ฌ ์ œ์•ˆ: pretrained BERT contextual embedding ๊ธฐ๋ฐ˜์˜ text generation ์ž๋™ ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ

  • task-agnostic: MT, image captioning ๋“ฑ ์—ฌ๋Ÿฌ ์ƒ์„ฑ ํƒœ์Šคํฌ์— ๋ฒ”์šฉ ์ ์šฉ

  • 363๊ฐœ ์‹œ์Šคํ…œ ์ถœ๋ ฅ์œผ๋กœ ํ‰๊ฐ€, human judgment๊ณผ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„ ์ž…์ฆ

Highlight (1 page, edited: 2026-04-12)

BERTSCORE computes the similarity of two sentences as a sum of cosine similarities between their tokensโ€™ embeddings.

Main Idea:

ํ•œ ์ค„ ์š”์•ฝ: token-level contextual embedding ๊ฐ„ cosine similarity์˜ ํ•ฉ์œผ๋กœ ๋‘ ๋ฌธ์žฅ์˜ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

  • exact match ๋Œ€์‹  soft similarity โ†’ ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ฆˆ, ๋™์˜์–ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ฒ˜๋ฆฌ

  • greedy matching์œผ๋กœ ๊ฐ ํ† ํฐ์„ ์ƒ๋Œ€ ๋ฌธ์žฅ์˜ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ํ† ํฐ์— ๋งค์นญ

Image (4 page, edited: 2026-04-12)

Main Idea:

  • candidate ์™€ reference ๋ฅผ BERT์— ํƒœ์›Œ contextual embedding ๊ฐ’์„ ์–ป์–ด๋‚ด๊ณ , token-pair ๋งˆ๋‹ค cosine similarity ์ด์šฉํ•˜์—ฌ ์œ ์‚ฌ์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ณ  IDF๋กœ ๊ฐ token์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌ.

๐ŸŸข Methods

Highlight (4 page, edited: 2026-04-12)

We combine precision and recall to compute an F1 measure.

Methods:

  • Recall: reference์˜ ๊ฐ ํ† ํฐ์„ candidate์—์„œ greedy matching (row-wise max)

  • Precision: candidate์˜ ๊ฐ ํ† ํฐ์„ reference์—์„œ greedy matching (column-wise max)

  • F1์ด ๋Œ€๋ถ€๋ถ„์˜ ์„ค์ •์—์„œ ๊ฐ€์žฅ ์•ˆ์ •์ ์ธ ๋ฉ”ํŠธ๋ฆญ

Image (4 page, edited: 2026-04-12)

  • BERTScore recall : row-wise max pooling

  • BERTScore precision: column-wise max pooling

  • (optional) : ๊ฐœ์˜ reference ๋ฌธ์žฅ๋“ค์„ ๋ณด๋ฉด์„œ ํ† ํฐ์ด reference ๋ฌธ์žฅ์— ๋“ค์–ด๊ฐ€๋ฉด 1, ์•„๋‹ˆ๋ฉด 0์œผ๋กœ ์นด์šดํŒ… ํ›„ ํ‰๊ท ๊ฐ’(๋กœ๊ทธ์Šค์ผ€์ผ)

Highlight (4 page, edited: 2026-04-12)

BERTSCORE enables us to easily incorporate importance weighting. We experiment with inverse document frequency (idf) scores computed from the test corpus.

Methods:

IDF ๊ฐ€์ค‘์น˜ ์ ์šฉ:

  • ๊ฐœ reference ๋ฌธ์žฅ์—์„œ ํ† ํฐ ์ถœํ˜„ ๋นˆ๋„์˜ ์—ญ์ˆ˜(๋กœ๊ทธ ์Šค์ผ€์ผ)

  • tf๋Š” ๋‹จ์ผ ๋ฌธ์žฅ์ด๋ฏ€๋กœ 1๋กœ ๊ฐ€์ •, idf๋งŒ ์‚ฌ์šฉ

  • ํฌ๊ท€ ํ† ํฐ์— ๋†’์€ ๊ฐ€์ค‘์น˜ โ†’ ๋ฌธ์žฅ ์œ ์‚ฌ๋„์— ๋” ํฐ ๊ธฐ์—ฌ

Image (5 page, edited: 2026-03-29)

  • cosine-similarity๋กœ score๋ฅผ ๊ณ„์‚ฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— bound๊ฐ€ ์ด์ง€๋งŒ,์ €์ž๋“ค์ด ์‹ค์ œ๋กœ ๊ณ„์‚ฐ ์‹œ์—๋Š” (-1,+1)๋ณด๋‹ค ์ž‘์€ ๊ตฌ๊ฐ„์—์„œ ๊ฐ’๋“ค์ด ํ˜•์„ฑ. (์ดˆ๊ณ ์ฐจ์›์—์„œ -1,1 ์— ๊ฐ€๊นŒ์šด ๊ฐ’์„ ๊ฐ–๊ธฐ์—๋Š” ๋งค์šฐ ์–ด๋ ค์›€)

  • ๋”ฐ๋ผ์„œ ์ €์ž๋“ค์€ score์˜ readability๋ฅผ ๋†’ํžˆ๊ธฐ ์œ„ํ•ด ์‹ค์ฆ์ ์ธ lower-bound ๋ฅผ ์ฐพ์•„ ์‹ค์ œ ๊ณ„์‚ฐ score๊ฐ€ (-1,+1) ์‚ฌ์ด๋กœ ์˜ค๋„๋ก rescaling์„ ์ง„ํ–‰.

๐ŸŸ  Limitations

Highlight (7 page, edited: 2026-04-12)

Overall, we find that applying importance weighting using idf at times provides small benefit, but in other cases does not help. Understanding better when such importance weighting is likely to help is an important direction for future work, and likely depends on the domain of the text and the available test data. We continue without idf weighting for the rest of our experiments.

Limitations:

IDF weighting์˜ ํšจ๊ณผ๊ฐ€ ๋ถˆ์ผ๊ด€

  • ์ผ๋ถ€ ์„ค์ •์—์„œ๋งŒ ์†Œํญ ๊ฐœ์„ , ๋‹ค๋ฅธ ๊ฒฝ์šฐ์—๋Š” ๋„์›€ ์•ˆ ๋จ

  • ๋„๋ฉ”์ธยทํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋ฉฐ, ์ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด future work

  • ์ดํ›„ ์‹คํ—˜์—์„œ๋Š” IDF ์—†์ด ์ง„ํ–‰

Highlight (9 page, edited: 2026-04-12)

However, there is no one configuration of BERTSCORE that clearly outperforms all others.

Limitations:

๋ชจ๋“  ์„ค์ •์—์„œ ์ตœ์ ์ธ ๋‹จ์ผ ๊ตฌ์„ฑ์€ ์—†์Œ

  • ๋ชจ๋ธ ์„ ํƒ(BERT vs RoBERTa vs multilingual), ๋ ˆ์ด์–ด, IDF ์‚ฌ์šฉ ์—ฌ๋ถ€ ๋“ฑ์ด ๋„๋ฉ”์ธ/์–ธ์–ด์— ๋”ฐ๋ผ ๋‹ค๋ฆ„

  • ์˜์–ด: RoBERTa-large 24-layer ๊ถŒ์žฅ

  • ๋น„์˜์–ด: multilingual BERT ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‚˜ ์ €์ž์› ์–ธ์–ด์—์„œ ๋ถˆ์•ˆ์ •

๐ŸŸฃ Key Concepts to Clarify

Highlight (3 page, edited: 2026-04-12)

Instead of greedy matching, WMD (Kusner et al., 2015), WMDO (Chow et al., 2019), and SMS (Clark et al., 2019) propose to use optimal matching based on earth moverโ€™s distance (Rubner et al., 1998).

Key Concepts to Clarify:

Greedy matching vs Optimal matching

  • Greedy: ๊ฐ ํ† ํฐ์„ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ์ƒ๋Œ€ ํ† ํฐ์— 1:1 ๋งค์นญ โ†’ , ๋‹จ์ˆœํ•˜๊ณ  ๋น ๋ฆ„

  • Optimal (EMD/WMD): ์ „์ฒด ์ตœ์  ํ• ๋‹น โ†’ ๊ณ„์‚ฐ ๋น„์šฉ ๋†’์Œ

  • BERTScore๋Š” greedy ์„ ํƒ โ€” Appendix C์—์„œ optimal ๋Œ€๋น„ ์ผ๊ด€๋œ ๊ฐœ์„  ์—†์Œ์„ ํ™•์ธ

  • MoverScore๋Š” ๊ฐ™์€ ๋งฅ๋ฝ์—์„œ optimal(WMD) ์„ ํƒ

Highlight (4 page, edited: 2026-04-12)

In contrast to prior word embeddings (Mikolov et al., 2013; Pennington et al., 2014), contextual embeddings, such as BERT (Devlin et al., 2019) and ELMO (Peters et al., 2018), can generate different vector representations for the same word in different sentences depending on the surrounding words, which form the context of the target word.

Key Concepts to Clarify:

Contextual embedding vs static embedding์˜ ํ•ต์‹ฌ ์ฐจ์ด

  • Static (Word2Vec, GloVe): ๋‹จ์–ด๋‹น ํ•˜๋‚˜์˜ ๊ณ ์ • ๋ฒกํ„ฐ

  • Contextual (BERT, ELMo): ๊ฐ™์€ ๋‹จ์–ด๋„ ์ฃผ๋ณ€ ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๋ฒกํ„ฐ ์ƒ์„ฑ

  • Transformer์˜ self-attention์ด ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ์ธ์ฝ”๋”ฉ โ†’ ๋‹ค์˜์–ด, ์–ด์ˆœ ๋ณ€ํ™” ํฌ์ฐฉ ๊ฐ€๋Šฅ

๐ŸŸช Results

Image (7 page, edited: 2026-04-12)

Results:

[Table 4 ๋ถ„์„] Segment-level ์„ฑ๋Šฅ:

  • BERTScore๊ฐ€ ๋ชจ๋“  ๋ฉ”ํŠธ๋ฆญ ๋Œ€๋น„ ์œ ์˜ํ•˜๊ฒŒ ๋†’์€ ์„ฑ๋Šฅ

  • BLEU ๋Œ€๋น„ ํŠนํžˆ ํฐ ๊ฐœ์„  โ€” ๊ฐœ๋ณ„ ๋ฌธ์žฅ ๋ถ„์„์— ์ ํ•ฉ

  • ์‹ฌ์ง€์–ด supervised ๋ฉ”ํŠธ๋ฆญ RUSE๋„ segment-level์—์„œ๋Š” BERTScore์— ์—ด์„ธ

Image (8 page, edited: 2026-04-12)

Results:

[Table 5 ๋ถ„์„] Image captioning (COCO 2015 Captioning Challenge) ๊ฒฐ๊ณผ:

  • M1: ์ƒ์„ฑ ์บก์…˜์ด ์ธ๊ฐ„ ์บก์…˜ ๋Œ€๋น„ better or equal๋กœ ํ‰๊ฐ€๋œ ๋น„์œจ

  • M2: ์ƒ์„ฑ ์บก์…˜์ด ์ธ๊ฐ„ ์บก์…˜๊ณผ **๊ตฌ๋ถ„ ๋ถˆ๊ฐ€๋Šฅ(indistinguishable)**ํ•œ ๋น„์œจ (๋” ์—„๊ฒฉ)

  • BERTScore๊ฐ€ task-agnostic ๋ฉ”ํŠธ๋ฆญ ์ค‘ M1/M2 ๋ชจ๋‘์—์„œ ํฐ ํญ์œผ๋กœ ์ตœ๊ณ  ์„ฑ๋Šฅ

  • BLEU, ROUGE ๋“ฑ n-gram ๋ฉ”ํŠธ๋ฆญ์€ human judgment์™€ ์•ฝํ•œ ์ƒ๊ด€๊ด€๊ณ„

  • SPICE(task-specific)๋ณด๋‹ค๋„ ์šฐ์ˆ˜ โ†’ ๋ณ„๋„ task ์ตœ์ ํ™” ์—†์ด๋„ ๊ฐ•๋ ฅํ•œ ๋ฒ”์šฉ์„ฑ

Image (8 page, edited: 2026-04-12)

Results:

[Table 6 ๋ถ„์„] Adversarial robustness (PAWS):

  • ๋Œ€๋ถ€๋ถ„์˜ ๋ฉ”ํŠธ๋ฆญ์ด QQP์—์„œ๋Š” ์ ์ ˆํ•˜๋‚˜ PAWS ์ ๋Œ€์  ์˜ˆ์ œ์—์„œ chance ์ˆ˜์ค€๊นŒ์ง€ ํ•˜๋ฝ

  • BERTScore๋Š” ์†Œํญ ํ•˜๋ฝ๋งŒ โ€” contextual embedding์ด word swapping์—๋„ ์˜๋ฏธ ์ฐจ์ด๋ฅผ ํฌ์ฐฉ

  • ์˜ˆ: โ€œFlights from New York to Floridaโ€ vs โ€œFlights from Florida to New Yorkโ€ ๊ตฌ๋ถ„ ๊ฐ€๋Šฅ

๐Ÿ”˜ Ablation Study

Image (16 page, edited: 2026-04-12)

Ablation Study:

[Appendix B] BERT ๋ ˆ์ด์–ด ์„ ํƒ ์‹คํ—˜:

  • ๋ชจ๋“  ๋ชจ๋ธ์—์„œ ์ค‘๊ฐ„ ๋ ˆ์ด์–ด๊ฐ€ ์ตœ์  ์„ฑ๋Šฅ

  • ์ตœ์ข… ๋ ˆ์ด์–ด๋Š” next-word prediction ๋“ฑ pretraining objective์— ํŠนํ™”๋˜์–ด semantic similarity์— ๋œ ์ ํ•ฉ

  • WMT16์„ validation์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋ชจ๋ธ๋ณ„ ์ตœ์  ๋ ˆ์ด์–ด ํƒ์ƒ‰

Highlight (19 page, edited: 2026-04-12)

Third, replacing greedy matching with WMD does not lead to consistent improvement.

Ablation Study:

[Appendix C] Greedy matching vs WMD(optimal) ๋น„๊ต:

  • WMD๋กœ ๊ต์ฒดํ•ด๋„ ์ผ๊ด€๋œ ๊ฐœ์„  ์—†์Œ โ€” ์˜คํžˆ๋ ค BERTScore(greedy)๊ฐ€ ๋™์ผ ์„ค์ •์—์„œ ๋” ๋‚˜์€ ๊ฒฝ์šฐ ๋‹ค์ˆ˜

  • ๊ฒฐ๋ก : greedy matching์ด text generation ํ‰๊ฐ€์— ์ถฉ๋ถ„ํ•˜๋ฉฐ, optimal matching์˜ ์ถ”๊ฐ€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ์ •๋‹นํ™”๋˜์ง€ ์•Š์Œ

  • MoverScore์™€์˜ ํ•ต์‹ฌ ์ฐจ์ด์ 