BERTScore

TL;DR

의미적 유사성을 반영하기 위해 PLM BERT를 활용하여 token-level 유사성을 반영.

IDF를 가중치로 활용하여 importance weighting 적용.

텍스트 생성 모델을 평가할 때, 단순한 어휘적 일치(lexical matching) 으로만 평가하는 것은 한계가 있다. 이러한 한계를 극복하기 위해 제안된 방법 중 하나인 BERTScore는 BERT와 같은 사전 학습된 언어 모델을 활용하여 텍스트 생성 모델의 출력과 참조 문장 간의 의미적 유사성을 평가한다. 이는 전통적인 n-gram 기반 평가 지표보다 문맥적 의미를 더 잘 반영하여, 다양한 자연어 처리 과제에서 더 신뢰할 수 있는 평가를 제공한다.

Methods

candidate 와 reference를 BERT에 통과시켜 contextual embedding 값을 얻는다.
token pair 마다 cosine similarity를 이용하여 유사도를 계산한다.
유사도 행렬을 기준으로 BERTScore recall은 row-wise max pooling (reference 각 토큰 기준), precision은 column-wise max pooling (candidate 각 토큰 기준)으로 구한다.
(optional) IDF를 이용하여 importance weighting 을 반영한다.

Examples

다음 두 문장에 대해,

Reference 문장 (A): [“cat”, “on”, “mat”]
Candidate 문장 (B): [“cat”, “sits”, “on”, “rug”]

BERT를 사용하여 각 단어의 임베딩을 계산하고, 두 문장 간의 코사인 유사도 행렬을 구했다고 가정하자:

Token (A)	cat (B)	sits (B)	on (B)	rug (B)
cat (A)	0.95	0.20	0.10	0.05
on (A)	0.10	0.30	0.90	0.10
mat (A)	0.05	0.40	0.20	0.80

Recall(row-wise max pooling)은 참조문장 A에서 얼마나 많은 정보가 예측문장 B에 잘 반영되었는지를 측정한다. 이를 위해 각 행(row) 에서 최대값 을 선택하고, 이 값들의 평균을 계산한다:

cat (A) on (A) mat (A) : max (0.95, 0.20, 0.10, 0.05) = 0.95 : max (0.10, 0.30, 0.90, 0.10) = 0.90 : max (0.05, 0.40, 0.20, 0.80) = 0.80

$R_{BERT} = \frac{0.95 + 0.90 + 0.80}{3} = 0.883$

Precision(column-wise max pooling) 은 후보문장 B의 각 단어가 참조문장 A의 단어들 중 얼마나 잘 매칭되었는지를 측정한다. 이를 위해 각 열(column) 에서 최대값 을 선택하고, 이 값들의 평균을 계산한다:

cat (B) sits (B) on (B) rug (B) : max (0.95, 0.10, 0.05) = 0.95 : max (0.20, 0.30, 0.40) = 0.40 : max (0.10, 0.90, 0.20) = 0.90 : max (0.05, 0.10, 0.80) = 0.80

$P_{BERT} = \frac{0.95 + 0.40 + 0.90 + 0.80}{4} = 0.763$

F1 Score는 Precision과 Recall의 조화평균이므로 다음과 같다.

$F_{BERT} = 2 \cdot \frac{P _{BERT} \cdot R _{BERT}}{P _{BERT} + R _{BERT}} = 2 \cdot \frac{0.763 \cdot 0.883}{0.763 + 0.883} = 0.819$

Code

공식 구현: Tiiiger/bert_score

참고: https://www.comet.com/site/blog/bertscore-for-llm-evaluation/

토큰화 & Truncation

def sent_encode(tokenizer, sent):
    return tokenizer.encode(
        sent, add_special_tokens=True,
        max_length=tokenizer.model_max_length,  # BERT=512, kcbert=300
        truncation=True  # 초과 시 잘라버림 (경고 없음)
    )

[CLS], [SEP] 자동 추가 → 실제 사용 가능 토큰은 max_length - 2
문서 수준 입력 시 뒷부분 정보 손실

레이어 선택

# 모델 로드 후 지정 레이어까지만 물리적으로 잘라냄
model.encoder.layer = nn.ModuleList(
    [layer for layer in model.encoder.layer[:num_layers]]
)
# 예: num_layers=9 → layer 0~8만 사용, layer 8 출력이 최종 embedding

기본값은 모델별 최적 레이어 (논문 Appendix B에서 WMT16으로 튜닝)
중간 레이어가 semantic similarity에 최적, 상위 레이어는 MLM에 특화

핵심 스코어링 — `greedy_cos_idf`

def greedy_cos_idf(ref_embedding, ref_masks, ref_idf,
                   hyp_embedding, hyp_masks, hyp_idf):
 
    # (1) L2 정규화 → 내적 = cosine similarity
    ref_embedding.div_(torch.norm(ref_embedding, dim=-1).unsqueeze(-1))
    hyp_embedding.div_(torch.norm(hyp_embedding, dim=-1).unsqueeze(-1))
 
    # (2) 유사도 행렬: [batch, hyp_len, ref_len]
    sim = torch.bmm(hyp_embedding, ref_embedding.transpose(1, 2))
 
    # (3) Greedy matching
    word_precision = sim.max(dim=2)[0]  # candidate 각 토큰 → ref 최대 매칭
    word_recall    = sim.max(dim=1)[0]  # reference 각 토큰 → hyp 최대 매칭
 
    # (4) IDF 가중 평균 → 문장 수준 점수
    hyp_idf.div_(hyp_idf.sum(dim=1, keepdim=True))  # 정규화 (합=1)
    ref_idf.div_(ref_idf.sum(dim=1, keepdim=True))
 
    P = (word_precision * hyp_idf).sum(dim=1)  # Precision
    R = (word_recall    * ref_idf).sum(dim=1)  # Recall
    F = 2 * P * R / (P + R)                    # F1
    return P, R, F

dim 방향 이해 — 위 Examples 행렬과 대응

코드의 sim은 Examples 행렬과 전치 관계

Examples: 행=Reference(A), 열=Candidate(B) → shape [3, 4]

코드: 행=Candidate(hyp), 열=Reference(ref) → shape [4, 3]

Examples 행렬을 전치하면 코드의 sim이 된다:

연산	dim	축 방향	의미	Examples 대응
`sim.max(dim=2)`	ref 축 제거	→ 방향	hyp 각 토큰의 best match in ref	column-wise max (Precision)
`sim.max(dim=1)`	hyp 축 제거	↓ 방향	ref 각 토큰의 best match in hyp	row-wise max (Recall)

단계별 정리

단계	연산	코드
유사도 행렬	L2 norm 후 batch matmul	`bmm(hyp, ref.T)` → `[B, H, R]`
Precision	hyp 각 토큰의 ref 최대 매칭	`sim.max(dim=2)`
Recall	ref 각 토큰의 hyp 최대 매칭	`sim.max(dim=1)`
IDF 가중	정규화된 IDF를 가중치로 곱	`(word_score * idf).sum()`
F1	조화평균	`F = 2PR/(P+R)`

IDF 계산

def get_idf_dict(arr, tokenizer, nthreads=4):
    idf_count = Counter()
    num_docs = len(arr)
 
    # (1) 각 reference 문장을 토큰화하여 DF 집계
    for sent in arr:
        tokens = sent_encode(tokenizer, sent)
        for token_id in set(tokens):     # set → 문장 내 중복 무시
            idf_count[token_id] += 1     # 해당 토큰이 등장한 문장 수 (DF)
 
    # (2) IDF = log((N+1)/(df+1)) — Laplace smoothing
    idf_dict = defaultdict(
        lambda: log((num_docs + 1) / 1)  # 미등장 토큰 기본값 (최대 IDF)
    )
    idf_dict.update({
        idx: log((num_docs + 1) / (c + 1))
        for (idx, c) in idf_count.items()
    })
    return idf_dict

(1) reference 코퍼스의 각 문장을 순회하며 토큰별 문서 빈도(DF) 집계. set(tokens)으로 한 문장 내 중복 카운팅 방지
(2) Laplace smoothing으로 0-division 방지. 미등장 토큰은 최대 IDF 부여
IDF 미사용 시 (idf=False) 모든 토큰 균등 가중

Key Concepts to Clarify

레이어 선택: BERT의 모든 레이어가 동일하게 유용하지 않다. 논문에서 중간 레이어가 semantic similarity에 최적이며, 최종 레이어는 pretraining objective에 특화되어 성능이 떨어짐을 확인. 각 모델별로 WMT16을 validation으로 최적 레이어를 탐색 (Appendix B).
Greedy matching vs Optimal matching: BERTScore는 greedy matching (각 토큰을 가장 유사한 상대 토큰에 매칭) 선택. WMD 기반 optimal matching(Earth Mover’s Distance)으로 교체해도 일관된 개선 없음 (Appendix C). MoverScore는 같은 맥락에서 optimal 선택.
Baseline rescaling: cosine similarity 범위가 이론상 $[- 1, 1]$ 이나 실제로는 좁은 구간에 분포. Common Crawl에서 랜덤 문장 쌍으로 empirical lower bound $b$ 를 구해 $\hat{R}_{BERT} = \frac{R _{BERT} - b}{1 - b}$ 로 rescaling하여 가독성 향상. 랭킹 능력에는 영향 없음.

Connections

MoverScore — 같은 contextual embedding 기반이나 optimal matching(Word Mover’s Distance,WMD) 선택
BLEU — BERTScore가 해결하려는 n-gram 기반 메트릭의 대표
METEOR — stem/synonym fallback으로 exact match 보완, BERTScore의 선행

Source Trail

이 노트는 @zhangBERTScoreEvaluatingText2020에서 추출됨
KoBERTScore (한국어 구현)

MOCs

🪴 디지털 가든

탐색기

BERTScore

Methods

Examples

Code

토큰화 & Truncation

레이어 선택

핵심 스코어링 — `greedy_cos_idf`

dim 방향 이해 — 위 Examples 행렬과 대응

단계별 정리

IDF 계산

Key Concepts to Clarify

Connections

Source Trail

Comments

그래프 뷰

목차

🪴 디지털 가든

탐색기

BERTScore

Methods

Examples

Code

토큰화 & Truncation

레이어 선택

핵심 스코어링 — greedy_cos_idf

dim 방향 이해 — 위 Examples 행렬과 대응

단계별 정리

IDF 계산

Key Concepts to Clarify

Connections

Source Trail

Related MOCs

Comments

그래프 뷰

목차

핵심 스코어링 — `greedy_cos_idf`