๋ฐ์ดํ„ฐ์…‹ ๋…ธํŠธ


1. ๋ฐ์ดํ„ฐ์…‹ ์ •๋ณด

  • ์ด๋ฆ„: NegBench: Vision-Language Models Do Not Understand Negation
  • ์ €์ž: Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, Marzyeh Ghassemi
  • ๋ฐœํ‘œ: CVPR 2025
  • ์„ค๋ช…:
    • VLM์ด ๋ถ€์ •(negation)์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ดํ•ดํ•˜๋Š”์ง€ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ๋ฒค์น˜๋งˆํฌ
    • โ€œํŠน์ • ๊ฐ์ฒด๊ฐ€ ์—†๋Š” ์ด๋ฏธ์ง€๋ฅผ ๊ฒ€์ƒ‰ํ•˜๋ผโ€์™€ ๊ฐ™์€ ์‹ค์šฉ์  ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋ถ€์ • ์ดํ•ด๊ฐ€ ํ•„์ˆ˜์ ์ž„์—๋„, ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ ๊ฑฐ์˜ ํƒ๊ตฌ๋˜์ง€ ์•Š์€ ์˜์—ญ
    • ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ํŒŒ์ธํŠœ๋‹์œผ๋กœ ๋ถ€์ • ์ดํ•ด๋ ฅ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹ค์ฆ
  • ๋ผ์ด์„ ์Šค: MIT (์ฝ”๋“œ ๊ธฐ์ค€)
  • ๋ฆฌ์†Œ์Šค:

โš ๏ธ ์ฃผ์˜: ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ ์ผ๋ถ€(CheXpert, COCO, VOC2007, MSR-VTT)๋Š” ๊ฐ ์›๋ณธ ์†Œ์Šค์—์„œ ๋ณ„๋„ ๋‹ค์šด๋กœ๋“œ ํ•„์š”.


2. ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ

๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ฐ ์†Œ์Šค

๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฐ์ดํ„ฐ ์†Œ์Šคํƒœ์Šคํฌ
์ด๋ฏธ์ง€COCO 2017 ValMCQ, Retrieval
์ด๋ฏธ์ง€VOC2007MCQ
์ด๋ฏธ์ง€Synthetic (Stable Diffusion)MCQ, Retrieval
์ด๋ฏธ์ง€ (์˜๋ฃŒ)CheXpertBinary MCQ
๋น„๋””์˜คMSR-VTTMCQ, Retrieval
  • ํŠน์ง•:
    • 79,000 ์˜ˆ์ œ, 18๊ฐœ ํƒœ์Šคํฌ ๋ณ€ํ˜•
    • ์ด๋ฏธ์ง€ยท๋น„๋””์˜คยท์˜๋ฃŒ 3๊ฐœ ๋„๋ฉ”์ธ์„ ์•„์šฐ๋ฅด๋Š” ํฌ๊ด„์  ํ‰๊ฐ€
  • ์ƒ˜ํ”Œ ์ˆ˜: 79K (์ „์ฒด)

3. ๊ตฌ์ถ• ๋ฐฉ์‹

  • ๋ฐฉ๋ฒ•: ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ…œํ”Œ๋ฆฟ + Llama 3.1 ๊ธฐ๋ฐ˜ rephrasing (MSR-VTT ๋น„๋””์˜ค ์บก์…˜)
  • ์†Œ์Šค: COCO, VOC2007, CheXpert, MSR-VTT ๋“ฑ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹์˜ ์บก์…˜์„ ๋ถ€์ • ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜
  • ํŒŒ์ธํŠœ๋‹ ๋ฐ์ดํ„ฐ (๋ฒค์น˜๋งˆํฌ์™€ ๋ณ„๋„):
    • CC12M-NegCap: ~30M ๋ถ€์ • ์บก์…˜ (CC12M์—์„œ ํŒŒ์ƒ)
    • CC12M-NegMCQ: ~40M ๋ถ€์ • MCQ (CC12M์—์„œ ํŒŒ์ƒ)

4. ํƒœ์Šคํฌ ๋ฐ ํ™œ์šฉ

์ฃผ์š” ํƒœ์Šคํฌ

  1. Retrieval with Negation โ€” ๋ถ€์ • ์ฟผ๋ฆฌ(โ€œX๊ฐ€ ์—†๋Š” ์ด๋ฏธ์ง€โ€)๋กœ ์˜ฌ๋ฐ”๋ฅธ ์ด๋ฏธ์ง€/๋น„๋””์˜ค๋ฅผ ๊ฒ€์ƒ‰
  2. Multiple Choice Questions (MCQ) with Negated Captions โ€” ๋ถ€์ • ํ‘œํ˜„์ด ํฌํ•จ๋œ ์บก์…˜ ์ค‘ ์˜ฌ๋ฐ”๋ฅธ ๊ฒƒ์„ ์„ ํƒ

๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ

์ฃผ์š” ๋ฐœ๊ฒฌ:

  • ํ˜„๋Œ€ VLM(CLIP, NegCLIP, CoNCLIP ๋“ฑ)์ด ๋ถ€์ • ํƒœ์Šคํฌ์—์„œ chance level์— ๊ฐ€๊นŒ์šด ์„ฑ๋Šฅ ์„ ๋ณด์ž„
  • ๋ถ€์ •์„ ํฌํ•จํ•œ ์ฟผ๋ฆฌ์—์„œ ๋ชจ๋ธ๋“ค์ด ๋ถ€์ •์–ด๋ฅผ ์‚ฌ์‹ค์ƒ ๋ฌด์‹œํ•˜๋Š” ๊ฒฝํ–ฅ

ํŒŒ์ธํŠœ๋‹ ํšจ๊ณผ (CC12M ๊ธฐ๋ฐ˜ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ):

๋ฐฉ๋ฒ•Negated RetrievalNegated MCQ
๋ฒ ์ด์Šค๋ผ์ธ (CLIP)baselinebaseline
+ CC12M-NegCap+10% recallโ€”
+ CC12M-NegMCQโ€”+40% accuracy

Recall +10% (Retrieval)

  • ๊ฐœ์„ ์€ ์žˆ์ง€๋งŒ ํญ์ด ์ž‘์Œ โ€” ๋ถ€์ • ํ‘œํ˜„์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋Š” retrieval์˜ ๊ทผ๋ณธ์  ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์–ด๋ ค์›€์„ ์‹œ์‚ฌ

Accuracy +40% (MCQ)

  • ํญ์ด ํผ โ€” MCQ๋Š” ์„ ํƒ์ง€ ๊ฐ„ ๋น„๊ต ํƒœ์Šคํฌ๋ผ ๋ถ€์ • ํ‘œํ˜„์˜ ์˜๋ฏธ๋งŒ ํ•™์Šตํ•˜๋ฉด ๋ฐ”๋กœ ์„ฑ๋Šฅ์ด ์˜ค๋ฅด๊ธฐ ๋•Œ๋ฌธ

Open Questions

  • ๋ถ€์ • ์ดํ•ด๋ ฅ ํ–ฅ์ƒ์ด ์ผ๋ฐ˜์ ์ธ VLM ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ (trade-off ์กด์žฌ ์—ฌ๋ถ€)
  • ๋” ๋ณต์žกํ•œ ๋…ผ๋ฆฌ์  ๋ถ€์ • (์ด์ค‘ ๋ถ€์ •, ์กฐ๊ฑด๋ถ€ ๋ถ€์ •)์— ๋Œ€ํ•œ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ
  • ํ•œ๊ตญ์–ด ๋“ฑ ๋น„์˜์–ด ์–ธ์–ด์—์„œ์˜ ๋ถ€์ • ์ดํ•ด๋ ฅ ํ‰๊ฐ€ ํ•„์š”์„ฑ

Reference

  • NegConstraint โ€” ํ…์ŠคํŠธ ์ƒ์„ฑ์—์„œ์˜ ๋ถ€์ • ์ œ์•ฝ ๊ด€๋ จ ๋ฐ์ดํ„ฐ์…‹