Google Study Finds AI Benchmarks Ignore Human Disagreement
A Google study reveals that the standard practice of using 3-5 human raters per test example is often not enough for reliable AI benchmarks. The way annotation budgets are allocated is just as important as the budget itself.
Why it matters
This study highlights a key limitation in how AI systems are currently evaluated, which has important implications for the development and deployment of reliable AI technologies.
Key Points
- 1Standard AI benchmarks use too few human raters per test example
- 2Human disagreement is systematically ignored in current benchmarks
- 3Allocation of annotation budgets is crucial for reliable benchmarks
Details
The Google study found that the standard practice of using 3-5 human raters per test example is often insufficient for reliable AI benchmarks. Human raters frequently disagree on the correct labels or annotations, but current benchmarks do not account for this disagreement. The study suggests that the way annotation budgets are allocated - i.e., how many raters are used per example - matters just as much as the overall budget size. Properly accounting for human variability and disagreement is critical for developing AI systems that can generalize and perform well in the real world.
No comments yet
Be the first to comment