Evaluating AI Model Integrity: Uncovering Leakage and Fixing Inflated Baselines
The article discusses how the authors found leakage issues in their AI model's evaluation process, leading to inflated baseline performance. They outline two key bugs and the steps taken to fix them, ensuring honest and transparent reporting of model capabilities.
Why it matters
Ensuring the integrity of AI model evaluation is critical for building trustworthy and reliable systems, especially in finance applications where the stakes are high.
Key Points
- 1Leakage from same-symbol data across training and validation splits
- 2Forward-return window leakage in training labels
- 3Adoption of symbol-disjoint splits and purge-embargo windows to fix the issues
- 4Commitment to publishing honest baselines and evaluation details
Details
The article describes how the authors' internal baseline for a chart-embedding AI model was inflated by 0.4 percentage points due to data leakage issues. The first bug was caused by the training and validation splits being done by date, allowing the model to find near-duplicates of validation samples in the training set. The second bug was related to the forward-return labels in the training data overlapping with the validation window. To address these issues, the authors implemented symbol-disjoint splits, where no ticker appears in more than one split, and a purge-embargo window to ensure no information leakage across splits. They also committed to publishing honest baselines and evaluation details, acknowledging that the true direction-prediction accuracy is closer to a coin-flip 51.2% on a clean holdout set. The authors emphasize the importance of these fixes for AI agents that rely on their chart-embedding service, as inflated baselines can lead to downstream issues in sizing, stop placement, and confidence calibration.
No comments yet
Be the first to comment