LLM Training Scandal 🤯

LLM's Dirty Secret: Training on Test Data Exposed

May 02, 2024

Background:

The study focuses on the growing problem of benchmark dataset leakage in the training of large language models (LLMs). This leakage can skew benchmark effectiveness and lead to unfair comparisons, hindering the development of the field.

This leaderboard shows the relative possibility that various models conduct verbatim training on the training set of a benchmark over test set to enhance capabilities (measured based on PPL and N-gram Accuracy). Models exhibiting near-zero possibilities suggest either the absence of training and test split or the use of both splits in the training process. This metric does not imply cheating, but rather indicates the potential use of the benchmark data during the (pre-)training phase; while using benchmarks to enhance capabilities is acceptable, the lack of relevant documentation can reduce transparency, potentially resulting in unfair comparisons and hindering the field's healthy development.

Objective:

To develop a detection pipeline that can identify if LLMs have been trained on benchmark data, thus ensuring the integrity of model evaluations.

Methodology:

The researchers introduce two metrics—Perplexity and N-gram Accuracy—to gauge prediction precision on benchmarks and detect data leakages. They apply these metrics to a selection of 31 LLMs, evaluating their performance on mathematical reasoning tasks.

Key Findings:

- Significant instances of potential data leakage were found across several models.

- Models like Qwen-1.8B, Aquila2, and InternLM2 showed high levels of prediction accuracy on test datasets, suggesting prior exposure during training.

- The study proposes the adoption of a "Benchmark Transparency Card" for documenting model training and data usage, promoting transparency and ethical development.

Implications:

This research underscores the need for clear documentation and ethical guidelines in AI development to prevent data leakage and ensure fair and accurate model evaluations.

The Responsible AI Digest by SoRAI (formerly ABCP)

Discussion about this post