Abstract

The progress in techniques to evaluate LLMs has regrettably fallen behind the progress in LLM development, making it challenging to quantify progress. My research calls for rethinking the fundamental principles underlying the evaluation of Transformer-based language models. I will discuss some work on applying language models to real tasks, as well as the selection of test data for efficient and robust evaluation. I will present challenges that arise when we do not know what the ground truth might be. Finally, I will discuss some ideas on evaluating Transformer language models without involving language.

Attachment