How to Improve Your LLM : Combine Evaluations with Analytics
The future of LLM evaluations resembles software testing more than benchmarks. Real-world testing looks like this, asking LLMs to produce Dad jokes like this zinger : I’m reading a book about gravity & it’s impossible to put down.
Machine learning benchmarks like those published by Google for Gemini2 last week, or precision and recall for classifying dog & cat photos, or the BLEU score for measuring machine translation provide a high-level comparison of relative model performance.


