
With AI models clobbering every benchmark, it’s time for human evaluation
Veronika Oliinyk / Getty Images Artificial intelligence has traditionally progressed by automatic precision tests in tasks intended to approximate human knowledge. Carefully manufactured reference tests such as the Benchmark for the General Understanding of Language (GLUE), the set of understanding data of the massive multitasking language (MMLU) and the “last examination of humanity”, used large…