PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

This application presents the results of several models that we have evaluated on a verbal reasoning challenge (Papers, ArXiv). The overall results are below. Use the tabs above to explore the results in more detail.

model
total
correct
accuracy
completions-sonnet3p7_20250219_nothinking
594
266
0.45