A small benchmark of short natural-language prompts that look easy but expose weak reasoning — goal grounding, world-state tracking, social pragmatics, modified-riddle templates, literal precision ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results