The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang
Abstract
Camlang, a constructed language, is used to evaluate whether LLMs can master unfamiliar languages through metalinguistic reasoning, revealing that current models lack systematic grammatical mastery compared to humans.
Large Language Models (LLMs) achieve gold-medal performance across many benchmarks, yet it remains unclear whether such success reflects genuine reasoning or pattern matching. From a cognitive science perspective, an informative test is whether models can master an unfamiliar language through explicit metalinguistic deductive learning, a paradigm where human learners can reliably internalise grammatical systems through metalinguistic reasoning. We address this question with Camlang, a novel constructed language that exhibits naturalistic yet unattested feature combinations. Camlang consists of two explicit resources, a grammar book and a bilingual dictionary, which mirror adult second-language learning via explicit grammar rules and lexical lookup, and enable us to disentangle errors in morpho-syntax, lexical semantics, and sentence-level reasoning. Human experiments show that these resources are sufficient for participants to acquire Camlang and successfully solve Camlang tasks. To operationalise evaluation, we adapt CommonsenseQA into Camlang, creating Camlang-CSQA-v0, the first task in a broader suite where solving questions requires applying grammar rules and lexical mappings. Experimental results show that GPT-5 achieves 98\% EM accuracy in English but only 47\% in Camlang, far below human performance at 87\%, while other state-of-the-art reasoning LLMs perform even worse. Human verification further reveals that most model successes stem from shallow lexical alignment while GPT-5 shows emerging metalinguistic awareness to a limited extent but not systematic grammatical mastery as humans. Camlang establishes a cognitively grounded evaluation paradigm that exposes fundamental gaps between current models and human metalinguistic competence.
Community
This paper introduces a typologically plausible yet novel constructed (artificial) language, Camlang, presented with a grammar book and dictionary. By adapting CommonsenseQA into Camlang, this paper tests not only grammatical rule acquisition but also the integration of explicit rules with commonsense reasoning. Experimental results show that GPT-5 achieves 98% EM accuracy in English but only 47% in Camlang, while other reasoning LLMs perform even worse. In contrast to LLMs, human participant experiments show that these resources are sufficient for participants to acquire Camlang and achieve 87% EM accuracy in Camlang compared to 91% in English. Human verification analysis further shows that most model successes stem from shallow lexical alignment, while GPT-5 shows emerging metalinguistic awareness to a limited extent, but not systematic grammatical mastery as humans.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper