Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes- response bias

Dentella, Vittoria; Guenther, Fritz; Leivada, Evelina

Dades identificatives

Identificador: imarina:9440482

Handle: https://hdl.handle.net/20.500.11797/imarina9440482

Autors:
Dentella, VittoriaGuenther, FritzLeivada, Evelina

Resum:
Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human -like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next -word predictions. This work tests whether sta-bility and accuracy are showcased by GPT-3/text- davinci-002, GPT-3/text- davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes- response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs' performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.
Altres:

Autor segons l'article: Dentella, Vittoria; Guenther, Fritz; Leivada, Evelina
Departament: Estudis Anglesos i Alemanys
Autor/s de la URV: Dentella, Vittoria
Paraules clau: Bias Cognition Cognitive models Humans Judgment Languag Language Language models Linguistics
Resum: Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human -like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next -word predictions. This work tests whether sta-bility and accuracy are showcased by GPT-3/text- davinci-002, GPT-3/text- davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes- response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs' performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.
Àrees temàtiques: Anthropology Antropologia / arqueologia Astronomia / física Biodiversidade Biotecnología Ciência da computação Ciências agrárias i Ciências ambientais Ciências biológicas i Ciências biológicas ii Ciências biológicas iii Educação física Engenharias i Engenharias ii Engenharias iii Engenharias iv Farmacia General o multidisciplinar Geociências Geografía Interdisciplinar Matemática / probabilidade e estatística Medicina i Medicina ii Medicina iii Medicina veterinaria Multidisciplinary Multidisciplinary sciences Odontología Psicología Química Saúde coletiva Zootecnia / recursos pesqueiros
Accès a la llicència d'ús: https://creativecommons.org/licenses/by/3.0/es/
Adreça de correu electrònic de l'autor: vittoria.dentella@estudiants.urv.cat
Identificador de l'autor: 0000-0001-6697-9184
Data d'alta del registre: 2025-02-18
Versió de l'article dipositat: info:eu-repo/semantics/publishedVersion
Referència a l'article segons font original: Proceedings Of The National Academy Of Sciences Of The United States Of America. 120 (51): e2309583120-
Referència de l'ítem segons les normes APA: Dentella, Vittoria; Guenther, Fritz; Leivada, Evelina (2023). Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes- response bias. Proceedings Of The National Academy Of Sciences Of The United States Of America, 120(51), e2309583120-. DOI: 10.1073/pnas.2309583120
URL Document de llicència: https://repositori.urv.cat/ca/proteccio-de-dades/
Entitat: Universitat Rovira i Virgili
Any de publicació de la revista: 2023
Tipus de publicació: Journal Publications

Paraules clau:

Multidisciplinary,Multidisciplinary Sciences
Bias
Cognition
Cognitive models
Humans
Judgment
Languag
Language
Language models
Linguistics
Anthropology
Antropologia / arqueologia
Astronomia / física
Biodiversidade
Biotecnología
Ciência da computação
Ciências agrárias i
Ciências ambientais
Ciências biológicas i
Ciências biológicas ii
Ciências biológicas iii
Educação física
Engenharias i
Engenharias ii
Engenharias iii
Engenharias iv
Farmacia
General o multidisciplinar
Geociências
Geografía
Interdisciplinar
Matemática / probabilidade e estatística
Medicina i
Medicina ii
Medicina iii
Medicina veterinaria
Multidisciplinary
Multidisciplinary sciences
Odontología
Psicología
Química
Saúde coletiva
Zootecnia / recursos pesqueiros
Documents:

DocumentPrincipal
Cerca a google

Repositori URV

Articles producció científica> Estudis Anglesos i Alemanys

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes- response bias

Dades identificatives

Altres:

Paraules clau:

Documents:

Cerca a google