Articles producció científica> Estudis Anglesos i Alemanys

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes- response bias

  • Datos identificativos

    Identificador: imarina:9440482
    Autores:
    Dentella, VittoriaGuenther, FritzLeivada, Evelina
    Resumen:
    Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human -like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next -word predictions. This work tests whether sta-bility and accuracy are showcased by GPT-3/text- davinci-002, GPT-3/text- davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes- response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs' performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.
  • Otros:

    Autor según el artículo: Dentella, Vittoria; Guenther, Fritz; Leivada, Evelina
    Departamento: Estudis Anglesos i Alemanys
    Autor/es de la URV: Dentella, Vittoria
    Palabras clave: Bias Cognition Cognitive models Humans Judgment Languag Language Language models Linguistics
    Resumen: Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human -like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next -word predictions. This work tests whether sta-bility and accuracy are showcased by GPT-3/text- davinci-002, GPT-3/text- davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes- response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs' performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.
    Áreas temáticas: Anthropology Antropologia / arqueologia Astronomia / física Biodiversidade Biotecnología Ciência da computação Ciências agrárias i Ciências ambientais Ciências biológicas i Ciências biológicas ii Ciências biológicas iii Educação física Engenharias i Engenharias ii Engenharias iii Engenharias iv Farmacia General o multidisciplinar Geociências Geografía Interdisciplinar Matemática / probabilidade e estatística Medicina i Medicina ii Medicina iii Medicina veterinaria Multidisciplinary Multidisciplinary sciences Odontología Psicología Química Saúde coletiva Zootecnia / recursos pesqueiros
    Acceso a la licencia de uso: https://creativecommons.org/licenses/by/3.0/es/
    Direcció de correo del autor: vittoria.dentella@estudiants.urv.cat
    Identificador del autor: 0000-0001-6697-9184
    Fecha de alta del registro: 2025-02-18
    Versión del articulo depositado: info:eu-repo/semantics/publishedVersion
    Referencia al articulo segun fuente origial: Proceedings Of The National Academy Of Sciences Of The United States Of America. 120 (51): e2309583120-
    Referencia de l'ítem segons les normes APA: Dentella, Vittoria; Guenther, Fritz; Leivada, Evelina (2023). Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes- response bias. Proceedings Of The National Academy Of Sciences Of The United States Of America, 120(51), e2309583120-. DOI: 10.1073/pnas.2309583120
    URL Documento de licencia: https://repositori.urv.cat/ca/proteccio-de-dades/
    Entidad: Universitat Rovira i Virgili
    Año de publicación de la revista: 2023
    Tipo de publicación: Journal Publications
  • Palabras clave:

    Multidisciplinary,Multidisciplinary Sciences
    Bias
    Cognition
    Cognitive models
    Humans
    Judgment
    Languag
    Language
    Language models
    Linguistics
    Anthropology
    Antropologia / arqueologia
    Astronomia / física
    Biodiversidade
    Biotecnología
    Ciência da computação
    Ciências agrárias i
    Ciências ambientais
    Ciências biológicas i
    Ciências biológicas ii
    Ciências biológicas iii
    Educação física
    Engenharias i
    Engenharias ii
    Engenharias iii
    Engenharias iv
    Farmacia
    General o multidisciplinar
    Geociências
    Geografía
    Interdisciplinar
    Matemática / probabilidade e estatística
    Medicina i
    Medicina ii
    Medicina iii
    Medicina veterinaria
    Multidisciplinary
    Multidisciplinary sciences
    Odontología
    Psicología
    Química
    Saúde coletiva
    Zootecnia / recursos pesqueiros
  • Documentos:

  • Cerca a google

    Search to google scholar