Articles producció científica> Filologies Romàniques

A White-Box Sociolinguistic Model for Gender Detection

  • Dades identificatives

    Identificador: imarina:9247856
    Autors:
    Morales Sanchez, DamianMoreno, AntonioJimenez Lopez, Maria Dolores
    Resum:
    Within the area of Natural Language Processing, we approached the Author Profiling task as a text classification problem. Based on the author's writing style, sociodemographic information, such as the author's gender, age, or native language can be predicted. The exponential growth of user-generated data and the development of Machine-Learning techniques have led to significant advances in automatic gender detection. Unfortunately, gender detection models often become black-boxes in terms of interpretability. In this paper, we propose a tree-based computational model for gender detection made up of 198 features. Unlike the previous works on gender detection, we organized the features from a linguistic perspective into six categories: orthographic, morphological, lexical, syntactic, digital, and pragmatics-discursive. We implemented a Decision-Tree classifier to evaluate the performance of all feature combinations, and the experiments revealed that, on average, the classification accuracy increased up to 3.25% with the addition of feature sets. The maximum classification accuracy was reached by a three-level model that combined lexical, syntactic, and digital features. We present the most relevant features for gender detection according to the trees generated by the classifier and contextualize the significance of the computational results with the linguistic patterns defined by previous research in relation to gender.
  • Altres:

    Autor segons l'article: Morales Sanchez, Damian; Moreno, Antonio; Jimenez Lopez, Maria Dolores
    Departament: Enginyeria Informàtica i Matemàtiques Filologies Romàniques
    Autor/s de la URV: Jiménez López, María Dolores / Morales Sánchez, Damián / Moreno Ribas, Antonio
    Paraules clau: Machine learning Gender detection Computational sociolinguistics Author profiling Author
    Resum: Within the area of Natural Language Processing, we approached the Author Profiling task as a text classification problem. Based on the author's writing style, sociodemographic information, such as the author's gender, age, or native language can be predicted. The exponential growth of user-generated data and the development of Machine-Learning techniques have led to significant advances in automatic gender detection. Unfortunately, gender detection models often become black-boxes in terms of interpretability. In this paper, we propose a tree-based computational model for gender detection made up of 198 features. Unlike the previous works on gender detection, we organized the features from a linguistic perspective into six categories: orthographic, morphological, lexical, syntactic, digital, and pragmatics-discursive. We implemented a Decision-Tree classifier to evaluate the performance of all feature combinations, and the experiments revealed that, on average, the classification accuracy increased up to 3.25% with the addition of feature sets. The maximum classification accuracy was reached by a three-level model that combined lexical, syntactic, and digital features. We present the most relevant features for gender detection according to the trees generated by the classifier and contextualize the significance of the computational results with the linguistic patterns defined by previous research in relation to gender.
    Àrees temàtiques: Química Process chemistry and technology Physics, applied Materials science, multidisciplinary Materials science (miscellaneous) Materials science (all) Materiais Instrumentation General materials science General engineering Fluid flow and transfer processes Engineering, multidisciplinary Engineering (miscellaneous) Engineering (all) Engenharias ii Engenharias i Computer science applications Ciências biológicas iii Ciências biológicas ii Ciências biológicas i Ciências agrárias i Ciência de alimentos Chemistry, multidisciplinary Biodiversidade Astronomia / física
    Accès a la llicència d'ús: https://creativecommons.org/licenses/by/3.0/es/
    Adreça de correu electrònic de l'autor: damian.morales@urv.cat damian.morales@urv.cat antonio.moreno@urv.cat mariadolores.jimenez@urv.cat
    Identificador de l'autor: 0000-0003-3945-2314 0000-0001-5544-3210
    Data d'alta del registre: 2024-10-12
    Versió de l'article dipositat: info:eu-repo/semantics/publishedVersion
    Enllaç font original: https://www.mdpi.com/2076-3417/12/5/2676
    URL Document de llicència: https://repositori.urv.cat/ca/proteccio-de-dades/
    Referència a l'article segons font original: Applied Sciences-Basel. 12 (5): 2676-
    Referència de l'ítem segons les normes APA: Morales Sanchez, Damian; Moreno, Antonio; Jimenez Lopez, Maria Dolores (2022). A White-Box Sociolinguistic Model for Gender Detection. Applied Sciences-Basel, 12(5), 2676-. DOI: 10.3390/app12052676
    DOI de l'article: 10.3390/app12052676
    Entitat: Universitat Rovira i Virgili
    Any de publicació de la revista: 2022
    Tipus de publicació: Journal Publications
  • Paraules clau:

    Chemistry, Multidisciplinary,Computer Science Applications,Engineering (Miscellaneous),Engineering, Multidisciplinary,Fluid Flow and Transfer Processes,Instrumentation,Materials Science (Miscellaneous),Materials Science, Multidisciplinary,Physics, Applied,Process Chemistry and Technology
    Machine learning
    Gender detection
    Computational sociolinguistics
    Author profiling
    Author
    Química
    Process chemistry and technology
    Physics, applied
    Materials science, multidisciplinary
    Materials science (miscellaneous)
    Materials science (all)
    Materiais
    Instrumentation
    General materials science
    General engineering
    Fluid flow and transfer processes
    Engineering, multidisciplinary
    Engineering (miscellaneous)
    Engineering (all)
    Engenharias ii
    Engenharias i
    Computer science applications
    Ciências biológicas iii
    Ciências biológicas ii
    Ciências biológicas i
    Ciências agrárias i
    Ciência de alimentos
    Chemistry, multidisciplinary
    Biodiversidade
    Astronomia / física
  • Documents:

  • Cerca a google

    Search to google scholar