Articles producció científicaFilologies Romàniques

A White-Box Sociolinguistic Model for Gender Detection

  • Datos identificativos

    Identificador:  imarina:9247856
    Autores:  Sanchez, DM; Moreno, A; Lopez, MDJ
    Resumen:
    Within the area of Natural Language Processing, we approached the Author Profiling task as a text classification problem. Based on the author's writing style, sociodemographic information, such as the author's gender, age, or native language can be predicted. The exponential growth of user-generated data and the development of Machine-Learning techniques have led to significant advances in automatic gender detection. Unfortunately, gender detection models often become black-boxes in terms of interpretability. In this paper, we propose a tree-based computational model for gender detection made up of 198 features. Unlike the previous works on gender detection, we organized the features from a linguistic perspective into six categories: orthographic, morphological, lexical, syntactic, digital, and pragmatics-discursive. We implemented a Decision-Tree classifier to evaluate the performance of all feature combinations, and the experiments revealed that, on average, the classification accuracy increased up to 3.25% with the addition of feature sets. The maximum classification accuracy was reached by a three-level model that combined lexical, syntactic, and digital features. We present the most relevant features for gender detection according to the trees generated by the classifier and contextualize the significance of the computational results with the linguistic patterns defined by previous research in relation to gender.
  • Otros:

    Enlace a la fuente original: https://www.mdpi.com/2076-3417/12/5/2676
    Referencia de l'ítem segons les normes APA: Sanchez, DM; Moreno, A; Lopez, MDJ (2022). A White-Box Sociolinguistic Model for Gender Detection. Applied Sciences-Basel, 12(5), 2676-. DOI: 10.3390/app12052676
    Referencia al articulo segun fuente origial: Applied Sciences-Basel. 12 (5): 2676-
    DOI del artículo: 10.3390/app12052676
    Año de publicación de la revista: 2022-03-01
    Entidad: Universitat Rovira i Virgili
    Versión del articulo depositado: info:eu-repo/semantics/publishedVersion
    Fecha de alta del registro: 2026-05-09
    Autor/es de la URV: Jiménez López, María Dolores / Morales Sánchez, Damián / Moreno Ribas, Antonio
    Departamento: Enginyeria Informàtica i Matemàtiques, Filologies Romàniques
    URL Documento de licencia: https://repositori.urv.cat/ca/proteccio-de-dades/
    Tipo de publicación: Journal Publications
    Autor según el artículo: Sanchez, DM; Moreno, A; Lopez, MDJ
    Acceso a la licencia de uso: https://creativecommons.org/licenses/by/3.0/es/
    Áreas temáticas: Process chemistry and technology, Physics, applied, Materials science, multidisciplinary, Materials science (miscellaneous), Materials science (all), Instrumentation, General materials science, General engineering, Fluid flow and transfer processes, Engineering, multidisciplinary, Engineering (miscellaneous), Engineering (all), Computer science applications, Ciências biológicas i, Ciências agrárias i, Chemistry, multidisciplinary
    Direcció de correo del autor: damian.morales@urv.cat, antonio.moreno@urv.cat, antonio.moreno@urv.cat, mariadolores.jimenez@urv.cat, mariadolores.jimenez@urv.cat
  • Palabras clave:

    Quality education
    Machine learning
    Gender detection
    Computational sociolinguistics
    Author profiling
    Author
    Chemistry
    Multidisciplinary
    Computer Science Applications
    Engineering (Miscellaneous)
    Engineering
    Fluid Flow and Transfer Processes
    Instrumentation
    Materials Science (Miscellaneous)
    Materials Science
    Physics
    Applied
    Process Chemistry and Technology
    Materials science (all)
    General materials science
    General engineering
    Engineering (all)
    Ciências biológicas i
    Ciências agrárias i
  • Documentos:

  • Cerca a google

    Search to google scholar