Repositori institucional URV
Español Català English
TITLE:
A White-Box Sociolinguistic Model for Gender Detection - imarina:9247856

URV's Author/s:Jiménez López, María Dolores / Morales Sánchez, Damián / Moreno Ribas, Antonio
Author, as appears in the article.:Morales Sanchez, Damian; Moreno, Antonio; Jimenez Lopez, Maria Dolores;
Author's mail:damian.morales@urv.cat
damian.morales@urv.cat
antonio.moreno@urv.cat
mariadolores.jimenez@urv.cat
Author identifier:0000-0003-3945-2314
0000-0001-5544-3210
Journal publication year:2022
Publication Type:Journal Publications
APA:Morales Sanchez, Damian; Moreno, Antonio; Jimenez Lopez, Maria Dolores; (2022). A White-Box Sociolinguistic Model for Gender Detection. Applied Sciences-Basel, 12(5), -. DOI: 10.3390/app12052676
Papper original source:Applied Sciences-Basel. 12 (5):
Abstract:Within the area of Natural Language Processing, we approached the Author Profiling task as a text classification problem. Based on the author's writing style, sociodemographic information, such as the author's gender, age, or native language can be predicted. The exponential growth of user-generated data and the development of Machine-Learning techniques have led to significant advances in automatic gender detection. Unfortunately, gender detection models often become black-boxes in terms of interpretability. In this paper, we propose a tree-based computational model for gender detection made up of 198 features. Unlike the previous works on gender detection, we organized the features from a linguistic perspective into six categories: orthographic, morphological, lexical, syntactic, digital, and pragmatics-discursive. We implemented a Decision-Tree classifier to evaluate the performance of all feature combinations, and the experiments revealed that, on average, the classification accuracy increased up to 3.25% with the addition of feature sets. The maximum classification accuracy was reached by a three-level model that combined lexical, syntactic, and digital features. We present the most relevant features for gender detection according to the trees generated by the classifier and contextualize the significance of the computational results with the linguistic patterns defined by previous research in relation to gender.
Article's DOI:10.3390/app12052676
Link to the original source:https://www.mdpi.com/2076-3417/12/5/2676
Papper version:info:eu-repo/semantics/publishedVersion
licence for use:https://creativecommons.org/licenses/by/3.0/es/
Department:Enginyeria Informàtica i Matemàtiques
Filologies Romàniques
Licence document URL:https://repositori.urv.cat/ca/proteccio-de-dades/
Thematic Areas:Química
Process chemistry and technology
Physics, applied
Materials science, multidisciplinary
Materials science (miscellaneous)
Materials science (all)
Materiais
Instrumentation
General materials science
General engineering
Fluid flow and transfer processes
Engineering, multidisciplinary
Engineering (miscellaneous)
Engineering (all)
Engenharias ii
Engenharias i
Computer science applications
Ciências biológicas iii
Ciências biológicas ii
Ciências biológicas i
Ciências agrárias i
Ciência de alimentos
Chemistry, multidisciplinary
Biodiversidade
Astronomia / física
Keywords:Machine learning
Gender detection
Computational sociolinguistics
Author profiling
Author
Entity:Universitat Rovira i Virgili
Record's date:2024-09-07
Search your record at:

Available files
FileDescriptionFormat
DocumentPrincipalDocumentPrincipalapplication/pdf

Information

© 2011 Universitat Rovira i Virgili