URV's Author/s: | Jiménez López, María Dolores / Morales Sánchez, Damián / Moreno Ribas, Antonio |
Author, as appears in the article.: | Morales Sanchez, Damian; Moreno, Antonio; Jimenez Lopez, Maria Dolores; |
Author's mail: | damian.morales@urv.cat damian.morales@urv.cat antonio.moreno@urv.cat mariadolores.jimenez@urv.cat |
Author identifier: | 0000-0003-3945-2314 0000-0001-5544-3210 |
Journal publication year: | 2022 |
Publication Type: | Journal Publications |
APA: | Morales Sanchez, Damian; Moreno, Antonio; Jimenez Lopez, Maria Dolores; (2022). A White-Box Sociolinguistic Model for Gender Detection. Applied Sciences-Basel, 12(5), -. DOI: 10.3390/app12052676 |
Papper original source: | Applied Sciences-Basel. 12 (5): |
Abstract: | Within the area of Natural Language Processing, we approached the Author Profiling task as a text classification problem. Based on the author's writing style, sociodemographic information, such as the author's gender, age, or native language can be predicted. The exponential growth of user-generated data and the development of Machine-Learning techniques have led to significant advances in automatic gender detection. Unfortunately, gender detection models often become black-boxes in terms of interpretability. In this paper, we propose a tree-based computational model for gender detection made up of 198 features. Unlike the previous works on gender detection, we organized the features from a linguistic perspective into six categories: orthographic, morphological, lexical, syntactic, digital, and pragmatics-discursive. We implemented a Decision-Tree classifier to evaluate the performance of all feature combinations, and the experiments revealed that, on average, the classification accuracy increased up to 3.25% with the addition of feature sets. The maximum classification accuracy was reached by a three-level model that combined lexical, syntactic, and digital features. We present the most relevant features for gender detection according to the trees generated by the classifier and contextualize the significance of the computational results with the linguistic patterns defined by previous research in relation to gender. |
Article's DOI: | 10.3390/app12052676 |
Link to the original source: | https://www.mdpi.com/2076-3417/12/5/2676 |
Papper version: | info:eu-repo/semantics/publishedVersion |
licence for use: | https://creativecommons.org/licenses/by/3.0/es/ |
Department: | Enginyeria Informàtica i Matemàtiques Filologies Romàniques |
Licence document URL: | https://repositori.urv.cat/ca/proteccio-de-dades/ |
Thematic Areas: | Química Process chemistry and technology Physics, applied Materials science, multidisciplinary Materials science (miscellaneous) Materials science (all) Materiais Instrumentation General materials science General engineering Fluid flow and transfer processes Engineering, multidisciplinary Engineering (miscellaneous) Engineering (all) Engenharias ii Engenharias i Computer science applications Ciências biológicas iii Ciências biológicas ii Ciências biológicas i Ciências agrárias i Ciência de alimentos Chemistry, multidisciplinary Biodiversidade Astronomia / física |
Keywords: | Machine learning Gender detection Computational sociolinguistics Author profiling Author |
Entity: | Universitat Rovira i Virgili |
Record's date: | 2024-09-07 |
Description: | Within the area of Natural Language Processing, we approached the Author Profiling task as a text classification problem. Based on the author's writing style, sociodemographic information, such as the author's gender, age, or native language can be predicted. The exponential growth of user-generated data and the development of Machine-Learning techniques have led to significant advances in automatic gender detection. Unfortunately, gender detection models often become black-boxes in terms of interpretability. In this paper, we propose a tree-based computational model for gender detection made up of 198 features. Unlike the previous works on gender detection, we organized the features from a linguistic perspective into six categories: orthographic, morphological, lexical, syntactic, digital, and pragmatics-discursive. We implemented a Decision-Tree classifier to evaluate the performance of all feature combinations, and the experiments revealed that, on average, the classification accuracy increased up to 3.25% with the addition of feature sets. The maximum classification accuracy was reached by a three-level model that combined lexical, syntactic, and digital features. We present the most relevant features for gender detection according to the trees generated by the classifier and contextualize the significance of the computational results with the linguistic patterns defined by previous research in relation to gender. |
Type: | Journal Publications |
Contributor: | Universitat Rovira i Virgili |
Títol: | A White-Box Sociolinguistic Model for Gender Detection |
Subject: | Chemistry, Multidisciplinary,Computer Science Applications,Engineering (Miscellaneous),Engineering, Multidisciplinary,Fluid Flow and Transfer Processes,Instrumentation,Materials Science (Miscellaneous),Materials Science, Multidisciplinary,Physics, Applied,Process Chemistry and Technology Machine learning Gender detection Computational sociolinguistics Author profiling Author Química Process chemistry and technology Physics, applied Materials science, multidisciplinary Materials science (miscellaneous) Materials science (all) Materiais Instrumentation General materials science General engineering Fluid flow and transfer processes Engineering, multidisciplinary Engineering (miscellaneous) Engineering (all) Engenharias ii Engenharias i Computer science applications Ciências biológicas iii Ciências biológicas ii Ciências biológicas i Ciências agrárias i Ciência de alimentos Chemistry, multidisciplinary Biodiversidade Astronomia / física |
Date: | 2022 |
Creator: | Morales Sanchez, Damian Moreno, Antonio Jimenez Lopez, Maria Dolores |
Rights: | info:eu-repo/semantics/openAccess |
Search your record at: |
File | Description | Format | |
---|---|---|---|
DocumentPrincipal | DocumentPrincipal | application/pdf |
© 2011 Universitat Rovira i Virgili