Identificador: TDX:4222
Autors: Morales Sánchez, Damián
Resum:
This dissertation, framed in the computational sociolinguistics field, explores the use of sociolinguistic-derived features in Artificial Intelligence-based computational models for automatic gender detection on Spanish texts.
Our interest lays in designing computational models based on white-box machine learning algorithms and fuzzy logic with sociolinguistic-inspired features.
We elaborated a characterisation of gender based on linguistic levels from the publications framed in the language and gender field, the computer-mediated communication and gender research area, and computational sociolinguistics. This characterisation serves as the foundation of our experimental analysis.
In the experimental analysis, we implemented the Decision Tree algorithm with orthographic, morphological, lexical, syntactic, digital, and pragmatic-discursive features on the PAN-AP-13 dataset in order to identify gender sociolinguistic patterns. From this first computational experiment, we extended our analysis to other datasets and algorithms; specifically, we explored, besides the PAN-AP-13 and the Decision Tree algorithm, the PAN-AP-15, PAN-AP-17, PAN-AP-18, and PAN-AP-19 datasets, and the Random Forest and XGBoost algorithms. We designed 63 models from the combinations of the feature sets. The classification accuracy of the resulting models, which did not exceed 160 linguistic features, was around 70%.
We culminated the experimental analysis with a sociolinguistic characterisation of gender based on 39 patterns organised according to their robustnesss.
Our theoretical proposal presents 64 fuzzy models, of which 57 are ensemble fuzzy models whose final output was calculated using the majority vote scheme. According to the results, the Orthographic, Lexical, Syntactic, Digital, and Pragmatic-Discursive (OLSDP) ensemble model produced the best results.
White-box machine learning algorithms and fuzzy logic, along with sociolinguistic-inspired features, must be incorporated into automatic gender identification in order to elucidate the complex relationship between language and gender.