Articles producció científica> Enginyeria Informàtica i Matemàtiques

Survey and evaluation of Web search engine hit counts as research tools in computational linguistics

  • Datos identificativos

    Identificador: imarina:5131992
    Autores:
    Sanchez, DavidMartinez-Sanahuja, LauraBatet, Montserrat
    Resumen:
    In recent years, many studies on computational linguistics have employed the Web as source for research. Specifically, the distribution of textual data in the Web is used to drive linguistic analyses in tasks such as information extraction, knowledge acquisition or natural language processing. For these purposes, commercial Web search engines are commonly used as the low-entry-cost way to access Web data and, more specifically, to estimate the distribution of the entity(ies) of interest from the hit count the search engines provide when querying such entities. Even though several studies have evaluated the effectiveness of Web search engines as information retrieval tools from the perspective of the end users, few authors have assessed the suitability of hit counts as research tools in computational linguistics; moreover, studies so far have focused on the most well-known search engines (typically Google, Bing and Yahoo!), and neglected potentially interesting alternatives that have recently surfaced. To fill this gap, in this work, we first compile and survey the general-purpose search engines that are currently available. Then, we evaluate the suitability of the hit counts they provide under several perspectives that are relevant for computational linguistics: flexibility of the query language, linguistic coherence, mathematical coherence and temporal consistency. The results of our survey show that, even though the choice of a particular search engine has been generally ignored by researchers relying on Web data, there are significant quality differences between the hit counts of current search engines, and that the most well-known and widely-used search engines do not provide the best results. In this respect, we also identify the search engines whose hit counts are
  • Otros:

    Autor según el artículo: Sanchez, David; Martinez-Sanahuja, Laura; Batet, Montserrat;
    Departamento: Enginyeria Informàtica i Matemàtiques
    Autor/es de la URV: Batet Sanromà, Montserrat / Sánchez Ruenes, David
    Palabras clave: Web search engines Semantic similarity Information distribution Hit counts Computational linguistics
    Resumen: In recent years, many studies on computational linguistics have employed the Web as source for research. Specifically, the distribution of textual data in the Web is used to drive linguistic analyses in tasks such as information extraction, knowledge acquisition or natural language processing. For these purposes, commercial Web search engines are commonly used as the low-entry-cost way to access Web data and, more specifically, to estimate the distribution of the entity(ies) of interest from the hit count the search engines provide when querying such entities. Even though several studies have evaluated the effectiveness of Web search engines as information retrieval tools from the perspective of the end users, few authors have assessed the suitability of hit counts as research tools in computational linguistics; moreover, studies so far have focused on the most well-known search engines (typically Google, Bing and Yahoo!), and neglected potentially interesting alternatives that have recently surfaced. To fill this gap, in this work, we first compile and survey the general-purpose search engines that are currently available. Then, we evaluate the suitability of the hit counts they provide under several perspectives that are relevant for computational linguistics: flexibility of the query language, linguistic coherence, mathematical coherence and temporal consistency. The results of our survey show that, even though the choice of a particular search engine has been generally ignored by researchers relying on Web data, there are significant quality differences between the hit counts of current search engines, and that the most well-known and widely-used search engines do not provide the best results. In this respect, we also identify the search engines whose hit counts are best suited for research.
    Áreas temáticas: Software Sociología Medicina i Interdisciplinar Information systems Hardware and architecture Engenharias iv Engenharias iii Computer science, information systems Ciências biológicas i Ciência da computação
    Acceso a la licencia de uso: https://creativecommons.org/licenses/by/3.0/es/
    Direcció de correo del autor: montserrat.batet@urv.cat david.sanchez@urv.cat
    Identificador del autor: 0000-0001-8174-7592 0000-0001-7275-7887
    Fecha de alta del registro: 2024-09-07
    Versión del articulo depositado: info:eu-repo/semantics/acceptedVersion
    Enlace a la fuente original: https://www.sciencedirect.com/science/article/abs/pii/S0306437917303290
    URL Documento de licencia: https://repositori.urv.cat/ca/proteccio-de-dades/
    Referencia al articulo segun fuente origial: Information Systems. 73 50-60
    Referencia de l'ítem segons les normes APA: Sanchez, David; Martinez-Sanahuja, Laura; Batet, Montserrat; (2018). Survey and evaluation of Web search engine hit counts as research tools in computational linguistics. Information Systems, 73(), 50-60. DOI: 10.1016/j.is.2017.12.007
    DOI del artículo: 10.1016/j.is.2017.12.007
    Entidad: Universitat Rovira i Virgili
    Año de publicación de la revista: 2018
    Tipo de publicación: Journal Publications
  • Palabras clave:

    Computer Science, Information Systems,Hardware and Architecture,Information Systems,Software
    Web search engines
    Semantic similarity
    Information distribution
    Hit counts
    Computational linguistics
    Software
    Sociología
    Medicina i
    Interdisciplinar
    Information systems
    Hardware and architecture
    Engenharias iv
    Engenharias iii
    Computer science, information systems
    Ciências biológicas i
    Ciência da computação
  • Documentos:

  • Cerca a google

    Search to google scholar