CoHAtNet: An integrated convolutional-transformer architecture with hybrid self-attention for end-to-end camera localization

Hasan, H; Garcia, MA; Rashwan, H; Puig, D

doi:10.1016/j.imavis.2025.105674

Dades identificatives

Identificador: imarina:9463954

Handle: https://hdl.handle.net/20.500.11797/imarina9463954

Autors: Hasan, H; Garcia, MA; Rashwan, H; Puig, D

Resum:
Camera localization refers to the process of automatically determining the position and orientation of a camera within its 3D environment from the images it captures. Traditional camera localization methods often rely on Convolutional Neural Networks, which are effective at extracting local visual features but struggle to capture long-range dependencies critical for accurate localization. In contrast, Transformer-based approaches model global contextual relationships appropriately, although they often lack precision in fine-grained spatial representations. To bridge this gap, we introduce CoHAtNet, a novel Convolutional Hybrid-Attention Network that tightly integrates convolutional and self-attention mechanisms. Unlike previous hybrid models that stack convolutional and attention layers separately, CoHAtNet embeds local features extracted via Mobile Inverted Bottleneck Convolution blocks directly into the Value component of the self-attention mechanism of Transformers. This yields a hybrid self-attention block capable of dynamically capturing both local spatial detail and global semantic context within a single attention layer. Additionally, CoHAtNet enables modality-level fusion by processing RGB and depth data jointly in a unified pipeline, allowing the model to leverage complementary appearance and geometric cues throughout. Extensive evaluations have been conducted on two widely-used camera localization datasets: 7-Scenes (RGB-D) and Cambridge Landmarks (RGB). Experimental results show that CoHAtNet achieves state-of-theart performance in both translation and orientation accuracy. These results highlight the effectiveness of our hybrid design in challenging indoor and outdoor environments. This makes CoHAtNet a strong candidate for end-to-end camera localization tasks.
Altres:

Enllaç font original: https://www.sciencedirect.com/science/article/pii/S0262885625002628?via%3Dihub
Referència de l'ítem segons les normes APA: Hasan, H; Garcia, MA; Rashwan, H; Puig, D (2025). CoHAtNet: An integrated convolutional-transformer architecture with hybrid self-attention for end-to-end camera localization. Image And Vision Computing, 162(), 105674-. DOI: 10.1016/j.imavis.2025.105674
Referència a l'article segons font original: Image And Vision Computing. 162 105674-
DOI de l'article: 10.1016/j.imavis.2025.105674
Any de publicació de la revista: 2025-10-01
Entitat: Universitat Rovira i Virgili
Versió de l'article dipositat: info:eu-repo/semantics/publishedVersion
Data d'alta del registre: 2026-02-13
Autor/s de la URV: Abdellatif Fatahallah Ibrahim Mahmoud, Hatem / Puig Valls, Domènec Savi
Departament: Enginyeria Informàtica i Matemàtiques
URL Document de llicència: https://repositori.urv.cat/ca/proteccio-de-dades/
Tipus de publicació: Journal Publications
Autor segons l'article: Hasan, H; Garcia, MA; Rashwan, H; Puig, D
Accès a la llicència d'ús: https://creativecommons.org/licenses/by/3.0/es/
Àrees temàtiques: Artes / música, Biotecnología, Ciência da computação, Ciências biológicas i, Computer science, artificial intelligence, Computer science, software engineering, Computer science, software, graphics, programming, Computer science, theory & methods, Computer vision and pattern recognition, Direito, Electrical and electronic engineering, Engenharias iv, Engineering, electrical & electronic, Interdisciplinar, Matemática / probabilidade e estatística, Optics, Química, Signal processing
Adreça de correu electrònic de l'autor: domenec.puig@urv.cat, hatem.abdellatif@urv.cat

Paraules clau:

3-d environments
Affordable and clean energy
Attention mechanisms
Camera localization
Cameras
Coatnet
Convolution
Convolutional neural network
Convolutional neural networks
End to end
Hybrid cnn-transformer
Hybrid cnn-transformers
Hybrid self-attentio
Hybrid self-attention
Image processing
Localization method
Position and orientations
Semantics
Computer Science
Artificial Intelligence
Software Engineering
Software
Graphics
Programming
Theory & Methods
Computer Vision and Pattern Recognition
Electrical and Electronic Engineering
Engineering
Electrical & Electronic
Optics
Signal Processing
Artes / música
Biotecnología
Ciência da computação
Ciências biológicas i
Direito
Engenharias iv
Interdisciplinar
Matemática / probabilidade e estatística
Química
Documents:

DocumentPrincipal
Cerca a google

CoHAtNet: An integrated convolutional-transformer architecture with hybrid self-attention for end-to-end camera localization

Dades identificatives

Altres:

Paraules clau:

Documents:

Cerca a google