CoHAtNet: An integrated convolutional-transformer architecture with hybrid self-attention for end-to-end camera localization

Hasan, H; Garcia, MA; Rashwan, H; Puig, D

doi:10.1016/j.imavis.2025.105674

Identification data

Identifier: imarina:9463954

Handle: https://hdl.handle.net/20.500.11797/imarina9463954

Authors: Hasan, H; Garcia, MA; Rashwan, H; Puig, D

Abstract:
Camera localization refers to the process of automatically determining the position and orientation of a camera within its 3D environment from the images it captures. Traditional camera localization methods often rely on Convolutional Neural Networks, which are effective at extracting local visual features but struggle to capture long-range dependencies critical for accurate localization. In contrast, Transformer-based approaches model global contextual relationships appropriately, although they often lack precision in fine-grained spatial representations. To bridge this gap, we introduce CoHAtNet, a novel Convolutional Hybrid-Attention Network that tightly integrates convolutional and self-attention mechanisms. Unlike previous hybrid models that stack convolutional and attention layers separately, CoHAtNet embeds local features extracted via Mobile Inverted Bottleneck Convolution blocks directly into the Value component of the self-attention mechanism of Transformers. This yields a hybrid self-attention block capable of dynamically capturing both local spatial detail and global semantic context within a single attention layer. Additionally, CoHAtNet enables modality-level fusion by processing RGB and depth data jointly in a unified pipeline, allowing the model to leverage complementary appearance and geometric cues throughout. Extensive evaluations have been conducted on two widely-used camera localization datasets: 7-Scenes (RGB-D) and Cambridge Landmarks (RGB). Experimental results show that CoHAtNet achieves state-of-theart performance in both translation and orientation accuracy. These results highlight the effectiveness of our hybrid design in challenging indoor and outdoor environments. This makes CoHAtNet a strong candidate for end-to-end camera localization tasks.
Others:

Link to the original source: https://www.sciencedirect.com/science/article/pii/S0262885625002628?via%3Dihub
APA: Hasan, H; Garcia, MA; Rashwan, H; Puig, D (2025). CoHAtNet: An integrated convolutional-transformer architecture with hybrid self-attention for end-to-end camera localization. Image And Vision Computing, 162(), 105674-. DOI: 10.1016/j.imavis.2025.105674
Paper original source: Image And Vision Computing. 162 105674-
Article's DOI: 10.1016/j.imavis.2025.105674
Journal publication year: 2025-10-01
Entity: Universitat Rovira i Virgili
Paper version: info:eu-repo/semantics/publishedVersion
Record's date: 2026-02-13
URV's Author/s: Abdellatif Fatahallah Ibrahim Mahmoud, Hatem / Puig Valls, Domènec Savi
Department: Enginyeria Informàtica i Matemàtiques
Licence document URL: https://repositori.urv.cat/ca/proteccio-de-dades/
Publication Type: Journal Publications
Author, as appears in the article.: Hasan, H; Garcia, MA; Rashwan, H; Puig, D
licence for use: https://creativecommons.org/licenses/by/3.0/es/
Thematic Areas: Artes / música, Biotecnología, Ciência da computação, Ciências biológicas i, Computer science, artificial intelligence, Computer science, software engineering, Computer science, software, graphics, programming, Computer science, theory & methods, Computer vision and pattern recognition, Direito, Electrical and electronic engineering, Engenharias iv, Engineering, electrical & electronic, Interdisciplinar, Matemática / probabilidade e estatística, Optics, Química, Signal processing
Author's mail: domenec.puig@urv.cat, hatem.abdellatif@urv.cat

Keywords:

3-d environments
Affordable and clean energy
Attention mechanisms
Camera localization
Cameras
Coatnet
Convolution
Convolutional neural network
Convolutional neural networks
End to end
Hybrid cnn-transformer
Hybrid cnn-transformers
Hybrid self-attentio
Hybrid self-attention
Image processing
Localization method
Position and orientations
Semantics
Computer Science
Artificial Intelligence
Software Engineering
Software
Graphics
Programming
Theory & Methods
Computer Vision and Pattern Recognition
Electrical and Electronic Engineering
Engineering
Electrical & Electronic
Optics
Signal Processing
Artes / música
Biotecnología
Ciência da computação
Ciências biológicas i
Direito
Engenharias iv
Interdisciplinar
Matemática / probabilidade e estatística
Química
Documents:

DocumentPrincipal
Cerca a google

CoHAtNet: An integrated convolutional-transformer architecture with hybrid self-attention for end-to-end camera localization

Identification data

Others:

Keywords:

Documents:

Cerca a google