We are looking for a highly motivated Deep Learning Engineer with strong expertise in computer vision and audio analysis. The ideal candidate should be comfortable working with CNNs, pretrained models, attention mechanisms, vision transformers, video processing, object detection, and audio classification. You will be part of a team developing AI-driven solutions using multi-modal deep learning. Responsibilities Design and implement deep learning models for image, video, object detection, and audio classification tasks. Apply and fine-tune CNN-based architectures and Vision Transformers (e. g., ViT, Swin). Integrate attention mechanisms (e. g., SE, CBAM, Transformer attention) into model architectures for enhanced feature learning. Utilize pretrained models for transfer learning and multi-task learning. Work with video data using spatiotemporal modeling techniques (e. g., 3D CNNs, temporal attention). Extract and process features from audio using spectrograms, MFCCs, or learned embeddings. Evaluate and optimize models for speed, accuracy, and robustness. Collaborate across teams to deploy models into production. Requirements Strong programming skills in Python, with experience in PyTorch or TensorFlow. Hands-on experience with CNNs, pretrained networks, and attention modules. Solid knowledge of Vision Transformers, including recent architectures (e. g., Swin, DeiT). Understanding of attention mechanisms (self-attention, cross-attention, squeeze-and-excitation, etc. ). Experience implementing and training object detection models (YOLO, SSD, Faster R-CNN, RetinaNet, DETR). Experience in video analysis and temporal modeling. Strong grasp of audio classification workflows and features. Experience handling large-scale datasets and designing data pipelines. Familiarity with training strategies for deep models, including learning rate scheduling, early stopping, and data augmentation. This job was posted by Anshika Mahapatra from CureBay.