Multimodal Vision LLM Engineer

3 years

15 - 30 Lacs

Posted:1 hour ago| Platform: GlassDoor logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

About the Role

We are building the next generation of spatial intelligence where robots and 3D systems understand and interact with the world in real time. As a Multimodal LLM Engineer, you will design, train, and deploy vision-language models that understand detected objects, 3D environments, and dynamic scenes. Your work will enable robots and digital tools to reason about objects, context, safety, and actions—entirely on-device.

You will collaborate closely with perception, robotics, and systems engineers to bring together 3D vision, object detection, and LLM reasoning into a unified real-time intelligence engine.

This is a highly technical role with direct impact on core product capabilities.

Responsibilities

  • Develop and fine-tune multimodal LLMs (vision-language, 3D-language, object-context reasoning).
  • Build pipelines that fuse object detection, 3D data, bounding boxes, and sensor inputs into LLM tokens.
  • Architect models that interpret dynamic scenes, track changes, and deliver contextual reasoning.
  • Implement region-based reasoning, spatial attention, temporal understanding, and affordance prediction.
  • Train and optimize models using frameworks such as LLaVA, Qwen-VL, InternVL, CLIP/SigLIP, SAM, DETR, or custom backbones.
  • Convert raw perception output into structured representations (scene graphs, spatial embeddings).
  • Work with Robotics/Systems teams to integrate LLM reasoning into real-time pipelines (30–60 FPS).
  • Develop scalable data pipelines for multimodal datasets (images, detections, 3D meshes, text descriptions).
  • Perform model evaluation on context understanding, safety judgment, and action recommendation.
  • Collaborate on model compression and deployment for edge devices (Rockchip, Jetson, Apple M-series).

Minimum Qualifications

  • MS or PhD in Computer Science, AI/ML, Robotics, or related field—or equivalent experience.
  • 3+ years experience building deep learning models, including transformers.
  • Hands-on experience with multimodal models (VLMs) or LLM fine-tuning.
  • Strong understanding of one or more:
  • Vision Transformers (ViT, SigLIP)
  • CLIP-style contrastive models
  • LLaVA / BLIP / Qwen-VL / InternVL
  • DETR / SAM / YOLO / 3D perception networks
  • Advanced Python and PyTorch skills.
  • Experience training models with large datasets and distributed systems.
  • Solid understanding of model architecture fundamentals (attention, tokenization, embeddings).

Job Type: Full-time

Pay: ₹1,500,000.00 - ₹3,000,000.00 per year

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You