We are building the next generation of spatial intelligence where robots and 3D systems understand and interact with the world in real time. As a Multimodal LLM Engineer, you will design, train, and deploy vision-language models that understand detected objects, 3D environments, and dynamic scenes. Your work will enable robots and digital tools to reason about objects, context, safety, and actions—entirely on-device.

You will collaborate closely with perception, robotics, and systems engineers to bring together 3D vision, object detection, and LLM reasoning into a unified real-time intelligence engine.

This is a highly technical role with direct impact on core product capabilities.

Responsibilities

Develop and fine-tune multimodal LLMs (vision-language, 3D-language, object-context reasoning).
Build pipelines that fuse object detection, 3D data, bounding boxes, and sensor inputs into LLM tokens.
Architect models that interpret dynamic scenes, track changes, and deliver contextual reasoning.
Implement region-based reasoning, spatial attention, temporal understanding, and affordance prediction.
Train and optimize models using frameworks such as LLaVA, Qwen-VL, InternVL, CLIP/SigLIP, SAM, DETR, or custom backbones.
Convert raw perception output into structured representations (scene graphs, spatial embeddings).
Work with Robotics/Systems teams to integrate LLM reasoning into real-time pipelines (30–60 FPS).
Develop scalable data pipelines for multimodal datasets (images, detections, 3D meshes, text descriptions).
Perform model evaluation on context understanding, safety judgment, and action recommendation.
Collaborate on model compression and deployment for edge devices (Rockchip, Jetson, Apple M-series).

Minimum Qualifications

MS or PhD in Computer Science, AI/ML, Robotics, or related field—or equivalent experience.
3+ years experience building deep learning models, including transformers.
Hands-on experience with multimodal models (VLMs) or LLM fine-tuning.
Strong understanding of one or more:
Vision Transformers (ViT, SigLIP)
CLIP-style contrastive models
LLaVA / BLIP / Qwen-VL / InternVL
DETR / SAM / YOLO / 3D perception networks
Advanced Python and PyTorch skills.
Experience training models with large datasets and distributed systems.
Solid understanding of model architecture fundamentals (attention, tokenization, embeddings).

Job Type: Full-time

Pay: ₹1,500,000.00 - ₹3,000,000.00 per year

More Jobs at O-HIVE

PCB Design Engineer

delhi

4.0 - 4.0 yrs

INR 12 - 25 Lacs

Multimodal Vision LLM Engineer

delhi

3.0 - 3.0 yrs

INR 15 - 30 Lacs

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.