Generative AI Engineer Vision-Language Model (VLM)

1 - 5 years

0 Lacs

Posted:2 days ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

As a Vision-Language Model Developer, you will be responsible for developing, fine-tuning, and evaluating vision-language models such as CLIP, Flamingo, BLIP, GPT-4V, LLaVA, etc. You will design and build multimodal pipelines that integrate image/video input with natural language understanding or generation. Working with large-scale image-text datasets like LAION, COCO, and Visual Genome for training and validation will be a key part of your role. Implementing zero-shot/few-shot multimodal inference, retrieval, captioning, VQA (Visual Question Answering), grounding, etc., will also be within your responsibilities. Collaboration with product teams, ML engineers, and data scientists to deliver real-world multimodal applications is essential. Additionally, optimizing model inference performance and resource utilization in production environments using ONNX, TensorRT, etc., will be part of your duties. You will also conduct error analysis, ablation studies, and propose improvements in visual-language alignment. Contribution to research papers, documentation, or patents, if in a research-driven team, is expected. Qualifications required for this role include a Bachelors/Masters/PhD in Computer Science, AI, Machine Learning, or a related field. You should have at least 2+ years of experience in computer vision or NLP, with a minimum of 1+ year in multimodal ML or VLMs. Strong programming skills in Python, with experience in libraries like PyTorch, HuggingFace Transformers, OpenCV, and torchvision are necessary. Familiarity with VLM architectures such as CLIP, BLIP, Flamingo, LLaVA, Kosmos, GPT-4V, etc., is expected. Experience with dataset curation, image-caption pair processing, and image-text embedding strategies is also required. A solid understanding of transformers, cross-attention mechanisms, and contrastive learning is essential for this role. Please note that this is a full-time position with a day shift schedule. The work location is in person. As a Vision-Language Model Developer, you will be responsible for developing, fine-tuning, and evaluating vision-language models such as CLIP, Flamingo, BLIP, GPT-4V, LLaVA, etc. You will design and build multimodal pipelines that integrate image/video input with natural language understanding or generation. Working with large-scale image-text datasets like LAION, COCO, and Visual Genome for training and validation will be a key part of your role. Implementing zero-shot/few-shot multimodal inference, retrieval, captioning, VQA (Visual Question Answering), grounding, etc., will also be within your responsibilities. Collaboration with product teams, ML engineers, and data scientists to deliver real-world multimodal applications is essential. Additionally, optimizing model inference performance and resource utilization in production environments using ONNX, TensorRT, etc., will be part of your duties. You will also conduct error analysis, ablation studies, and propose improvements in visual-language alignment. Contribution to research papers, documentation, or patents, if in a research-driven team, is expected. Qualifications required for this role include a Bachelors/Masters/PhD in Computer Science, AI, Machine Learning, or a related field. You should have at least 2+ years of experience in computer vision or NLP, with a minimum of 1+ year in multimodal ML or VLMs. Strong programming skills in Python, with experience in libraries like PyTorch, HuggingFace Transformers, OpenCV, and torchvision are necessary. Familiarity with VLM architectures such as CLIP, BLIP, Flamingo, LLaVA, Kosmos, GPT-4V, etc., is expected. Experience with dataset curation, image-caption pair processing, and image-text embedding strategies is also required. A solid understanding of transformers, cross-attention mechanisms, and contrastive learning is essential for this role. Please note that this is a full-time position with a day shift schedule. The work location is in person.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You