Posted:1 week ago| Platform:
On-site
Full Time
Key Responsibilities: Develop, fine-tune, and evaluate vision-language models (e.g., CLIP, Flamingo, BLIP, GPT-4V, LLaVA, etc.). Design and build multimodal pipelines that integrate image/video input with natural language understanding or generation. Work with large-scale image-text datasets (e.g., LAION, COCO, Visual Genome) for training and validation. Implement zero-shot/few-shot multimodal inference, retrieval, captioning, VQA (Visual Question Answering), grounding, etc. Collaborate closely with product teams, ML engineers, and data scientists to deliver real-world multimodal applications. Optimize model inference performance and resource utilization in production environments (ONNX, TensorRT, etc.). Conduct error analysis, ablation studies, and propose improvements in visual-language alignment. Contribute to research papers, documentation, or patents if in a research-driven team. Required Skills & Qualifications: Bachelor’s/Master’s/PhD in Computer Science, AI, Machine Learning, or a related field. 2+ years experience in computer vision or NLP, with at least 1+ year in multimodal ML or VLMs. Strong programming skills in Python, with experience in libraries like PyTorch, HuggingFace Transformers, OpenCV, torchvision. Familiarity with VLM architectures: CLIP, BLIP, Flamingo, LLaVA, Kosmos, GPT-4V, etc. Experience with dataset curation, image-caption pair processing, and image-text embedding strategies. Solid understanding of transformers, cross-attention mechanisms, and contrastive learning. Job Type: Full-time Pay: ₹25,000.00 - ₹40,000.00 per year Schedule: Day shift Work Location: In person
Neuronest AI Pvt Ltd
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Salary: Not disclosed
Salary: Not disclosed