Machine Learning Operations Engineer (ML Ops - 2)

5 - 8 years

0 Lacs

Posted:1 day ago| Platform: GlassDoor logo

Apply

Work Mode

Remote

Job Type

Full Time

Job Description

We are looking for a Machine Learning Operations Engineer to join our team, to design, build, and integrate ML Ops for large-scale, distributed machine learning systems, focusing on cutting-edge tools, distributed GPU training, and enhancing research experimentation.

Roles & Responsibilities:

  • Architect, build, and integrate end-to-end life cycles of large-scale, distributed machine learning systems i.e., ML Ops, using cutting-edge tools/frameworks.
  • Develop tools and services for the explainability of ML solutions.
  • Implement distributed cloud GPU training approaches for deep learning models.
  • Build software/tools that improve the rate of experimentation for the research team and extract insights from it.
  • Identify and evaluate new patterns and technologies to improve the performance, maintainability, and elegance of our machine learning systems.
  • Lead and execute technical projects to completion. Communicate with peers to build requirements and track progress.
  • Mentor fellow engineers in your areas of expertise - Contribute to a team culture that values effective collaboration, technical excellence, and innovation.
  • Collaborate with engineers across various functions to solve complex data problems at scale.

Qualification:

  • 5 - 8 years of professional experience in implementing the MLOps framework to scale up ML in production.
  • Master’s degree or PhD in Computer Science, Machine Learning / Deep Learning domains

Must-have:

  • Hands-on experience with Kubernetes, Kubeflow, MLflow, Sagemaker, and other ML model experiment management tools, including training, inference, and evaluation.
  • Experience in ML model serving (TorchServe, TensorFlow Serving, NVIDIA Triton inference server, etc.)
  • Proficiency with ML model training frameworks (PyTorch, PyTorch Lightning, Tensorflow, etc.).
  • Experience with GPU computing to do data and model training parallelism.
  • Solid software engineering skills in developing systems for production.
  • Strong expertise in Python.
  • Building end-to-end data systems as an ML Engineer, Platform Engineer, or equivalent.
  • Experience working with cloud data processing technologies (S3, ECR, Lambda, AWS, Spark, Dask, ElasticSearch, Presto, SQL, etc.).
  • Having Geospatial / Remote sensing experience is a plus.
  • Competencies:
  • Excellent debugging and critical thinking skills.
  • Excellent analytical and problem-solving skills.
  • Ability to work in a fast-paced, team-based environment.

Benefits:

  • Medical Health Cover for you and your family, including unlimited online doctor consultations
  • Access to mental health experts for you and your family
  • Dedicated allowances for learning and skill development
  • Comprehensive leave policy with casual leaves, paid leaves, marriage leaves, bereavement leaves
  • Twice a year appraisal

Job Type: Full-time

Work Location: In person

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now