Overview
We are seeking a highly skilled and proactive AI Solutions SRE Lead to oversee the maintenance, optimization, and ongoing performance of deployed AI/ML systems and solutions. In this role, you'll act as the bridge between innovation and operations, ensuring our AI solutions consistently deliver value and operate seamlessly in real-world environments. You will lead efforts to monitor deployments, troubleshoot issues, and define best practices for sustaining AI systems throughout their lifecycle.
Responsibilities
Monitoring & Sustenance:
- Lead the post-deployment lifecycle of AI solutions, ensuring continued functionality, reliability, and scalability.
- Establish monitoring frameworks to oversee system performance, usage, and metrics for AI/ML models and APIs.
- Detect anomalies in AI systems, troubleshoot operational issues, and initiate timely corrective actions.
Performance Optimization:
- Continuously assess and optimize the performance of AI models to maintain efficiency and accuracy in production environments.
- Collaborate with data scientists and engineers to refine algorithms, retrain models, and update solutions as needed.
- Implement automation where possible to streamline maintenance processes.
Stakeholder Collaboration:
- Work with cross-functional teams (engineering, product, operations, etc.) to ensure alignment of AI sustainment activities with business goals.
- Communicate effectively with stakeholders to provide updates on system health, risks, and improvements.
Governance & Best Practices:
- Define and implement best practices for sustaining AI solutions, including documentation, testing protocols, and version control.
- Ensure compliance with ethical AI standards, regulatory guidelines, and established governance frameworks.
- Manage and mitigate risks associated with model drift, data shifts, and system vulnerabilities.
Incident Management:
- Lead responses to critical incidents involving AI systems by performing root cause analysis and deploying solutions for quick resolution.
- Advocate for proactive risk prevention and early detection strategies.
- Mentor and develop junior team members, fostering their skills in AI observability and domain-specific knowledge in ML, Computer Vision, and Generative AI.
Qualifications
Required:
- Bachelor's degree in Computer Science, Engineering, Data Science, or related field; advanced degree preferred.
- 9+ years of experience in machine learning, data science, or software engineering roles, with significant exposure to Computer Vision and Generative AI projects.
- 4+ years of experience specifically focused on AI/ML development and sustain the applications / solutions.
- Strong programming skills in languages such as Python, Java, or Go.
- Extensive experience with AI/ML frameworks (e.g., TensorFlow, PyTorch, scikit-learn) and cloud platforms (e.g., AWS, Azure, GCP).
- Proficiency in data visualization tools and techniques (e.g., Grafana, Tableau, D3.js).
- Deep understanding of AI/ML concepts, including model training, evaluation, and deployment, with specific knowledge of Computer Vision and Generative AI techniques.
- Experience with monitoring and observability tools such as Prometheus, ELK stack, or similar systems.
- Excellent problem-solving skills and ability to troubleshoot complex AI systems across various domains.
- Proven track record of mentoring and developing junior team members in AI-related roles.
Preferred:
- Experience with MLOps practices and tools, particularly for large-scale AI systems.
- Familiarity with AI ethics and responsible AI principles, especially as they relate to Generative AI.
- Knowledge of relevant AI regulations and compliance requirements, including those specific to Computer Vision applications.
- Experience with distributed systems and large-scale data processing for AI applications.
- Contributions to open-source projects or research publications in AI solution at production scale. Previous experience with large-scale AI/ML solutions in production environments.
- Knowledge of DevOps principles and CI/CD pipelines specific to AI/ML systems.
Key Competencies
- Strong analytical and critical thinking skills
- Excellent communication and collaboration abilities
- Proactive and self-motivated work ethic
- Ability to explain complex technical concepts to both technical and non-technical audiences
- Adaptability and willingness to learn in a rapidly evolving field
- Strong mentorship and leadership skills
- Deep curiosity and passion for AI, particularly in ML, Computer Vision, and Generative AI domains
- We are looking for a passionate and innovative individual who can help us build robust, transparent, and reliable AI systems while nurturing the growth of our team. If you have a strong background in AI/ML, with specific expertise in Computer Vision and Generative AI, and a keen interest in observability and system reliability, we encourage you to apply.