Get alerts for new jobs matching your selected skills, preferred locations, and experience range.
5 - 10 years
7 - 12 Lacs
Bengaluru
Work from Office
About the Job: The Data Development Insights & Strategy (DDIS) team is seeking a Senior AI Engineer to design, scale, and maintain our AI model lifecycle framework within Red Hat's OpenShift AI and RHEL AI infrastructures. As a Senior AI Engineer, you will contribute to managing and optimizing large-scale AI models, collaborating with cross-functional teams to ensure high availability, continuous monitoring, and efficient integration of new model updates, while driving innovation through emerging AI technologies. In this role, you will leverage your expertise in AI, MLOps/LLMOps, cloud computing, and distributed systems to enhance model performance, scalability and operational efficiency. You'll work in close collaboration with the Products & Global Engineering(P&GE) and IT AI Infra teams, ensuring seamless model deployment and maintenance in a secure and high-performance environment. This is an exciting opportunity to drive AI model advancements and contribute to the operational success of mission-critical applications. What you will do? Develop and maintain the lifecycle framework for AI models within Red Hat's OpenShift and RHEL AI infrastructure, ensuring security, scalability and efficiency throughout the process. Design, implement, and optimize CI/CD pipelines and automation for deploying AI models at scale using tools like Git, Jenkins, and Terraform, ensuring zero disruption during updates and integration. Continuously monitor and improve model performance using tools such as OpenLLMetry, Splunk, and Catchpoint, while responding to performance degradation and model-related issues. Work closely with cross-functional teams, including Products & Global Engineering(P&GE) and IT AI Infra teams, to seamlessly integrate new models or model updates into production systems with minimal downtime and disruption. Enable a structured process for handling feature requests (RFEs), prioritization, and resolution, ensuring transparent communication and timely resolution of model issues. Assist in fine-tuning and enhancing large-scale models, including foundational models like Mistral and LLama, while ensuring computational resources are optimally allocated (GPU management, cost management strategies). Drive performance improvements, model updates, and releases on a quarterly basis, ensuring that all RFEs are processed and resolved within 30 days. Collaborate with stakeholders to align AI model updates with evolving business needs, data changes, and emerging technologies. Contribute to mentoring junior engineers, fostering a collaborative and innovative environment. What you will bring? A bachelor's or master's degree in Computer Science, Data Science, Machine Learning, or a related technical field is required. Hands-on experience that demonstrates your ability and interest in AI engineering and MLOps will be considered in lieu of formal degree requirements. Experience programming in at least one of these languages: Python, with a strong understanding of Machine Learning frameworks and tools. Experience working with cloud platforms such as AWS, GCP, or Azure, and have familiarity with deploying and maintaining AI models at scale in these environments. As a Senior AI Engineer, you will be most successful if you have experience working with large-scale distributed systems and infrastructure, especially in production environments where AI and LLM models are deployed and maintained. You should be comfortable troubleshooting, optimizing, and automating workflows related to AI model deployment, monitoring, and lifecycle management. We value a strong ability to debug and optimize model performance and automate manual tasks wherever possible. Additionally, you should be well-versed in managing AI model infrastructure using containerization technologies like Kubernetes and OpenShift, and have hands-on experience with performance monitoring tools (e.g., OpenLLMetry, Splunk, Catchpoint). We also expect you to have a solid understanding of GPU-based computing and resource optimization, with a background in high-performance computing (e.g., CUDA, vLLM, MIG, TGI, TEI). Experience working in Agile development environments. Work collaboratively within cross-functional teams to solve complex problems and drive AI model updates will be key to your success in this role. Desired skills: 5+ years of experience in AI or MLOps, with a focus on deploying, maintaining, and optimizing large-scale AI models in production. Expertise in deploying and managing models in cloud environments (AWS, GCP, Azure) and containerized platforms like OpenShift or Kubernetes. Familiarity with large-scale distributed systems and experience managing their performance and scalability. Experience with performance monitoring and analysis tools such as OpenLLMetry, Prometheus, or Splunk. Deep understanding of GPU-based deployment strategies and computational cost management. Strong experience in managing model lifecycle processes, from training to deployment, monitoring, and updates. Ability to mentor junior engineers and promote knowledge sharing across teams. Excellent communication skills, both verbal and written, with the ability to engage with technical and non-technical stakeholders. A passion for innovation and continuous learning in the rapidly evolving field of AI and machine learning.
Posted 2 months ago
2 - 7 years
4 - 9 Lacs
Maharashtra
Work from Office
Description Mumbai/Bangalore Generic JD What will SREs do? Provide hands-on SRE with 24x7 SRE support, including incident management, problem management, root cause analysis, monitoring, alerting, and maintenance of infrastructure, compliance Track, audit, monitor and implement on technical work streams Act as portfolio SME (Subject Matter Expert) understand document common components, core functionalities, infrastructure of supported applications Be an escalation point in the on-call rotation, and support our maintenance, scheduled work, support and release deployment requirements Lead in incident management and problem management for applications in scope and RCA Action items fulfillment/ownership Focus on Continuous improvement and technical standards Drive improvements in productivity, monitoring, tooling and best practices Manage technology currency (server patching, certificate renewal, compliance, etc.) with keen eye on automating opportunities Drive best-in-class technical solutions by tracking closely industry leading solutions and applying to RBC environment and needs Leverage the value in unit, department, and enterprise wide teams to develop better solutions and achieve a cross enterprise mindset EngineeringDevelop SRE solutions (monitoring and alerting, machine learning anomaly detection, self-healing and reliability testing) Apply design-thinking and agile mindset in working with SREs, Scrum Masters and Incident Leads Contribute to and leverage best practices in SRE Simplifies development by building repeatable solutions to manual tasks Supports unit's goals to adopt automation solutions for applications in scope Production SupportPerform production support role, including off-hours support and rotational on-call support to be compensated accordingly with overtime pay, lieu time, and on-call allowance Assist in incident management and problem management for applications in scope Evaluate continuously what went well, what went wrong, what can be done to improve and prevent in future Maintain technology currency (perform server patching, certificate renewal, etc.) with keen eye on automating opportunities Ensure availability and uptime of applications in scope, as per service level objectives Ensure compliance of all systems and applications in scope, including maintaining segregation of duties Technical ConsultationSupport initiatives outside of application or squad level scope Consult on products build to other teams in RBPT and enterprise Innovation and LearningStay abreast of technology change and learn constantly, through official training assignments and self-assigned learning Provide demos to team at large of new technology findings Advanced knowledge of the following SRE practices and technologies 3-5 years of experience in related field oPython, YAML, Shell scripting oAzure, Linux oDynatrace, Prometheus, PagerDuty, Moog, Splunk, Elastic, Azure monitor oChaos Engineering oMQ, Kafka oPerform production support role, including off-hours support In-depth hands-on experience in a variety of SRE tools (Ansible, Azure Automation, Catchpoint) Named Job Posting? (if Yes - needs to be approved by SCSC) Additional Details Global Grade C Level To Be Defined Named Job Posting? (if Yes - needs to be approved by SCSC) No Remote work possibility Yes Global Role Family To be defined Local Role Name To be defined Local Skills reliability metrics;reliability controls Languages RequiredENGLISH Role Rarity To Be Defined
Posted 2 months ago
10 - 14 years
12 - 16 Lacs
Bengaluru
Work from Office
About the Job: The Data Development Insights & Strategy (DDIS) team is seeking a Principal AI Engineer to lead the design, development, and optimization of AI model lifecycle frameworks within Red Hats OpenShift AI and RHEL AI infrastructures. As a Principal AI Engineer, you will play a key leadership role in overseeing the strategic direction of AI model deployment and lifecycle management, collaborating across teams to ensure seamless integration, scalability, and performance of mission-critical AI models. In this role, you will drive the development of innovative solutions for the AI model lifecycle, applying your deep expertise in MLOps/LLMOps, cloud computing, and distributed systems. You will be a technical leader who mentors and guides teams in collaboration with Products & Global Engineering (P&GE) and IT AI Infra to ensure efficient model deployment and maintenance in secure, scalable environments. This is an exciting opportunity for someone who wants to take a leadership role in influencing the strategic direction of Red Hat's AI innovations, driving the innovation and optimization of AI models and technologies. What you will do? Lead the design and development of scalable, efficient, and secure AI model lifecycle frameworks within Red Hats OpenShift and RHEL AI infrastructures, ensuring models are deployed and maintained with minimal disruption and optimal performance. Define and implement the strategy for optimizing AI model deployment, scaling, and integration across hybrid cloud environments (AWS, GCP, Azure), working with cross-functional teams to ensure consistent high availability and operational excellence. Spearhead the creation and optimization of CI/CD pipelines and automation for AI model deployments, leveraging tools such as Git, Jenkins, and Terraform, ensuring zero disruption during updates and integration. Champion the use of advanced monitoring tools (e.g., OpenLLMetry, Splunk, Catchpoint) to monitor and optimize model performance, responding to issues and leading the troubleshooting of complex problems related to AI and LLM models. Lead cross-functional collaboration in collaboration with Products & Global Engineering (P&GE) and IT AI Infra teams to ensure seamless integration of new models or model updates into production systems, adhering to best practices and minimizing downtime. Define and oversee the structured process for handling feature requests (RFEs), prioritization, and resolution, ensuring transparency and timely delivery of updates and enhancements. Lead and influence the adoption of new AI technologies, tools, and frameworks to ensure that Red Hat remains at the forefront of AI and machine learning advancements. Drive performance improvements, model updates, and releases on a quarterly basis, ensuring RFEs are processed and resolved within agreed-upon timeframes and driving business adoption. Oversee the fine-tuning and enhancement of large-scale models, including foundational models like Mistral and LLama, ensuring the optimal allocation of computational resources (GPU management, cost management strategies). Lead a team of engineers, mentoring junior and senior talent, fostering an environment of collaboration and continuous learning, and driving the technical growth of the team. Contribute to strategic discussions with leadership, influencing the direction of AI initiatives and ensuring alignment with broader business goals and technological advancements. What you will bring? A bachelors or masters degree in Computer Science, Data Science, Machine Learning, or a related technical field is required. Hands-on experience and demonstrated leadership in AI engineering and MLOps will be considered in lieu of formal degree requirements. 10+ years of experience in AI or MLOps, with at least 3 years in a technical leadership role managing the deployment, optimization, and lifecycle of large-scale AI models. You should have deep expertise in cloud platforms (AWS, GCP, Azure) and containerized environments (OpenShift, Kubernetes), with a proven track record in scaling and managing AI infrastructure in production. Experience optimizing large-scale distributed AI systems, automating deployment pipelines using CI/CD tools like Git, Jenkins, and Terraform, and leading performance monitoring using tools such as OpenLLMetry, Splunk, or Catchpoint. You should have a strong background in GPU-based computing and resource optimization (e.g., CUDA, MIG, vLLM) and be comfortable with high-performance computing environments. Your leadership skills will be key, as you will mentor and guide engineers while fostering a collaborative, high-performance culture. You should also have a demonstrated ability to drive innovation, solve complex technical challenges, and work cross-functionally with teams to deliver AI model updates that align with evolving business needs. A solid understanding of Agile development processes and excellent communication skills are essential for this role. Lastly, a passion for AI, continuous learning, and staying ahead of industry trends will be vital to your success at Red Hat. Desired skills: 10+ years of experience in AI, MLOps, or related fields, with a substantial portion of that time spent in technical leadership roles driving the strategic direction of AI infrastructure and model lifecycle management. Extensive experience with foundational models such as Mistral, LLama, GPT, and their deployment, tuning, and scaling in production environments. Proven ability to influence and drive AI and MLOps roadmaps, shaping technical strategy and execution in collaboration with senior leadership. In-depth experience with performance monitoring, resource optimization, and troubleshooting of AI models in complex distributed environments. Strong background in high-performance distributed systems and container orchestration, particularly in AI/ML workloads. Proven experience in guiding and mentoring engineering teams to build high-performance capabilities, fostering a culture of continuous improvement and technical innovation. As a Principal AI Engineer at Red Hat, you will have the opportunity to drive major strategic AI initiatives, influence the future of AI infrastructure, and lead a high-performing engineering team. This is a unique opportunity for a seasoned AI professional to shape the future of AI model lifecycle management at scale. If youre ready to take on a technical leadership role with a high level of responsibility and impact, we encourage you to apply.
Posted 3 months ago
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Accenture
36723 Jobs | Dublin
Wipro
11788 Jobs | Bengaluru
EY
8277 Jobs | London
IBM
6362 Jobs | Armonk
Amazon
6322 Jobs | Seattle,WA
Oracle
5543 Jobs | Redwood City
Capgemini
5131 Jobs | Paris,France
Uplers
4724 Jobs | Ahmedabad
Infosys
4329 Jobs | Bangalore,Karnataka
Accenture in India
4290 Jobs | Dublin 2