Jobs

Interviews
Job Alerts
Tools

Upskill and Grow with AI

Mock Interview Practice interviews in realistic simulations

Coding Practice Improve your coding skills with challenges

Certification Earn certifications to validate your skills

AI Learning Get trained with AI expert sessions

Career Path AI insights for smarter career decisions

AI Job Match Score AI-Powered Job Match Against Your Resume and Optimize Your Resume

Career Tools and Resources

Resume Builder Build Professional Resume with Ease

ATS Friendliness Check Check Resume Friendliness for Applicant Tracking Systems

Auto Apply Apply to hundreds of jobs on any platform effortlessly

Co-Pilot (Chrome Extension) Your AI Assistant for Seamless Browsing Efficiency

Interview Questions Streamline interviews with ready-to-use questions

Salaries Discover market-driven salary insights across skillsets and geographies

Companies Explore leading companies actively hiring talent
For Employers

Home
>
Jobs in bengaluru
>
Morgan Stanley
>
Site Reliability Engineer on AI Platform, Director

Site Reliability Engineer on AI Platform, Director

Morgan Stanley

6 years

1 - 5 Lacs

bengaluru

Posted:17 hours ago| Platform: GlassDoor logo

Apply

Skills Required

reliability ai support ml engineering data security training inference automation compliance constraints software combination kubernetes aws azure api development rest communication technology efficiency consistency controls governance power management stability model design code provisioning storage network orchestration metrics analysis remediation planning scaling strategies scheduling forecasting deployment rollback integration backup restore documentation programming scripting python java containerization docker monitoring logging datadog terraform helm ansible networking tcp ip routing tuning drive audit collaboration cortex architecture pipeline kafka spark sql redis service diversity recruiting

Work Mode

On-site

Job Type

Part Time

Job Description

Site Reliability Engineer on AI Platform , Director

We're seeking someone to join our AI Platform team as Site Reliability Engineer on AI Platform to help support, scale and harden the infrastructure that powers our AI/ML systems. You will collaborate closely with infrastructure engineering, cloud engineering, data engineering, and security teams to ensure availability, reliability, performance, and security of production AI workloads (training, inference, data pipelines) in a regulated, high-stakes financial environment. As an SRE on the AI platform, you will bring deep operations, automation, and systems engineering skills to enable our models and pipelines to run reliably at scale, while balancing cost, security, and compliance constraints.

The ideal candidate will have strong hands-on experience supporting software platforms on any combination of the following platforms - Kubernetes, Cloud (AWS, Azure, and/or Google), API based development, REST framework, data engineering, and large-scale API Gateway environments etc. Knowledge of AIML and hands-on experience implementing solutions using Generative AI are also preferable. The candidate will have great communication skills, a team-based mentality and a strong passion for using AI to increase productivity as well as help generate new ideas for product & technical improvements.?

Our mission is to develop a firmwide Artificial Intelligence (AI) Development Platform that aligns with the firm's Technology principles and drives efficiency and consistency, controls, security and strong governance and promotes innovation, enabling teams to build applications that leverage AI capabilities and accelerate the adoption of AI across our businesses.

In the Technology division, we leverage innovation to build the connections and capabilities that power our Firm, enabling our clients and colleagues to redefine markets and shape the future of our communities. This is a SRE on the AI platform position at Director level, which is part of Infrastructure Production Management & Reliability Engineering job family that maintains the stability and reliability of the organization's infrastructure systems, ensuring optimal performance and availability to support business operations.

Morgan Stanley is an industry leader in financial services, known for mobilizing capital to help governments, corporations, institutions, and individuals around the world achieve their financial goals.

What you'll do in the role:

Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving).
Design and build automation for core platform capabilities, reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc. Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards.
Work on Grafana dashboards for various metrics which are being scrapped by Prometheus.
Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation.
Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting.
Optimize cost vs. performance tradeoffs in large-scale compute environments.
Harden systems for security, compliance, auditability, and data governance
Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems.
Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms.
Maintain runbooks, operational playbooks, documentation, and training materials.
Participate in on-call rotations and respond to production incidents 24/7 as needed.
Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability.

What you'll bring to the role:

At least 6+ years’ relevant experience would generally be expected to find the skills required for this role.
Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, PagerDuty, etc.)

Nice to have

Understanding of SRE techniques.

Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
Excellent communication, documentation, and cross-team collaboration skills
Proven track record of reducing operational toil via automation
Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex.
Good knowledge of Microservice based architecture, industry standards, for both public and private cloud.
Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.)
Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc) for cloud app storage.
Experience working with Generative AI development, embeddings, fine tuning of Generative AI models
Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling)
Understanding of ModelOps/ ML Ops/ LLM Op.
Experience with chaos engineering, canary deployments, blue/green rollouts

WHAT YOU CAN EXPECT FROM MORGAN STANLEY:

We are committed to maintaining the first-class service and high standard of excellence that have defined Morgan Stanley for over 89 years. Our values - putting clients first, doing the right thing, leading with exceptional ideas, committing to diversity and inclusion, and giving back - aren’t just beliefs, they guide the decisions we make every day to do what's best for our clients, communities and more than 80,000 employees in 1,200 offices across 42 countries. At Morgan Stanley, you’ll find an opportunity to work alongside the best and the brightest, in an environment where you are supported and empowered. Our teams are relentless collaborators and creative thinkers, fueled by their diverse backgrounds and experiences. We are proud to support our employees and their families at every point along their work-life journey, offering some of the most attractive and comprehensive employee benefits and perks in the industry. There’s also ample opportunity to move about the business for those who show passion and grit in their work.

To learn more about our offices across the globe, please copy and paste https://www.morganstanley.com/about-us/global-officesinto your browser.

Morgan Stanley is an equal opportunities employer. We work to provide a supportive and inclusive environment where all individuals can maximize their full potential. Our skilled and creative workforce is comprised of individuals drawn from a broad cross section of the global communities in which we operate and who reflect a variety of backgrounds, talents, perspectives, and experiences. Our strong commitment to a culture of inclusion is evident through our constant focus on recruiting, developing, and advancing individuals based on their skills and talents.

More Jobs at Morgan Stanley

Associate - Operational Risk Quality Assurance -US Bank (Wealth Management)

Mumbai, Maharashtra, India

Experience: Not specified

Salary: Not disclosed

Financial Reporting - Senior Associate - Fund Services

Mumbai, Maharashtra, India

5 - 8 yrs

Salary: Not disclosed

Senior Manager _ Senior QA Engineer _ Parametric

Mumbai, Maharashtra, India

7 - 7 yrs

Salary: Not disclosed

JAVA/Springboot Lead - Vice President - Software Engineering

Mumbai, Maharashtra, India

Experience: Not specified

Salary: Not disclosed

JAVA/Microservices/RDBMS Technical Manager - Vice President – Software Engineering

Mumbai, Maharashtra, India

Experience: Not specified

Salary: Not disclosed

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

Morgan Stanley

Financial Services

New York NY

RecommendedJobs for You

Site Reliability Engineer on AI Platform, Director

Morgan Stanley

bengaluru

Site Reliability Engineer on AI Platform, Director

Morgan Stanley

bengaluru

Login to

Please Verify Your Phone or Email

Confirm Action

Site Reliability Engineer on AI Platform, Director