Jobs
Interviews

8 Slis Jobs

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

3.0 - 5.0 years

0 Lacs

hyderabad, telangana, india

On-site

There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within the Chief Technology Office team, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform. Job responsibilities Guides and assists others in the areas of building appropriate level designs and gaining consensus from peers where appropriate Collaborates with other software engineers and teams to design and implement deployment approaches using automated continuous integration and continuous delivery pipelines Collaborates with other software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions in their applications Implements infrastructure, configuration, and network as code for the applications and platforms in your remit Collaborates with technical experts, key stakeholders, and team members to resolve complex problems Understands service level indicators and utilizes service level objectives to proactively resolve issues before they impact customers Supports the adoption of site reliability engineering best practices within your team Required qualifications, capabilities, and skills Formal training or certification on Site Reliability Engineering concepts and 3+ years applied experience Experience in SRE, DevOps, or application support roles, with knowledge of SLIs/SLOs, incident response, and troubleshooting. Familiarity with monitoring and observability tools (e.g., Grafana, Prometheus, Splunk, OpenTelemetry). Hands-on experience with CI/CD pipelines (Jenkins, including global libraries), infrastructure as code (Terraform), version control (Git), containerization (Docker), and orchestration (Kubernetes). Exposure to cloud platforms (AWS, GCP, or Azure) and automating infrastructure and deployments. Willingness to participate in on-call rotation and respond to production incidents. Preferred qualifications, capabilities, and skills Familiar in banking, fintech, or regulated environments. Participation in game days or chaos engineering. Interest in sharing knowledge and best practices with peers.

Posted 2 days ago

Apply

4.0 - 9.0 years

20 - 27 Lacs

bengaluru

Hybrid

We are seeking a passionate and skilled Site Reliability Engineer (SRE) to join our team. In this role, you will ensure high availability, performance, and security of our systems while proactively identifying and resolving reliability issues. You will be responsible for monitoring, troubleshooting, automation, and building resilient infrastructure that supports millions of users globally. Key Responsibilities Monitor, troubleshoot, and resolve live-site issues to maintain uptime, performance, and security. Define and manage SLIs, SLOs, and error budgets to ensure reliable user experiences. Consolidate infrastructure monitoring and alerting into unified systems (e.g., Prometheus + Alertmanager) while enhancing alerts with contextual information (dashboards, runbooks, severity levels). Continuously improve infrastructure by upgrading and patching OS, databases, networking, and related components. Optimize on-call processes, lead incident response, root-cause analysis, and post-mortems. Build self-healing systems, automate repetitive/manual tasks, and proactively identify opportunities to improve uptime. What You Will Bring Strong SRE mindset proactive in spotting problems, performance bottlenecks, and areas for improvement. Hands-on expertise with observability tools and strong troubleshooting skills in distributed systems. Ability to work in a fast-paced, results-driven environment that demands operational excellence. Strong problem-solving skills with a track record of developing and implementing solutions. Excellent organizational and multitasking skills to handle multiple complex priorities under tight deadlines. Requirements Bachelor’s degree in Computer Science, Engineering, or a related technical field. 2+ years of experience managing distributed systems & web applications with high uptime requirements (10M+ users preferred). Proficiency in Linux and LAMP stack environments. Experience with observability tools (e.g., Prometheus, Grafana, New Relic, CloudWatch, ELK, Zabbix ). Experience with Infrastructure as Code (IaC) tools (e.g., Ansible, Terraform, Terragrunt ). Strong ownership mindset, bias for action, and ability to deliver results end-to-end. Excellent written and verbal communication skills. Preferred Qualifications Familiarity with cloud computing and the AWS ecosystem . Programming experience to automate infrastructure tasks. Flexibility to work during off-schedule hours (evenings/weekends) if required.

Posted 1 week ago

Apply

16.0 - 18.0 years

0 Lacs

hyderabad, telangana, india

On-site

Making the World More Resilient - One Application at a Time! At Swiss Re, our mission is to make the world more resilient. As a leading global reinsurance company, we help individuals, businesses, and societies recover from disaster and build confidence for the future. To fulfil this mission, we must ensure our own systems and operations are equally resilient. In the Property & Casualty Reinsurance division, the stability and reliability of our IT systems directly impact our ability to deliver on this promise. That's why we're looking for a Lead Reliability Architect who will champion the resilience of our application landscape - ensuring our systems are built to withstand disruption, adapt quickly, and perform reliably even in the face of the unexpected. Key Responsibilities As our Lead Reliability Architect, you will: Own and shape the reliability strategy for our Property & Casualty IT landscape, ensuring alignment with Swiss Re's broader technology and business objectives. Overlook the reliability and resilience characteristics of our business-critical application portfolio and drive their continuous improvement. Define and maintain blueprints, guidelines, and best practices for resilience, high availability, disaster recovery, and fault tolerance - ensuring they are practical, actionable, and consistently applied across all development teams. Work directly with application development teams to support the implementation of these blueprints and architectural principles across the whole Software Development Lifecycle. Define and govern the monitoring & alerting baseline for our applications, which includes defining golden signals, SLIs, and SLOs across the whole system landscape. Drive the adoption of the OpenTelemetry framework in our observability stack - across applications, platforms, and shared infrastructure. Partner closely with Operations (Run) teams to analyze operational incidents and derive actionable insights for improving system reliability and fault response capabilities. Act as a bridge between engineering and operations , fostering a culture of reliability, accountability, and continuous improvement. Mentor teams and advocate for SRE practices , ensuring a consistent understanding and application of resilience and observability standards across our engineering workforce. About You We are looking for a candidate with a balanced profile of deep technical expertise and strong leadership capabilities. Professional & Technical Skills Overall 16+ Years of experience in Technology domain. Well-established track record and senior-level hands-on background in software and reliability engineering with a focus on distributed systems and high-availability architectures in public cloud environments (ideally Azure). Deep expertise in reliability and resilience engineering, including concepts like redundancy and failover, fault tolerance and graceful degradation, circuit breakers, retry patterns, chaos engineering, and auto-healing. Solid experience in operating applications at scale, ideally within regulated or mission-critical environments. Familiarity with Google's Site Reliability Engineering (SRE) practices, especially around SLIs and SLOs, error budgets, and operational readiness. Strong background in monitoring, telemetry, and observability, with a focus on defining effective metrics and alerts that reduce noise and improve incident detection. Hands-on experience with OpenTelemetry and related observability tools (e.g., Prometheus, Grafana, Jaeger, Elastic, etc.) would be a plus. Experience collaborating in DevOps and hybrid cloud environments, ideally with exposure to containerized and microservices architectures. Personal & Leadership Skills Strong thought leadership and influencing skills ability to challenge the status quo and advocate for meaningful change. Architectural mindset, with a structured approach to problem-solving and strong planning and design capabilities. High personal integrity, accountability, and a proactive approach to ownership and decision-making. Excellent collaboration and communication skills, able to build trusted relationships across teams, functions, and geographies. Team player with the ability to work across disciplines and bring people together around shared goals. Demonstrated ability to foster understanding between application development and operations teams - serving as a translator and facilitator between the two worlds. Fluent in English, both written and spoken. #LI-Hybrid? Keywords: Reference Code: 134808

Posted 1 week ago

Apply

5.0 - 9.0 years

0 Lacs

karnataka

On-site

The role of Engineering Manager - Site Reliability is to primarily manage, mentor, and develop a team of Site Reliability Engineers, ensuring the development of both the individual and the team as a whole are in line with organizational objectives and direction. You will be responsible for managing all activities in scope through the direction of activities, designing new products, and modifying existing designs to ensure deliverables are on time and of acceptable quality. It is crucial for you to analyze technology trends, human resource needs, and market demand to plan projects that ensure resilience in line with current demand and future ambition. Additionally, you will be expected to confer with leaders, production, key stakeholders, and marketing teams to determine engineering feasibility, cost-effectiveness, scalability, and time-to-market for new and existing products. In this role, your responsibilities will include managing people by inspiring, growing, and developing individuals through the creation of personal development plans, leveraging available learning resources, and offering stretch opportunities. You will need to ensure delivery by tracking team health metrics and KPIs, monitoring roadmap progress, identifying blockers, and resolving or escalating them. End to End System Ownership involves actively monitoring application health and performance, setting and monitoring relevant metrics, and taking action accordingly. You will also be responsible for reducing business continuity risks and bus factor by applying state-of-the-art practices and tools, and writing appropriate documentation such as runbooks and OpDocs. Technical Incident Management will require you to address and resolve live production issues, improve the overall reliability of systems through root cause analysis, and contribute to postmortem processes and logging live issues. Building software applications will involve utilizing relevant development languages, applying knowledge of systems, services, and tools appropriate for the business area, writing readable and reusable code, and ensuring the quality of applications through standard testing techniques and methods. As an Engineering Manager - Site Reliability, you should possess strong people management skills and experience, excellent communication and stakeholder management skills, good commercial awareness, and technical vision. You are expected to be a humble and thoughtful technology leader who leads by example and gains your teammates" respect through actions rather than title. Experience in software development, building complex and scalable solutions, and leading and managing a team of engineers in a fast-paced and complex environment is essential. Proficiency in at least one programming language (Java, C/C++, Python, Go), ability to formulate software solutions from scratch, understanding of Service-Oriented Architecture, Microservices & OOP patterns, hands-on experience in Linux administration and troubleshooting, creative problem-solving approach, practical experience in defining SLIs and SLOs, strong analytical skills, and a data-driven mindset are also required. If your application is successful, your personal data may be used for a pre-employment screening check by a third party as permitted by applicable law. The pre-employment screening may include employment history, education, and other information necessary for determining your qualifications and suitability for the position.,

Posted 1 week ago

Apply

5.0 - 9.0 years

0 Lacs

pune, maharashtra

On-site

As a Site Reliability Engineer (SRE) at UBS, you will play a crucial role in ensuring the availability, performance, and resilience of our platforms in a mission-critical financial environment. Your primary responsibility will be to design, implement, and maintain highly available and fault-tolerant systems, with a focus on building and operating reliable, scalable systems in regulated industries such as banking and financial services. You will work closely with engineering, infrastructure, and security teams to build secure, observable, and automated systems, while fostering a culture of operational excellence. Your role will involve defining and monitoring Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to guarantee system reliability and customer satisfaction. Additionally, you will lead incident response, post-mortems, and root cause analysis for production issues, as well as collaborate with development teams to embed reliability into the software development lifecycle. Joining the Operating Systems and Middleware (OSM) team at UBS, you will be part of a globally distributed team that supports critical infrastructure across different time zones using a follow-the-sun support model. Operating in a collaborative Agile environment, you will have the opportunity to work alongside talented engineers who are passionate about building reliable systems and solving complex problems. We value transparency, shared responsibility, and continuous learning, empowering our engineers to take ownership, innovate, and continuously improve. The ideal candidate for this role will have proven expertise in Site Reliability Engineering, with a background in software engineering, infrastructure, or operations. You should possess hands-on experience with cloud platforms like Azure, operating systems such as Linux RHEL7+, and networking fundamentals. A solid understanding of networking and storage technologies, authentication and naming services, scripting and automation, as well as infrastructure as code tools is essential. Additionally, you should demonstrate a metrics- and automation-driven mindset, strong collaboration and communication skills, and a proactive, ownership-driven attitude. Desirable additions to your expertise include experience with chaos engineering, resilience testing, disaster recovery planning, financial transaction systems, real-time data pipelines, core banking platforms, CI/CD pipelines, containerization, and orchestration. UBS offers a dynamic and inclusive work environment where diversity is celebrated, and employees are supported with new challenges, growth opportunities, and flexible working options. Join us at UBS, where collaboration and individual empowerment drive our success.,

Posted 2 weeks ago

Apply

3.0 - 7.0 years

0 Lacs

karnataka

On-site

As a Support Engineer with experience in maintaining and supporting solutions in a Cloud based environment (GCP or AWS), you will be responsible for ensuring the smooth operation of monitoring tools such as ELK, Dynamiter, Cloud watch, Cloud logging, Cloud Monitoring, New Relic. Your primary focus will be to implement and maintain monitoring and self-healing strategies to proactively prevent production incidents. You will also be required to conduct root cause analysis of production issues and design on call and escalation processes. In addition, you will participate in the design and implementation of serviceability solutions for monitoring and alerting, as well as debugging production issues across services and levels of the stack. Collaborating closely with the platform engineering team, you will help establish and improve production support approaches and participate in defining SLIs and SLOs to demonstrate efficiency and value to business partners. Your responsibilities will also include interacting and testing APIs, participating in Out-of-business-hour deployments and support on rotation with team members, and being familiar with agile development techniques. L3 Support experience is considered an asset for this role. In return, we offer competitive salaries, comprehensive health benefits, flexible work hours, remote work options, professional development and training opportunities, and a supportive and inclusive work environment.,

Posted 1 month ago

Apply

5.0 - 9.0 years

0 Lacs

haryana

On-site

Cvent is a global leader in meeting, event, travel, and hospitality technology, with a workforce of over 4000 employees worldwide. Our cloud-based solutions cater to more than 28,000 customers in over 100 countries, including 80% of the Fortune 100 companies. As a Lead - Site Reliability Engineer at Cvent, you will leverage your expertise in development and operations to identify and address issues, develop universal solutions, and provide guidance to junior staff. Your responsibilities will also include enabling and supporting multi-disciplinary teams, resolving complex development and automation challenges, promoting Cvent's standards and best practices, ensuring the scalability and performance of our product suite, and collaborating with various teams to establish effective monitoring and alerting strategies. Key Responsibilities: - Utilize advanced knowledge in development and operations to prioritize and resolve issues - Mentor and support junior staff members - Empower and collaborate with multi-disciplinary teams across different applications and locations - Address complex development, automation, and business process challenges - Advocate for Cvent standards and best practices - Ensure product scalability, performance, and resilience - Establish monitoring and alerting strategies for new applications - Share best practices with acquisition's DevOps team - Develop automation solutions for deployment targeting multiple environments - Assist in achieving zero-down-time deployments for legacy code base - Contribute to Open Source projects - Automate tasks to streamline operations Requirements: - Knowledge of SDLC methodologies, preferably Agile - Proficiency in Java, Python, or Ruby - Experience with managing AWS services - Familiarity with configuration management tools like Chef, Puppet, or Ansible - Strong Windows and Linux administration skills - Working knowledge of APM, monitoring, and logging tools - Experience with 3-tier application stacks and incident response - Familiarity with build tools such as Jenkins, CircleCI, etc. - Exposure to containerization concepts like docker, ECS, EKS, Kubernetes - Experience with NoSQL databases like MongoDB, couchbase, postgres, etc. - Self-motivated with the ability to work independently Preferred Skills: - Understanding of F5 load balancing concepts - Basic knowledge of observability, SLIs/SLOs, and message queues - Familiarity with basic networking concepts - Experience with package managers like Nexus, Artifactory, etc. - Strong communication and people management skills Join us at Cvent to be part of a dynamic team that is driving innovation and excellence in the world of event management technology.,

Posted 1 month ago

Apply

15.0 - 19.0 years

0 Lacs

haryana

On-site

As the Vice President of DevOps & SRE, you will hold a senior leadership position with the primary responsibility of driving platform reliability, secure operations, and DevOps excellence throughout the enterprise. Your role will involve integrating site reliability engineering practices with scalable DevOps automation and maintaining a robust cybersecurity posture. Leading high-performing teams, defining technology strategy, managing infrastructure, and safeguarding systems and data to support business growth and digital innovation will be key aspects of your role. You will be expected to lead enterprise-wide DevOps adoption and continuous delivery transformation, implementing and optimizing CI/CD pipelines, infrastructure-as-code (IaC), and cloud-native architectures. Championing automation in deployment, monitoring, and infrastructure provisioning will be essential, along with experience in containerization (Kubernetes, Docker), service mesh, and serverless environments. Facilitating collaboration between development, operations, and QA for rapid and reliable releases will also be a critical part of your responsibilities. Establishing and leading the Site Reliability Engineering (SRE) function to ensure system reliability, scalability, and performance will be another key aspect of your role. You will define and monitor SLAs, SLOs, and SLIs for critical applications and services, drive incident management, root cause analysis, and foster a postmortem culture. Developing and deploying observability strategies using tools like Prometheus, Grafana, Zabbix, or enterprise tools such as New Relic, Dynatrace, or Splunk will also be within your purview. In terms of leadership and strategic alignment, you will build and mentor cross-functional teams across DevOps and SRE, partnering with engineering, product, and business leaders to align technical initiatives with organizational goals. Managing departmental budgets, tools, and vendor relationships, as well as reporting on KPIs, operational health, security posture, and risk to the executive leadership team will also be part of your responsibilities. To qualify for this role, you must hold a Bachelors or Masters in Computer Science, Engineering, or a related field, along with at least 15+ years of experience in IT/engineering, including a minimum of 5+ years in leadership roles. Proven expertise in implementing DevOps, SRE, and security practices at scale, as well as hands-on experience with AWS, Azure, or GCP, CI/CD tools, and SRE observability platforms, are essential requirements for this position.,

Posted 1 month ago

Apply
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies