Jobs
Interviews

15 Slos Jobs

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

8.0 - 10.0 years

0 Lacs

bengaluru, karnataka, india

On-site

The NVIDIA GPU Cloud (NGC) organization is looking for software engineering talent to build NVIDIA's accelerated compute cloud services. These services include software to manage hardware and network provisioning to build a multi-tenant infrastructure. As a software engineer, you will work with other software engineers, product architects, and product managers as a collaborative team to deliver end-to-end software solutions to manage complex cloud infrastructure deployments. You will write services and software that aligns with the broad architectural vision for the NVIDIA Cloud Platform, working with other teams to develop a robust and scalable system. You own your code - from development to commit to test to production. We expect you to be passionate about code quality, testing, deployment efficiency/simplicity and bringing amazing products to market. What you will be doing: Work with NVIDIA internal customers Design and build scalable software systems to manage NVIDIA's cloud infrastructure. Building network and systems automation software for managing a multi-tenant cloud infrastructure Participate in open-source communities of software we leverage and build. Present to internal stakeholders and NVIDIA leadership on roadmaps, vision, & demos What we need to see: 8 + years of experience with designing and building distributed software systems. BS/MS degree in Computer science or related areas (or equivalent experience) Demonstrated ability to write code in a mainstream systems programming language such as C, C++, Golang, or Rust. Demonstrated ability to design and implement maintainable APIs for consumers. Practical experience with asynchronous programming, type safety, threading models, state machines and data structures. Background of data persistence (SQL or similar). Understanding of secure communication protocols (mutual-TLS, IPsec, or similar). Ways to stand out from the crowd: Experience in a Hyperscale Cloud Service Provider (public facing or not) Understanding of networking protocols such as IP, IPv6, BGP, HTTP, ICMP, tunneling protocols (VXLAN, Geneve, FoU, GRE), etc. Background with Host management systems (DHCP, Redfish, UEFI) and host security services such as TPM, TXT, and SecureBoot Kubernetes and/or distributed task scheduling Knowledge of SRE principles (observability, SLOs, logging, etc.) NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and talented people in the world working for us. If you're creative and passionate about developing cloud services we want to hear from you!

Posted 2 days ago

Apply

3.0 - 5.0 years

0 Lacs

hyderabad, telangana, india

On-site

There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within the Chief Technology Office team, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform. Job responsibilities Guides and assists others in the areas of building appropriate level designs and gaining consensus from peers where appropriate Collaborates with other software engineers and teams to design and implement deployment approaches using automated continuous integration and continuous delivery pipelines Collaborates with other software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions in their applications Implements infrastructure, configuration, and network as code for the applications and platforms in your remit Collaborates with technical experts, key stakeholders, and team members to resolve complex problems Understands service level indicators and utilizes service level objectives to proactively resolve issues before they impact customers Supports the adoption of site reliability engineering best practices within your team Required qualifications, capabilities, and skills Formal training or certification on Site Reliability Engineering concepts and 3+ years applied experience Experience in SRE, DevOps, or application support roles, with knowledge of SLIs/SLOs, incident response, and troubleshooting. Familiarity with monitoring and observability tools (e.g., Grafana, Prometheus, Splunk, OpenTelemetry). Hands-on experience with CI/CD pipelines (Jenkins, including global libraries), infrastructure as code (Terraform), version control (Git), containerization (Docker), and orchestration (Kubernetes). Exposure to cloud platforms (AWS, GCP, or Azure) and automating infrastructure and deployments. Willingness to participate in on-call rotation and respond to production incidents. Preferred qualifications, capabilities, and skills Familiar in banking, fintech, or regulated environments. Participation in game days or chaos engineering. Interest in sharing knowledge and best practices with peers.

Posted 2 days ago

Apply

4.0 - 9.0 years

20 - 27 Lacs

bengaluru

Hybrid

We are seeking a passionate and skilled Site Reliability Engineer (SRE) to join our team. In this role, you will ensure high availability, performance, and security of our systems while proactively identifying and resolving reliability issues. You will be responsible for monitoring, troubleshooting, automation, and building resilient infrastructure that supports millions of users globally. Key Responsibilities Monitor, troubleshoot, and resolve live-site issues to maintain uptime, performance, and security. Define and manage SLIs, SLOs, and error budgets to ensure reliable user experiences. Consolidate infrastructure monitoring and alerting into unified systems (e.g., Prometheus + Alertmanager) while enhancing alerts with contextual information (dashboards, runbooks, severity levels). Continuously improve infrastructure by upgrading and patching OS, databases, networking, and related components. Optimize on-call processes, lead incident response, root-cause analysis, and post-mortems. Build self-healing systems, automate repetitive/manual tasks, and proactively identify opportunities to improve uptime. What You Will Bring Strong SRE mindset proactive in spotting problems, performance bottlenecks, and areas for improvement. Hands-on expertise with observability tools and strong troubleshooting skills in distributed systems. Ability to work in a fast-paced, results-driven environment that demands operational excellence. Strong problem-solving skills with a track record of developing and implementing solutions. Excellent organizational and multitasking skills to handle multiple complex priorities under tight deadlines. Requirements Bachelor’s degree in Computer Science, Engineering, or a related technical field. 2+ years of experience managing distributed systems & web applications with high uptime requirements (10M+ users preferred). Proficiency in Linux and LAMP stack environments. Experience with observability tools (e.g., Prometheus, Grafana, New Relic, CloudWatch, ELK, Zabbix ). Experience with Infrastructure as Code (IaC) tools (e.g., Ansible, Terraform, Terragrunt ). Strong ownership mindset, bias for action, and ability to deliver results end-to-end. Excellent written and verbal communication skills. Preferred Qualifications Familiarity with cloud computing and the AWS ecosystem . Programming experience to automate infrastructure tasks. Flexibility to work during off-schedule hours (evenings/weekends) if required.

Posted 1 week ago

Apply

16.0 - 18.0 years

0 Lacs

hyderabad, telangana, india

On-site

Making the World More Resilient - One Application at a Time! At Swiss Re, our mission is to make the world more resilient. As a leading global reinsurance company, we help individuals, businesses, and societies recover from disaster and build confidence for the future. To fulfil this mission, we must ensure our own systems and operations are equally resilient. In the Property & Casualty Reinsurance division, the stability and reliability of our IT systems directly impact our ability to deliver on this promise. That's why we're looking for a Lead Reliability Architect who will champion the resilience of our application landscape - ensuring our systems are built to withstand disruption, adapt quickly, and perform reliably even in the face of the unexpected. Key Responsibilities As our Lead Reliability Architect, you will: Own and shape the reliability strategy for our Property & Casualty IT landscape, ensuring alignment with Swiss Re's broader technology and business objectives. Overlook the reliability and resilience characteristics of our business-critical application portfolio and drive their continuous improvement. Define and maintain blueprints, guidelines, and best practices for resilience, high availability, disaster recovery, and fault tolerance - ensuring they are practical, actionable, and consistently applied across all development teams. Work directly with application development teams to support the implementation of these blueprints and architectural principles across the whole Software Development Lifecycle. Define and govern the monitoring & alerting baseline for our applications, which includes defining golden signals, SLIs, and SLOs across the whole system landscape. Drive the adoption of the OpenTelemetry framework in our observability stack - across applications, platforms, and shared infrastructure. Partner closely with Operations (Run) teams to analyze operational incidents and derive actionable insights for improving system reliability and fault response capabilities. Act as a bridge between engineering and operations , fostering a culture of reliability, accountability, and continuous improvement. Mentor teams and advocate for SRE practices , ensuring a consistent understanding and application of resilience and observability standards across our engineering workforce. About You We are looking for a candidate with a balanced profile of deep technical expertise and strong leadership capabilities. Professional & Technical Skills Overall 16+ Years of experience in Technology domain. Well-established track record and senior-level hands-on background in software and reliability engineering with a focus on distributed systems and high-availability architectures in public cloud environments (ideally Azure). Deep expertise in reliability and resilience engineering, including concepts like redundancy and failover, fault tolerance and graceful degradation, circuit breakers, retry patterns, chaos engineering, and auto-healing. Solid experience in operating applications at scale, ideally within regulated or mission-critical environments. Familiarity with Google's Site Reliability Engineering (SRE) practices, especially around SLIs and SLOs, error budgets, and operational readiness. Strong background in monitoring, telemetry, and observability, with a focus on defining effective metrics and alerts that reduce noise and improve incident detection. Hands-on experience with OpenTelemetry and related observability tools (e.g., Prometheus, Grafana, Jaeger, Elastic, etc.) would be a plus. Experience collaborating in DevOps and hybrid cloud environments, ideally with exposure to containerized and microservices architectures. Personal & Leadership Skills Strong thought leadership and influencing skills ability to challenge the status quo and advocate for meaningful change. Architectural mindset, with a structured approach to problem-solving and strong planning and design capabilities. High personal integrity, accountability, and a proactive approach to ownership and decision-making. Excellent collaboration and communication skills, able to build trusted relationships across teams, functions, and geographies. Team player with the ability to work across disciplines and bring people together around shared goals. Demonstrated ability to foster understanding between application development and operations teams - serving as a translator and facilitator between the two worlds. Fluent in English, both written and spoken. #LI-Hybrid? Keywords: Reference Code: 134808

Posted 1 week ago

Apply

5.0 - 9.0 years

0 Lacs

karnataka

On-site

The role of Engineering Manager - Site Reliability is to primarily manage, mentor, and develop a team of Site Reliability Engineers, ensuring the development of both the individual and the team as a whole are in line with organizational objectives and direction. You will be responsible for managing all activities in scope through the direction of activities, designing new products, and modifying existing designs to ensure deliverables are on time and of acceptable quality. It is crucial for you to analyze technology trends, human resource needs, and market demand to plan projects that ensure resilience in line with current demand and future ambition. Additionally, you will be expected to confer with leaders, production, key stakeholders, and marketing teams to determine engineering feasibility, cost-effectiveness, scalability, and time-to-market for new and existing products. In this role, your responsibilities will include managing people by inspiring, growing, and developing individuals through the creation of personal development plans, leveraging available learning resources, and offering stretch opportunities. You will need to ensure delivery by tracking team health metrics and KPIs, monitoring roadmap progress, identifying blockers, and resolving or escalating them. End to End System Ownership involves actively monitoring application health and performance, setting and monitoring relevant metrics, and taking action accordingly. You will also be responsible for reducing business continuity risks and bus factor by applying state-of-the-art practices and tools, and writing appropriate documentation such as runbooks and OpDocs. Technical Incident Management will require you to address and resolve live production issues, improve the overall reliability of systems through root cause analysis, and contribute to postmortem processes and logging live issues. Building software applications will involve utilizing relevant development languages, applying knowledge of systems, services, and tools appropriate for the business area, writing readable and reusable code, and ensuring the quality of applications through standard testing techniques and methods. As an Engineering Manager - Site Reliability, you should possess strong people management skills and experience, excellent communication and stakeholder management skills, good commercial awareness, and technical vision. You are expected to be a humble and thoughtful technology leader who leads by example and gains your teammates" respect through actions rather than title. Experience in software development, building complex and scalable solutions, and leading and managing a team of engineers in a fast-paced and complex environment is essential. Proficiency in at least one programming language (Java, C/C++, Python, Go), ability to formulate software solutions from scratch, understanding of Service-Oriented Architecture, Microservices & OOP patterns, hands-on experience in Linux administration and troubleshooting, creative problem-solving approach, practical experience in defining SLIs and SLOs, strong analytical skills, and a data-driven mindset are also required. If your application is successful, your personal data may be used for a pre-employment screening check by a third party as permitted by applicable law. The pre-employment screening may include employment history, education, and other information necessary for determining your qualifications and suitability for the position.,

Posted 1 week ago

Apply

5.0 - 7.0 years

0 Lacs

bengaluru, karnataka, india

On-site

Senior Software Engineer (TypeScript Developer) Position Overview Job Title: Senior Software Engineer (TypeScript Developer) Corporate Title: AVP Location: Bangalore, India Role Description You will be joining the TDI Engineering Platforms and Practice group as a full stack developer working on our target state secure pipelines and control automation stack. The pipeline is a key component in providing a frictionless software delivery experience for our customers and will be used by the entire organization. You will be responsible for designing, building and supporting a variety of automation including GitHub Actions and Workflows and backend process (Java/TypeScript) ensuring the highest standards of compliance without hindering the pace of delivery of our customer teams. This is a rare opportunity to help shape the future technology and culture of our firm. What we'll offer you As part of our flexible scheme, here are just some of the benefits that you'll enjoy Best in class leave policy Gender neutral parental leaves 100% reimbursement under childcare assistance benefit (gender neutral) Sponsorship for Industry relevant certifications and education Employee Assistance Program for you and your family members Comprehensive Hospitalization Insurance for you and your dependents Accident and Term life Insurance Complementary Health screening for 35 yrs. and above Your key responsibilities Building secure and reusable CICD components to provide provenance and governance around our SDLC practice ensuing high quality compliance Integrate with existing developer tooling to gather information and automate Ensuring the highest standards in security and supply chain integrity in-line with NIST, SLSA and other standards Direct customer engagement to gather requirements and understand the disparate ways teams build software today Developing supporting materials (software, training materials, workshops) to facilitate adoption Continuously measure the success of our solutions via a data driven approach, feedback and continuous improvement Your skills and experience Experienced full stack developer (Java/JVM/TypeScript), likely 5+ years in industry Extensive DevOps experience including CICD, SLI/SLOs, error budgets et al Extensive automation experience including GitHub Actions, TFE, scripting such as Ansible or similar Experience of varied orchestration technologies such as TeamCity, Jenkins and Cloud ready tools like ArgoCD and Tekton a plus Cloud (K8s and/or GCP) expertise - training can be provided Understanding of security concerns and frameworks such as SLSA and ensuring provenance of the SBOM a plus Proven communication and influencing skills, experience coaching and mentoring a plus How we'll support you Training and development to help you excel in your career Coaching and support from experts in your team A culture of continuous learning to aid progression A range of flexible benefits that you can tailor to suit your needs About us and our teams Please visit our company website for further information: We strive for a in which we are empowered to excel together every day. This includes acting responsibly, thinking commercially, taking initiative and working collaboratively. Together we share and celebrate the successes of our people. Together we are Deutsche Bank Group. We welcome applications from all people and promote a positive, fair and inclusive work environment.

Posted 1 week ago

Apply

5.0 - 7.0 years

0 Lacs

bengaluru, karnataka, india

On-site

Senior Software Engineer (TypeScript Developer) Position Overview Job Title: Senior Software Engineer (TypeScript Developer) Corporate Title: AVP Location: Bangalore, India Role Description You will be joining the TDI Engineering Platforms and Practice group as a full stack developer working on our target state secure pipelines and control automation stack. The pipeline is a key component in providing a frictionless software delivery experience for our customers and will be used by the entire organization. You will be responsible for designing, building and supporting a variety of automation including GitHub Actions and Workflows and backend process (Java/TypeScript) ensuring the highest standards of compliance without hindering the pace of delivery of our customer teams. This is a rare opportunity to help shape the future technology and culture of our firm. What we'll offer you As part of our flexible scheme, here are just some of the benefits that you'll enjoy Best in class leave policy Gender neutral parental leaves 100% reimbursement under childcare assistance benefit (gender neutral) Sponsorship for Industry relevant certifications and education Employee Assistance Program for you and your family members Comprehensive Hospitalization Insurance for you and your dependents Accident and Term life Insurance Complementary Health screening for 35 yrs. and above Your key responsibilities Building secure and reusable CICD components to provide provenance and governance around our SDLC practice ensuing high quality compliance Integrate with existing developer tooling to gather information and automate Ensuring the highest standards in security and supply chain integrity in-line with NIST, SLSA and other standards Direct customer engagement to gather requirements and understand the disparate ways teams build software today Developing supporting materials (software, training materials, workshops) to facilitate adoption Continuously measure the success of our solutions via a data driven approach, feedback and continuous improvement Your skills and experience Experienced full stack developer (Java/JVM/TypeScript), likely 5+ years in industry Extensive DevOps experience including CICD, SLI/SLOs, error budgets et al Extensive automation experience including GitHub Actions, TFE, scripting such as Ansible or similar Experience of varied orchestration technologies such as TeamCity, Jenkins and Cloud ready tools like ArgoCD and Tekton a plus Cloud (K8s and/or GCP) expertise - training can be provided Understanding of security concerns and frameworks such as SLSA and ensuring provenance of the SBOM a plus Proven communication and influencing skills, experience coaching and mentoring a plus How we'll support you Training and development to help you excel in your career Coaching and support from experts in your team A culture of continuous learning to aid progression A range of flexible benefits that you can tailor to suit your needs About us and our teams Please visit our company website for further information: We strive for a in which we are empowered to excel together every day. This includes acting responsibly, thinking commercially, taking initiative and working collaboratively. Together we share and celebrate the successes of our people. Together we are Deutsche Bank Group. We welcome applications from all people and promote a positive, fair and inclusive work environment.

Posted 1 week ago

Apply

5.0 - 9.0 years

0 Lacs

pune, maharashtra

On-site

As a Site Reliability Engineer (SRE) at UBS, you will play a crucial role in ensuring the availability, performance, and resilience of our platforms in a mission-critical financial environment. Your primary responsibility will be to design, implement, and maintain highly available and fault-tolerant systems, with a focus on building and operating reliable, scalable systems in regulated industries such as banking and financial services. You will work closely with engineering, infrastructure, and security teams to build secure, observable, and automated systems, while fostering a culture of operational excellence. Your role will involve defining and monitoring Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to guarantee system reliability and customer satisfaction. Additionally, you will lead incident response, post-mortems, and root cause analysis for production issues, as well as collaborate with development teams to embed reliability into the software development lifecycle. Joining the Operating Systems and Middleware (OSM) team at UBS, you will be part of a globally distributed team that supports critical infrastructure across different time zones using a follow-the-sun support model. Operating in a collaborative Agile environment, you will have the opportunity to work alongside talented engineers who are passionate about building reliable systems and solving complex problems. We value transparency, shared responsibility, and continuous learning, empowering our engineers to take ownership, innovate, and continuously improve. The ideal candidate for this role will have proven expertise in Site Reliability Engineering, with a background in software engineering, infrastructure, or operations. You should possess hands-on experience with cloud platforms like Azure, operating systems such as Linux RHEL7+, and networking fundamentals. A solid understanding of networking and storage technologies, authentication and naming services, scripting and automation, as well as infrastructure as code tools is essential. Additionally, you should demonstrate a metrics- and automation-driven mindset, strong collaboration and communication skills, and a proactive, ownership-driven attitude. Desirable additions to your expertise include experience with chaos engineering, resilience testing, disaster recovery planning, financial transaction systems, real-time data pipelines, core banking platforms, CI/CD pipelines, containerization, and orchestration. UBS offers a dynamic and inclusive work environment where diversity is celebrated, and employees are supported with new challenges, growth opportunities, and flexible working options. Join us at UBS, where collaboration and individual empowerment drive our success.,

Posted 2 weeks ago

Apply

10.0 - 14.0 years

0 Lacs

pune, maharashtra

On-site

As a Technical Product Manager for the internal Observability & Insights Platform, you will play a crucial role in defining the product strategy, overseeing discovery and delivery, and ensuring that engineers and stakeholders across 350+ services can effectively build, debug, and operate with confidence. Your responsibilities will include owning and evolving a platform encompassing logging (ELK stack), metrics (Prometheus, Grafana, Thanos), tracing (Jaeger), structured audit logs, and SIEM integrations, while competing with high-cost solutions like Datadog and Honeycomb. Your impact will be both technical and strategic, with a focus on enhancing developer experience, reducing operational noise, and driving platform efficiency and cost visibility. Key Deliverables: - Successfully manage and deliver initiatives from the Observability Roadmap / Job Jar, tracked via RAG status and Jira epics. - Conduct structured discoveries for upcoming capabilities such as SIEM exporter, SDK adoption, and trace sampling. - Design and implement scorecards in Port to measure observability maturity across teams. - Ensure feature parity and stakeholder migration in cost-saving initiatives like Datadog and Prometheus. - Track and report platform usage, reliability, and cost metrics aligned with business outcomes. - Drive feature documentation, adoption plans, and enablement sessions across engineering. Jobs To Be Done: - Define and evolve the observability product roadmap covering Logs, Metrics, Traces, SDK, Dashboards, and SIEM. - Lead dual-track agile product discovery for upcoming initiatives, gathering context, defining problems, and validating feasibility. - Collaborate with engineering managers to break down initiatives into quarterly deliverables, epics, and sprint-level execution. - Maintain the Observability Job Jar and present RAG status every 2 weeks with confidence supported by Jira hygiene. - Define and track metrics to measure the success of each platform capability including SLOs, cost savings, and adoption percentage. - Collaborate closely with FinOps, Security, and Platform teams to ensure observability aligns with cost, compliance, and operational goals. - Promote the adoption of SDKs, scorecards, and dashboards through enablement, documentation, and evangelism. Ways Of Working: - Operate in dual-track agile mode, discovering next quarter's priorities while delivering the current quarter's committed outcomes. - Maintain a GPS PRD (Product Requirements Doc) for each major initiative, defining the problem, rationale, and value measurement. - Collaborate deeply with engineers in backlog grooming, planning, demos, and retrospectives. - Follow RAG-based reporting with stakeholders, escalating risks early and presenting mitigation paths clearly. - Operate with full visibility in Jira, driving delivery rhythm across sprints. - Use quarterly Job Jar reviews to recalibrate product priorities, staffing needs, and stakeholder alignment. Requirements: - 10+ years of product management experience, preferably in platform/infrastructure products. - Demonstrated success in managing internal developer platforms or observability tooling. - Experience in launching or migrating enterprise-scale telemetry stacks like Datadog, Prometheus/Grafana, Honeycomb, Jaeger. - Ability to translate complex engineering requirements into structured product plans with measurable outcomes. - Strong technical background in cloud-native environments such as EKS, Kafka, Elasticsearch. - Excellent documentation and storytelling skills, especially to influence engineers and non-technical stakeholders. Success Metrics: - Reduction in Datadog/Honeycomb usage & cost post migration. - Uptime & latency of observability pipelines (Jaeger, ELK, Prometheus). - Scorecard improvement across teams (Bronze, Silver, Gold). - Number of issues detected/resolved using the new observability stack. - Time to incident triage with new tracing/logging capabilities.,

Posted 3 weeks ago

Apply

7.0 - 10.0 years

3 - 8 Lacs

Bengaluru, Karnataka, India

On-site

Partner with application developers and solution architects to ensure services are built for scale and performance. Lead setting service-level objectives, agreements and indicators (SLOs, SLAs and SLIs) for the underlying service by collaborating with Application Development, Product and Business Owners Design, Develop and create Scripts/Software/Tools that will improve the reliability of systems in Production including fixing issues, responding to incidents and taking on-call responsibilities. Improve the overall resilience of a system and provide visibility to the health and performance of services across all applications and infrastructure Improve service performance metrics like latency, page load speed and ETL and help proactively identify performance issues across the system Implement monitoring solutions, create Dashboards and Alerts based on four golden signals of SRE providing single source to determine the overall performance and availability of the services they support. Writing, updating, and using documentation, including runbooks/playbooks Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more Using Chaos Engineering to test what you build under real-world conditions Spread information across DevOps and business teams ? encouraging a blameless culture focused on workflow visibility and collaboration Root-cause analysis complex problems involving multiple parties, networks, hardware, and software that relate to scaling and performance. Services as technical owner to ensures delivery for SRE initiative Performs deliverable reviews and coaches team in area of expertise in SRE Provide continuous competitive and best-practices research, leverage industry resources and market trends, and liaise with internal stakeholders.

Posted 1 month ago

Apply

3.0 - 7.0 years

0 Lacs

karnataka

On-site

As a Support Engineer with experience in maintaining and supporting solutions in a Cloud based environment (GCP or AWS), you will be responsible for ensuring the smooth operation of monitoring tools such as ELK, Dynamiter, Cloud watch, Cloud logging, Cloud Monitoring, New Relic. Your primary focus will be to implement and maintain monitoring and self-healing strategies to proactively prevent production incidents. You will also be required to conduct root cause analysis of production issues and design on call and escalation processes. In addition, you will participate in the design and implementation of serviceability solutions for monitoring and alerting, as well as debugging production issues across services and levels of the stack. Collaborating closely with the platform engineering team, you will help establish and improve production support approaches and participate in defining SLIs and SLOs to demonstrate efficiency and value to business partners. Your responsibilities will also include interacting and testing APIs, participating in Out-of-business-hour deployments and support on rotation with team members, and being familiar with agile development techniques. L3 Support experience is considered an asset for this role. In return, we offer competitive salaries, comprehensive health benefits, flexible work hours, remote work options, professional development and training opportunities, and a supportive and inclusive work environment.,

Posted 1 month ago

Apply

5.0 - 10.0 years

5 - 10 Lacs

Pune, Maharashtra, India

On-site

Role Overview: Business Operations Site Reliability Engineer (SRE): The role of the Business Operations team is to act as the production readiness steward for Mastercard products. As a BizOps SRE, the primary responsibility is ensuring the stability and health of the platform. Foster developer run ownership and empower developers to build resilient products. Support developers during the application build phase with operational design, automation, capacity planning, and monitoring, ensuring fault-tolerant and scalable products. Create and enforce operational standards while fostering an agile and learning culture. Focus on triage and root cause analysis, understanding the business impact of products, and performing blameless post-mortems. Engage early in the development lifecycle to be proactive and manage production and change activities to maximize customer experience. Focus on risk management, compliance, and risk mitigation across all environments. Align product and customer-focused priorities with operational needs by providing continuous feedback throughout the lifecycle. Mission: The mission is to ensure production readiness through close collaboration with developers to design, build, implement, and support technology services. Ensure operational criteria such as system availability, capacity, performance, monitoring, self-healing, and deployment automation are implemented throughout the delivery process. Lead the DevOps transformation at Mastercard through tooling and by advocating for change and standards across development, quality, release, and product organizations. Support daily operations with a hyper-focus on triage and root cause analysis, understanding business impacts and conducting blameless post-mortems. Shift left in the development process, becoming more proactive to maximize customer experience and increase the value of supported applications. Focus on streamlining and standardizing application-specific support activities and centralizing points of interaction for both internal and external partners. Communicate effectively with key stakeholders to align product and customer-focused priorities with operational needs. Key Responsibilities: Operational Readiness Architect: Serve as the primary contact responsible for the overall health, performance, and capacity of applications. Support services before they go live by engaging in system design consulting, capacity planning, and launch reviews. Partner with development and product teams to establish monitoring and alerting strategies, ensuring zero downtime during deployment. Site Reliability Engineering (SRE): Ensure application scalability, performance, and resilience. Practice sustainable incident response and blameless post-mortems. Take a holistic approach to problem-solving and optimize recovery time. Automate data-driven alerts to proactively escalate issues and work with development teams to establish Service Level Objectives (SLOs) to improve reliability. DevOps/Automation: Address complex development, automation, and business process challenges. Engage in and improve the entire lifecycle of services, from inception and design to deployment, operation, and refinement. Support the CI/CD pipeline, ensuring smooth promotion of software into higher environments through validation and operational gating. Lead Mastercard in DevOps automation and best practices. Increase automation and tooling to reduce manual interventions and toil. ITSM Practices: Analyze ITSM activities of the platform and provide feedback to development teams on operational gaps or resiliency concerns. Role Qualifications: Education and Experience: BS degree in Computer Science, a related technical field (e.g., physics, mathematics), or equivalent practical experience. Exposure to coding and/or scripting. An appetite for pushing the boundaries of automation and exploring new technology, infrastructure, and practices to scale architecture for future growth. Technical and Analytical Skills: Experience with algorithms, data structures, scripting, pipeline management, and software design. Systematic problem-solving approach with strong communication skills and a sense of ownership. Interest in designing, analyzing, and troubleshooting large-scale distributed systems. Comfortable collaborating with cross-functional teams to ensure expected system behavior is understood and monitoring is in place to detect anomalies. Additional Skills: Ability to balance doing things correctly with fixing issues quickly. Flexible and pragmatic, working towards the long-term health of systems. Willingness to learn and take on challenging opportunities while being part of a matrix-based, diverse, and geographically distributed team. Ability to prioritize and build relationships across development, operations, and product teams.

Posted 1 month ago

Apply

5.0 - 9.0 years

0 Lacs

haryana

On-site

Cvent is a global leader in meeting, event, travel, and hospitality technology, with a workforce of over 4000 employees worldwide. Our cloud-based solutions cater to more than 28,000 customers in over 100 countries, including 80% of the Fortune 100 companies. As a Lead - Site Reliability Engineer at Cvent, you will leverage your expertise in development and operations to identify and address issues, develop universal solutions, and provide guidance to junior staff. Your responsibilities will also include enabling and supporting multi-disciplinary teams, resolving complex development and automation challenges, promoting Cvent's standards and best practices, ensuring the scalability and performance of our product suite, and collaborating with various teams to establish effective monitoring and alerting strategies. Key Responsibilities: - Utilize advanced knowledge in development and operations to prioritize and resolve issues - Mentor and support junior staff members - Empower and collaborate with multi-disciplinary teams across different applications and locations - Address complex development, automation, and business process challenges - Advocate for Cvent standards and best practices - Ensure product scalability, performance, and resilience - Establish monitoring and alerting strategies for new applications - Share best practices with acquisition's DevOps team - Develop automation solutions for deployment targeting multiple environments - Assist in achieving zero-down-time deployments for legacy code base - Contribute to Open Source projects - Automate tasks to streamline operations Requirements: - Knowledge of SDLC methodologies, preferably Agile - Proficiency in Java, Python, or Ruby - Experience with managing AWS services - Familiarity with configuration management tools like Chef, Puppet, or Ansible - Strong Windows and Linux administration skills - Working knowledge of APM, monitoring, and logging tools - Experience with 3-tier application stacks and incident response - Familiarity with build tools such as Jenkins, CircleCI, etc. - Exposure to containerization concepts like docker, ECS, EKS, Kubernetes - Experience with NoSQL databases like MongoDB, couchbase, postgres, etc. - Self-motivated with the ability to work independently Preferred Skills: - Understanding of F5 load balancing concepts - Basic knowledge of observability, SLIs/SLOs, and message queues - Familiarity with basic networking concepts - Experience with package managers like Nexus, Artifactory, etc. - Strong communication and people management skills Join us at Cvent to be part of a dynamic team that is driving innovation and excellence in the world of event management technology.,

Posted 1 month ago

Apply

3.0 - 5.0 years

0 - 3 Lacs

Hyderabad, Telangana, India

On-site

Job description The SRE function is a highly visible force multiplier with a growth mindset, going through a period of increased investment, where you can contribute to the delivery of a highly reliable banking solution As part of an SRE squad, you will partner with engineering teams within Macquarie to help develop and drive the adoption of SRE best practices and tooling across the organisation. The role will require close engagement and collaboration with all the engineering community. You will be involved in projects such as measuring, testing and improving our resilience (Chaos engineering), our capacity to deal with increasing load (Demand forecasting and capacity planning), our ability to make changes safely (Change management and System Design) and our Observability (Metrics, monitoring, and alerts) What you offer Strong experience in software engineering and system design utilising Java, Golang or similar language Understand the benefits and correct use of SLOs, metrics, logs and traces Cloud Native at heart ready to build on the shoulders of giants Excellent understanding of modern software development practices, tools and technologies Strong DevOps fundamentals with preference for Java, Golang, Microservices and other cloud technologies. Experience in APM and Observability tools, such as NewRelic, DataDog, Dynatrace, Grafana stack etc.

Posted 1 month ago

Apply

15.0 - 19.0 years

0 Lacs

haryana

On-site

As the Vice President of DevOps & SRE, you will hold a senior leadership position with the primary responsibility of driving platform reliability, secure operations, and DevOps excellence throughout the enterprise. Your role will involve integrating site reliability engineering practices with scalable DevOps automation and maintaining a robust cybersecurity posture. Leading high-performing teams, defining technology strategy, managing infrastructure, and safeguarding systems and data to support business growth and digital innovation will be key aspects of your role. You will be expected to lead enterprise-wide DevOps adoption and continuous delivery transformation, implementing and optimizing CI/CD pipelines, infrastructure-as-code (IaC), and cloud-native architectures. Championing automation in deployment, monitoring, and infrastructure provisioning will be essential, along with experience in containerization (Kubernetes, Docker), service mesh, and serverless environments. Facilitating collaboration between development, operations, and QA for rapid and reliable releases will also be a critical part of your responsibilities. Establishing and leading the Site Reliability Engineering (SRE) function to ensure system reliability, scalability, and performance will be another key aspect of your role. You will define and monitor SLAs, SLOs, and SLIs for critical applications and services, drive incident management, root cause analysis, and foster a postmortem culture. Developing and deploying observability strategies using tools like Prometheus, Grafana, Zabbix, or enterprise tools such as New Relic, Dynatrace, or Splunk will also be within your purview. In terms of leadership and strategic alignment, you will build and mentor cross-functional teams across DevOps and SRE, partnering with engineering, product, and business leaders to align technical initiatives with organizational goals. Managing departmental budgets, tools, and vendor relationships, as well as reporting on KPIs, operational health, security posture, and risk to the executive leadership team will also be part of your responsibilities. To qualify for this role, you must hold a Bachelors or Masters in Computer Science, Engineering, or a related field, along with at least 15+ years of experience in IT/engineering, including a minimum of 5+ years in leadership roles. Proven expertise in implementing DevOps, SRE, and security practices at scale, as well as hands-on experience with AWS, Azure, or GCP, CI/CD tools, and SRE observability platforms, are essential requirements for this position.,

Posted 1 month ago

Apply
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies