Jobs
Interviews

95 Site Reliability Jobs - Page 2

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

4.0 - 8.0 years

10 - 15 Lacs

mohali

Work from Office

About the Role We are seeking a highly skilled Sr. Site Reliability Engineer (SRE) to lead the implementation, optimization, and management of our observability stack across cloud infrastructure. You will play a key role in ensuring the reliability, scalability, and performance of our platform, spanning microservices on Kubernetes/EC2 and mission-critical systems. This role requires strong problem-solving, automation mindset, and a proactive approach to incident management. Key Responsibilities Design, implement, and manage monitoring, logging, and alerting systems across production and non-production environments. Lead incident response, root cause analysis, and post-mortem practices for continuous improvement. Define and implement disaster recovery strategies with regular testing. Collaborate with development teams to define and track SLAs/SLOs for critical services. Optimize AWS cloud infrastructure for cost efficiency, reliability, and scalability. Build and maintain automation frameworks for deployment, scaling, and recovery using Terraform, GitLab CI/CD, and Kubernetes. Administer Kubernetes clusters, troubleshoot performance bottlenecks, and ensure high availability. Manage databases (PostgreSQL or similar), including replication and disaster recovery strategies. Contribute to infrastructure security, compliance, and best practices. Participate in the on-call rotation and handle high-priority incidents under pressure. Required Skills & Experience 4+ years of experience as an SRE, DevOps, or similar role. Strong hands-on experience with AWS services: EC2, EKS, RDS, Cognito, CloudWatch, etc. Proven expertise in Kubernetes administration in production environments. Proficiency in scripting/programming: Python, Bash, Chef (recipes, cookbooks), Ansible. Strong knowledge of Infrastructure as Code (Terraform/CloudFormation). Deep experience with observability tools: Prometheus, Grafana, ELK stack, distributed tracing. Database administration experience with PostgreSQL or similar systems. Understanding of network protocols, load balancing, and security best practices. Experience in CI/CD pipelines and GitOps workflows. Ability to handle multiple incidents and prioritize effectively under pressure. Exposure to monitoring solutions like Splunk, Datadog, Dynatrace. Preferred Qualifications AWS Certified Solutions Architect or AWS DevOps Engineer certification. Certified Kubernetes Administrator (CKA). Why Join Us Be part of a fast-growing HealthTech startup transforming healthcare technology. Work with modern tools, cutting-edge infrastructure, and a collaborative team. Opportunity to own end-to-end infrastructure reliability and automation. Competitive salary and growth opportunities.

Posted 3 weeks ago

Apply

4.0 - 8.0 years

15 - 20 Lacs

mohali

Work from Office

About the Role We are seeking a highly skilled Sr. Site Reliability Engineer (SRE) to lead the implementation, optimization, and management of our observability stack across cloud infrastructure. You will play a key role in ensuring the reliability, scalability, and performance of our platform, spanning microservices on Kubernetes/EC2 and mission-critical systems. This role requires strong problem-solving, automation mindset, and a proactive approach to incident management. Key Responsibilities Design, implement, and manage monitoring, logging, and alerting systems across production and non-production environments. Lead incident response, root cause analysis, and post-mortem practices for continuous improvement. Define and implement disaster recovery strategies with regular testing. Collaborate with development teams to define and track SLAs/SLOs for critical services. Optimize AWS cloud infrastructure for cost efficiency, reliability, and scalability. Build and maintain automation frameworks for deployment, scaling, and recovery using Terraform, GitLab CI/CD, and Kubernetes. Administer Kubernetes clusters, troubleshoot performance bottlenecks, and ensure high availability. Manage databases (PostgreSQL or similar), including replication and disaster recovery strategies. Contribute to infrastructure security, compliance, and best practices. Participate in the on-call rotation and handle high-priority incidents under pressure. Required Skills & Experience 4+ years of experience as an SRE, DevOps, or similar role. Strong hands-on experience with AWS services: EC2, EKS, RDS, Cognito, CloudWatch, etc. Proven expertise in Kubernetes administration in production environments. Proficiency in scripting/programming: Python, Bash, Chef (recipes, cookbooks), Ansible. Strong knowledge of Infrastructure as Code (Terraform/CloudFormation). Deep experience with observability tools: Prometheus, Grafana, ELK stack, distributed tracing. Database administration experience with PostgreSQL or similar systems. Understanding of network protocols, load balancing, and security best practices. Experience in CI/CD pipelines and GitOps workflows. Ability to handle multiple incidents and prioritize effectively under pressure. Exposure to monitoring solutions like Splunk, Datadog, Dynatrace. Preferred Qualifications AWS Certified Solutions Architect or AWS DevOps Engineer certification. Certified Kubernetes Administrator (CKA). Why Join UsBe part of a fast-growing HealthTech startup transforming healthcare technology. Work with modern tools, cutting-edge infrastructure, and a collaborative team. Opportunity to own end-to-end infrastructure reliability and automation. Competitive salary and growth opportunities.

Posted 3 weeks ago

Apply

6.0 - 9.0 years

12 - 16 Lacs

pune

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools: logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD: GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) ? - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 3 weeks ago

Apply

8.0 - 10.0 years

9 - 13 Lacs

coimbatore

Work from Office

Job Summary : We are seeking an experienced Site Reliability Engineer (SRE) who will play a critical role in ensuring the reliability, performance, and scalability of our payment systems. The ideal candidate will possess deep expertise in DevOps automation, enterprise monitoring, and cloud platforms, along with a strong background in Card Payment systems. This role requires hands-on technical skills, a passion for problem-solving, and the ability to collaborate across teams in a fast-paced, dynamic environment. Key Responsibilities : Reliability & Performance : - Ensure the reliability, availability, and performance of critical payment platforms and services. - Drive root cause analysis (RCA) and implement long-term solutions to prevent recurrence of incidents. - Manage capacity planning, scalability, and performance tuning across cloud and on-prem environments. - Lead and participate in the on-call rotation, providing timely support and issue resolution. DevOps Automation & CI/CD : - Design, implement, and maintain CI/CD pipelines using Jenkins, GitHub, and other DevOps tools. - Automate infrastructure deployment, configuration, and monitoring, following Infrastructure as Code (IaC) principles. - Enhance automation for routine operational tasks, incident response, and self-healing capabilities. Monitoring & Observability : - Implement and manage enterprise monitoring solutions including Splunk, Dynatrace, Prometheus, and Grafana. - Build real-time dashboards, alerts, and reporting to proactively identify system anomalies. - Continuously improve observability, logging, and tracing across all environments. Cloud Platforms & Infrastructure : - Work with AWS, Azure, and PCF (Pivotal Cloud Foundry) environments, managing cloud-native services and infrastructure. - Design and optimize cloud architecture for reliability and cost-efficiency. - Collaborate with cloud security and networking teams to ensure secure and compliant infrastructure. Payment Systems Expertise : - Apply your understanding of Card Payment systems to ensure platform reliability and compliance. - Troubleshoot payment-related issues, ensuring minimal impact on transaction flows and customer experience. - Collaborate with product and development teams to ensure alignment with business objectives

Posted 3 weeks ago

Apply

3.0 - 5.0 years

7 - 11 Lacs

coimbatore

Work from Office

We are seeking an experienced Site Reliability Engineer (SRE) who will play a critical role in ensuring the reliability, performance, and scalability of our payment systems. The ideal candidate will possess deep expertise in DevOps automation, enterprise monitoring, and cloud platforms, along with a strong background in Card Payment systems. This role requires hands-on technical skills, a passion for problem-solving, and the ability to collaborate across teams in a fast-paced, dynamic environment. Key Responsibilities : Reliability & Performance : - Ensure the reliability, availability, and performance of critical payment platforms and services. - Drive root cause analysis (RCA) and implement long-term solutions to prevent recurrence of incidents. - Manage capacity planning, scalability, and performance tuning across cloud and on-prem environments. - Lead and participate in the on-call rotation, providing timely support and issue resolution. DevOps Automation & CI/CD : - Design, implement, and maintain CI/CD pipelines using Jenkins, GitHub, and other DevOps tools. - Automate infrastructure deployment, configuration, and monitoring, following Infrastructure as Code (IaC) principles. - Enhance automation for routine operational tasks, incident response, and self-healing capabilities. Monitoring & Observability : - Implement and manage enterprise monitoring solutions including Splunk, Dynatrace, Prometheus, and Grafana. - Build real-time dashboards, alerts, and reporting to proactively identify system anomalies. - Continuously improve observability, logging, and tracing across all environments. Cloud Platforms & Infrastructure : - Work with AWS, Azure, and PCF (Pivotal Cloud Foundry) environments, managing cloud-native services and infrastructure. - Design and optimize cloud architecture for reliability and cost-efficiency. - Collaborate with cloud security and networking teams to ensure secure and compliant infrastructure. Payment Systems Expertise : - Apply your understanding of Card Payment systems to ensure platform reliability and compliance. - Troubleshoot payment-related issues, ensuring minimal impact on transaction flows and customer experience. - Collaborate with product and development teams to ensure alignment with business objectives.

Posted 3 weeks ago

Apply

5.0 - 7.0 years

14 - 19 Lacs

hyderabad

Work from Office

The ideal candidate is a Senior Site Reliability Engineer with strong expertise in CI/CD pipeline design, infrastructure automation, and backend service development. They have hands-on experience with Node.js, Python scripting, and managing large-scale Kubernetes clusters. The candidate is well-versed in AWS cloud infrastructure, including AWS CDK, and has a deep understanding of DevOps and security best practices. Familiarity with ArgoCD, Kustomize, and GitOps workflows is a strong advantage. They should also be capable of monitoring and optimizing system performance, ensuring reliability and scalability across environments, and collaborating with cross-functional teams. Responsibilities : - Lead the design and implementation of CI/CD pipelines to streamline deployment processes. - Develop and maintain backend services using Node.js, focusing on security and mitigating cyber vulnerabilities. - Automate processes using Python scripting to build utilities that support CI/CD pipelines. - Manage large-scale infrastructure and multiple Kubernetes clusters to ensure optimal performance and reliability. - Implement AWS infrastructure solutions, utilizing AWS CDK and core AWS services to enhance our cloud capabilities. - Collaborate with cross-functional teams to ensure seamless integration of services and infrastructure. - Monitor system performance and troubleshoot issues to maintain high availability and reliability. Qualifications we seek in you : Minimum Qualifications / Skills : - Proven experience in a Senior SRE or similar role. - Strong expertise in CI/CD deployments. - Working knowledge of Python scripting for automation. - Experience in developing and maintaining backend services using Node.js. - Practical experience with AWS infrastructure, including strong working knowledge of AWS CDK and core AWS services. Preferred Qualifications/ Skills : - Familiarity with ArgoCD and Kustomize. - Hands-on experience in managing large-scale infrastructure and multiple Kubernetes clusters. - Strong understanding of security best practice in software development.

Posted 3 weeks ago

Apply

5.0 - 10.0 years

8 - 12 Lacs

noida

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 3 weeks ago

Apply

6.0 - 9.0 years

12 - 16 Lacs

pune

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools: logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD: GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) ? - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 4 weeks ago

Apply

5.0 - 8.0 years

5 - 8 Lacs

Gurgaon, Haryana, India

On-site

NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life's work , to amplify human creativity and intelligence. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience.

Posted 1 month ago

Apply

5.0 - 8.0 years

5 - 8 Lacs

Pune, Maharashtra, India

On-site

NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life's work , to amplify human creativity and intelligence. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience.

Posted 1 month ago

Apply

5.0 - 8.0 years

5 - 8 Lacs

Hyderabad, Telangana, India

On-site

NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life's work , to amplify human creativity and intelligence. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience.

Posted 1 month ago

Apply

5.0 - 8.0 years

5 - 8 Lacs

Bengaluru, Karnataka, India

On-site

NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life's work , to amplify human creativity and intelligence. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience.

Posted 1 month ago

Apply

5.0 - 8.0 years

5 - 8 Lacs

Pune, Maharashtra, India

On-site

NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life's work , to amplify human creativity and intelligence. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience.

Posted 1 month ago

Apply

3.0 - 7.0 years

0 Lacs

pune, maharashtra

On-site

The ideal candidate for this position should have hands-on experience in Site Reliability and DevOps, along with expertise in Kubernetes, Docker, Terraform, and CI/CD. As a Level M professional, you will be working in US EST hours with Pune being the preferred location. Your responsibilities will include designing, developing, and deploying software systems and infrastructure to enhance reliability, scalability, and performance. You will be expected to identify manual processes that can be automated to improve operational efficiency. Implementing monitoring and alerting systems to proactively identify and address issues will be a key part of your role. Collaborating with customers for architecture reviews and developing new features to enhance the reliability and scalability of the platform will also be part of your duties. Working closely with various application teams to understand platform issues and design solutions for monitoring and issue resolution will be essential. You will be responsible for designing recovery and resiliency strategies for different applications. Identifying opportunities for technological improvements and the need for new tools to support capacity planning, disaster recovery, and resiliency will also be part of your role. Additionally, you will architect and implement packages/modules that can serve as blueprints for implementation by different application teams.,

Posted 1 month ago

Apply

10.0 - 14.0 years

0 Lacs

karnataka

On-site

We are looking for a skilled technical leader capable of developing tools and services to enhance the test automation, test reporting, and test debugging processes for our team of automation engineers. Your role will involve guiding the automation of test infrastructure provisioning, scaling, and more. Additionally, as part of the team, you will be responsible for building frameworks to facilitate the integration of automated testing into CI/CD pipelines across various languages and frameworks. Your technical expertise and leadership will play a crucial role in fostering a culture of site reliability, test automation, shared ownership, and transparency. Your responsibilities will include building and supporting tools and services to enhance our automated test platform, researching and implementing ways to improve user experience and reduce manual tasks, leading infrastructure automation efforts, spearheading test automation frameworks and CI/CD integration, managing test environments and infrastructure, promoting agile processes and fast release cycles, architecting monitoring and alerting systems for comprehensive test lifecycle observability, developing playbooks for incident response and disaster recovery, and instilling a culture of site reliability, shared ownership, and automation throughout the organization. You will also be involved in technical design reviews, code quality processes, and utilizing GenAI/ML tools for test development and triage processes. The ideal candidate will have a strong problem-solving ability, a passion for building usable and scalable systems, the ability to collaborate effectively across teams, a sense of responsibility and ownership, excellent communication skills, comfort with ambiguity, and a curiosity for constant learning and professional growth. Additionally, you should possess over 10 years of experience in product quality, automation, and/or DevOps, hold a Bachelor's or Master's degree in Computer Science, Engineering, or a related field, demonstrate hands-on experience in developing, deploying, and securing services, particularly in regulated environments. Experience with software development productivity metrics, infrastructure provisioning using code and scripts, networking, big data technologies, databases, Linux administration, microservices, distributed systems, performance optimizations, public cloud providers, and VMWare is preferred. Experience in cybersecurity and AI/ML testing would be an added advantage. If you are excited about tackling complex challenges, driving innovation, and leading technical initiatives to enhance test automation processes, we encourage you to apply for this role and be a part of our dynamic team.,

Posted 1 month ago

Apply

6.0 - 11.0 years

8 - 12 Lacs

Mumbai, Delhi / NCR, Bengaluru

Work from Office

Observability & SRE Engineer Azure & Splunk (3 Months) Role Overview : We are looking for a highly skilled Observability and Site Reliability Engineer (SRE) with strong experience in Splunk integration with Azure, cloud-native monitoring, and chaos engineering practices. The ideal candidate will play a key role in improving system reliability, monitoring capabilities, and resilience across our Azure cloud infrastructure. Key Responsibilities : Design, implement, and manage observability solutions using Splunk integrated with Azure Monitor, Log Analytics, and Application Insights. Develop and maintain monitoring, alerting, and dashboarding solutions to ensure system health and performance. Implement Azure Chaos Engineering tools and scenarios to proactively test the resilience of cloud applications. Collaborate with application and infrastructure teams to identify SLOs/SLIs and define reliability objectives. Automate incident detection and response processes using Splunk alerts, Azure Automation, and scripting. Conduct root cause analysis (RCA) and post-incident reviews to drive continuous improvement. Drive the adoption of SRE principles and practices across engineering teams. Location - Delhi / NCR, Bangalore, Mumbai, Pune

Posted 1 month ago

Apply

5.0 - 10.0 years

8 - 12 Lacs

Surat

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 month ago

Apply

5.0 - 10.0 years

8 - 12 Lacs

Gurugram

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 month ago

Apply

5.0 - 10.0 years

8 - 12 Lacs

Kanpur

Work from Office

Job Description : We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 month ago

Apply

5.0 - 10.0 years

8 - 12 Lacs

Kolkata

Work from Office

Job Description : We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 month ago

Apply

5.0 - 10.0 years

8 - 12 Lacs

Ahmedabad

Remote

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 month ago

Apply

5.0 - 10.0 years

8 - 12 Lacs

Chennai

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools : logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills : - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD : GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 month ago

Apply

6.0 - 9.0 years

12 - 16 Lacs

Pune

Work from Office

We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine. Key Responsibilities : - Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps) - Manage observability tools: logs, metrics, traces, and alerts - Tune backend services & GKE workloads (Node.js, Django, Go, Java) - Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets) - Lead incident responses & perform root cause analyses - Standardize secrets, tagging & infra consistency across environments - Enhance CI/CD pipelines & collaborate on better rollout strategies Must-Have Skills - 5-10 years in DevOps / SRE / Infra roles - Kubernetes (GKE preferred) - IaC with Terraform & Helm - CI/CD: GitHub Actions + GitOps (ArgoCD / Flux) - Cloud architecture expertise (IAM, VPC, Secrets) - Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) ? - Incident management with tools like Datadog & PagerDuty - Excellent communicator & documenter Tech Stack : - GKE, Kubernetes, Terraform, Helm - GitHub Actions, ArgoCD / Flux - Datadog, PagerDuty - CloudSQL, Cloudflare, IAM, Secrets You're : - A proactive team player & strong individual contributor - Confident yet humble - Curious, driven & always learning - Not afraid to solve deep infrastructure challenges

Posted 1 month ago

Apply

4.0 - 5.0 years

8 - 12 Lacs

Gurugram

Work from Office

Position Overview : We are seeking an SRE to join our high-impact platform engineering team. You will maintain SLAs for real-time services deployed across hybrid clouds and Kubernetes clusters, contributing to automation, observability, and availability goals. Roles and Responsibilities : - Monitor application and infrastructure metrics; build dashboards and alerts (Prometheus, Grafana, ELK). - Automate health checks, incident remediation, and reliability guardrails. - Manage on-call rotations, conduct root cause analysis, and implement postmortem action plans. - Define and track SLOs, SLIs, and error budgets. - Use chaos engineering and resilience testing to ensure fault tolerance. Must Have Skills : - 4 - 5 years of experience in managing production-grade Kubernetes clusters and cloud-native platforms. - Proficiency in Linux system internals, containers, and networking. - Scripting/automation expertise in Python/Go/Shell. - Familiarity with incident management, runbooks, and observability standards. - Exposure to service discovery, DNS routing, and load balancing is a bonus. Qualification : BE/BTech/MCA/ME/MTech/MS in Computer Science or a related technical field or equivalent practical experience. Location : Gurugaon / Onsite. About Nomiso : Our mission is to Empower and Enhance the lives of our customers, through efficient solutions for their At Nomiso we encourage entrepreneurial spirit to learn, grow and improve. A great workplace, thrives on ideas and opportunities. We're in pursuit of colleagues who share similar passions, are nimble and thrive when challenged. We offer a positive, stimulating and fun environment with opportunities to grow, a fast-paced approach to innovation, and a place where your views are valued and encouraged. We are an equal opportunity employer and are committed to diversity, equity, and inclusion. We do not discriminate on race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability status, or any other protected characteristics.

Posted 1 month ago

Apply

6.0 - 10.0 years

0 Lacs

karnataka

On-site

The Lead Associate, Release Management position at BetaNXT involves supporting the scheduling, coordination, and verification of all Mediant's technology/application releases. You will collaborate with QA and Developer Leads to ensure builds are validated before deployment to Production and organize artifacts for release. Additionally, you will work with IT management to enhance software engineering processes and practices related to building, deploying, updating software, and maintaining environments. As part of your responsibilities, you will assist in triaging issues in Production, performing Root Cause Analysis to identify bug introductions, and providing feedback to enhance engineering processes. Your key functions will include implementing and managing release processes for software applications, APIs, and various IT initiatives. You will validate release features, prepare release instructions, and coordinate resources required for deployment. Working closely with QA Leads, you will establish and maintain a bug triage process, prioritize bugs for fixes, and ensure timely resolution by the scrum team. Collaboration with Developers, QA, and DevOps teams to identify and evaluate risks related to releases is essential. You will conduct Root Cause Analysis for discovered bugs, assist in troubleshooting production issues, and coordinate resources to address them. Managing projects and interdependencies to ensure production readiness for all system updates will also be part of your role. To be successful in this position, you should have at least 6+ years of experience and be familiar with build, deployment, and versioning software such as Bamboo and BitBucket. Experience working in a Cloud environment, preferably AWS, is required. Previous experience in the financial services and securities industry is preferred. You should be comfortable testing software applications, APIs, and database objects/SQL, with experience in DevOps, Site Reliability, or Release Management for a rapidly growing company. Familiarity with software development tools like GIT, GitLab, Docker, Postman, and Splunk is beneficial. A B.S degree is required, while an advanced degree or equivalent experience is preferred. Strong project management and communication skills are necessary, along with experience in Software Quality Assurance or Verification of Release Builds. Experience with build and release processes, especially deploying in a Cloud environment, is preferred. Familiarity with Agile/Scrum development methodologies and SQL skills to write queries and understand existing stored procedures and functions are also valuable assets in this role.,

Posted 1 month ago

Apply
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies