Home
Jobs

17961 Reliability Jobs - Page 19

Filter Interviews
Min: 0 years
Max: 25 years
Min: β‚Ή0
Max: β‚Ή10000000
Setup a job Alert
Filter
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

8.0 years

0 Lacs

Greater Kolkata Area

On-site

Linkedin logo

Work Location : PAN India Duration : 12 Months (Extendable) Shift : Rotational shifts including night shifts and weekend availability Years of Experience : 8+ Years πŸ”§ Job Summary We are seeking an experienced and versatile Site Reliability Engineer (SRE) / Observability Engineer to join our project delivery team. The ideal candidate will bring a deep understanding of modern cloud infrastructure, monitoring tools, and automation practices to ensure system uptime, scalability, and performance across a distributed environment. 🎯 Key Responsibilities Site Reliability Engineering Design, build, and maintain scalable, reliable infrastructure. Automate provisioning/configuration using tools like Terraform, Ansible, Chef, or Puppet. Develop automation tools/scripts in Python, Go, Java, or Bash. Administer and optimize Linux/Unix systems and network components (TCP/IP, DNS, load balancers). Deploy and manage infrastructure on AWS or Kubernetes platforms. Build and maintain CI/CD pipelines (e.g., Jenkins, ArgoCD). Monitor production systems with tools such as Prometheus, Grafana, Nagios, Datadog. Conduct postmortems and define SLAs/SLOs to ensure high system reliability. Plan and implement capacity management, failover systems, and auto-scaling mechanisms. Observability Engineering Instrument services for metrics/logs/traces using OpenTelemetry, Prometheus, Jaeger, etc. Manage observability stacks (e.g., Grafana, ELK Stack, Splunk, Datadog, Honeycomb). Work with time-series databases (e.g., InfluxDB, Prometheus) and log aggregation tools. Build actionable alerts and dashboards to reduce alert fatigue and increase insight. Advocate for observability best practices with developers and define performance KPIs. βœ… Required Skills & Qualifications Proven experience as an SRE or Observability Engineer in production environments. Strong Linux/Unix and cloud infrastructure skills (especially AWS, Kubernetes). Proficient in scripting and automation (Python, Go, Bash, Java). Expertise in observability, monitoring, and alerting systems. Experience in Infrastructure as Code (IaC) and modern CI/CD practices. Strong troubleshooting skills and ability to respond to live production issues. Comfortable with rotational shifts, including nights and weekends. πŸ” Mandatory Technical Skills Ansible AWS Automation Services AWS CloudFormation AWS CodePipeline AWS CodeDeploy AWS DevOps Services Show more Show less

Posted 11 hours ago

Apply

Exploring Reliability Jobs in India

The job market for reliability professionals in India is growing rapidly as companies across various industries recognize the importance of ensuring their systems and products are dependable and consistent. Reliability jobs in India offer competitive salaries, career progression opportunities, and the chance to work on cutting-edge technologies.

Top Hiring Locations in India

  1. Bangalore
  2. Pune
  3. Hyderabad
  4. Chennai
  5. Mumbai

Average Salary Range

The average salary range for reliability professionals in India varies based on experience level: - Entry-level: β‚Ή4-6 lakhs per annum - Mid-level: β‚Ή8-12 lakhs per annum - Experienced: β‚Ή15-20 lakhs per annum

Career Path

In the field of reliability, a typical career path may include roles such as: 1. Junior Reliability Engineer 2. Reliability Engineer 3. Senior Reliability Engineer 4. Reliability Manager 5. Director of Reliability

Related Skills

In addition to expertise in reliability, professionals in this field are often expected to have or develop skills in: - Data analysis - Programming languages like Python or R - Statistical modeling - Root cause analysis

Interview Questions

  • What is the difference between reliability and availability? (basic)
  • How do you measure system reliability? (medium)
  • Can you explain the Weibull distribution and its use in reliability analysis? (advanced)
  • What is FMEA (Failure Modes and Effects Analysis) and how is it used in reliability engineering? (medium)
  • How do you prioritize reliability improvements in a system with limited resources? (advanced)
  • Describe a time when you successfully improved the reliability of a system. (basic)
  • What tools do you use for reliability analysis and prediction? (medium)
  • How do you calculate Mean Time Between Failures (MTBF)? (basic)
  • Explain the concept of RBD (Reliability Block Diagram) and its importance. (medium)
  • How do you conduct a reliability test for a new product? (advanced)
  • What is the difference between predictive and preventive maintenance in the context of reliability engineering? (basic)
  • How do you define a reliability goal for a system? (medium)
  • Describe a situation where you had to deal with unexpected system failures. How did you handle it? (basic)
  • What is the role of redundancy in improving system reliability? (medium)
  • How do you ensure the reliability of a software application? (basic)
  • Can you explain the concept of Mean Time To Repair (MTTR) and its significance in reliability analysis? (medium)
  • What are the common failure modes in electrical systems and how do you address them? (advanced)
  • How do you establish a reliability testing plan for a complex system? (medium)
  • Describe your experience with Failure Reporting, Analysis, and Corrective Action System (FRACAS). (advanced)
  • How do you handle conflicting priorities between reliability, cost, and time-to-market? (medium)
  • What is the role of HALT (Highly Accelerated Life Testing) in reliability testing? (advanced)
  • How do you conduct a reliability growth analysis for a product over its lifecycle? (medium)
  • Can you explain the concept of Reliability Centered Maintenance (RCM) and its benefits? (advanced)
  • Describe a situation where you had to communicate reliability issues to non-technical stakeholders. How did you approach it? (medium)
  • How do you stay updated with the latest trends and developments in the field of reliability engineering? (basic)

Closing Remark

As you prepare for interviews in the field of reliability, remember to showcase your problem-solving skills, technical knowledge, and experience in improving system dependability. Stay confident, stay curious, and best of luck in your job search journey!

cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies