Staff Engineer -SRE

12 years

0 Lacs

Posted:2 months ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Responsibilities:

  • The Site Reliability Engineering (SRE) team is responsible for the

    reliability, scalability, stability, and performance

    of systems and services.
  • Collaborate with cross-functional teams to design, build, and maintain systems, while troubleshooting issues as they arise. Act as a bridge between development and operations teams.
  • Work closely with business teams to define

    Service Level Objectives (SLOs)

    and

    Service Level Agreements (SLAs)

    for critical systems, ensuring uptime in alignment with these standards.
  • Deploy and manage monitoring tools to gain insights into system health and performance.
  • Analyze performance, identify bottlenecks, and implement solutions to improve scalability and reduce latency.
  • Develop scripts, tools, and automation frameworks to reduce manual intervention in deployment, monitoring, and scaling.
  • Partner with development teams to design and implement observability practices such as logging, metrics, and tracing, enabling proactive issue diagnosis and resolution.
  • Create actionable alerts in monitoring systems to ensure rapid response to potential production incidents.
  • Forecast resource requirements and provision effectively for current and future demand.
  • Design and execute

    chaos experiments

    to test system resiliency against failures.
  • Own, define, and implement

    Disaster Recovery (DR)

    processes for systems, including conducting both planned and unplanned DR drills to ensure preparedness.
  • Ensure adherence to security best practices during system design and operations.
  • Maintain and update documentation, playbooks, and process guidelines.
  • Publish

    KPI reports

    and system health updates regularly for business stakeholders.

Requirements:

  • Must-have:

    Bachelor’s degree (preferably in Computer Science or a related field), or equivalent experience.
  • Must-have:

    12+ years of overall IT experience.
  • Must-have:

    7+ years of proven work experience as a

    Senior Site Reliability Engineer

    or similar role.
  • Must-have:

    5+ years of AWS Cloud experience with certifications such as AWS Certified DevOps Engineer, SysOps, or Security.
  • Must-have:

    3+ years of experience using a wide range of AWS technologies (e.g., EC2, RDS, ELB, S3, VPC, CloudWatch, Monitoring Tools) to develop and maintain secure AWS-based cloud solutions.
  • Must-have:

    2+ years of experience in CDN and/or cache systems such as Fastly, Akamai, or CloudFront.
  • Strong expertise in

    cloud deployments

    (AWS / Docker / Kubernetes).
  • Hands-on experience with IaC tools such as Terraform, Chef, Ansible, Shell, Groovy, Python, etc.
  • Experience with monitoring systems like CloudWatch, NewRelic, Datadog/Splunk, and ELK stack.
  • Experience managing cloud network resources (AWS preferred), including VPC, URL proxies, private link, DNS, ACLs, firewalls, and C2S access points.
  • Operational knowledge of CI/CD tools such as GitHub Actions and Jenkins.
  • Experience with additional tools such as JIRA, Bitbucket, Fortify, SonarQube, Nexus, and Nexus IQ.
  • Experience with configuration automation tools like Puppet, Ansible, Chef, or Salt.
  • Strong scripting and automation skills (e.g., Bash, Python).
  • Proficiency in Windows and Linux system administration.
  • Strong analytical and problem-solving skills to resolve complex infrastructure and deployment issues.
  • Excellent attention to detail, verbal/written communication, and documentation skills.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Forbes Advisor logo
Forbes Advisor

Consumer Services

Jersey City New Jersey

RecommendedJobs for You