Site Reliability Engineer

6 years

0 Lacs

Posted:4 hours ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Position Description,645

This role is responsible for ensuring the availability, reliability, performance, and scalability of cloud and network systems. The engineer will focus on automating manual processes, improving system resilience, and driving operational excellence across distributed platforms. The position requires strong technical leadership, hands-on development skills, and the ability to collaborate with cross-functional teams.

Key Responsibilities

  • Design, develop, configure, and deploy code to enhance service reliability for new and existing systems, maintaining high standards of code quality.
  • Conduct code reviews and provide actionable feedback to improve development practices.
  • Lead troubleshooting, debugging, and architectural analysis of complex systems.
  • Automate routine operational tasks to improve efficiency and reduce manual effort.
  • Participate in an on-call rotation and lead incident response efforts when required.
  • Create and maintain documentation including design specifications, runbooks, playbooks, and system analysis reports.
  • Implement and manage SRE monitoring backend systems using Golang, Postgres, OpenTelemetry.
  • Develop tooling and automation using Terraform and other Infrastructure-as-Code tools to enhance observability and proactive issue detection.
  • Optimize and manage infrastructure hosted on Google Cloud Platform (GCP), ensuring performance, cost efficiency, and scalability.
  • Partner with development teams to improve reliability and performance using platform engineering best practices.
  • Build and maintain automated solutions for monitoring, performance tuning, disaster recovery, and operational tasks.
  • Troubleshoot and resolve issues across development, testing, and production environments.
  • Lead post-incident reviews and implement preventive measures to avoid recurrences.
  • Apply and maintain security best practices, support compliance efforts, and participate in security audits and vulnerability assessments.
  • Identify performance bottlenecks using code profiling, configuration tuning, and system analysis.
  • Monitor, analyze, and improve system performance through metrics and observability tools.
  • Contribute to internal knowledge bases and technical documentation.

Required Skills

  • Programming: Go (Golang), API Development
  • Cloud/Monitoring: Experience with GCP, monitoring frameworks, and system reliability tools
  • Infrastructure Automation: Terraform or similar IaC tools

Preferred Skills

  • Dynatrace
  • Google Cloud Platform (advanced)

Experience Required

  • 6+ years of overall IT experience
  • 4+ years in software development
  • Practical experience in two programming languages or advanced expertise in one

Education

  • Bachelor’s Degree (required)
Skills: infrastructure,code,reliability,automation,cloud

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You

bengaluru, karnataka, india

pune, maharashtra, india