Job Title:
Site Reliability Engineer
Department:
Engineering / Infrastructure
Reports To:
SRE Manager / DevOps Lead
Location:
Bangalore, India
Role Summary
The Site Reliability Engineer (SRE) will be responsible for ensuring the availability, performance, and scalability of critical systems. This role involves managing CI/CD pipelines, monitoring production environments, automating operations, and driving platform reliability improvements in collaboration with development and infrastructure teams.
Key Responsibilities
- Manage alerts and monitoring of critical production systems.
- Operate and enhance CI/CD pipelines and improve deployment and rollback strategies.
- Work with central platform teams on reliability initiatives.
- Automate testing, regression, and build tooling for operational efficiency.
- Execute NFR testing on production systems.
- Plan and implement Debian version migrations with minimal disruption.
Required Qualifications & Skills
- CI/CD and Packaging Tools:
- Hands-on experience with Jenkins, Docker, JFrog for packaging and deployment.
- Operating System Expertise:
- Experience in Debian OS migration and upgrade processes.
- Monitoring Systems:
- Knowledge of Grafana, Nagios, and other observability tools.
- Configuration Management:
- Proficiency with Ansible, Puppet, or Chef.
- Version Control:
- Working knowledge of Git and related version control systems.
- Kubernetes:
- Deep understanding of Kubernetes architecture, deployment pipelines, and debugging.
- Ability to deploy components with detailed insights into:
- Configuration parameters and system requirements
- Monitoring and alerting needs
- Performance tuning
- Designing for high availability and fault tolerance
- Networking:
- Understanding of TCP/IP, UDP, Multicast, Broadcast.
- Experience with TCPDump, Wireshark for network diagnostics.
- Linux & Databases:
- Strong skills in Linux tools and scripting.
- Familiarity with MySQL and NoSQL database systems.
Soft Skills
- Strong problem-solving and analytical skills
- Effective communication and collaboration with cross-functional teams
- Ownership mindset and accountability
- Adaptability to fast-paced and dynamic environments
- Detail-oriented and proactive approach
Preferred Qualifications
- Bachelor’s degree in Computer Science, Engineering, or related technical field
- Certifications in Kubernetes (CKA/CKAD), Linux, or DevOps practices
- Experience with cloud platforms (AWS, GCP, Azure)
- Exposure to service mesh, observability stacks, or SRE toolkits
Key Relationships
- Internal: DevOps, Infrastructure, Software Development, QA, Security Teams
- External: Tool vendors, platform service providers (if applicable)
Role Dimensions
- Impact on uptime and reliability of business-critical services
- Ownership of CI/CD and production deployment processes
- Contributor to cross-team reliability and scalability initiatives
Success Measures (KPIs)
- System uptime and availability (SLA adherence)
- Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) incidents
- Deployment success rate and rollback frequency
- Automation coverage of operational tasks
- Completion of OS migration and infrastructure upgrade projects
Competency Framework Alignment
- Technical Mastery: Infrastructure, automation, CI/CD, Kubernetes, monitoring
- Execution Excellence: Timely project delivery, process improvements
- Collaboration: Cross-functional team engagement and support
- Resilience: Problem solving under pressure and incident response
- Innovation: Continuous improvement of operational reliability and performance