Job
Description
Site Reliability Engineer - SystemExpeience7+ Years Summary
We are seeking a skilled and proactive Site Reliability Engineer (SRE) to join ourteam. The ideal candidate will have extensive experience in Linux systemsadministration, understanding of database management, and a proven trackrecord of troubleshooting complex, system-level issues. You will be responsiblefor ensuring the reliability, performance, and scalability of our productionenvironments, balancing system and database stability through robustmonitoring, debugging, and automation practices. Responsibilities:
Lead incident response and resolution: Proactively troubleshoot, debug,and resolve complex system-level incidents and outages, encompassingLinux operating systems, applications, and database technologies. Conduct deep-dive root cause analysis: Perform thorough post-incident analysis to identify underlying issues in production environments, implementing sustainable solutions. Design and implement robust monitoring: Develop, maintain, andenhance comprehensive system and database monitoring, alerting, andobservability solutions (e.g., Grafana, Prometheus, PMM). Drive automation and efficiency: Automate Linux system administrationtasks, operational runbooks, and database maintenance to improvesystem reliability, consistency, and operational efficiency. Collaborate on resilient deployments: Partner with development andengineering teams to ensure seamless, reliable, and secure softwaredeployments and infrastructure changes. Architect scalable infrastructure: Contribute to the architectural designand implementation of highly scalable, resilient, and performantinfrastructure solutions. Enhance on-call effectiveness: Participate in and continuously improveon-call rotations, developing tools and processes to reduce alert fatigueand minimize human error. Foster technical growth: Mentor and guide junior Site ReliabilityEngineers (SREs), promoting knowledge sharing and skill developmentwithin the team.
Qualifications:
Extensive Linux Expertise: Proven experience in advanced Linux systems administration, including deep understanding of file systems, kernel tuning (Sysctl), and performance optimization. Advanced Troubleshooting & Debugging: Exceptional ability to debugand rapidly resolve complex, distributed system-level issues inhigh-pressure production environments. Configuration Management: Hands-on experience with industry-standardconfiguration management tools (e.g., SaltStack, Ansible, Puppet). Load Balancing & Proxying: Practical experience with load balancing technologies (e.g., Nginx, HAProxy, LVS) and their configuration for highavailability. Containerization & Orchestration Strong understanding and practicalexperience with containerization (e.g., Docker) and container orchestrationplatforms (e.g., Kubernetes, Mesosphere). Monitoring & Alerting Tooling Proficiency in implementing, maintaining,and leveraging system and database monitoring platforms (e.g., Grafana,Prometheus, PMM) and custom scripting for alerts. Automation & Scripting Mastery: Highly proficient in developingautomation solutions using scripting languages (e.g., Python, Shellscripting, Go) for operational tasks. Networking Fundamentals: Solid understanding of core networkingconcepts and protocols (e.g., TCP/IP, DNS, DHCP, BGP, IPTables, IP &Routing protocols). Database Administration Fundamentals: Strong grasp of relationaldatabase concepts and practical experience with database administrationprinciples.
Preferred Qualifications
Cloud Infrastructure Experience: Experience managing and troubleshooting private/on-premise cloud environments, with a focus on identifying and mitigating hardware-related issues and their impact. Relational Database Specialization: Deep practical experience withMariaDB, Percona Server, and/or MySQL, encompassing advanceddatabase administration, performance tuning, and complex replicationtopologies. Backup & Recovery Expertise Hands-on experience with robust backupand restore technologies, including ZFS. Message Queuing Systems: Familiarity with message queuing systemslike RabbitMQ (RMQ).
PhonePe Full Time Employee Benefits (Not applicable for Intern or Contract Roles)
Insurance Benefits - Medical Insurance, Critical Illness Insurance, Accidental Insurance, Life Insurance Wellness Program - Employee Assistance Program, Onsite Medical Center, Emergency Support System Parental Support - Maternity Benefit, Paternity Benefit Program, Adoption Assistance Program, Day-care Support Program Mobility Benefits - Relocation benefits, Transfer Support Policy, Travel Policy Retirement Benefits - Employee PF Contribution, Flexible PF Contribution, Gratuity, NPS, Leave Encashment Other Benefits - Higher Education Assistance, Car Lease, Salary Advance Policy
Working at PhonePe is a rewarding experience! Great people, a work environment that thrives on creativity, the opportunity to take on roles beyond a defined job description are just some of the reasons you should work with us. Read more about PhonePe on our blog.
Life at PhonePe
PhonePe in the news