Job
Description
Role Overview: PhonePe Limited is looking for a skilled Site Reliability Engineer - Database with 4 to 8 years of experience to join their team. As a Site Reliability Engineer, you will be responsible for the design, provisioning, and lifecycle management of large-scale MySQL/Galera multi-master clusters across multiple geographic locations. Your role will involve ensuring the resilience, scalability, and performance of the distributed, high-volume database infrastructure while driving strategic improvements to the infrastructure. Key Responsibilities: - Lead the design, provisioning, and lifecycle management of large-scale MySQL/Galera multi-master clusters across multiple geographic locations. - Develop and implement database reliability strategies, including automated failure recovery and disaster recovery solutions. - Investigate and resolve database-related issues, including performance problems, connectivity issues, and data corruption. - Own and continuously improve performance tuning, including query optimization, indexing, and resource management, security hardening, and high availability of database systems. - Standardize and automate database operational tasks such as upgrades, backups, schema changes, and replication management. - Drive capacity planning, monitoring, and incident response across infrastructure. - Proactively identify, diagnose, and resolve complex production issues in collaboration with the engineering team. - Participate in and enhance on-call rotations, implementing tools to reduce alert fatigue and human error. - Develop and maintain observability tooling for database systems. - Mentor and guide junior SREs and DBAs, fostering knowledge sharing and skill development within the team. Qualifications Required: - Expertise in Linux systems administration, scripting (Bash/Python), file systems, disk management, and debugging system-level performance issues. - 4+ years of hands-on experience in MySQL database administration in large-scale, high-availability environments. - Deep understanding of MySQL internals, InnoDB storage engine, replication mechanisms (async, semi-sync, Galera), and tuning parameters. - Proven experience managing 100+ production clusters and databases larger than 1TB in size. - Hands-on experience with Galera clusters is a strong plus. - Familiarity with Infrastructure-as-Code tools like Ansible, Terraform, or similar. - Experience with observability tools such as Prometheus, Grafana, or Percona Monitoring & Management. - Exposure to other NOSQL (e.g., Aerospike) will be a plus. - Experience working in on-premise environments is highly desirable. (Note: The additional details of the company were not present in the provided job description.),