Job
Description
We are seeking individuals who can offer informed and unique perspectives, enjoy collaborating with cross-functional teams, and are continuously pushing boundaries to create reliable and scalable solutions and enhance user experiences. Your main responsibilities will include analyzing the current technologies utilized within the company, devising monitoring and notification tools to enhance observability and visibility. You will be tasked with ensuring system stability by proactively identifying failure scenarios and implementing solutions to reduce MTTR. Developing solutions to boost system performance with a strong emphasis on high availability, scalability, and resilience will be a key focus. You will also integrate telemetry and alerting platforms to monitor and enhance system reliability. It is essential to adhere to industry best practices for system development, configuration management, and deployment. Additionally, you will play a crucial role in facilitating seamless information flow between teams by documenting acquired knowledge. Staying current with modern technologies and trends will enable you to advocate for their incorporation into products if they bring value. In incident management, you will be involved in troubleshooting production issues, conducting root cause analysis (RCA), and actively sharing insights to enhance system reliability and internal knowledge. The ideal candidate should have experience in troubleshooting and optimizing high-performance microservices architectures running on Kubernetes and AWS in highly available production environments. A minimum of 5 years of experience in software development using languages such as Python, Java, Go, etc., with a strong foundation in data structures, algorithms, problem-solving, and complexity analysis is required. During the SRE selection process, a coding challenge will be presented. You should possess a curious and proactive nature in identifying performance bottlenecks, scalability issues, and resilience problem areas and be adept at resolving them. Familiarity with observability tools and data collection is essential. Knowledge of databases like RDS, NoSQL, distributed TiDB, etc., is preferred. Strong communication skills, a collaborative approach, and a proactive attitude to deliver results are highly valued. Embracing challenges and seeing them through to completion is a key attribute. Preferred qualifications include expertise in container image management and optimization, experience in large distributed system architecture and capacity planning, understanding of Infrastructure as Code (IaC), automation tools like Terraform, CloudFormation, etc., background in SRE/DevOps concepts and implementation, proficiency in managing monitoring tools such as CloudWatch, VictoriaMetrics, Prometheus, and reporting with Snowflake and Sigma. In-depth knowledge of web technologies like CloudFront, Nginx, etc., and experience in designing, implementing, or maintaining disaster recovery strategies and multi-region architecture for high availability, resilience, and business continuity across critical systems are advantageous. Proficiency in Japanese and English languages is a plus, although language skills are not mandatory as we have professional translators available. **Working Conditions** **Employment Status:** Full Time **Office Location:** Gurugram (WeWork) The development center requires your presence at the Gurugram office to help establish a strong core team.,