Site Reliability Engineer - 2

3.0 - 5.0 years

5.0 - 7.0 Lacs P.A.

Bengaluru

Posted:1 week ago| Platform: Naukri logo

Apply Now

Skills Required

UnixAutomationLinuxNetworkingArchitectureInformation securityTroubleshootingDistribution systemMonitoringPython

Work Mode

Work from Office

Job Type

Full Time

Job Description

Scale Your Impact: Join MoEngage as a Site Reliability Engineer (SRE-2)! About MoEngage: Join the league of engineers building and scaling a platform that defines customer engagement for the worlds leading brands! MoEngage is the engine behind personalized experiences for over 900 million consumers, processing a staggering 80 billion messages monthly. Were not just a SaaS company; were pioneers in intelligent customer engagement, and our commitment to delivering a reliable, high-performance platform at massive scale is unwavering. At MoEngage, youll be part of a team that thrives on technical excellence, ownership, and continuous improvement. Were a company that invests in its people and technology, fostering an environment where you can tackle complex challenges, contribute to significant architectural decisions, and directly influence the success of our global customers. If youre looking for a place where your SRE expertise can truly shine and where you can grow into a technical leader, look no further. The Opportunity: Site Reliability Engineer (SRE-2) Are you an SRE with a few years under your belt, itching to take on more significant challenges and drive impactful reliability initiativesDo you have a solid grasp of cloud platforms and container orchestration, and a burning desire to automate everything in sightAs an SRE-2 at MoEngage, youll be a critical member of our SRE team, responsible for the health and performance of key services and contributing directly to the evolution of our infrastructure at a scale that few engineers get to experience. This is your chance to deepen your technical expertise, take on more ownership, and mentor emerging talent while working on a platform that operates at the cutting edge. What Youll Do to Keep Our Engines Roaring: Be a Reliability Champion: Take ownership of the reliability, performance, and efficiency of critical services. Automate, Automate, Automate: Design, develop, and implement robust automation solutions to eliminate toil, streamline operations, and improve system resilience. Battle Incidents (and Win): Lead troubleshooting efforts for complex production incidents, perform in-depth root cause analysis, and implement sustainable preventative measures. Sculpt Our Infrastructure: Actively contribute to the design, implementation, and optimization of our cloud infrastructure on AWS and GCP , leveraging your expertise in technologies like Kubernetes. Enhance Observability: Implement and refine advanced monitoring, alerting, and logging solutions to gain deep insights into system behavior and predict potential issues. Collaborate for Success: Partner closely with development teams to influence architectural decisions, ensuring reliability, scalability, and security are built in from the start. Strengthen Our Security Posture: Implement and advocate for advanced security practices within our infrastructure and operational workflows. Drive Efficiency: Analyze and optimize cloud infrastructure spend, identifying and implementing cost-saving opportunities. Guide the Next Wave: Mentor and guide SRE-1 engineers, contributing to the growth and knowledge sharing within the team. Be Ready for Action: Participate in our on-call rotation, acting as a key point of escalation and resolution for critical issues. What Makes You the Ideal Candidate: 3-5 years of hands-on experience in Site Reliability Engineering, DevOps, or a similar role with a strong focus on production systems. Demonstrated expertise in Python or Go you have a proven track record of automating complex tasks. Strong command of AWS and/or GCP cloud platforms . In-depth experience with containerization and orchestration using Kubernetes (K8s, ArgoCD, Helm/Kustomize) . Experience with infrastructure as code tools like Terraform or Ansible is highly valued. Solid understanding and experience with monitoring and observability stacks (VictoriaMetrics, Prometheus, Grafana, ELK stack, etc.). Deep knowledge of Linux/Unix systems internals and advanced networking concepts. Proven ability to diagnose and resolve complex issues in large-scale distributed systems. A strong understanding of Cloud Security and Information Security principles and best practices. Experience with cloud cost analysis and optimization techniques. Familiarity with CI/CD pipelines and GitOps methodologies. Experience with messaging queues and distributed systems (Celery, Kafka) is a plus. Excellent communication, collaboration, and problem-solving skills. A desire to mentor and lead by example.

Marketing Technology
Mumbai

RecommendedJobs for You