Lead Engineer - Cloud Reliability

8 - 12 years

0 Lacs

Posted:3 weeks ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

You will be responsible for the following duties and responsibilities: - Design, implement, and maintain monitoring, alerting, and logging solutions for webMethods, GemFire, AWS services, and Kubernetes clusters to proactively identify and resolve issues. - Develop and implement automation for operational tasks, incident response, and system provisioning/de-provisioning. - Participate in on-call rotations to respond to critical incidents, troubleshoot complex problems, and perform root cause analysis (RCA). - Identify and eliminate toil through automation and process improvements. - Conduct performance tuning and capacity planning for all supported platforms. You should have expertise in the following platforms: - webMethods: Support, maintain, and optimize webMethods Integration Server, Universal Messaging, API Gateway, and related components. Experience with webMethods upgrades, patching, and configuration management. - GemFire: Administer and optimize GemFire clusters, ensuring high availability, data consistency, and performance for critical applications. Troubleshoot GemFire-related issues, including cache misses, replication problems, and member failures. - AWS Cloud: Manage and optimize AWS cloud resources (EC2, S3, RDS, VPC, IAM, CloudWatch, Lambda, etc.) for scalability, security, and cost-efficiency. - Rancher Kubernetes: Administer, troubleshoot, and optimize Kubernetes clusters managed by Rancher. Experience with Helm charts, Kubernetes operators, ingress controllers, and network policies. You should possess the following skills, experience, and requirements: - Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience. - 8+ years of experience in an SRE, DevOps, or highly technical operations role. - Deep expertise in at least two, and strong proficiency in all, of the following: - webMethods Integration Platform (Integration Server, Universal Messaging, API Gateway). - VMware GemFire (or other distributed in-memory data grids like Apache Geode, Redis Enterprise). - AWS cloud services (EC2, S3, RDS, VPC, CloudWatch, EKS etc.). - Kubernetes administration, particularly with Rancher and EKS. - Strong scripting and programming skills: Python, Go, Java, Bash. - Experience with Infrastructure as Code (IaC) tools such as Terraform or CloudFormation. - Proficiency with CI/CD pipelines (e.g., Jenkins, GitLab CI, AWS CodePipeline). - Experience with monitoring and logging tools (e.g., Dynatrace, Prometheus, Grafana, ELK Stack, Datadog, Splunk). - Solid understanding of networking concepts (TCP/IP, DNS, Load Balancing, VPNs). - Excellent problem-solving, analytical, and communication skills. - Ability to work effectively in a fast-paced, collaborative environment. Nice to have skills include: - Experience with other integration platforms or message brokers. - Knowledge of other distributed databases or caching technologies. - AWS Certifications. - Kubernetes Certifications (CKA, CKAD, CKS). - Experience with chaos engineering principles and tools. - Familiarity with agile methodologies.,

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You