Technical Skills
- 6+ years of hands-on experience in Site Reliability Engineering, DevOps, or Cloud Engineering.
- Expertise in AWS services such as EC2, S3, RDS, IAM, VPC, Lambda, CloudWatch, etc.
- Strong knowledge of Kubernetes and container orchestration best practices.
- Experience managing services on Amazon ECS (Fargate or EC2).
- Proficient in infrastructure-as-code tools like Terraform, CloudFormation, or Pulumi.
- Skilled in scripting languages such as Python, Bash, or Go.
- Solid grasp of networking, load balancing, DNS, and firewall rules in cloud environments.
- Deep understanding of microservices architectures, API gateways, and service meshes.
Soft Skills
- Proven leadership and cross-functional collaboration skills.
- Strong problem-solving and incident-resolution mindset.
- Clear communication, documentation, and stakeholder reporting abilities.
- Passion for continuous improvement and automation.
Preferred Qualifications
- AWS certifications such as AWS Certified DevOps Engineer, Solutions Architect – Professional, or equivalent.
- Familiarity with service meshes like Istio or Linkerd.
- Experience with serverless architectures and event-driven systems.
- Knowledge of regulatory compliance (SOC2, ISO 27001, GDPR) in cloud environments.
Skills – AWS Cloud, CICD, EC2, Kubernete, Grafana, Datadog, Python
SRE- AWS
Job Summary
We are looking for an experienced and driven Senior Site Reliability Engineer (SRE) to architect, implement, and maintain robust cloud infrastructure. This role demands a deep understanding of AWS, Kubernetes, ECS, and the ability to build scalable, secure, and highly available infrastructure from scratch. The ideal candidate will be a strong advocate for DevOps principles, automation, and reliability, and will possess the skills to support and optimize complex microservices-based architectures.
Key Responsibilities
- Infrastructure Design & Implementation
- Design and build highly scalable, fault-tolerant, and secure cloud infrastructure using AWS, Kubernetes, and ECS.
- Lead efforts in infrastructure as code (IaC) using tools like Terraform or CloudFormation.
- Develop and enforce best practices for infrastructure provisioning, security, and cost optimization.
System Reliability & Performance
- Ensure availability, performance, scalability, and security of production systems.
- Implement observability strategies including monitoring, logging, and alerting using tools such as Prometheus, Grafana, ELK, or Datadog.
- Analyse system performance metrics and proactively identify potential issues and bottlenecks.
DevOps & Automation
- Build and maintain CI/CD pipelines to streamline code deployments across environments.
- Drive automation in infrastructure provisioning, configuration management, and operational tasks.
- Ensure repeatable and reliable deployments using containers and orchestration tools like Kubernetes and ECS.
Service Management
- Own the SRE lifecycle, including incident management, postmortems, root cause analysis, and runbook creation.
- Collaborate closely with development and QA teams to ensure seamless microservices integration, deployment, and lifecycle management.
- Maintain service-level objectives (SLOs), service-level agreements (SLAs), and error budgets.
Security & Compliance
- Implement and enforce cloud security best practices for networking, identity and access management, and data protection.
- Support audits, compliance assessments, and vulnerability remediation.
- Monitor for security anomalies and work with security teams to respond to threats.
Technical Skills
- 6+ years of hands-on experience in Site Reliability Engineering, DevOps, or Cloud Engineering.
- Expertise in AWS services such as EC2, S3, RDS, IAM, VPC, Lambda, CloudWatch, etc.
- Strong knowledge of Kubernetes and container orchestration best practices.
- Experience managing services on Amazon ECS (Fargate or EC2).
- Proficient in infrastructure-as-code tools like Terraform, CloudFormation, or Pulumi.
- Skilled in scripting languages such as Python, Bash, or Go.
- Solid grasp of networking, load balancing, DNS, and firewall rules in cloud environments.
- Deep understanding of microservices architectures, API gateways, and service meshes.
Soft Skills
- Proven leadership and cross-functional collaboration skills.
- Strong problem-solving and incident-resolution mindset.
- Clear communication, documentation, and stakeholder reporting abilities.
- Passion for continuous improvement and automation.
Preferred Qualifications
- AWS certifications such as AWS Certified DevOps Engineer, Solutions Architect – Professional, or equivalent.
- Familiarity with service meshes like Istio or Linkerd.
- Experience with serverless architectures and event-driven systems.
- Knowledge of regulatory compliance (SOC2, ISO 27001, GDPR) in cloud environments.
Skills – AWS Cloud, CICD, EC2, Kubernete, Grafana, Datadog, Python
Key Responsibilities:
Cloud Platform: GCP
- Infrastructure Automation: Design, implement, and manage infrastructure as code using Terraform to provision and manage GCP resources.
- Container Orchestration: Deploy and manage Kubernetes clusters, ensuring efficient operation of containerized applications.
- Continuous Integration/Continuous Deployment (CI/CD): Develop and maintain CI/CD pipelines using Jenkins to automate application build, test, and deployment processes.
- Containerization: Collaborate with development teams to containerize applications using Docker and manage deployments with Helm Charts.
- Code Quality Assurance: Integrate and manage SonarQube to ensure code quality and security standards are met.
- Monitoring and Logging: Implement and manage monitoring solutions using Datadog to ensure system health, performance, and security.
- Collaboration: Work closely with cross-functional teams, including developers, QA, and operations, to streamline processes and improve productivity.
Requirements:
- Experience: 5+ years in DevOps or cloud engineering roles, with at least 3 years of relevant experience in the specified technologies.
- Technical Proficiency:
o Hands-on experience with GCP services and architecture.
o Proficiency in Terraform for infrastructure as code implementations.
o Strong understanding and experience with Kubernetes and Docker.
o Experience in setting up and managing CI/CD pipelines using Jenkins.
o Familiarity with Helm Charts for application deployment.
o Experience with SonarQube for code quality analysis.
o Proficiency in monitoring and logging tools, particularly Datadog.
- Scripting Skills: Proficiency in scripting languages such as Bash or Python is an added advantage.
o Strong problem-solving abilities and analytical thinking.
o Excellent communication skills, both verbal and written.
o Ability to work collaboratively in a team environment.
o Strong organizational and time management skills.
Skills – Terraform, Kubernetes, Cluster, Docker, GCP, SonarQube