Urgent Hiring!!
Location : RemoteRole : Staff Engineer- SREExperience : 10+The Site Reliability Engineering (SRE) team is responsible for the reliability, scalability,stability and performance of systems and services.
- They work with cross-functional teams to design, build and maintain systems and they
troubleshoot issues when they arise. They bridge the gap between development and
operations teams.
- They work closely with business teams to define Service Level Objectives
(SLO) and agreements (SLA) of critical systems. They also monitor and maintain the
uptime of these systems in-line with the defined SLO’s and SLA’s.
- They deploy and manage monitoring tools to gain insights on system health and
performance.
- They analyze performance, identify bottlenecks and implement solutions to
improve a system’s scalability and latency durations.
- They develop scripts, implement tools and automation frameworks to reduce the manual
intervention efforts of deployment, monitoring and scaling.
- They work with development teams for design and development of observability
practices like logging, metrics, tracing, etc. They aim to diagnose and troubleshoot issues
proactively.
- They create actionable alerts on monitoring systems to ensure rapid response for
potential production incidents.
- They forecast resource needs and provision adequately for current and future demand.
- They design and execute “chaos experiments” to test system’s failure resiliency.
- They own, define and implement the Disaster Recovery (DR) processes for systems.
- They also conduct planned and unplanned mock DR drills to test for response
preparedness during production incidents.
- They ensure that security best practices are followed and implemented during design
and operations of systems.
- They also own and maintain documentation of processes, playbooks, and systems.
- They publish KPI reports and other system health updates on a regular basis to the
business.
Requirements
- Must-have - Bachelor's degree, preferably in CS or a related field, or equivalent
Experience
- Must-have - 12+ years of overall IT experience
- Must-have - 7+ year of proven work experience as a Senior Site Reliability Engineer or a
similar position.
- Must-have - 5+ years of AWS Cloud experience with AWS Certified DevOps Engineer or
SysOps or Security etc.
- Must-have - AWS experience - 3+ years’ experience with using a broadrange of AWS
technologies (e.g. EC2, RDS, ELB, S3, VPC, CloudWatch & Monitoring Tools) to develop
and maintain an Amazon AWS based cloud solution, with an emphasis on best practicecloud security.
- Must-have - 2+ year of experience in CDN and/or Cache systems like Fastly, Akamai,
CloudFront, etc.
- Proven Understanding & strong experience with Cloud deployments ( AWS / Docker/
Kubernetes)
- Knowledge on provisioning IAC Tools like Terraform, Chef, Ansible, Shell, groovy,
python, etc.
- Experience with monitoring systems such as CloudWatch, NewRelic, Datadog/Splunk,
ELK stack.
- Experience managing cloud network resources (AWS Preferred) such as CloudWatch,
VPC, URL proxies, private link, DNS, ACLs, firewalls, and C2S access points.
- Platform or Application Engineering and Operational Knowledge in any of the CI/CD
tooling like GitHub Actions, Jenkins, etc.
- Experience in other tooling Technologies like JIRA, Bitbucket, Jenkins, Fortify,
SonarQube, Nexus, Nexus IQ
- Experience with configuration automation tools like Puppet/Ansible/Chef/Salt
- Scripting Skills: Strong scripting (e.g. Bash & Python) and automation skills.
- Operating Systems: Windows and Linux system administration.
- Problem Solving: Ability to analyze and resolve complex infrastructure resource and
application deployment issues
- Strong attention to detail. Excellent verbal and written communication skills. Strong
documentation skills.
Good To Have
- Experience with Terraform/Ansible/Chef/Puppet
- Experience with GitHub Actions
- Experience with CloudFront, Fastly
- Oversees team members performing these functions
- Anticipates problems and future technical needs and takes necessary steps to address
issues.
- Work primarily in server side technologies and comfortable with client side whenever
Required
- Enthusiastically follow technology trends, software engineering best practices and
technologies
Perks
- Day off on the 3rd Friday of every month (one long weekend each month)
- Monthly Wellness Reimbursement Program to promote health well-being
- Paid paternity and maternity leaves
Notice Period: Immediate- 30 Days
Email to : sharmila.m@aptita.com