Job
Description
You will partner with product owners and business subject matter experts to analyze business needs and enhance support ability, scalability, and recovery for the engineered solution. It is essential to ensure that the technical solution aligns with business requirements and operational team methodologies. Your responsibility will include driving the improvement of service availability through automation to reduce the mean time to recovery. You will be tasked with developing methods for autonomous recovery and self-repairing systems, ensuring consistency with RFPIO architecture, design, and development standards. In addition, coordinating and planning system releases and hotfixes will be within your scope. You will also need to develop methods for simplified triage through checklists, run books, and standard operating procedures. Adapting to new methodologies that enhance business flexibility and agility will be a key aspect of your role. Supporting software development by offering operational enhancements to non-functional requirements is crucial. You will be expected to develop enhancements to boost service levels by utilizing key performance indicators, monitoring, non-functional testing, and availability reports. Embracing a service-focused approach that leverages continuous process improvement is vital. Additionally, participating in chaos testing to enhance system resiliency and mentoring other engineers will be part of your responsibilities. Providing overall technical leadership to smaller working teams when necessary is also expected. Staying abreast of the latest development tools, technology ideas, patterns, and methodologies is essential. Sharing knowledge by effectively communicating results and ideas to key stakeholders is encouraged. You should have a minimum of 3 to 5 years of experience in a Site Reliability Engineering, DevOps, or Infrastructure-focused role. Experience in supporting internet-facing production services and distributed systems is required. Proficiency in implementing and coordinating telemetry using monitoring and observability tools such as Splunk, Grafana, or Prometheus is necessary. Coding experience in high-level programming languages like Java or Python is a must. Being an advocate for automation and believing in reducing operational load through software is vital. Possessing a strong sense of ownership is key to success in this role. Experience in managing, scaling, and troubleshooting Java applications is crucial. Familiarity with cloud infrastructure concepts such as zones, regions, VPCs, etc., is beneficial. An understanding of various software service deployment packaging, strategies, and tooling is required. Proficiency in common authentication schemes, certificates, and securely managing secrets is essential. The ability to design and implement automated configuration management processes for repeatable and consistent service deployment is a necessary skill. A bachelor's or master's degree in Computer Science or equivalent industry experience is preferred. Having prior experience as a Site Reliability Engineer, software engineer, DevOps Engineer, or system administrator is advantageous. Experience in system automation technology like Ansible, container technologies, and cloud services is beneficial for this role.,