Overview
Monitor and manage system reliability performance, ensuring systems meet SLOs.Communicate reliability concerns and their potential impact with key stakeholders.Promote the prioritization of reliability throughout the software development life cycle.Design, code, test, and deliver solutions to automate manual operations.Participate in on-call rotations, provide support for SRE systems, and lead or participate in post-mortem incident analysis.Key Responsibilities:Demonstrate and innovate SRE practices by collaborating with stakeholders to implement important SRE principles and objectives and create new practices where applicable.Partner with product and platform teams to define and track service level objectives (SLOs) and indicators (SLIs).Engage in system design, capacity planning, and architecture discussions to ensure operational requirements are met.Share lessons learned and best practices regarding reliability and performance with stakeholders and team members.Assist in training and mentoring fellow junior SREs to ensure best practices are followed and scaled within the organization.Pursue continuous improvement opportunities to stay up to date on SRE methods and trends and participate in organizational learning initiatives.Support governance and ensure compliance with policies by collaborating with security, compliance, and other teams.Respond promptly to requests for assistance from technical customers, providing engineering support and best-practice guidance.Adhere to and suggest improvements to standard operating procedures, advocate for automation and workflow optimization.Operational Resiliency Architect:Support application health, performance, and capacity.Assist in system design consulting, capacity planning, and launch reviews.Collaborate with development and product teams to establish monitoring and alerting strategies. DevOps/Automation:Engage in development, automation, and business process improvement.Support CI/CD pipelines and promote software into higher environments.Increase automation and tooling to reduce manual intervention.ITSM Practices:Analyze ITSM activities and provide feedback to development teams on operational gaps or resiliency concerns.Perform root cause analysis of incidents and work with development teams to resolve issues.Preferred Skills and Experience:Coding experience in one or more programming languages such as Java, Python, or Go.Familiarity with cloud platforms like AWS, Azure, or GCP.About youExperience with Message Queue (MQ) technologies like RabbitMQ, Kafka, or similar technologies.Experience with observability tools like Splunk, Dynatrace, Prometheus, or Datadog.Knowledge of industry-standard CI/CD tools like Git/Bitbucket, Jenkins, Maven, and Artifactory.Understanding of client-server relationships, network concepts, and operating system navigation.Familiarity with Kubernetes and configuration management tools.General Skills and Competencies:Ability to work with development, operations, and product teams.Strong verbal and written communication skills, including the ability to explain technical issues to non-technical audiences.Critical thinking skills and a proactive approach to problem-solving.A mindset geared towards continuous improvement and learning.Ability to work effectively in a team and share best practices