Role and Responsibilities
Look for more:
- Digital experience, anything with web or mobile apps, APIs
- Engineering experience with various programming languages like Python, JavaScript (this is a good one to look for)
- Longer tenured experience
Reporting to Engineering, the Site Reliability Engineer will play a critical role in driving innovation and growth for the Banking Solutions, Payments and Capital Markets business. In this role, the candidate will have the opportunity to make a lasting impact on the company's transformation journey, drive customer-centric innovation and automation, and position the organization as a leader in the competitive banking, payments and investment landscape. Specifically, the Site Reliability Engineer will be responsible for the following:
- Design and maintain monitoring solutions and alerting mechanisms for infrastructure, application performance, and user experience metrics, enabling proactive issue detection and mitigation.
- Implement automation tools and processes to automate routine tasks, scale infrastructure, and ensure seamless deployments, updates, and rollbacks with minimal user impact.
- Ensure the reliability, availability, and performance of applications and services, focusing on minimizing downtime, optimizing response times, and maintaining high availability for users.
- Lead incident response efforts for incidents, including identification, triage, resolution, and post-incident analysis to prevent recurrence and improve system resilience.
- Conduct capacity planning, performance tuning, and resource optimization for environments, collaborating with development and operations teams to meet scalability and performance goals.
- Collaborate with security teams to implement security best practices, perform vulnerability assessments, and ensure compliance with security standards and regulatory requirements for applications.
- Manage deployment pipelines, release processes, and configuration management for app deployments, ensuring consistency, reliability, and version control across environments.
- Identify areas for improvement in reliability, performance, and efficiency through data analysis, root cause analysis, and trend analysis, and drive initiatives to enhance system reliability and operational efficiency.
- Create and maintain documentation, runbooks, and knowledge base articles for operational procedures, troubleshooting guides, and best practices, and promote knowledge sharing within the team.
- Develop and test disaster recovery plans, backup strategies, and failover mechanisms for app services, ensuring business continuity and data integrity in case of failures or disasters.
- Collaborate with development, QA, DevOps, and product teams to ensure alignment on reliability goals, performance metrics, release schedules, and incident response processes.
- Participate in on-call rotations and provide 24/7 support for critical incidents, troubleshoot issues, and coordinate with teams for resolution, escalation, and follow-up actions as per defined SLAs.
Professional Qualifications
- Proficient in development technologies, architectures, and platforms (web, api) to understand system complexities and performance considerations.
- Experience in cloud platforms (e.g., AWS, Azure, Google Cloud) and infrastructure as code (IaC) tools for managing app infrastructure and deployments.
- Knowledge of monitoring tools (e.g., Prometheus, Grafana, DataDog, New Relic) and logging frameworks (e.g., Splunk, SumoLogic, ELK Stack) for real-time visibility into system health, performance metrics, and user experience.
- Experience in incident management, including incident response, triage, root cause analysis (RCA), and post-mortem reviews to prevent recurring issues.
- Strong troubleshooting skills to diagnose complex technical issues in app environments, infrastructure, networking, and performance bottlenecks.
- Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Terraform, Ansible) for automating routine tasks, deployments, and infrastructure management.
- Experience in implementing continuous integration/continuous deployment (CI/CD) pipelines for apps using tools like Jenkins, GitLab CI/CD, or Azure DevOps.
- Expertise in setting up monitoring solutions, configuring alerts, and creating dashboards to monitor system performance, application metrics, and user experience.
- Familiarity with APM (Application Performance Monitoring) tools to analyze app performance, identify bottlenecks, and optimize resource utilization.
- Familiarity with RUM (Real User Monitoring) for tracking and analyzing user interaction and system performance.
- Commitment to continuous learning, staying updated with industry trends, new technologies, and best practices in app reliability, performance, and operations.
- Adaptability to evolving requirements, technologies, and business needs, with a focus on driving continuous improvement and operational excellence.
Personal Characteristics
- Demonstrates judgment and flexibility; thinks about issues and develops solutions that thoughtfully take the broader context into account - positively deals with a shifting demand for time, priorities, and the rapid change of environments.
- Takes an ownership approach to engineering and product outcomes.
- Action-oriented self-starter who can set strategy and drive execution with a roll up the sleeves” approach.
- Excellent interpersonal communication, negotiation and influencing skills to work effectively with all stakeholders (internal & external), making information-based decisions.
- Penchant for excellence, both personally and professionally, demonstrated by intellectual curiosity, record of accomplishment, and reputation; shows strong attention to detail and implementation of best practices with an inclination for continuous improvement.
- Ability to quickly establish strong credibility with employees, business partners and external resources.
- Embodies and delivers the firm's values and culture towards colleagues, clients, and communities:
- Win as one team
- Lead with integrity
- Be the change
Team ,
Look and ask below skills for SRE.
Skill Area
Description
Must-Have ()
Programming & Scripting
Proficiency in Python, Go, Java, or similar
Automation & Tooling Development
Building internal tools, bots, CLIs, scripts to automate ops
CI/CD Pipeline Engineering
Designing and maintaining robust pipelines (GitHub Actions, Jenkins, ArgoCD, etc.)
Observability Tooling
Writing custom exporters, alerts, or improving logging/tracing infra
Platform Engineering
Building self-service platforms (e.g., developer portals, internal PaaS)
Developer Experience (DevEx)
Reducing friction for engineers – faster builds, better onboarding, tooling
Infrastructure as Code (IaC)
Comfortable using IaC (Terraform/CDK), but not the main focus
Good to have
Incident Tooling
Building/improving tooling for incident response, runbooks, auto-remediation
API & Service Reliability
Writing code to ensure availability, retries, graceful degradation
Performance Engineering
Profiling code, reducing latency in user-facing or platform services
Testing for Reliability
Writing integration tests, chaos tests, synthetic checks
Cloud-Native Dev (AWS/GCP/Azure)
Uses SDKs and APIs programmatically, not just infra setup
Security Engineering (DevSecOps)
Implements security controls in code (e.g., secret scanning, access guards)
Good to have
Documentation & Dev Portals
Documents systems, APIs, contributes to internal wikis or portals
Collaboration with Developers
Works closely with dev teams to integrate reliability into SDLC
Role & responsibilities
Preferred candidate profile