As a Software Engineering Architect in the Cloud Incident Team , you will play a critical role in enhancing the resilience, performance, and reliability of the Western Union s cloud-based applications. You will work closely with Site Reliability Engineers (SREs), Cloud Operations, and Application Teams to diagnose, remediate, and prevent high-impact incidents across Tier 0 and Tier 1 services. This role requires a deep understanding of cloud-native architectures, incident management, automation, and performance engineering. You will be responsible for designing and implementing scalable solutions that improve fault tolerance, observability, and recovery mechanisms within our cloud ecosystem.
Key Responsibilities
- Lead architecture and design efforts for cloud-based applications, focusing on reliability, scalability, and performance.
- Collaborate with the Cloud Incident Response Team to troubleshoot, mitigate, and prevent critical application failures.
- Implement self-healing, auto-recovery, and failover automation to improve operational resilience.
- Work closely with SREs and DevOps teams to optimize cloud observability, logging, and monitoring strategies.
- Drive incident post-mortem analysis , identifying root causes and systemic issues, and developing long-term architectural fixes.
- Develop reusable patterns and best practices for cloud-native application deployments (Kubernetes, serverless, containerization).
- Improve response time and efficiency of troubleshooting processes through AI-driven incident detection and automation .
- Partner with security teams to ensure compliance with best practices in cloud security, resilience, and operational stability.
- Mentor and guide engineering teams on performance tuning, application scalability, and disaster recovery architecture .
Required Skills Qualifications
- 12+ Years of total experience with 5+ years of experience in software architecture, application performance engineering, or cloud solutions design.
- Strong expertise in AWS, with a focus on high-availability and disaster recovery architectures.
- Proficiency in Kubernetes, Docker, Terraform, and Infrastructure-as-Code (IaC) best practices.
- Hands-on experience with incident management, observability tools (Splunk, Dynatrace, Prometheus, Grafana), and AIOps-driven troubleshooting .
- Deep understanding of microservices architecture, event-driven systems, and API design .
- Experience with CI/CD pipelines, automation frameworks, and DevSecOps methodologies .
- Strong programming/scripting skills in Python, Java, or Bash .
- Familiarity with SRE principles, ITIL framework, and cloud-native application reliability best practices .
- Excellent problem-solving skills with a data-driven approach to diagnosing performance bottlenecks and failures .
Preferred Qualifications
Experience in financial services, fintech, or payment processing environments .
Expertise in designing real-time transaction monitoring and failure recovery systems .
Certifications in AWS, or related cloud technologies is a plus.
Benefits
- Employees Provident Fund [EPF]
- Gratuity Payment
- Public holidays
- Annual Leave, Sick leave, Compensatory leave, and Maternity / Paternity leave
- Annual Health Checkup
- Hospitalization Insurance Coverage (Mediclaim)
- Group Life Insurance, Group Personal Accident Insurance Coverage, Business Travel Insurance
- Cab Facility
- Relocation Benefit