-
Define and drive the long-term reliability and scalability strategy for the Adobe Pass platform, aligning with product and business goals.
-
Architect large-scale, distributed, and multi-region systems designed for resiliency, observability, and self-healing.
-
Anticipate systemic risks and design proactive mitigation strategies ensuring zero single points of failure across critical services.
-
Partner with software architecture and infrastructure teams to evolve the platform toward greater reliability, efficiency, and cost optimization.
Automation, Observability & Reliability Engineering
-
Build and champion advanced automation frameworks that enable zero-touch operations across deployment, recovery, and scaling workflows.
-
Introduce AI/ML-based predictive monitoring and anomaly detection systems to anticipate failures before they impact users.
-
Lead organization-wide reliability initiatives such as chaos engineering, error budgets, and SLO adoption driving measurable reliability improvements.
-
Continuously refine observability architecture (metrics, traces, logs) to ensure comprehensive, actionable insights into production health.
Incident Response & Operational Excellence
-
Serve as a technical authority during high-impact incidents, guiding cross-functional teams through real-time mitigation and long-term prevention.
-
Establish and enforce best-in-class incident management frameworks, improving MTTR, MTBF, and reducing incident recurrence rates.
-
Lead blameless postmortems and translate findings into actionable reliability roadmaps.
-
Drive reliability reviews and operational readiness assessments for all major product launches.
Performance, Scalability & Cost Efficiency
-
Lead large-scale performance tuning and capacity engineering efforts, ensuring optimal resource utilization and cost efficiency across environments.
-
Identify architectural bottlenecks, drive performance benchmarking, and influence platform evolution for better scalability and elasticity.
-
Partner with FinOps and CloudOps to optimize spend while maintaining reliability SLAs and SLOs.
Cross-Team Leadership & Mentorship
-
Mentor and coach SREs and software engineers, cultivating deep reliability-first thinking across teams.
-
Serve as a thought leader in reliability engineering driving best practices, evangelizing automation-first culture, and influencing technical standards across multiple teams.
-
Collaborate with engineering leaders, PMs, and operations to align priorities, set strategic goals, and deliver on high-impact reliability initiatives.
-
Lead technical deep dives and design reviews, ensuring all systems are built to scale securely and reliably.
Qualifications -
Bachelor s or Master s degree in Computer Science, Engineering, or a related field.
-
12+ years of experience in site reliability, production engineering, or large-scale distributed system operations.
-
Proven track record of designing and managing highly available, globally distributed systems in cloud-native environments (AWS, Azure, GCP).
-
Expert-level proficiency in one or more programming/scripting languages (Python, Go, Java, Bash) for automation and tooling.
-
Deep understanding of Kubernetes, microservices, and service mesh architectures.
-
Advanced experience with Infrastructure as Code (Terraform, CloudFormation) and CI/CD automation frameworks .
-
Mastery in observability and monitoring stacks (Prometheus, Grafana, Datadog, OpenTelemetry).
-
Strong expertise in networking, storage, and distributed databases (both SQL and NoSQL).
-
Demonstrated ability to influence architectural decisions and drive reliability strategy across organizations.
-
Exceptional communication, leadership, and stakeholder management skills.
Preferred Qualifications -
Experience designing reliability frameworks or SRE platforms at scale (error budgets, chaos engineering, reliability reviews).
-
Prior experience in high-traffic or latency-sensitive systems (media streaming, advertising, or real-time platforms).
-
Familiarity with big data ecosystems (Kafka, Spark, Hadoop) and large-scale data ingestion pipelines.
-
Hands-on experience with security, compliance, and governance in production environments (SOC2, GDPR, ISO27001).
-
Cloud or Kubernetes certifications (AWS Solutions Architect Professional, CKA/CKAD, GCP Professional Cloud Architect).
-
Published contributions or conference talks on reliability, automation, or distributed systems.
.
Adobe aims to make accessible to any and all users. If you have a disability or special need that requires accommodation to navigate our website or complete the application process, email or call .