Who We Are
Wayfair operates one of the largest custom e-commerce logistics networks in the U.S., with 1.6 million square meters of logistics space and an inherently complex system architecture. To meet the performance, reliability, and scalability demands of our systems, the Reliability Platforms team builds foundational platform products that empower teams across the company to deliver exceptional customer experiences.
We build and own Observability Platforms, Platform Insights, Performance Engineering (Chaos Testing), and Reliability Tooling the cornerstones of platform health and operational excellence. Our mission is to make reliability a product that s usable, measurable, and integral to every developer s workflow.
We're looking for a strong engineering leader to define strategy, develop talent, and drive adoption of our reliability platforms across Wayfair s engineering ecosystem.
What You ll Do
Pod Leadership & Strategic Direction
- Define and execute the charter for the Reliability Platforms Pod, consisting of 2 4 atomic teams, aligned with the broader Superpod mission and Strategic Technology Objectives (STOs).
- Take full ownership of the systems within the pod, ensuring technical decisions and outcomes drive measurable impact and business reliability.
- Identify and manage cross-superpod and cross-functional dependencies, driving cohesive delivery and reducing system fragility.
Platform & Product Leadership
- Build and scale Platform-as-a-Product offerings with strong developer empathy, clear reliability KPIs, and intuitive user experience.
- Architect systems and services that integrate observability, resiliency, chaos testing, and platform insights into core infrastructure workflows.
- Lead performance engineering efforts including load testing, chaos simulations, and failure modeling to drive proactive reliability culture.
Technical & Operational Excellence
- Co-create and review technical artifacts such as design documents, architecture reviews, and postmortem analyses.
- Champion engineering best practices across observability, SLO definition, automation, and incident response.
- Support the translation of non-technical business requirements into robust engineering solutions especially around risk mitigation and system uptime.
People Development & Team Building
- Attract, mentor, and grow diverse engineering talent across ICs and managers.
- Foster a high-trust, inclusive, and collaborative team culture focused on continuous improvement.
- Coach teams through architectural tradeoffs, prioritization under ambiguity, and delivery of platform features with wide organizational impact.
Stakeholder Engagement
- Represent the team in cross-functional and leadership forums to advocate for reliability priorities.
- Work closely with Product, SRE, Infrastructure, and Developer Experience stakeholders to align platform initiatives with company-wide goals.
What You ll Need
- 16+ years of software engineering experience, with at least 10 years in technical leadership roles, and 5+ years leading managers.
- Proven ownership of Pod-level systems and delivery of complex platform initiatives aligned with STOs and cross-org reliability goals.
- Deep experience with cloud-native platform engineering including microservices, event-driven systems, and observability tools (Prometheus, Datadog, OpenTelemetry).
- Hands-on expertise in applying SRE principles (SLOs, error budgets, automated remediation) to platform and infrastructure layers.
- Proficiency with Infrastructure-as-Code and automation (e.g., Terraform, Kubernetes), driving repeatable, self-service reliability tooling.
- Demonstrated success in delivering developer-centric tooling e.g., APIs, CLIs, dashboards to improve adoption and system insights.
- Strong communication and stakeholder engagement skills; comfortable influencing with and without authority.
- A pragmatic, results-oriented mindset with the ability to balance technical excellence with business impact.