This is a Night Shift Job (EST Hours)
As an L3, you are the technical anchor for complex production issues across our DSP data and bidding surfaces. You’ll lead SEV‐1/2 incident bridges, perform deep SQL and system investigations, implement durable fixes, and partner with Platform/Data/Backend Engineering on performance and cost optimization. You’ll mentor L2s, harden runbooks, and raise the reliability bar across Snowflake, Postgres, MySQL, and Athena.
Key Responsibilities
- Major incident leadership: Drive SEV‐1/2 resolution; coordinate war rooms; communicate impact and mitigation clearly to U.S. business stakeholders and leadership.
- Advanced diagnostics & performance tuning: ○ Snowflake: micro‐partitioning, clustering, warehouse sizing/concurrency, query plan analysis. ○ Athena/Presto: partitioning, file formats (Parquet/ORC), partition pruning, S3 layout. ○ Postgres/MySQL: indexing strategies, query plans, connection pooling, vacuum/ANALYZE, parameter tuning.
- Sustainable engineering fixes: Ship PRs or scripts for hotfixes, add guardrails/data quality checks, automate remediations, and eliminate toil for L2s.
- DSP domain problem‐solving: Debug bidder behavior (latency, QPS, timeout budgets), ORTB field mismatches, price‐floor issues, segment join/refresh rates, attribution/reporting gaps.
- Observability & SLOs: Define and refine SLOs, create high‐signal alerts and dashboards, and drive post‐mortem RCAs with action items to closure.
- Mentorship & knowledge: Level‐up L2 team via training, improved runbooks, and tooling; set escalation criteria and acceptance checklists.
- Cost & performance stewardship: Optimize compute/storage spend (warehouse sizing, query rewrites, data retention strategies) while maintaining SLAs.
What We Are Looking For
- 5–8+ years in production engineering/support for large‐scale data or ad‐tech platforms, including leading SEV‐1/2 incidents.
- Expert‐level SQL and proven tuning experience across Snowflake, Postgres, MySQL, Athena.
- Proficiency with Python/Bash, Linux, and at least one observability stack (Datadog/Grafana/Kibana).
- Deep working knowledge of DSP/bidding ecosystems: ORTB, bidder latency envelopes, win‐rate dynamics, pacing algorithms, audience pipelines, creative approvals/brand safety.
- Strong stakeholder communication with U.S. teams; comfortable writing exec‐level incident summaries and presenting RCAs.
- Sustainable mindset: prioritize root‐cause elimination, automation, and change management that reduces risk and ticket load over time.
Nice to have
- Experience with Kafka/Kinesis, S3/Glue/Lambda, Looker/Mode/Tableau.
- Familiarity with privacy/compliance (HIPAA/SOC2, consent frameworks) and log‐level data sharing practices.
Success metrics
- Reduced SEV‐1/2 frequency and duration, % recurring issues eliminated, performance/cost improvements delivered, quality of RCAs and engineering fixes, L2 readiness uplift.
Work hours & on‐call
- Core coverage 9:00am–6:00pm ET, with flexibility for incident bridges.
- Primary on‐call rotation for critical incidents (with backup L2 rotation).
Skills: athena,kibana,data,python,sql,grafana,datadog,bash,postgresql,snowflake,mysql