About the Organization
Join Oracle Cloud Infrastructure s Observability organization
, a core OCI pillar enabling reliability, visibility, and operational excellence across all OCI services. The Telemetry Alarming team
owns the monitoring and alerting layer that transforms raw telemetry into actionable insights for both OCI customers and internal service teams.
Our Mission
OCI Observability is building a world-class Integrated Observability and Management Platform
that delivers seamless visibility across OCI, other clouds, and on-premises environments. The platform unifies Logging, Monitoring, Auditing, SIEM, Events, and Inventory
into a cohesive experience providing actionable insights into the health, performance, and security of distributed systems.
What You ll Do
Our systems evaluate millions of metrics per second across thousands of tenants, ensuring timely detection of anomalies, outages, and performance regressions at cloud scale. As a
Principal Engineer
, you will lead the architecture, design, and technical direction of the next-generation Alarming platform driving high availability, low-latency signal evaluation, intelligent suppression, and seamless integration with OCI s unified Observability suite. This role provides deep ownership, visibility across the Observability stack, and the opportunity to shape OCI s technical direction in monitoring and telemetry infrastructure. - Define architecture, design, and technical direction for large-scale telemetry services.
- Lead design reviews, guide implementation quality, and ensure long-term maintainability.
- Collaborate with partner teams across OCI Observability, Control Plane, and Developer Platform.
- Mentor engineers, raise technical standards, and foster a culture of excellence and ownership.
- Anticipate and mitigate systemic risks, ensuring reliability and resilience at global scale.
What You ll Get
- A supportive, engineering-driven culture that values innovation and technical rigor.
- Exposure to
massive-scale distributed systems
and deep infrastructure challenges. - The agility of a focused team combined with the reach and stability of Oracle.
- Opportunities to expand skills across OCI s broad cloud ecosystem.
- Continuous technical development and leadership growth.
- Comprehensive benefits and a collaborative, high-caliber engineering community.
Job Responsibilities
- Define architecture and lead development of large-scale
Monitoring and Alarming services
handling multi-region, multi-tenant workloads. - Design high-throughput evaluation pipelines for time-series data, optimized for low latency and fault tolerance.
- Drive the evolution of core capabilities such as
alarm suppression, composite conditions, and intelligent correlation
. - Collaborate with peer teams in Telemetry to deliver integrated Observability experiences.
- Establish performance, reliability, and efficiency benchmarks for the Alarming platform.
- Mentor engineers, perform design reviews, and set technical standards across the organization.
- Lead incident analysis, root-cause investigations, and architectural remediation of complex production issues.
- Contribute to OCI-wide initiatives improving telemetry ingestion, query efficiency, and alerting reliability.
Required Qualifications
- BS/MS in Computer Science or related field, or equivalent practical experience.
- 6+ years of hands-on engineering experience, including 2+ years designing and leading cloud-scale systems.
- Expertise in
distributed systems, microservices, and cloud-native architecture
. - Deep proficiency in at least one major programming language (Java, Go, or C#).
- Experience with one or more public clouds (OCI, AWS, Azure, GCP).
- Strong analytical, design, and debugging skills.
- Ability to communicate complex ideas clearly and lead technical discussions across teams.
- Passion for observability, telemetry, and building systems that operate at global scale.