AI Data Platform Reliability & Validation Engineer

3 - 7 years

0 Lacs

Posted:17 hours ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

You will be joining Oracle's AI Data Platform team as an experienced engineer to contribute to the reliability of the AI platform. Your role is crucial in ensuring the robustness, performance, and reliability of the enterprise-scale, AI-powered data platform. Your responsibilities will include: - Designing, developing, and executing end-to-end scenario validations that mirror real-world usage of complex AI data platform workflows such as data ingestion, transformation, and ML pipeline orchestration. - Collaborating with product, engineering, and field teams to identify gaps in coverage and propose test automation strategies. - Developing and maintaining automated test frameworks supporting end-to-end, integration, performance, and regression testing for distributed data/AI services. - Monitoring system health across the stack, proactively detecting failures or SLA breaches. - Championing SRE best practices including observability, incident management, blameless postmortems, and runbook automation. - Analyzing logs, traces, and metrics to identify reliability, latency, and scalability issues and driving root cause analysis and corrective actions. - Partnering with engineering to enhance high-availability, fault tolerance, and continuous delivery (CI/CD) improvements. - Participating in on-call rotation to support critical services, ensuring rapid resolution and minimizing customer impact. Desired Qualifications: - Bachelor's or master's degree in computer science, engineering, or related field, or demonstrated equivalent experience. - 3+ years of experience in software QA/validation, SRE, or DevOps roles, ideally in data platforms, cloud, or AI/ML environments. - Proficiency with DevOps automation and tools for continuous integration, deployment, and monitoring such as Terraform, Jenkins, GitLab CI/CD, and Prometheus. - Working knowledge of distributed systems, data engineering pipelines, and cloud-native architectures (OCI, AWS, Azure, GCP, etc.). - Strong proficiency in Java, Python, and related technologies. - Hands-on experience with test automation frameworks like Selenium, pytest, JUnit, and scripting in Python, Bash, etc. - Familiarity with SRE practices like service-level objectives (SLO/SLA), incident response, and observability tools like Prometheus, Grafana, ELK. - Strong troubleshooting and analytical skills with a passion for reliability engineering and process automation. - Excellent communication and cross-team collaboration abilities. About Us: Oracle, a global leader in cloud solutions, leverages tomorrow's technology to address today's challenges. Operating with integrity for over 40 years, Oracle partners with industry leaders across various sectors. The company is committed to fostering an inclusive workforce that offers opportunities for all. Oracle provides competitive benefits, flexible medical, life insurance, and retirement options, along with volunteer programs for community engagement. Inclusivity is a core value, and Oracle is dedicated to integrating people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability, please contact accommodation-request_mb@oracle.com or call +1 888 404 2494 in the United States. You will be joining Oracle's AI Data Platform team as an experienced engineer to contribute to the reliability of the AI platform. Your role is crucial in ensuring the robustness, performance, and reliability of the enterprise-scale, AI-powered data platform. Your responsibilities will include: - Designing, developing, and executing end-to-end scenario validations that mirror real-world usage of complex AI data platform workflows such as data ingestion, transformation, and ML pipeline orchestration. - Collaborating with product, engineering, and field teams to identify gaps in coverage and propose test automation strategies. - Developing and maintaining automated test frameworks supporting end-to-end, integration, performance, and regression testing for distributed data/AI services. - Monitoring system health across the stack, proactively detecting failures or SLA breaches. - Championing SRE best practices including observability, incident management, blameless postmortems, and runbook automation. - Analyzing logs, traces, and metrics to identify reliability, latency, and scalability issues and driving root cause analysis and corrective actions. - Partnering with engineering to enhance high-availability, fault tolerance, and continuous delivery (CI/CD) improvements. - Participating in on-call rotation to support critical services, ensuring rapid resolution and minimizing customer impact. Desired Qualifications: - Bachelor's or master's degree in computer science, engineering, or related field, or demonstrated equivalent experience. - 3+ years of experience in software QA/validation, SRE, or DevOps roles, ideally in data platforms, cloud, or AI/ML environments. - Proficiency with DevOps automa

Mock Interview

Practice Video Interview with JobPe AI

Start Java Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Java Skills

Practice Java coding challenges to boost your skills

Start Practicing Java Now
Oracle logo
Oracle

Information Technology

Redwood City

RecommendedJobs for You