On-site
Part Time
At eBay, we're more than a global ecommerce leader — we’re changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. We’re committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts.
Our customers are our compass, authenticity thrives, bold ideas are welcome, and everyone can bring their unique selves to work — every day. We're in this together, sustaining the future of our customers, our company, and our planet.
Join a team of passionate thinkers, innovators, and dreamers — and help us connect people and build communities to create economic opportunity for all.
At eBay, we are building the next-generation AI platform to power intelligent experiences for millions of users worldwide. Our AI Platform (AIP) provides the scalable, secure, and efficient foundation for deploying and optimizing advanced machine learning and large language model (LLM) workloads at production scale. We enable teams across eBay to move from experimentation to global deployment with speed, reliability, and efficiency.
We are seeking an experienced Machine Learning Platform Support Engineer to join our AI Platform team. In this role, you will be the first line of support (L1) for ML workloads running on Kubernetes and Ray.io clusters. You will be responsible for triaging, monitoring, and resolving platform-related issues across ML training, inference, model deployment, and GPU resource allocation.
This position includes participation in on-call rotations (PagerDuty) and requires close collaboration with ML Platform engineers, researchers, and platform teams to ensure the reliability, scalability, and usability of the AI Platform. You will play a critical role in ensuring operational excellence and maintaining the uptime of the core infrastructure that powers eBay’s global AI and ML systems.
Serve as the first point of contact (L1) for all support requests related to the AI/ML Platform, including ML training, inference, model deployment, and GPU allocation.
Provide operational and on-call (PagerDuty) support for Ray.io and Kubernetes clusters running distributed ML workloads across cloud and on-prem environments.
Monitor, triage, and resolve platform incidents involving job failures, scaling errors, cluster instability, or GPU resource contention.
Manage GPU quota allocation and scheduling across multiple user teams, ensuring compliance with approved quotas and optimal resource utilization.
Support Ray Train/Tune for large-scale distributed training and Ray Serve for autoscaled inference, maintaining performance and service reliability.
Troubleshoot Kubernetes workloads, including pod scheduling, networking, image issues, and resource exhaustion in multi-tenant namespaces.
Collaborate with platform engineers, SREs, and ML practitioners to resolve infrastructure, orchestration, and dependency issues impacting ML workloads.
Improve observability, monitoring, and alerting for Ray and Kubernetes clusters using Prometheus, Grafana, and OpenTelemetry to enable proactive issue detection.
Maintain and enhance runbooks, automation scripts, and knowledge base documentation to accelerate incident resolution and reduce recurring support requests.
Participate in root cause analysis (RCA) and post-incident reviews, contributing to platform improvements and automation initiatives to minimize downtime.
Bachelor’s or Master’s degree in Computer Science, Engineering, or related technical discipline (or equivalent experience).
5+ years of experience in ML operations, DevOps, or platform support for distributed AI/ML systems.
Proven experience providing L1/L2 and on-call support for Ray.io and Kubernetes-based clusters supporting ML training and inference workloads.
Strong understanding of Ray cluster operations, including autoscaling, job scheduling, and workload orchestration across heterogeneous compute (CPU/GPU/accelerators).
Hands-on experience managing Kubernetes control plane and data plane components, multi-tenant namespaces, RBAC, ingress, and resource isolation.
Expertise in GPU scheduling, allocation, and monitoring (NVIDIA device plugin, MIG configuration, CUDA/NCCL optimization).
Proficiency in Python and/or Go for automation, diagnostics, and operational tooling in distributed environments.
Working knowledge of Kubernetes and cloud-native environments (AWS, GCP, Azure) and CI/CD pipelines.
Experience with observability stacks (Prometheus, Grafana, OpenTelemetry) and incident management tools (PagerDuty, ServiceNow).
Familiarity with ML frameworks such as TensorFlow and PyTorch, and their integration within distributed Ray/Kubernetes clusters.
Strong debugging, analytical, and communication skills to collaborate effectively with cross-functional engineering and research teams.
A customer-centric, operationally disciplined mindset focused on maintaining platform reliability, performance, and user satisfaction.
Please see the Talent Privacy Notice for information regarding how eBay handles your personal data collected when you use the eBay Careers website or apply for a job with eBay.
eBay is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, sex, sexual orientation, gender identity, veteran status, and disability, or other legally protected status. If you have a need that requires accommodation, please contact us at talent@ebay.com. We will make every effort to respond to your request for accommodation as soon as possible. View our accessibility statement to learn more about eBay's commitment to ensuring digital accessibility for people with disabilities.
The eBay Jobs website uses cookies to enhance your experience. By continuing to browse the site, you agree to our use of cookies. Visit our Privacy Center for more information.
eBay
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Practice Python coding challenges to boost your skills
Start Practicing Python Nowkarnataka
Salary: Not disclosed
16.0 - 20.0 Lacs P.A.
16.0 - 20.0 Lacs P.A.
Experience: Not specified
Salary: Not disclosed
chennai
Salary: Not disclosed
chennai, tamil nadu, india
Salary: Not disclosed
10.0 - 15.0 Lacs P.A.
5.0 - 8.0 Lacs P.A.
Pune, Maharashtra, India
Experience: Not specified
Salary: Not disclosed
5.0 - 8.0 Lacs P.A.