We work on Apple scale opportunities and challenges. We are engineers at heart. We like solving technical problems. We believe a good engineer has the curiosity to dig into inner workings of technology and is always experimenting, reading and in constant learning mode. If you are a software engineer with passion to code and dig deeper into any technology, love knowing the internals, fascinated by distributed systems architecture, we want to hear from you.
Description
We are seeking a highly skilled LLM Ops and ML Ops Engineer to lead the deployment, scaling, monitoring, and optimization of large language models (LLMs) across diverse environments. This role is critical to ensuring our machine learning systems are production-ready, high-performing, and resilient. The ideal candidate will have deep expertise in Python programming / Go Programming, a comprehensive understanding of LLM internals, and hands-on experience with various inference engines and deployment strategies. The person should be capable of exhibiting deftness to balance multiple simultaneous competing priorities and deliver solutions in a timely manner. The person should be able to understand complex architectures and be comfortable working with multiple teams KEY RESPONSIBILITIES: - Design and build scalable infrastructure for fine-tuning, and deploying large language models. - Develop and optimize inference pipelines using popular frameworks and engines (e.g. TensorRT, vLLM, Triton Inference Server). - Implement observability solutions for model performance, latency, throughput, GPU/TPU utilization, and memory efficiency. - Own the end-to-end lifecycle of LLMs in production-from experimentation to continuous integration and continuous deployment (CI/CD). - Collaborate with research scientists, ML engineers, and backend teams to operationalize groundbreaking LLM architectures. - Automate and harden model deployment workflows using Python, Kubernetes, Containers and orchestration tools like Argo Workflows and GitOps. - Design reproducible model packaging, versioning, and rollback strategies for large-scale serving. - Stay current with advances in LLM inference acceleration, quantization, distillation, and model compilation techniques (e.g., GGUF, AWQ, FP8).
Minimum Qualifications
-
5+ years of experience in LLM/ML Ops, DevOps, or infrastructure engineering with a focus on machine learning systems.
- Advance level proficiency in Python/Go, with ability to write clean, performant, and maintainable production code.
- Deep understanding of transformer architectures, LLM tokenization, attention mechanisms, memory management, and batching strategies.
- Proven experience deploying and optimizing LLMs using multiple inference engines.
- Strong background in containerization and orchestration (Kubernetes, Helm).
- Familiarity with monitoring tools (e.g., Prometheus, Grafana), logging frameworks, and performance profiling.
Preferred Qualifications
-
Experience integrating LLMs into micro-services or edge inference platforms.
- Experience with Ray distributed inference
- Hands-on with quantization libraries
- Contributions to open-source ML infrastructure or LLM optimization tools.
- Familiarity with cloud platforms (AWS, GCP) and infrastructure-as-code (Terraform).
Exposure to secure and compliant model deployment workflows
-
Submit CV