- Drives the execution of multiple business plans and projects by identifying customer and operational needs; developing and communicating business plans and priorities; removing barriers and obstacles that impact performance; providing resources; identifying performance standards; measuring progress and adjusting performance accordingly; developing contingency plans; and demonstrating adaptability and supporting continuous learning
- Provides supervision and development opportunities for associates by selecting and training; mentoring; assigning duties; building a team-based work environment; establishing performance expectations and conducting regular performance evaluations; providing recognition and rewards; coaching for success and improvement; and promoting a belonging mindset in the workplace
- Promotes and supports company policies, procedures, mission, values, and standards of ethics and integrity by training and providing direction to others in their use and application; ensuring compliance with them; and utilizing and supporting the Open Door Policy
- Ensures business needs are being met by evaluating the ongoing effectiveness of current plans, programs, and initiatives; consulting with business partners, managers, co-workers, or other key stakeholders; soliciting, evaluating, and applying suggestions for improving efficiency and cost-effectiveness; and participating in and supporting community outreach events
What youll do:
We are seeking a Principal Machine Learning Engineer to set the technical direction and lead the design of large-scale AI/ML platforms and products. You combine deep hands-on expertise with strategic vision to drive company-wide impact.
As a Principal ML Engineer, you ll:
* Define and drive the multi-year technical strategy and architecture for catalog AI/ML systems across classification, attribution, trust & safety and generative experiences.
* Architect production-grade GenAI systems including retrieval orchestration, tool/use-case routing, policy engines, and evaluation harnesses. * Lead the design of training and serving stacks for LLMs and multi-modal models: distributed fine-tuning, parameter-efficient methods, distillation, quantization and compilation. * Own and improve reliability, latency and cost SLOs for online inference; design autoscaling, caching, batching, and GPU/accelerator utilization strategies. * Establish platform standards for data quality, lineage, governance and cost management across ML pipelines. * Build reusable evaluation and observability frameworks (offline metrics, human-in-the-loop, A/B testing, canaries) with robust telemetry and alerting. * Champion MLOps/LLMOps best practices: CI/CD for models, registries, feature/embedding stores, vector databases, rollout/rollback, shadowing and drift detection. * Mentor and develop Senior/Staff engineers and scientists; raise the bar on code quality, design reviews, technical writing and operational excellence. * Influence cross-functional roadmaps and align with Product, Platform, Data Engineering, Security and Infrastructure to deliver outcomes at global scale. * Publish, file patents and build partnerships with academia/open-source communities; represent Walmart at top-tier AI/ML venues. * Ensure solutions meet privacy, security, compliance and Responsible AI requirements. * Architect and lead fine-tuning, distillation and domain adaptation of foundation models (LLMs and vision-language) for retail-specific tasks, including instruction-tuning and RAG. * Optimize inference at scale using quantization, tensor/pipeline parallelism, speculative decoding, KV-cache management and high-performance serving (e.g., Triton, vLLM, TensorRT-LLM). * Design rigorous evaluation and safety frameworks for generative systems (LLM-as-judge, red teaming, factuality, bias/fairness, toxicity) and enforce policy through guardrails.
What youll bring:
* PhD with 7+ years of relevant experience / Master s with 10+ years / Bachelor s with 12+ years in Computer Science or a strongly quantitative field.
* Demonstrated leadership delivering multiple large-scale ML/GenAI products in production with measurable business impact. * Deep expertise in transformer architectures, retrieval-augmented generation, parameter-efficient fine-tuning, model distillation and evaluation. * Strong programming skills in Python (and optionally C++/Rust) with production experience in PyTorch/TensorFlow/JAX and modern data/streaming frameworks (e.g., Spark, Flink, Kafka). * Proven experience with distributed training/serving on Kubernetes and cloud environments; familiarity with CUDA/NCCL, sharding/ZeRO, mixed precision and performance profiling. * Strong grasp of data governance, privacy, security and Responsible AI; experience operationalizing policies and controls. * Track record of publications or intellectual property generation in top-tier venues. * Excellent communication, stakeholder management and mentoring skills; high ownership and bias for action.
Good to have:
* Experience with eCommerce domain, product knowledge graphs and taxonomy/attribute systems.