Senior Staff AI/ML Scale Engineer

4 - 8 years

6 - 10 Lacs

Posted:1 day ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

About Marvell

.

Your Team, Your Impact

This team at Marvell develops Murals, a next-generation AI/ML infrastructure simulation and design platform that enables in-depth analysis and optimization of large-scale training and inference workloads. Leveraging trace-driven simulation, performance modeling, and hardware/software co-design, the team helps shape scalable and resilient solutions for advanced workloads such as LLMs, DLRMs, GenAI, and GNNs. Working closely with system architects, hardware designers, and ML practitioners, the team explores innovative ways to optimize compute, memory, and networking subsystems across complex datacenter environments.

What You Can Expect

  • Simulation & Modeling - Implement workflows to study AI/ML workloads using trace-driven and analytical models.

  • Performance Analysis - Profile and analyze system bottlenecks across compute, memory, and network layers.

  • Networking Studies - Evaluate collective communication performance (all-reduce, all-to-all, reduce-scatter) across different topologies and fabrics.

  • Tooling & Automation - Develop utilities for trace generation, merging, conversion, and visualization.

  • Prototype & Validation - Test distributed training and inference pipelines in simulated and real environments.

  • Hardware/Software Co-Design - Collaborate on emerging technologies (CXL, DPUs, NVLink, PCIe, UET/UEC, in-network compute).

  • Scaling Studies - Conduct performance projections and trade-off studies for next-gen AI infrastructure.

  • Knowledge Sharing - Document workflows, publish internal reports, and drive peer learning.

What Were Looking For

  • Bachelor s, Master s, or PhD in Computer Science, Electrical Engineering, or related field with 4-12 years of relevant professional experience.

  • Strong foundation in computer architecture, distributed systems, AI/ML, and operating systems.

  • Solid networking fundamentals including TCP/IP, RDMA, RoCE, UET/UEC, and switching/routing.

  • Experience with simulation frameworks (e. g. , Astra-Sim, Chakra, gem5, SST, NS-3).

  • Hands-on with PyTorch/TensorFlow and distributed training frameworks (DDP, Horovod, DeepSpeed).

  • Strong programming skills in Python, C++, and scripting for automation.

  • Familiarity with interconnect and memory technologies (CXL, PCIe, NVLink, UAL).

  • Experience with profiling, telemetry, observability, and debugging tools.

  • Knowledge of collective communication algorithms and topology-aware scheduling.

  • Exposure to AI accelerators, memory disaggregation, DPUs, and custom silicon.

  • Familiarity with visualization tools (Perfetto, Chrome Tracing, Chakra Timeline, Flamegraphs).

  • Experience with large-scale AI training pipelines and scaling studies.

  • Interest in energy/performance trade-offs and resilience techniques.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Marvell Semiconductors logo
Marvell Semiconductors

Semiconductors

Santa Clara

RecommendedJobs for You