Get alerts for new jobs matching your selected skills, preferred locations, and experience range. Manage Job Alerts
5.0 - 9.0 years
0 Lacs
chennai, all india
On-site
As an engineer in this role, you will be responsible for building and optimizing high-throughput, low-latency LLM inference infrastructure using open-source models such as Qwen, LLaMA, and Mixtral on multi-GPU systems like A100/H100. Your ownership will include performance tuning, model hosting, routing logic, speculative decoding, and cost-efficiency tooling. Key Responsibilities: - Deep experience with vLLM, tensor/pipe parallelism, and KV cache management - Strong grasp of CUDA-level inference bottlenecks, FlashAttention2, and quantization - Familiarity with FP8, INT4, and speculative decoding (e.g., TwinPilots, PowerInfer) - Proven ability to scale LLMs across multi-GPU nodes using TP, D...
Posted 3 days ago
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Accenture
174558 Jobs | Dublin
Wipro
55192 Jobs | Bengaluru
EY
44116 Jobs | London
Accenture in India
37169 Jobs | Dublin 2
Turing
30851 Jobs | San Francisco
Uplers
30086 Jobs | Ahmedabad
IBM
27225 Jobs | Armonk
Capgemini
23907 Jobs | Paris,France
Accenture services Pvt Ltd
23788 Jobs |
Infosys
23603 Jobs | Bangalore,Karnataka