Posted:1 week ago|
Platform:
On-site
Contractual
Overview: Seeking an engineer to build and optimize high-throughput, low-latency LLM inference infrastructure using open-source models (Qwen, LLaMA, Mixtral) on multi-GPU systems (A100/H100). You’ll own performance tuning, model hosting, routing logic, speculative decoding, and cost-efficiency tooling. Must-Have Skills: Deep experience with vLLM, tensor/pipe parallelism, KV cache management Strong grasp of CUDA-level inference bottlenecks, FlashAttention2, quantization Familiarity with FP8, INT4, speculative decoding (e.g., TwinPilots, PowerInfer) Proven ability to scale LLMs across multi-GPU nodes (TP, DDP, inference routing) Python (systems-level), containerized deployments (Docker, GCP/AWS), load testing (Locust) Bonus: Experience with any-to-any model routing (e.g., text2sql, speech2text) Exposure to LangGraph, Triton kernels, or custom inference engines Has tuned models for <$0.50/M token inference at scale Highlight: Very good rate card for the best candidate fit. Show more Show less
Constient Global Solutions
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
My Connections Constient Global Solutions
Chennai, Tamil Nadu, India
Experience: Not specified
Salary: Not disclosed
Chennai, Tamil Nadu, India
Experience: Not specified
Salary: Not disclosed