ROLE OVERVIEW
We are seeking a highly skilled Senior Audio ML Engineer who can develop, optimize, and deploy advanced speech processing models across distributed GPU clusters. You will drive audio ML initiatives, architect scalable training pipelines for speech generation and recognition systems, and ensure production-ready deployment of state-of-the-art audio models including TTS, ASR, voice cloning, and speech translation systems.
KEY ROLES
- Design, develop, and enhance speech processing models including TTS, ASR, Speaker Diarization, Source Separation, and Speech-to-Speech Translation systems for production use cases.
- Architect and optimize distributed training pipelines across GPU clusters for large-scale speech model training using advanced parallelization strategies.
- Fine-tune and customize speech foundation models using proprietary audio datasets, advanced training techniques, and comprehensive evaluation frameworks.
- Develop state-of-the-art voice cloning systems with zero-shot capabilities, emotion control, accent flexibility, pitch variation, and cross-lingual expressivity.
- Design high-performance inference pipelines for speech models using TensorRT, ONNX, quantization, streaming, and GPU optimization techniques.
- Ensure all speech models are production-grade—robust, scalable, monitored, and integrated into real-time audio processing systems.
- Research and evaluate cutting-edge architectures in speech synthesis, recognition, and multimodal audio-visual systems.
- Collaborate with the audio ML team to drive technical excellence and knowledge sharing across speech processing initiatives.
RESPONSIBLITES
- Architect end-to-end speech processing systems including distributed training, model serving, real-time inference, and continuous model improvements.
- Work with infrastructure teams to optimize GPU cluster utilization, implement efficient data loading pipelines, and manage large-scale audio dataset processing.
- Build comprehensive model evaluation frameworks—WER, MOS scores, speaker similarity metrics, latency benchmarks, and audio quality assessments.
- Drive experimentation with novel architectures including neural vocoders, diffusion-based TTS, transformer variants, and multimodal speech systems.
- Collaborate cross-functionally with product, backend, audio engineering, and DevOps teams to deliver end-to-end speech AI features.
- Implement robust training monitoring, experiment tracking, and model versioning systems for reproducible speech model development.
- Handle domain-shifted conditions, multilingual datasets, and challenging acoustic environments in production deployments.
- Contribute to team knowledge sharing through technical documentation, code reviews, and best practices in distributed speech model training.
REQUIRED QUALIFICATIONS
- 3-5+ years of experience in audio/speech machine learning, deep learning for speech processing, or audio signal processing systems.
- Proven expertise with state-of-the-art speech frameworks including ASR models (Whisper, Conformer), TTS systems (VITS, FastSpeech, Tacotron), and voice cloning architectures.
- Hands-on experience with distributed training across GPU clusters using PyTorch DDP, DeepSpeed, FairScale, or similar frameworks.
- Strong knowledge of audio processing libraries (librosa, torchaudio, SpeechBrain) and speech-specific data pipelines.
- Expert-level experience with model optimization for speech (TensorRT, ONNX Runtime, quantization) and real-time audio inference systems.
- Solid understanding of GPU cluster management, CUDA optimization, mixed precision training, and large-scale audio data handling.
- Experience with Speaker Diarization, Source Separation, Noise Cancellation, and Speech-to-Speech Translation systems.
- Strong technical communication skills and ability to work collaboratively in cross-functional teams.
- Master's or PhD in Electrical Engineering, Computer Science, or related field with specialization in Speech Processing or Audio ML.
NOTE - We accept international applicants also.