Senior ASR/TTS Specialist - AI Agent Integration Expert
Company:
EXL Service
Type:
Full-time
Experience:
3+ years
Position Summary
We seek an exceptional
Senior ASR/TTS Specialist
to lead speech AI initiatives and integrate advanced speech technologies with AI agent frameworks. This role focuses on fine-tuning ASR/TTS models, implementing MLOps best practices, and building production-ready speech AI systems powering next-generation conversational AI agents.
Key Responsibilities
Speech AI Model Development & Integration
Model Fine-tuning : Customize state-of-the-art ASR/TTS models for domain-specific applications with <300ms latency
Speech-to-Speech Systems : Build end-to-end S2S pipelines using Amazon Nova Sonic v1.0, Azure OpenAI Realtime (GPT-4o), and Gemini 2.5 Flash Native Audio Multi-modal Integration : Develop speech models integrating with vision and text modalities in AI agents Agent Framework Integration : Implement speech capabilities with LangChain/LangGraph, CrewAI, AutoGen, LlamaIndex, and OpenAI Assistants API
MLOps & Production Engineering
- Model Lifecycle : Implement comprehensive MLOps pipelines using MLflow, Weights & Biases, and automated CI/CD
- Multi-cloud Deployment : Deploy speech models across AWS Bedrock, Google Cloud AI, and Azure Cognitive Services
- Real-time Processing : Build WebSocket-based streaming audio systems handling 1000+ concurrent connections
- Production Monitoring : Implement WER tracking, latency monitoring, and multi-provider failover mechanisms
Research & Development
- Cutting-edge Research : Stay current with latest speech AI breakthroughs and implement novel architectures
- Performance Optimization : Optimize models for real-time inference using TensorRT, ONNX, and edge deployment
- Data Pipeline Engineering : Build scalable audio ingestion, preprocessing, and augmentation systems
Required Qualifications
Core Technical Skills (Must-Have)
Speech AI Models (3+ years experience):
-
ASR Systems
: Amazon Nova Sonic v1.0, Google Speech-to-Text, Azure Speech Services, Whisper, Wav2Vec2, Riva -
TTS Systems
: Google TTS, Azure Cognitive Services TTS, ElevenLabs (REST/WebSocket), Tortoise, VITS, FastSpeech2 -
Speech-to-Speech
: Direct S2S without intermediate text, multimodal audio processing -
Cloud Services
: AWS Bedrock Runtime, Google Cloud AI (Gemini API), Azure OpenAI Services
Programming & Frameworks:
-
Languages
: Expert Python, proficient C++/Rust for optimization -
ML Frameworks
: Advanced PyTorch, TensorFlow 2.x, JAX/Flax -
Audio Processing
: librosa, torchaudio, soundfile, WebRTC, µ-law/PCM conversion -
Agent Frameworks
: Hands-on experience with 3+ of: LangChain, CrewAI, AutoGen, LlamaIndex, OpenAI Assistants
MLOps & Infrastructure (Essential)
MLOps Tools (2+ years):
-
Experiment Management
: MLflow, Weights & Biases -
Model Serving
: TorchServe, TensorFlow Serving, NVIDIA Triton -
Workflow Orchestration
: Apache Airflow, Kubeflow, Prefect -
Containerization
: Docker, Kubernetes for speech model deployment
Cloud & Production:
-
Multi-cloud Experience
: AWS (Bedrock, Nova Sonic), Google Cloud (Gemini, Speech APIs), Azure (OpenAI Services) -
Real-time Systems
: Sub-300ms latency, WebSocket architecture, telecom integration (Genesys AudioConnector) -
Monitoring
: Audio quality metrics, model drift detection, production reliability (99.9% uptime)
Preferred QualificationsAdvanced Specializations
- Multi-lingual Processing : Cross-lingual transfer learning, zero-shot adaptation
- Domain Expertise : Healthcare, legal, technical domain speech AI
- Edge AI : TensorRT, Core ML, ONNX optimization for mobile/edge deployment
- Research Background : Publications in ICASSP, INTERSPEECH, ICML, NeurIPS
Leadership & Education
- Team Leadership : Experience leading speech AI teams and technical initiatives
- Education : MS/PhD in Computer Science, Electrical Engineering, or related field
- Open Source : Contributions to speech AI libraries and frameworks
Technical Environment
Production Technology Stack
Core Technologies:
-
Languages
: Python, C++, Rust, TypeScript -
Frameworks
: PyTorch, TensorFlow, JAX, LangChain, CrewAI, AutoGen -
Cloud Services
: AWS Bedrock, Google Cloud AI, Azure OpenAI Services -
Audio Tools
: librosa, torchaudio, WebRTC, FFmpeg -
MLOps
: MLflow, Kubeflow, Docker, Kubernetes, NVIDIA Triton -
Databases
: Vector DBs (Pinecone, Weaviate), PostgreSQL, Redis
Production Models:
- Amazon Nova Sonic v1.0 (Speech-to-Speech streaming) - Gemini 2.5 Flash Native Audio Dialog (Multimodal processing) - Azure OpenAI GPT-4o (Realtime voice conversations) - ElevenLabs (Voice cloning and synthesis)
Infrastructure
GPU Clusters : NVIDIA A100/H100 for model training Edge Deployment : NVIDIA Jetson, ARM-based targets Real-time Requirements : <300ms latency, 1000+ concurrent streams
Enterprise Integration : Genesys AudioConnector, SIP protocol, telephony systems
Key Projects & Success Metrics
Primary Focus Areas
- Next-gen S2S Systems : Amazon Nova Sonic, Azure OpenAI Realtime, Gemini Native Audio
- Multi-cloud Integration : Unified APIs across AWS, Google Cloud, Azure
- Conversational AI Agents : Low-latency speech-enabled customer service bots
- Telecom Integration : Enterprise telephony and AudioConnector systems
- Domain-specific Models : Medical, legal, technical vocabulary fine-tuning
Success Metrics
Performance : <5% WER for domain-specific tasks Latency :
Reliability : 99.9% uptime for production services Scale : 1000+ concurrent speech streams