Senior Data Engineer - Data Processing & Feature Engineering
Location: Coimbatore
Experience Level: 6+ years
About the Role
We are seeking exceptional Senior Data Engineers to build the data foundation powering Velogent AI's autonomous agents. You will design and implement large-scale data ingestion, processing, and feature engineering systems that transform unstructured enterprise data (invoices, documents, transactions, RFQs) into structured, high-quality datasets. Your work enables agentic AI systems to make accurate, compliance-aware decisions while maintaining data quality, lineage, and auditability standards required by regulated industries.
Core Responsibilities
- Design and architect end-to-end data pipelines processing large volumes of unstructured enterprise data (documents, PDFs, transaction records, email, etc.)
- Build sophisticated data ingestion frameworks supporting multiple data sources and formats with automated validation and quality checks
- Implement large-scale data processing solutions using distributed computing frameworks handling terabytes of data efficiently
- Develop advanced feature engineering pipelines extracting meaningful signals from unstructured data (document classification, entity extraction, semantic tagging)
- Design data warehousing architecture supporting both operational (near real-time) and analytical queries for agentic AI reasoning
- Build robust data quality frameworks ensuring high data accuracy critical for agent decision-making and regulatory compliance
- Implement data governance patterns including lineage tracking, metadata management, and audit trails for regulated environments
- Optimize data pipeline performance, reliability, and cost through intelligent partitioning, caching, and resource optimization
- Lead data security implementation protecting sensitive information (PII, financial data, healthcare records) with encryption and access controls
- Collaborate with AI engineers to understand data requirements and optimize data for model training and inference
- Establish best practices for data documentation, SLA management, and operational excellence
Must-Have Qualifications
- Unstructured Data Expertise: Production experience ingesting and processing large volumes of unstructured data (documents, PDFs, images, text, logs)
- Large-Scale Data Processing: Advanced expertise with distributed data processing frameworks (Apache Spark, Flink, or cloud-native alternatives like AWS Glue)
- Feature Engineering: Deep knowledge of advanced feature engineering techniques for ML systems, including automated feature extraction and transformation
- Python Proficiency: Expert-level Python for data processing, ETL pipeline development, and data science workflows
- NLP/Text Processing: Strong background in NLP and text analysis techniques for document understanding, entity extraction, and semantic processing
- Data Architecture: Experience designing data warehouses, data lakes, or lakehouse architectures supporting both batch and real-time processing
- ETL/ELT Pipeline Design: Proven expertise building production-grade ETL/ELT pipelines with error handling, retry logic, and monitoring
- Cloud Data Platforms: Advanced experience with AWS data services (S3, Athena, Glue, RDS, DynamoDB) or equivalent cloud platforms
- Data Quality & Governance: Understanding of data quality frameworks, metadata management, and data governance practices
Nice-to-Have Qualifications
- Experience with document parsing and layout analysis libraries (Pydantic, unstructured.io, PyPDF, etc.)
- Knowledge of information extraction pipelines and vector databases for semantic search
- Familiarity with Apache Kafka or other event streaming platforms for real-time data processing
- Experience with dbt (data build tool) or similar data transformation frameworks
- Understanding of data privacy and compliance frameworks (GDPR, HIPAA, SOC2)
- Experience optimizing costs in cloud data platforms through intelligent resource allocation
- Background in building recommendation systems or ranking systems using feature engineering
- Knowledge of graph databases and knowledge graphs for relationship extraction
- Familiarity with computer vision techniques for document analysis and processing
- Published work or open-source contributions in NLP, document processing, or data engineering
What You'll Work With
- Large-scale document processing pipelines handling millions of invoices, contracts, and business documents
- Apache Spark and distributed computing frameworks for ETL
- AWS data services (S3, Glue, Athena, RDS) for data infrastructure
- Advanced NLP and text processing libraries (spaCy, transformers, LangChain)
- Vector databases and semantic search infrastructureData quality and monitoring frameworks
- Cloud data warehouses and data lakes on AWS
- Compliance and governance frameworks for regulated industries