1. Design, build, and maintain scalable data pipelines and workflows using Databricks (SQL, PySpark, Delta Lake).
2. Develop efficient ETL/ELT pipelines for structured and semi-structured data using Databricks notebooks/jobs. 3. Integrate and transform large-scale datasets from multiple sources into unified, analytics-ready outputs. 4. Optimize Spark jobs and manage Delta Lake performance using techniques such as partitioning, Z-ordering, broadcast joins, and caching. 5. Design and implement data ingestion pipelines for RESTful APIs, transforming JSON responses into Spark tables. 6. Apply best practices in data modeling and data warehousing concepts. 7. Perform data validation and quality checks. 8. Work with various data formats, including JSON, Parquet, and Avro. 9. Build and manage data orchestration pipelines, including linked services and datasets for Databricks, and SQL Server/RDBMS databases. 10. Collaborate closely with Data Scientists, Data Analysts, Business Analysts, and Data Architects to deliver trusted, high-quality datasets. 11. Contribute to data governance, metadata documentation, and ensure adherence to data quality standards. 12. Use version control tools (e.g., Git) and CI/CD pipelines to manage code deployment and workflow changes. 13. Develop real-time and batch processing pipelines for streaming data sources such as MQTT, Kafka, and Event Hub. 14. Experience integrating with APIs. 15. Comfortable working with Git, DevOps pipelines, and Agile delivery methodologies.
Required qualifications to be successful in this role:
1. Hands on 5+ years of experience in data engineering or big data development.
2. Design, build, and maintain scalable data pipelines and workflows using Databricks (SQL, PySpark, Delta Lake). 3. Develop efficient ETL/ELT pipelines for structured and semi-structured data using Databricks notebooks/jobs. 4. Strong hands-on experience with Databricks and Apache Spark (PySpark/SQL). 5. Proven experience with AWS services or Azure Data Lake, and related services. 6. Proficiency in SQL and Python for data processing, transformation, and validation.