We are seeking a talented and experienced Senior Data Engineer to join our Catalog Management department. In this role, you will be responsible for designing, developing, and maintaining robust and scalable data pipelines and infrastructure on Google Cloud Platform (GCP). You will work closely with data scientists, analysts, and other engineers to ensure data is readily available, reliable, and optimized for various analytical and operational needs. A strong focus on building automated testing into our data solutions is a must. The ideal candidate will have a strong background in Java development, Apache Spark, and GCP services, with a passion for building high-quality data solutions.
Responsibilities:
- Data Pipeline Development: Design, develop, and maintain efficient and scalable data pipelines using Apache Spark (primarily with Java) or Apache Beam or Kubeflow to ingest, process, and transform large datasets from various sources.
- GCP Infrastructure Management: Build, configure, and manage data infrastructure components on GCP, including BigQuery, Dataflow, Dataproc, Cloud Storage, Pub/Sub, and Cloud Functions.
- API Development and Maintenance: Develop and maintain RESTful APIs using Spring Boot to provide secure and reliable access to processed data and data services.
- Data Modeling and Design: Design and implement optimized data models for analytical and operational use cases, considering performance, scalability, and data integrity.
- Data Quality Assurance: Implement comprehensive data quality checks and monitoring systems to ensure data accuracy, consistency, and reliability throughout the data lifecycle.
- Test Automation: Develop and maintain automated unit, integration, and end-to-end tests for data pipelines and APIs to ensure code quality and prevent regressions.
- Performance Optimization and Monitoring: Proactively monitor system performance, reliability, and scalability. Analyze system performance metrics (CPU, memory, network) to identify bottlenecks, optimize system health, and ensure cost-efficiency.
- Collaboration and Communication: Collaborate effectively with data scientists, analysts, product managers, architects, and other engineers to understand data requirements, translate them into technical solutions, and deliver effective data solutions.
- Documentation: Create and maintain clear, comprehensive, and up-to-date documentation for data pipelines, infrastructure, and APIs, including design specifications, operational procedures, and troubleshooting guides.
- CI/CD Implementation: Implement and maintain robust CI/CD pipelines for automated deployment of data solutions, ensuring rapid and reliable releases.
- Production Support and Incident Management: Provide timely and effective support for production systems, including incident management, root cause analysis, and resolution.
- Continuous Learning: Stay current with the latest trends and technologies in data engineering, GCP, and related fields, and proactively identify opportunities to improve existing systems and processes.
Qualifications:
- Bachelor's degree in Computer Science, Engineering, or a related field.
- 4-6 years of experience in data engineering or a related role.
- Strong proficiency in Java programming.
- Extensive experience with Apache Spark for data processing.
- Solid experience with Google Cloud Platform (GCP) services, including BigQuery, Dataflow, Dataproc, Cloud Storage, and Pub/Sub.
- Experience developing RESTful APIs using Spring Boot.
- Experience with test automation frameworks (e.g., JUnit, Mockito, REST Assured).
- Experience with CI/CD pipelines (e.g., Jenkins, GitLab CI, Cloud Build).
- Excellent problem-solving and analytical skills.
- Strong communication and collaboration skills.
- Preferred Qualifications:
- Experience with other data processing technologies (e.g., Apache Beam, Flink).
- Experience with infrastructure-as-code tools (e.g., Terraform, Cloud Deployment Manager).
- Experience with data visualization tools (e.g., Tableau, Looker).
- Experience with containerization technologies (e.g., Docker, Kubernetes).
- Understanding of AI/GenAI concepts and their data requirements is a plus.
- Experience building data pipelines to support AI/ML models is a plus.
- Strong expertise in API testing tools (e.g., Postman).
- Solid experience in performance testing using JMeter.
- Proven experience with modern test automation frameworks.
- Proficient in using JUnit for unit testing.
Technical Skills:
- Strong proficiency in Java programming, including functional programming concepts for scripting and automation.
- Solid understanding of cloud platforms, with a strong preference for Google Cloud Platform (GCP).
- Proven experience with modern test automation frameworks (e.g., JUnit, Mockito).
- Familiarity with system performance monitoring and analysis (CPU, memory, network).
- Experience with monitoring and support best practices.
- Strong debugging and troubleshooting skills to identify and resolve complex technical issues.
- Strong analytical, problem-solving, and communication skills.