As we develop an enterprise-wide data and digital strategy that moves us toward a greater focus on the use of data and data-driven insights, we are seeking an Assistant Scientist for our Data Engineering team. The role will support the teams efforts towards creating, enhancing, and stabilizing the Enterprise data lake through the development of the data pipelines. This role requires a person that is a team player and can work well with team members from other disciplines to deliver data in an efficient and strategic manner.
What youll be DOING
What will your essential responsibilities include?
- Proficiency in designing, developing, and maintaining ETL pipelines using Pyspark within Azure Databricks environment, with expertise in utilizing delta tables for optimized data storage and processing.
- Experience with source control and CI/CD pipelines, including creating builds from GitHub repositories and automating release processes for data ingestion workflows and Databricks jobs using Azure DevOps or Harness.
- Ability to monitor and analyze the performance of ETL jobs, identifying and resolving issues promptly, and implementing improvements to enhance efficiency and throughput.
- Robust troubleshooting skills to diagnose system performance bottlenecks related to data processing, and implementing effective solutions to optimize system performance.
- Collaborative communication skills to coordinate with cross-functional teams, ensuring seamless integration of data pipelines into broader system architecture and workflows.
- Commitment to data integrity and quality assurance, maintaining high standards across all pipelines and environments.
- Knowledge of secure coding practices to develop and maintain resilient, vulnerable-free code that adheres to security standards and best practices.
You will report to Associate Scientist, Data Engineering.
What you will BRING
Were looking for someone who has these abilities and skills:
Required Skills and Abilities:
- Business & Insurance Acumen: Understanding the core principles of the insurance industry and overall business operations. This knowledge enables the data professional to interpret data insights in a meaningful way, aligning data solutions with business objectives, risk management, and regulatory requirements specific to insurance.
- Digital Literacy: Proficiency in understanding and utilizing digital tools, platforms, and data technologies. This includes familiarity with cloud computing platforms like Azure, data warehousing, and automation tools, allowing for effective communication with technical teams and translating business needs into technical solutions.
- Stakeholder Management: Effectively engaging with diverse stakeholdersincluding business leaders, data users, and technical teamsto gather requirements, communicate progress, and ensure that data platforms meet organizational needs. Building effective relationships and managing expectations are key to successful project delivery.
- Passion for Data & Data-Driven Culture: Demonstrated enthusiasm for working with data, with a clear focus on delivering value within a data-centric organization, and an outstanding sense of ownership and care for quality work.
- Education: Bachelors degree in computer science, Mathematics, Statistics, Finance, or a related technical field, or equivalent professional experience.
- Data Engineering & Distributed Computing: Outstanding background in software development with hands-on experience in ingesting, transforming, and storing large datasets using Pyspark within Azure Databricks, complemented by a solid understanding of distributed computing concepts.
- ETL Pipeline Development: Proven experience in designing and developing scalable ETL pipelines using Pyspark on Azure Databricks, with proficiency in Python scripting, including advanced techniques such as list comprehensions and dictionary variables.
- Data Warehousing Expertise: Relevant years experience and proficiency in data warehousing concepts, ensuring effective data modeling, storage, and retrieval strategies.
- Database & SQL Skills: Proficiency in SQL and database design principles, enabling efficient data querying, schema design, and performance optimization.
- Delta Lake Operations: Hands-on experience working with Delta Lake, including performing merge operations, insert overrides, and partition management to maintain data integrity and optimize performance.
- CI/CD & Orchestration: Practical experience implementing CI/CD pipelines using Azure DevOps or Harness, along with orchestration tools like Azure Data Factory (ADF) or Stone branch for automated workflows and deployment.
- Cloud Platform Knowledge: Familiarity with Azure cloud services, including Azure Synapse Analytics and Azure Data Lake Storage (ADLS), supporting scalable data solutions.
- Version Control & Build Management: Experience with GitHub for version control and managing build and release processes.
- Additional Tools (Plus): Exposure to Informatica ETL tools is advantageous, providing additional flexibility in data integration approaches.
Desired Skills and Abilities:
- Customer Centricity: In this role, prioritizing the needs of internal and external stakeholders by developing data solutions that support business objectives, ensuring data integrity, security, and usability to meet organizational goals.
- Cross-Functional Collaboration: Working effectively with data engineers, data scientists, business analysts, and IT teams to design, develop, and integrate data pipelines, fostering open communication and shared understanding to achieve seamless system architecture.
- Analytical & Strategic Mindset: Leveraging data insights and technical expertise to identify opportunities for process improvements, troubleshoot system performance issues, and develop scalable solutions aligned with long-term organizational strategies.
- Resilience: Remaining focused and adaptable when facing difficulties such as system failures, data inconsistencies, or evolving project requirements, ensuring continuous progress and reliability in delivering data platform solutions.
- Growth Mindset: Continuously seeking to learn new tools, technologies, and best practices in data engineering and AI platforms, embracing feedback, and striving for excellence to enhance personal and team performance.
- Performance Excellence: Maintaining high standards in developing secure, efficient, and reliable data pipelines, monitoring system performance, and optimizing processes to deliver high-quality results consistently.