Posted:2 weeks ago|
Platform:
Remote
Full Time
Key Responsibilities : - Design and develop scalable PySpark pipelines to ingest, parse, and process XML datasets with extreme hierarchical complexity. - Implement efficient XPath expressions, recursive parsing techniques, and custom schema definitions to extract data from nested XML structures. - Optimize Spark jobs through partitioning, caching, and parallel processing to handle terabytes of XML data efficiently. - Transform raw hierarchical XML data into structured DataFrames for analytics, machine learning, and reporting use cases. - Collaborate with data architects and analysts to define data models for nested XML schemas. - Troubleshoot performance bottlenecks and ensure reliability in distributed environments (e.g., AWS, Databricks, Hadoop). - Document parsing logic, data lineage, and optimization strategies for maintainability. Qualifications : - 5+ years of hands-on experience with PySpark and Spark XML libraries (e.g., `spark-xml`) in production environments. - Proven track record of parsing XML data with 20+ levels of nesting using recursive methods and schema inference. - Expertise in XPath, XQuery, and DataFrame transformations (e.g., `explode`, `struct`, `selectExpr`) for hierarchical data. - Strong understanding of Spark optimization techniques: partitioning strategies, broadcast variables, and memory management. - Experience with distributed computing frameworks (e.g., Hadoop, YARN) and cloud platforms (AWS, Azure, GCP). - Familiarity with big data file formats (Parquet, Avro) and orchestration tools (Airflow, Luigi). - Bachelor's degree in Computer Science, Data Engineering, or a related field. Preferred Skills : - Experience with schema evolution and versioning for nested XML/JSON datasets. - Knowledge of Scala or Java for extending Spark XML libraries. - Exposure to Databricks, Delta Lake, or similar platforms. - Certifications in AWS/Azure big data technologies.
Victrix Systems And Labs
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
8.0 - 14.0 Lacs P.A.
8.0 - 14.0 Lacs P.A.
8.0 - 14.0 Lacs P.A.
8.0 - 14.0 Lacs P.A.
8.0 - 14.0 Lacs P.A.
8.0 - 14.0 Lacs P.A.
8.0 - 14.0 Lacs P.A.
8.0 - 14.0 Lacs P.A.
8.0 - 14.0 Lacs P.A.
8.0 - 14.0 Lacs P.A.