
Đã đóng
Đã đăng vào
Thanh toán khi bàn giao
The goal is to put together a complete, production-ready reference that moves our legacy Hive ecosystem into Apache Iceberg while keeping performance front and centre. Scope of work • Author a concise yet comprehensive guide that walks through every major Iceberg concept—catalogs, snapshots, partition evolution, time-travel, schema evolution, compaction, MERGE INTO, etc.—and pair each topic with at least one runnable exercise so the team can practise what they read. • Build a reusable codebase that translates existing HQL/Hive queries into Iceberg-compatible syntax. Wherever possible, refactor logic so queries can scale smoothly when the data volume grows from 1 TB toward the upper end of our 10 TB footprint. • Replace our current Sqoop import jobs with PySpark scripts that land data straight into Iceberg tables, taking advantage of Spark 3, DataFrame API, and Iceberg’s write options. • Deliver a proof of concept that ingests ten representative tables—covering initial full load, daily incremental, and back-fill of historical partitions—while demonstrating snapshot isolation and fast rollback. • Provide benchmark notes that highlight the scalability gains (partition pruning, vectorised reads, write amplification reduction, etc.) we achieve after migration. Acceptance criteria – Exercises compile and run on Spark 3.x with Iceberg ≥ 1.2. – Conversion utilities handle typical Hive constructs (dynamic partitions, CTAS, ORC/Parquet formats) without manual rewrites. – PySpark ingestion completes within the target window on a 3-node test cluster at ~2 TB scale and shows linear growth characteristics as data volume increases. – POC tables support time-travel queries and exhibit consistent results before and after incremental loads. All source code should be version-controlled (Git) with README instructions and sample datasets so the internal team can reproduce results instantly.
Mã dự án: 40304621
4 đề xuất
Dự án từ xa
Hoạt động 27 ngày trước
Thiết lập ngân sách và thời gian
Nhận thanh toán cho công việc
Phác thảo đề xuất của bạn
Miễn phí đăng ký và cháo giá cho công việc
4 freelancer chào giá trung bình ₹909 INR cho công việc này

Your Iceberg migration from legacy Hive with that 10TB footprint sounds like a solid modernization move. I'll build you the complete POC with PySpark scripts replacing those Sqoop jobs, plus a practical guide covering catalogs, time-travel, schema evolution with hands-on exercises your team can actually run. Built something similar with my price aggregation engine that processes massive datasets - had to optimize for performance and scalability when tracking 800+ products across multiple sources. Also created automated systems that handle complex data pipelines, so I know how to structure reusable code that scales. You can check out my work at ffulb.com. Can start immediately and deliver the full migration POC with benchmarks within 2 weeks. The conversion utilities will handle your dynamic partitions and format transitions without manual rewrites.
₹1.384 INR trong 2 ngày
0,0
0,0

I am a senior Data Engineer with deep expertise in Hive, Spark, PySpark, SQL, ETL, and Apache Iceberg migration. I have successfully delivered multiple enterprise data warehouse modernization projects from legacy Hive to Iceberg, focusing on performance, scalability, and production readiness. I can fully deliver your project within 7 days, including: • A complete Iceberg guide with runnable exercises covering catalogs, snapshots, partition evolution, time-travel, compaction, MERGE INTO, etc. • A reusable codebase to convert Hive/HQL queries to Iceberg-compatible syntax for 1TB–10TB scaling. • PySpark scripts to replace Sqoop jobs for full load, daily incremental, and back-fill into Iceberg tables using Spark 3 DataFrame API. • A working POC for 10 tables with snapshot isolation and fast rollback demonstrated. • Benchmark report on scalability, partition pruning, vectorized reads, and write amplification reduction. • All code versioned in Git with clear README and reproducible steps. All deliverables will comply with your acceptance criteria: • Runnable on Spark 3.x + Iceberg ≥ 1.2 • Support Hive dynamic partitions, CTAS, ORC/Parquet • Linear scalability on 3-node cluster at 2TB+ data • Correct time-travel and incremental load results I am ready to start immediately and deliver a production-ready solution on time. Looking forward to working with you.
₹600 INR trong 4 ngày
0,0
0,0

Hello, I can help fix your Iceberg migration and ingestion POC by identifying pipeline issues, resolving ingestion errors, and ensuring smooth data flow with proper schema handling and performance tuning. I have experience working with Apache Iceberg, ETL pipelines, and cloud-based data lakes, ensuring reliable and scalable solutions. I’ll debug the current setup, optimize ingestion, and deliver a stable, production-ready POC. Let’s discuss your current errors and setup. Regards, Bharti
₹1.050 INR trong 7 ngày
0,0
0,0

Bangalore, India
Phương thức thanh toán đã xác thực
Thành viên từ thg 11 3, 2015
₹600-1500 INR
₹600-1500 INR
₹100-400 INR/ giờ
₹100-400 INR/ giờ
₹100-400 INR/ giờ
$30-250 USD
$2-15 USD/ giờ
$10-30 USD
$25-50 USD/ giờ
$2-8 USD/ giờ
₹5000-8000 INR
$30-250 USD
₹1500-12500 INR
₹100-300 INR/ giờ
₹12500-37500 INR
$25-50 USD/ giờ
₹600-1500 INR
€12-18 EUR/ giờ
₹600-1500 INR
$2-8 USD/ giờ
$2-8 USD/ giờ
₹600-1500 INR
$2-8 USD/ giờ
$2-8 USD/ giờ
$10-30 USD