Title
AWS re:Invent 2022 - Migrating to Amazon EMR to reduce costs and simplify operations (ANT337)
Summary
-
LendingClub's Migration Experience:
- LendingClub, a fintech company, migrated from an on-premise Hadoop environment to Amazon EMR.
- They faced challenges with their on-premise setup, including limited burst capacity, difficulty in upgrades, and a lack of agility.
- The migration to EMR allowed them to handle burst workloads, automate node recovery, and improve cost efficiency.
- They adopted a lift-and-shift strategy, ensuring identical outputs in both on-premise and AWS environments.
- They encountered issues with S3 performance and had to refactor jobs to work with S3 instead of HDFS.
- They developed tools for data validation, replication monitoring, and job performance comparison.
- They achieved better scaling with EMR managed scaling and custom metrics.
- They plan to further optimize costs, explore containers and serverless options, and adopt new data formats like Iceberg and Hudi.
-
Thomson Reuters' Migration Experience:
- Thomson Reuters migrated their on-premise Hadoop systems to AWS to address challenges such as multi-tenant clusters, inefficient DevOps, and lack of active-active redundancy.
- They developed a bidirectional copying system to migrate workflows incrementally without impacting other workflows.
- They established shared frameworks and utilities to ensure consistency across migrations.
- They created a catalog for data and exposed it to the company, leading to new business features.
- They moved from multi-tenant clusters to ephemeral EMR clusters, improving stability and allowing per-workflow tech stack updates.
- They reduced compute resources by nearly half and improved time to market for new features.
- They leveraged Apache Hoodie for near-real-time data replication and incremental updates.
- They emphasized the importance of a strong foundation, leveraging AWS expertise, and empowering developers.
Insights
-
LendingClub's Insights:
- Migrating to the cloud requires careful planning, especially when dealing with data fabric changes like moving from HDFS to S3.
- Custom tooling and monitoring are crucial for ensuring data accuracy and SLA compliance during migration.
- EMR managed scaling and custom metrics can significantly improve resource utilization and cost efficiency.
- Open source solutions and collaboration with AWS support can help overcome migration challenges.
-
Thomson Reuters' Insights:
- Incremental migration with a bidirectional copying system can untangle complex dependencies and facilitate a smoother transition.
- Establishing shared frameworks, utilities, and a data catalog can standardize the migration process and improve data accessibility.
- Ephemeral EMR clusters can address the challenges of multi-tenant environments and improve workflow stability.
- Leveraging newer technologies like Apache Hoodie can drastically reduce data replication times and enable more dynamic data processing.
- Empowering developers to directly engage with AWS support can expedite problem resolution and reduce bottlenecks in the migration process.