Title
AWS re:Invent 2022 - Build your open data lakehouse with Dremio and AWS (PRT085)
Summary
- Brett Roberts, Principal Partner Solutions Architect at Dremio, discusses building an open data lakehouse with Dremio and AWS.
- Dremio is an open data lakehouse platform that allows high-performance BI directly on the data lake without data movement or copying.
- Dremio contributes to open source with projects like Apache Arrow, Project Nessie, and Apache Iceberg.
- Data lakehouses combine the best features of data warehouses and data lakes into a unified architecture.
- Key components of a data lakehouse include scalable data lake storage (e.g., Amazon S3), open file and table formats (e.g., Parquet, Apache Iceberg), query engines, and a self-service semantic layer.
- Dremio's reference architecture with AWS includes Dremio Arctic, an intelligent meta store, and integration with AWS services like Glue and Lake Formation for security and governance.
- Emphasis on open architecture allows for flexibility and future-proofing against new technologies.
- Dremio offers a SaaS version called Dremio Cloud for users to try out their platform.
Insights
- The concept of a data lakehouse is gaining traction as organizations seek to modernize their data architectures by combining the benefits of data lakes and data warehouses.
- Open source contributions by Dremio, such as Apache Arrow, are significant in the data community, with Arrow seeing around 60 million downloads per month.
- The move towards open architectures is driven by the need for flexibility, cost-effectiveness, and the ability to avoid vendor lock-in.
- Dremio's partnership with AWS and the use of AWS services like S3, Glue, and Lake Formation indicate a strong integration with the AWS ecosystem, which is beneficial for AWS customers looking to implement a data lakehouse.
- The emphasis on a self-service semantic layer and the ability to perform Git-like operations on data catalogs (Project Nessie) highlight the importance of agility and collaboration in modern data teams.
- The presentation suggests a growing trend towards decoupling compute from data, which allows for multiple query engines to work on the same data sets without creating redundant copies, enhancing efficiency and reducing costs.
- Dremio's offering of a free SaaS version for users to test their platform demonstrates confidence in their product and a customer-centric approach to adoption.