Analyzing Streaming Data with Apache Druid Ant213

Title

AWS re:Invent 2023 - Analyzing streaming data with Apache Druid (ANT213)

Summary

  • Apache Druid is used for analyzing streaming data in real-time, with true stream ingestion allowing for sub-second analysis.
  • Real-time operations and context-aware decisions are key benefits, with examples from ThousandEyes and Atlassian demonstrating Druid's high concurrency and real-time analytics capabilities.
  • Operational visibility at scale is a major use case, with the New York Stock Exchange as an example due to its need to handle large-scale network attacks.
  • Druid originated in the ad tech industry and is still widely used for real-time advertising decisioning, with Reddit as an example.
  • Streaming data is growing rapidly, with an estimated 50% of high-value business data expected to be streamed within three years.
  • Druid's design ensures high scalability, high concurrency, and nonstop reliability, with data continuously backed up to S3.
  • The database can handle both real-time and historic data, with the ability to ingest data via streams or batch.
  • Druid was created by developers at an ad tech company in 2010 and was open-sourced in 2012, with Netflix being the first major adopter.
  • Impli, founded by the creators of Apache Druid, aims to support and advance open source Druid, offering enhanced security, cloud deployment options, and support services.
  • Druid is available on AWS Marketplace and integrates with AWS services like Kinesis and MSK.

Insights

  • Apache Druid is particularly well-suited for applications requiring real-time analytics and decision-making, such as operational monitoring, observability, and interactive data exploration.
  • The growth of streaming data and the need for real-time analysis are driving the adoption of technologies like Druid, which can handle the scale and speed required by modern data streams.
  • Druid's ability to provide high concurrency without significant resource expenditure makes it a cost-effective solution for organizations with large numbers of concurrent users.
  • The open-source nature of Druid, coupled with commercial support and services from Impli, provides flexibility for organizations to choose the level of support and security they need.
  • The integration of Druid with AWS services and its availability in the AWS Marketplace highlights the close relationship between cloud services and streaming data analytics platforms, making it easier for AWS customers to adopt Druid.
  • The use cases presented in the talk, such as the New York Stock Exchange for operational visibility and Reddit for advertising decisioning, demonstrate the wide applicability of Druid across different industries and scenarios.