Netflix Maestro Orchestrating Scaled Data Ml Workflows in the Cloud Nfx308

Title

AWS re:Invent 2023 - Netflix Maestro: Orchestrating scaled data & ML workflows in the cloud (NFX308)

Summary

  • Netflix Maestro is a powerful workflow orchestrator used internally at Netflix to manage and automate ETL pipelines and machine learning workflows.
  • Maestro is a fully managed service that provides workflow as a service to thousands of Netflix users, ensuring reliability and scalability.
  • It features a workflow engine, alerting service, error classification services, and a user interface with templates and domain-specific language for easy workflow definition.
  • Maestro supports a variety of use cases, including data processing, model training, A/B testing, and more.
  • It was built in-house due to the lack of existing solutions that could handle Netflix's scale and variety of workflows.
  • Maestro's architecture includes an API gateway, a core workflow engine with versioning and triggering support, and integration with downstream services via Kafka.
  • The Maestro DSL (Domain-Specific Language) is available in YAML, Python, and Java, making workflow definitions readable, reproducible, and debuggable.
  • Maestro supports parameterized workflows with features like conditional branching and sub-workflows, as well as dynamic code injection for custom logic.
  • It is extensible, allowing users to create new step types and bring their own compute resources.
  • Workflows in Maestro are executed efficiently and reliably, with each job running in isolation using Docker containers and Paper Mill for notebook execution.

Insights

  • Netflix's decision to build Maestro in-house highlights the unique challenges faced by large-scale data-driven companies and the limitations of existing workflow orchestration tools.
  • Maestro's design emphasizes user-friendliness and flexibility, catering to a diverse user base with different technical backgrounds and preferences.
  • The use of domain-specific languages and parameterization in Maestro simplifies the process of defining complex workflows, making it accessible to both engineers and non-engineers.
  • The integration of Maestro with other Netflix tools like Metaflow suggests a cohesive ecosystem for data and ML operations at Netflix.
  • Maestro's ability to handle spiky and uneven loads, with tens of thousands of workflows and millions of jobs per day, demonstrates its robustness and the importance of scalability in workflow orchestration.
  • The presentation of Maestro at AWS re:Invent 2023 indicates a willingness by Netflix to share its internal tools and practices with the broader tech community, potentially influencing the development of similar tools in the industry.