Title
AWS re:Invent 2023 - Netflix Maestro: Orchestrating scaled data & ML workflows in the cloud (NFX308)
Summary
- Netflix Maestro is a powerful workflow orchestrator used internally at Netflix to manage and automate ETL pipelines and machine learning workflows.
- Maestro is a fully managed service that provides workflow as a service to thousands of Netflix users, ensuring reliability and scalability.
- It features a workflow engine, alerting service, error classification services, and a user interface with templates and domain-specific language for easy workflow definition.
- Maestro supports a variety of use cases, including data processing, model training, A/B testing, and more.
- It was built in-house due to the lack of existing solutions that could handle Netflix's scale and variety of workflows.
- Maestro's architecture includes an API gateway, a core workflow engine with versioning and triggering support, and integration with downstream services via Kafka.
- The Maestro DSL (Domain-Specific Language) is available in YAML, Python, and Java, making workflow definitions readable, reproducible, and debuggable.
- Maestro supports parameterized workflows with features like conditional branching and sub-workflows, as well as dynamic code injection for custom logic.
- It is extensible, allowing users to create new step types and bring their own compute resources.
- Workflows in Maestro are executed efficiently and reliably, with each job running in isolation using Docker containers and Paper Mill for notebook execution.
Insights
- Netflix's decision to build Maestro in-house highlights the unique challenges faced by large-scale data-driven companies and the limitations of existing workflow orchestration tools.
- Maestro's design emphasizes user-friendliness and flexibility, catering to a diverse user base with different technical backgrounds and preferences.
- The use of domain-specific languages and parameterization in Maestro simplifies the process of defining complex workflows, making it accessible to both engineers and non-engineers.
- The integration of Maestro with other Netflix tools like Metaflow suggests a cohesive ecosystem for data and ML operations at Netflix.
- Maestro's ability to handle spiky and uneven loads, with tens of thousands of workflows and millions of jobs per day, demonstrates its robustness and the importance of scalability in workflow orchestration.
- The presentation of Maestro at AWS re:Invent 2023 indicates a willingness by Netflix to share its internal tools and practices with the broader tech community, potentially influencing the development of similar tools in the industry.