How Netflix Orchestrates Millions of Workflow Jobs with Maestro

Netflix developed Maestro, a scalable workflow orchestrator, to replace Meson, which struggled with increasing workloads due to its single-leader architecture. Maestro uses a microservices-based design, distributed queues, and CockroachDB for horizontal scalability, supporting time-based scheduling, event-driven triggers, and dynamic workflows with features like foreach loops and parameterization. It caters to diverse users via multiple DSLs (YAML, Python, Java), UI-based workflow creation, and integrations like Metaflow.
Core Technical Concepts & Technologies
- Workflow Orchestration (DAG-based execution)
- Microservices Architecture (stateless services)
- Distributed Queues (decoupled communication)
- CockroachDB (distributed SQL for state storage)
- Time-Based & Event-Driven Scheduling (cron, signals)
- Dynamic Workflows (parameterization, foreach loops)
- Execution Abstractions (predefined step types, notebooks, Docker)
- Multi-DSL Support (YAML, Python, Java)
Key Points
-
Meson’s Limitations
- Single-leader architecture led to scaling bottlenecks.
- Required vertical scaling (AWS instance limits reached).
- Struggled with peak loads (e.g., midnight UTC workflows).
-
Maestro’s Architecture
- Workflow Engine: Manages DAGs, step execution, and dynamic workflows (e.g., foreach loops).
- Time-Based Scheduler: Cron-like triggers with deduplication for exact-once execution.
- Signal Service: Event-driven triggers (e.g., S3 updates, internal events) with lineage tracking.
-
Scalability Techniques
- Stateless microservices + horizontal scaling.
- Distributed queues for reliable inter-service communication.
- CockroachDB for consistent, scalable state storage.
-
Execution Abstractions
- Step Types: Predefined templates (Spark, SQL, etc.).
- Notebook Execution: Direct Jupyter notebook support.
- Docker Jobs: Custom logic via containers.
-
User Flexibility
- DSLs (YAML, Python, Java) and UI for workflow creation.
- Metaflow Integration: Pythonic DAGs for data scientists.
-
Advanced Features
- Parameterized Workflows: Dynamic backfills (e.g., date ranges).
- Rollup & Aggregated Views: Unified status tracking for complex workflows.
- Event Publishing: Internal/external (Kafka/SNS) for real-time monitoring.
Technical Specifications & Examples
- Foreach Loop:
steps: - foreach: input: ${date_range} steps: - notebook: params: date: ${item}
- Signal Service: Subscribes to events (e.g.,
s3://data-ready
) to trigger workflows. - CockroachDB: Ensures strong consistency for workflow state across regions.
Key Takeaways
- Horizontal Scaling: Maestro’s stateless microservices and distributed queues overcome single-node bottlenecks.
- Flexible Triggers: Combines time-based and event-driven scheduling for efficiency.
- User-Centric Design: Supports engineers (Docker/APIs), data scientists (notebooks), and analysts (UI).
- Observability: Rollup views and event publishing enable real-time workflow tracking.
- Dynamic Workflows: Parameterization and foreach loops reduce manual definition overhead.
Limitations & Future Work
- Complexity: Deeply nested workflows may require careful monitoring.
- Learning Curve: Multiple DSLs/APIs could overwhelm new users.
- Open-Source Adoption: External use cases may reveal edge cases not yet addressed.
References: Netflix Tech Blog, Maestro GitHub.
WorkOS + MCP: Authentication for AI Agents (Sponsored)
This article was originally published on ByteByteGo
Visit Original Source