Managed Kubeflow Pipelines on Vertex โ declarative ML workflows that scale out from self-hosted ThreadPoolExecutor patterns to fully-managed orchestration with experiment tracking, model registry integration, and event-triggered execution.
What it is: Managed runtime for Kubeflow Pipelines. A pipeline is a DAG of containerized components โ typically Python functions decorated with @component. Inputs and outputs flow as typed Artifacts between steps.
Native integrations: Vertex AI Experiments tracks parameters / metrics / artifacts. Vertex AI Model Registry versions output models. Cloud Scheduler triggers pipeline runs on cron. Pub/Sub triggers event-driven runs. All composes without writing glue code.
Direct alternatives: Cloud Composer (managed Airflow), GitHub Actions (CI/CD-native), custom Dagster deployments, Prefect Cloud. Vertex Pipelines wins for ML-specific workflows where Experiments + Registry + Endpoints integration matters.
@dsl.component(packages_to_install=["scikit-learn", "pandas"])
def train_model(
train_data: Input[Dataset],
test_data: Input[Dataset],
n_estimators: int,
model_output: Output[Model],
metrics_output: Output[Metrics],
):
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
df_train = pd.read_csv(train_data.path)
df_test = pd.read_csv(test_data.path)
model = RandomForestClassifier(n_estimators=n_estimators)
model.fit(df_train.drop("y", axis=1), df_train["y"])
accuracy = model.score(df_test.drop("y", axis=1), df_test["y"])
metrics_output.log_metric("accuracy", accuracy)
# ... save model to model_output.path
@dsl.pipeline(name="nightly-fraud-retrain")
def fraud_pipeline(n_estimators: int = 100):
data = load_data()
train, test = split_data(data=data.outputs["data"])
drift_check = drift_detection(new_data=train.outputs["train"])
with dsl.If(drift_check.outputs["drift_detected"] == True):
trained = train_model(
train_data=train.outputs["train"],
test_data=test.outputs["test"],
n_estimators=n_estimators,
)
register = register_model(model=trained.outputs["model_output"])
deploy = canary_deploy(model_version=register.outputs["version"])Each @dsl.component compiles to a Docker container; the @dsl.pipeline compiles to a Kubeflow YAML; both run on Vertex's managed compute when submitted. Conditionals, loops, and parallel branches are all first-class.
Cloud Scheduler triggers the pipeline at 2am. Steps: extract last-24h data โ drift detection vs baseline โ if drift detected, train new model on TPU v5e โ evaluate against held-out test โ register in Model Registry โ 5% canary deploy to Endpoints โ monitor for 24h via Cloud Monitoring โ if metrics hold, promote to 100%. Whole loop is declarative โ no Airflow DAGs to maintain.
50M documents need nightly re-embedding for the enterprise RAG index. Pipeline steps: list_new_docs (Cloud Storage diff) โ parallel chunking (sharded across 20 workers) โ parallel embedding via Gemini embedding API โ upsert to Vertex Vector Search. Parallelism is declared in the DAG โ Pipelines handles worker allocation. Cost-effective because compute scales to zero between runs.
The AI Factory's 30-hour autonomous run โ generating 1,600+ AI-authored solutions across 3 languages โ maps to a Vertex Pipeline. Each agentic generation is a component; parallel branches per language; judge model evaluation as a downstream step; self-correction loop via dsl.While; final artifact registry. The same code that runs on local ThreadPoolExecutor for dev scales to Pipelines for production.
Cloud Pub/Sub event from PagerDuty triggers the pipeline. Steps: parse incident โ retrieve similar past incidents from Vector Search โ fetch current system metrics from BigQuery โ ask Gemini Pro for diagnostic hypothesis โ if confidence high, auto-execute remediation runbook โ else page on-call with the analysis. Audit-compliant โ every step logged to Cloud Logging.
From ThreadPoolExecutor: Each thread-pool task becomes a @dsl.component. Replace shared in-memory state with typed Artifacts passed between components. The hard part isn't the code โ it's defining the right artifact contracts so components can run independently. Plan a 2-3 week port for a complex agentic flow.
From Airflow: @task functions become @dsl.component. PythonOperator becomes the component decorator. XCom is replaced by typed Artifacts. Sensors map to Cloud Pub/Sub triggers. The conceptual model is similar; the syntax is cleaner because of the typed Artifact contracts.
When NOT to migrate: if your workflow is non-ML and you already have a healthy Airflow / Dagster / Prefect deployment, the migration cost rarely justifies the move. Vertex Pipelines shines specifically for ML/agentic workloads where Experiments + Registry + Endpoints integration matters.