Vertex AI Pipelines

Managed Kubeflow Pipelines on Vertex — declarative ML workflows that scale out from self-hosted ThreadPoolExecutor patterns to fully-managed orchestration with experiment tracking, model registry integration, and event-triggered execution.

TL;DR

What it is: Managed runtime for Kubeflow Pipelines. A pipeline is a DAG of containerized components — typically Python functions decorated with @component. Inputs and outputs flow as typed Artifacts between steps.

Native integrations: Vertex AI Experiments tracks parameters / metrics / artifacts. Vertex AI Model Registry versions output models. Cloud Scheduler triggers pipeline runs on cron. Pub/Sub triggers event-driven runs. All composes without writing glue code.

Direct alternatives: Cloud Composer (managed Airflow), GitHub Actions (CI/CD-native), custom Dagster deployments, Prefect Cloud. Vertex Pipelines wins for ML-specific workflows where Experiments + Registry + Endpoints integration matters.

The Component Model

@dsl.component(packages_to_install=["scikit-learn", "pandas"])
def train_model(
    train_data: Input[Dataset],
    test_data: Input[Dataset],
    n_estimators: int,
    model_output: Output[Model],
    metrics_output: Output[Metrics],
):
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    df_train = pd.read_csv(train_data.path)
    df_test = pd.read_csv(test_data.path)
    model = RandomForestClassifier(n_estimators=n_estimators)
    model.fit(df_train.drop("y", axis=1), df_train["y"])
    accuracy = model.score(df_test.drop("y", axis=1), df_test["y"])
    metrics_output.log_metric("accuracy", accuracy)
    # ... save model to model_output.path

@dsl.pipeline(name="nightly-fraud-retrain")
def fraud_pipeline(n_estimators: int = 100):
    data = load_data()
    train, test = split_data(data=data.outputs["data"])
    drift_check = drift_detection(new_data=train.outputs["train"])
    with dsl.If(drift_check.outputs["drift_detected"] == True):
        trained = train_model(
            train_data=train.outputs["train"],
            test_data=test.outputs["test"],
            n_estimators=n_estimators,
        )
        register = register_model(model=trained.outputs["model_output"])
        deploy = canary_deploy(model_version=register.outputs["version"])

Each @dsl.component compiles to a Docker container; the @dsl.pipeline compiles to a Kubeflow YAML; both run on Vertex's managed compute when submitted. Conditionals, loops, and parallel branches are all first-class.

Production Use Cases

Use Case 1 — Nightly model retraining with drift gate

Cloud Scheduler triggers the pipeline at 2am. Steps: extract last-24h data → drift detection vs baseline → if drift detected, train new model on TPU v5e → evaluate against held-out test → register in Model Registry → 5% canary deploy to Endpoints → monitor for 24h via Cloud Monitoring → if metrics hold, promote to 100%. Whole loop is declarative — no Airflow DAGs to maintain.

Use Case 2 — Batch RAG ingestion at scale

50M documents need nightly re-embedding for the enterprise RAG index. Pipeline steps: list_new_docs (Cloud Storage diff) → parallel chunking (sharded across 20 workers) → parallel embedding via Gemini embedding API → upsert to Vertex Vector Search. Parallelism is declared in the DAG — Pipelines handles worker allocation. Cost-effective because compute scales to zero between runs.

Use Case 3 — Agentic workflow that spans hours

The AI Factory's 30-hour autonomous run — generating 1,600+ AI-authored solutions across 3 languages — maps to a Vertex Pipeline. Each agentic generation is a component; parallel branches per language; judge model evaluation as a downstream step; self-correction loop via dsl.While; final artifact registry. The same code that runs on local ThreadPoolExecutor for dev scales to Pipelines for production.

Use Case 4 — Event-driven incident response pipeline

Cloud Pub/Sub event from PagerDuty triggers the pipeline. Steps: parse incident → retrieve similar past incidents from Vector Search → fetch current system metrics from BigQuery → ask Gemini Pro for diagnostic hypothesis → if confidence high, auto-execute remediation runbook → else page on-call with the analysis. Audit-compliant — every step logged to Cloud Logging.

Migrating from ThreadPoolExecutor / Airflow

From ThreadPoolExecutor: Each thread-pool task becomes a @dsl.component. Replace shared in-memory state with typed Artifacts passed between components. The hard part isn't the code — it's defining the right artifact contracts so components can run independently. Plan a 2-3 week port for a complex agentic flow.

From Airflow: @task functions become @dsl.component. PythonOperator becomes the component decorator. XCom is replaced by typed Artifacts. Sensors map to Cloud Pub/Sub triggers. The conceptual model is similar; the syntax is cleaner because of the typed Artifact contracts.

When NOT to migrate: if your workflow is non-ML and you already have a healthy Airflow / Dagster / Prefect deployment, the migration cost rarely justifies the move. Vertex Pipelines shines specifically for ML/agentic workloads where Experiments + Registry + Endpoints integration matters.

Glossary

Vertex AI Pipelines

ThreadPoolExecutor

AI Factory

Multi-Agent Orchestration

Vertex AI

Gemini

Vertex Vector Search

Judge Model

Agentic Systems

RAG

Vertex AI Pipelines

TL;DR

The Component Model

Production Use Cases

Use Case 1 — Nightly model retraining with drift gate

Use Case 2 — Batch RAG ingestion at scale

Use Case 3 — Agentic workflow that spans hours

Use Case 4 — Event-driven incident response pipeline

Migrating from ThreadPoolExecutor / Airflow

Glossary

Related Reading