How to Run Flyte on GKE Autopilot With Dynamic GPU Scheduling

Editor’s note: Vlado Djerek, Lead DevOps Engineer at Akvelon and an active Flyte contributor, wrote this blog post. It shows, step by step, how to run Flyte on GKE Autopilot with dynamic GPU scheduling.

Running GPU-heavy ML workloads on GKE Autopilot introduces one challenge: acquiring GPU resources, which are often scarce. Fortunately, Flyte, GKE Autopilot, and Dynamic Workload Scheduler work together seamlessly, requiring only minimal configuration to make it all click.

This article presents a reproducible, cloud-native deployment that addresses this challenge. By integrating Flyte with Kueue and GKE’s Dynamic Workload Scheduler, the setup gates GPU tasks until resources are provisioned, slashing idle spend while remaining 100% Autopilot-compliant.

The approach, developed by Akvelon’s DevOps team and showcased at a recent Flyte community sync, is now featured on the GKE AI Labs site. Links to the open-source Terraform and Helm files, as well as the demo video, appear below.

Introduction

Flyte is a powerful open-source platform for orchestrating large-scale ML pipelines, and it works well with GKE Autopilot. Autopilot handles infrastructure behind the scenes, yet GPUs still require explicit setup from engineers. GPU resources are limited, so they must be provisioned dynamically.

This configuration is particularly relevant for ML workflows, which often require on-demand access to GPUs and flexible scheduling logic. While Flyte works smoothly with Kubernetes-native features, GKE Autopilot requires stricter validation on pod specifications, which means teams must include node selectors and tolerations to schedule GPU pods correctly. So the question is:

How do you deploy Flyte on GKE Autopilot in a way that supports GPU-heavy workloads, uses cloud resources efficiently, and avoids brittle, manual setup?

Our team needed a solution that:

Provisions infrastructure using declarative tools
Deploys Flyte in a way that respects GKE Autopilot constraints
Enables GPU workflows to run only when needed.
Maintains flexibility, reproducibility, and usability by others in the community

See how we put all the pieces together, from infrastructure to GPU task execution.

Step-by-Step: How to Set Up Flyte on GKE Autopilot

To make Flyte work reliably on GKE Autopilot, our team focused on building a repeatable, cloud-native setup — one that could be shared with the open-source community and used as a foundation for ML teams working in Google Cloud.

Instead of working around platform limitations, we leaned into the strengths of GCP. The setup combines Terraform for infrastructure provisioning, Helm for deploying Flyte, and smart use of GCP-native services, such as Cloud SQL, Artifact Registry, and Autopilot for dynamic scheduling.

This resulted in a clean, modular deployment pipeline that can:

Spin up a GKE Autopilot cluster with GPU support.
Configure Cloud SQL for Flyte’s metadata storage.
Push container images to Artifact Registry.
Install Flyte using its official Helm chart.
Run example workflows using FlyteKit from a local environment.

Our team also enabled access to Flyte’s web UI and gRPC endpoints via port forwarding from a local machine, publishing secure endpoints for broader team use, which is planned for next steps. This is a lightweight way to interact with the cluster while testing and developing.

This foundation let us quickly test Flyte’s behavior on GKE Autopilot and made it easier to layer in more advanced capabilities, like dynamic workload scheduling using Kueue.

Making GPU Workloads Smarter With Kueue and Dynamic Scheduling

Once the core Flyte setup was in place, we turned to one of the most common resource management problems in ML infrastructure: how to use GPUs efficiently without overprovisioning.

In many Kubernetes environments, teams pre-allocate GPU nodes, keeping them online even when idle, increasing both resources and costs. That approach doesn’t align with the flexibility promised by GKE Autopilot, and it certainly doesn’t scale well for bursty, experiment-heavy ML workflows.

To solve this, we integrated Kueue, a Kubernetes-native job queueing system, with GKE’s Dynamic Workload Scheduler (DWS). This combination allowed us to gate Flyte tasks that required GPUs and only launch them once the platform confirmed GPU resources were actually available.

In practice, this is how a Flyte task gets scheduled and executed when GPUs are involved:

Flyte creates the pod.
Kueue holds the pod until DWS provisions GPU nodes and signals readiness.
Once resources are available, Kueue lets the pod proceed.
Once the pod is released, it runs as part of the Flyte workflow, with Flyte tracking its execution.

The diagram below shows how the system components interact during this GPU scheduling flow:

This approach eliminates the need to keep GPU nodes online 24/7. It is clean, cost-efficient, and keeps infrastructure lean while still supporting complex AI workflows.

To enable this behavior, we made a few lightweight adjustments using the FlyteKit SDK, including:

Adding GPU resource requests and limits.
Applying node selectors required by GKE Autopilot.
Labeling pods (e.g., with kueue.x-k8s.io/queue-name), so Kueue can assign them to the correct queue.

We leveraged FlyteKit’s image specification feature to define custom container images, built locally using FlyteKit’s BuildKit integration with the necessary pip dependencies. These were pushed to Artifact Registry, and because the registry is regional, GKE can stream the container image on demand, speeding up pod startup and reducing wait times for GPU task execution. This particular image was optimized for tasks like GPU benchmarking with Torch.

The takeaway: engineers don’t need to modify Flyte’s core components or wait for new features. GKE’s APIs and FlyteKit’s pod customization options make this integration seamless.

To make the task lifecycle even more transparent, we've created a simplified timeline showing how a GPU task progresses, from submission to execution.

What’s Demonstrated in the Video Session: Hello World, GPU Tasks, and Queued Scheduling

In the live session, I walk the Flyte community through a working example of the setup, showing how well Flyte can run on GKE Autopilot with GPU tasks and dynamic workload scheduling in place.

Here’s what you can take away from the demo:

FlyteKit workflows run reliably on GKE Autopilot with no changes to Flyte’s internals, though some setup is required to adapt FlyteKit tasks to GKE Autopilot’s GPU scheduling constraints.
The workflow-to-execution pipeline is quick and clean, from Python code to Flyte dashboard.
Kueue integrates seamlessly, holding GPU workloads until resources are available and releasing them automatically.
Flyte’s dashboard surfaces key execution details like status, logs, and timing, though queue states such as ‘SchedulingGated’ are not yet visible in Flyte’s UI; they’re only observable at the pod level in Kubernetes.
FlyteKit and Kubernetes-native features make it easy to customize pod specs and GPU resource configs without modifying Flyte itself.

It’s a practical example of how modern DevOps teams can combine open-source tools with cloud-native infrastructure to support AI/ML workloads more efficiently.

Watch the full session with the demo:

What You Can Do With This Setup

This setup makes running Flyte on GKE Autopilot a practical option for real-world ML workloads. By combining Flyte with Kueue and GCP’s Dynamic Workload Scheduler, teams can now enjoy flexibility and automation when working with GPU-heavy pipelines.

Instead of overprovisioning resources or relying on brittle manual setups, engineers can now:

Deploy Flyte in a GCP-native way using Terraform and Helm, remaining fully Autopilot-compliant.
Let Kueue + DWS add GPUs only when a task requires them.
Fork the open-source repo and integrate their own Flyte workflows.
Let infrastructure scale dynamically as GPU resources become available.
Keep workflows fully cloud-native without needing to modify Flyte’s internals.
Focus on building and running ML pipelines, not spending time on cluster maintenance.

This solution turns a complex, cloud-native deployment into a repeatable pattern that other teams can use, adapt, and extend. It’s an open-source solution shaped by real engineering needs.

How Akvelon Can Help

Akvelon’s engineering teams work closely with clients to build scalable, cost-efficient infrastructure for modern AI/ML development, from designing cloud-native platforms to optimizing resource usage to implementing workflow orchestration at scale.

As a Google Cloud Partner, we help organizations maximize performance, manage costs, and reduce operational risks. Global enterprises trust our DevOps engineering services and AI/ML development expertise as they develop the next generation of AI-powered systems.

Vlado Djerek

Lead DevOps Engineer at Akvelon

Blog