# Modelplane Documentation

> Modelplane is the open source control plane for AI model serving. It extends Crossplane to manage AI inference across a fleet of GPU clusters.

---

# Overview

Source: /overview/


Modelplane is the open source control plane for AI inference. It's software you
install and run in your own environment, and it orchestrates the models, serving
stack, and infrastructure across cloud, neocloud, and on-premise. Modelplane
supports running any model and any engine on any infrastructure, with the
frontier-level serving topologies and performance the largest models demand,
from a single GPU to disaggregated, multi-node deployments.

Modelplane operates across the whole fleet: provisioning inference clusters,
scheduling model deployments on compatible clusters, autoscaling model replicas
across clusters, caching model weights across clusters, and routing across
clusters.

It's an active system that is always reconciling the fleet toward the state you
declare. You install Modelplane on a Kubernetes cluster, which becomes the
control cluster for your inference fleet. It's built on
[Crossplane](https://crossplane.io) and fully integrates with your existing
platform systems.


    Warning
  
  
    Modelplane is under active development. We have opted to build the project in the
open, collaborating with the broad AI inference community on integrations and
capabilities.
  

## Deploy a model

Modelplane's API is declarative, designed for platform teams responsible for the
inference infrastructure and developers deploying models on that infrastructure.

Once a platform team has provisioned inference clusters and declared the available
GPUs and networking fabric, an ML development team deploys a model with a
declarative manifest:

```yaml {nocopy=true}
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen-demo
  namespace: ml-team
spec:
  replicas: 1
  engines:
  - name: qwen
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          - cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args: ["--model=Qwen/Qwen2.5-0.5B-Instruct"]
```

Modelplane schedules a model replica onto an inference cluster with free,
compatible GPUs and memory, and deploys the serving engine. Exposing an
OpenAI-compatible endpoint can be done by declaring a model service:

```yaml {nocopy=true}
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen-demo
```

## A universal control plane for AI inference


Modelplane is designed to be a universal control plane for inference. It runs
inference clusters on any cloud, neocloud, or on-premise environment, or any
combination of them. Modelplane can provision the clusters for you, or you can
bring your own.

It supports any serving engine that runs as a container, and can serve
frontier-quality models using advanced topologies including tensor parallel,
pipeline parallel, data and expert parallel, and prefill/decode disaggregation.
Modelplane works across different accelerators and networking fabrics, and
schedules each model's replicas by matching the model's hardware requirements to
the hardware available across your clusters.

## What Modelplane is not

Modelplane is not a serving engine like vLLM, SGLang, or TensorRT-LLM. Modelplane
composes serving engines and orchestrates them fleet-wide across cloud, neocloud,
and on-premise. Modelplane is not a managed inference service like Baseten,
Together, or Fireworks. These offer cloud services, while Modelplane is
self-hosted software.

## Next steps


    Get started
  
  Go from nothing to a live OpenAI-compatible endpoint in about 45 minutes.
  Deploy on a real fleet →


    Why Modelplane
  
  Learn more about Modelplane’s capabilities and how it works.
  Learn more →


---

# Deploy a Model

Source: /models/model-deployment/

**API:** [`modelplane.ai/v1alpha1` · ModelDeployment](/reference/modeldeployments/)

A `ModelDeployment` is the ML team's primary interface. You describe the model
you want served, the hardware it needs, and how many copies to run; Modelplane
schedules it onto matching clusters and keeps it running. You never name a
cluster.

Modelplane is unopinionated about the engine itself. You bring the container and
its flags, and Modelplane shapes a serving topology around it. The engine flags
you write carry parallelism, quantization, and KV transfer, never injected by
Modelplane.

A deployment's `spec.engines` describes its topology through two choices:

- **One pod or a gang**: whether an engine is a single `Standalone` pod or a
  `Leader` with one or more `Worker` pods coordinating across nodes.
- **Unified or disaggregated**: whether `spec.serving.mode` keeps prefill and
  decode together (`Unified`, the default) or splits them across two engines
  (`PrefillDecode`).

How many of each to run is a separate question, covered in
[Sizing a deployment](#sizing-a-deployment).

## Single-node

The default, and what the [getting started tour](/getting-started/)
deploys. One `Standalone` member is one pod on one node, claiming that node's
GPUs through its `nodeSelector`. It's usually the right choice when a model fits
on a single node. Within a node, tensor parallelism is an engine flag
(`--tensor-parallel-size`), not a Modelplane concept.

```yaml {nocopy=true}
engines:
- name: qwen
  members:
  - role: Standalone        # one pod, one node
```

## Multi-node

When a model is too large for one node's GPUs, make the engine a gang: a `Leader`
and a `Worker` whose `worker.nodes` expands to that many worker pods, one per
node. The pods serve the model together; how the model splits across them
(tensor, pipeline, data, or expert parallelism) is up to your engine flags.

A gang should use a [`ModelCache`](/models/model-cache/) via
`spec.modelCacheRef`, so every pod mounts the same weights instead of each
pulling its own.

```yaml {nocopy=true}
modelCacheRef:
  name: qwen3-coder         # recommended for gangs
engines:
- name: qwen3-coder
  members:
  - role: Leader
  - role: Worker
    worker:
      nodes: 1              # one worker pod per node
```

A member's `env` can read pod fields through `valueFrom.fieldRef`, like setting
vLLM's `VLLM_HOST_IP` from `status.podIP`, which multi-NIC RDMA nodes need so the
engine binds the right interface instead of guessing it.

## Disaggregated serving

The prefill and decode phases have opposite hardware profiles, and on one engine
a prefill burst stalls the decodes already running. Set
`spec.serving.mode: PrefillDecode` to run them as two engines, one marking
`phase: Prefill` and the other `phase: Decode`. Modelplane fronts the pair with
inference-aware routing that sequences prefill then decode, moving the KV cache
between them. Each phase can sit on the GPU class that suits it.

```yaml {nocopy=true}
serving:
  mode: PrefillDecode       # the two engines below are one P/D pair
engines:
- name: prefill
  phase: Prefill
- name: decode
  phase: Decode
```

Disaggregation pays off for large models under load with strict latency targets
and long context. For small models or low traffic, the KV-transfer overhead
outweighs the benefit, so unified serving is the default.

It requires an engine image that includes the **NIXL** KV-transfer runtime.
vLLM's `NixlConnector` (and SGLang's prefill/decode transfer) import the `nixl`
package, so disaggregated engines crash at startup with `NIXL is not available`
on an image that lacks it. Recent vanilla `vllm/vllm-openai` images include NIXL,
so pin a current tag rather than an old one. The engine image is yours to choose,
so this is a prerequisite Modelplane does not bundle for you.

## Requesting GPUs

You don't name a cluster or a GPU model. Instead each member's `nodeSelector`
lists the hardware its pods need, and Modelplane finds a node pool that has it.
The platform team publishes node pools as `InferenceClass` resources, each
describing the devices its nodes carry. Your request is matched against them.

A request names a device (`gpu`), how many of it each pod needs (`count`), and
one or more `selectors` the device must match:

```yaml {nocopy=true}
nodeSelector:
  devices:
  - name: gpu
    count: 1                # one GPU per pod
    selectors:
    - cel: |
        device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0
```

Each selector is a single line of [CEL](https://cel.dev/), a small expression
language, that returns true or false for one device. The part in brackets, `"gpu.nvidia.com"`, is the
GPU vendor's driver. The fields after it, like `memory` or `architecture`, are
what the platform team published for that device. This one says "match a GPU
whose memory is at least 40Gi." A device has to match every selector in the
request. Give two selectors to mean "Hopper, with at least 80Gi."

### Requesting more than one device

`devices` is a list, so a member can ask for distinct kinds of hardware at once,
each its own entry with its own `count` and `selectors`. A node pool matches the
member only when it satisfies every entry. This is how you ask for both a GPU and
a fast NIC on the same node:

```yaml {nocopy=true}
nodeSelector:
  devices:
  - name: gpu
    count: 8
    selectors:
    - cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper"
  - name: nic
    count: 1
    selectors:
    - cel: device.attributes["nic.nvidia.com"].linkType == "infiniband"
```

### What you can match on

Each selector is evaluated against one device and must return a boolean. The
device exposes three things:

- `device.driver`: the device's driver, a string.
- `device.attributes["<driver>"].<name>`: a typed attribute (string, bool, int,
  or version), such as `architecture` or `cudaComputeCapability`.
- `device.capacity["<driver>"].<name>`: a capacity quantity, such as `memory`.

Two helpers build comparable values: `quantity()` parses Kubernetes quantities
like `"40Gi"`, and `semver()` parses versions like `"9.0.0"`. Both support
`compareTo` (which orders two values), `isGreaterThan`, and `isLessThan`. Combine
selectors with the usual CEL operators (`==`, `!=`, `>=`, `&&`, `||`).

```yaml {nocopy=true}
selectors:
# Capacity: at least 40Gi of GPU memory. >= 0 reads as "left is at least right".
- cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0
# Attribute equality: a specific architecture.
- cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper"
# Version attribute: a minimum CUDA compute capability.
- cel: device.attributes["gpu.nvidia.com"].cudaComputeCapability.isGreaterThan(semver("8.9.0"))
# Driver: match any device from a given driver.
- cel: device.driver == "gpu.nvidia.com"
# Presence: only match a device that publishes a given domain.
- cel: '"gpu.nvidia.com" in device.attributes'
# Two conditions in one selector.
- cel: |
    device.attributes["gpu.nvidia.com"].architecture == "Hopper" &&
    device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("80Gi")) >= 0
```

This is the Kubernetes DRA device selector expression surface. The
Kubernetes-specific CEL extension libraries (such as regular expressions and IP
address helpers) aren't available. Selectors in practice are attribute and
capacity comparisons like those above.

### Seeing what's available

To see what you can match against, list the classes the platform team has
published and look at the devices each one declares:

```bash
kubectl get inferenceclass
kubectl describe inferenceclass gke-l4-1x-g2
```

The `describe` output shows each device's driver, attributes (like
`architecture`), and capacity (like `memory`), which are exactly the keys your
selectors read. If a selector asks for something no published class offers, the
deployment won't schedule.

## Sizing a deployment

Three independent numbers control how many pods a deployment runs:

- **`spec.replicas`** stamps out whole copies of the entire topology. Each
  replica is a complete serving instance, and replicas usually land on different
  clusters. This is the scaling axis (see [Scaling](#scaling)).
- **`engines[].copies`** runs several identical copies of one engine within a
  replica, on the same cluster. It's a fixed number, sized once, never
  autoscaled. Copies make a replica more resilient within its cluster: a node
  failure drops one copy instead of taking the whole replica out of service. In
  disaggregated serving they also set the prefill-to-decode ratio.
- **`worker.nodes`** sets how many nodes one gang spans: a `Leader` plus that
  many `Worker` pods. It's how big a single multi-node engine is.

## Scaling

`spec.replicas` is the only scaling axis. Each replica is a complete,
fixed-shape serving instance, so scaling adds or removes whole instances across
the fleet. Because the deployment exposes the Kubernetes scale subresource,
`kubectl scale` and KEDA work without anything extra. There's no in-cluster pod
autoscaling.

## Choosing a topology

| Topology | Use when | How you set it |
|----------|----------|----------------|
| Single-node | The model fits on one node's GPUs | One `Standalone` member (the default) |
| Multi-node | The model is too large for one node | A `Leader` and one or more `Worker` members, ideally with a `modelCacheRef` |
| Disaggregated serving | Large model, heavy load, strict latency, long context | `serving.mode: PrefillDecode` with two phase engines |

## Examples


    Single-node
    
    
    Multi-node
    
    
      model-deployment.yaml
      
          
    # A ModelDeployment deploys a model to one or more inference clusters.
# The scheduler picks clusters by clusterSelector labels and nodeSelector
# device requests, gated on available nodes. Each matched cluster gets one
# ModelReplica.
#
# The control plane creates a unified OpenAI-compatible endpoint:
#   http://<gateway-address>/<namespace>/<name>/v1/chat/completions
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-8b
  namespace: ml-team
spec:
  # Number of ModelReplicas to fan out to. Each replica is a complete
  # serving instance scheduled to one InferenceCluster.
  replicas: 1

  # Optional: restrict the scheduler to clusters with specific labels.
  # clusterSelector:
  #   matchLabels:
  #     modelplane.ai/region: us-central

  # Engines are an array of inference engines. This model is one engine, one
  # Standalone member, one pod - the simplest shape. The engine composes to a
  # Deployment fronted by a Service.
  engines:
  - name: qwen3-8b
    members:
    # A Standalone member is a single self-contained engine pod. Its template
    # carries the container named "engine" - the inference engine; its image,
    # command, and args pass through verbatim.
    - role: Standalone
      # The member's per-node device request: a list of DRA device requests
      # describing what each of the member's pods needs from its node. The
      # scheduler matches each against a candidate pool's InferenceClass
      # devices and pins the member to a pool that satisfies them. Each
      # request's CEL is real DRA CEL over a single device; quantity() and
      # semver() are helpers. claim: DRA devices also become requests in the
      # DRA ResourceClaim the serving pods claim GPUs through, so an engine
      # must declare the GPUs it needs.
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # Qwen3-8B fits comfortably on an L4; 20Gi selects one without
          # over-constraining. A larger model would ask for more memory or a
          # specific architecture here. This CEL is real DRA CEL: the scheduler
          # matches it against the pool's declared device, and DRA matches it
          # again against the GPU's ResourceSlice when it binds the claim.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - "--model=Qwen/Qwen3-8B"
            - "--served-model-name=qwen"
            - "--reasoning-parser=qwen3"
            - "--default-chat-template-kwargs={\"enable_thinking\": false}"
            - "--enable-auto-tool-choice"
            - "--tool-call-parser=hermes"

  
      model-deployment-multinode.yaml
      
          
    # A ModelDeployment serving one model across two nodes.
#
# When a model is too large to fit on one node's GPUs, make an engine a gang:
# give it a Leader and a Worker member, whose worker.nodes expands to that many
# worker pods, one per node. The scheduler picks a cluster with a pool that has
# enough GPUs per node and enough nodes for the whole gang, and Modelplane
# composes a LeaderWorkerSet-backed serving instance on it. The worker joins the
# leader through $(MODELPLANE_LEADER_ADDRESS), which Modelplane injects.
#
# Multi-node engines require a ModelCache: every pod in the gang mounts it at
# /mnt/models. When a member brings its own command, Modelplane does not inject
# --model, so the leader points the engine at the mount explicitly.
#
# This shape (vLLM's native multiprocessing backend, TP within a node and PP
# across nodes) is the one validated serving Qwen3-Coder-480B; see
# examples/qwen3-coder/ for the full platform side.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-coder
  namespace: ml-team
spec:
  replicas: 1
  modelCacheRef:
    name: qwen3-coder
  engines:
  - name: qwen3-coder
    members:
    - role: Leader
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            command:
            - /bin/sh
            - -c
            - >-
              exec vllm serve /mnt/models
              --served-model-name=qwen3-coder
              --tensor-parallel-size=8
              --pipeline-parallel-size=2
              --distributed-executor-backend=mp
              --nnodes=2 --node-rank=0
              --master-addr=$(MODELPLANE_LEADER_ADDRESS)
              --max-model-len=32768
              --port=8000
    - role: Worker
      worker:
        nodes: 1
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            command:
            - /bin/sh
            - -c
            - >-
              exec vllm serve /mnt/models
              --served-model-name=qwen3-coder
              --tensor-parallel-size=8
              --pipeline-parallel-size=2
              --distributed-executor-backend=mp
              --nnodes=2 --node-rank=1
              --master-addr=$(MODELPLANE_LEADER_ADDRESS)
              --headless
              --max-model-len=32768

  
---

# Get started

Source: /getting-started/


Modelplane is an open source control plane for AI inference. It separates two
concerns: a platform team managing GPU capacity, and ML teams deploying models
against it. Without it, every change on one side creates work for the other.
When the platform team updates infrastructure, ML teams have to react. When
model requirements change, the platform team gets a request.

With Modelplane, the platform team publishes hardware without knowing what
models will run on it. The ML team declares what a model needs without knowing
what clusters exist. The control plane resolves it and keeps it current as
both sides change.

In this tour, you'll switch between provisioning infrastructure and declaring a
model to see how they interact. By the end you'll have a GPU fleet across three regions and one OpenAI-compatible endpoint routing to a model served across two of them.

This is not a production setup and takes around 45 minutes to run.

## What you'll build

The platform team provisions a starter cluster and grows it to two A100 regions;
the ML team serves a model on the L4, then scales it onto an A100, all behind one
endpoint.


<style>
   
  #asciinema-b2b29fdb8e .ap-player {
    --term-color-background: #001D2F;
    --term-color-foreground: #D7E9EE;
  }


<script>
  AsciinemaPlayer.create("/getting-started/what-youll-build.cast", document.getElementById("asciinema-b2b29fdb8e"), {
    autoPlay: false,
    controls: true,
    fit: "width",
    idleTimeLimit:  2 ,
    terminalFontSize: "medium",
    poster: "npt:2:13",
    theme: "monokai"
  });


## Before you begin

You'll need [kind](https://kind.sigs.k8s.io/),
[kubectl](https://kubernetes.io/docs/tasks/tools/), and
[Helm](https://helm.sh/docs/intro/install/) installed, plus an AWS or GCP account
with permission to create clusters. Each step covers what it needs as you reach
it.

## The tour

1. [Installation](/getting-started/installation/): stand up the Modelplane control plane.
2. [Build the platform](/getting-started/build-the-platform/): provision your first GPU cluster.
3. [Deploying a model](/getting-started/deploying-a-model/): serve a model and send it a request.
4. [Scale the platform](/getting-started/scale-the-platform/): grow to a multi-region fleet.
5. [Scale the model](/getting-started/scale-the-model/): serve the model from two regions behind one endpoint.

First, follow the [Installation](/getting-started/installation/) guide.


---

# Installation

Source: /getting-started/installation/

The control plane is where everything in Modelplane runs. In this step you'll install it on a local kind cluster, using Crossplane for reconciliation and the Modelplane APIs. No cloud yet, that comes next.

This step takes about five minutes.

## Prerequisites

Install [kind](https://kind.sigs.k8s.io/),
[kubectl](https://kubernetes.io/docs/tasks/tools/), and
[Helm](https://helm.sh/docs/intro/install/) on your machine.


    Note
  
  
    You can run your Modelplane control plane anywhere. This tour uses kind for
illustration.
  

## Install the control plane

Crossplane provides the reconciliation engine and package management. Create the
kind cluster and install it with Helm:

```bash
kind create cluster --name modelplane
```

```bash
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update crossplane-stable
helm install crossplane crossplane-stable/crossplane \
  --namespace crossplane-system --create-namespace \
  --set "args={--enable-dependency-version-upgrades}" \
  --wait
```

Apply the bootstrap resources. They grant Crossplane the permissions it needs to
manage your cluster:

```shell
kubectl apply -f /examples/getting-started/prerequisites.yaml
```


        Review the prerequisites manifest
      
    
      prerequisites.yaml
      
          
    # Modelplane prerequisites. Apply once after installing Crossplane.
#
# These resources grant Crossplane and provider-helm the permissions
# needed to compose Gateway API, MetalLB, and Service/EndpointSlice
# routing resources. They cannot be self-composed because Crossplane
# needs the permissions before it can compose anything.
---
apiVersion: v1
kind: Namespace
metadata:
  name: modelplane-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: crossplane-compose-modelplane
  labels:
    rbac.crossplane.io/aggregate-to-crossplane: "true"
rules:
- apiGroups: [""]
  resources: ["namespaces"]
  verbs: ["*"]
# Selectorless Service plus EndpointSlice composed by ModelEndpoint to route
# the control plane gateway to a remote model endpoint.
- apiGroups: [""]
  resources: ["services"]
  verbs: ["*"]
- apiGroups: ["discovery.k8s.io"]
  resources: ["endpointslices"]
  verbs: ["*"]
- apiGroups: ["gateway.networking.k8s.io"]
  resources: ["gateways", "gatewayclasses", "httproutes"]
  verbs: ["*"]
- apiGroups: ["gateway.envoyproxy.io"]
  resources: ["backends"]
  verbs: ["*"]
- apiGroups: ["metallb.io"]
  resources: ["ipaddresspools", "l2advertisements"]
  verbs: ["*"]
- apiGroups: ["protection.crossplane.io"]
  resources: ["usages"]
  verbs: ["*"]
---
# Give provider-helm a deterministic SA name so we can grant it
# permissions. Without this, the SA name has a random hash.
apiVersion: pkg.crossplane.io/v1beta1
kind: DeploymentRuntimeConfig
metadata:
  name: provider-helm-modelplane
spec:
  serviceAccountTemplate:
    metadata:
      name: provider-helm-modelplane
---
# Grant provider-helm cluster-admin. It installs full Helm charts
# (MetalLB, Envoy Gateway, LeaderWorkerSet, etc.) that create arbitrary
# resource types across namespaces.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: provider-helm-modelplane
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: provider-helm-modelplane
  namespace: crossplane-system
---
# Apply the DRC to provider-helm automatically via ImageConfig.
apiVersion: pkg.crossplane.io/v1beta1
kind: ImageConfig
metadata:
  name: provider-helm-modelplane
spec:
  matchImages:
  - type: Prefix
    prefix: xpkg.upbound.io/upbound/provider-helm
  runtime:
    configRef:
      name: provider-helm-modelplane

  
## Install Modelplane

The Modelplane Configuration adds the Modelplane APIs and the composition
functions that reconcile them:


      configuration.yaml
      
          
    # The Modelplane Crossplane Configuration. Installing it adds the Modelplane
# APIs (InferenceGateway, InferenceClass, InferenceCluster, ModelDeployment,
# ModelCache, ModelService) and the composition functions that reconcile them.
apiVersion: pkg.crossplane.io/v1
kind: Configuration
metadata:
  name: modelplane
spec:
  package: xpkg.upbound.io/modelplane/modelplane:v0.1.0

  
Wait until the configuration is healthy:

```bash
kubectl wait configuration/modelplane --for=condition=Healthy --timeout=5m
```

## Next step

The control plane is running but has nothing to schedule against yet. In the
next step, you'll [build the platform](/getting-started/build-the-platform/) to provision a GPU cluster and
publish what hardware it offers.


---

# Qwen3-8B

Source: /examples/qwen3-8b/


An 8.2B dense chat model on a single NVIDIA L4. The smallest recipe: one
`Standalone` engine, no cache, weights pulled straight from Hugging Face.

This recipe was run end to end; the `InferenceClass` and `ModelDeployment` are
the exact manifests from that run. Apply the platform side first, then the ML
side.

## Platform


      inference-class.yaml
      
          
    # InferenceClass for the L4 shape, validated serving Qwen3-8B on EKS.
#
# One NVIDIA L4 on an EKS g6.xlarge. The single GPU is a claim: DRA device;
# the scheduler matches a ModelDeployment's nodeSelector against its declared
# capacity and DRA binds it to the serving pod.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-l4-1x-g6
spec:
  description: "EKS g6.xlarge, 1x NVIDIA L4"
  provisioning:
    provider: EKS
    eks:
      instanceType: g6.xlarge
      diskSizeGb: 100
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      # The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
      # nominal 24GB.
      memory: { value: "23034Mi" }

  
      inference-cluster.yaml
      
          
    # An EKS InferenceCluster with one L4 node pool, labeled for the
# ModelDeployment's clusterSelector to target.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-l4
  labels:
    modelplane.ai/region: us
spec:
  cluster:
    source: EKS
    eks:
      region: us-west-2
  nodePools:
  - name: gpu-l4
    className: eks-l4-1x-g6
    nodeCount: 1
    minNodeCount: 1
    maxNodeCount: 1
    zones:
    - us-west-2a

  
## Deployment


      model-deployment.yaml
      
          
    # Qwen3-8B served on a single NVIDIA L4, validated end to end on EKS.
#
# An 8.2B dense model is a single Standalone engine: one self-contained vLLM
# pod, no ModelCache, weights pulled straight from Hugging Face. The flags carry
# real meaning beyond fit:
#
#   --tool-call-parser=hermes        the parser for Qwen3 dense (qwen3_xml is
#                                    for Qwen3-Coder, not this model). Qwen3's
#                                    tool-use template ships in the tokenizer,
#                                    so no --chat-template is needed.
#   --reasoning-parser=qwen3 with
#   --default-chat-template-kwargs   turns thinking off. Qwen3 thinks by
#                                    default, burying a one-line answer under a
#                                    <think> block and forbidding greedy decode.
#   --max-model-len / --gpu-memory-utilization  L4 fit, not correctness.
#
# No --port or --host: Modelplane's routing expects the engine on its default
# :8000 with a /health probe, and passes args through verbatim.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-8b
  namespace: ml-team
spec:
  replicas: 1
  clusterSelector:
    matchLabels:
      modelplane.ai/region: us
  engines:
  - name: qwen3-8b
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - "--model=Qwen/Qwen3-8B"
            - "--served-model-name=qwen"
            - "--max-model-len=16384"
            - "--gpu-memory-utilization=0.92"
            - "--reasoning-parser=qwen3"
            - "--default-chat-template-kwargs={\"enable_thinking\": false}"
            - "--enable-auto-tool-choice"
            - "--tool-call-parser=hermes"

  
      model-service.yaml
      
          
    # Exposes the qwen3-8b deployment's endpoints as a single OpenAI-compatible URL.
# Modelplane labels each composed ModelEndpoint with the deployment name, so this
# selector reaches every replica. Read the public address from status.address:
#   kubectl get ms qwen3-8b -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen3-8b
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b

  
---

# Set Up the Gateway

Source: /platform/inference-gateway/

**API:** [`modelplane.ai/v1alpha1` · InferenceGateway](/reference/inferencegateways/)

The `InferenceGateway` sets up the control plane's front door: one unified,
OpenAI-compatible address that every `ModelService` is exposed through, routing
each request on to the inference cluster serving it.

The `InferenceGateway` is a singleton: create exactly one, named `default`, on
your Modelplane control plane. It fronts every inference cluster in the fleet, so
you don't create one per cluster.

The `backend` field selects which gateway runs it. `Traefik` is the only value
today.

On a cloud cluster with a native LoadBalancer controller, the gateway's `Service`
gets an external address on its own. On kind or bare-metal, where there's no such
controller, set `spec.traefik.loadBalancer: MetalLB` and give it an address pool
in `spec.traefik.metallb.addressPool` so the gateway gets an IP. See the example
below.

Once the gateway is ready, read its external address from `status.address`:

```bash
kubectl get ig default -o jsonpath='{.status.address}'
```

That address is the host of every `ModelService` URL
(`http://<address>/<namespace>/<service>`), so it's what you hand to ML teams.
## Example


      inference-gateway.yaml
      
          
    # The InferenceGateway creates a unified, OpenAI-compatible endpoint on the
# control plane cluster. It installs Traefik Proxy and creates a Gateway that
# routes traffic to model replicas on remote inference clusters.
#
# Create one InferenceGateway per control plane. It must be named "default".
#
# For kind or bare-metal clusters, set loadBalancer to MetalLB and configure an
# address pool. For cloud clusters with native LoadBalancer support, omit the
# loadBalancer field entirely.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceGateway
metadata:
  name: default
spec:
  backend: Traefik
  traefik:
    version: "40.2.0"

    # Remove the loadBalancer section if your cluster supports LoadBalancer
    # services natively (e.g. GKE, EKS).
    loadBalancer: MetalLB
    metallb:
      addressPool: "172.18.255.200-172.18.255.250"

  
---

# Why Modelplane

Source: /overview/why/


Open-weight models are becoming the choice for organizations: they can be
post-trained, including with reinforcement learning, to compete with frontier
models, and they put cost, governance, and data sovereignty back under the
organization's control. As they do, platform teams are 
increasingly asked to provide GPU inference to their ML and development teams the 
same way they already provide cloud infrastructure.

## Kubernetes is becoming the default orchestrator

Kubernetes is rapidly becoming the default orchestrator for inference. The broader 
cloud-native community is investing heavily to make it a first-class platform for
AI workloads, adding device-aware scheduling, multi-node inference, distributed
serving, and accelerator management. The major open source inference projects are
converging on it; among them are vLLM, SGLang, NVIDIA Dynamo, llm-d, Ray, Slurm,
KubeAI, and Kueue. Neoclouds like Baseten and CoreWeave have standardized on
Kubernetes for their own operations. Inside a single cluster, the open source
stack is now strong.

## Inference is a fleet problem

Inference, however, almost always runs across more than one cluster. Accelerator
availability scatters capacity across hardware types, providers, and regions.
Sovereignty and compliance pin workloads to specific locations. Operators run
across multiple clouds and on-premise environments. Large clusters
concentrate failure and risk, so fleets of smaller clusters are often preferable,
and inference workloads don't bin-pack the way other workloads do.

Inference grows into a fleet, and a new set of problems appears above
any single cluster:

- Deciding where each model runs across available capacity.
- Optimizing placement across heterogeneous accelerators.
- Failing over across clouds and regions.
- Routing by cost, latency, and sovereignty requirements.
- Provisioning new capacity as demand grows.
- Caching and distributing model weights across the fleet.
- Managing the lifecycle of models, clusters, and infrastructure as one system.

Open source addresses pieces of this but none brings all the pieces together in a
fleet-wide system of record that manages placement, caching, capacity, policy, and
routing across an entire fleet. The labs, hyperscalers, and managed providers have
all solved these problems in a proprietary way, but the open equivalent does not
yet exist.

## Modelplane extends Kubernetes to manage the fleet

Modelplane does for the fleet what Kubernetes does for the cluster. It's the open
source control plane above your inference clusters across cloud, neocloud, and
on-premise: it places model deployments, autoscales replicas, provisions and
manages the infrastructure underneath, caches and distributes model weights, and
routes inference through one unified gateway with fallback to managed providers.
It turns "I need this model served" into a stable endpoint for any ML team.

Modelplane composes these projects rather than replacing them, and stays neutral
across models, accelerators, clouds, and serving stacks. It's built on
[Crossplane](https://crossplane.io) and extends Kubernetes to manage inference
at the fleet level. Modelplane is open source, Apache 2 licensed, and we plan to
donate it to a neutral open source foundation later this year.


    How Modelplane works
  
  The architecture, the resources, and what happens when you deploy a model.
  →


    FAQ
  
  How Modelplane compares to cluster orchestrators and managed providers, and what it requires.
  →


---

# Build the platform

Source: /getting-started/build-the-platform/

This is the platform team's side of Modelplane. You set up the gateway that
fronts your models, give the control plane cloud credentials, and register your
first GPU cluster: a hardware profile published as an `InferenceClass` and an
`InferenceCluster` that offers it.

In the next step, the ML team will create a model deployment that schedules
against this capacity without knowing which cluster it runs on.

## Prerequisites


    EKS
    
    
    GKE
    
    
<ul>
<li>An AWS account with permissions to create EKS clusters, VPCs, and IAM roles
<li>AWS access key ID and secret access key


<ul>
<li>A GCP account with permissions to create GKE clusters, VPCs, and IAM roles
<li>A GCP service account JSON key


## Set up the InferenceGateway


The `InferenceGateway` installs Traefik Proxy and MetalLB on the control plane.
Traefik routes inference traffic to model replicas. MetalLB assigns Traefik's
`LoadBalancer` service an external IP on kind, which doesn't have a cloud load
balancer. You need one named `default` per control plane.


If you run the control plane on a cloud cluster with native `LoadBalancer`
support, omit the `loadBalancer` field.


      inference-gateway.yaml
      
          
    # The InferenceGateway creates a unified, OpenAI-compatible endpoint on the
# control plane cluster. It installs Traefik Proxy and creates a Gateway that
# routes traffic to model replicas on remote inference clusters.
#
# Create one InferenceGateway per control plane. It must be named "default".
#
# For kind or bare-metal clusters, set loadBalancer to MetalLB and configure an
# address pool. For cloud clusters with native LoadBalancer support, omit the
# loadBalancer field entirely.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceGateway
metadata:
  name: default
spec:
  backend: Traefik
  traefik:
    version: "40.2.0"

    # Remove the loadBalancer section if your cluster supports LoadBalancer
    # services natively (e.g. GKE, EKS).
    loadBalancer: MetalLB
    metallb:
      addressPool: "172.18.255.200-172.18.255.250"

  
Wait until the gateway is ready:

```bash
kubectl wait --for=condition=Ready ig/default --timeout=5m
```

## Configure cloud credentials

Give the control plane credentials so it can provision clusters in your cloud
account.


    EKS
    
    
    GKE
    
    
<p>Create an AWS credentials file:

  
    ini
    
      
  [default]
aws_access_key_id = 
aws_secret_access_key = 

<p>Create a Kubernetes secret:

  
    bash
    
      
  kubectl create secret generic aws-creds \
  --from-file=credentials= \
  -n crossplane-system

<p>Apply the <code>ClusterProviderConfig referencing your secret:

  
      clusterproviderconfig-aws.yaml
      
          
    # Points the AWS provider at the credentials Secret you created. Named default,
# so InferenceClusters with an EKS source use it without further configuration.
apiVersion: aws.m.upbound.io/v1beta1
kind: ClusterProviderConfig
metadata:
  name: default
spec:
  credentials:
    source: Secret
    secretRef:
      namespace: crossplane-system
      name: aws-creds
      key: credentials

  
<p>Create a Kubernetes secret:

  
    bash
    
      
  kubectl create secret generic gcp-creds \
  --from-file=credentials=.json \
  -n crossplane-system

<p>Apply the <code>ClusterProviderConfig, setting <code>projectID to your GCP project:

  
      clusterproviderconfig-gke.yaml
      
          
    # Points the GCP provider at the credentials Secret you created. Named default,
# so InferenceClusters with a GKE source use it without further configuration.
apiVersion: gcp.m.upbound.io/v1beta1
kind: ClusterProviderConfig
metadata:
  name: default
spec:
  projectID: my-gcp-project  # replace with your GCP project
  credentials:
    source: Secret
    secretRef:
      namespace: crossplane-system
      name: gcp-creds
      key: credentials

  
    bash
    
      
  curl -fsSL /examples/getting-started/clusterproviderconfig-gke.yaml \
  | sed 's/my-gcp-project//' \
  | kubectl apply -f -


## Publish hardware and register the cluster

The `InferenceClass` describes a hardware profile and how to provision it. The
`InferenceCluster` registers a cluster that offers it. Apply both:


    EKS
    
    
    GKE
    
    
      platform.yaml
      
          
    apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: l4-1x-g6
spec:
  description: "EKS g6.xlarge, 1x NVIDIA L4"
  provisioning:
    provider: EKS
    eks:
      instanceType: g6.xlarge
      diskSizeGb: 50
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      memory: { value: "23034Mi" }   # L4's real reported VRAM (not the nominal 24GB)
---
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-us-east
  labels:
    modelplane.ai/region: us-east
spec:
  cluster:
    source: EKS
    eks:
      region: us-east-1
  nodePools:
  - name: gpu-l4
    className: l4-1x-g6
    nodeCount: 1
    minNodeCount: 1
    maxNodeCount: 1
    zones:
    - us-east-1b

  
<p>Modelplane provisions the cluster. This takes about 15 minutes:


    bash
    
      
  kubectl wait --for=condition=Ready ic/eks-us-east --timeout=20m


<p>Apply the manifest, setting the cluster’s <code>project to your GCP project:

  
      platform.yaml
      
          
    apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: gke-l4-1x-g2
spec:
  description: "GKE g2-standard-8, 1x NVIDIA L4"
  provisioning:
    provider: GKE
    gke:
      machineType: g2-standard-8
      diskSizeGb: 100
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      memory: { value: "23034Mi" }   # L4's real reported VRAM (not the nominal 24GB)
---
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: starter
  labels:
    modelplane.ai/region: us-central
spec:
  cluster:
    source: GKE
    gke:
      project: my-gcp-project
      region: us-central1
  nodePools:
  - name: gpu-l4
    className: gke-l4-1x-g2
    nodeCount: 1
    minNodeCount: 1   # keep >=1; the autoscaler can't scale a GPU pool up from 0 for DRA pods
    maxNodeCount: 2
    zones:
    - us-central1-a

  
    bash
    
      
  curl -fsSL /examples/getting-started/gke/platform.yaml \
  | sed 's/my-gcp-project//' \
  | kubectl apply -f -

<p>Modelplane provisions the cluster. This takes about 15 minutes:


    bash
    
      
  kubectl wait --for=condition=Ready ic/starter --timeout=20m


    Note
  
  
    <p>Modelplane is reconciling the infrastructure against the source of truth, the
manifest you just applied.
<p>While you wait, Modelplane is creating the EKS or GKE cluster and its GPU node
pool, then installing the inference stack with LeaderWorkerSet for multi-node
serving, llm-d for inference-aware routing, Envoy Gateway for traffic
management, and the storage class for model weights. This is the same reconciliation loop Crossplane uses to configure other
infrastructure, extended to the inference layer.

  
Once the cluster is `Ready` the ML team can deploy a model on it.


    Note
  
  
    A cloud GPU cluster costs money while it runs. To stop the tour and resume
later, follow Clean up.
  

## Next step

Now that the platform is provisioned, the ML team can [deploy a model](/getting-started/deploying-a-model/) by describing what the model needs, not the infrastructure.


---

# Define Hardware Classes

Source: /platform/inference-class/

**API:** [`modelplane.ai/v1alpha1` · InferenceClass](/reference/inferenceclasses/)


An `InferenceClass` is a tested recipe for a GPU node pool. It bundles:


- **Devices**: the node's hardware as a list of Dynamic Resource Allocation (DRA)
  style devices, each with a driver, count, typed attributes, and capacity. The
  scheduler matches a member's `nodeSelector` against these devices, and GPUs
  bind to pods through DRA.
- **Provisioning** (optional): how to create a node pool of this class on a
  specific cloud. Classes without provisioning are for existing clusters where
  the pool already exists.

Different clouds and GPU types imply different classes. A GKE L4 pool is
`gke-l4-1x-g2`. A bare-metal H100 pool is `h100-8x-byo` (no provisioning).

## Describing devices

A class's `devices` follow Kubernetes
[Dynamic Resource Allocation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)
(DRA), the mechanism modern Kubernetes uses to match GPUs to pods. Each device
has a `driver` (the vendor that owns it, such as `gpu.nvidia.com`), a `count`
(how many a node has), typed `attributes` (such as `architecture`), and
`capacity` (quantities, such as `memory`). This mirrors the shape the GPU's DRA
driver publishes on a real node, so what you declare here is what an ML team's
`nodeSelector` matches against and what DRA binds at runtime.

You author the attribute and capacity keys, and there's no fixed list. Pick the
ones an ML team would reasonably select on, the GPU memory, the architecture, the
compute capability, using the same names the driver reports.

## DRA and synthetic devices

Each device sets a `claim` discriminator:

- **`DRA`** (the default) is hardware a real DRA driver exposes, today GPUs.
  Modelplane both schedules against it and binds it to pods.
- **`Synthetic`** is described for scheduling only, never claimed. Use it for
  hardware that matters for placement but has no DRA driver yet, like an
  InfiniBand fabric.

## The device contract

The `driver`, attribute keys, and capacity keys a class declares are a contract
with the ML team: a `ModelDeployment`'s `nodeSelector` matches a pool only if the
class publishes the attributes and capacity it asks for. ML teams write those
matches as [CEL](https://cel.dev/) selectors over the keys you publish here. For
GPUs, these keys should mirror what the DRA driver reports, so the same selector
that places a deployment on the pool also binds the right device.

Publish a device's real usable capacity, not its nominal spec. An `80GB` H100
reports about `81559Mi` of usable memory, so a class that declares `80Gi` would
let a `nodeSelector` asking for `>= 80Gi` match the pool but then fail to bind the
GPU.

## Examples


    GKE L4
    
    
    EKS L4
    
    
    H100 bare-metal
    
    
      inference-class-gke-l4.yaml
      
          
    # An InferenceClass describing GKE g2-standard-8 with one NVIDIA L4 GPU.
#
# The provisioning block tells Modelplane how to create a node pool of
# this class on GKE. The devices block describes what hardware a node of
# this class has, DRA-style - used by the scheduler to match models to
# clusters, and to form DRA ResourceClaims for claim: DRA devices.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: gke-l4-1x-g2
spec:
  description: "GKE g2-standard-8, 1x NVIDIA L4"
  provisioning:
    provider: GKE
    gke:
      machineType: g2-standard-8
      diskSizeGb: 100
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      memory: { value: "23034Mi" }

  
      inference-class-eks-l4.yaml
      
          
    # An InferenceClass describing EKS g6.xlarge with one NVIDIA L4 GPU.
#
# The provisioning block tells Modelplane how to create a node group of
# this class on EKS. The devices block describes what hardware a node of
# this class has, DRA-style - used by the scheduler to match models to
# clusters, and to form DRA ResourceClaims for claim: DRA devices.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-l4-1x-g6
spec:
  description: "EKS g6.xlarge, 1x NVIDIA L4"
  provisioning:
    provider: EKS
    eks:
      instanceType: g6.xlarge
      diskSizeGb: 100
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      # The L4's real usable VRAM, what the NVIDIA DRA driver reports in the
      # node's ResourceSlice, not its nominal 24GB. The scheduler matches a
      # nodeSelector against this, so declaring the marketing number would let
      # it place a model that wants 23Gi onto a node DRA then can't satisfy.
      memory: { value: "23034Mi" }

  
      inference-class-h100-byo.yaml
      
          
    # An InferenceClass describing a BYO 8x H100 node pool.
#
# No provisioning block: this class describes hardware that already
# exists on a bring-your-own cluster. Modelplane copies the devices block
# onto the cluster's status.gpuPools for the scheduler to match against.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: h100-8x-byo
spec:
  description: "BYO 8x NVIDIA H100 80GB"
  # DRA-style devices. These are the contract a ModelDeployment's
  # nodeSelector matches against. Keys are bare names; the domain comes
  # from each device's driver. claim: DRA devices are emitted as requests
  # in a DRA ResourceClaim; claim: Synthetic devices (here the InfiniBand
  # fabric, which has no DRA driver) are matched for scheduling only.
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 8
    attributes:
      architecture: { string: Hopper }
      cudaComputeCapability: { version: "9.0.0" }
    capacity:
      # The H100 80GB's real usable VRAM, what the NVIDIA DRA driver reports,
      # not its nominal 80GB. A nodeSelector asking for >= 80Gi would never bind.
      memory: { value: "81559Mi" }
  - name: nic
    claim: Synthetic
    driver: nic.nvidia.com
    count: 8
    attributes:
      linkType: { string: infiniband }

  
---

# Expose a Model

Source: /models/model-service/

**API:** [`modelplane.ai/v1alpha1` · ModelService](/reference/modelservices/)

A [`ModelDeployment`](/models/model-deployment/) serves a model, but its
replicas are scattered across the fleet with no single address. A `ModelService`
gives them one: a stable, unified, OpenAI-compatible URL that load-balances
across every replica, wherever it runs.

A service selects what to route to by label. Behind the scenes, Modelplane
creates one `ModelEndpoint`, a single reachable backend, for each replica of a
deployment and labels it. Two of those labels carry routing intent:

- `modelplane.ai/deployment`: the deployment the replica belongs to.
- `modelplane.ai/cluster`: the cluster the replica runs on.

Modelplane creates an endpoint only once its replica is Ready, serving and
reachable, and withdraws it if the replica later goes unhealthy. A service only
ever routes to replicas that can actually answer, so a deployment that's still
starting or scaling up has fewer endpoints behind its URL until those replicas
come up. You don't create endpoints yourself. You point a service at them.

`spec.endpoints` is a list, and the entries combine: the service routes to every
endpoint that any entry matches. The patterns below build on that.

## Route to a whole deployment

The common case: one selector matching a deployment's name reaches every replica,
wherever in the fleet they run.

```yaml {nocopy=true}
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b   # every replica of this deployment
```

## Route to part of a deployment

Add a second label to narrow within a deployment. A selector matches an endpoint
only when all its labels match, so pairing the deployment with a cluster routes to
just that cluster's replicas. This is how you take a cluster out of service
without redeploying: point the service at the clusters you want and leave one out,
and traffic drains to the rest.

```yaml {nocopy=true}
spec:
  endpoints:
  # Only the replicas on prod-us-east, e.g. while draining another cluster.
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b
        modelplane.ai/cluster: prod-us-east
```

## Route across several deployments

Give more than one entry to front several deployments behind the same URL. Each
entry contributes its matched endpoints, and traffic spreads evenly across every
one.

```yaml {nocopy=true}
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b-v2
```

This is the shape an A/B test or a canary rollout would take, but note traffic is
split **evenly** across the matched endpoints today. Weighting one entry over
another, to send, say, 5% of traffic to a canary, is tracked in
[#90](https://github.com/modelplaneai/modelplane/issues/90). Until then the split
follows endpoint counts, not a ratio you set.

The entries don't have to be deployments. One can select a manually created
[ModelEndpoint](/models/model-endpoint/) that points at an external
provider, so a service can send overflow or break-glass traffic to a SaaS
endpoint alongside your own replicas:

```yaml {nocopy=true}
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: kimi-k2
  - selector:
      matchLabels:
        modelplane.ai/external-provider: together
```

Endpoints with different path layouts coexist behind the one URL.

## Sending a request

The service's public address is on `status.address`, in the form
`http://<gateway>/<namespace>/<service-name>`:

```bash
ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}')
```

Append the OpenAI path and send a request. The `model` field is the name the
engine serves (its `--served-model-name`, or the model's Hugging Face id if you
didn't set one):

```bash
curl "$ADDRESS/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

## Alternate APIs

We call the endpoint OpenAI-compatible because the engines are, not because
Modelplane imposes it. The route matches the `/<namespace>/<service>/` prefix and
preserves the path below it on the way to the engine, so any API the engine serves
is reachable on the same URL.

Take a vLLM replica that also serves the Anthropic Messages API. It answers on
`.../v1/messages`, so a client that speaks it (including Claude Code, via
`ANTHROPIC_BASE_URL`) talks to it directly. The engine's operational paths come
through the same way: `.../health` and the Prometheus `.../metrics` are reachable
on the service URL.

There's one exception, and it's set by the deployment rather than the service.
[Disaggregated serving](/models/model-deployment/#disaggregated-serving)
reads OpenAI-format request bodies to pick a prefill and decode worker, so a
request in another API shape still reaches the engine but skips that
cache-aware routing. Unified serving forwards every API shape the same way.

## Example


      model-service.yaml
      
          
    # A ModelService exposes one or more ModelDeployments via a single
# OpenAI-compatible endpoint. It composes a Gateway-API HTTPRoute on the
# control plane that load-balances across every ModelEndpoint matching
# its selector.
#
# Modelplane composes one ModelEndpoint per ModelReplica, labeled
# `modelplane.ai/deployment: <deployment-name>`. So a ModelService with
# that label selector reaches every replica of the named deployment.
#
# Once the service is ready, its public address is on status.address:
#   kubectl get ms qwen3-8b -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen3-8b
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b

  
---

# How it schedules

Source: /architecture/scheduling/

**API:** [`modelplane.ai/v1alpha1` · ModelDeployment](/reference/modeldeployments/)

When an ML team creates a [ModelDeployment](/models/model-deployment/),
the fleet scheduler decides which cluster each replica runs on and which node
pool each engine uses. Platform teams don't drive it directly, but what they
publish, the clusters, their labels, and each pool's
[InferenceClass](/platform/inference-class/), is exactly what the
scheduler matches against. This page explains how it places work and where it
deliberately stops short, so you can reason about why a deployment landed where it
did.

## A pure function of observed state

The scheduler recomputes the whole placement from scratch on every reconcile. It
reads the deployment, every `InferenceCluster` with its published capacity, and
every existing `ModelReplica`, and returns a placement. Given the same inputs it
returns the same placement, so it's safe to run continuously.

The key consequence is stability. Existing replicas are *inputs*, not decisions.
A healthy replica is never moved to improve the global picture, even if a better
cluster appears later. This keeps placement from churning underneath a running
deployment.

## Two-level matching

The scheduler picks a `(cluster, pool)` for each replica in two stages, matching
against what the platform team published.

1. **Clusters** are filtered by `clusterSelector.matchLabels` against the
   standard Kubernetes labels on each `InferenceCluster`: tier, region, provider,
   compliance posture. This is organizational metadata, so string equality is
   enough. An unset selector matches every cluster.
2. **Pools** are filtered by matching each device request in a member's
   `nodeSelector.devices` against the devices a pool's `InferenceClass` publishes.
   A request is a real DRA request: a `count` and CEL selectors over a device's
   attributes and capacity, such as "a GPU with at least 141Gi of memory." A pool
   fits a member when it has devices satisfying every request, with `count` to
   cover them.

The CEL is the same expression an ML engineer would write in a DRA
`ResourceClaim`, evaluated against the devices the `InferenceClass` declares. The
keys a platform team puts on a class are the contract: a `nodeSelector` matches a
pool only if the class publishes the attributes and capacity it asks for.

## Co-scheduling and pools

A replica is a set of engines placed together on one cluster. Within a replica,
every member of a single engine is placed on **one** pool: each member carries
its own `nodeSelector`, but the scheduler requires a single pool that satisfies
them all.

It works this way because a gang's members coordinate over their pool's
interconnect fabric, and the scheduler can't reason about fabric. Pool identity
is the finest grain it has. An engine split across pools risks landing its
members on different fabrics. The collective then never forms, and the gang hangs
with no clear error. To avoid that, the scheduler never splits an engine: an engine that no
single pool satisfies isn't scheduled on that cluster. Different engines of the same replica
can use different pools, but all on the same cluster.

```mermaid
graph TD
    subgraph cluster ["One InferenceCluster"]
        subgraph pool1 ["Pool A"]
            L["prefill engine\nLeader + Worker\n(whole gang, one pool)"]
        end
        subgraph pool2 ["Pool B"]
            D["decode engine\nStandalone"]
        end
    end

    R["ModelReplica"] --> L
    R --> D
```

A member with no `nodeSelector` claims no devices. It matches the engine's pool
at no node cost and rides along on the gang's nodes, packed there by the
cluster's own scheduler.

## Counting capacity in nodes

Capacity is gated on **nodes**, not on individual GPUs. The only number the
scheduler reads from a member is its node cost:

```text
nodes = pods × copies
pods  = 1 for a Standalone or Leader, or worker.nodes for a Worker
```

A member that resolves no `claim: DRA` device, because it carried no
`nodeSelector` or matched only synthetic devices, costs zero nodes. The scheduler
sums the cost of a replica's members and places the replica only where every
engine's pool has enough free nodes, tracking a running ledger so it never
overcommits a cluster.

This accounting is deliberately coarse. The control-plane scheduler answers
"could this cluster plausibly host this replica," not "exactly which GPU does
each pod get." Device-level contention between deployments is left to DRA
admission on the workload cluster, which is authoritative: it rejects a pod whose
`ResourceClaim` can't be satisfied, and the next reconcile sees the updated
state.

## Pinning placement to a pool

The scheduler's pool choice is enforced, not advisory. Each scheduled pod carries
a Kubernetes `nodeSelector` on the `modelplane.ai/pool` node label, so it can only
land on the pool the scheduler chose. Without it, the cluster's scheduler could
place a pod on any pool whose devices match its DRA claim, and the fleet's
per-pool accounting would drift from where pods actually run.

Modelplane labels the nodes of every pool it provisions. On a BYO
(`source: Existing`) cluster it doesn't provision the nodes, so the operator must
label each pool's nodes `modelplane.ai/pool=<nodePools[].name>` themselves, or
worker pods for that pool stay `Pending`.

## Scaling, retention, and re-placement

Scheduling runs in two phases each reconcile:


- **Retain.** Each existing replica keeps its cluster if the cluster still exists
  and every member's pinned pool still matches its (possibly edited)
  `nodeSelector`. A degraded cluster, one that's not Ready or has no gateway
  address, is still retained; transient outages surface through the deployment's
  conditions, not re-placement.
- **Fill.** If the deployment wants more replicas than were retained, the
  shortfall is placed one at a time, each onto the eligible cluster hosting the
  fewest of this deployment's replicas, spreading before packing. If it wants
  fewer, the highest-index replicas are dropped first.


A replica never changes cluster. If its cluster is deleted, the replica stops
being emitted, Crossplane garbage-collects it, and the fill phase mints a fresh
replica elsewhere. Moving is always delete-plus-create, mirroring how Kubernetes
treats a pod whose node is gone.

## Known limitations

The scheduler is built to be conservative and predictable rather than optimal.
Two limits follow from that, both tracked for future work:

- **A whole node is charged per pod**
  ([#172](https://github.com/modelplaneai/modelplane/issues/172)). A pod that
  claims one GPU of an eight-GPU node still charges the whole node in the
  scheduler's accounting. This is safe, it can only under-count a pool's
  capacity, never overcommit it, but it can strand GPUs on deployments of
  sub-node engines.
- **An engine can't span pools, even on one fabric**
  ([#149](https://github.com/modelplaneai/modelplane/issues/149)). Because the
  scheduler has no concept of fabric, it refuses to split a gang across pools at
  all. That forecloses a legitimate case, GPU workers on one pool and a no-GPU
  coordinator on another within the same fabric, until fabric-aware placement
  lands.


---

# How Modelplane works

Source: /overview/how-it-works/


Modelplane runs as a control plane on its own cluster, the **control cluster**,
above the **inference clusters** that actually serve models. It's built on
[Crossplane](https://crossplane.io): platform teams and developers describe what
they want as Kubernetes resources, and Modelplane continuously reconciles the
fleet to match, composing the clusters, scheduling replicas, and exposing
endpoints. This page is the full tour. It covers the architecture and resources, then walks through what happens when you deploy a model.

## Modelplane API

Modelplane's API is two sets of resources, one per team, with everything in
between filled in for you. Platform teams describe the fleet, ML teams describe a
model, and Modelplane composes the rest.


    Platform team creates
    
      InferenceGateway
      The unified, OpenAI-compatible entry point on the control cluster.
    
    
      InferenceClass
      A tested hardware recipe for a node pool: the devices it offers and how to provision it.
    
    
      InferenceCluster
      A Kubernetes cluster in the fleet, provisioned by Modelplane or brought as-is.
    
  
    ML team creates
    
      ModelDeployment
      A model to serve, with engines, replica count, and an optional cache.
    
    
      ModelService
      One OpenAI-compatible endpoint, load-balanced across the endpoints it selects.
    
    
      ModelCache
      Model weights staged once per cluster on shared storage.
    
  
    Modelplane composes
    
      ModelReplica
      One complete serving instance on a specific cluster.
    
    
      ModelEndpoint
      A reachable endpoint, one per replica or set manually for an external provider.
    
  
The hierarchy mirrors Kubernetes core one scope up: `ModelDeployment` →
`ModelReplica` → `ModelService` → `ModelEndpoint` parallels `Deployment` → `Pod` → `Service` →
`Endpoint`, across a fleet instead of within a single cluster.

## What the control plane reconciles

Once the resources exist, Modelplane keeps the fleet matching them. Five concerns
run continuously:

1. **Provisioning.** From an `InferenceCluster`, Modelplane creates a full cluster 
   and its GPU node pools, or brings in a cluster you already run on
   any Kubernetes, and installs the serving stack on each.
2. **Scheduling.** A two-level scheduler places work: it pins each `ModelReplica`
   to a cluster and pool whose hardware meets the model's requirements, then the
   cluster's own scheduler binds the GPUs to the serving pods through DRA.
3. **Autoscaling.** Replicas are the scaling axis. Scaling a `ModelDeployment`'s
   `spec.replicas` adds or removes whole serving instances through the standard
   Kubernetes scale subresource, so `kubectl scale` or a KEDA `ScaledObject` work
   out of the box.
4. **Routing.** A `ModelService` exposes one OpenAI-compatible endpoint through
   the gateway and load-balances across the deployment's `ModelEndpoints`,
   wherever their replicas run. `ModelEndpoints` can also point at external
   inference services.
5. **Caching.** A `ModelCache` stages model weights on cluster storage once, so
   serving pods read them locally instead of re-downloading on every start.

## Universal compatibility

Modelplane is deliberately unopinionated about the engine. A `ModelDeployment`
describes the *shape* of a deployment, how many pods, on how many nodes, with
which devices, and nothing about how the engine runs internally. The engine flags
you write carry parallelism (tensor, pipeline, data, expert), quantization, and KV
transfer; Modelplane never injects them.

This is what lets one API serve any container-based engine and any topology
without special cases. Modelplane composes the engine onto the right cluster
resource and injects almost nothing, just the address a multi-node leader is
reachable at, so a worker can join it. New engines and new parallelism strategies
work without a change to Modelplane. The community publishes recipes (worked, copyable
manifests) to bridge the gap that flexibility leaves, rather than hard-coding
choices into the API.

## Fleet scheduler

For each replica, the scheduler picks a `(cluster, pool)` in two steps:

1. **Filter clusters** by `clusterSelector.matchLabels` against the standard
   Kubernetes labels on each `InferenceCluster`, the organizational metadata:
   tier, region, provider, compliance posture.
2. **Filter pools** by matching each device request in the deployment's
   `nodeSelector.devices` against the pool's `InferenceClass`. A request is based
   on DRA: a `count` and CEL selectors over a device's attributes and capacity, like
   "a GPU with at least 141Gi of memory." A pool fits when it has the devices the
   model asks for and enough free nodes to hold a replica.

Capacity is accounted at the node level across the fleet, so Modelplane never
overcommits a pool. Replicas are pinned to their cluster once placed and stay
there across reconciles; if a cluster is deleted, the scheduler re-places its
replicas elsewhere. [How it schedules](/architecture/scheduling/)
covers the placement rules and their limits in full.

## Deploying a model

Creating a `ModelDeployment` kicks off the loop end to end. The scheduler
discovers the ready clusters (filtered by your label selector if you set one),
matches each engine's device requests against their pools, and pins each replica
to a cluster that fits. Modelplane composes a `ModelReplica` on each chosen
cluster, turns it into the right serving workload there, creates a `ModelEndpoint`
per replica, and your `ModelService` routes traffic across them through one stable
endpoint on the gateway. Scale the deployment up or down and the same loop
re-converges.

## Serving topologies

A single-node deployment composes to a Kubernetes Deployment fronted by a
service. When a model is too large for one node, an engine becomes a gang: a
`Leader` member and one or more `Worker` members that Modelplane composes into a
LeaderWorkerSet, serving the model together across nodes. Gang deployments
should stage their weights through a `ModelCache`, so the pods share one copy
instead of each pulling the same model.

Disaggregated serving splits prefill and decode into separate engines
(`serving.mode: PrefillDecode`) that run on the same cluster and hand off the KV
cache between them. Modelplane wires up the cluster-edge routing that pairs each
request's prefill and decode; the engines carry the KV-transfer flags. Both are
described in full in the [model deployment docs](/models/model-deployment/).

## Next steps


    FAQ
  
  Quick answers on how Modelplane compares and what it requires.
  →


    Get started
  
  Put it together: deploy Modelplane and serve a model.
  →


---

# Qwen3-Coder-480B

Source: /examples/qwen3-coder/


A 480B code MoE (35B active). Two validated shapes: the BF16 weights span two
H200 nodes as a gang over EFA, served from a `ModelCache`; the FP8 checkpoint
fits one node, so it runs as a single `Standalone` engine on SGLang with no
cache.

Both shapes were run end to end; the `InferenceClass` and `ModelDeployment` are
the exact manifests from those runs. Apply the platform side first, then the ML
side. The `InferenceCluster` carries an EC2 capacity reservation placeholder to
edit before applying.

## Platform


    Multi-node (BF16)
    
    
    Single-node (FP8)
    
    
      inference-class.yaml
      
          
    # InferenceClass for the H200 shape, validated serving Qwen3-Coder-480B
# multi-node on EKS. 8x NVIDIA H200 on an EKS p5en.48xlarge, with EFA.
#
# Both the GPU and the EFA fabric are claim: DRA devices. A multi-node gang's
# nodeSelector requests both, so the scheduler co-schedules the whole gang on a
# pool that has them and DRA binds 8 GPUs + 16 EFA interfaces per pod. The EFA
# device is installed by the EFA DRA driver (DRANET) in the serving stack.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-h200-8x-p5en
spec:
  description: "EKS p5en.48xlarge, 8x NVIDIA H200, EFA"
  provisioning:
    provider: EKS
    eks:
      instanceType: p5en.48xlarge
      diskSizeGb: 1024
      accelerator:
        type: nvidia-h200
        count: 8
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 8
    attributes:
      architecture: { string: Hopper }
    capacity:
      memory: { value: "140Gi" }   # advertised below the ~141 GiB the driver reports
  - name: efa
    claim: DRA
    driver: dra.net
    deviceClassName: efa.networking.k8s.aws
    count: 16

  
      inference-cluster.yaml
      
          
    # An EKS InferenceCluster with a two-node H200 pool over EFA, validated serving
# Qwen3-Coder-480B as a multi-node gang. The H200 nodes come from an EC2
# Capacity Block reserved for ML.
#
# fabric: EFA turns on Elastic Fabric Adapter for the gang's cross-node traffic;
# without it multi-node NCCL falls back to TCP, which is slow and unstable.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-coder
  labels:
    modelplane.ai/region: us
spec:
  cluster:
    source: EKS
    eks:
      region: us-east-2
  nodePools:
  - name: gpu-h200
    className: eks-h200-8x-p5en
    nodeCount: 2
    minNodeCount: 2
    maxNodeCount: 2
    zones:
    - us-east-2b
    fabric: EFA
    capacityBlock:
      capacityReservationId: cr-0123456789abcdef0  # replace with your reservation ID

  
    bash
    
      
  curl -fsSL /examples/examples/qwen3-coder/inference-cluster.yaml \
  | sed 's/cr-0123456789abcdef0//' \
  | kubectl apply -f -


      inference-class-fp8.yaml
      
          
    # InferenceClass for the H200 shape without EFA, validated serving the FP8
# Qwen3-Coder-480B checkpoint single-node on SGLang.
#
# The FP8 weights (~480 GB) fit on one 8x H200 node, so this needs no second
# node, no fabric, and no ModelCache - the GPU is the only device.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-h200-8x-p5en
spec:
  description: "EKS p5en.48xlarge, 8x NVIDIA H200"
  provisioning:
    provider: EKS
    eks:
      instanceType: p5en.48xlarge
      diskSizeGb: 1024
      accelerator:
        type: nvidia-h200
        count: 8
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 8
    attributes:
      architecture: { string: Hopper }
    capacity:
      memory: { value: "140Gi" }   # advertised below the ~141 GiB the driver reports

  
## Deployment


    Multi-node (BF16)
    
    
    Single-node (FP8)
    
    
      model-cache.yaml
      
          
    # The shared, read-write-many cache the multi-node gang serves from. Hydrated
# once per matched cluster from the gated Hugging Face repo; every gang pod
# mounts it at /mnt/models. ~960 GB of BF16 weights, so sizeGiB leaves headroom.
#
# The repo is gated, so it needs a Hugging Face token. Create the authSecret once
# in the ModelCache's namespace on the control plane; Modelplane propagates it to
# each matched cluster.
apiVersion: modelplane.ai/v1alpha1
kind: ModelCache
metadata:
  name: qwen3-coder
  namespace: ml-team
spec:
  source: HuggingFace
  huggingFace:
    repo: Qwen/Qwen3-Coder-480B-A35B-Instruct
    authSecret:
      name: hf-token
      key: HF_TOKEN
    sizeGiB: 1100

  
      model-deployment.yaml
      
          
    # Qwen3-Coder-480B served BF16 across two H200 nodes, validated end to end on
# EKS over EFA. A 480B MoE doesn't fit one node, so the engine is a Leader +
# Worker gang spanning two nodes via LeaderWorkerSet, both pods mounting the
# shared ModelCache at /mnt/models.
#
# Each member requests 8 GPUs + 16 EFA interfaces per node; the scheduler
# co-schedules the gang on the H200 pool. The worker joins the leader through
# $(MODELPLANE_LEADER_ADDRESS), which Modelplane injects.
#
# Notes on the engine flags:
#   --distributed-executor-backend=mp with --nnodes/--node-rank/--master-addr/
#     --headless is vLLM's native multiprocessing multi-node path.
#     vllm/vllm-openai:v0.23.0 no longer ships Ray, so the Ray-based
#     multi-node-serving.sh helper doesn't work on this image; the MP backend
#     needs nothing extra.
#   TP8 x PP2: tensor-parallel within a node over NVLink, pipeline-parallel
#     across the two nodes. tensor-parallel-size = GPUs per node,
#     pipeline-parallel-size = nodes.
#   --tool-call-parser=qwen3_xml is the parser for Qwen3-Coder specifically
#     (the dense Qwen3 models use hermes). The model is non-thinking, so there's
#     no reasoning parser.
#   --max-model-len=32768 caps context to fit; the native 256K isn't needed.
#   FI_PROVIDER=efa / NCCL_DEBUG=INFO point NCCL at the EFA fabric.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-coder
  namespace: ml-team
spec:
  replicas: 1
  clusterSelector:
    matchLabels:
      modelplane.ai/region: us
  modelCacheRef:
    name: qwen3-coder
  engines:
  - name: qwen3-coder
    members:
    - role: Leader
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.driver == "gpu.nvidia.com" && device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
        - name: efa
          count: 16
          selectors:
          - cel: |
              device.driver == "dra.net"
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            env:
            - name: NCCL_DEBUG
              value: "INFO"
            - name: FI_PROVIDER
              value: "efa"
            command:
            - /bin/sh
            - -c
            - >-
              exec vllm serve /mnt/models
              --served-model-name=qwen3-coder
              --tensor-parallel-size=8
              --pipeline-parallel-size=2
              --distributed-executor-backend=mp
              --nnodes=2 --node-rank=0
              --master-addr=$(MODELPLANE_LEADER_ADDRESS)
              --max-model-len=32768
              --gpu-memory-utilization=0.92
              --enable-auto-tool-choice
              --tool-call-parser=qwen3_xml
              --port=8000
    - role: Worker
      worker:
        nodes: 1
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.driver == "gpu.nvidia.com" && device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
        - name: efa
          count: 16
          selectors:
          - cel: |
              device.driver == "dra.net"
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            env:
            - name: NCCL_DEBUG
              value: "INFO"
            - name: FI_PROVIDER
              value: "efa"
            command:
            - /bin/sh
            - -c
            - >-
              exec vllm serve /mnt/models
              --served-model-name=qwen3-coder
              --tensor-parallel-size=8
              --pipeline-parallel-size=2
              --distributed-executor-backend=mp
              --nnodes=2 --node-rank=1
              --master-addr=$(MODELPLANE_LEADER_ADDRESS)
              --headless
              --max-model-len=32768
              --gpu-memory-utilization=0.92

  
      model-service.yaml
      
          
    # Exposes the multi-node BF16 qwen3-coder deployment as a single
# OpenAI-compatible URL. Read the public address from status.address:
#   kubectl get ms qwen3-coder -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen3-coder
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-coder

  
      model-deployment-fp8.yaml
      
          
    # Qwen3-Coder-480B served FP8 on a single 8x H200 node with SGLang, validated
# end to end on EKS. The FP8 checkpoint (~480 GB) fits one node, so this is a
# single Standalone engine: no second node, no EFA, no ModelCache. The engine
# pulls the public FP8 repo straight to the node's local disk.
#
# SGLang-specific notes:
#   --ep-size 8 is required, not optional. Pure --tp-size 8 fails at FP8 weight
#     creation ("output_size ... not divisible by ... block_n = 128"): the
#     block-FP8 MoE doesn't shard evenly across 8 tensor-parallel ranks. Expert
#     parallelism shards whole experts and gets past it.
#   --tool-call-parser qwen3_coder is SGLang's parser name for this model
#     (vLLM's is qwen3_xml). The model is non-thinking, so no reasoning parser.
#   Image tag matters: lmsysorg/sglang v0.5.11-v0.5.13(.post1) -runtime images
#     are broken (ModuleNotFoundError: distro). v0.5.10.post1-runtime is the
#     most recent clean tag with Qwen3-Coder support.
#   --host 0.0.0.0 --port 8000: SGLang defaults to 127.0.0.1:30000, but
#     Modelplane's contract is 0.0.0.0:8000 with a /health probe. Args pass
#     through verbatim - Modelplane injects nothing for a non-vLLM engine.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-coder-sgl
  namespace: ml-team
spec:
  replicas: 1
  clusterSelector:
    matchLabels:
      modelplane.ai/region: us
  engines:
  - name: qwen3-coder-sgl
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.driver == "gpu.nvidia.com" && device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: lmsysorg/sglang:v0.5.10.post1-runtime
            command:
            - /bin/sh
            - -c
            - >-
              exec python3 -m sglang.launch_server
              --model-path Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
              --served-model-name qwen3-coder
              --tp-size 8
              --ep-size 8
              --context-length 32768
              --page-size 32
              --trust-remote-code
              --tool-call-parser qwen3_coder
              --host 0.0.0.0
              --port 8000

  
      model-service-fp8.yaml
      
          
    # Exposes the single-node FP8 qwen3-coder-sgl deployment as a single
# OpenAI-compatible URL. Read the public address from status.address:
#   kubectl get ms qwen3-coder-sgl -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen3-coder-sgl
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-coder-sgl

  
---

# Cache Model Weights

Source: /models/model-cache/


**API:** [`modelplane.ai/v1alpha1` · ModelCache](/reference/modelcaches/)

A `ModelCache` stages a model's weights on shared workload-cluster storage,
fetched once from the configured source rather than downloaded again on every pod
start. `ModelDeployments` reference a cache via `spec.modelCacheRef.name`, and
Modelplane mounts it at `/mnt/models` in every serving pod, shared across the
pods of a multi-node engine. The engine reads weights locally from the mount.

`ModelCache` is recommended for multi-node deployments and optional for
single-node cold-start optimization.

## What to cache

The required `source` enum names the kind, with the matching source object set
alongside it. Setting `source: HuggingFace` selects `spec.huggingFace`, which
carries the `repo` to fetch, an optional `revision` (branch, tag, or commit), and
`sizeGiB`, how much storage the weights get on each cluster. Size it to the
model, since a value below the model's size leaves no room to stage the weights.
`HuggingFace` is the only source today.

The cache mounts at `/mnt/models` on every consuming pod, so the engine's args
reference that path (`--model=/mnt/models` for vLLM) rather than the source.

## Authenticating

A gated or private model needs a credential to fetch. When a cache stages the
weights, the credential lives on the cache: set `authSecret` to name a Secret in
the cache's namespace, and Modelplane propagates it to every cluster the cache
stages to, for the hydration to read.

Create the Secret once on the control plane, then reference it:

```bash
kubectl create secret generic hf-token \
  --namespace ml-team \
  --from-literal=HF_TOKEN=hf_xxxxxxxx
```

```yaml {nocopy=true}
spec:
  source: HuggingFace
  huggingFace:
    repo: Qwen/Qwen3-Coder-480B-A35B-Instruct
    authSecret:
      name: hf-token         # a Secret in this ModelCache's namespace
      key: HF_TOKEN          # defaults to HF_TOKEN
    sizeGiB: 1100
```

Without a cache, the engine fetches the model itself at startup, so the
credential goes on the `ModelDeployment` instead, as `HF_TOKEN` in the engine
container's `env`.

## Where to cache

An optional `clusterSelector` scopes where the cache is staged. Omitting it
stages the cache on every cluster in the fleet; setting `matchLabels` restricts
it to clusters carrying those labels. A `ModelDeployment` that references the cache
places *new* replicas only onto clusters within this footprint, so narrowing the
selector also narrows where replicas can land: a replica never schedules to a
cluster the cache didn't stage to. Replicas already running are left where they
are.

## Loading from cache

A cache only pays off if the engine reads from it quickly. With its default
loader an engine can read a large model from shared storage slowly enough that
the cache makes cold starts *worse* than fetching the model directly, since you
pay to hydrate the cache and then wait on a slow read. Choose a fast loader with
your engine flags.

For vLLM on EKS, `--load-format=runai_streamer` reads from the EFS-backed cache
dramatically faster than the default loader (minutes rather than tens of
minutes for a large model), tuned further with `--model-loader-extra-config`:

```yaml {nocopy=true}
args:
- --model=/mnt/models
- --load-format=runai_streamer
- --model-loader-extra-config={"concurrency":16,"distributed":true}
```

The right loader and settings depend on the engine and the storage backend, so
treat these as a starting point and measure your own cold-start time. The
[Kimi-K2 example](/examples/kimi-k2/) uses this configuration end to
end.

## Storage prerequisites


The cache PVC needs a `ReadWriteMany` (RWX) StorageClass on the workload cluster.
What the platform admin must set up depends on the cloud:


- **GKE** and **EKS:** auto-provisioned. Nothing for the admin to do.
- **Existing:** the admin sets up a `ReadWriteMany` StorageClass on the cluster.

Either way, your `ModelCache` and `ModelDeployment` specs are the same. How
storage is provided on each cluster source, and how to bring your own backend, is
covered in [Register a Cluster](/platform/inference-cluster/#cache-storage).

## Example


      model-cache.yaml
      
          
    # A ModelCache stages a model artifact on workload-cluster storage as a
# first-class resource. Modelplane composes a ReadWriteMany PVC on each matched
# cluster and hydrates it once from the configured source. A ModelDeployment
# references it via spec.modelCacheRef; the PVC mounts at /mnt/models read-write
# into every serving pod, so the engine reads weights locally instead of
# fetching them at boot.
apiVersion: modelplane.ai/v1alpha1
kind: ModelCache
metadata:
  name: qwen3-coder
  namespace: ml-team
spec:
  source: HuggingFace
  huggingFace:
    repo: Qwen/Qwen3-Coder-480B-A35B-Instruct
    # Gated repo, so a Hugging Face token is needed. Create this Secret once in
    # the ModelCache's namespace on the control plane; Modelplane propagates it
    # to each matched cluster.
    authSecret:
      name: hf-token
      key: HF_TOKEN
    sizeGiB: 1100
  # Optional: stage only on clusters matching these labels. Omit to stage on
  # every matched cluster. Narrowing this also narrows where a referencing
  # ModelDeployment can place new replicas.
  # clusterSelector:
  #   matchLabels:
  #     modelplane.ai/tier: frontier

  
---

# Deploying a model

Source: /getting-started/deploying-a-model/


Now that the platform is provisioned, the ML team can declare what a model needs
with a `ModelDeployment`. Describe the hardware requirements and the scheduler
schedules against the capacity the platform team published.

## Create a deployment

Create a namespace for the model:

```bash
kubectl create namespace ml-team
```

The device selector matches against the capacity declared in the
`InferenceClass`, not the pod's resource requests. Any L4 node satisfies
`>= 20Gi`, so this deployment runs on the cluster you just added:


    EKS
    
    
    GKE
    
    
      model-deployment.yaml
      
          
    apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen-demo
  namespace: ml-team
spec:
  replicas: 1
  engines:
  - name: qwen
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # Any L4 satisfies >= 20Gi. The selector matches against the capacity
          # declared in the InferenceClass, not the pod's resource requests.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - --model=Qwen/Qwen2.5-0.5B-Instruct
            - --dtype=half

  
      model-deployment.yaml
      
          
    apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen-demo
  namespace: ml-team
spec:
  replicas: 1
  engines:
  - name: qwen
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # Any L4 satisfies >= 20Gi. The selector matches against the capacity
          # declared in the InferenceClass, not the pod's resource requests.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - --model=Qwen/Qwen2.5-0.5B-Instruct
            - --dtype=half

  
Wait until `REPLICAS` shows `1`:

```bash
kubectl get md -n ml-team --watch
```

To see which cluster the scheduler chose:

```bash
kubectl get modelreplica -n ml-team
```

```shell{nocopy=true}
NAME              CLUSTER       SYNCED   READY   COMPOSITION                   AGE
qwen-demo-7323a   eks-us-east   True     True    modelreplicas.modelplane.ai   12m
```

The ML team never named a cluster. The scheduler matched the GPU requirement
(`>= 20Gi`) against the `InferenceClass` the platform team published and made
the placement. 

## Expose the model

A `ModelService` selects `ModelEndpoints` by label and creates a Gateway API
`HTTPRoute` that routes to them. Modelplane creates one `ModelEndpoint` per
replica, labeled with the deployment name:


      model-service.yaml
      
          
    # A ModelService exposes one or more ModelDeployments via a single
# OpenAI-compatible endpoint. It composes a Gateway-API HTTPRoute on the
# control plane that load-balances across every ModelEndpoint matching
# its selector.
#
# Modelplane composes one ModelEndpoint per ModelReplica, labeled
# `modelplane.ai/deployment: <deployment-name>`. So a ModelService with
# that label selector reaches every replica of the named deployment.
#
# Once the service is ready, its public address is on status.address:
#   kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen-demo

  
The request path is `/<namespace>/<modelservice-name>/...` (`/ml-team/qwen/` in
this example), from the `ModelService` named `qwen`. The `model` field in the
request body is the Hugging Face id `Qwen/Qwen2.5-0.5B-Instruct`, since this
deployment doesn't set `--served-model-name`.

## Send a request

Read the endpoint's public address from the `ModelService` status:

```bash
ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}')
```

Send a request to it:

```bash
kubectl run -i --rm curl-test \
  --image=curlimages/curl \
  --restart=Never \
  --env="ADDRESS=$ADDRESS" \
  -- sh -c 'curl -v "$ADDRESS/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"Qwen/Qwen2.5-0.5B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"What is Kubernetes in one sentence?\"}],\"max_tokens\":100}"'
```

The request routes to the replica on the cluster Modelplane placed it on.
You should get a response in a few seconds:

```json {nocopy=true}
{
  "id": "chatcmpl-c88b1429-067d-40a5-971c-ab9c54153c26",
  "model": "Qwen/Qwen2.5-0.5B-Instruct",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Kubernetes (K8s) is an open-source platform for automating 
        the deployment, scaling, and management of containerized applications. 
        It provides scalable orchestration capabilities that enable developers 
        to deploy complex applications quickly and efficiently across various environments."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 48,
    "total_tokens": 85
  }
}

```

## Next step

The platform team declared capacity and in this guide the ML team deployed a
model behind a stable endpoint. Neither team needed to know what the other was doing. Modelplane matched them.

In the next step, the platform team grows the fleet. [Scale the platform](/getting-started/scale-the-platform/) to add more clusters across regions.


---

# Kimi-K2

Source: /examples/kimi-k2/


A 1T MoE (1 trillion parameters) served prefill/decode disaggregated across two
H200 nodes: two engines, one per phase, with Modelplane composing the llm-d
routing layer between them. This recipe serves an INT4 quantization of the
model; the native FP8 weights need four such nodes.

This recipe was run end to end; the `InferenceClass` and `ModelDeployment` are
the exact manifests from that run. Apply the platform side first, then the ML
side. The `InferenceCluster` carries an EC2 capacity reservation placeholder to
edit before applying.

## Platform


      inference-class.yaml
      
          
    # InferenceClass for the H200 shape, validated serving Kimi-K2 prefill/decode
# disaggregated on EKS. 8x NVIDIA H200 on an EKS p5en.48xlarge, with EFA.
#
# Both the GPU and the EFA fabric are claim: DRA devices. The two P/D engines
# each request 8 GPUs + 16 EFA interfaces, and the scheduler places one on each
# H200 node; NIXL ships KV cache between them over EFA.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-h200-8x-p5en
spec:
  description: "EKS p5en.48xlarge, 8x NVIDIA H200, EFA"
  provisioning:
    provider: EKS
    eks:
      instanceType: p5en.48xlarge
      diskSizeGb: 1024
      accelerator:
        type: nvidia-h200
        count: 8
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 8
    attributes:
      architecture: { string: Hopper }
    capacity:
      memory: { value: "140Gi" }   # advertised below the ~141 GiB the driver reports
  - name: efa
    claim: DRA
    driver: dra.net
    deviceClassName: efa.networking.k8s.aws
    count: 16

  
      inference-cluster.yaml
      
          
    # An EKS InferenceCluster with a two-node H200 pool over EFA, validated serving
# Kimi-K2 as a prefill/decode pair (one engine per node). The H200 nodes come
# from an EC2 Capacity Block reserved for ML.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-kimi
  labels:
    modelplane.ai/region: us
spec:
  cluster:
    source: EKS
    eks:
      region: us-east-2
  nodePools:
  - name: gpu-h200
    className: eks-h200-8x-p5en
    nodeCount: 2
    minNodeCount: 2
    maxNodeCount: 2
    zones:
    - us-east-2b
    fabric: EFA
    capacityBlock:
      capacityReservationId: cr-0123456789abcdef0  # replace with your reservation ID

  
    bash
    
      
  curl -fsSL /examples/examples/kimi-k2/inference-cluster.yaml \
  | sed 's/cr-0123456789abcdef0//' \
  | kubectl apply -f -


## Deployment


      model-cache.yaml
      
          
    # The shared, read-write-many cache both P/D engines serve from. Hydrated once
# per matched cluster; both phases mount the same RWX volume at /mnt/models.
#
# This validated run served an INT4 quantization of Kimi K2 rather than the
# native 1T-parameter FP8 model, which would need four 8x H200 nodes. The quant
# repo is public, so no authSecret is needed here.
apiVersion: modelplane.ai/v1alpha1
kind: ModelCache
metadata:
  name: kimi-k2
  namespace: ml-team
spec:
  source: HuggingFace
  huggingFace:
    repo: RedHatAI/Kimi-K2-Instruct-quantized.w4a16
    sizeGiB: 600

  
      model-deployment.yaml
      
          
    # Kimi-K2 served prefill/decode disaggregated across two H200 nodes, validated
# end to end on EKS. serving.mode: PrefillDecode makes the two engines below one
# P/D pair: Modelplane composes the llm-d routing layer (InferencePool, endpoint
# picker, the NIXL pd-sidecar) between them. Each engine is a single-node 8-GPU
# Standalone pod; the scheduler places prefill on one node and decode on the
# other, and KV cache ships between them over EFA via NIXL.
#
# Notes on the engine flags (most P/D machinery is Modelplane's; the engine
# config has real sharp edges):
#   EP, not TP8. --tensor-parallel-size=1 --data-parallel-size=8
#     --enable-expert-parallel is vLLM's DeepSeek-V3 / Kimi single-node recipe
#     (Kimi-K2 is DeepSeek-V3 arch). It keeps each expert whole on a GPU and
#     dodges the Marlin %128 alignment trap that a plain TP8 layout would hit.
#   --load-format=runai_streamer cold-reads ~509 GiB per engine off the shared
#     RWX cache in ~6 minutes (vs ~45 with the default loader).
#   --tokenizer / --override-generation-config are workarounds for two bugs in
#     this specific quant repo (an off-by-2 in its bundled tokenizer, and a
#     wrong eos_token_id), not normal flags. The override pulls a gated repo at
#     startup, so HF_TOKEN is set on both engines.
#   The decode engine runs on :8001; the pd-sidecar owns :8000.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: kimi-k2
  namespace: ml-team
spec:
  replicas: 1
  modelCacheRef:
    name: kimi-k2
  serving:
    mode: PrefillDecode            # the two engines below are one P/D pair
  engines:
  - name: kimi-prefill
    phase: Prefill
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("130Gi")) >= 0
        - name: efa
          count: 16
          selectors:
          - cel: |
              device.driver == "dra.net"
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            command: ["vllm", "serve", "/mnt/models"]
            args:
            - --served-model-name=kimi-k2
            - --quantization=compressed-tensors
            - --tensor-parallel-size=1
            - --data-parallel-size=8
            - --enable-expert-parallel
            - --block-size=64
            - --max-model-len=131072
            - --trust-remote-code
            - --tool-call-parser=kimi_k2
            - --enable-auto-tool-choice
            - --load-format=runai_streamer
            - --model-loader-extra-config={"concurrency":16,"distributed":true}
            - --tokenizer=moonshotai/Kimi-K2-Instruct
            - --override-generation-config={"eos_token_id":163586}
            - --port=8000
            - --kv-transfer-config={"kv_connector":"NixlConnector","kv_role":"kv_producer"}
            env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef: { name: hf-token, key: HF_TOKEN }
  - name: kimi-decode
    phase: Decode
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("130Gi")) >= 0
        - name: efa
          count: 16
          selectors:
          - cel: |
              device.driver == "dra.net"
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            command: ["vllm", "serve", "/mnt/models"]
            args:
            - --served-model-name=kimi-k2
            - --quantization=compressed-tensors
            - --tensor-parallel-size=1
            - --data-parallel-size=8
            - --enable-expert-parallel
            - --block-size=64
            - --max-model-len=131072
            - --trust-remote-code
            - --tool-call-parser=kimi_k2
            - --enable-auto-tool-choice
            - --load-format=runai_streamer
            - --model-loader-extra-config={"concurrency":16,"distributed":true}
            - --tokenizer=moonshotai/Kimi-K2-Instruct
            - --override-generation-config={"eos_token_id":163586}
            - --port=8001                                      # pd-sidecar owns 8000
            - --kv-transfer-config={"kv_connector":"NixlConnector","kv_role":"kv_consumer"}
            env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef: { name: hf-token, key: HF_TOKEN }

  
      model-service.yaml
      
          
    # Exposes the kimi-k2 deployment as a single OpenAI-compatible URL. Read the
# public address from status.address:
#   kubectl get ms kimi-k2 -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: kimi-k2
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: kimi-k2

  
---

# Register a Cluster

Source: /platform/inference-cluster/

**API:** [`modelplane.ai/v1alpha1` · InferenceCluster](/reference/inferenceclusters/)

An `InferenceCluster` represents a Kubernetes cluster configured for model
serving. Platform teams create these to provide GPU capacity.


Each cluster has:

- A **cluster source**: `GKE` or `EKS` (Modelplane provisions the full cluster)
  or `Existing` (bring a cluster you manage yourself). See
  [Supported Providers](/platform/providers/) for the clouds and
  neoclouds Modelplane runs on.
- One or more **node pools**, each referencing an `InferenceClass` for its
  hardware capabilities and provisioning recipe.
- **Labels** for organizational metadata: tier, region, provider. These are the
  matching surface for `ModelDeployment.clusterSelector`.

Modelplane installs the serving stack it needs on every cluster it manages,
including existing clusters, which it assumes are solely for its use.

## Ownership and requirements

Modelplane assumes exclusive ownership of every `InferenceCluster`. The fleet
scheduler's capacity accounting relies on Modelplane being the only thing placing
GPU workloads on the cluster, so dedicate each cluster to Modelplane rather than
sharing it with other workloads.

Modelplane also has opinions about how a cluster is set up: its Kubernetes
version, the components it installs, and required features like DRA for binding
GPUs to pods. On provisioned clusters Modelplane handles this for you. On an
existing cluster the platform team must meet the requirements.

## Provisioned and existing clusters

The `cluster.source` discriminator picks one of two models:

- **Provisioned (`GKE`, `EKS`).** Modelplane creates the cluster and its GPU node
  pools from each pool's `InferenceClass`, labels the pool's nodes so the
  scheduler's placement is enforced, and provisions the storage class for model
  weights. It also injects a non-GPU **system pool** with opinionated defaults to
  run the inference stack, so you only declare the GPU pools you want.
- **Existing (`Existing`).** A kubeconfig `Secret` provides access to a cluster
  you run yourself. Modelplane installs the serving stack it needs but doesn't
  provision infrastructure, and each pool's `InferenceClass` provides hardware
  capabilities for scheduling only. You're responsible for the cluster meeting
  Modelplane's requirements, including labeling each pool's nodes
  `modelplane.ai/pool=<pool-name>` (see
  [how scheduling pins placement](/architecture/scheduling/#pinning-placement-to-a-pool)).

## Examples


    GKE
    
    
    EKS
    
    
    Existing
    
    
      inference-cluster-gke.yaml
      
          
    # An InferenceCluster backed by a GKE cluster.
#
# Modelplane provisions the full GKE cluster (VPC, subnet, system pool,
# GPU pools, service account, IAM bindings) and installs the inference
# stack (cert-manager, Envoy Gateway, Prometheus, LeaderWorkerSet,
# Gateway API).
#
# The system pool that hosts control-plane components is provisioned
# automatically and is not declared here. Only GPU pools - each
# referencing an InferenceClass that describes the hardware shape and
# how to provision it - need to be declared.
#
# Add labels to this InferenceCluster to control which deployments land on it
# via a ModelDeployment's clusterSelector.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: gke-us-central
  labels:
    modelplane.ai/region: us-central
spec:
  cluster:
    source: GKE
    gke:
      project: my-gcp-project  # Replace with your GCP project ID.
      region: us-central1

  nodePools:
  - name: gpu-l4
    className: gke-l4-1x-g2
    nodeCount: 1
    minNodeCount: 1
    maxNodeCount: 4
    zones:
    - us-central1-a
    - us-central1-c

  
      inference-cluster-eks.yaml
      
          
    # An InferenceCluster backed by an EKS cluster.
#
# Modelplane provisions the full EKS cluster (VPC, subnets, internet
# gateway, IAM roles for the cluster and nodes, system + GPU node
# groups, vpc-cni / kube-proxy / coredns addons) and installs the
# inference stack (cert-manager, Traefik, Prometheus, KEDA,
# LeaderWorkerSet).
#
# The system node group that hosts control-plane components is
# provisioned automatically and is not declared here. Only GPU node
# groups - each referencing an InferenceClass that describes the
# hardware shape and how to provision it - need to be declared.
#
# Modelplane provisions EFS RWX storage for ModelCache on EKS: an
# Elastic-throughput file system, mount targets, the EFS CSI driver, and
# a 'modelplane-rwx-efs' StorageClass pinned to it. The admin does
# nothing, and provisioned EKS clusters take no StorageClass override.
#
# Delete this InferenceCluster with foreground cascading deletion for a
# clean teardown:
#
#   kubectl delete inferencecluster eks-us-west --cascade=foreground
#
# The inference stack runs on the EKS cluster and must uninstall while
# the cluster's API server and kubeconfig still exist - otherwise its
# Helm releases hang, and a load balancer one of them created can leak
# its security group and block the VPC from deleting. Foreground
# deletion holds the cluster until the stack is uninstalled. Background
# deletion (the kubectl default) tears everything down at once and can
# orphan cloud resources.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-us-west
  labels:
    modelplane.ai/region: us-west
spec:
  cluster:
    source: EKS
    eks:
      region: us-west-2

  nodePools:
  - name: gpu-l4
    className: eks-l4-1x-g6
    nodeCount: 1
    minNodeCount: 1
    maxNodeCount: 4
    zones:
    - us-west-2a
    - us-west-2b

  
      inference-cluster-existing.yaml
      
          
    # An InferenceCluster using an existing cluster you manage yourself.
#
# Provide a kubeconfig Secret so Modelplane can install the inference
# stack and deploy models. Each GPU pool references an InferenceClass
# that describes the hardware - used by the scheduler to know what
# capacity is available.
#
# The kubeconfig Secret must exist in the control plane cluster before
# creating this InferenceCluster.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: byo-us-east
  labels:
    modelplane.ai/region: us-east
spec:
  cluster:
    source: Existing
    existing:
      secretRef:
        name: byo-cluster-kubeconfig
        key: kubeconfig

      # Optional: a cloud identity Secret for pulling images from private
      # registries or accessing cloud APIs from the remote cluster.
      # identitySecretRef:
      #   name: byo-cluster-sa-key
      #   key: private_key

  # Each pool's nodes must be labeled modelplane.ai/pool=<name> (here
  # modelplane.ai/pool=gpu-h100). The scheduler pins a worker to its pool by
  # this label; Modelplane provisions and labels EKS/GKE pools itself, but on a
  # BYO cluster you label the nodes. Without it worker pods stay Pending.
  nodePools:
  - name: gpu-h100
    className: h100-8x-byo
    nodeCount: 2

  
## Cache storage

A [ModelCache](/models/model-cache/) stages model weights on a
`ReadWriteMany` (RWX) StorageClass on the workload cluster. Where that comes from
depends on the source:


- **`GKE`** (Filestore Enterprise) and **`EKS`** (EFS): auto-provisioned. Those
  classes are fixed; nothing for the admin to do.
- **`Existing`**: bring your own. Create an RWX StorageClass on the cluster, with
  any backend that supports automatic PVC provisioning (WekaIO, NetApp Trident,
  `FSx` for NetApp, and similar), and name it in
  `cluster.existing.cache.storageClassName`.


The ML team's `ModelCache` and `ModelDeployment` specs are the same regardless of
which backing storage a cluster uses.


---

# FAQ

Source: /overview/faq/


Short answers to the questions that come up first, with links to the full
treatment. If you're new here, read the [Introduction](/overview/)
and [How Modelplane works](/overview/how-it-works/) first.

## What Modelplane is


  Is Modelplane a serving engine like vLLM?
  
    No, Modelplane is the control plane <em>above the engine. It composes serving
engines like vLLM, SGLang, and NVIDIA TensorRT-LLM, and operates them across a
fleet of clusters. It doesn’t serve tokens itself. You bring the engine; Modelplane schedules
it, routes to it, scales it, and caches its weights across your inference fleet.
  

  Does Modelplane replace vLLM or SGLang?
  
    No, they run the model; Modelplane runs the fleet. A <code>ModelDeployment carries
your engine container and its flags, and Modelplane composes it onto the right
cluster. Switching or upgrading engines is a change to your deployment, not to
Modelplane.
  

  How is Modelplane different from KServe or NVIDIA Dynamo?
  
    Scope. KServe and Dynamo are cluster orchestrators: they schedule, scale, route,
and cache within a single Kubernetes cluster. Modelplane runs its operations across a
fleet of clusters, clouds, and regions. Modelplane uses llm-d for multi-node serving,
and KV-cache management, as do KServe and Dynamo. Modelplane is planning deeper integrations
with NVIDIA Dynamo in future releases.
  

  How is Modelplane different from a managed provider like Baseten or Fireworks?
  
    Managed providers run fleet-scale serving inside their own closed platform.
Modelplane is the open equivalent that runs in infrastructure you own. The
difference is open, in your own infrastructure, community-driven, and neutral
across the stack, not scope. You can still route to a managed provider from Modelplane.
  

## What it supports


  What models does Modelplane support?
  
    Modelplane supports any model, including open weights, custom models, and just about
anything that can be downloaded from Hugging Face, NVIDIA NGC, and other registries.
  

  Does Modelplane support NVIDIA?
  
    <p>Yes, across the stack. NVIDIA is the most widely available accelerator on the
clouds Modelplane runs on and the primary target today. Modelplane binds NVIDIA
GPUs to pods through Dynamic Resource Allocation (DRA), matching devices by
attributes such as GPU memory and architecture with CEL selectors.
<p>The software stack rides on the engine-agnostic API. NVIDIA NIM microservices and
the TensorRT-LLM engine run as engine containers like any other, Modelplane stages
weights and NIM-style artifacts from NVIDIA NGC alongside Hugging Face and other
registries, and the inference stack it installs includes NVIDIA Dynamo and llm-d,
with deeper Dynamo integration on the roadmap.

  
  Which engines and accelerators are supported?
  
    The API is engine-agnostic: any engine that runs as a container works, and its
flags are yours to write. Multiple accelerators are supported as long as they
can be bound through DRA, and the device model (DRA plus CEL selectors) is built to
extend to other accelerators and fabrics.
  

  Which clouds or neoclouds does Modelplane support?
  
    Today Modelplane provisions clusters on a few hyperscalers and neoclouds, and supports
bringing your own Kubernetes cluster anywhere. More provisioners are on the roadmap; the
bring-your-own path means you can run on any Kubernetes now. See
Supported Providers for the full matrix of clouds,
neoclouds, and their Crossplane providers.
  

  Can I bring my own cluster, or run on a neocloud or on-premise?
  
    Yes, an <code>InferenceCluster with <code>source: Existing registers a cluster you already
run, through its kubeconfig. Modelplane installs the serving stack it needs but
doesn’t provision the infrastructure. This is how you run on neoclouds and
on-premise today.
  

## What it requires


  Where does Modelplane run?
  
    Modelplane runs as a control plane on a control cluster: an ordinary Kubernetes
cluster with Crossplane installed, with no GPUs of its own. The inference clusters
it manages do the serving, and each needs Dynamic Resource Allocation (DRA,
Kubernetes v1.35+) to bind GPUs to pods. Modelplane assumes exclusive ownership of
every inference cluster, so dedicate each one to Modelplane rather than sharing it
with other workloads.
  

  Do I need Crossplane?
  
    Yes, Modelplane is built on Crossplane and requires it. If your
platform team already runs Crossplane to manage cloud infrastructure, Modelplane is the
same pattern applied to inference. Modelplane uses Crossplane’s function framework and shares its infrastructure providers.
  

## What it can do


  How does Modelplane decide where a model runs?
  
    Two-level matching. First it filters clusters by their labels (tier, region,
provider) against your <code>clusterSelector. Then it filters node pools by matching
your device requests, real DRA requests with CEL selectors over GPU memory,
architecture, and other attributes, against each pool’s <code>InferenceClass. It places each
replica on a cluster and pool that fits and has free capacity.
  

  Can I serve across regions and clusters behind one endpoint?
  
    Yes, that’s the point. A <code>ModelService exposes one OpenAI-compatible endpoint and
load-balances across every replica of a deployment, wherever they run.
  

  Can I route to a managed provider?
  
    Yes, a <code>ModelService can include a manually created <code>ModelEndpoint that points at
an external SaaS endpoint like Together or Baseten alongside your self-hosted
replicas, and load-balances across all of them.
  

  How do large or multi-node models work?
  
    An engine can be a gang: a leader and one or more workers that Modelplane composes
into a LeaderWorkerSet across nodes. You write the coordination (like Ray or vLLM’s data-parallel coordinator) in the engine flags, and Modelplane injects
the leader’s address so the workers can join it. Multi-node deployments stage
weights through a <code>ModelCache.
  

  What about disaggregated prefill/decode?
  
    Set <code>serving.mode: PrefillDecode and define separate prefill and decode engines.
Both run on the same cluster, hand off the KV cache over a fast fabric, and
Modelplane configures the cluster-edge routing that pairs each request. The
KV-transfer flags live in your engine config.
  

  How does scaling work?
  
    Replicas are the only scaling axis. Each replica is a complete serving instance;
scaling <code>spec.replicas adds or removes whole instances across the fleet. Because
a <code>ModelDeployment exposes the Kubernetes scale subresource, <code>kubectl scale and
KEDA work without anything extra. There’s no per-pod autoscaling inside a cluster.
  

  How are model weights handled?
  
    A <code>ModelCache stages weights once per cluster on shared (ReadWriteMany) storage,
and every pod reads them locally. Pods don’t re-download on each start, and
concurrent starts don’t race. It hydrates from Hugging Face today, is optional for
single-node deployments, and is recommended for multi-node ones.
  

## The project


  Why did you pick Modelplane as a name for the project?
  
    It’s a fusion of AI Model and Control Plane. We also like that it implies that AI models
are their own layer (or plane) in the stack.
  

  What does the logo signify?
  
    Three popsicle sticks assembled to make a model plane. Balsa wood planes were the inspiration.
  

  Is Modelplane production-ready?
  
    Modelplane is in early development and moving fast. Treat it as early software. The
platform docs are specific about what’s available today
versus what’s planned. We are building it in the open.
  

  What's the license and governance?
  
    Modelplane is Apache 2.0,
with no usage caps or token metering, and is developed in the open. It’s neutral
across models, engines, accelerators, and clouds, and is intended for donation to
a neutral open source foundation. It’s a project from Upbound, the team behind Rook
and Crossplane, both CNCF Graduated and widely adopted projects.
  

  How do I get involved?
  
    Issues, discussions, and contributions are welcome on
GitHub. See <code>CONTRIBUTING.md for
development setup and the project’s conventions.
  

## Next steps


    Get started
  
  Deploy Modelplane and serve your first model.
  →


    How Modelplane works
  
  The architecture and the control loop, in one page.
  →


---

# Glossary

Source: /overview/glossary/


## Modelplane

The open source control plane software. You install Modelplane on a Kubernetes
cluster (the **control cluster**). Modelplane never serves tokens itself; it
orchestrates the clusters and engines that do.

## Control cluster

The Kubernetes cluster where Modelplane runs. It needs no GPUs. It holds
Modelplane's Crossplane-based components and the API resources you apply to
declare your fleet.

## Inference cluster

A GPU cluster in the fleet where serving engines run and tokens are produced.
Modelplane can provision inference clusters on EKS, GKE, and other providers, or
you can bring your own through an `InferenceCluster` with `source: Existing`.

## Fleet

All inference clusters managed by a single Modelplane control cluster.

## Platform

The inference infrastructure the platform team
provisions using `InferenceGateway`, `InferenceClass`, and `InferenceCluster`
resources. This is distinct from Modelplane itself, which runs on the control
cluster above the fleet.

## Platform team

The infrastructure team responsible for GPU capacity. They create
`InferenceCluster`, `InferenceClass`, and `InferenceGateway` resources,
provisioning the fleet that ML teams deploy against.


## ML team


The development team deploying models. They create `ModelDeployment`,
`ModelService`, and `ModelCache` resources, declaring what a model needs without
knowing which cluster it runs on.


---

# AI tools

Source: /overview/ai-tools/


The Modelplane docs are built to be read by AI assistants as well as people. You
can connect a coding agent directly to this site, pull any page as Markdown, or
point a model at a single index file that lists the whole documentation set.
Every page also carries a **Copy page** menu next to its title with the same
shortcuts.

## Connect to the MCP server

The documentation MCP server lets an assistant search these docs and read any
page in real time, so its answers track the current content instead of its
training data. It exposes two tools:

- `search_modelplane_docs`: search the docs and get back the most relevant sections with their titles, URLs, and snippets.
- `get_modelplane_doc`: fetch the full Markdown of a single page.

The server URL is:

```plaintext
https://docs.modelplane.ai/mcp
```


    Claude Code
    
    
    Claude Desktop
    
    
    Cursor
    
    
    VS Code
    
    
    Other
    
    
    bash
    
      
  claude mcp add --transport http modelplane-docs https://docs.modelplane.ai/mcp


Open Settings, go to Connectors, and choose <strong>Add custom connector. Name it <code>modelplane-docs, enter the server URL above, and enable the connector when you start a conversation.


<p>Open the command palette, run <strong>Cursor Settings: MCP, and add a server to <code>mcp.json:


    json
    
      
  {
  "mcpServers": {
    "modelplane-docs": {
      "url": "https://docs.modelplane.ai/mcp"
    }
  }
}


<p>Create <code>.vscode/mcp.json in your workspace:


    json
    
      
  {
  "servers": {
    "modelplane-docs": {
      "type": "http",
      "url": "https://docs.modelplane.ai/mcp"
    }
  }
}


Any MCP client that speaks the streamable HTTP transport can connect to the server URL directly. No authentication is required.


The **Copy page** menu on every page also has **Connect to Cursor** and **Connect to VS Code** shortcuts that install the server in one click.

## Read pages as Markdown

Every page is also published as raw Markdown. Add `index.md` to any page URL:

```plaintext
https://docs.modelplane.ai/models/model-deployment/index.md
```

The **Copy page** control next to each title copies that Markdown to your clipboard, and **View as Markdown** opens it in the browser. Paste it into any assistant when you want to ground a question in a specific page.

## llms.txt

For tools that index a whole site, the docs publish the [`llms.txt`](https://llmstxt.org) format:

- [`llms.txt`](/llms.txt): a short index of every page with links and descriptions.
- [`llms-full.txt`](/llms-full.txt): every page concatenated into one Markdown file.

## Page menu reference

The **Copy page** menu next to each title has these actions:

<table>
  <thead>
      <tr>
          <th>Action
          <th>What it does
      
  
  <tbody>
      <tr>
          <td>Copy page
          <td>Copies the page as Markdown to your clipboard.
      
      <tr>
          <td>View as Markdown
          <td>Opens the page as raw Markdown.
      
      <tr>
          <td>Copy MCP Server
          <td>Copies the MCP server URL to your clipboard.
      
      <tr>
          <td>Connect to Cursor
          <td>Installs the MCP server in Cursor.
      
      <tr>
          <td>Connect to VS Code
          <td>Installs the MCP server in VS Code.
      
  
---

# Architecture

Source: /architecture/


Modelplane's central design choice is to build the control plane on
[Crossplane](https://crossplane.io) rather than as a bespoke set of Kubernetes
controllers. Everything else here follows from that. This section assumes you're
comfortable with Kubernetes; the rest of the Crossplane vocabulary you need is
below.

## Crossplane in brief

[Crossplane](https://crossplane.io) extends Kubernetes to manage things beyond
the cluster, cloud infrastructure, SaaS, and in Modelplane's case inference
fleets, through the same declarative, reconciled API model. Three of its concepts
matter here:

- **Composite Resources (XRs)** are custom resources whose controller, instead of
  talking to an external API directly, declares a set of other resources that
  should exist. Every Modelplane API, `InferenceCluster`, `ModelDeployment`,
  `ModelService`, is an XR.
- **Composition functions** are that controller logic. A function is a small gRPC
  service handed the observed XR and the resources it depends on, which returns
  the desired child resources. An XR runs a pipeline of one or more functions
  every reconcile; in Modelplane each is typically a single function, so the rest
  of this section says "the function" for short.
- **Providers** are controllers that manage external systems through their own
  managed resources: `provider-gcp` and `provider-aws` for cloud APIs,
  `provider-helm` for Helm releases, `provider-kubernetes` for arbitrary objects
  on any cluster. A composition function composes these like any other resource.

Put together: a Modelplane API is an XR, its logic is a composition function, and
the function composes a mix of plain Kubernetes objects, other Modelplane XRs, and
provider resources.

The resource model mirrors Kubernetes core, one scope up:
`ModelDeployment` → `ModelReplica` → `ModelService` → `ModelEndpoint` parallels
`Deployment` → `Pod` → `Service` → `Endpoint`, but across a fleet of clusters
rather than within one. A `ModelDeployment` composes a `ModelReplica` per replica,
a `ModelReplica` composes the serving workload on its target cluster, and a
`ModelService` routes across the `ModelEndpoint`s. If you know how those core
objects relate, you already know the shape of Modelplane's.

## Why Crossplane?

Modelplane is, at its core, a system that turns declarative resources into
composed infrastructure spanning cloud accounts, many Kubernetes clusters, and
the workloads on them. That's the problem Crossplane solves, and it helps in two
ways: providers and functions.

**Providers** give us reach. Modelplane has to provision Kubernetes clusters and
all the infrastructure they need across different clouds, then install software
onto them. That's an enormous surface, and providers cover it without us rolling
our own controllers for each cloud API and Helm release.

**Functions** are where Modelplane's own logic lives, and writing it as
composition functions buys several things:

- **Business logic, not controller plumbing.** A function computes desired state
  from observed state. Crossplane handles the fiddly Kubernetes controller
  details, the watches, requeues, finalizers, and drift correction, that a
  hand-written controller gets wrong in a dozen subtle ways. Less plumbing to
  write and maintain means we move faster.
- **Testability.** A function is a pure function of its inputs, so you can test
  it as a black box: feed it an XR and its dependencies, assert on the resources
  it returns. The whole test runs in process, with no API server to stand up.
- **The right language for each job.** Functions can be written in any language.
  Modelplane's are Python, for fast iteration on the serving and scheduling logic
  and because Python is the common language of the ML world, which lowers the bar
  for contributors. The performance-sensitive distributed-systems core stays in
  Go, where Crossplane and its providers already are.

The bet underneath both is that inference infrastructure is the same shape of
problem as cloud infrastructure, which Crossplane manages well. Building on it
lets Modelplane spend its effort on the part that's actually inference-specific.

## The control cluster and the fleet

Modelplane runs on a **control cluster** and manages a fleet of **workload
clusters**, the `InferenceCluster`s. The split is deliberate: the control plane
holds no GPUs and serves no tokens. It schedules, composes, and routes; the
workload clusters do the serving.

The control cluster runs Crossplane, the Modelplane composition functions (one
per resource, each a pod Crossplane calls per reconcile), the providers, and the
control-plane gateway. It also holds every Modelplane resource and the
`ProviderConfig`s that let the providers reach each workload cluster, built from
that cluster's kubeconfig.

Crossplane core drives everything. Each reconcile it asks a function what a
resource should compose and gets back the desired resources. Core then reconciles
them, applying the provider resources that the providers act on. A function only
computes desired state. It never reaches a provider or a cluster itself.

```mermaid
flowchart TB
    subgraph control["Control cluster"]
        cp["Crossplane core"]
        fns["Modelplane functions\n(one pod per resource)"]
        prov["Providers\ngcp · aws · helm · kubernetes"]
        gw["Control-plane gateway"]
    end
    subgraph fleet["Fleet"]
        wc1["Workload cluster A"]
        wc2["Workload cluster B"]
    end
    cp <-->|"desired state (gRPC)"| fns
    cp -->|composes| prov
    cp -->|composes| gw
    prov -->|provision + install via kubeconfig| wc1
    prov -->|provision + install via kubeconfig| wc2
```

Modelplane installs a serving stack on each workload cluster: the components a
cluster needs to serve models, providing inference-aware routing through Gateway
API, multi-node serving, GPU binding through DRA, and observability, among others.
The exact components evolve, but Modelplane composes and owns all of them. For
provisioned clusters the providers also create the cluster and its node pools
first.

## How a deployment is composed

A resource composes others, which compose others, until the tree bottoms out in
provider resources and plain Kubernetes objects. A `ModelDeployment` is the
clearest example. Its function schedules the replicas, then composes a
`ModelReplica` for each, and a `ModelEndpoint` for each replica that's ready to
serve. Each `ModelReplica` function composes the serving workload, a Deployment or
a LeaderWorkerSet, onto its target workload cluster through provider-kubernetes.

```mermaid
flowchart TD
    md["ModelDeployment"]
    mr1["ModelReplica\n(cluster A)"]
    mr2["ModelReplica\n(cluster B)"]
    me1["ModelEndpoint\n(cluster A)"]
    me2["ModelEndpoint\n(cluster B)"]
    wl1["Deployment / LeaderWorkerSet\non workload cluster A"]
    wl2["Deployment / LeaderWorkerSet\non workload cluster B"]

    md --> mr1
    md --> mr2
    md --> me1
    md --> me2
    mr1 --> wl1
    mr2 --> wl2
```

The platform resources compose the same way. An `InferenceCluster` composes a
`GKECluster` or `EKSCluster` (the cloud infrastructure, via the cloud providers)
and a `ServingStack` (the per-cluster software install, via provider-helm and
provider-kubernetes). Engines bind GPUs through DRA: each `claim: DRA` device in a
member's `nodeSelector` becomes a request in the `ResourceClaim` the serving pods
claim through.

## The request path

A served request crosses two gateways, both built on Gateway API. The
**control-plane gateway** is the front door: a `ModelService` composes an
`HTTPRoute` on it that matches the service's path prefix and forwards to the
matched `ModelEndpoint`s, each of which is a `Service` pointing at a workload
cluster's gateway address. The **workload-cluster gateway** then routes from the
cluster edge to the engine pods.

```mermaid
flowchart LR
    client["Client"]
    cpgw["Control-plane gateway"]
    wcgw["Workload-cluster gateway"]
    engine["Engine pods\n(vLLM, SGLang, ...)"]

    client -->|service path| cpgw
    cpgw -->|per-replica path| wcgw
    wcgw -->|engine path| engine
```

Each hop rewrites the path: the control plane rewrites the public prefix to the
replica's path, and the workload gateway strips that down to what the engine
serves. This per-backend path rewriting is the main thing the control-plane
gateway has to support, and it narrows which Gateway API implementations can fill
the role.

Which gateway sits at each layer is internal, not part of the API. The
[`InferenceGateway`](/platform/inference-gateway/) `backend` field
is an enum precisely so the control-plane gateway can grow other options over
time. Target the `ModelService` URL rather than either gateway directly.


---

# Llama-3.1-8B

Source: /examples/llama-3.1-8b/


An 8B dense chat model on a single NVIDIA L4. The entry recipe: one `Standalone`
engine, no cache, public weights from a Hugging Face mirror. It carries no
`clusterSelector`, so device capacity alone matches it to any compatible L4 in
the fleet.

This recipe was run end to end on GKE; the `InferenceClass`, `InferenceCluster`,
and `ModelDeployment` are the exact manifests from that run. The EKS platform
shape is the standard single-L4 recipe. It passes server validation but was not
served in this run. Apply the platform side first, then the ML side. The GKE
`InferenceCluster` carries a GCP project placeholder to edit before applying.

## Platform


    EKS
    
    
    GKE
    
    
      inference-class-eks.yaml
      
          
    # InferenceClass for the L4 shape on EKS, validated serving Llama-3.1-8B.
#
# One NVIDIA L4 on a g6.2xlarge. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-l4-1x-g6
spec:
  description: "EKS g6.2xlarge, 1x NVIDIA L4"
  provisioning:
    provider: EKS
    eks:
      instanceType: g6.2xlarge
      diskSizeGb: 100
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      # The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
      # nominal 24GB.
      memory: { value: "23034Mi" }

  
      inference-cluster-eks.yaml
      
          
    # EKS InferenceCluster with one L4 node pool. No clusterSelector targets it; the
# ModelDeployment matches on device capacity alone, so it lands here or on any
# other compatible cluster in the fleet.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-l4-single
  labels:
    modelplane.ai/cloud: eks
    modelplane.ai/region: us-west
spec:
  cluster:
    source: EKS
    eks:
      region: us-west-2
  nodePools:
  - name: gpu-l4
    className: eks-l4-1x-g6
    nodeCount: 1
    zones:
    - us-west-2a
    minNodeCount: 0
    maxNodeCount: 4

  
      inference-class-gke.yaml
      
          
    # InferenceClass for the L4 shape on GKE, validated serving Llama-3.1-8B.
#
# One NVIDIA L4 on a g2-standard-8. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: gke-l4-1x-g2
spec:
  description: "GKE g2-standard-8, 1x NVIDIA L4"
  provisioning:
    provider: GKE
    gke:
      machineType: g2-standard-8
      diskSizeGb: 100
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      # The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
      # nominal 24GB.
      memory: { value: "23034Mi" }

  
      inference-cluster-gke.yaml
      
          
    # GKE InferenceCluster with one L4 node pool. Replace the project ID before
# applying. No clusterSelector targets it; the ModelDeployment matches on device
# capacity alone, so it lands here or on any other compatible cluster.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: gke-l4-single
  labels:
    modelplane.ai/cloud: gke
    modelplane.ai/region: us-central
spec:
  cluster:
    source: GKE
    gke:
      project: my-gcp-project  # Replace with your GCP project ID.
      region: us-central1
  nodePools:
  - name: gpu-l4
    className: gke-l4-1x-g2
    nodeCount: 1
    zones:
    - us-central1-a
    minNodeCount: 0
    maxNodeCount: 4

  
    bash
    
      
  curl -fsSL /examples/examples/llama-3.1-8b/inference-cluster-gke.yaml \
  | sed 's/my-gcp-project//' \
  | kubectl apply -f -


## Deployment


      model-deployment.yaml
      
          
    # Llama-3.1-8B Instruct served on a single NVIDIA L4 by vLLM, validated end to
# end on GKE (the model layer is cloud-agnostic; the same manifest serves on EKS).
#
# 8B in bf16 is ~16Gi of weights, leaving room for the KV cache on the L4's
# ~23Gi. Llama's default context is 128K, whose KV cache does not fit beside the
# weights, so --max-model-len caps it at 8192 - raise it only as far as the
# leftover VRAM allows.
#
# Weights come from the public NousResearch mirror, so no Hugging Face token is
# needed. The gated meta-llama/Llama-3.1-8B-Instruct original needs an hf-token
# Secret on the *workload* cluster (the engine pod reads it, not the control
# plane) and HF_TOKEN passed on the engine container.
#
# No --port or --host: Modelplane's routing expects the engine on its default
# :8000 with a /health probe, and passes args through verbatim.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: llama-3-1-8b
  namespace: ml-team
spec:
  # One replica, matched to any compatible InferenceCluster by device capacity.
  replicas: 1
  engines:
  - name: llama
    members:
    # A single self-contained vLLM pod. The container named "engine" is the
    # inference server; its image and args pass through verbatim.
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # An 8B model needs most of an L4. >=20Gi selects the L4 (which
          # reports ~23Gi) without over-constraining. DRA evaluates this CEL
          # against the InferenceClass device, then against the GPU's
          # ResourceSlice when it binds the claim.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.7.3
            args:
            - "--model=NousResearch/Meta-Llama-3.1-8B-Instruct"
            # The id clients pass as "model" in OpenAI requests.
            - "--served-model-name=llama-3.1-8b"
            # Cap the context so the KV cache fits beside the weights on the L4.
            - "--max-model-len=8192"

  
      model-service.yaml
      
          
    # Exposes the llama-3-1-8b deployment's endpoints as a single OpenAI-compatible
# URL. Modelplane labels each composed ModelEndpoint with the deployment name, so
# this selector reaches every replica. Read the public address from
# status.address:
#   kubectl get ms llama-3-1-8b -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: llama-3-1-8b
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: llama-3-1-8b

  
---

# Route to External Providers

Source: /models/model-endpoint/

**API:** [`modelplane.ai/v1alpha1` · ModelEndpoint](/reference/modelendpoints/)

A `ModelEndpoint` is a single reachable inference endpoint that a
[`ModelService`](/models/model-service/) can route to. Modelplane creates
one for each of your replicas automatically, but you can also create one by hand
to point at an inference endpoint Modelplane doesn't run, most often a SaaS
provider like Together or Baseten. A service treats both the same, so you can
front your own replicas and an external provider behind one URL: send overflow to
the provider when your fleet is busy, or fail over to it as a break-glass option.

## Routing to an external provider

Create a `ModelEndpoint` with three things:

```yaml {nocopy=true}
apiVersion: modelplane.ai/v1alpha1
kind: ModelEndpoint
metadata:
  name: kimi-k2-together
  namespace: ml-team
  labels:
    # 1. A label of your own for a ModelService to select on. Any label
    #    works; modelplane.ai/external-provider is a readable convention.
    modelplane.ai/external-provider: together
spec:
  # 2. The provider's base URL.
  url: https://api.together.xyz/
  # 3. The path to rewrite requests to. A ModelService receives requests at
  #    /<namespace>/<service>/v1/... and rewrites them to this prefix, so an
  #    OpenAI-compatible provider that serves /v1/... takes /v1/.
  rewritePath: /v1/
```

Then point a [`ModelService`](/models/model-service/) at it. Selecting
`modelplane.ai/external-provider: together` routes to the provider; adding a
second entry for a deployment fronts both behind one URL, so traffic can spill
over to the provider alongside your own replicas:

```yaml {nocopy=true}
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: kimi-k2
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: kimi-k2          # your own replicas
  - selector:
      matchLabels:
        modelplane.ai/external-provider: together  # the endpoint above
```

The provider must speak the OpenAI API, since that's the contract a
`ModelService` exposes. Anything OpenAI-compatible works; `url` and `rewritePath`
are all that change between providers.


## Example


      model-endpoint.yaml
      
          
    # Modelplane composes a ModelEndpoint per ModelReplica automatically. Create one
# manually only to register an external inference endpoint with a ModelService,
# for example a SaaS provider like Together or BaseTen.
#
# Give it a label of your own for a ModelService to select on
# (modelplane.ai/external-provider is a readable convention), and set
# url/rewritePath to the provider's OpenAI-compatible endpoint.
apiVersion: modelplane.ai/v1alpha1
kind: ModelEndpoint
metadata:
  name: qwen3-coder-together
  namespace: ml-team
  labels:
    modelplane.ai/external-provider: together
spec:
  url: https://api.together.xyz/
  rewritePath: /v1/

  
---

# Scale the platform

Source: /getting-started/scale-the-platform/


You have one L4 cluster with a running model. In this guide, you'll add two
larger-GPU clusters in different regions to grow the fleet available to the ML team.

Provisioning two more clusters takes about 10 to 15 minutes.

## Register more clusters


    EKS
    
    
    GKE
    
    
<p>Register two more clusters with a bigger hardware class: <code>L40S (<code>48 Gi) in
<code>us-west and <code>eu-central:

  
      platform-scale.yaml
      
          
    apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: l40s-1x-g6e
spec:
  description: "EKS g6e.xlarge, 1x NVIDIA L40S"
  provisioning:
    provider: EKS
    eks:
      instanceType: g6e.xlarge
      diskSizeGb: 100
      accelerator:
        type: nvidia-l40s
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      memory: { value: "46068Mi" }
---
# g6e.xlarge is available in us-east-1, us-west-2, and eu-central-1.
# eu-west-1 does NOT have g6e.xlarge.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-us-west
  labels:
    modelplane.ai/region: us-west
spec:
  cluster:
    source: EKS
    eks:
      region: us-west-2
  nodePools:
  - name: gpu-l40s
    className: l40s-1x-g6e
    nodeCount: 1
    minNodeCount: 1
    maxNodeCount: 1
    zones:
    - us-west-2a
---
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-eu-central
  labels:
    modelplane.ai/region: eu-central
spec:
  cluster:
    source: EKS
    eks:
      region: eu-central-1
  nodePools:
  - name: gpu-l40s
    className: l40s-1x-g6e
    nodeCount: 1
    minNodeCount: 1
    maxNodeCount: 1
    zones:
    - eu-central-1a

  
    Note
  
  
    <code>g6e.xlarge runs ~$2/hr on demand. Two of them plus the <code>L4 from earlier is a
few dollars for this tour. Clean up when you’re done (see Clean
up).
  

<p>Register two more clusters with a bigger hardware class: <code>A100 (<code>40 Gi) in
<code>us-west and <code>us-east. Apply the manifest, setting each cluster’s <code>project to
your GCP project:

  
      platform-scale.yaml
      
          
    apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: gke-a100-40-1x
spec:
  description: "GKE a2-highgpu-1g, 1x NVIDIA A100 40GB"
  provisioning:
    provider: GKE
    gke:
      machineType: a2-highgpu-1g
      diskSizeGb: 200
      accelerator:
        type: nvidia-tesla-a100
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ampere }
      cudaComputeCapability: { version: "8.0.0" }
    capacity:
      # A100 40GB real reported VRAM. Keep the selector at >= 35Gi (not >= 40Gi)
      # so it reliably clears the L4 (24Gi) without hitting the boundary.
      memory: { value: "40960Mi" }
---
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: gpu-us-west
  labels:
    modelplane.ai/region: us-west
spec:
  cluster:
    source: GKE
    gke:
      project: my-gcp-project
      region: us-west1
  nodePools:
  - name: gpu-a100
    className: gke-a100-40-1x
    nodeCount: 1
    minNodeCount: 1   # keep >=1; the autoscaler can't scale a GPU pool up from 0 for DRA pods
    maxNodeCount: 2
    zones:
    - us-west1-b
---
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: gpu-us-east
  labels:
    modelplane.ai/region: us-east
spec:
  cluster:
    source: GKE
    gke:
      project: my-gcp-project
      region: us-east1
  nodePools:
  - name: gpu-a100
    className: gke-a100-40-1x
    nodeCount: 1
    minNodeCount: 1   # keep >=1; the autoscaler can't scale a GPU pool up from 0 for DRA pods
    maxNodeCount: 2
    zones:
    - us-east1-b

  
    bash
    
      
  curl -fsSL /examples/getting-started/gke/platform-scale.yaml \
  | sed 's/my-gcp-project//g' \
  | kubectl apply -f -


    Note
  
  
    <code>a2-highgpu-1g runs ~$3.50/hr on demand. Two of them plus the <code>L4 from earlier
is a few dollars for this tour. Clean up when you’re done (see Clean
up).
  

Modelplane provisions both clusters in parallel:

```bash
kubectl wait --for=condition=Ready ic --all --timeout=20m
```

## Your model keeps running

Growing the fleet doesn't disturb anything already deployed. `qwen-demo` stays
on its original cluster and the two new clusters add capacity the moment
they're `Ready` with no interruption for the ML team. A replica only moves if
its deployment changes in a way that no longer fits where it runs. 

## Next step

The fleet now spans three clusters across three regions. The ML team is next. [Scale the model](/getting-started/scale-the-model/) to serve it from two regions behind a single endpoint.


---

# Supported Providers

Source: /platform/providers/

Modelplane is built on [Crossplane](https://crossplane.io) and shares its
infrastructure providers, so the set of clouds and neoclouds it reaches grows
alongside Crossplane itself. This page shows where Modelplane runs today and
where it's headed.

A provider can show up here in three ways:


    Note
  
  
    <ul>
<li><strong>Provisioning supported. Modelplane creates and manages the whole cluster
from an <code>InferenceCluster, selected through <code>provisioning.provider. GKE and
EKS work this way today.
<li><strong>Bring your own supported. Register a cluster you already run with
<code>source: Existing. This works on any provider whose Kubernetes meets
Modelplane’s requirements (Dynamic Resource Allocation and a recent Kubernetes
version), so you can run on the providers below now, ahead of native
provisioning.
<li><strong>Crossplane provider exists. A Crossplane provider is published for the
cloud. That provider is the path by which native provisioning lands, so it
marks where Modelplane can grow next.


## Clouds and neoclouds

Listed alphabetically, spanning hyperscalers and GPU-specialist neoclouds. Each
runs a managed Kubernetes service with GPU node pools, so the bring-your-own path
covers them all today. Where a Crossplane provider exists, it's the path to
native provisioning.

<table>
  <thead>
      <tr>
          <th>Provider / service
          <th>Accelerators
          <th>Provisioning
          <th>BYO
          <th>Crossplane
      
  
  <tbody>
      <tr>
          <td>Alibaba Cloud (ACK)
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>
      
      <tr>
          <td>AWS (EKS)
          <td>NVIDIA Trainium
          <td>✓
          <td>✓
          <td>
      
      <tr>
          <td>Civo (K3s)
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>
      
      <tr>
          <td>CoreWeave (CKS)
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>none yet
      
      <tr>
          <td>Crusoe (CMK)
          <td>NVIDIA AMD
          <td>Planned
          <td>✓
          <td>none yet
      
      <tr>
          <td>DigitalOcean (DOKS)
          <td>NVIDIA AMD
          <td>Planned
          <td>✓
          <td>
      
      <tr>
          <td>Fluidstack
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>none yet
      
      <tr>
          <td>Google Cloud (GKE)
          <td>NVIDIA TPU
          <td>✓
          <td>✓
          <td>
      
      <tr>
          <td>Huawei Cloud (CCE)
          <td>NVIDIA Ascend
          <td>Planned
          <td>✓
          <td>
      
      <tr>
          <td>IBM Cloud (IKS)
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>none active
      
      <tr>
          <td>Lambda
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>none yet
      
      <tr>
          <td>Linode / Akamai (LKE)
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>
      
      <tr>
          <td>Microsoft Azure (AKS)
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>
      
      <tr>
          <td>Nebius
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>none yet
      
      <tr>
          <td>Oracle Cloud (OKE)
          <td>NVIDIA AMD
          <td>Planned
          <td>✓
          <td>
      
      <tr>
          <td>OVHcloud
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>
      
      <tr>
          <td>Scaleway (Kapsule)
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>
      
      <tr>
          <td>Tencent Cloud (TKE)
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>
      
      <tr>
          <td>Voltage Park
          <td>NVIDIA
          <td>Planned
          <td>✓
          <td>none yet
      
      <tr>
          <td>Vultr (VKE)
          <td>NVIDIA AMD
          <td>Planned
          <td>✓
          <td>
      
  
    Note
  
  
    <p><strong>On-premises and bare metal. Bring an on-prem cluster the same way as any
other: stand up Kubernetes on your own hardware (like NVIDIA DGX BasePOD
or SuperPOD) with NVIDIA Base Command Manager, Run:ai, or your own tooling, then
register it with <code>source: Existing. Provisioning it for you is on the roadmap
too. Modelplane can drive NVIDIA Base Command Manager or other bare-metal
Kubernetes provisioners through Crossplane, the same pattern it uses in the
cloud.

  
Native provisioning expands as more Crossplane providers ship; until then, the
bring-your-own path runs Modelplane on any conformant Kubernetes cluster today.


    Tip
  
  
<p>Don’t see your cloud or neocloud, or want to be added?
Open an issue and we’ll
track it.
  

    Register a Cluster
  
  Add a cluster to Modelplane, provisioned or bring-your-own.
  →


    Define Hardware Classes
  
  Describe the GPUs and provisioning recipe each node pool uses.
  →


---

# API Reference

Source: /reference/


Modelplane's API is a set of Kubernetes custom resources. Each type below has
its own page with the full spec and status schema, a runnable example, and
fields you can link to directly. For release history, see the
[GitHub releases page](https://github.com/modelplaneai/modelplane/releases).


---

# Scale the model

Source: /getting-started/scale-the-model/

A `ModelService` can front more than one `ModelDeployment`. Here you add a second
deployment, pinned to a different region, and point the same service at both. The
endpoint you already curled stays the same. Behind it, traffic now load-balances
across two regions.

```mermaid
graph LR
    subgraph fleet ["Fleet"]
        IC1["us-east\nL4"]
        IC2["us-west\nlarger GPU"]
    end

    subgraph ml ["ML team"]
        MD1["ModelDeployment\nqwen-demo"]
        MD2["ModelDeployment\nqwen-west\nclusterSelector: us-west"]
        MS["ModelService qwen\n/ml-team/qwen/v1/..."]
    end

    IC1 --> MD1
    IC2 --> MD2
    MD1 --> MS
    MD2 --> MS
```

## Deploy to a second region

The new deployment uses a `clusterSelector` to pin its replica to the `us-west`
cluster you added in the last step, and selects the larger GPU there:


    EKS
    
    
    GKE
    
    
      model-deployment-west.yaml
      
          
    # A second ModelDeployment, pinned to a different region with clusterSelector.
# It carries its own deployment name, so the ModelService can front it alongside
# qwen-demo and serve both from one endpoint.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen-west
  namespace: ml-team
spec:
  replicas: 1
  # clusterSelector filters which InferenceClusters this deployment can land on.
  # This pins the replica to the us-west cluster you added in "Scale the fleet".
  clusterSelector:
    matchLabels:
      modelplane.ai/region: us-west
  engines:
  - name: qwen
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # L40S (46068Mi) qualifies; L4 (23034Mi) does not.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - --model=Qwen/Qwen2.5-0.5B-Instruct
            - --dtype=half

  
      model-deployment-west.yaml
      
          
    # A second ModelDeployment, pinned to a different region with clusterSelector.
# It carries its own deployment name, so the ModelService can front it alongside
# qwen-demo and serve both from one endpoint.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen-west
  namespace: ml-team
spec:
  replicas: 1
  # clusterSelector filters which InferenceClusters this deployment can land on.
  # This pins the replica to the us-west cluster you added in "Scale the fleet".
  clusterSelector:
    matchLabels:
      modelplane.ai/region: us-west
  engines:
  - name: qwen
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # A100 40GB (40960Mi) qualifies; L4 (23034Mi) does not.
          # Threshold at 35Gi not 40Gi to avoid the boundary on A100's exact reported VRAM.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("35Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - --model=Qwen/Qwen2.5-0.5B-Instruct
            - --dtype=half

  
Wait until its replica is `Ready`, then check placement. You now have one replica
per region:

```bash
kubectl get modelreplica -n ml-team
```

```shell {nocopy=true}
NAME              CLUSTER       SYNCED   READY   COMPOSITION                   AGE
qwen-demo-7323a   eks-us-east   True     True    modelreplicas.modelplane.ai   42m
qwen-west-92535   eks-us-west   True     True    modelreplicas.modelplane.ai   8m
```

## Front both with one service

Update the `ModelService` to select both deployments. Each entry in
`spec.endpoints` adds its matching replicas to the same endpoint:


      model-service-multi.yaml
      
          
    # The same ModelService as before, now selecting two deployments. Each entry in
# spec.endpoints adds every ModelEndpoint matching its selector to the same
# OpenAI-compatible endpoint, so /ml-team/qwen/ load-balances across qwen-demo
# and qwen-west, wherever their replicas run.
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen-demo
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen-west

  
The endpoint URL doesn't change. Clients that had this URL before still have it;
they don't know the fleet changed. The gateway load-balances across both regions,
and losing one region keeps the other serving. Send the same request as before:

```bash
ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}')
```

```bash
kubectl run -i --rm curl-test \
  --image=curlimages/curl \
  --restart=Never \
  --env="ADDRESS=$ADDRESS" \
  -- sh -c 'curl -v "$ADDRESS/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"Qwen/Qwen2.5-0.5B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"What is Kubernetes in one sentence?\"}],\"max_tokens\":100}"'
```

## That's the tour

You stood up a control plane, built a multi-region GPU fleet, deployed a model
across it, and ended with one stable endpoint serving requests. The platform
team published hardware. The ML team described what the model needs. Modelplane
placed them and served behind a single endpoint.

[Clean up](/getting-started/clean-up/) tears everything down
when you're done.

For more on the resources you used:

* [InferenceClass](/platform/inference-class/)
* [InferenceCluster](/platform/inference-cluster/)
* [ModelDeployment](/models/model-deployment/)
* [ModelService](/models/model-service/)

Modelplane is in active development and we're building in the open. If you're
running your own inference fleet and want to shape where this goes, we'd love to
hear from you. Star the [repository](https://github.com/modelplaneai/modelplane),
join us in [Slack](https://slack.crossplane.io), or read the
[manifesto](https://modelplane.ai).


---

# Clean up

Source: /getting-started/clean-up/

Delete the model resources, clusters, and finally the control plane.

## Delete model resources

Delete model resources before clusters. Deleting a cluster first leaves the
deployments reconciling against infrastructure that no longer exists.

```bash
kubectl delete md --all -n ml-team
kubectl delete ms --all -n ml-team
```

Wait for all model replicas to finish:

```bash
kubectl get modelreplica -n ml-team --watch
```

## Delete the clusters

Delete all clusters with foreground cascading deletion. The serving stack on each
workload cluster must uninstall while that cluster's API server is still
reachable. Foreground deletion holds each cluster object until its stack
finishes. Background deletion can orphan cloud resources.

```bash
kubectl delete ic --all --cascade=foreground
```

Wait until all clusters are deleted:

```bash
kubectl get ic --watch
```

## Delete the control plane

Delete the kind cluster:

```bash
kind delete cluster --name modelplane
```


---

# EKSCluster

Source: /reference/eksclusters/

An EKSCluster provisions an EKS cluster with dedicated node groups for GPU inference and system workloads. It outputs a Secret containing the cluster kubeconfig that consumers use to target the cluster. The kubeconfig embeds a static bearer token that the AWS provider refreshes.

---

# GKECluster

Source: /reference/gkeclusters/

A GKECluster provisions a GKE cluster with dedicated node pools for GPU inference and system workloads. It outputs secrets containing the cluster kubeconfig and a GCP service account key that consumers can use to target the cluster.

---

# ServingStack

Source: /reference/servingstacks/

A ServingStack installs the serving substrate (LeaderWorkerSet, Gateway API, cert-manager, Prometheus) on a Kubernetes cluster.