# Modelplane Documentation > Modelplane is the open source control plane for AI model serving. It extends Crossplane to manage AI inference across a fleet of GPU clusters. --- # Overview Source: /overview/ Modelplane is the open source control plane for AI inference. It's software you install and run in your own environment, and it orchestrates the models, serving stack, and infrastructure across cloud, neocloud, and on-premise. Modelplane supports running any model and any engine on any infrastructure, with the frontier-level serving topologies and performance the largest models demand, from a single GPU to disaggregated, multi-node deployments. Modelplane operates across the whole fleet: provisioning inference clusters, scheduling model deployments on compatible clusters, autoscaling model replicas across clusters, caching model weights across clusters, and routing across clusters. It's an active system that is always reconciling the fleet toward the state you declare. You install Modelplane on a Kubernetes cluster, which becomes the control cluster for your inference fleet. It's built on [Crossplane](https://crossplane.io) and fully integrates with your existing platform systems. Warning Modelplane is under active development. We have opted to build the project in the open, collaborating with the broad AI inference community on integrations and capabilities. ## Deploy a model Modelplane's API is declarative, designed for platform teams responsible for the inference infrastructure and developers deploying models on that infrastructure. Once a platform team has provisioned inference clusters and declared the available GPUs and networking fabric, an ML development team deploys a model with a declarative manifest: ```yaml {nocopy=true} apiVersion: modelplane.ai/v1alpha1 kind: ModelDeployment metadata: name: qwen-demo namespace: ml-team spec: replicas: 1 engines: - name: qwen members: - role: Standalone nodeSelector: devices: - name: gpu count: 1 selectors: - cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0 template: spec: containers: - name: engine image: vllm/vllm-openai:v0.23.0 args: ["--model=Qwen/Qwen2.5-0.5B-Instruct"] ``` Modelplane schedules a model replica onto an inference cluster with free, compatible GPUs and memory, and deploys the serving engine. Exposing an OpenAI-compatible endpoint can be done by declaring a model service: ```yaml {nocopy=true} apiVersion: modelplane.ai/v1alpha1 kind: ModelService metadata: name: qwen namespace: ml-team spec: endpoints: - selector: matchLabels: modelplane.ai/deployment: qwen-demo ``` ## A universal control plane for AI inference Modelplane is designed to be a universal control plane for inference. It runs inference clusters on any cloud, neocloud, or on-premise environment, or any combination of them. Modelplane can provision the clusters for you, or you can bring your own. It supports any serving engine that runs as a container, and can serve frontier-quality models using advanced topologies including tensor parallel, pipeline parallel, data and expert parallel, and prefill/decode disaggregation. Modelplane works across different accelerators and networking fabrics, and schedules each model's replicas by matching the model's hardware requirements to the hardware available across your clusters. ## What Modelplane is not Modelplane is not a serving engine like vLLM, SGLang, or TensorRT-LLM. Modelplane composes serving engines and orchestrates them fleet-wide across cloud, neocloud, and on-premise. Modelplane is not a managed inference service like Baseten, Together, or Fireworks. These offer cloud services, while Modelplane is self-hosted software. ## Next steps Get started Go from nothing to a live OpenAI-compatible endpoint in about 45 minutes. Deploy on a real fleet → Why Modelplane Learn more about Modelplane’s capabilities and how it works. Learn more → --- # Deploy a Model Source: /models/model-deployment/ **API:** [`modelplane.ai/v1alpha1` · ModelDeployment](/reference/modeldeployments/) A `ModelDeployment` is the ML team's primary interface. You describe the model you want served, the hardware it needs, and how many copies to run; Modelplane schedules it onto matching clusters and keeps it running. You never name a cluster. Modelplane is unopinionated about the engine itself. You bring the container and its flags, and Modelplane shapes a serving topology around it. The engine flags you write carry parallelism, quantization, and KV transfer, never injected by Modelplane. A deployment's `spec.engines` describes its topology through two choices: - **One pod or a gang**: whether an engine is a single `Standalone` pod or a `Leader` with one or more `Worker` pods coordinating across nodes. - **Unified or disaggregated**: whether `spec.serving.mode` keeps prefill and decode together (`Unified`, the default) or splits them across two engines (`PrefillDecode`). How many of each to run is a separate question, covered in [Sizing a deployment](#sizing-a-deployment). ## Single-node The default, and what the [getting started tour](/getting-started/) deploys. One `Standalone` member is one pod on one node, claiming that node's GPUs through its `nodeSelector`. It's usually the right choice when a model fits on a single node. Within a node, tensor parallelism is an engine flag (`--tensor-parallel-size`), not a Modelplane concept. ```yaml {nocopy=true} engines: - name: qwen members: - role: Standalone # one pod, one node ``` ## Multi-node When a model is too large for one node's GPUs, make the engine a gang: a `Leader` and a `Worker` whose `worker.nodes` expands to that many worker pods, one per node. The pods serve the model together; how the model splits across them (tensor, pipeline, data, or expert parallelism) is up to your engine flags. A gang should use a [`ModelCache`](/models/model-cache/) via `spec.modelCacheRef`, so every pod mounts the same weights instead of each pulling its own. ```yaml {nocopy=true} modelCacheRef: name: qwen3-coder # recommended for gangs engines: - name: qwen3-coder members: - role: Leader - role: Worker worker: nodes: 1 # one worker pod per node ``` A member's `env` can read pod fields through `valueFrom.fieldRef`, like setting vLLM's `VLLM_HOST_IP` from `status.podIP`, which multi-NIC RDMA nodes need so the engine binds the right interface instead of guessing it. ## Disaggregated serving The prefill and decode phases have opposite hardware profiles, and on one engine a prefill burst stalls the decodes already running. Set `spec.serving.mode: PrefillDecode` to run them as two engines, one marking `phase: Prefill` and the other `phase: Decode`. Modelplane fronts the pair with inference-aware routing that sequences prefill then decode, moving the KV cache between them. Each phase can sit on the GPU class that suits it. ```yaml {nocopy=true} serving: mode: PrefillDecode # the two engines below are one P/D pair engines: - name: prefill phase: Prefill - name: decode phase: Decode ``` Disaggregation pays off for large models under load with strict latency targets and long context. For small models or low traffic, the KV-transfer overhead outweighs the benefit, so unified serving is the default. It requires an engine image that includes the **NIXL** KV-transfer runtime. vLLM's `NixlConnector` (and SGLang's prefill/decode transfer) import the `nixl` package, so disaggregated engines crash at startup with `NIXL is not available` on an image that lacks it. Recent vanilla `vllm/vllm-openai` images include NIXL, so pin a current tag rather than an old one. The engine image is yours to choose, so this is a prerequisite Modelplane does not bundle for you. ## Requesting GPUs You don't name a cluster or a GPU model. Instead each member's `nodeSelector` lists the hardware its pods need, and Modelplane finds a node pool that has it. The platform team publishes node pools as `InferenceClass` resources, each describing the devices its nodes carry. Your request is matched against them. A request names a device (`gpu`), how many of it each pod needs (`count`), and one or more `selectors` the device must match: ```yaml {nocopy=true} nodeSelector: devices: - name: gpu count: 1 # one GPU per pod selectors: - cel: | device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0 ``` Each selector is a single line of [CEL](https://cel.dev/), a small expression language, that returns true or false for one device. The part in brackets, `"gpu.nvidia.com"`, is the GPU vendor's driver. The fields after it, like `memory` or `architecture`, are what the platform team published for that device. This one says "match a GPU whose memory is at least 40Gi." A device has to match every selector in the request. Give two selectors to mean "Hopper, with at least 80Gi." ### Requesting more than one device `devices` is a list, so a member can ask for distinct kinds of hardware at once, each its own entry with its own `count` and `selectors`. A node pool matches the member only when it satisfies every entry. This is how you ask for both a GPU and a fast NIC on the same node: ```yaml {nocopy=true} nodeSelector: devices: - name: gpu count: 8 selectors: - cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper" - name: nic count: 1 selectors: - cel: device.attributes["nic.nvidia.com"].linkType == "infiniband" ``` ### What you can match on Each selector is evaluated against one device and must return a boolean. The device exposes three things: - `device.driver`: the device's driver, a string. - `device.attributes[""].`: a typed attribute (string, bool, int, or version), such as `architecture` or `cudaComputeCapability`. - `device.capacity[""].`: a capacity quantity, such as `memory`. Two helpers build comparable values: `quantity()` parses Kubernetes quantities like `"40Gi"`, and `semver()` parses versions like `"9.0.0"`. Both support `compareTo` (which orders two values), `isGreaterThan`, and `isLessThan`. Combine selectors with the usual CEL operators (`==`, `!=`, `>=`, `&&`, `||`). ```yaml {nocopy=true} selectors: # Capacity: at least 40Gi of GPU memory. >= 0 reads as "left is at least right". - cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0 # Attribute equality: a specific architecture. - cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper" # Version attribute: a minimum CUDA compute capability. - cel: device.attributes["gpu.nvidia.com"].cudaComputeCapability.isGreaterThan(semver("8.9.0")) # Driver: match any device from a given driver. - cel: device.driver == "gpu.nvidia.com" # Presence: only match a device that publishes a given domain. - cel: '"gpu.nvidia.com" in device.attributes' # Two conditions in one selector. - cel: | device.attributes["gpu.nvidia.com"].architecture == "Hopper" && device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("80Gi")) >= 0 ``` This is the Kubernetes DRA device selector expression surface. The Kubernetes-specific CEL extension libraries (such as regular expressions and IP address helpers) aren't available. Selectors in practice are attribute and capacity comparisons like those above. ### Seeing what's available To see what you can match against, list the classes the platform team has published and look at the devices each one declares: ```bash kubectl get inferenceclass kubectl describe inferenceclass gke-l4-1x-g2 ``` The `describe` output shows each device's driver, attributes (like `architecture`), and capacity (like `memory`), which are exactly the keys your selectors read. If a selector asks for something no published class offers, the deployment won't schedule. ## Sizing a deployment Three independent numbers control how many pods a deployment runs: - **`spec.replicas`** stamps out whole copies of the entire topology. Each replica is a complete serving instance, and replicas usually land on different clusters. This is the scaling axis (see [Scaling](#scaling)). - **`engines[].copies`** runs several identical copies of one engine within a replica, on the same cluster. It's a fixed number, sized once, never autoscaled. Copies make a replica more resilient within its cluster: a node failure drops one copy instead of taking the whole replica out of service. In disaggregated serving they also set the prefill-to-decode ratio. - **`worker.nodes`** sets how many nodes one gang spans: a `Leader` plus that many `Worker` pods. It's how big a single multi-node engine is. ## Scaling `spec.replicas` is the only scaling axis. Each replica is a complete, fixed-shape serving instance, so scaling adds or removes whole instances across the fleet. Because the deployment exposes the Kubernetes scale subresource, `kubectl scale` and KEDA work without anything extra. There's no in-cluster pod autoscaling. ## Choosing a topology | Topology | Use when | How you set it | |----------|----------|----------------| | Single-node | The model fits on one node's GPUs | One `Standalone` member (the default) | | Multi-node | The model is too large for one node | A `Leader` and one or more `Worker` members, ideally with a `modelCacheRef` | | Disaggregated serving | Large model, heavy load, strict latency, long context | `serving.mode: PrefillDecode` with two phase engines | ## Examples Single-node Multi-node model-deployment.yaml # A ModelDeployment deploys a model to one or more inference clusters. # The scheduler picks clusters by clusterSelector labels and nodeSelector # device requests, gated on available nodes. Each matched cluster gets one # ModelReplica. # # The control plane creates a unified OpenAI-compatible endpoint: # http://///v1/chat/completions apiVersion: modelplane.ai/v1alpha1 kind: ModelDeployment metadata: name: qwen3-8b namespace: ml-team spec: # Number of ModelReplicas to fan out to. Each replica is a complete # serving instance scheduled to one InferenceCluster. replicas: 1 # Optional: restrict the scheduler to clusters with specific labels. # clusterSelector: # matchLabels: # modelplane.ai/region: us-central # Engines are an array of inference engines. This model is one engine, one # Standalone member, one pod - the simplest shape. The engine composes to a # Deployment fronted by a Service. engines: - name: qwen3-8b members: # A Standalone member is a single self-contained engine pod. Its template # carries the container named "engine" - the inference engine; its image, # command, and args pass through verbatim. - role: Standalone # The member's per-node device request: a list of DRA device requests # describing what each of the member's pods needs from its node. The # scheduler matches each against a candidate pool's InferenceClass # devices and pins the member to a pool that satisfies them. Each # request's CEL is real DRA CEL over a single device; quantity() and # semver() are helpers. claim: DRA devices also become requests in the # DRA ResourceClaim the serving pods claim GPUs through, so an engine # must declare the GPUs it needs. nodeSelector: devices: - name: gpu count: 1 selectors: # Qwen3-8B fits comfortably on an L4; 20Gi selects one without # over-constraining. A larger model would ask for more memory or a # specific architecture here. This CEL is real DRA CEL: the scheduler # matches it against the pool's declared device, and DRA matches it # again against the GPU's ResourceSlice when it binds the claim. - cel: | device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0 template: spec: containers: - name: engine image: vllm/vllm-openai:v0.23.0 args: - "--model=Qwen/Qwen3-8B" - "--served-model-name=qwen" - "--reasoning-parser=qwen3" - "--default-chat-template-kwargs={\"enable_thinking\": false}" - "--enable-auto-tool-choice" - "--tool-call-parser=hermes" model-deployment-multinode.yaml # A ModelDeployment serving one model across two nodes. # # When a model is too large to fit on one node's GPUs, make an engine a gang: # give it a Leader and a Worker member, whose worker.nodes expands to that many # worker pods, one per node. The scheduler picks a cluster with a pool that has # enough GPUs per node and enough nodes for the whole gang, and Modelplane # composes a LeaderWorkerSet-backed serving instance on it. The worker joins the # leader through $(MODELPLANE_LEADER_ADDRESS), which Modelplane injects. # # Multi-node engines require a ModelCache: every pod in the gang mounts it at # /mnt/models. When a member brings its own command, Modelplane does not inject # --model, so the leader points the engine at the mount explicitly. # # This shape (vLLM's native multiprocessing backend, TP within a node and PP # across nodes) is the one validated serving Qwen3-Coder-480B; see # examples/qwen3-coder/ for the full platform side. apiVersion: modelplane.ai/v1alpha1 kind: ModelDeployment metadata: name: qwen3-coder namespace: ml-team spec: replicas: 1 modelCacheRef: name: qwen3-coder engines: - name: qwen3-coder members: - role: Leader nodeSelector: devices: - name: gpu count: 8 selectors: - cel: | device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0 template: spec: containers: - name: engine image: vllm/vllm-openai:v0.23.0 command: - /bin/sh - -c - >- exec vllm serve /mnt/models --served-model-name=qwen3-coder --tensor-parallel-size=8 --pipeline-parallel-size=2 --distributed-executor-backend=mp --nnodes=2 --node-rank=0 --master-addr=$(MODELPLANE_LEADER_ADDRESS) --max-model-len=32768 --port=8000 - role: Worker worker: nodes: 1 nodeSelector: devices: - name: gpu count: 8 selectors: - cel: | device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0 template: spec: containers: - name: engine image: vllm/vllm-openai:v0.23.0 command: - /bin/sh - -c - >- exec vllm serve /mnt/models --served-model-name=qwen3-coder --tensor-parallel-size=8 --pipeline-parallel-size=2 --distributed-executor-backend=mp --nnodes=2 --node-rank=1 --master-addr=$(MODELPLANE_LEADER_ADDRESS) --headless --max-model-len=32768 --- # Get started Source: /getting-started/ Modelplane is an open source control plane for AI inference. It separates two concerns: a platform team managing GPU capacity, and ML teams deploying models against it. Without it, every change on one side creates work for the other. When the platform team updates infrastructure, ML teams have to react. When model requirements change, the platform team gets a request. With Modelplane, the platform team publishes hardware without knowing what models will run on it. The ML team declares what a model needs without knowing what clusters exist. The control plane resolves it and keeps it current as both sides change. In this tour, you'll switch between provisioning infrastructure and declaring a model to see how they interact. By the end you'll have a GPU fleet across three regions and one OpenAI-compatible endpoint routing to a model served across two of them. This is not a production setup and takes around 45 minutes to run. ## What you'll build The platform team provisions a starter cluster and grows it to two A100 regions; the ML team serves a model on the L4, then scales it onto an A100, all behind one endpoint.