Kubernetes v1.35 [stable](enabled by default)This page describes dynamic resource allocation (DRA) in Kubernetes.
DRA is a Kubernetes feature that lets you request and share resources among Pods. These resources are often attached devices like hardware accelerators.
With DRA, device drivers and cluster admins define device classes that are available to claim in workloads. Kubernetes allocates matching devices to specific claims and places the corresponding Pods on nodes that can access the allocated devices.
Allocating resources with DRA is a similar experience to dynamic volume provisioning, in which you use PersistentVolumeClaims to claim storage capacity from storage classes and request the claimed capacity in your Pods.
DRA provides a flexible way to categorize, request, and use devices in your cluster. Using DRA provides benefits like the following:
These benefits provide significant improvements in the device allocation workflow when compared to device plugins, which require per-container device requests, don't support device sharing, and don't support expression-based device filtering.
The workflow of using DRA to allocate devices involves the following types of users:
Device owner: responsible for devices. Device owners might be commercial vendors, the cluster operator, or another entity. To use DRA, devices must have DRA-compatible drivers that do the following:
Cluster admin: responsible for configuring clusters and nodes, attaching devices, installing drivers, and similar tasks. To use DRA, cluster admins do the following:
Workload operator: responsible for deploying and managing workloads in the cluster. To use DRA to allocate devices to Pods, workload operators do the following:
DRA uses the following Kubernetes API kinds to provide the core allocation
functionality. All of these API kinds are included in the resource.k8s.io/v1
API group.
A DeviceClass lets cluster admins or device drivers define categories of devices in the cluster. DeviceClasses tell operators what devices they can request and how they can request those devices. You can use common expression language (CEL) to select devices based on specific attributes. A ResourceClaim that references the DeviceClass can then request specific configurations within the DeviceClass.
To create a DeviceClass, see Set Up DRA in a Cluster.
A ResourceClaim defines the resources that a workload needs. Every ResourceClaim has requests that reference a DeviceClass and select devices from that DeviceClass. ResourceClaims can also use selectors to filter for devices that meet specific requirements, and can use constraints to limit the devices that can satisfy a request. ResourceClaims can be created by workload operators or can be generated by Kubernetes based on a ResourceClaimTemplate. A ResourceClaimTemplate defines a template that Kubernetes can use to auto-generate ResourceClaims for Pods.
The method that you use depends on your requirements, as follows:
DRAWorkloadResourceClaims
feature to be enabled.When you define a workload, you can use Common Expression Language (CEL) to filter for specific device attributes or capacity. The available parameters for filtering depend on the device and the drivers.
If you directly reference a specific ResourceClaim in a Pod, that ResourceClaim must already exist in the same namespace as the Pod. If the ResourceClaim doesn't exist in the namespace, the Pod won't schedule. This behavior is similar to how a PersistentVolumeClaim must exist in the same namespace as a Pod that references it.
You can reference an auto-generated ResourceClaim in a Pod, but this isn't recommended because auto-generated ResourceClaims are bound to the lifetime of the Pod or PodGroup that triggered the generation.
To learn how to claim resources using one of these methods, see Allocate Devices to Workloads with DRA.
Kubernetes v1.36 [stable](enabled by default)You can provide a prioritized list of subrequests for requests in a ResourceClaim or ResourceClaimTemplate. The scheduler will then select the first subrequest that can be allocated. This allows users to specify alternative devices that can be used by the workload if the primary choice is not available.
In the example below, the ResourceClaimTemplate requested a device with the color black and the size large. If a device with those attributes is not available, the pod cannot be scheduled. With the prioritized list feature, a second alternative can be specified, which requests two devices with the color white and size small. The large black device will be allocated if it is available. If it is not, but two small white devices are available, the pod will still be able to run.
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: prioritized-list-claim-template
spec:
spec:
devices:
requests:
- name: req-0
firstAvailable:
- name: large-black
deviceClassName: resource.example.com
selectors:
- cel:
expression: |-
device.attributes["resource-driver.example.com"].color == "black" &&
device.attributes["resource-driver.example.com"].size == "large"
- name: small-white
deviceClassName: resource.example.com
selectors:
- cel:
expression: |-
device.attributes["resource-driver.example.com"].color == "white" &&
device.attributes["resource-driver.example.com"].size == "small"
count: 2
If the pod is eligible for multiple nodes in the cluster, the scheduler will use the index of chosen subrequests from any prioritized lists as one of the inputs when it scores each node. So nodes that can allocate devices requested in a higher ranked subrequest are more likely to be chosen than nodes that can only allocate devices for lower ranked subrequests.
The decision is made on a per-Pod basis, so if the Pod is a member of a ReplicaSet or similar grouping, you cannot rely on all the members of the group having the same subrequest chosen. Your workload must be able to accommodate this.
Kubernetes v1.36 [alpha](disabled by default)When you organize Pods with the Workload API, you can reserve ResourceClaims for entire PodGroups instead of individual Pods and generate ResourceClaimTemplates for a PodGroup instead of a single Pod, allowing the Pods within a PodGroup to share access to devices allocated to the generated ResourceClaim.
This feature targets two problems:
status.reservedFor list can only contain 256 items.
Since kube-scheduler only records individual Pods in that list, only 256 Pods
can share a ResourceClaim. By allowing PodGroups to be recorded in
status.reservedFor, many more than 256 Pods can share a ResourceClaim.The PodGroup API defines a spec.resourceClaims field with the same structure
and similar meaning as the spec.resourceClaims field in the Pod API:
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: training-group
namespace: some-ns
spec:
...
resourceClaims:
- name: pg-claim
resourceClaimName: my-pg-claim
- name: pg-claim-template
resourceClaimTemplateName: my-pg-template
Like claims made by Pods, claims for PodGroups defining a resourceClaimName
refer to a ResourceClaim by name. Claims defining a resourceClaimTemplateName
refer to a ResourceClaimTemplate which replicates into one ResourceClaim for the
entire PodGroup that can be shared amongst its Pods.
When a Pod defines a claim with a name, resourceClaimName, and
resourceClaimTemplateName that all match one of its PodGroup's
spec.resourceClaims, then kube-scheduler reserves the ResourceClaim for the
PodGroup instead of the Pod. If the Pod's claim does not match one made by its
PodGroup, then kube-scheduler reserves the ResourceClaim for the Pod. In either
case, reservation is recorded in the ResourceClaim's status.reservedFor.
PodGroup reservations and the corresponding resource allocation persist in the
ResourceClaim until the PodGroup is deleted, even if the group no longer has any
Pods.
When a Pod claim matching a PodGroup claim defines a
resourceClaimTemplateName, then one ResourceClaim is generated for the
PodGroup. Other Pods in the group defining the same claim will share that
generated ResourceClaim instead of prompting a new ResourceClaim to be generated
for each Pod. Whether or not a resourceClaimTemplateName claim matches a
PodGroup claim, the name of the generated ResourceClaim is recorded in the Pod's
status.resourceClaimStatuses.
ResourceClaims generated from a ResourceClaimTemplate for a PodGroup follow the lifecycle of the PodGroup. The ResourceClaim is first created when both the PodGroup and its ResourceClaimTemplate exist. The ResourceClaim is deleted after the PodGroup has been deleted and the ResourceClaim is no longer reserved.
Consider the following example:
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: training-group
namespace: some-ns
spec:
...
resourceClaims:
- name: pg-claim
resourceClaimName: my-pg-claim
- name: pg-claim-template
resourceClaimTemplateName: my-pg-template
---
apiVersion: v1
kind: Pod
metadata:
name: training-group-pod-1
namespace: some-ns
spec:
...
schedulingGroup:
podGroupName: training-group
resourceClaims:
- name: pod-claim
resourceClaimName: my-pod-claim
- name: pod-claim-template
resourceClaimTemplateName: my-pod-template
- name: pg-claim
resourceClaimName: my-pg-claim
- name: pg-claim-template
resourceClaimTemplateName: my-pg-template
In this example, the training-group PodGroup has one Pod named training-group-pod-1.
The Pod's pod-claim and pod-claim-template claims do not match
any claim made by the PodGroup, so those claims are not affected by the
PodGroup: ResourceClaim my-pod-claim becomes reserved for the Pod and a
ResourceClaim is generated from ResourceClaimTemplate my-pod-template and also
becomes reserved for the Pod. The pg-claim and pg-claim-template do match
claims made by the PodGroup. ResourceClaim my-pg-claim becomes reserved for
the PodGroup and a ResourceClaim is generated from ResourceClaimTemplate
my-pg-template and also becomes reserved for the PodGroup.
Associating ResourceClaims with Workload API resources is an alpha feature and
only enabled when the DRAWorkloadResourceClaims feature gate
is enabled in the kube-apiserver, kube-controller-manager, kube-scheduler, and kubelet.
Each ResourceSlice represents one or more devices in a pool. The pool is managed by a device driver, which creates and manages ResourceSlices. The resources in a pool might be represented by a single ResourceSlice or span multiple ResourceSlices.
ResourceSlices provide useful information to device users and to the scheduler, and are crucial for dynamic resource allocation. Every ResourceSlice must include the following information:
Drivers use a controller to reconcile ResourceSlices in the cluster with the information that the driver has to publish. This controller overwrites any manual changes, such as cluster users creating or modifying ResourceSlices.
Consider the following example ResourceSlice:
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: cat-slice
spec:
driver: "resource-driver.example.com"
pool:
generation: 1
name: "black-cat-pool"
resourceSliceCount: 1
# The allNodes field defines whether any node in the cluster can access the device.
allNodes: true
devices:
- name: "large-black-cat"
attributes:
color:
string: "black"
size:
string: "large"
cat:
bool: true
This ResourceSlice is managed by the resource-driver.example.com driver in the
black-cat-pool pool. The allNodes: true field indicates that any node in the
cluster can access the devices. There's one device in the ResourceSlice, named
large-black-cat, with the following attributes:
color: blacksize: largecat: trueA DeviceClass could select this ResourceSlice by using these attributes, and a ResourceClaim could filter for specific devices in that DeviceClass.
The order in which the Kubernetes scheduler evaluates devices for allocation is determined by the lexicographical sorting of ResourceSlice and resource pool names. The scheduler uses a first-fit strategy, meaning it selects the first available device that satisfies the claim's requirements.
This allows the priority of resource allocation to be influenced by the names assigned to pools and ResourceSlices. Note that pools without binding conditions are always evaluated before those with binding conditions, regardless of their names.
For drivers built using the k8s.io/dynamic-resources/kubeletplugin Go package or
the ResourceSlice controller from that module, these components automatically handle
ResourceSlice naming to ensure they are evaluated in the order specified by the driver.
The following sections describe the workflow for the various types of DRA users and for the Kubernetes system during dynamic resource allocation.
ResourceSlice creation: drivers in the cluster create ResourceSlices that represent one or more devices in a managed pool of similar devices.
Workload creation: the cluster control plane checks new workloads for references to ResourceClaimTemplates or to specific ResourceClaims.
resourceclaim-controller generates ResourceClaims for the workload.ResourceSlice filtering: for every Pod, Kubernetes checks the ResourceSlices in the cluster to find a device that satisfies all of the following criteria:
Resource allocation: after finding an eligible ResourceSlice for a Pod's ResourceClaim, the Kubernetes scheduler updates the ResourceClaim with the allocation details. The scheduler uses a first-fit strategy and evaluates pools and ResourceSlices in lexicographical order by their names. Drivers can prioritize specific slices or pools by naming them appropriately. For details, see Naming and prioritization.
Pod scheduling: when resource allocation is complete, the scheduler places the Pod on a node that can access the allocated resource. The device driver and the kubelet on that node configure the device and the Pod's access to the device.
You can check the status of dynamically allocated resources by using any of the following methods:
The PodResourcesLister kubelet gRPC service lets you monitor in-use devices.
The DynamicResource message provides information that's specific to dynamic
resource allocation, such as the device name and the claim name. For details,
see
Monitoring device plugin resources.
Kubernetes v1.33 [beta](enabled by default)DRA drivers can report driver-specific
device status
data for each allocated device in the status.devices field of a ResourceClaim.
For example, the driver might list the IP addresses that are assigned to a
network interface device. Updating this field requires specific synthetic RBAC permissions,
see
Hardening Guide - Dynamic Resource Allocation
and
Harden Dynamic Resource Allocation in Your Cluster.
The accuracy of the information that a driver adds to a ResourceClaim
status.devices field depends on the driver. Evaluate drivers to decide whether
you can rely on this field as the only source of device information.
If you disable the
DRAResourceClaimDeviceStatus feature gate, the
status.devices field automatically gets cleared when storing the ResourceClaim.
A ResourceClaim device status is supported when it is possible, from a DRA
driver, to update an existing ResourceClaim where the status.devices field is
set.
For details about the status.devices field, see the
ResourceClaim API reference.
Kubernetes v1.36 [beta](enabled by default)Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources. For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy. It is also helpful to find out if the device recovers.
To use this functionality, the ResourceHealthStatus feature gate must be enabled (beta and enabled by default since v1.36), and the DRA driver must implement the DRAResourceHealth gRPC service.
When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. This health information is then exposed directly in the Pod's status. The kubelet populates the allocatedResourcesStatus field in the status of each container, detailing the health of each device assigned to that container. Each resource health entry can include an optional message field with additional human-readable context about the health status, such as error details or failure reasons.
If the kubelet does not receive a health update from a DRA driver within a timeout period, the device's health status is marked as "Unknown". DRA drivers can configure this timeout on a per-device basis by setting the health_check_timeout_seconds field in the DeviceHealth gRPC message. If not specified, the kubelet uses a default timeout of 30 seconds. This allows different hardware types (for example, GPUs, FPGAs, or storage devices) to use appropriate timeout values based on their health-reporting characteristics.
This provides crucial visibility for users and controllers to react to hardware failures. For a Pod that is failing, you can inspect this status to determine if the failure was related to an unhealthy device.
When you - or another API client - create a Pod with spec.nodeName already set, the scheduler gets bypassed.
If some ResourceClaim needed by that Pod does not exist yet, is not allocated
or not reserved for the Pod, then the kubelet will fail to run the Pod and
re-check periodically because those requirements might still get fulfilled later.
Such a situation can also arise when support for dynamic resource allocation was not enabled in the scheduler at the time when the Pod got scheduled (version skew, configuration, feature gate, etc.). kube-controller-manager detects this and tries to make the Pod runnable by reserving the required ResourceClaims. However, this only works if those were allocated by the scheduler for some other pod.
It is better to avoid bypassing the scheduler because a Pod that is assigned to a node blocks normal resources (RAM, CPU) that then cannot be used for other Pods while the Pod is stuck. To make a Pod run on a specific node while still going through the normal scheduling flow, create the Pod with a node selector that exactly matches the desired node:
apiVersion: v1
kind: Pod
metadata:
name: pod-with-cats
spec:
nodeSelector:
kubernetes.io/hostname: name-of-the-intended-node
...
You may also be able to mutate the incoming Pod, at admission time, to unset
the .spec.nodeName field and to use a node selector instead.
The following sections describe DRA features that support advanced use cases. Usage of them is optional and may only be relevant with DRA drivers that support them.
Some of them are available in the Alpha or Beta feature stage. Those depend on feature gates and may depend on additional API groups. For more information, see Set up DRA in the cluster.
Kubernetes v1.36 [stable](enabled by default)You can mark a request in a ResourceClaim or ResourceClaimTemplate as having privileged features for maintenance and troubleshooting tasks. A request with admin access grants access to in-use devices and may enable additional permissions when making the device available in a container:
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: large-black-cat-claim-template
spec:
spec:
devices:
requests:
- name: req-0
exactly:
deviceClassName: resource.example.com
allocationMode: All
adminAccess: true
Admin access is a privileged mode and should not be granted to regular users in
multi-tenant clusters. Only users authorized to
create ResourceClaim or ResourceClaimTemplate objects in namespaces labeled with
resource.kubernetes.io/admin-access: "true" (case-sensitive) can use the
adminAccess field. This ensures that non-admin users cannot misuse the
feature.
Admin access is a beta feature and is enabled by default with the
DRAAdminAccess feature gate
in the kube-apiserver, kube-scheduler, and kubelet.
Kubernetes v1.36 [beta](enabled by default)Starting in Kubernetes v1.36, DRA enforces fine-grained authorization checks for updates
to ResourceClaim status by using synthetic subresources and node-aware verbs.
For security hardening guidance, including RBAC examples for scheduler and DRA drivers, see Hardening Guide - Dynamic Resource Allocation.
For a step-by-step cluster administrator procedure, see Harden Dynamic Resource Allocation in Your Cluster.
The following sections describe DRA features that are available in the Alpha feature stage. They depend on enabling feature gates and may depend on additional API groups. For more information, see Set up DRA in the cluster.
Kubernetes v1.36 [beta](enabled by default)You can provide an extended resource name for a DeviceClass. The scheduler will then select the devices matching the class for the extended resource requests. This allows users to continue using extended resource requests in a pod to request either extended resources provided by device plugin, or DRA devices. The same extended resource can be provided either by device plugin, or DRA on one single cluster node. The same extended resource can be provided by device plugin on some nodes, and DRA on other nodes in the same cluster.
In the example below, the DeviceClass is given an extendedResourceName example.com/gpu.
If a pod requested for the extended resource example.com/gpu: 2, it can be scheduled to
a node with two or more devices matching the DeviceClass.
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
name: gpu.example.com
spec:
selectors:
- cel:
expression: device.driver == 'gpu.example.com' && device.attributes['gpu.example.com'].type
== 'gpu'
extendedResourceName: example.com/gpu
In addition, users can use a special extended resource to allocate devices without
having to explicitly create a ResourceClaim. Using the extended resource name
prefix deviceclass.resource.kubernetes.io/ and the DeviceClass name.
This works for any DeviceClass, even if it does not specify an extended resource name.
The resulting ResourceClaim will contain a request for an ExactCount of the
specified number of devices of that DeviceClass.
Extended resource allocation by DRA is a beta feature and is enabled by default with the
DRAExtendedResource feature gate
in the kube-apiserver, kube-scheduler, kube-controller-manager, and kubelet.
Kubernetes v1.36 [beta](enabled by default)Devices represented in DRA don't necessarily have to be a single unit connected to a single machine, but can also be a logical device comprised of multiple devices connected to multiple machines. These devices might consume overlapping resources of the underlying phyical devices, meaning that when one logical device is allocated other devices will no longer be available.
In the ResourceSlice API, this is represented as a list of named CounterSets, each of which contains a set of named counters. The counters represent the resources available on the physical device that are used by the logical devices advertised through DRA.
Logical devices can specify the ConsumesCounters list. Each entry contains a reference to a CounterSet and a set of named counters with the amounts they will consume. So for a device to be allocatable, the referenced counter sets must have sufficient quantity for the counters referenced by the device.
CounterSets must be specified in separate ResourceSlices from devices. Devices can consume counters from any CounterSet defined in the same resource pool as the device.
Here is an example of two devices, each consuming 6Gi of memory from a shared counter with 8Gi of memory. Thus, only one of the devices can be allocated at any point in time. The scheduler handles this and it is transparent to the consumer as the ResourceClaim API is not affected.
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: resourceslice-with-countersets
spec:
nodeName: worker-1
pool:
name: pool
generation: 1
resourceSliceCount: 2
driver: dra.example.com
sharedCounters:
- name: gpu-1-counters
counters:
memory:
value: 8Gi
---
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: resourceslice-with-devices
spec:
nodeName: worker-1
pool:
name: pool
generation: 1
resourceSliceCount: 2
driver: dra.example.com
devices:
- name: device-1
consumesCounters:
- counterSet: gpu-1-counters
counters:
memory:
value: 6Gi
- name: device-2
consumesCounters:
- counterSet: gpu-1-counters
counters:
memory:
value: 6Gi
Partitionable devices is a beta feature and enabled when the
DRAPartitionableDevices feature gate
is kept enabled in the kube-apiserver and kube-scheduler.
Kubernetes v1.36 [beta](enabled by default)The consumable capacity feature allows the same devices to be consumed by multiple independent ResourceClaims, with the Kubernetes scheduler managing how much of the device's capacity is used up by each claim. This is analogous to how Pods can share the resources on a Node; ResourceClaims can share the resources on a Device.
The device driver can set allowMultipleAllocations field added in .spec.devices of ResourceSlice
to allow allocating that device to multiple independent ResourceClaims or to multiple requests within a ResourceClaim.
Users can set capacity field added in spec.devices.requests of ResourceClaim to specify the device resource requirements for each allocation.
For the device that allows multiple allocations, the requested capacity is drawn from — or consumed from — its total capacity,
a concept known as consumable capacity.
Then, the scheduler ensures that the aggregate consumed capacity across all claims does not exceed the device’s overall capacity.
Furthermore, driver authors can use the requestPolicy constraints on individual device capacities to control
how those capacities are consumed.
For example, the driver author can specify that a given capacity is only consumed in increments of 1Gi.
Here is an example of a network device which allows multiple allocations and contains a consumable bandwidth capacity.
kind: ResourceSlice
apiVersion: resource.k8s.io/v1
metadata:
name: resourceslice
spec:
nodeName: worker-1
pool:
name: pool
generation: 1
resourceSliceCount: 1
driver: dra.example.com
devices:
- name: eth1
allowMultipleAllocations: true
attributes:
name:
string: "eth1"
capacity:
bandwidth:
requestPolicy:
default: "1M"
validRange:
min: "1M"
step: "8"
value: "10G"
The consumable capacity can be requested as shown in the below example.
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: bandwidth-claim-template
spec:
spec:
devices:
requests:
- name: req-0
exactly:
deviceClassName: resource.example.com
capacity:
requests:
bandwidth: 1G
The allocation result will include the consumed capacity and the identifier of the share.
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
...
status:
allocation:
devices:
results:
- consumedCapacity:
bandwidth: 1G
device: eth1
shareID: "a671734a-e8e5-11e4-8fde-42010af09327"
In this example, a multiply-allocatable device was chosen. However, any resource.example.com device
with at least the requested 1G bandwidth could have met the requirement.
If a non-multiply-allocatable device were chosen, the allocation would have resulted in the entire device.
To force the use of a only multiply-allocatable devices, you can use the CEL criteria device.allowMultipleAllocations == true.
When requesting multiple devices in a ResourceClaim, you can use the DistinctAttribute constraint to ensure that each allocated device has a different value for a specified attribute. This constraint was introduced with the consumable capacity feature.
The DistinctAttribute constraint is particularly useful when working with multiply-allocatable devices. It prevents the scheduler from allocating the same device multiple times within a single ResourceClaim, even when that device allows multiple allocations.
Beyond preventing duplicate allocations, this constraint helps optimize performance by ensuring devices are distributed based on their attributes. For example, you can use it to distribute devices across different NUMA nodes to optimize memory bandwidth and reduce contention.
Kubernetes v1.36 [beta](enabled by default)Device taints are similar to node taints: a taint has a string key, a string value, and an effect. The effect is applied to the ResourceClaim which is using a tainted device and to all Pods referencing that ResourceClaim. The "NoSchedule" effect prevents scheduling those Pods. Tainted devices are ignored when trying to allocate a ResourceClaim because using them would prevent scheduling of Pods.
The "NoExecute" effect implies "NoSchedule" and in addition causes eviction of all Pods which have been scheduled already. This eviction is implemented in the device taint eviction controller in kube-controller-manager by deleting affected Pods.
The "None" effect is ignored by the scheduler and eviction controller. DRA drivers can use it to communicate exceptions to admins or other controllers, like for example degraded health of a device. Admins can also use it to do dry-runs of pod eviction in DeviceTaintRules (more on that below).
ResourceClaims can tolerate taints. If a taint is tolerated, its effect does not apply. An empty toleration matches all taints. A toleration can be limited to certain effects and/or match certain key/value pairs. A toleration can check that a certain key exists, regardless which value it has, or it can check for specific values of a key. For more information on this matching see the node taint concepts.
Eviction can be delayed by tolerating a taint for a certain duration. That delay starts at the time when a taint gets added to a device, which is recorded in a field of the taint.
Taints apply as described above also to ResourceClaims allocating "all" devices on a node. All devices must be untainted or all of their taints must be tolerated. Allocating a device with admin access (described above) is not exempt either. An admin using that mode must explicitly tolerate all taints to access tainted devices.
Device taints and tolerations is a beta feature and enabled when the
DRADeviceTaints feature gate
is kept enabled in the kube-apiserver, kube-controller-manager and kube-scheduler.
To use DeviceTaintRules, the resource.k8s.io/v1beta2 API version must be
enabled together with the DRADeviceTaintRules feature gate.
In contrast to DRADeviceTaints, DRADeviceTaintRules is off by default because of this dependency
on the beta API group, which has to be off by default.
You can add taints to devices in the following ways, by using the DeviceTaintRule API kind.
A DRA driver can add taints to the device information that it publishes in ResourceSlices. Consult the documentation of a DRA driver to learn whether the driver uses taints and what their keys and values are.
Kubernetes v1.36 [beta](disabled by default)An admin or a control plane component can taint devices without having to tell the DRA driver to include taints in its device information in ResourceSlices. They do that by creating DeviceTaintRules. Each DeviceTaintRule adds one taint to devices which match the device selector. Without such a selector, no devices are tainted. This makes it harder to accidentally evict all pods using ResourceClaims when leaving out the selector by mistake.
Devices can be selected by giving the name of a DeviceClass, driver, pool, and/or device. The DeviceClass selects all devices that are selected by the selectors in that DeviceClass. With just the driver name, an admin can taint all devices managed by that driver, for example while doing some kind of maintenance of that driver across the entire cluster. Adding a pool name can limit the taint to a single node, if the driver manages node-local devices.
Finally, adding the device name can select one specific device. The device name and pool name can also be used alone, if desired. For example, drivers for node-local devices are encouraged to use the node name as their pool name. Then tainting with that pool name automatically taints all devices on a node.
Drivers might use stable names like "gpu-0" that hide which specific device is currently assigned to that name. To support tainting a specific hardware instance, CEL selectors can be used in a DeviceTaintRule to match a vendor-specific unique ID attribute, if the driver supports one for its hardware.
The taint applies as long as the DeviceTaintRule exists. It can be modified and and removed at any time. Here is one example of a DeviceTaintRule for a fictional DRA driver:
apiVersion: resource.k8s.io/v1beta2
kind: DeviceTaintRule
metadata:
name: example
spec:
# The entire hardware installation for this
# particular driver is broken.
# Evict all pods and don't schedule new ones.
deviceSelector:
driver: dra.example.com
taint:
key: dra.example.com/unhealthy
value: Broken
effect: NoExecute
The kube-apiserver automatically tracks when this taint was created by setting the
timeAdded field in the spec. The toleration period starts at that time
stamp. During updates which change the effect (see simulated eviction flow
below), the kube-apiserver automatically updates the time stamp. Users can control
the time stamp explicitly by setting the field when creating a DeviceTaintRule and
by changing it to some different value when updating.
The status contains a condition added by the eviction controller:
kubectl describe devicetaintrules
Name: example
...
Spec:
Device Selector:
Driver: dra.example.com
Taint:
Effect: NoExecute
Key: dra.example.com/unhealthy
Time Added: 2025-11-05T18:15:37Z
Value: Broken
Status:
Conditions:
Last Transition Time: 2025-11-05T18:15:37Z
Message: 1 pod evicted since starting the controller.
Observed Generation: 1
Reason: Completed
Status: False
Type: EvictionInProgress
Events: <none>
Pods get evicted by deleting them. Usually this happens very quickly, except when a toleration for the taint delays it for a certain period or when there are very many pods which need to be evicted. When it takes longer, the message provides information about the current status:
2 pods need to be evicted in 2 different namespaces. 1 pod evicted since starting the controller.
The condition can be used to check whether an eviction is currently active:
kubectl wait --for=condition=EvictionInProgress=false DeviceTaintRule/example
Beware of the potential race between scheduler and controller observing the new
taint at different times, which can lead to pods still being scheduled at a
time when the controller thinks that there are none which need to be evicted
and thus sets this condition to False. In practice, this race is made very
unlikely by updating the status only after an intentional delay of a few
seconds.
For effect: None, the message provides information about the number of
affected devices, how many of those are allocated, and how many pods would be
evicted if the effect was NoExecute. This can be used to do a dry-run before
actually triggering eviction:
Create a DeviceTaintRule with the desired selectors and effect: None.
Review the message:
3 published devices selected. 1 allocated device selected.
1 pod would be evicted in 1 namespace if the effect was NoExecute.
This information will not be updated again. Recreate the DeviceTaintRule to trigger an update.
Published devices are those listed in ResourceSlices. Tainting them prevents allocation for new pods. Only allocated devices cause eviction of the pods using them.
Edit the DeviceTaintRule and change the effect into NoExecute.
Kubernetes v1.36 [alpha](disabled by default)You can query the availability of devices in resource pools using the ResourcePoolStatusRequest API. This provides visibility into how many devices are available, allocated, or unavailable across your cluster's DRA resource pools.
To check resource pool status:
Create a ResourcePoolStatusRequest specifying the driver name (required) and optionally a limit on the number of pools returned. You can also limit it to a single pool by specifying a pool name:
apiVersion: resource.k8s.io/v1beta2
kind: ResourcePoolStatusRequest
metadata:
name: check-gpus
spec:
driver: example.com/gpu
# Optional: filter to a specific pool
# poolName: my-pool
# Optional: limit number of pools returned (default: 100, max: 1000)
# limit: 10
Wait for the controller to process the request:
kubectl wait --for=condition=Complete resourcepoolstatusrequest/check-gpus --timeout=30s
Read the status to see pool availability:
kubectl get resourcepoolstatusrequest/check-gpus -o yaml
The status includes:
poolCount: total number of pools matching the filter (may exceed the number
of pools listed if truncated by the limit).pools: a list of pool details, each containing:
driver and poolName: identify the pool.generation: the latest pool generation observed across ResourceSlices.resourceSliceCount: the number of ResourceSlices making up the pool.totalDevices: total devices in the pool.allocatedDevices: devices currently allocated to claims.availableDevices: devices available for allocation
(totalDevices - allocatedDevices - unavailableDevices).unavailableDevices: devices not available due to taints or other conditions.nodeName: the node associated with the pool, if any.validationError: set when the pool's data could not be fully validated
(for example, during a generation rollout). When set, device count fields
may be unset.conditions: includes Complete (success) or Failed (error) condition types.Delete the request when done:
kubectl delete resourcepoolstatusrequest/check-gpus
ResourcePoolStatusRequest objects are processed once by a controller in kube-controller-manager. The spec is immutable once created, and the entire object becomes immutable once the status is populated. To get updated availability data, delete and recreate the request. Completed requests are automatically cleaned up after 1 hour.
This feature requires explicit RBAC permissions on the ResourcePoolStatusRequest resource. No default ClusterRoles include this permission.
Resource pool status is an alpha feature and only enabled when the
DRAResourcePoolStatus feature gate
is enabled in the kube-apiserver and kube-controller-manager.
Kubernetes v1.36 [beta](enabled by default)Device Binding Conditions allow the Kubernetes scheduler to delay Pod binding until external resources, such as fabric-attached GPUs or reprogrammable FPGAs, are confirmed to be ready.
This waiting behavior is implemented in the PreBind phase of the scheduling framework. During this phase, the scheduler checks whether all required device conditions are satisfied before proceeding with binding.
This improves scheduling reliability by avoiding premature binding and enables coordination with external device controllers.
To use this feature, device drivers (typically managed by driver owners) must publish the
following fields in the Device section of a ResourceSlice. Cluster administrators
must enable the DRADeviceBindingConditions and DRAResourceClaimDeviceStatus feature
gates for the scheduler to honor these fields.
bindingConditions.status.conditions field of the associated ResourceClaim) before the Pod can be bound. These conditions typically represent readiness signals, such as DeviceAttached or DeviceInitialized.bindingFailureConditionsbindsToNodetrue, the scheduler records the selected node name in the
status.allocation.nodeSelector field of the ResourceClaim.
This does not affect the Pod's spec.nodeSelector. Instead, it sets a node selector
inside the ResourceClaim, which external controllers can use to perform node-specific
operations such as device attachment or preparation.All condition types listed in bindingConditions and bindingFailureConditions are evaluated
from the status.conditions field of the ResourceClaim.
External controllers are responsible for updating these conditions using standard Kubernetes
condition semantics (type, status, reason, message, lastTransitionTime).
The scheduler waits up to 600 seconds (default) for all bindingConditions to become True.
If the timeout is reached or any bindingFailureConditions are True, the scheduler
clears the allocation and reschedules the Pod.
A cluster administration can configure this timeout duration by editing the kube-scheduler configuration file.
An example of configuring this timeout in KubeSchedulerConfiguration is given below:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: DynamicResources
args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: DynamicResourcesArgs
bindingTimeout: 60s
Here is an example of a ResourceSlice that you might see in a cluster where there's a DRA driver in use, and that driver supports binding conditions:
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: gpu-slice-1
spec:
driver: dra.example.com
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator-type
operator: In
values:
- "high-performance"
pool:
name: gpu-pool
generation: 1
resourceSliceCount: 1
devices:
- name: gpu-1
attributes:
vendor:
string: "example"
model:
string: "example-gpu"
bindsToNode: true
bindingConditions:
- dra.example.com/is-prepared
bindingFailureConditions:
- dra.example.com/preparing-failed
This example ResourceSlice has the following properties:
accelerator-type=high-performance,
so that the scheduler uses only a specific set of eligible nodes.node-3) and sets
the status.allocation.nodeSelector field in the ResourceClaim to that node name.dra.example.com/is-prepared binding condition indicates that the device gpu-1
must be prepared (the is-prepared condition has a status of True) before binding.gpu-1 device preparation fails (the preparing-failed condition has a status of True), the scheduler aborts binding.Device binding conditions is a beta feature and is enabled by default, controlled by the
DRADeviceBindingConditions feature gate
in the kube-apiserver and kube-scheduler.
Kubernetes v1.36 [alpha](disabled by default)Devices managed by DRA can have an underlying footprint composed of node-allocatable
resources, such as cpu, memory, hugepages, or ephemeral-storage.
This feature integrates these DRA-based requests into the scheduler's standard
accounting alongside regular Pod spec requests for these resources.
Users (PodSpec authors) can use a mixture of Pod-level resources, container-level resources, and resource claims with associated node-allocatable resources. These devices represent resources like CPUs or memory directly, or they could be accelerators, network interface cards, or other devices that require some host resources when allocated. The DRA driver will populate information in the ResourceSlice that tells the scheduler how to calculate the node allocatable resources when the device is allocated to a ResourceClaim. PodSpec authors do not need to make that calculation themselves.
When authoring a PodSpec using claims for these types of devices, there are a few things to be aware of:
DRA drivers declare this node allocatable resource footprint using the
nodeAllocatableResourceMappings field on devices within a ResourceSlice.
This mapping translates the requested DRA device or capacity into standard
resources that are tracked in the node's status.allocatable (note that extended
resources are not supported for this mapping). This is useful both for drivers that directly
expose native resources (like a CPU or Memory DRA driver) and for devices that
require auxiliary node dependencies (like an accelerator that needs host memory).
This mapping defines the translation of the requested DRA device or capacity units to the corresponding quantity of the node-allocatable resource. The scheduler calculates the exact quantity using:
capacityKey is not set, the
allocationMultiplier multiplies the device count allocated to the claim.
The allocationMultiplier defaults to 1 if not specified.capacityKey is set, it references a
capacity name defined in the device's capacity map. The scheduler looks
up the amount of that capacity consumed by the claim and multiplies it by
the allocationMultiplier.Here is an example where a CPU DRA driver exposes a CPU socket as a pool of 128
CPUs using DRA consumable capacity. The capacityKey links the consumed
cpu.example.com/cpu capacity directly to the node's standard cpu
allocatable resource:
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: my-node-cpus
spec:
driver: cpu.example.com
nodeName: my-node
pool:
name: socket-cpus
generation: 1
resourceSliceCount: 1
devices:
- name: socket0cpus
allowMultipleAllocations: true
capacity:
"cpu.example.com/cpu": "128"
nodeAllocatableResourceMappings:
cpu:
capacityKey: "cpu.example.com/cpu"
# allocationMultiplier defaults to 1 if omitted
- name: socket1cpus
allowMultipleAllocations: true
capacity:
"cpu.example.com/cpu": "128"
nodeAllocatableResourceMappings:
cpu:
capacityKey: "cpu.example.com/cpu"
# allocationMultiplier defaults to 1 if omitted
Here is an example of a resource slice where an accelerator requires an additional 8Gi of memory per device instance to function:
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: my-node-xpus
spec:
driver: xpu.example.com
nodeName: my-node
pool:
name: xpu-pool
generation: 1
resourceSliceCount: 1
devices:
- name: xpu-model-x-001
attributes:
example.com/model:
string: "model-x"
nodeAllocatableResourceMappings:
memory:
allocationMultiplier: "8Gi"
After a Pod is successfully bound to the node, the exact quantities of
node-allocatable resources allocated via DRA are included in the Pod's
status.nodeAllocatableResourceClaimStatuses field.
Node-allocatable resources is an alpha feature and is enabled when the
DRANodeAllocatableResources feature gate is enabled in the kube-apiserver,
kube-scheduler, and kubelet. In the alpha phase, the kubelet does not account
for these resources when determining QoS classes, configuring cgroups, or making
eviction decisions.
Kubernetes v1.36 [alpha]
DRA drivers can expose device metadata such as device attributes (PCI bus addresses or mdevUUID for mediated devices) or network configuration directly to containers as JSON files. This lets applications inside the container discover information about allocated devices without querying the Kubernetes API or building custom controllers.
KEP-5304 defines a device metadata protocol that drivers must follow so applications inside the container see a consistent layout across drivers and clusters. The DRA kubelet plugin library implements this protocol for you; the rest of this section describes how to use it.
Device metadata follows the same rules as device access: it is available inside a container only when that container requests the device in its container specification, and not otherwise. For how to request DRA devices in Pods and containers, see Request devices in workloads using DRA.
The protocol consists of four rules:
File paths. Metadata files live inside containers under
/var/run/kubernetes.io/dra-device-attributes. For a directly referenced
ResourceClaim the path is
resourceclaims/<claimName>/<requestName>/<driverName>-metadata.json; for a
claim created from a ResourceClaimTemplate the path is
resourceclaimtemplates/<podClaimName>/<requestName>/<driverName>-metadata.json
(where podClaimName is pod.spec.resourceClaims[].name).
In cases where the ResourceClaim request uses the
prioritized list feature, only the top-level request
name is used for the <requestName> segment in the file path (that is,
the /<subrequest> portion is dropped). Inside the
JSON file, the requests[].name field carries the full
<request>/<subrequest> reference (for example, gpu/high-memory) so
that consumers can identify which alternative was allocated.
The path constants are defined in
k8s.io/dynamic-resource-allocation/api/metadata.
JSON API. Each file is a stream of one or more
DeviceMetadata
objects serialized as versioned JSON with apiVersion and kind, following
Kubernetes API conventions. The same metadata is encoded once per supported
API version (newest first). All objects in the stream are semantically
equivalent; consumers should use the first object they can decode.
Generation. When a driver updates a metadata file the embedded
metadata.generation field must increase so consumers can detect changes.
Container exposure. Files are typically exposed via CDI bind-mounts, but other mechanisms are permitted as long as the file appears at the correct path and is read-only inside the container.
Device metadata is a driver-side feature that does not require any Kubernetes
API changes or feature gates. Using the DRA kubelet plugin library is a common
way to implement a driver, but drivers can be built in other ways as well.
Drivers that use the kubelet plugin enable this feature by passing the
EnableDeviceMetadata and MetadataVersions
options
when starting the plugin. MetadataVersions specifies which API versions are
serialized into the metadata file and must be set explicitly by the driver.
Check the documentation of your DRA driver to learn whether device metadata is
supported and how to enable it.
When device metadata is enabled, the driver generates metadata files and CDI bind-mount specifications while preparing the allocated devices for the pod, before the consuming containers start. The metadata appears inside containers at the well-known paths as defined above.
When a single request allocates devices from multiple DRA drivers, each driver
writes its own metadata file. Containers enumerate *-metadata.json files in
the request directory to discover all devices.
The Go package
k8s.io/dynamic-resource-allocation/devicemetadata
provides utilities for reading and decoding these metadata files by applications
inside the container.
Each metadata file conforms to the
DeviceMetadata
API (metadata.resource.k8s.io/v1alpha1).
The following example shows a metadata file for a GPU device allocated through
a ResourceClaimTemplate:
{
"kind": "DeviceMetadata",
"apiVersion": "metadata.resource.k8s.io/v1alpha1",
"metadata": {
"name": "pod0-gpu-2kqrd",
"namespace": "gpu-test1",
"uid": "c7e7b22e-239b-4498-b27c-7f1344481e14",
"generation": 1
},
"podClaimName": "gpu",
"requests": [
{
"name": "gpu",
"devices": [
{
"driver": "gpu.example.com",
"pool": "worker-0",
"name": "gpu-0",
"attributes": {
"driverVersion": {
"version": "1.0.0"
},
"index": {
"int": 0
},
"model": {
"string": "LATEST-GPU-MODEL"
},
"uuid": {
"string": "gpu-18db0e85-99e9-c746-8531-ffeb86328b39"
}
}
}
]
}
]
}
Drivers provide metadata in one of two ways:
metadata.generation so consumers can detect changes. The MetadataUpdater
API in the DRA kubelet plugin library handles generation bookkeeping
automatically for driver authors.In both cases, metadata remains available to each consuming container for the lifetime of that container. Metadata files are cleaned up after all containers in the Pod have terminated.
To learn how to use device metadata in your workloads, see Access DRA device metadata.
Custom, hand-crafted drivers that do not use the DRA kubelet plugin library
must implement the device metadata protocol
themselves. That means writing DeviceMetadata JSON at the correct file paths,
incrementing metadata.generation on every update, and exposing the files
read-only inside the container through CDI or an equivalent mechanism.
Kubernetes v1.36 [alpha](disabled by default)This feature improves the ResourceSlice API, allowing DRA drivers to specify list values for device attributes instead of only scalars. This is useful for modeling more complex internal node topologies, for example when a CPU has adjacency to multiple PCIe roots.
For ResourceClaim authors (end users), this means that the matchAttribute and distinctAttribute work better for these cases.
matchAttribute — the two attributes must have a non-empty list intersection, rather than be identical (scalar values are treated as single-item lists).
This just means that if one driver publishes a single value for, say, the PCIe root, and another driver publishes a list, the constraint is met as long as
the single value appears somewhere in the list.distinctAttribute — the attribute values must be pairwise-disjoint (no value shared between any two devices)To help ResourceClaim authors use attributes that may be lists inside CEL expressions, this feature also introduces an includes() CEL function.
# Scalar attribute (backward compatible)
# assume: device.attributes["dra.example.com"].model = "model-a"
device.attributes["dra.example.com"].model.includes("model-a") # true
device.attributes["dra.example.com"].model.includes("model-b") # false
# List-type attribute (requires DRAListTypeAttributes)
# assume: device.attributes["dra.example.com"].supported-models= ["model-a", "model-b"]
device.attributes["dra.example.com"].supported-models.includes("model-a") # true
device.attributes["dra.example.com"].supported-models.includes("model-c") # false
By default, each DeviceAttribute holds exactly one scalar value: a boolean, an integer,
a string, or a semantic version string. The DRAListTypeAttributes feature gate extends
DeviceAttribute with four list-type fields, allowing a device to advertise multiple
values for a single attribute:
bools — a list of boolean valuesints — a list of 64-bit integer valuesstrings — a list of strings (each at most 64 characters)versions — a list of semantic version strings per semver.org spec 2.0.0
(each at most 64 characters)The total number of individual attribute values per device (scalar fields plus all list elements combined) is limited to 48. When any device in a ResourceSlice uses this feature or other advanced features such as taints, the ResourceSlice will be limited to at most 64 devices. use list-type attributes or other advanced features such as taints.
Here is an example of a device advertising multiple supported models using a list-type string attribute:
kind: ResourceSlice
apiVersion: resource.k8s.io/v1
metadata:
name: example-resourceslice
spec:
nodeName: worker-1
pool:
name: pool
generation: 1
resourceSliceCount: 1
driver: dra.example.com
devices:
- name: gpu-0
attributes:
dra.example.com/supported-models:
strings:
- model-a
- model-b
List type attributes is an alpha feature and only enabled when the
DRAListTypeAttributes feature gate
is enabled in the kube-apiserver and kube-scheduler.