Implementing Resource Isolation in Doris Offline Analysis Data Warehouse Cluster
I. Background
With rapid business development, the current offline analysis data warehouse Doris cluster is facing resource contention issues due to diverse computing needs such as daily data science jobs, offline batch processing, and customer-side report queries. To address this, resource limits need to be implemented for different account resource requests. After research, the official Doris Workload Group solution has been adopted.
Doris version: doris-2.1.8-1-834d802457
II. Workload Group Introduction
Workload Group is an in-process resource isolation mechanism provided by Apache Doris that achieves resource isolation between different business loads through fine-grained division of CPU, memory, and IO resources within BE processes.
The principle is as shown in the diagram below:

Currently supported isolation capabilities include:
| Resource Type | Isolation Method | Description |
|---|---|---|
| CPU | Soft limit / Hard limit | Soft limit allocates by weight, hard limit is absolute upper bound |
| Memory | Soft limit / Hard limit | Hard limit automatically kills queries to free up memory when exceeded |
| IO | Speed limit | Limits IO throughput for reading local/remote files |
| Concurrency | Queuing mechanism | Exceeding concurrency queries enter queue to wait |
For more information: https://doris.apache.org/zh-CN/docs/2.1/admin-manual/workload-management/workload-group
III. Doris Cluster Adjustment to Support Cgroup (Manual Deployment)
3.1 Cgroup Environment Configuration
Check the system’s supported CGroup version:
root@doristest:~# cat /proc/filesystems | grep cgroup
nodev cgroup
nodev cgroup2
Confirm the effective CGroup version:
root@doristest:~# ls /sys/fs/cgroup/cpu/
ls: cannot access '/sys/fs/cgroup/cpu/': No such file or directory
root@doristest:~# ls /sys/fs/cgroup/cgroup.controllers
/sys/fs/cgroup/cgroup.controllers # Indicates cgroup v2 is in effect
3.2 Create Doris CGroup Directory
# ========== CGroup v1 Operations ==========
mkdir /sys/fs/cgroup/cpu/doris
# Set permissions (root is the user running BE)
chmod 770 /sys/fs/cgroup/cpu/doris
chown -R root:root /sys/fs/cgroup/cpu/doris
# ========== CGroup v2 Operations ==========
mkdir /sys/fs/cgroup/doris
# Set permissions
chmod 770 /sys/fs/cgroup/doris
chown -R root:root /sys/fs/cgroup/doris # Adjust as needed, currently root
3.3 CGroup v2 Additional Configuration
CGroup v2 has stricter permission control and requires additional configuration:
# Modify root directory cgroup.procs file permissions
chmod a+w /sys/fs/cgroup/cgroup.procs
# Enable CPU controller
# Enter doris directory
cd /sys/fs/cgroup/doris
# Enable CPU controller (by modifying parent subtree_control)
# Already set by default, can be ignored
echo +cpu > ../cgroup.subtree_control
# Verify: Check if cpu.max file appears in doris directory
/sys/fs/cgroup/doris/cpu.max
# And cgroup.controllers should contain cpu
# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
3.4 Configure BE and Restart
Edit the be.conf file:
# ========== CGroup v1 Configuration ==========
doris_cgroup_cpu_path = /sys/fs/cgroup/cpu/doris
# ========== CGroup v2 Configuration ==========
doris_cgroup_cpu_path = /sys/fs/cgroup/doris
Restart BE service:
# Restart BE
./bin/stop_be.sh
./bin/start_be.sh
Verification: Check the be.INFO log
grep "add thread" log/be.INFO
# Appearance of "add thread xxx to group" indicates successful configuration
3.5 Persistent Configuration (Optional)
CGroup configuration will be cleared after machine reboot. It is recommended to use systemd to create an auto-configuration service on boot:
# /etc/systemd/system/doris-cgroup.service
[Unit]
Description=Doris CGroup Setup
Before=doris-be.service
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'mkdir -p /sys/fs/cgroup/cpu/doris && chmod 770 /sys/fs/cgroup/cpu/doris && chown -R doris:doris /sys/fs/cgroup/cpu/doris'
[Install]
WantedBy=multi-user.target
Set to auto-start on boot:
systemctl enable doris-cgroup.service
IV. Batch Enable Cgroup on Doris Nodes via Ansible
4.1 Preparation
Before executing batch deployment, confirm the following information:
- Confirm CGroup version: CGroup version (v1 or v2) for all Doris BE nodes
- Prepare host inventory: Edit Ansible’s hosts file to add all BE nodes
- Confirm BE user: User running Doris BE service (currently root)
4.2 Configure Host Inventory
Edit the hosts file:
4.3 Create Ansible Playbook
Create file setup_doris_cgroup.yaml:
|
|
4.4 Execute Batch Deployment
Syntax check and dry run
|
|
4.5 Official Execution and Verification of Deployment Results
Official execution
|
|
Adjust Doris BE configuration via Doris Manager (rolling restart):
# ========== CGroup v1 Configuration ==========
doris_cgroup_cpu_path = /sys/fs/cgroup/cpu/doris
# ========== CGroup v2 Configuration ==========
doris_cgroup_cpu_path = /sys/fs/cgroup/doris

Verification:
|
|
At this point, Doris BE nodes have enabled Cgroup resource limits.
V. Workload Group Management
5.1 Create Workload Group
Basic example (CPU soft limit)
Complete configuration example
|
|
5.2 Attribute Details
| Attribute Name | Type | Default Value | Value Range | Description |
|---|---|---|---|---|
| cpu_share | INT | -1 | [1, 10000] | CPU soft limit weight, higher value means higher priority |
| cpu_hard_limit | INT | -1 | [1%, 100%] | CPU hard limit percentage (new in version 2.1) |
| memory_limit | FLOAT | -1 | (0%, 100%] | Memory limit percentage |
| enable_memory_overcommit | BOOL | true | true/false | true=soft limit (can exceed), false=hard limit |
| max_concurrency | INT | 2147483647 | [0, 2147483647] | Maximum concurrent queries |
| max_queue_size | INT | 0 | [0, 2147483647] | Queue length (0=no queuing) |
| queue_timeout | INT | 0 | [0, 2147483647] | Queue timeout (milliseconds) |
| scan_thread_num | INT | -1 | [1, 2147483647] | Scan thread count (-1=use BE configuration) |
| max_remote_scan_thread_num | INT | -1 | [1, 2147483647] | Maximum scan thread count for external tables |
| min_remote_scan_thread_num | INT | -1 | [1, 2147483647] | Minimum scan thread count for external tables |
| read_bytes_per_second | INT | -1 | [1, 9223372036854775807] | Internal table read IO limit (bytes/second) |
| remote_read_bytes_per_second | INT | -1 | [1, 9223372036854775807] | External table read IO limit (bytes/second) |
Notes:
- Must specify at least one attribute when creating
- CGroup v1 cpu_share default is 1024, range 2-262144
- CGroup v2 cpu_share default is 100, range 1-10000
- Sum of all Workload Group memory_limit cannot exceed 100%
- Sum of all Workload Group cpu_hard_limit cannot exceed 100%
5.3 Modify Workload Group
ALTER WORKLOAD GROUP g1 PROPERTIES('cpu_share' = '4096');
-- Change memory limit to hard limit
ALTER WORKLOAD GROUP g1 PROPERTIES(
'memory_limit' = '30%',
'enable_memory_overcommit' = 'false'
);
-- Modify concurrency and queuing parameters
ALTER WORKLOAD GROUP g1 PROPERTIES(
'max_concurrency' = '100',
'max_queue_size' = '200',
'queue_timeout' = '30000'
);
-- Add CPU hard limit (hard limit mode must be enabled first)
ALTER WORKLOAD GROUP g1 PROPERTIES('cpu_hard_limit' = '20%');
5.4 Delete Workload Group
- Delete specified Workload Group
DROP WORKLOAD GROUP g1;
Note: The default normal group cannot be deleted.
VI. User Binding and Authorization
6.1 View Available Workload Groups
-- View Workload Groups that the current user has permission to use
SELECT name FROM information_schema.workload_groups;
6.2 Authorization
-- Grant user permission to use specified Workload Group
GRANT USAGE_PRIV ON WORKLOAD GROUP 'g1' TO 'user_1'@'%';
-- Grant permission to all Workload Groups
GRANT USAGE_PRIV ON WORKLOAD GROUP '*' TO 'user_1'@'%';
-- Revoke permission
REVOKE USAGE_PRIV ON WORKLOAD GROUP 'g1' FROM 'user_1'@'%';
6.3 Binding Method 1: User Properties (Recommended)
Set default Workload Group for user (persistent):
-- Set user's default Workload Group
SET PROPERTY FOR 'user_1' = 'default_workload_group' = 'g1';
-- View user properties
SHOW PROPERTY FOR 'user_1';
VII. Monitoring and Viewing
7.1 View Workload Group List
-- Show statement
SHOW WORKLOAD GROUPS;
-- System table query
SELECT * FROM information_schema.workload_groups;
7.2 View Resource Usage
-- View memory usage of each Workload Group (unit: MB)
SELECT
workload_group_id,
name,
MEMORY_USAGE_BYTES / 1024 / 1024 AS mem_used_mb,
CPU_USAGE_PERCENT AS cpu_percent
FROM workload_group_resource_usage;
-- View details of specific Workload Group
SELECT * FROM information_schema.workload_groups WHERE name = 'g1';
7.3 System Table Field Description
workload_groups table
| Field | Description |
|---|---|
| ID | Workload Group ID |
| NAME | Name |
| CPU_SHARE | CPU soft limit weight |
| MEMORY_LIMIT | Memory limit percentage |
| ENABLE_MEMORY_OVERCOMMIT | Whether memory overcommit is allowed |
| MAX_CONCURRENCY | Maximum concurrency |
| MAX_QUEUE_SIZE | Queue length |
| QUEUE_TIMEOUT | Queue timeout |
| CPU_HARD_LIMIT | CPU hard limit percentage |
| SCAN_THREAD_NUM | Scan thread count |
| READ_BYTES_PER_SECOND | Internal table IO limit |
| REMOTE_READ_BYTES_PER_SECOND | External table IO limit |
VIII. CPU Soft/Hard Limit Mode Switching
8.1 Mode Description
| Mode | Characteristics | Applicable Scenarios |
|---|---|---|
| CPU soft limit | Allocates CPU time by weight, can use all CPU when idle | Large load fluctuations, want to fully utilize resources |
| CPU hard limit | Absolute upper limit, cannot exceed even if CPU is idle | Need strict resource isolation, guarantee SLA |
Note: Only one mode can be used in a cluster at the same time, cannot be mixed.
8.2 Switch from Soft Limit to Hard Limit
Step 1: Set hard limit values for all Workload Groups
-- Must set cpu_hard_limit for all Groups
ALTER WORKLOAD GROUP g1 PROPERTIES('cpu_hard_limit' = '30%');
ALTER WORKLOAD GROUP g2 PROPERTIES('cpu_hard_limit' = '20%');
ALTER WORKLOAD GROUP normal PROPERTIES('cpu_hard_limit' = '50%');
-- Verify: Sum of all Group's cpu_hard_limit <= 100%
SELECT SUM(cpu_hard_limit) FROM information_schema.workload_groups
WHERE cpu_hard_limit > 0;
Step 2: Enable cluster hard limit switch
-- Takes effect immediately in memory (invalid after restart)
ADMIN SET FRONTEND CONFIG ("enable_cpu_hard_limit" = "true");
-- Persistent configuration: Modify fe.conf for all FEs
echo "experimental_enable_cpu_hard_limit = true" >> fe/conf/fe.conf
8.3 Switch from Hard Limit Back to Soft Limit
-- Disable hard limit switch (automatically switches back to soft limit mode)
ADMIN SET FRONTEND CONFIG ("enable_cpu_hard_limit" = "false");
For persistence, modify fe.conf to false or delete the configuration.
The above is the overall solution for adding resource limits to the Doris offline analysis data warehouse.
Reference document: https://doris.apache.org/zh-CN/docs/2.1/admin-manual/workload-management/workload-group