Hermes 多 Agent 协同方案：负载均衡批量执行 Skill 架构设计

1

背景与问题陈述

1.1 业务场景

在药企医学传播场景中，批量采集全国城市的政策信息是一个典型的高并发、长时运行任务：

任务量级

700+

100+ 城市 × 7 类政策

单任务耗时

150s

搜索 + 提取 + 写入

串行执行
29h
700 × 150s（不可接受）

1.2 核心问题

1

并发上限

单一 Agent 执行 Skill 的并发有限（受 API 限速、单进程工具集限制）

2

Skill 一致性

多个 Agent 并行执行同一个 Skill，如何确保各 Agent 拥有相同的 Skill 能力？

3

负载均衡

任务数量远大于单 Agent 处理能力，需要有效分配任务到空闲 Agent

4

故障恢复

部分 Worker 失败不影响整体任务队列

1.3 解决目标

性能目标

将 700+ 任务的执行时间从 29 小时 压缩到 30 分钟以内

多 Agent 并行执行同一 Skill，能力完全一致
任务失败自动重试或重新排队，对整体进度无影响
可观测：任务进度、Worker 状态、成功率全程可控

2

架构概览

2.1 整体架构图

┌─────────────────────────────────────────────────────────────────────────────┐
│                        用户 / 编排者 Agent                                    │
│                   （分析任务 → 创建任务队列 → 监控进度）                          │
└─────────────────────────────────┬───────────────────────────────────────────┘
                                  │ 创建 N 个任务
                                  ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│                           Kanban 调度层                                      │
│                                                                              │
│   dispatch_interval_seconds: 5s    max_spawn: 15                            │
│   dispatch_in_gateway: true        failure_limit: 5                          │
│                                                                              │
│   ┌─ Ready ──────→ 调度器按 max_spawn 上限 Spawn Worker                     │
│   ├─ Running ────→ 跟踪正在执行的任务                                        │
│   ├─ Done ───────→ 标记完成                                                  │
│   └─ Blocked ────→ 等待人工介入或依赖满足                                     │
└─────────────────────────────────┬───────────────────────────────────────────┘
                                  │ spawn_nowait()
                                  │ subprocess.Popen("hermes -p <profile> chat -q work kanban task <id>")
                                  ↓
    ┌──────────────────────────────┼──────────────────────────────────────────┐
    │                              │                                          │
    │  ┌─────────────────────┐    │    ┌─────────────────────┐               │
    │  │  Worker 1            │    │    │  Worker 2            │  ...         │
    │  │  hermes -p           │    │    │  hermes -p           │               │
    │  │  collector_a         │    │    │  collector_b         │               │
    │  │  ───────────────────│    │    │  ───────────────────│               │
    │  │  加载 Skill:         │    │    │  加载 Skill:         │               │
    │  │  policy-batch-      │    │    │  policy-batch-      │               │
    │  │  collector          │    │    │  collector          │               │
    │  │  ───────────────────│    │    │  ───────────────────│               │
    │  │  kanban_read_task()│    │    │  kanban_read_task()│               │
    │  │  → 执行采集逻辑      │    │    │  → 执行采集逻辑      │               │
    │  │  kanban_complete() │    │    │  kanban_complete() │               │
    │  └─────────────────────┘    │    └─────────────────────┘               │
    │                              │                                          │
    │  Worker 15台...              │  Worker N台                             │
    └──────────────────────────────┼──────────────────────────────────────────┘
                                  ↓
              ┌───────────────────────────────────────┐
              │            共享结果存储                  │
              │  ~/七大政策/{城市}/{政策号-政策名}/     │
              │  ├── 01-指标汇总表.md                   │
              │  ├── 02-来源明细表.md                   │
              │  └── 03-缺口与待补充.md                 │
              └───────────────────────────────────────┘

2.2 核心设计原则

①

任务队列化

所有批量任务进入 Kanban 队列，调度器统一分发

②

Worker 池化

固定数量 Worker 进程，按需复用，任务完成立即接新任务

③

Skill 中心化

Skill 实体存于一处，所有 Profile 通过符号链接共享

④

进程隔离

每个 Worker 是独立 OS 进程，独立模型调用，互不阻塞

⑤

调度自主

调度器内嵌 Gateway，固定间隔 tick，主动补位，无需人工干预

3

核心组件

3.1 Kanban 调度器

以固定间隔扫描 Ready 任务池，按 max_spawn 上限 Spawn 可用 Worker。每次 tick 间隔决定任务完成后的补位速度。

参数	默认值	说明
`dispatch_interval_seconds`	60s	轮询间隔。调到 5s 几乎无空窗期
`max_spawn`	null	每轮最大 Spawn 数，null=不限
`dispatch_in_gateway`	true	调度器内嵌在 Gateway 进程内
`failure_limit`	5	连续失败 N 次自动 Block

调度逻辑（伪代码）

# kanban_db.py dispatch_once()
def dispatch_once():
    ready_rows = db.select("SELECT * FROM tasks WHERE status = 'ready'")
    for row in ready_rows:
        if spawned >= max_spawn: break
        claimed = claim_task(conn, row["id"], ttl_seconds=900)
        if claimed:
            pid = spawn_nowait(
                "hermes -p <profile> chat -q work kanban task <id>",
                cwd=workspace
            )
            spawned += 1

Worker 生命周期

Ready

→

claim_task()

→

spawn_nowait()

→

Worker 进程启动

↓ kanban_read_task() → 解析参数 → skill_view() → 执行

kanban_complete()

→

进程退出

→

下次 tick 补位

3.2 Hermes Profile（Worker 身份）

Profile = 独立 Hermes 配置单元，包含独立配置文件、SOUL.md、Skill 链接、API Key、session 存储。

┌──────────────────────────────────────────────────────────┐
│     3 个 Profile（对应 OpenClaw 3 个 Agent 实例）            │
│                                                           │
│  collector_a  ──→ Worker 进程池（最多 5 并发）             │
│  collector_b  ──→ Worker 进程池（最多 5 并发）             │
│  collector_c  ──→ Worker 进程池（最多 5 并发）             │
│                                                           │
│          3 × 5 = 15 个最大并发槽位                          │
└──────────────────────────────────────────────────────────┘

3.3 Skill 共享机制（核心）

推荐方案：符号链接共享 Skill 目录

Skill 实体存于一处（~/.hermes/skills/），各 Profile 通过符号链接引用真实目录，实现单一来源、多处引用。

~/.hermes/skills/                           ← 技能本体（唯一真实来源）
└── policy-batch-collector/
    ├── SKILL.md
    └── references/

~/.hermes/profiles/
├── collector_a/
│   └── skills/
│       └── policy-batch-collector/  ──→ ../../skills/policy-batch-collector
├── collector_b/
│   └── skills/
│       └── policy-batch-collector/  ──→ ../../skills/policy-batch-collector
└── collector_c/
    └── skills/
        └── policy-batch-collector/  ──→ ../../skills/policy-batch-collector

优势

Skill 更新一次，所有 Profile 即时生效 · 无需重复安装 · 版本管理清晰

3.4 任务结构设计

任务 body 为 JSON 字符串，Worker 读取后解析参数并加载对应 Skill：

// 任务 body 格式
{
  "city": "深圳",
  "policy": "07-落户",
  "year": 2026,
  "skill": "policy-batch-collector",
  "output_base": "~/七大政策"
}

1

读取任务

kanban_read_task() → 解析 body JSON

2

加载 Skill

skill_view(name=body["skill"]) → 通过符号链接已挂在

3

执行采集

web_search() → 解析结构化 → write_file() 写入结果

4

完成标记

kanban_complete(summary="深圳-落户-采集完成")

4

并行模式对比

AKanban Worker 池

并发数：可配置 15+
任务持久性：SQLite，断电不丢
Worker 独立性：独立 OS 进程
Skill 加载：全量（符号链接）
适用：分钟级长任务、批量采集
故障恢复：自动重跑失败任务
调度速度：5s tick 补位

VS

Bdelegate_task 批量

并发数：默认 3
任务持久性：内存中，断电丢失
Worker 独立性：独立 conversation
Skill 加载：继承父 agent 工具集
适用：秒级快速子任务
故障恢复：父进程失败全部丢失
调度速度：即时（受父进程约束）

选型结论

批量 Skill 执行（分钟级）应使用 Kanban Worker 池；delegate_task 适用于轻量并行推理任务。

5

Skill 一致性保障体系

5.1 三种方案对比

方案	实现难度	更新复杂度	一致性	推荐度
① 每 Profile 独立安装	低	高（需重复 N 次）	依赖人工	不推荐
② 任务 body 动态加载	中	中	运行时加载	备选
③ 符号链接共享	低	低（更新一次全量生效）	100%	推荐

5.2 方案③详细设计

主 Skill 目录（版本单一来源）
~/.hermes/skills/
└── policy-batch-collector/
    ├── SKILL.md
    └── references/
        ├── cities.json
        └── policy-types.json

Profile Skills 链接（只读引用）
~/.hermes/profiles/
├── collector_a/skills/policy-batch-collector/  → ../../skills/policy-batch-collector
├── collector_b/skills/policy-batch-collector/  → ../../skills/policy-batch-collector
└── collector_c/skills/policy-batch-collector/  → ../../skills/policy-batch-collector

更新 Skill 流程

# 只需更新主目录，所有 Profile 即时生效
cd ~/.hermes/skills/policy-batch-collector
git pull  # 或 hermes skills update policy-batch-collector

# 验证链接
ls -la ~/.hermes/profiles/collector_a/skills/
# lrwxr-xr-x  collector_a -> ../../skills/policy-batch-collector

6

搜索 API 速率限制应对策略

多 Worker 并发场景下，搜索 API 是主要瓶颈。以下是主流 API 的限速对比及应对策略：

API	免费额度	并发限制	多 Profile 策略
Brave Free	2000次/月	1 QPS	3 Profile × 不同 Key
Tavily PRO	账号限速	~1 QPS	轮换 API Key
MiniMax Token	14次/2min	冷却期	设置 2min 冷却

每个 Profile 独立 .env 配置

# Profile A
echo "BRAVE_SEARCH_API_KEY=key_a" > ~/.hermes/profiles/collector_a/.env

# Profile B
echo "BRAVE_SEARCH_API_KEY=key_b" > ~/.hermes/profiles/collector_b/.env

# Profile C
echo "BRAVE_SEARCH_API_KEY=key_c" > ~/.hermes/profiles/collector_c/.env

注意

Terminal 的 echo 重定向写入会被 block，需使用 execute_code 工具或 Python 脚本来写入文件。

7

部署配置步骤

1

环境准备 — 创建 Profile

hermes profile create collector_a
hermes profile create collector_b
hermes profile create collector_c
# 验证
hermes profile list

2

Skill 中心化部署 — 符号链接

# 安装一次
hermes skills install policy-batch-collector

# 创建符号链接
for profile in collector_a collector_b collector_c; do
    mkdir -p ~/.hermes/profiles/$profile/skills
    ln -sf ../../skills/policy-batch-collector \
           ~/.hermes/profiles/$profile/skills/policy-batch-collector
done

# 验证
ls -la ~/.hermes/profiles/collector_a/skills/

3

配置调度参数

# 编辑 ~/.hermes/config.yaml
kanban:
  dispatch_in_gateway: true
  dispatch_interval_seconds: 5     # 从默认 60s 降低
  max_spawn: 15                   # 3 Profile × 5 并发
  failure_limit: 5

4

配置 API Key（每个 Profile）

# 使用 execute_code 工具写入
python3 -c "
import os
for profile, key in [
    ('collector_a', 'brave_key_a'),
    ('collector_b', 'brave_key_b'),
    ('collector_c', 'brave_key_c'),
]:
    path = os.path.expanduser(f'~/.hermes/profiles/{profile}/.env')
    with open(path, 'w') as f:
        f.write(f'BRAVE_SEARCH_API_KEY={key}\n')
    os.chmod(path, 0o600)
"

5

批量创建任务

for city in $(cat cities.txt); do
    for policy in 01 02 03 04 05 06 07; do
        hermes kanban create "采集: \${city} 政策\${policy}" \
            --assignee collector_a \
            --body "{\"city\":\"\${city}\",\"policy\":\"\${policy}\"}"
    done
done

6

启动 Gateway

# 重启使调度参数生效
kill $(cat ~/.hermes/gateway.pid)
hermes gateway start --no-browser

# 验证调度器
grep "kanban dispatcher" ~/.hermes/logs/gateway.log
# 应看到：kanban dispatcher: embedded in gateway (interval=5.0s)

8

监控与运维

8.1 监控命令

# 实时任务状态统计
hermes kanban list --json | python3 -c "
import json,sys
d=json.load(sys.stdin)
print('Ready:',sum(1 for t in d if t['status']=='ready'))
print('Running:',sum(1 for t in d if t['status']=='running'))
print('Done:',sum(1 for t in d if t['status']=='done'))
print('Blocked:',sum(1 for t in d if t['status']=='blocked'))
"

# 实时 Worker 进程
ps aux | grep "hermes.*chat.*work.kanban" | grep -v grep

# 实时文件产出
watch -n 5 'find ~/七大政策/ -name "*.md" -newer /tmp/start_time 2>/dev/null | wc -l'

# Gateway 日志实时
tail -f ~/.hermes/logs/gateway.log | grep -i "dispatch\|spawn\|complete\|block"

8.2 Web UI

看板可视化界面

http://localhost:9119/kanban

8.3 故障恢复

故障场景	检测方式	恢复动作
Worker 进程消失（任务卡住）	下次 tick 发现进程不存在	调度器自动重 Spawn
任务连续失败 5 次	failure_limit 计数	自动 Block，等待人工处理
API 限速（2056 错误）	Worker 日志检测到 2056	增加冷却时间，调度器自然重试
Profile 崩溃	多次 Spawn 失败	切换到其他 Profile 继续

9

性能估算

以 700 个任务（100 城市 × 7 政策）为例，对比不同并发配置下的理论执行时间：

串行（1 Worker）

29h

700 × 150s

完成进度

~2h

3 Profile × 5 并发

1.9h

15 并发

完成进度

~1.9h

理论目标

<30min

理想状态

完成进度

<30min

主要影响因素

API 限速（需多 API Key 轮换）、城市政策内容复杂度差异、搜索结果翻页深度等均会影响实际耗时。

10

OpenClaw 迁移对照

OpenClaw 概念	Hermes 对应	说明
3 个 Agent	3 个 Profile	`collector_a/b/c`
每 Agent 5 会话并发	`max_spawn: 15`	全局上限
常驻 15 槽位	`dispatch_interval_seconds: 5`	5s tick 几乎无空窗
manifest.json 状态	SQLite (`kanban.db`)	更稳健，断电不丢
无监控界面	Web UI + 日志	可观测性更强
任务完成立即补位	Ready 列自动补位	调度器自动处理

11

架构扩展方向

11.1 按城市分区分配 Profile

collector_north  → 负责北方城市（北京、天津、沈阳...）
collector_south  → 负责南方城市（广州、深圳、成都...）
collector_east   → 负责东部城市（上海、杭州、南京...）

优势：可扩展 SOUL.md 设定地域专家角色，不同地区政策由熟悉当地的 Worker 处理。

11.2 按政策类型分配 Profile

collector_medical  → 负责 01-异地就医
collector_housing  → 负责 03-公积金异地贷款
collector_school   → 负责 06-子女上学

优势：专业分工，Worker 可积累该领域的采集经验。

11.3 优先级队列

hermes kanban create "采集: 深圳 落户" \
    --assignee collector_a \
    --priority high \
    --body '{"city":"深圳","policy":"07"}'

12

附录

12.1 相关 Skill

Skill 名称	说明
`policy-batch-collector`	中国城市政策批量采集调度器
`kanban-orchestrator`	看板任务分解与编排规范
`kanban-worker`	Worker 执行规范与反模式
`hermes-kanban-batch-collection`	OpenClaw 迁移完整指南

12.2 关键配置文件路径

~/.hermes/config.yaml               # 主配置（调度参数）
~/.hermes/profiles/<name>/          # Profile 隔离配置
~/.hermes/skills/                   # 中心化 Skill 目录
~/.hermes/kanban.db                 # 任务队列 SQLite
~/.hermes/logs/gateway.log          # 调度日志

12.3 参考命令速查

# Profile 管理
hermes profile create <name>
hermes profile list
hermes profile show <name>

# Skill 管理
hermes skills install <skill-id>
hermes skills list

# Kanban 操作
hermes kanban create "title" --assignee <profile> --body "json"
hermes kanban list --json
hermes kanban dispatch
hermes kanban show <task-id> --json

# 日志
tail -f ~/.hermes/logs/gateway.log

△

文档修订历史

版本	日期	修订内容
v1.0	2026-05-12	初稿完成

后续计划

本文档为架构设计文档，具体部署实施请参考对应的部署检查清单。