产品是什么 What is HeraldVox

唤达(HeraldVox)是一个 Coding Agent 的语音交互层。核心逻辑很简单:用户说话 → 语音识别 → 自然语言理解 → 指令下发给 Coding Agent → Agent 执行 → 结果通过 TTS 反馈给用户。

HeraldVox is a voice interface layer for Coding Agents. The core logic is simple: user speaks → speech recognition → natural language understanding → command to Coding Agent → Agent executes → result fed back via TTS.

系统架构 · 三层设计 System Architecture · Three Layers
语音交互层(唤达自研) Voice Interface Layer (Self-built)
🎙 唤醒词检测Wake word 👂 STT 🧠 对话管理Dialog mgmt 🔊 TTS
Agent 调度层 Agent Orchestration Layer
多 Agent 并发Multi-agent concurrency 🔀 指令路由Command routing 🔒 E2E 加密通信E2E encrypted comms 📱 手机远程控制Mobile remote control
Coding Agent 执行层 Coding Agent Execution Layer
Claude Code Codex Gemini CLI Openclaw

问题:命令行把大部分人挡在了门外 The Problem: Terminals Lock Most People Out

Coding Agent 是目前使用 AI 最强大的方式——它能操作电脑、执行命令、写代码、自动化几乎一切。但它的默认界面是这样的:

Coding Agents are the most powerful way to use AI right now — they can operate your computer, run commands, write code, automate almost everything. But the default interface looks like this:

bash — 80×24
user@~/project $

我家有两个小孩,一岁多和三岁多。大量时间花在陪孩子和做家务上,需要反复离开电脑。但每次给 Agent 下指令,我都必须坐下来,打开终端,打字。每次离开,Agent 就停在那里等我。

I have two kids, one just over 1 and one just over 3. I spend large chunks of my day with them, away from my desk. But every time I needed to give an Agent a new instruction, I had to sit down, open a terminal, type. Every time I left, the Agent just sat there waiting.

问题不是 AI 不够强。问题是交互方式把我锁死在了电脑前。

The problem wasn't that AI wasn't powerful enough. The problem was that the interface chained me to my desk.

语音引擎:全链路自研 Voice Engine: Fully Self-Built Pipeline

这是整个产品技术复杂度最高的部分。从唤醒词到最终 TTS 输出,全链路都是自研的,不依赖任何第三方语音平台。

This is the most technically complex part of the product. From wake word to final TTS output, the entire pipeline is self-built — no third-party voice platform dependencies.

🎙
唤醒词
Wake Word
本地离线
Local offline
👂
STT
中英文 + 术语优化
ZH/EN + term dict
🧠
对话管理
Dialog Mgmt
多轮上下文
Multi-turn context
Agent 执行
Agent Exec
并行多 Agent
Parallel agents
🔊
TTS
流式,低延迟
Streaming, low-latency

唤醒词检测采用本地离线方案,不依赖网络,激活前的语音完全不经过任何服务器。STT 针对技术术语做了专项优化——Coding Agent 场景下用户会说大量函数名、组件名、文件路径,通用 STT 识别率往往不够。TTS 采用流式方案,边生成边合成,显著降低首字延迟。

Wake word detection runs locally and offline — no audio reaches any server before activation. STT is optimized for technical terminology: in Coding Agent scenarios, users say many function names, component names, and file paths that standard STT handles poorly. TTS uses a streaming approach — synthesize as the text generates — which significantly reduces time-to-first-audio.

支持的 Agent Supported Agents

目前支持四个主流 Coding Agent,可以同时开多个并行运行,用语音一句话切换。

Currently supports four major Coding Agents. Run multiple in parallel, switch between them with a single voice command.

Claude Code
Claude Code Anthropic
Codex
Codex OpenAI
Gemini CLI
Gemini CLI Google
🔓
Openclaw 开源通用 Agent Open-source general agent

端到端加密通信 End-to-End Encrypted Communication

手机端控制桌面 Agent 的通信基于 libsodium,使用 X25519 密钥交换 + XSalsa20-Poly1305 对称加密。中继服务器只转发加密后的数据包,无法读取任何内容。

Mobile-to-desktop Agent communication uses libsodium: X25519 key exchange + XSalsa20-Poly1305 symmetric encryption. The relay server only forwards encrypted packets — it can't read anything.

📱 手机Phone
本地密钥
Local key
🔒 E2E
☁️ 中继服务器Relay server
只转发,看不见内容
Forwards only, sees nothing
🔒 E2E
💻 电脑 AgentDesktop Agent
本地密钥
Local key

底层基于 16,000+ Star 开源项目,代码可审计,零数据收集

Built on an open-source foundation with 16,000+ GitHub stars — fully auditable, zero data collection

怎么做到的:用 AI Agent 做 AI Agent How I Did It: Using AI Agents to Build AI Agents

这件事最有意思的地方是:我用来做唤达的工具,就是唤达要支持的那些工具。我的工作流是这样的:

The most interesting thing about this project: the tools I used to build HeraldVox are the exact tools it now supports. My workflow looked like this:

1
描述需求(中文自然语言) Describe requirement (natural language) 用语音或文字,说清楚这个功能应该怎么工作 Voice or text — explain what this feature should do
2
Agent 生成代码 Agent generates code Coding Agent 理解需求,写出实现 Coding Agent understands and implements
3
我审查逻辑 + 运行测试 I review logic + run tests 判断产品行为是否符合预期,不需要看每一行代码 Judge if product behavior matches intent — don't need to read every line
4
发现问题 → 描述修复方向 Find issues → describe the fix 用自然语言告诉 Agent 哪里不对、应该怎么改 Tell the Agent in plain language what's wrong and how to fix it
5
循环,直到通过 Loop until it passes 重复以上过程,我负责产品判断,Agent 负责实现 Repeat — I own product judgment, Agent owns implementation

边带俩娃边创业。白天陪孩子,晚上写代码。有时候凌晨三点爬起来调 bug,六点又被小的那个哭醒。这个节奏持续了将近一年。整个产品的核心业务代码量超过了 20 万行。

Building while raising two kids. Days with them, nights writing code. Sometimes up at 3am debugging, back up at 6 when the younger one cried. That rhythm went on for nearly a year. The product's core business code now exceeds 200,000 lines.

现在可以用了

It's live now

Web App,手机和电脑都做了适配,不用下载安装。当前版本完全免费。

Web app — works on mobile and desktop, no install needed. Current version is completely free.

免费开始使用 Start for free

免费使用 · 端到端加密 · 零数据收集

Free · End-to-end encrypted · Zero data collection