我不是程序员，用声音指挥 AI，独自做了一个有 20 万行代码的产品

产品是什么 What is HeraldVox

唤达（HeraldVox）是一个 Coding Agent 的语音交互层。核心逻辑很简单：用户说话 → 语音识别 → 自然语言理解 → 指令下发给 Coding Agent → Agent 执行 → 结果通过 TTS 反馈给用户。

HeraldVox is a voice interface layer for Coding Agents. The core logic is simple: user speaks → speech recognition → natural language understanding → command to Coding Agent → Agent executes → result fed back via TTS.

系统架构 · 三层设计 System Architecture · Three Layers

语音交互层（唤达自研） Voice Interface Layer (Self-built)

🎙 唤醒词检测Wake word 👂 STT 🧠 对话管理Dialog mgmt 🔊 TTS

↕

Agent 调度层 Agent Orchestration Layer

⚡ 多 Agent 并发Multi-agent concurrency 🔀 指令路由Command routing 🔒 E2E 加密通信E2E encrypted comms 📱 手机远程控制Mobile remote control

↕

Coding Agent 执行层 Coding Agent Execution Layer

Claude Code Codex Gemini CLI Openclaw

问题：命令行把大部分人挡在了门外 The Problem: Terminals Lock Most People Out

Coding Agent 是目前使用 AI 最强大的方式——它能操作电脑、执行命令、写代码、自动化几乎一切。但它的默认界面是这样的：

Coding Agents are the most powerful way to use AI right now — they can operate your computer, run commands, write code, automate almost everything. But the default interface looks like this:

bash — 80×24

user@~/project $

我家有两个小孩，一岁多和三岁多。大量时间花在陪孩子和做家务上，需要反复离开电脑。但每次给 Agent 下指令，我都必须坐下来，打开终端，打字。每次离开，Agent 就停在那里等我。

I have two kids, one just over 1 and one just over 3. I spend large chunks of my day with them, away from my desk. But every time I needed to give an Agent a new instruction, I had to sit down, open a terminal, type. Every time I left, the Agent just sat there waiting.

问题不是 AI 不够强。问题是交互方式把我锁死在了电脑前。

The problem wasn't that AI wasn't powerful enough. The problem was that the interface chained me to my desk.

语音引擎：全链路自研 Voice Engine: Fully Self-Built Pipeline

这是整个产品技术复杂度最高的部分。从唤醒词到最终 TTS 输出，全链路都是自研的，不依赖任何第三方语音平台。

This is the most technically complex part of the product. From wake word to final TTS output, the entire pipeline is self-built — no third-party voice platform dependencies.

🎙

唤醒词

Wake Word

本地离线

Local offline

👂

STT

中英文 + 术语优化

ZH/EN + term dict

🧠

对话管理

Dialog Mgmt

多轮上下文

Multi-turn context

⚡

Agent 执行

Agent Exec

并行多 Agent

Parallel agents

🔊

TTS

流式，低延迟

Streaming, low-latency

唤醒词检测采用本地离线方案，不依赖网络，激活前的语音完全不经过任何服务器。STT 针对技术术语做了专项优化——Coding Agent 场景下用户会说大量函数名、组件名、文件路径，通用 STT 识别率往往不够。TTS 采用流式方案，边生成边合成，显著降低首字延迟。

Wake word detection runs locally and offline — no audio reaches any server before activation. STT is optimized for technical terminology: in Coding Agent scenarios, users say many function names, component names, and file paths that standard STT handles poorly. TTS uses a streaming approach — synthesize as the text generates — which significantly reduces time-to-first-audio.

支持的 Agent Supported Agents

目前支持四个主流 Coding Agent，可以同时开多个并行运行，用语音一句话切换。

Currently supports four major Coding Agents. Run multiple in parallel, switch between them with a single voice command.

Claude Code Anthropic

Codex OpenAI

Gemini CLI Google

🔓

Openclaw 开源通用 Agent Open-source general agent

端到端加密通信 End-to-End Encrypted Communication

手机端控制桌面 Agent 的通信基于 libsodium，使用 X25519 密钥交换 + XSalsa20-Poly1305 对称加密。中继服务器只转发加密后的数据包，无法读取任何内容。

Mobile-to-desktop Agent communication uses libsodium: X25519 key exchange + XSalsa20-Poly1305 symmetric encryption. The relay server only forwards encrypted packets — it can't read anything.

📱 手机Phone

本地密钥

Local key

→ 🔒 E2E

☁️ 中继服务器Relay server

只转发，看不见内容

Forwards only, sees nothing

→ 🔒 E2E

💻 电脑 AgentDesktop Agent

本地密钥

Local key

底层基于 16,000+ Star 开源项目，代码可审计，零数据收集

Built on an open-source foundation with 16,000+ GitHub stars — fully auditable, zero data collection

怎么做到的：用 AI Agent 做 AI Agent How I Did It: Using AI Agents to Build AI Agents

这件事最有意思的地方是：我用来做唤达的工具，就是唤达要支持的那些工具。我的工作流是这样的：

The most interesting thing about this project: the tools I used to build HeraldVox are the exact tools it now supports. My workflow looked like this:

描述需求（中文自然语言） Describe requirement (natural language) 用语音或文字，说清楚这个功能应该怎么工作 Voice or text — explain what this feature should do

Agent 生成代码 Agent generates code Coding Agent 理解需求，写出实现 Coding Agent understands and implements

我审查逻辑 + 运行测试 I review logic + run tests 判断产品行为是否符合预期，不需要看每一行代码 Judge if product behavior matches intent — don't need to read every line

发现问题 → 描述修复方向 Find issues → describe the fix 用自然语言告诉 Agent 哪里不对、应该怎么改 Tell the Agent in plain language what's wrong and how to fix it

循环，直到通过 Loop until it passes 重复以上过程，我负责产品判断，Agent 负责实现 Repeat — I own product judgment, Agent owns implementation

边带俩娃边创业。白天陪孩子，晚上写代码。有时候凌晨三点爬起来调 bug，六点又被小的那个哭醒。这个节奏持续了将近一年。整个产品的核心业务代码量超过了 20 万行。

Building while raising two kids. Days with them, nights writing code. Sometimes up at 3am debugging, back up at 6 when the younger one cried. That rhythm went on for nearly a year. The product's core business code now exceeds 200,000 lines.

现在可以用了

It's live now

Web App，手机和电脑都做了适配，不用下载安装。当前版本完全免费。

Web app — works on mobile and desktop, no install needed. Current version is completely free.

免费开始使用 Start for free

免费使用 · 端到端加密 · 零数据收集

Free · End-to-end encrypted · Zero data collection

我不是程序员，用声音指挥 AI，独自做了一个有 20 万行代码的产品 I'm Not a Programmer. I Built a 200,000-Line Product Solo Using Voice-Controlled AI.