Distributed SFU probers
voicegw livekit sfu --load measures SFU capacity from a single host. That answers “what can this one machine push,” not “does my SFU hold up when clients from several regions join the same room at once.” Distributed mode answers the second question: one coordinator synchronizes N probers, each running from a different region, all ramping the same room together.
See the voicegw livekit sfu reference for the combined-report format. This page covers deploying the probers.
How it works
- The coordinator (
sfu --coordinator --expect N) serves a small HTTP barrier. It hands every prover the same job (room, ramp tiers, duration, thresholds) and a sharedstart_attimestamp once all N have registered. - Each prober (
sfu --report-to <url> --vantage <label>) registers, waits for the barrier so all vantages start at the same instant, ramps the shared room, and posts its per-tier measurements back. - When every prover has reported, the coordinator aggregates (summing clients per tier, taking the worst rtt / loss / quality), prints the combined capacity, and deletes the shared rooms.
[server] extra (pip install 'voicegateway[server]') for its HTTP layer. Probers need only the base install.
Coordinator
Run the coordinator somewhere the probers can reach over HTTP (a bastion host, a small VM, or a Fly machine with an internal address):Probers on Fly.io
Thedeploy/prober/ directory ships a Dockerfile and an example fly.toml. The image is a run-to-completion job that runs one prober and exits.
FLY_REGION and reports it as its vantage label, so a machine in sjc shows up as the sjc vantage with no per-region config. The ramp and duration are dictated by the coordinator (every vantage runs the same job), so there is nothing to set for them on the prober.
Any host that can run a container works the same way: set COORDINATOR_URL, the LiveKit creds, and VOICEGW_REGION, then run the image. Fly is just a convenient way to place probers in specific regions.
Limitations
- The coordinator endpoint is unauthenticated.
/register,/report, and/resulthave no auth, so anyone who can reach the port can inject fake reports or read the result. Run the coordinator on a private network the probers can reach (a VPC, Fly private networking, an SSH tunnel), not a public interface, and only for the duration of the run. - Per-tier concurrency drifts after the first tier. The barrier synchronizes only the shared start; each vantage then advances to its next ramp tier as soon as its own measurement finishes. Vantages stay aligned at the first tier, but faster ones run ahead on later tiers, so the combined per-tier client sums are exact at tier one and an upper bound thereafter. Read the combined knee as approximate, and lean on the baseline and first-tier numbers for the tightest signal.
- If a prober dies, the run degrades rather than hangs. The coordinator stops after its timeout (default 10 minutes) and aggregates whatever reported; a prober that never clears the barrier gives up after its own timeout. A missing vantage shows up under
droppedin the report.
Cost and safety
Distributed probing opens real SFU connections from many hosts at once. Unlikelatency, it does not invoke STT/LLM/TTS providers (there is no agent in the loop), so there is no per-turn provider cost. It does consume SFU capacity for the duration of the ramp, so run it against a test project or during a maintenance window, not against production traffic.
Related
voicegw livekit sfu: the command reference and combined-report format.- Deploy on Fly.io: deploying the VoiceGateway engine itself.