Performance characteristics¶
This page is the user-facing companion to the design doc
design/perf_baseline.md. It covers
what to expect from Gossamer's runtime in production: garbage
collection behaviour, per-goroutine memory, scheduling under
load, and how the compiled tier compares to Go and Rust.
The numbers below are reproducible on commodity hardware; the runs in this document are from an AMD Ryzen 9 9900X (12 cores / 24 threads, 64 GiB DDR5-6000) on Linux 6.17. Your numbers will differ; the shape of the curves is what matters.
Compiled-tier benchmarks¶
| benchmark | mode | secs | cpu secs | mem (kb) | gz (bytes) |
|---|---|---|---|---|---|
| fasta (N=25M) | Rust reference | 0.84 | 4.32 | 8 660 | 1 933 |
| fasta (N=25M) | Go reference | 0.63 | 2.56 | 17 956 | 1 827 |
| fasta (N=25M) | Gossamer --release |
0.49 | 2.02 | 1 564 288 | 2 297 |
| n-body (N=50M) | Rust reference | 1.46 | 1.46 | 1 952 | 2 033 |
| n-body (N=50M) | Go reference | 2.13 | 2.13 | 2 204 | 1 215 |
| n-body (N=50M) | Gossamer --release (opt) |
1.69 | 1.68 | 2 120 | 2 628 |
Reproduce with bash bench/game/run.sh against any commit. The
multi-thread fasta uses 8 worker goroutines + a shared I64Vec;
the n-body opt variant uses precomputed pair distances to match
the Rust reference's algorithm shape.
The fasta memory cost (~1.5 GiB) is the price of the
I64Vec-backed parallel buffer: 8 bytes per output character
× 50M characters in flight ≈ 400 MB per section, doubled across
the two random sections. The single-thread variant
(fasta_old.gos) runs at ~1.8 s using ~2 MiB.
Garbage collector¶
Gossamer ships a stop-the-world mark-sweep GC. Expect:
- Pause time scales with live-set size, not heap size.
Steady-state pauses on a 200 MiB live set: ~3–8 ms in the
mem_stats()histogram. Spikes when many large allocations retire at once. - Heap growth target is 1.5× live (configurable via
GOSSAMER_GC_TARGET). Higher = fewer pauses, more RAM. - Allocation rate for typical service code: 200–500 MB/s
per thread. Closures,
String::new, and tuple-return idioms are the hot allocators; considerVec::with_capacity-style preallocation in inner loops. - Memory not returned to OS aggressively. The runtime reuses freed pages; OS-visible RSS only shrinks under prolonged idle. Container memory limits should be sized to ~2× peak working set.
Tail-latency-sensitive services (p99 < 10 ms) should:
- Pre-allocate Vec / HashMap with
with_capacity. - Avoid per-request closures; reuse handlers.
- Consider pooling large buffers via
bytes::Buffer.
Goroutines¶
In v1, every goroutine is a real OS thread. This has three consequences:
- Memory cost per goroutine ≈ 1 MiB stack (Linux default,
reducible via
ulimit -s) + ~32 KiB runtime metadata. Spawning 10 000 goroutines costs ~10 GiB of RSS unless you shrink stacks. - Spawn cost ≈ 30–80 µs per
go expr(kernelclone+ Gossamer scheduler bookkeeping). Compared to Go's ~2 µs spawn this is slow by two orders of magnitude. - Context switches are kernel-driven. The OS scheduler is in charge; Gossamer does not park goroutines in user space.
This is the Go-1-style implementation, not the modern Go scheduler. It is correct, simple, and sufficient for service loads up to a few thousand concurrent goroutines. M:N scheduling lands in v1.x and brings goroutine cost down to Go-class.
For workloads that today need 10k+ goroutines (Go's bread-and-butter "a goroutine per connection" pattern), v1 is workable but not ideal — limit goroutines to a worker pool sized at 2–4× the core count.
Throughput vs. latency¶
The runtime favours throughput:
- 64 KiB stdout buffer with line-buffer flush. Programs that print 25 MB in tight inner loops (fasta) reach the kernel via ~400 syscalls instead of 25 million.
- Inline-cached method dispatch in the bytecode VM saves the
HashMap::getlookup on hot paths. - LLVM
-O3 -mcpu=nativeper-function with auto-vectorisation enabled. Inner loops match therustc -C target-cpu=nativeoutput for primitive arithmetic. - Per-function fallback to Cranelift means a partially-LLVM program still compiles. Cranelift's output is roughly 1.5–2× slower than LLVM for numeric kernels.
Latency tuning options:
GOSSAMER_GC_TARGET=2.0— fewer collections, higher peak RSS.GOSSAMER_PROCS=N— pin scheduler to N OS threads.- Pre-warm caches on startup; the first request after launch pays JIT + GC initialisation overhead.
Scheduling under load¶
Sustained load behaviour, measured against bench/game/fasta
running 8 fan-out workers at N=25M:
- All 8 OS threads pegged at ≥99% utilisation throughout. No scheduler-induced idle.
- Output ordering deterministic: each worker fills a non-overlapping
range of the shared
I64Vec, thenWaitGroup::waitjoins before the bulk write. - No global mutex contention in the steady state. The GC's STW phase is the only synchronisation point.
For HTTP services the picture is different: each request runs on its own goroutine, and idle goroutines park on socket reads. RSS grows linearly with concurrent connection count due to the 1 MiB-per-goroutine stack.
Comparison to Go's runtime¶
| Metric | Go (current) | Gossamer (v1) | Notes |
|---|---|---|---|
| Goroutine spawn | ~2 µs | ~30–80 µs | M:N pending in v1.x. |
| Goroutine memory | ~8 KiB (growable stacks) | ~1 MiB (OS-thread stack) | Same as above. |
| GC pause (live=200 MB) | <1 ms (concurrent) | 3–8 ms (STW) | Concurrent GC pending. |
chan send |
~50 ns | ~150 ns | Mutex-based; lock-free queue pending. |
| HTTP throughput (echo) | ~80k req/s/core | ~30k req/s/core | M:N + chan tuning will close this. |
These are not promises — they are the targets the runtime is sized against. Your milage will vary; please publish numbers back if they diverge.
Comparison to Rust's runtime (no async)¶
For CPU-bound benchmarks, Gossamer --release is competitive
with rustc -C opt-level=3 because:
- Both lower to LLVM IR with
-O3 -mcpu=native. - Both avoid heap allocation in inner loops.
- Gossamer's runtime is lighter (no Rust standard-library threading machinery).
For allocation-heavy code, Rust pulls ahead because Rust has no
GC overhead. Code that builds many small Strings in a loop
will look better in Rust unless the Gossamer code preallocates.
Cross-references¶
design/perf_baseline.md— full baseline numbers and method.design/binary_size_baseline.md— output binary sizes.deployment.md— production tuning knobs.non_goals_v1.md— what perf gaps are intentional in v1.