Performance tuning
Tune the TopGun server for your workload: memory ceiling, eviction thresholds, write-behind buffer, and connection limits. Env-var-configurable knobs are documented below. Deeper configuration (concurrency, timeouts, connection buffers) requires programmatic ServerConfig / ConnectionConfig struct fields.
Measured benchmarks
Measured performance (single node, M1 Max)
Benchmarks measured 2026-04-18 on M1 Max using the in-process load harness (cargo bench —bench load_harness, 200 connections, 30s). Fire-and-forget measured two consecutive runs at 483K and 487K ops/sec (~0.8% variation) after the harness was updated to retry with backoff on transient socket back-pressure rather than break on the first ENOBUFS. Fire-and-wait measures round-trip to ACK; fire-and-forget measures raw push throughput. Run cargo bench —bench load_harness to reproduce on your hardware.
Memory and eviction
The server uses an in-memory record cache with LRU eviction. When memory usage rises above the high-water mark, the eviction orchestrator removes least-recently-used records until it drops to the low-water mark.
All four env vars below are real — sourced from packages/server-rust/src/storage/eviction_config.rs:
# Memory ceiling and eviction thresholds (all real env vars)
TOPGUN_MAX_RAM_MB=2048 # default 1024 MB
TOPGUN_EVICTION_HIGH_PCT=85 # default 85 — eviction starts at this % of ceiling
TOPGUN_EVICTION_LOW_PCT=70 # default 70 — eviction stops at this % of ceiling
TOPGUN_EVICTION_INTERVAL_MS=1000 # default 1000 ms — orchestrator tick | Variable | Default | Source | Notes |
|---|---|---|---|
TOPGUN_MAX_RAM_MB | 1024 | eviction_config.rs:85 | RAM ceiling in MB; eviction engages above high-water mark |
TOPGUN_EVICTION_HIGH_PCT | 85 | eviction_config.rs:105 | Percent of ceiling at which eviction starts (0–100) |
TOPGUN_EVICTION_LOW_PCT | 70 | eviction_config.rs:125 | Percent of ceiling at which eviction stops; must be < high |
TOPGUN_EVICTION_INTERVAL_MS | 1000 | eviction_config.rs:145 | Orchestrator tick interval in ms |
Why 1024 MB / 85 / 70 defaults? Conservative ceiling tuned for a single-node HN-demo server. 85/70 water marks follow the Hazelcast LRU eviction reference pattern — wide enough gap (15 pp) to avoid eviction churn under bursty write loads. See Configuration defaults rationale.
Write-behind buffer
The write-behind layer decouples mutation latency from persistence latency by buffering writes in per-partition coalesced queues and flushing on a configurable schedule.
All three env vars below are real — sourced from packages/server-rust/src/storage/datastores/write_behind.rs:
# Write-behind buffer (all real env vars)
TOPGUN_WRITEBEHIND_FLUSH_INTERVAL_MS=1000 # default 1000 ms — flush cadence
TOPGUN_WRITEBEHIND_BATCH_SIZE=100 # default 100 — max entries per flush
TOPGUN_WRITEBEHIND_CAPACITY=10000 # default 10000 — bounded buffer size | Variable | Default | Source | Notes |
|---|---|---|---|
TOPGUN_WRITEBEHIND_FLUSH_INTERVAL_MS | 1000 | write_behind.rs:79 | How often the background flush task runs |
TOPGUN_WRITEBEHIND_BATCH_SIZE | 100 | write_behind.rs:99 | Max records flushed per tick |
TOPGUN_WRITEBEHIND_CAPACITY | 10000 | write_behind.rs:119 | Bounded buffer size; once full, writes apply back-pressure |
Durability note: Write-behind buffers acked writes for ~1s before persisting to disk. Acceptable for demo-tier. Crash-safe shutdown drain + WAL recovery are tracked separately (out of scope for this release). Until that lands, an unclean shutdown can lose buffered writes not yet flushed.
Programmatic tuning
Deeper configuration is controlled via Rust struct fields, not env vars. To change these values, modify the config structs and rebuild the binary.
// packages/server-rust/src/service/config.rs
pub struct ServerConfig {
pub max_concurrent_operations: u32, // default 1000
pub default_operation_timeout_ms: u64, // default 30000
pub max_query_records: u32, // default 10000
pub gc_interval_ms: u64, // default 60000
// ...
}
// packages/server-rust/src/network/config.rs
pub struct ConnectionConfig {
pub outbound_channel_capacity: usize, // default 256
pub send_timeout: Duration, // default 5s
pub idle_timeout: Duration, // default 60s
pub ws_write_buffer_size: usize, // default 128 KB
pub ws_max_write_buffer_size: usize, // default 512 KB
}
pub struct NetworkConfig {
pub request_timeout: Duration, // default 30s
pub max_body_size: usize, // default 2 MB
pub cors_max_age: Duration, // default 86400s
pub rate_limit_per_ip: u32, // default 100
pub rate_limit_burst: u32, // default 50
// ...
} Not env-var configurable today
These fields are Rust struct defaults. Env-var overrides for concurrency, timeout, and connection-buffer settings are tracked at TODO-400. Until then, modify the config structs and rebuild.
ServerConfig fields
| Field | Default | When to adjust |
|---|---|---|
max_concurrent_operations | 1000 | Increase when load-shed errors appear under high load |
default_operation_timeout_ms | 30000 ms | Decrease for faster failure detection; increase for complex queries |
max_query_records | 10000 | Increase for large dataset queries; decrease if memory constrained |
gc_interval_ms | 60000 ms | Decrease if stale data accumulating; increase if GC overhead visible |
ConnectionConfig fields
| Field | Default | When to adjust |
|---|---|---|
outbound_channel_capacity | 256 | Increase for slow consumers; decrease if per-connection memory is high |
send_timeout | 5s | Increase for high-latency clients; decrease to drop slow clients faster |
idle_timeout | 60s | Decrease to reclaim idle connections; increase if frequent reconnects |
ws_write_buffer_size | 128 KB | Increase if large messages are being fragmented |
ws_max_write_buffer_size | 512 KB | Increase for very large payloads |
NetworkConfig fields
| Field | Default | When to adjust |
|---|---|---|
request_timeout | 30s | Decrease to fail hung requests faster |
max_body_size | 2 MB | Increase for large batch writes |
rate_limit_per_ip | 100 req/s | Decrease to throttle aggressive clients |
rate_limit_burst | 50 | Decrease to tighten burst tolerance |
OS-level tuning
For production servers handling thousands of connections, tune the operating system:
Linux sysctl settings
# Increase socket backlog
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# Increase file descriptor limits
ulimit -n 65535
# Enable TCP keepalive tuning
sudo sysctl -w net.ipv4.tcp_keepalive_time=60
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=10
sudo sysctl -w net.ipv4.tcp_keepalive_probes=6
# Persist settings in /etc/sysctl.conf
cat >> /etc/sysctl.conf << EOF
net.core.somaxconn=65535
net.ipv4.tcp_max_syn_backlog=65535
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_intvl=10
net.ipv4.tcp_keepalive_probes=6
EOF File descriptor limits
# /etc/security/limits.conf
topgun soft nofile 65535
topgun hard nofile 65535
topgun soft nproc 65535
topgun hard nproc 65535 Docker / Kubernetes note
When running in containers, ensure the host has these settings applied. You may also need to setulimits in your Docker Compose or Kubernetes pod spec.Observability
Prometheus metrics are exposed at GET /metrics. All four metric names below are real — verified against packages/server-rust/src/service/middleware/metrics.rs and network/handlers/metrics_endpoint.rs:
Grafana dashboard queries
# Operation throughput
rate(topgun_operations_total[5m])
# p99 latency
histogram_quantile(0.99, rate(topgun_operation_duration_seconds_bucket[5m]))
# Error rate
rate(topgun_operation_errors_total[5m])
# Active connections
topgun_active_connections Critical metrics
| Metric | Type | Labels | Healthy range | Action if exceeded |
|---|---|---|---|---|
topgun_active_connections | Gauge | — | Depends on capacity | Add nodes or increase file descriptors |
topgun_operations_total | Counter | service, outcome | Steady rate | Investigate spikes by service label |
topgun_operation_duration_seconds | Histogram | service | p99 < 500ms | Profile slow service, check DB |
topgun_operation_errors_total | Counter | service, error | Near 0 | Investigate by error label |
Recommended alerts
| Alert name | Condition | Severity |
|---|---|---|
| HighErrorRate | rate(topgun_operation_errors_total[5m]) > 10 | Critical |
| HighLatency | histogram_quantile(0.99, rate(topgun_operation_duration_seconds_bucket[5m])) > 0.5 | Warning |
| ConnectionSpike | topgun_active_connections > 1000 | Warning |