DocsDeployPerformance tuning

Performance tuning

Tune the TopGun server for your workload: memory ceiling, eviction thresholds, write-behind buffer, and connection limits. Env-var-configurable knobs are documented below. Deeper configuration (concurrency, timeouts, connection buffers) requires programmatic ServerConfig / ConnectionConfig struct fields.

Measured benchmarks

Measured performance (single node, M1 Max)

37,000+

ops/sec fire-and-wait (ACK-verified)

480,000+

ops/sec fire-and-forget

1.5ms

p50 write latency

Benchmarks measured 2026-04-18 on M1 Max using the in-process load harness (cargo bench —bench load_harness, 200 connections, 30s). Fire-and-forget measured two consecutive runs at 483K and 487K ops/sec (~0.8% variation) after the harness was updated to retry with backoff on transient socket back-pressure rather than break on the first ENOBUFS. Fire-and-wait measures round-trip to ACK; fire-and-forget measures raw push throughput. Run cargo bench —bench load_harness to reproduce on your hardware.

Memory and eviction

The server uses an in-memory record cache with LRU eviction. When memory usage rises above the high-water mark, the eviction orchestrator removes least-recently-used records until it drops to the low-water mark.

All four env vars below are real — sourced from packages/server-rust/src/storage/eviction_config.rs:

.env

# Memory ceiling and eviction thresholds (all real env vars)
TOPGUN_MAX_RAM_MB=2048             # default 1024 MB
TOPGUN_EVICTION_HIGH_PCT=85        # default 85 — eviction starts at this % of ceiling
TOPGUN_EVICTION_LOW_PCT=70         # default 70 — eviction stops at this % of ceiling
TOPGUN_EVICTION_INTERVAL_MS=1000   # default 1000 ms — orchestrator tick

Variable	Default	Source	Notes
`TOPGUN_MAX_RAM_MB`	`1024`	`eviction_config.rs:85`	RAM ceiling in MB; eviction engages above high-water mark
`TOPGUN_EVICTION_HIGH_PCT`	`85`	`eviction_config.rs:105`	Percent of ceiling at which eviction starts (0–100)
`TOPGUN_EVICTION_LOW_PCT`	`70`	`eviction_config.rs:125`	Percent of ceiling at which eviction stops; must be < high
`TOPGUN_EVICTION_INTERVAL_MS`	`1000`	`eviction_config.rs:145`	Orchestrator tick interval in ms

Why 1024 MB / 85 / 70 defaults? Conservative ceiling tuned for a single-node HN-demo server. 85/70 water marks follow the Hazelcast LRU eviction reference pattern — wide enough gap (15 pp) to avoid eviction churn under bursty write loads. See Configuration defaults rationale.

Write-behind buffer

The write-behind layer decouples mutation latency from persistence latency by buffering writes in per-partition coalesced queues and flushing on a configurable schedule.

All three env vars below are real — sourced from packages/server-rust/src/storage/datastores/write_behind.rs:

.env

# Write-behind buffer (all real env vars)
TOPGUN_WRITEBEHIND_FLUSH_INTERVAL_MS=1000   # default 1000 ms — flush cadence
TOPGUN_WRITEBEHIND_BATCH_SIZE=100           # default 100 — max entries per flush
TOPGUN_WRITEBEHIND_CAPACITY=10000           # default 10000 — bounded buffer size

Variable	Default	Source	Notes
`TOPGUN_WRITEBEHIND_FLUSH_INTERVAL_MS`	`1000`	`write_behind.rs:79`	How often the background flush task runs
`TOPGUN_WRITEBEHIND_BATCH_SIZE`	`100`	`write_behind.rs:99`	Max records flushed per tick
`TOPGUN_WRITEBEHIND_CAPACITY`	`10000`	`write_behind.rs:119`	Bounded buffer size; once full, writes apply back-pressure
`TOPGUN_WAL_FSYNC_POLICY`	`batched`	`write_behind.rs`	WAL fsync aggressiveness: `batched` (default, throughput) \| `per_op` (acked == durable under `kill -9`) \| `none` (tests only)

Durability note: The default TOPGUN_WAL_FSYNC_POLICY=batched acks a write once its WAL frame is appended, then fsyncs on a ~10 ms group-commit timer — so an unclean kill -9 can drop the last few milliseconds of acked writes. This is a deliberate throughput default for demo-tier single-node deployments; the originating client keeps the write in local storage and re-converges on reconnect. If you need acked-implies-durable, set TOPGUN_WAL_FSYNC_POLICY=per_op (every WAL frame is fsynced before the write acks) — expect materially lower write throughput in exchange. Separately, the write-behind buffer holds acked writes for ~1s before persisting to the durable backend; crash-safe shutdown drain + WAL recovery are tracked separately (out of scope for this release).

Programmatic tuning

Deeper configuration is controlled via Rust struct fields, not env vars. To change these values, modify the config structs and rebuild the binary.

packages/server-rust/src/service/config.rs + network/config.rs

// packages/server-rust/src/service/config.rs
pub struct ServerConfig {
    pub max_concurrent_operations: u32,   // default 1000
    pub default_operation_timeout_ms: u64, // default 30000
    pub max_query_records: u32,           // default 10000
    pub gc_interval_ms: u64,              // default 60000
    // ...
}

// packages/server-rust/src/network/config.rs
pub struct ConnectionConfig {
    pub outbound_channel_capacity: usize,  // default 256
    pub send_timeout: Duration,            // default 5s
    pub idle_timeout: Duration,            // default 60s
    pub ws_write_buffer_size: usize,       // default 128 KB
    pub ws_max_write_buffer_size: usize,   // default 512 KB
}

pub struct NetworkConfig {
    pub request_timeout: Duration,         // default 30s
    pub max_body_size: usize,             // default 2 MB
    pub cors_max_age: Duration,           // default 86400s
    pub rate_limit_per_ip: u32,           // default 100
    pub rate_limit_burst: u32,            // default 50
    // ...
}

Not env-var configurable today

These fields are Rust struct defaults. Env-var overrides for concurrency, timeout, and connection-buffer settings are tracked at TODO-400. Until then, modify the config structs and rebuild.

ServerConfig fields

Field	Default	When to adjust
`max_concurrent_operations`	1000	Increase when load-shed errors appear under high load
`default_operation_timeout_ms`	30000 ms	Decrease for faster failure detection; increase for complex queries
`max_query_records`	10000	Increase for large dataset queries; decrease if memory constrained
`gc_interval_ms`	60000 ms	Decrease if stale data accumulating; increase if GC overhead visible

ConnectionConfig fields

Field	Default	When to adjust
`outbound_channel_capacity`	256	Increase for slow consumers; decrease if per-connection memory is high
`send_timeout`	5s	Increase for high-latency clients; decrease to drop slow clients faster
`idle_timeout`	60s	Decrease to reclaim idle connections; increase if frequent reconnects
`ws_write_buffer_size`	128 KB	Increase if large messages are being fragmented
`ws_max_write_buffer_size`	512 KB	Increase for very large payloads

NetworkConfig fields

Field	Default	When to adjust
`request_timeout`	30s	Decrease to fail hung requests faster
`max_body_size`	2 MB	Increase for large batch writes
`rate_limit_per_ip`	100 req/s	Decrease to throttle aggressive clients
`rate_limit_burst`	50	Decrease to tighten burst tolerance

OS-level tuning

Linux kernel parameters for high-connection workloads.

For production servers handling thousands of connections, tune the operating system:

Linux sysctl settings

Linux kernel tuning

# Increase socket backlog
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65535

# Increase file descriptor limits
ulimit -n 65535

# Enable TCP keepalive tuning
sudo sysctl -w net.ipv4.tcp_keepalive_time=60
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=10
sudo sysctl -w net.ipv4.tcp_keepalive_probes=6

# Persist settings in /etc/sysctl.conf
cat >> /etc/sysctl.conf << EOF
net.core.somaxconn=65535
net.ipv4.tcp_max_syn_backlog=65535
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_intvl=10
net.ipv4.tcp_keepalive_probes=6
EOF

File descriptor limits

/etc/security/limits.conf

# /etc/security/limits.conf
topgun soft nofile 65535
topgun hard nofile 65535
topgun soft nproc 65535
topgun hard nproc 65535

Docker / Kubernetes note

When running in containers, ensure the host has these settings applied. You may also need to set ulimits in your Docker Compose or Kubernetes pod spec.

Observability

Prometheus metrics are exposed at GET /metrics. All four metric names below are real — verified against packages/server-rust/src/service/middleware/metrics.rs and network/handlers/metrics_endpoint.rs:

Grafana dashboard queries

PromQL queries

# Operation throughput
rate(topgun_operations_total[5m])

# p99 latency
histogram_quantile(0.99, rate(topgun_operation_duration_seconds_bucket[5m]))

# Error rate
rate(topgun_operation_errors_total[5m])

# Active connections
topgun_active_connections

Critical metrics

Metric	Type	Labels	Healthy range	Action if exceeded
`topgun_active_connections`	Gauge	—	Depends on capacity	Add nodes or increase file descriptors
`topgun_operations_total`	Counter	service, outcome	Steady rate	Investigate spikes by service label
`topgun_operation_duration_seconds`	Histogram	service	p99 < 500ms	Profile slow service, check DB
`topgun_operation_errors_total`	Counter	service, error	Near 0	Investigate by error label

Recommended alerts

Alert name	Condition	Severity
HighErrorRate	`rate(topgun_operation_errors_total[5m]) > 10`	Critical
HighLatency	`histogram_quantile(0.99, rate(topgun_operation_duration_seconds_bucket[5m])) > 0.5`	Warning
ConnectionSpike	`topgun_active_connections > 1000`	Warning

Storage backends

Configuration