DocsDeployPerformance tuning

Performance tuning

Tune the TopGun server for your workload: memory ceiling, eviction thresholds, write-behind buffer, and connection limits. Env-var-configurable knobs are documented below. Deeper configuration (concurrency, timeouts, connection buffers) requires programmatic ServerConfig / ConnectionConfig struct fields.

Measured benchmarks

Measured performance (single node, M1 Max)

37,000+
ops/sec fire-and-wait (ACK-verified)
480,000+
ops/sec fire-and-forget
1.5ms
p50 write latency

Benchmarks measured 2026-04-18 on M1 Max using the in-process load harness (cargo bench —bench load_harness, 200 connections, 30s). Fire-and-forget measured two consecutive runs at 483K and 487K ops/sec (~0.8% variation) after the harness was updated to retry with backoff on transient socket back-pressure rather than break on the first ENOBUFS. Fire-and-wait measures round-trip to ACK; fire-and-forget measures raw push throughput. Run cargo bench —bench load_harness to reproduce on your hardware.

Memory and eviction

The server uses an in-memory record cache with LRU eviction. When memory usage rises above the high-water mark, the eviction orchestrator removes least-recently-used records until it drops to the low-water mark.

All four env vars below are real — sourced from packages/server-rust/src/storage/eviction_config.rs:

.env
# Memory ceiling and eviction thresholds (all real env vars)
TOPGUN_MAX_RAM_MB=2048             # default 1024 MB
TOPGUN_EVICTION_HIGH_PCT=85        # default 85 — eviction starts at this % of ceiling
TOPGUN_EVICTION_LOW_PCT=70         # default 70 — eviction stops at this % of ceiling
TOPGUN_EVICTION_INTERVAL_MS=1000   # default 1000 ms — orchestrator tick
VariableDefaultSourceNotes
TOPGUN_MAX_RAM_MB1024eviction_config.rs:85RAM ceiling in MB; eviction engages above high-water mark
TOPGUN_EVICTION_HIGH_PCT85eviction_config.rs:105Percent of ceiling at which eviction starts (0–100)
TOPGUN_EVICTION_LOW_PCT70eviction_config.rs:125Percent of ceiling at which eviction stops; must be < high
TOPGUN_EVICTION_INTERVAL_MS1000eviction_config.rs:145Orchestrator tick interval in ms

Why 1024 MB / 85 / 70 defaults? Conservative ceiling tuned for a single-node HN-demo server. 85/70 water marks follow the Hazelcast LRU eviction reference pattern — wide enough gap (15 pp) to avoid eviction churn under bursty write loads. See Configuration defaults rationale.

Write-behind buffer

The write-behind layer decouples mutation latency from persistence latency by buffering writes in per-partition coalesced queues and flushing on a configurable schedule.

All three env vars below are real — sourced from packages/server-rust/src/storage/datastores/write_behind.rs:

.env
# Write-behind buffer (all real env vars)
TOPGUN_WRITEBEHIND_FLUSH_INTERVAL_MS=1000   # default 1000 ms — flush cadence
TOPGUN_WRITEBEHIND_BATCH_SIZE=100           # default 100 — max entries per flush
TOPGUN_WRITEBEHIND_CAPACITY=10000           # default 10000 — bounded buffer size
VariableDefaultSourceNotes
TOPGUN_WRITEBEHIND_FLUSH_INTERVAL_MS1000write_behind.rs:79How often the background flush task runs
TOPGUN_WRITEBEHIND_BATCH_SIZE100write_behind.rs:99Max records flushed per tick
TOPGUN_WRITEBEHIND_CAPACITY10000write_behind.rs:119Bounded buffer size; once full, writes apply back-pressure

Durability note: Write-behind buffers acked writes for ~1s before persisting to disk. Acceptable for demo-tier. Crash-safe shutdown drain + WAL recovery are tracked separately (out of scope for this release). Until that lands, an unclean shutdown can lose buffered writes not yet flushed.

Programmatic tuning

Deeper configuration is controlled via Rust struct fields, not env vars. To change these values, modify the config structs and rebuild the binary.

packages/server-rust/src/service/config.rs + network/config.rs
// packages/server-rust/src/service/config.rs
pub struct ServerConfig {
    pub max_concurrent_operations: u32,   // default 1000
    pub default_operation_timeout_ms: u64, // default 30000
    pub max_query_records: u32,           // default 10000
    pub gc_interval_ms: u64,              // default 60000
    // ...
}

// packages/server-rust/src/network/config.rs
pub struct ConnectionConfig {
    pub outbound_channel_capacity: usize,  // default 256
    pub send_timeout: Duration,            // default 5s
    pub idle_timeout: Duration,            // default 60s
    pub ws_write_buffer_size: usize,       // default 128 KB
    pub ws_max_write_buffer_size: usize,   // default 512 KB
}

pub struct NetworkConfig {
    pub request_timeout: Duration,         // default 30s
    pub max_body_size: usize,             // default 2 MB
    pub cors_max_age: Duration,           // default 86400s
    pub rate_limit_per_ip: u32,           // default 100
    pub rate_limit_burst: u32,            // default 50
    // ...
}

Not env-var configurable today

These fields are Rust struct defaults. Env-var overrides for concurrency, timeout, and connection-buffer settings are tracked at TODO-400. Until then, modify the config structs and rebuild.

ServerConfig fields

FieldDefaultWhen to adjust
max_concurrent_operations1000Increase when load-shed errors appear under high load
default_operation_timeout_ms30000 msDecrease for faster failure detection; increase for complex queries
max_query_records10000Increase for large dataset queries; decrease if memory constrained
gc_interval_ms60000 msDecrease if stale data accumulating; increase if GC overhead visible

ConnectionConfig fields

FieldDefaultWhen to adjust
outbound_channel_capacity256Increase for slow consumers; decrease if per-connection memory is high
send_timeout5sIncrease for high-latency clients; decrease to drop slow clients faster
idle_timeout60sDecrease to reclaim idle connections; increase if frequent reconnects
ws_write_buffer_size128 KBIncrease if large messages are being fragmented
ws_max_write_buffer_size512 KBIncrease for very large payloads

NetworkConfig fields

FieldDefaultWhen to adjust
request_timeout30sDecrease to fail hung requests faster
max_body_size2 MBIncrease for large batch writes
rate_limit_per_ip100 req/sDecrease to throttle aggressive clients
rate_limit_burst50Decrease to tighten burst tolerance

OS-level tuning

Linux kernel parameters for high-connection workloads.

For production servers handling thousands of connections, tune the operating system:

Linux sysctl settings

Linux kernel tuning
# Increase socket backlog
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65535

# Increase file descriptor limits
ulimit -n 65535

# Enable TCP keepalive tuning
sudo sysctl -w net.ipv4.tcp_keepalive_time=60
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=10
sudo sysctl -w net.ipv4.tcp_keepalive_probes=6

# Persist settings in /etc/sysctl.conf
cat >> /etc/sysctl.conf << EOF
net.core.somaxconn=65535
net.ipv4.tcp_max_syn_backlog=65535
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_intvl=10
net.ipv4.tcp_keepalive_probes=6
EOF

File descriptor limits

/etc/security/limits.conf
# /etc/security/limits.conf
topgun soft nofile 65535
topgun hard nofile 65535
topgun soft nproc 65535
topgun hard nproc 65535

Docker / Kubernetes note

When running in containers, ensure the host has these settings applied. You may also need to set ulimits in your Docker Compose or Kubernetes pod spec.

Observability

Prometheus metrics are exposed at GET /metrics. All four metric names below are real — verified against packages/server-rust/src/service/middleware/metrics.rs and network/handlers/metrics_endpoint.rs:

Grafana dashboard queries

PromQL queries
# Operation throughput
rate(topgun_operations_total[5m])

# p99 latency
histogram_quantile(0.99, rate(topgun_operation_duration_seconds_bucket[5m]))

# Error rate
rate(topgun_operation_errors_total[5m])

# Active connections
topgun_active_connections

Critical metrics

MetricTypeLabelsHealthy rangeAction if exceeded
topgun_active_connectionsGaugeDepends on capacityAdd nodes or increase file descriptors
topgun_operations_totalCounterservice, outcomeSteady rateInvestigate spikes by service label
topgun_operation_duration_secondsHistogramservicep99 < 500msProfile slow service, check DB
topgun_operation_errors_totalCounterservice, errorNear 0Investigate by error label
Alert nameConditionSeverity
HighErrorRaterate(topgun_operation_errors_total[5m]) > 10Critical
HighLatencyhistogram_quantile(0.99, rate(topgun_operation_duration_seconds_bucket[5m])) > 0.5Warning
ConnectionSpiketopgun_active_connections > 1000Warning