Latency Measurement Methodology

Repeatable. Comparable. Tail‑aware. NDA‑ready.

This page documents how we measure and validate latency for HFT Ready Node deployments and how we perform instance / host enumeration within a region to identify low‑variance candidates. The public version is intentionally scoped; venue‑specific matrices, raw samples and PoP details are shared under NDA.

p50/p95/p99/p99.9 Jitter + drift Instance enumeration Change regression gates Reproducible reports

1) Definitions & Measurement Scope

We avoid single “best” numbers. We report distributions, confidence, and boundary conditions.

Latency metrics

  • RTT — round‑trip time between defined endpoints.
  • OWD — one‑way delay (requires disciplined time sync).
  • Jitter — dispersion of latency; we track variance and tail behavior.
  • Tail latency — p95/p99/p99.9 (and p99.99 if sample size supports it).

Endpoints & boundaries

  • Host‑local boundary: app → kernel → NIC (where applicable).
  • Network boundary: node ↔ PoP edge ↔ venue handoff (explicitly defined).
  • Scope statement: every report states what is inside/outside measurement.
  • Path type: dedicated vs shared connectivity is explicitly labeled.

Comparability rules

  • Compare only runs with the same payload size, protocol, and time window.
  • Record kernel/driver/firmware versions and NIC settings as part of the dataset.
  • Separate “host” effects from “network” effects via controlled test layers.

Why percentiles

  • Median can look good while tails are unusable for HFT.
  • We optimize for stable distributions and predictable tails.
  • We treat a small p99 regression as a production risk, even if p50 improves.

2) Measurement Harness & Controls

A measurement is only as good as its controls. We standardize warm‑up, CPU affinity, clock discipline, and logging.

Workload isolation

  • Dedicated CPU sets for test process, RX/TX handling and logging.
  • IRQ affinity strategy; consistent queue mapping.
  • Stable power state policy (no opportunistic frequency surprises during runs).
  • Noise control: disable/contain background services that add jitter.

Run phases

  • Warm‑up: stabilize caches, frequency behavior, and network buffers.
  • Calibration: sanity checks for loss, clock state, and baseline jitter.
  • Sampling: multiple independent trials, separated in time.
  • Cooldown: record post‑run health signals.

Clock discipline

  • RTT measurements do not require absolute time sync, but still track clock stability.
  • OWD reporting is only enabled when clock sync and drift constraints are satisfied.
  • Time sync health is logged and included in the report appendix.

Data capture

  • Store raw samples (timestamped), plus summary aggregates.
  • Capture host signals: CPU freq stats, IRQ load, NIC queue stats, drops/retransmits.
  • All datasets are versioned by: node profile + config hash + test suite version.

3) Instance / Host Enumeration Within a Region

When cloud resources are used adjacent to venues (research, gateways, bursting, or hybrid designs), we perform systematic enumeration to identify candidates with low tails and stable jitter. This is a repeatable process, not ad‑hoc “try a few instances”.

Candidate pool construction

  • Define region + zone scope, allowed instance families, and minimum network capabilities.
  • Generate a candidate list using stratified sampling across zones, instance types, and placement options.
  • Track each candidate’s hardware fingerprint (where available): CPU model stepping, NIC type, virtualization stack hints.

Exploration vs exploitation

  • Exploration: sample broadly to map variance across the region.
  • Exploitation: spend more trials on top candidates to estimate tails with higher confidence.
  • Use adaptive scheduling: promote candidates that remain stable across time windows.

Measurement dimensions

  • Multiple payload sizes (small control frames vs larger application frames).
  • Protocol profiles: UDP‑like vs TCP‑like behavior (without exposing proprietary details publicly).
  • Different time windows (time‑of‑day effect), repeated across days.

Ranking objective

  • Multi‑objective score: tail percentiles + jitter + loss + stability over time.
  • Penalize “fast median, bad tail” candidates.
  • Prefer candidates with small drift and fewer outliers across runs.
NDA note: Venue‑specific adjacency heuristics, exact sampling budgets, and candidate fingerprints are shared only under NDA. The process above describes the structure without revealing sensitive venue proximity details.

4) Statistical Treatment (Tail‑Focused)

Latency distributions are heavy‑tailed and often multi‑modal. We use robust statistics and explicitly track uncertainty.

Outliers & heavy tails

  • We do not blindly remove outliers. We label them and correlate with loss/host signals.
  • Outlier handling uses robust rules: median absolute deviation (MAD), Hampel filters, and signal‑correlated tagging.
  • We distinguish “measurement noise” from “real tail events”.

Percentile estimation

  • Percentiles computed from raw samples with consistent interpolation rules.
  • For tails (p99/p99.9) we require adequate sample size and report confidence bounds.
  • We avoid comparing p99.9 across tests with insufficient N (explicitly flagged in reports).

Confidence intervals

  • Bootstrap resampling to estimate confidence intervals for percentiles.
  • Separate CI per candidate and per time window (to detect instability).
  • Report uncertainty alongside headline numbers to prevent overfitting to noise.

Stability scoring

  • Use dispersion metrics (IQR, tail ratio p99/p50) to quantify jitter.
  • Drift modeling: compare distributions over time windows rather than single aggregates.
  • Flag multi‑modality (mixture behavior) that may indicate path changes or noisy neighbors.

5) Drift Detection & Continuous Regression

Once in production, the question becomes: “Did anything change?” We detect drift early and prevent silent regressions.

Baseline & guard bands

  • Baseline distributions are captured at handover with scope and config hash.
  • Guard bands defined per metric (e.g., p99.9 and jitter) with escalation thresholds.
  • Different guard bands for host vs network dominated paths.

Change-point detection

  • We use sequential monitoring methods (e.g., EWMA/CUSUM‑style logic) to detect small sustained shifts.
  • When drift triggers: correlate with host signals, loss, and recent change sets.
  • Confirm via targeted re‑sampling before rollback or mitigation.

Regression gates

  • Firmware/BIOS/kernel/driver changes go through a controlled test suite.
  • We compare distributions — not just averages — before approving rollout.
  • Rollback plan is mandatory for any change that affects latency risk.

Incident playbooks

  • Classify: host regression vs network regression vs external venue behavior.
  • Fast triage signals: drops, retransmits, queue pressure, IRQ imbalance, clock drift.
  • Mitigation paths: route change, config rollback, node failover (if HA).

6) Deliverables & Report Structure

The report is designed to be auditable: scope, tooling, config versions, and raw sample summaries.

Report sections

  • Scope statement: endpoints, boundaries, RTT/OWD, path type.
  • Config baseline: node type, BIOS profile ID, kernel/driver versions, NIC settings.
  • Results: percentile table, jitter metrics, drift comparison across windows.
  • Quality signals: loss/retransmits, time sync health, host load signals.
  • Interpretation: where variance comes from, recommended mitigations.

Data appendix

  • Sampling plan: number of trials, duration, payload profiles.
  • Bootstrap confidence intervals for percentiles (where applicable).
  • Outlier annotations and correlations (e.g., with loss or host events).
  • Candidate ranking methodology (high level; details under NDA).

Reproducibility

  • Test suite versioning and configuration hashing.
  • Repeatable run commands and environment constraints (documented).
  • Clear “can’t compare” flags when conditions differ.

What we don’t publish

  • Exact venue adjacency heuristics and PoP‑specific sensitive details.
  • Raw venue matrices and sensitive topology info (shared under NDA only).
  • Client‑specific identifiers; reports can be anonymized for internal sharing.

7) NDA Package

Under NDA we can provide the full methodology pack and venue‑specific artifacts required for engineering evaluation.

Included under NDA

  • PoP catalog by region (APAC/US/EU) and feasibility notes
  • Venue latency matrices (scoped) and historical drift snapshots
  • Sample full reports with raw distribution summaries
  • Candidate enumeration details (budgets, stratification rules, stability gates)
  • Change control policy and regression gate thresholds

Evaluation flow

  • NDA → share target venues and constraints
  • Propose PoPs + node types + acceptance scope
  • Deliver baseline report → agree guard bands
  • Go‑live with continuous regression monitoring

Want the full pack?

Send target venues, region (APAC/US/EU), node type preference and any constraints (time sync, NIC preferences, OS requirements).