Latency Measurement Methodology

Repeatable. Comparable. Tail‑aware. NDA‑ready.

This page documents how we measure and validate latency for HFT Ready Node deployments and how we perform instance / host enumeration within a region to identify low‑variance candidates. The public version is intentionally scoped; venue‑specific matrices, raw samples and PoP details are shared under NDA.

Request NDA package Start reading

p50/p95/p99/p99.9 Jitter + drift Instance enumeration Change regression gates Reproducible reports

1) Definitions & Measurement Scope

We avoid single “best” numbers. We report distributions, confidence, and boundary conditions.

Latency metrics

RTT — round‑trip time between defined endpoints.
OWD — one‑way delay (requires disciplined time sync).
Jitter — dispersion of latency; we track variance and tail behavior.
Tail latency — p95/p99/p99.9 (and p99.99 if sample size supports it).

Endpoints & boundaries

Host‑local boundary: app → kernel → NIC (where applicable).
Network boundary: node ↔ PoP edge ↔ venue handoff (explicitly defined).
Scope statement: every report states what is inside/outside measurement.
Path type: dedicated vs shared connectivity is explicitly labeled.

Comparability rules

Compare only runs with the same payload size, protocol, and time window.
Record kernel/driver/firmware versions and NIC settings as part of the dataset.
Separate “host” effects from “network” effects via controlled test layers.

Why percentiles

Median can look good while tails are unusable for HFT.
We optimize for stable distributions and predictable tails.
We treat a small p99 regression as a production risk, even if p50 improves.

2) Measurement Harness & Controls

A measurement is only as good as its controls. We standardize warm‑up, CPU affinity, clock discipline, and logging.

Workload isolation

Dedicated CPU sets for test process, RX/TX handling and logging.
IRQ affinity strategy; consistent queue mapping.
Stable power state policy (no opportunistic frequency surprises during runs).
Noise control: disable/contain background services that add jitter.

Run phases

Warm‑up: stabilize caches, frequency behavior, and network buffers.
Calibration: sanity checks for loss, clock state, and baseline jitter.
Sampling: multiple independent trials, separated in time.
Cooldown: record post‑run health signals.

Clock discipline

RTT measurements do not require absolute time sync, but still track clock stability.
OWD reporting is only enabled when clock sync and drift constraints are satisfied.
Time sync health is logged and included in the report appendix.

Data capture

Store raw samples (timestamped), plus summary aggregates.
Capture host signals: CPU freq stats, IRQ load, NIC queue stats, drops/retransmits.
All datasets are versioned by: node profile + config hash + test suite version.

3) Instance / Host Enumeration Within a Region

When cloud resources are used adjacent to venues (research, gateways, bursting, or hybrid designs), we perform systematic enumeration to identify candidates with low tails and stable jitter. This is a repeatable process, not ad‑hoc “try a few instances”.

Candidate pool construction

Define region + zone scope, allowed instance families, and minimum network capabilities.
Generate a candidate list using stratified sampling across zones, instance types, and placement options.
Track each candidate’s hardware fingerprint (where available): CPU model stepping, NIC type, virtualization stack hints.

Exploration vs exploitation

Exploration: sample broadly to map variance across the region.
Exploitation: spend more trials on top candidates to estimate tails with higher confidence.
Use adaptive scheduling: promote candidates that remain stable across time windows.

Measurement dimensions

Multiple payload sizes (small control frames vs larger application frames).
Protocol profiles: UDP‑like vs TCP‑like behavior (without exposing proprietary details publicly).
Different time windows (time‑of‑day effect), repeated across days.

Ranking objective

Multi‑objective score: tail percentiles + jitter + loss + stability over time.
Penalize “fast median, bad tail” candidates.
Prefer candidates with small drift and fewer outliers across runs.

NDA note: Venue‑specific adjacency heuristics, exact sampling budgets, and candidate fingerprints are shared only under NDA. The process above describes the structure without revealing sensitive venue proximity details.

4) Statistical Treatment (Tail‑Focused)

Latency distributions are heavy‑tailed and often multi‑modal. We use robust statistics and explicitly track uncertainty.

Outliers & heavy tails

We do not blindly remove outliers. We label them and correlate with loss/host signals.
Outlier handling uses robust rules: median absolute deviation (MAD), Hampel filters, and signal‑correlated tagging.
We distinguish “measurement noise” from “real tail events”.

Percentile estimation

Percentiles computed from raw samples with consistent interpolation rules.
For tails (p99/p99.9) we require adequate sample size and report confidence bounds.
We avoid comparing p99.9 across tests with insufficient N (explicitly flagged in reports).

Confidence intervals

Bootstrap resampling to estimate confidence intervals for percentiles.
Separate CI per candidate and per time window (to detect instability).
Report uncertainty alongside headline numbers to prevent overfitting to noise.

Stability scoring

Use dispersion metrics (IQR, tail ratio p99/p50) to quantify jitter.
Drift modeling: compare distributions over time windows rather than single aggregates.
Flag multi‑modality (mixture behavior) that may indicate path changes or noisy neighbors.

5) Drift Detection & Continuous Regression

Once in production, the question becomes: “Did anything change?” We detect drift early and prevent silent regressions.

Baseline & guard bands

Baseline distributions are captured at handover with scope and config hash.
Guard bands defined per metric (e.g., p99.9 and jitter) with escalation thresholds.
Different guard bands for host vs network dominated paths.

Change-point detection

We use sequential monitoring methods (e.g., EWMA/CUSUM‑style logic) to detect small sustained shifts.
When drift triggers: correlate with host signals, loss, and recent change sets.
Confirm via targeted re‑sampling before rollback or mitigation.

Regression gates

Firmware/BIOS/kernel/driver changes go through a controlled test suite.
We compare distributions — not just averages — before approving rollout.
Rollback plan is mandatory for any change that affects latency risk.

Incident playbooks

Classify: host regression vs network regression vs external venue behavior.
Fast triage signals: drops, retransmits, queue pressure, IRQ imbalance, clock drift.
Mitigation paths: route change, config rollback, node failover (if HA).

6) Deliverables & Report Structure

The report is designed to be auditable: scope, tooling, config versions, and raw sample summaries.

Report sections

Scope statement: endpoints, boundaries, RTT/OWD, path type.
Config baseline: node type, BIOS profile ID, kernel/driver versions, NIC settings.
Results: percentile table, jitter metrics, drift comparison across windows.
Quality signals: loss/retransmits, time sync health, host load signals.
Interpretation: where variance comes from, recommended mitigations.

Data appendix

Sampling plan: number of trials, duration, payload profiles.
Bootstrap confidence intervals for percentiles (where applicable).
Outlier annotations and correlations (e.g., with loss or host events).
Candidate ranking methodology (high level; details under NDA).

Reproducibility

Test suite versioning and configuration hashing.
Repeatable run commands and environment constraints (documented).
Clear “can’t compare” flags when conditions differ.

What we don’t publish

Exact venue adjacency heuristics and PoP‑specific sensitive details.
Raw venue matrices and sensitive topology info (shared under NDA only).
Client‑specific identifiers; reports can be anonymized for internal sharing.

7) NDA Package

Under NDA we can provide the full methodology pack and venue‑specific artifacts required for engineering evaluation.

Included under NDA

PoP catalog by region (APAC/US/EU) and feasibility notes
Venue latency matrices (scoped) and historical drift snapshots
Sample full reports with raw distribution summaries
Candidate enumeration details (budgets, stratification rules, stability gates)
Change control policy and regression gate thresholds

Evaluation flow

NDA → share target venues and constraints
Propose PoPs + node types + acceptance scope
Deliver baseline report → agree guard bands
Go‑live with continuous regression monitoring

Want the full pack?

Send target venues, region (APAC/US/EU), node type preference and any constraints (time sync, NIC preferences, OS requirements).

Request NDA package Review node types