← back to championship

HFT Technology: The Latency Stack

From 1000ms to 0.5µs — a 2,000,000× improvement in 28 years. The network, hardware, and software innovations that define modern market microstructure.
0.5µs Latency Kernel Bypass FPGA Microwave Speed of Light Co-Location
Latency Improvement 1995 — 2023
2,000,000×
1000ms 0.0005ms
Dial-up modem → FPGA with kernel bypass on co-located hardware

Latency Evolution

The defining curve of electronic market structure — log-scale descent from human-speed to physics-limited

Latency by Year (Log Scale) — 9 Data Points

Latency Breakdown

Every data point in the descent — from dial-up to FPGA, color-coded by technology era

Complete Latency History
Year Latency Medium / Technology Context Era
1995 1000ms Dial-up modem Seconds to execute. Phone-based trading. --
2000 100ms T1 lines to exchanges Electronic exchanges. Sub-second execution becomes possible. --
2005 10ms Co-located fiber Servers in exchange data centers. Reg NMS drives speed competition. --
2007 1ms Optimized fiber + co-lo Spread Networks. Sub-millisecond matching engines. --
2010 100.0µs FPGA + co-location Hardware acceleration. FPGA processes market data in microseconds. --
2012 10.0µs Microwave + FPGA McKay Brothers microwave. Chicago-NJ in 4.1ms one-way (vs 6.5ms fiber). --
2015 5.0µs Millimeter wave + custom NIC Kernel bypass networking. Custom network cards process packets in nanoseconds. --
2018 1.0µs Laser + FPGA + custom silicon Sub-microsecond tick-to-trade. ASIC-based feed handlers. --
2023 500ns Custom ASIC + integrated optics Nanosecond-scale decisions. Physics limit (speed of light) becomes binding constraint. --

Network Infrastructure

Network Infrastructure

Transmission Media Comparison

Network Layer Technologies
Medium Speed (fraction of c) Latency/km Bandwidth Weather Sensitivity Cost Tier
Fiber Optic -- -- Terabits/s -- --
Microwave -- -- ~1 Gbps -- --
Millimeter Wave -- -- ~10 Gbps -- --
Free-Space Optical (Laser) -- -- ~100 Gbps -- --

Speed Comparison — Fraction of c

Propagation Speed by Medium

Key Routes

Chicago — New Jersey
~1,200 km
The canonical HFT route

CME Aurora, IL ↔ NYSE Mahwah, NJ. The most fought-over data path in finance. Fiber: ~6.0ms one-way. Microwave: ~4.0ms one-way. The ~2ms delta between fiber and microwave at the speed of light is worth hundreds of millions annually. Spread Networks spent $300M laying a fiber route in 2010; it was obsoleted by microwave within two years.

London — Frankfurt
~640 km
LSE ↔ Eurex

LD4 Slough ↔ FR2 Frankfurt. European backbone. Microwave towers across the North Sea and Low Countries. Shorter distance means tighter absolute latency margins. Sub-4ms one-way via microwave. The Channel crossing complicates microwave line-of-sight — requires relay towers on high points in Belgium/Netherlands.

Tokyo — Osaka
~500 km
TSE ↔ OSE

Kanto ↔ Kansai. JPX arbitrage route. Mountainous terrain limits microwave options — Japan's topography favors fiber or millimeter-wave solutions with more relay hops. Sub-3ms one-way is the target. Less competitive than US/EU routes but still multi-million dollar infrastructure.

Why Light is Faster in Air Than Glass

The refractive index problem: Light in a vacuum travels at c = 299,792 km/s. In fiber optic cable, light travels through a glass core (silica, SiO₂) with a refractive index of n ≈ 1.47. The effective speed is c/n ≈ 203,940 km/s — about 68% of c. In air, n ≈ 1.0003, so microwave propagation is effectively at c.

The physics: The refractive index arises because photons interact with the electron clouds of silicon and oxygen atoms in the glass lattice. Each interaction introduces a tiny delay as the photon is absorbed and re-emitted. The cumulative effect across billions of interactions per meter reduces the effective group velocity. Air molecules are too sparse to cause significant delay at microwave frequencies.

Propagation Delay vfiber = c / n ≈ 299,792 / 1.47 ≈ 203,940 km/s
vair ≈ c ≈ 299,792 km/s

Chicago–NJ (1,200 km):
  tfiber = 1200 / 203,940 ≈ 5.88 ms
  tair = 1200 / 299,792 ≈ 4.00 ms
  Δt ≈ 1.88 ms — an eternity in HFT
Real-world fiber paths are longer than straight-line distance (routing around obstacles, rights-of-way). Actual fiber latency Chicago-NJ is ~6.5ms vs ~4.0ms microwave. The microwave advantage is even larger than refractive index alone suggests because the path is closer to geodesic.

Compute Hardware

Compute Hardware

Hardware Comparison

Compute Technologies
Hardware Typical Latency Flexibility Power Use Case
CPU (x86) ~1-10μs Highest 150-300W --
GPU ~10-100μs Medium 300-700W --
FPGA ~100ns-1μs Medium 10-75W --
ASIC <100ns None 5-50W --

Latency by Hardware Type (Log Scale)

Processing Latency — CPU → GPU → FPGA → ASIC

FPGA Deep Dive

What is an FPGA? A Field-Programmable Gate Array is a chip containing millions of configurable logic blocks (CLBs) connected by a programmable interconnect fabric. Unlike a CPU — which fetches, decodes, and executes instructions sequentially through a pipeline — an FPGA implements logic directly in hardware. There is no instruction fetch. There is no branch prediction. There is no cache miss. The logic IS the circuit. When a market data packet arrives at the FPGA's input pins, the processing begins at the speed of electrical signal propagation through the gate fabric — nanoseconds, not microseconds.

How does it achieve nanosecond latency? Consider parsing a FIX message to extract a price update. On a CPU: the NIC DMAs the packet into memory, the kernel processes the interrupt, copies the packet to userspace (or you bypass the kernel), your application reads the bytes, branches through parsing logic, updates the order book, evaluates the strategy, and builds a response. Dozens of pipeline stages, cache accesses, and branch predictions. On an FPGA: the Ethernet frame enters the FPGA's integrated MAC, the FIX parser is a state machine implemented in combinatorial logic that extracts fields as bits arrive (no store-and-forward), the order book update is a parallel lookup in on-chip BRAM, the strategy evaluation is a combinatorial circuit, and the outbound order is serialized onto the wire — all in a single pass through the gate fabric. Wire-to-wire in under 1 microsecond.

Key platforms: Xilinx (now AMD) Alveo U250/U55C and Intel (formerly Altera) Stratix 10 are the dominant HFT FPGA platforms. The Alveo U250 offers ~1.3M LUTs, 54MB of on-chip BRAM, and 100GbE integrated MACs. Firms like Optiver, Citadel Securities, and Jump Trading have dedicated FPGA engineering teams of 20-50 people, each maintaining custom RTL codebases of hundreds of thousands of lines of SystemVerilog.

Hardware-defined trading: The concept is simple but profound — instead of writing software that runs on general-purpose hardware, you design the hardware itself to implement your trading logic. The FPGA is not running your strategy. The FPGA IS your strategy, expressed as a physical circuit. This is why FPGA latency is measured in nanoseconds: there are no software abstractions to traverse. The penalty is flexibility — changing the strategy means resynthesizing the bitstream (hours to days), not recompiling code (seconds). The teams that win are those that architect their FPGAs with the right abstraction boundaries: generic feed handlers and order encoders in fixed RTL, with parameterized strategy logic that can be reconfigured quickly.

The CPU–FPGA Split

On the FPGA (Critical Path)

Feed handler: Ethernet MAC → IP/UDP parsing → exchange protocol decode (ITCH, OUCH, ARCA, CME MDP3) → order book reconstruction. All in streaming logic, no store-and-forward.

Pre-trade risk: Position limits, order rate limits, price band checks, fat-finger guards. These are simple comparators and counters that must never be bypassed — and on an FPGA they add <10ns.

Order encoder: Strategy decision → FIX/native protocol encoding → TCP/IP framing → wire. The inverse of the feed handler, equally latency-critical.

Simple strategies: Market making with deterministic spread logic, statistical arbitrage triggers, cross-venue price comparisons. Anything that can be expressed as combinatorial logic or simple state machines.

On the CPU (Non-Critical Path)

Complex strategy logic: Portfolio optimization, multi-leg options pricing, machine learning inference, regime detection. These require floating-point arithmetic, large memory, and algorithmic flexibility that FPGAs handle poorly.

Risk management: End-of-day P&L, portfolio Greeks, VaR calculations, margin monitoring. Important but not latency-sensitive — milliseconds are fine.

Configuration & control: Parameter updates to the FPGA (spread widths, position limits, symbol universe), monitoring dashboards, logging, compliance recording.

Recovery & resilience: Gap detection, sequence number recovery, reconnection logic, failover coordination. The CPU manages the system's lifecycle; the FPGA handles the hot path.

Software Stack

Software Stack

Stack Layers

Kernel Bypass (DPDK/ef_vi)

Lock-Free Data Structures

Custom Memory Allocators

C++ / Rust (Critical Path)

FPGA HDL (Verilog/VHDL)

Python (Research/Backtesting)

The Latency Stack — Where Every Microsecond Goes

Savings by Software Optimization

Kernel bypass eliminates the entire Linux networking stack from the critical path. Standard path: NIC → DMA to ring buffer → interrupt → kernel softirq → sk_buff allocation → protocol processing → socket buffer → copy to userspace → application. With kernel bypass (DPDK, Solarflare OpenOnload, ef_vi, or Exanic): NIC → DMA directly to userspace-mapped memory → application polls the buffer. No interrupts, no context switches, no copies. Saves ~10µs per packet. This is table stakes — every serious HFT firm uses kernel bypass.

Lock-free data structures replace mutexes and condition variables with atomic compare-and-swap (CAS) operations and memory ordering guarantees. A lock acquisition on a contended mutex can cost 5-15µs (futex syscall, context switch, scheduler). A CAS operation costs ~20ns. Lock-free SPSC (single-producer, single-consumer) ring buffers for inter-thread communication are the standard pattern. The Disruptor pattern (LMAX) demonstrated this at scale in 2011.

Custom memory allocators replace malloc/free (which use mmap/brk syscalls and maintain free lists) with pre-allocated pools of fixed-size objects. No syscalls, no fragmentation, deterministic allocation time. Object pools are initialized at startup and never freed. This eliminates ~1µs of jitter per allocation and, critically, eliminates the tail latency spikes from garbage collection or heap compaction.

FPGA on the critical path removes software entirely for feed handling and order encoding. The software stack applies only to the strategy layer and non-latency-sensitive operations. The critical path — market data in, order out — never touches a CPU.

Language Landscape

HFT Language Usage by Domain
C++ (dominant)90%
Core trading systems, strategy engines, feed handlers. C++17/20/23. Templates for zero-cost abstractions. constexpr everything. No exceptions on hot path. -O3 -march=native.
Rust (emerging)15%
New systems at some firms. Memory safety without GC. Zero-cost abstractions. Competitive latency with C++. Adoption limited by ecosystem maturity and hiring pool.
Java (legacy)25%
LMAX, some banks. GC-tuned (Azul Zing C4, ZGC). Viable for mid-frequency. 99th percentile latency is the problem — GC pauses create tail risk. Mechanical sympathy (cache-line alignment, off-heap allocation) mitigates but doesn't eliminate.
Python (research only)40%
Research, backtesting, data analysis, model training. Never on the hot path. NumPy/Pandas for offline analysis. Jupyter for exploration. ML model training (PyTorch/JAX). The model runs in Python; the inference runs in C++ or on an FPGA.
SystemVerilog / VHDLFPGA
Hardware description languages for FPGA design. Not "programming" in the traditional sense — you're describing circuits. Synthesized to bitstreams that configure the FPGA gate fabric. 100K+ lines of RTL for a production feed handler + order gateway.

The Physics Limit

When you've eliminated every software microsecond, what's left is the speed of light

Speed of Light (vacuum)
299,792
km/s
Speed in Fiber
~203,940
km/s (c/1.47)
Speed in Air (microwave)
~299,700
km/s (≈ c)
Minimum Theoretical Latency by Route
RouteDistanceFiber (c/1.47)Microwave (≈c)Physics Limit
Chicago ↔ New Jersey ~1,200 km ~5.88ms ~4.00ms 4.00ms one-way
London ↔ Frankfurt ~640 km ~3.14ms ~2.13ms 2.13ms one-way
Tokyo ↔ Osaka ~500 km ~2.45ms ~1.67ms 1.67ms one-way
Within data center ~100m ~0.49µs N/A 0.33µs

What's left to optimize when you're at the speed of light?

1. Path shortening. Real fiber routes are 20-40% longer than the geodesic because they follow railroad rights-of-way, avoid mountains, and navigate urban infrastructure. Microwave paths are closer to geodesic but require line-of-sight relay towers every 50-80 km. Millimeter-wave and free-space laser links can achieve near-geodesic paths with fewer relay points but are more susceptible to atmospheric attenuation.

2. Processing latency. Even at 500ns wire-to-wire on an FPGA, there are gates to optimize. Pipelining can reduce clock-to-clock latency. Combinatorial shortcuts (look-ahead adders, parallel prefix structures) save individual clock cycles. When your competitor is at 500ns and you're at 480ns, those 20ns matter — it's the difference between being first or second in the matching engine queue.

3. Serialization delay. A 64-byte Ethernet frame at 10GbE takes 51.2ns to serialize. At 100GbE it takes 5.12ns. Moving to 100GbE (or 400GbE) reduces serialization delay by an order of magnitude. This is a real and measurable improvement for minimum-size order packets.

4. Switch hop elimination. Every network switch adds 200-400ns of latency. Direct NIC-to-NIC connections (cross-connects within a data center) eliminate switch hops. Some firms negotiate with exchanges for dedicated cross-connects to the matching engine.

5. Shorter wavelengths. Hollow-core fiber is an emerging technology where light travels through an air core (n ≈ 1.0) instead of glass. This would give fiber the speed advantage of microwave with the reliability advantage of a physical cable. Currently expensive and limited in availability, but it represents the next frontier.

The diminishing returns curve: The first 10 years of HFT optimization (1995-2005) delivered a 10,000× improvement. The next 10 years (2005-2015) delivered 100×. The last 8 years (2015-2023) delivered 20×. Each subsequent microsecond costs exponentially more to eliminate. The industry is approaching an asymptote defined by physics — and the competitive advantage has shifted from raw latency to consistency (low jitter), intelligence (better signals), and risk management (surviving adverse selection).

Co-Location Geography

Where the matching engines live — and why even rack placement matters

Primary US Data Centers
ExchangeLocationFacility
CME Group Aurora, IL CME co-lo facility. Futures, options on futures. E-mini S&P, crude oil, treasury futures. The Globex matching engine runs here.
NYSE Mahwah, NJ 400,000 sq ft facility. Equities matching engine. The physical successor to the trading floor at 11 Wall Street.
NASDAQ Carteret, NJ Equinix NY5. NASDAQ matching engine. Also hosts dark pools and ATS operators seeking proximity.
CBOE Secaucus, NJ Options matching engine. SPX options, VIX futures. Equinix NY4/NY5 complex.
Why New Jersey?

Historical gravity: Wall Street was in Manhattan. Early electronic trading infrastructure was built nearby. As trading went electronic and matching engines needed space, power, and cooling that Manhattan couldn't provide, data centers moved across the Hudson to northern New Jersey — close enough to maintain low-latency connections to Wall Street firms, but with the space, power grid capacity, and cost structure needed for warehouse-scale computing.

The NJ data center corridor: Mahwah, Secaucus, Carteret, and Weehawken form a dense corridor of financial data centers. Equinix alone operates six major facilities in the area (NY1-NY9). CyrusOne, QTS, and Digital Realty have additional campuses. The density creates a network effect: once the exchanges are there, the firms must be there, which attracts more exchanges, dark pools, and service providers.

Power: Northern NJ has robust grid infrastructure. A single co-location facility can draw 20-40 MW. The area benefits from multiple utility feeds and proximity to natural gas generation.

Fair Co-Location

The cable length problem: Even within a data center, cable length varies. A rack 10 meters from the matching engine switch has ~33ns less latency than a rack 20 meters away. Over a year of trading, 33ns of consistent advantage translates to thousands of queue-priority wins. Regulators and exchanges recognized this asymmetry.

Equalized cable lengths: Most major exchanges now mandate equal cable lengths to all co-located participants. Every participant's cross-connect to the matching engine is the same length — excess cable is coiled on the rack. CME's Aurora facility and NYSE's Mahwah facility both implement equalized cables. This doesn't eliminate the co-location advantage over remote participants, but it ensures fairness among co-located firms.

What you're actually buying: Co-location isn't just about cable length. It's about being on the same switch fabric as the matching engine, eliminating WAN hops, and having deterministic latency. A co-located firm has ~1µs round-trip to the matching engine. A firm in Manhattan has ~200µs. A firm in Chicago has ~8ms. The co-location advantage is 200× over Manhattan and 8,000× over Chicago.

The cost of proximity: A single rack in a Tier 1 exchange co-location facility costs $10,000-$20,000/month. A dedicated cross-connect is additional. Power is metered separately. A typical HFT firm occupies 4-8 racks per venue across 5-10 venues. Annual co-location costs alone can exceed $2-5M before any hardware is installed. This is the rent — the barrier to entry that separates HFT firms from retail traders more effectively than any technological advantage.

The Full Stack

The complete path from market event to order acknowledgment — annotated with approximate latency at each hop

End-to-End Latency Path
Market Event
T=0
price change
Matching Engine
~1µs
exchange match
Co-lo Switch
+300ns
network hop
NIC
+100ns
DMA to FPGA
FPGA Feed
+200ns
parse + book
Strategy
+100ns
decision logic
FPGA Order
+150ns
encode + risk
NIC
+50ns
serialize
Switch
+300ns
network hop
Exchange
~1µs
order accepted
Total Internal Latency: ~900ns — under 1 microsecond
The critical path budget: From the moment the co-lo switch delivers a market data packet to the moment the outbound order packet leaves the NIC, the entire processing takes ~900ns. The exchange adds ~1µs on each side (feed dissemination + order matching). The network switch adds ~300ns per hop. Total round-trip from market event to order in the matching engine queue: approximately 3-4µs when co-located. This is the state of the art in 2023-2024. Further improvement requires eliminating switch hops (direct cross-connects), faster FPGA fabric, and smaller packet sizes.

Total Round-Trip Budget

Latency Budget Breakdown