F1 2026 — HFT Technology: The Latency Stack

Latency Improvement 1995 — 2023

2,000,000×

1000ms → 0.0005ms

Dial-up modem → FPGA with kernel bypass on co-located hardware

Latency Evolution

The defining curve of electronic market structure — log-scale descent from human-speed to physics-limited

Latency by Year (Log Scale) — 9 Data Points

Latency Breakdown

Every data point in the descent — from dial-up to FPGA, color-coded by technology era

Complete Latency History

Year	Latency	Medium / Technology	Context	Era
1995	1000ms	Dial-up modem	Seconds to execute. Phone-based trading.	--
2000	100ms	T1 lines to exchanges	Electronic exchanges. Sub-second execution becomes possible.	--
2005	10ms	Co-located fiber	Servers in exchange data centers. Reg NMS drives speed competition.	--
2007	1ms	Optimized fiber + co-lo	Spread Networks. Sub-millisecond matching engines.	--
2010	100.0µs	FPGA + co-location	Hardware acceleration. FPGA processes market data in microseconds.	--
2012	10.0µs	Microwave + FPGA	McKay Brothers microwave. Chicago-NJ in 4.1ms one-way (vs 6.5ms fiber).	--
2015	5.0µs	Millimeter wave + custom NIC	Kernel bypass networking. Custom network cards process packets in nanoseconds.	--
2018	1.0µs	Laser + FPGA + custom silicon	Sub-microsecond tick-to-trade. ASIC-based feed handlers.	--
2023	500ns	Custom ASIC + integrated optics	Nanosecond-scale decisions. Physics limit (speed of light) becomes binding constraint.	--

Network Infrastructure

Transmission Media Comparison

Network Layer Technologies

Medium	Speed (fraction of c)	Latency/km	Bandwidth	Weather Sensitivity	Cost Tier
Fiber Optic	--	--	Terabits/s	--	--
Microwave	--	--	~1 Gbps	--	--
Millimeter Wave	--	--	~10 Gbps	--	--
Free-Space Optical (Laser)	--	--	~100 Gbps	--	--

Speed Comparison — Fraction of c

Propagation Speed by Medium

Key Routes

Chicago — New Jersey

~1,200 km

The canonical HFT route

CME Aurora, IL ↔ NYSE Mahwah, NJ. The most fought-over data path in finance. Fiber: ~6.0ms one-way. Microwave: ~4.0ms one-way. The ~2ms delta between fiber and microwave at the speed of light is worth hundreds of millions annually. Spread Networks spent $300M laying a fiber route in 2010; it was obsoleted by microwave within two years.

London — Frankfurt

~640 km

LSE ↔ Eurex

LD4 Slough ↔ FR2 Frankfurt. European backbone. Microwave towers across the North Sea and Low Countries. Shorter distance means tighter absolute latency margins. Sub-4ms one-way via microwave. The Channel crossing complicates microwave line-of-sight — requires relay towers on high points in Belgium/Netherlands.

Tokyo — Osaka

~500 km

TSE ↔ OSE

Kanto ↔ Kansai. JPX arbitrage route. Mountainous terrain limits microwave options — Japan's topography favors fiber or millimeter-wave solutions with more relay hops. Sub-3ms one-way is the target. Less competitive than US/EU routes but still multi-million dollar infrastructure.

Why Light is Faster in Air Than Glass

The refractive index problem: Light in a vacuum travels at c = 299,792 km/s. In fiber optic cable, light travels through a glass core (silica, SiO₂) with a refractive index of n ≈ 1.47. The effective speed is c/n ≈ 203,940 km/s — about 68% of c. In air, n ≈ 1.0003, so microwave propagation is effectively at c.

The physics: The refractive index arises because photons interact with the electron clouds of silicon and oxygen atoms in the glass lattice. Each interaction introduces a tiny delay as the photon is absorbed and re-emitted. The cumulative effect across billions of interactions per meter reduces the effective group velocity. Air molecules are too sparse to cause significant delay at microwave frequencies.

Propagation Delay v_fiber = c / n ≈ 299,792 / 1.47 ≈ 203,940 km/s
v_air ≈ c ≈ 299,792 km/s

Chicago–NJ (1,200 km):
  t_fiber = 1200 / 203,940 ≈ 5.88 ms
  t_air = 1200 / 299,792 ≈ 4.00 ms
  Δt ≈ 1.88 ms — an eternity in HFT

Real-world fiber paths are longer than straight-line distance (routing around obstacles, rights-of-way). Actual fiber latency Chicago-NJ is ~6.5ms vs ~4.0ms microwave. The microwave advantage is even larger than refractive index alone suggests because the path is closer to geodesic.

Compute Hardware

Hardware Comparison

Compute Technologies

Hardware	Typical Latency	Flexibility	Power	Use Case
CPU (x86)	~1-10μs	Highest	150-300W	--
GPU	~10-100μs	Medium	300-700W	--
FPGA	~100ns-1μs	Medium	10-75W	--
ASIC	<100ns	None	5-50W	--

Latency by Hardware Type (Log Scale)

Processing Latency — CPU → GPU → FPGA → ASIC

FPGA Deep Dive

What is an FPGA? A Field-Programmable Gate Array is a chip containing millions of configurable logic blocks (CLBs) connected by a programmable interconnect fabric. Unlike a CPU — which fetches, decodes, and executes instructions sequentially through a pipeline — an FPGA implements logic directly in hardware. There is no instruction fetch. There is no branch prediction. There is no cache miss. The logic IS the circuit. When a market data packet arrives at the FPGA's input pins, the processing begins at the speed of electrical signal propagation through the gate fabric — nanoseconds, not microseconds.

How does it achieve nanosecond latency? Consider parsing a FIX message to extract a price update. On a CPU: the NIC DMAs the packet into memory, the kernel processes the interrupt, copies the packet to userspace (or you bypass the kernel), your application reads the bytes, branches through parsing logic, updates the order book, evaluates the strategy, and builds a response. Dozens of pipeline stages, cache accesses, and branch predictions. On an FPGA: the Ethernet frame enters the FPGA's integrated MAC, the FIX parser is a state machine implemented in combinatorial logic that extracts fields as bits arrive (no store-and-forward), the order book update is a parallel lookup in on-chip BRAM, the strategy evaluation is a combinatorial circuit, and the outbound order is serialized onto the wire — all in a single pass through the gate fabric. Wire-to-wire in under 1 microsecond.

Key platforms: Xilinx (now AMD) Alveo U250/U55C and Intel (formerly Altera) Stratix 10 are the dominant HFT FPGA platforms. The Alveo U250 offers ~1.3M LUTs, 54MB of on-chip BRAM, and 100GbE integrated MACs. Firms like Optiver, Citadel Securities, and Jump Trading have dedicated FPGA engineering teams of 20-50 people, each maintaining custom RTL codebases of hundreds of thousands of lines of SystemVerilog.

Hardware-defined trading: The concept is simple but profound — instead of writing software that runs on general-purpose hardware, you design the hardware itself to implement your trading logic. The FPGA is not running your strategy. The FPGA IS your strategy, expressed as a physical circuit. This is why FPGA latency is measured in nanoseconds: there are no software abstractions to traverse. The penalty is flexibility — changing the strategy means resynthesizing the bitstream (hours to days), not recompiling code (seconds). The teams that win are those that architect their FPGAs with the right abstraction boundaries: generic feed handlers and order encoders in fixed RTL, with parameterized strategy logic that can be reconfigured quickly.

The CPU–FPGA Split

On the FPGA (Critical Path)

Feed handler: Ethernet MAC → IP/UDP parsing → exchange protocol decode (ITCH, OUCH, ARCA, CME MDP3) → order book reconstruction. All in streaming logic, no store-and-forward.

Pre-trade risk: Position limits, order rate limits, price band checks, fat-finger guards. These are simple comparators and counters that must never be bypassed — and on an FPGA they add <10ns.

Order encoder: Strategy decision → FIX/native protocol encoding → TCP/IP framing → wire. The inverse of the feed handler, equally latency-critical.

Simple strategies: Market making with deterministic spread logic, statistical arbitrage triggers, cross-venue price comparisons. Anything that can be expressed as combinatorial logic or simple state machines.

On the CPU (Non-Critical Path)

Complex strategy logic: Portfolio optimization, multi-leg options pricing, machine learning inference, regime detection. These require floating-point arithmetic, large memory, and algorithmic flexibility that FPGAs handle poorly.

Risk management: End-of-day P&L, portfolio Greeks, VaR calculations, margin monitoring. Important but not latency-sensitive — milliseconds are fine.

Configuration & control: Parameter updates to the FPGA (spread widths, position limits, symbol universe), monitoring dashboards, logging, compliance recording.

Recovery & resilience: Gap detection, sequence number recovery, reconnection logic, failover coordination. The CPU manages the system's lifecycle; the FPGA handles the hot path.

Software Stack

Stack Layers

Kernel Bypass (DPDK/ef_vi)

Lock-Free Data Structures

Custom Memory Allocators

C++ / Rust (Critical Path)

FPGA HDL (Verilog/VHDL)

Python (Research/Backtesting)

The Latency Stack — Where Every Microsecond Goes

Savings by Software Optimization

Kernel bypass eliminates the entire Linux networking stack from the critical path. Standard path: NIC → DMA to ring buffer → interrupt → kernel softirq → sk_buff allocation → protocol processing → socket buffer → copy to userspace → application. With kernel bypass (DPDK, Solarflare OpenOnload, ef_vi, or Exanic): NIC → DMA directly to userspace-mapped memory → application polls the buffer. No interrupts, no context switches, no copies. Saves ~10µs per packet. This is table stakes — every serious HFT firm uses kernel bypass.

Lock-free data structures replace mutexes and condition variables with atomic compare-and-swap (CAS) operations and memory ordering guarantees. A lock acquisition on a contended mutex can cost 5-15µs (futex syscall, context switch, scheduler). A CAS operation costs ~20ns. Lock-free SPSC (single-producer, single-consumer) ring buffers for inter-thread communication are the standard pattern. The Disruptor pattern (LMAX) demonstrated this at scale in 2011.

Custom memory allocators replace malloc/free (which use mmap/brk syscalls and maintain free lists) with pre-allocated pools of fixed-size objects. No syscalls, no fragmentation, deterministic allocation time. Object pools are initialized at startup and never freed. This eliminates ~1µs of jitter per allocation and, critically, eliminates the tail latency spikes from garbage collection or heap compaction.

FPGA on the critical path removes software entirely for feed handling and order encoding. The software stack applies only to the strategy layer and non-latency-sensitive operations. The critical path — market data in, order out — never touches a CPU.

Language Landscape

HFT Language Usage by Domain

C++ (dominant)90%

Core trading systems, strategy engines, feed handlers. C++17/20/23. Templates for zero-cost abstractions. constexpr everything. No exceptions on hot path. -O3 -march=native.

Rust (emerging)15%

New systems at some firms. Memory safety without GC. Zero-cost abstractions. Competitive latency with C++. Adoption limited by ecosystem maturity and hiring pool.

Java (legacy)25%

LMAX, some banks. GC-tuned (Azul Zing C4, ZGC). Viable for mid-frequency. 99th percentile latency is the problem — GC pauses create tail risk. Mechanical sympathy (cache-line alignment, off-heap allocation) mitigates but doesn't eliminate.

Python (research only)40%

Research, backtesting, data analysis, model training. Never on the hot path. NumPy/Pandas for offline analysis. Jupyter for exploration. ML model training (PyTorch/JAX). The model runs in Python; the inference runs in C++ or on an FPGA.

SystemVerilog / VHDLFPGA

Hardware description languages for FPGA design. Not "programming" in the traditional sense — you're describing circuits. Synthesized to bitstreams that configure the FPGA gate fabric. 100K+ lines of RTL for a production feed handler + order gateway.

The Physics Limit

When you've eliminated every software microsecond, what's left is the speed of light

Speed of Light (vacuum)

299,792

km/s

Speed in Fiber

~203,940

km/s (c/1.47)

Speed in Air (microwave)

~299,700

km/s (≈ c)

Minimum Theoretical Latency by Route

Route	Distance	Fiber (c/1.47)	Microwave (≈c)	Physics Limit
Chicago ↔ New Jersey	~1,200 km	~5.88ms	~4.00ms	4.00ms one-way
London ↔ Frankfurt	~640 km	~3.14ms	~2.13ms	2.13ms one-way
Tokyo ↔ Osaka	~500 km	~2.45ms	~1.67ms	1.67ms one-way
Within data center	~100m	~0.49µs	N/A	0.33µs

What's left to optimize when you're at the speed of light?

1. Path shortening. Real fiber routes are 20-40% longer than the geodesic because they follow railroad rights-of-way, avoid mountains, and navigate urban infrastructure. Microwave paths are closer to geodesic but require line-of-sight relay towers every 50-80 km. Millimeter-wave and free-space laser links can achieve near-geodesic paths with fewer relay points but are more susceptible to atmospheric attenuation.

2. Processing latency. Even at 500ns wire-to-wire on an FPGA, there are gates to optimize. Pipelining can reduce clock-to-clock latency. Combinatorial shortcuts (look-ahead adders, parallel prefix structures) save individual clock cycles. When your competitor is at 500ns and you're at 480ns, those 20ns matter — it's the difference between being first or second in the matching engine queue.

3. Serialization delay. A 64-byte Ethernet frame at 10GbE takes 51.2ns to serialize. At 100GbE it takes 5.12ns. Moving to 100GbE (or 400GbE) reduces serialization delay by an order of magnitude. This is a real and measurable improvement for minimum-size order packets.

4. Switch hop elimination. Every network switch adds 200-400ns of latency. Direct NIC-to-NIC connections (cross-connects within a data center) eliminate switch hops. Some firms negotiate with exchanges for dedicated cross-connects to the matching engine.

5. Shorter wavelengths. Hollow-core fiber is an emerging technology where light travels through an air core (n ≈ 1.0) instead of glass. This would give fiber the speed advantage of microwave with the reliability advantage of a physical cable. Currently expensive and limited in availability, but it represents the next frontier.

The diminishing returns curve: The first 10 years of HFT optimization (1995-2005) delivered a 10,000× improvement. The next 10 years (2005-2015) delivered 100×. The last 8 years (2015-2023) delivered 20×. Each subsequent microsecond costs exponentially more to eliminate. The industry is approaching an asymptote defined by physics — and the competitive advantage has shifted from raw latency to consistency (low jitter), intelligence (better signals), and risk management (surviving adverse selection).

Co-Location Geography

Where the matching engines live — and why even rack placement matters

Primary US Data Centers

Exchange	Location	Facility
CME Group	Aurora, IL	CME co-lo facility. Futures, options on futures. E-mini S&P, crude oil, treasury futures. The Globex matching engine runs here.
NYSE	Mahwah, NJ	400,000 sq ft facility. Equities matching engine. The physical successor to the trading floor at 11 Wall Street.
NASDAQ	Carteret, NJ	Equinix NY5. NASDAQ matching engine. Also hosts dark pools and ATS operators seeking proximity.
CBOE	Secaucus, NJ	Options matching engine. SPX options, VIX futures. Equinix NY4/NY5 complex.

Why New Jersey?

Historical gravity: Wall Street was in Manhattan. Early electronic trading infrastructure was built nearby. As trading went electronic and matching engines needed space, power, and cooling that Manhattan couldn't provide, data centers moved across the Hudson to northern New Jersey — close enough to maintain low-latency connections to Wall Street firms, but with the space, power grid capacity, and cost structure needed for warehouse-scale computing.

The NJ data center corridor: Mahwah, Secaucus, Carteret, and Weehawken form a dense corridor of financial data centers. Equinix alone operates six major facilities in the area (NY1-NY9). CyrusOne, QTS, and Digital Realty have additional campuses. The density creates a network effect: once the exchanges are there, the firms must be there, which attracts more exchanges, dark pools, and service providers.

Power: Northern NJ has robust grid infrastructure. A single co-location facility can draw 20-40 MW. The area benefits from multiple utility feeds and proximity to natural gas generation.

Fair Co-Location

The cable length problem: Even within a data center, cable length varies. A rack 10 meters from the matching engine switch has ~33ns less latency than a rack 20 meters away. Over a year of trading, 33ns of consistent advantage translates to thousands of queue-priority wins. Regulators and exchanges recognized this asymmetry.

Equalized cable lengths: Most major exchanges now mandate equal cable lengths to all co-located participants. Every participant's cross-connect to the matching engine is the same length — excess cable is coiled on the rack. CME's Aurora facility and NYSE's Mahwah facility both implement equalized cables. This doesn't eliminate the co-location advantage over remote participants, but it ensures fairness among co-located firms.

What you're actually buying: Co-location isn't just about cable length. It's about being on the same switch fabric as the matching engine, eliminating WAN hops, and having deterministic latency. A co-located firm has ~1µs round-trip to the matching engine. A firm in Manhattan has ~200µs. A firm in Chicago has ~8ms. The co-location advantage is 200× over Manhattan and 8,000× over Chicago.

The cost of proximity: A single rack in a Tier 1 exchange co-location facility costs $10,000-$20,000/month. A dedicated cross-connect is additional. Power is metered separately. A typical HFT firm occupies 4-8 racks per venue across 5-10 venues. Annual co-location costs alone can exceed $2-5M before any hardware is installed. This is the rent — the barrier to entry that separates HFT firms from retail traders more effectively than any technological advantage.

The Full Stack

The complete path from market event to order acknowledgment — annotated with approximate latency at each hop

End-to-End Latency Path

Market Event
T=0
price change

→

Matching Engine

~1µs

exchange match

→

Co-lo Switch
+300ns
network hop

→

NIC
+100ns
DMA to FPGA

→

FPGA Feed
+200ns
parse + book

→

Strategy
+100ns
decision logic

→

FPGA Order
+150ns
encode + risk

→

NIC
+50ns
serialize

→

Switch
+300ns
network hop

→

Exchange
~1µs
order accepted

Total Internal Latency: ~900ns — under 1 microsecond

The critical path budget: From the moment the co-lo switch delivers a market data packet to the moment the outbound order packet leaves the NIC, the entire processing takes ~900ns. The exchange adds ~1µs on each side (feed dissemination + order matching). The network switch adds ~300ns per hop. Total round-trip from market event to order in the matching engine queue: approximately 3-4µs when co-located. This is the state of the art in 2023-2024. Further improvement requires eliminating switch hops (direct cross-connects), faster FPGA fabric, and smaller packet sizes.

Total Round-Trip Budget

Latency Budget Breakdown