ch6.hft_optimization

Ch 6. HFT Optimization – Architecture and Operating System

The most important question to ask is what we are trying to achieve – what level of performance is good enough for HFT trading strategies?

Definition:
- Operation by which states of a running process/thread are saved and the state of a different process/thread is restored.
- Allows resumption of execution where it left off.
- Foundation for modern Operating Systems (OSs) multitasking, creating the illusion of running more processes than available CPU cores.

Hardware or Software Context Switches:
- Hardware Context Switching:
  - Uses special hardware features (e.g., Task State Segments (TSSs)).
  - Saves register and processor state for the current process, then switches to a different process.
  - Generally faster due to special registers and instructions but can be slower in some cases because all registers must be saved.
- Software Context Switching:
  - Saves the current stack pointer and loads the new stack pointer to execute new code.
  - Registers, flags, data segments, and other relevant registers are pushed onto the old stack and popped off the new stack.
  - Preferred in modern OSs for better fault tolerance and customization of saved/restored registers.
Context Switches Between Threads or Processes:
- Process Switching Latency:
  - Latency associated with switching between processes.
  - More time-consuming due to the need to remove and reload code from the cache and memory.
  - Requires cleaning/flushing of the virtual memory space (e.g., Translation Lookaside Buffer (TLB)).
- Thread Switching Latency:
  - Latency associated with switching between threads.
  - Generally faster since threads share the same address space, reducing the need for flushing/cleaning memory structures.
  - Less overhead in virtual memory management compared to process switching.

Multitasking:
- Task schedulers in modern OSs switch processes in and out of the CPU.
- Reasons for switching:
  - Process completion.
  - Waiting on I/O or synchronization.
  - Preventing CPU starvation by CPU-intensive processes.
Interrupt Handling:
- Common in modern architectures.
- Processes initiate I/O operations and are blocked until completion.
- Scheduler switches out blocked processes, resuming others.
- OS installs interrupt handlers to manage resource access (e.g., disk, NICs).
- Upon I/O completion, interrupt handlers wake up the initiating process.
User and Kernel Mode Switching:
- Example: Disk or packet read completion.
- Part of the operation occurs in kernel space (e.g., invoking interrupt handler).
- Data processing usually occurs in user space.
- Some user space instructions force transitions to kernel mode.
- Context switches may occur during these transitions on some systems.

Saving the State of the Current Process:
- Save the state in a Process Control Block (PCB), which includes:
  - Registers
  - Stack Pointer (SP)
  - Program Counter (PC)
  - Memory maps
  - Various tables and lists related to the current thread or process.
Cache and TLB Management:
- Flush and/or invalidate the cache.
- Flush the Translation Lookaside Buffer (TLB), which handles virtual to physical memory address translations.
Restoring the State for the Next Process:
- Restore the state by loading the registers and data from the PCB of the next thread or process to be run.

Default CPU Task Scheduler Behavior:

Default algorithms aim for:
- Fairness in CPU resource allocation.
- Energy conservation and improved efficiency.
- Maximizing CPU throughput by prioritizing either shortest or longest jobs first.

HFT Application Requirements:

Energy Efficiency:
- Prefer not to conserve energy.
- Support for overclocked servers, which are not energy-efficient.
- Measures to prevent server overheating are secondary.
Scheduling and Priority Control:
- Critical to prioritize HFT processes over low-priority tasks.
- Avoid CPU starvation for HFT applications by ensuring they get maximum CPU time.
- Prevent preemption of HFT threads/processes, regardless of their CPU consumption.

Strategies for Optimizing HFT Performance:

Kernel and OS Parameter Adjustments:
- Modify kernel and OS settings to prioritize HFT requirements.
Core Pinning:
- Pin critical HFT processes to specific, isolated, and dedicated CPU cores.
- Ensure these cores do not preempt HFT processes.
Non-HFT Process Management:
- Move non-HFT processes to a small subset of cores to isolate them from HFT operations.

Task Scheduling:
- Overhead: Determining which process/thread to run next can be time-consuming and adds overhead to the context switch.
Flushing the Translation Lookaside Buffer (TLB):
- Expensive: The TLB must be flushed to clear virtual memory address translations, which is computationally intensive.
Cache Invalidation:
- Expensive: Similar to TLB invalidation, cache invalidation involves:
  - Writing edited data from the cache to memory.
  - Fetching new code from memory to replace old code in the cache (cache miss).
  - Initial cache misses slow down the resumption of the process assigned CPU resources after the context switch.

Pinning Threads to CPU Cores:
- CPU Isolation: Implement CPU isolation by pinning critical or CPU-intensive threads (hot or spinning threads) to specific cores.
- Benefits: Ensures minimal to no context switches for these threads, optimizing performance.
Avoiding System Calls That Lead to Pre-emption:
- Minimize Blocking System Calls:
  - System calls that block disk or network I/O cause the calling thread to block and result in a context switch.
  - Reduce the use of blocking system calls to minimize context switches.
- Use Kernel Bypass:
  - Bypass system calls for network I/O operations, which are common in HFT applications.
  - Kernel Bypass Overview: Avoids system call overhead by utilizing CPU resources, thereby reducing context switches.

Concurrent Access: Ensuring multiple threads/processes can access shared resources safely.
Synchronization Primitives: Using mechanisms like mutexes, semaphores, and critical sections to prevent data corruption in thread-unsafe code sections.
- Mutexes: Ensure mutual exclusion, allowing only one thread to access a resource at a time.
- Semaphores: Control access to a resource by multiple threads.
- Critical Sections: Protect portions of code that access shared resources.

Blocking: When a thread attempts to acquire a lock already held by another thread, it blocks until the lock is released, leading to:
- Increased latency.
- Reduced throughput.
- Potential deadlocks.