After several years of developing high-performance trading systems, we came up with some rules of thumb. When talking about low latency/high-frequency trading, I’m talking about software that must make a buy or sell decision within 3us (microseconds).
To achieve these things, I’ve learned that I need to forget everything about modern software engineering. You have to change your mind entirely and forget everything learned in this field: latency is the king, no matter how ugly your code is.
As a result, I will summarize all the obstacles I’ve found in developing this kind of system.
Programming Language: No, there is no perfect language for this kind of operation, but choose it carefully. Not only do you have to understand how to use it but master it! Understand what it does on each instruction, how the memory is managed each time you call an object, etc. IF you are using C# or Java, you must master the Garbage Collector, this could be a killer. My choice always was C/C++.
Choose your types: tell me what types are you using, and I will tell you how slow you can be. Avoid strings, dates, bigDecimal, autoboxing, and complex data structures (e.g. ArrayList grows, stacks, Maps rehash).
Avoid Exception Handing: YES, avoid it! It’s expensive. Exception handling adds 10-20% execution time at least. Compilers need to add additional code and take care of additional stack management to handle exceptions. That cost time. And before somebody tells me about GCC using the zero-cost model, I would say, please profile your system and measure it! Remember, each microsecond counts.
Threads: threads block/context switch, the scheduler will intervene, difficult to reason about performance when there are many threads. Understand how they behave on your OS. Understand how your hardware architecture works with threads… I know, It’s boring but essential. You don’t need to design fancy thread systems (e.g., ring buffers). In most of the cases, the simplest, the better. My best approach: pinned threads to a core – use busy spinning so the core is always looking at the queue.
Caches: L1 Cache at 5ns up to disk at 10ms. The main memory is 100ns. To be fast enough one needs to consider where data are stored. Make sure that your algorithms and data structure take advantage of the L1 caches as much as you can.
Layers of abstraction: Forget encapsulating, making your code nice, clean, and reusable… When data is passed from one layer to another the data are copied. The scheduler de-prioritizes our process to give other processes their “fair share”, meaning tons of CPU cycles are lost!
Warming up the data: Make sure you pre-allocate all your data structures before the main system starts. Also keep in mind reusable objects, so you don’t have to allocate them later. Remember, allocation is expensive.