JedMartin (u/JedMartin)

JedMartin commented on Wall Street’s ‘Private Rooms’ bloomberg.com/news/featur... · Posted by u/SirLJ

modderation · 6 months ago

I think it's an interesting thought experiment. What would happen if the stock market were quantized to a blind one trade per-minute granularity?

I suspect this would put everyone on more even footing, with less focus on beating causality and light lag, placing more focus on using the acquired information to make longer-term decisions. This would open things up to anyone with a computer and a disposable income, though it would disappoint anyone in the high-frequency trading field.

JedMartin · 6 months ago

You would change the rules, but I think the result would largely remain the same. As a market participant with the fastest access to data from other markets, news, and similar sources, as well as low order entry latency, you would still be able to profit from information asymmetry.

Imagine that a company announces the approval of its new vaccine a few milliseconds before the periodic trade occurs. As an HFT firm, you have the technology to enter, cancel, or modify your orders before the periodic auction takes place, while less sophisticated players remain oblivious to what just happened. The same applies to price movements on venues trading the same instrument, its derivatives, or even correlated assets in different parts of the world.

On the other hand, you risk increasing price volatility (especially in cases where there is an imbalance between buyers and sellers during the periodic auction) and making markets less liquid.

JedMartin commented on Wall Street’s ‘Private Rooms’ bloomberg.com/news/featur... · Posted by u/SirLJ

superzamp · 6 months ago

Attempts at doing this are effectively already existing, the IEX [1] exchange being an example, albeit on a less ambitious scale than your idea:

> It's a simple technology: 38 miles of coiled cable that incoming orders and messages must traverse before arriving at the exchange’s matching engine. This physical distance results in a 350-microsecond delay, giving the exchange time to take in market data from other venues—which is not delayed—and update prices before executing trades

JedMartin · 6 months ago

IntelligentCross Midpoint (a darkpool) is a better example, since it actually does matching periodically every couple of milliseconds [1]. IEX just introduces additional latency for everyone.

[1] https://www.imperativex.com/products

JedMartin commented on Wall Street’s ‘Private Rooms’ bloomberg.com/news/featur... · Posted by u/SirLJ

hayst4ck · 6 months ago

> won't impact prices

I strongly suspect that wall street has looked at 401k's/index funds as a giant money filled piñata. It is a huge pile of money following a well understood algorithm which makes it vulnerable to attack.

I suspect that this is the absolute core of "dark pool" strategy. Any trade that happens behind closed doors that "doesn't impact prices" means that an index fund is buying or selling at a price other than the "real" price meaning that dark pools are functionally a wealth transfer from grandma to an institutional trader.

JedMartin · 6 months ago

It's actually the other way around. As a big fund looking to trade a large number of shares in the public market, you'll quickly realize that the market tends to move away from you, and statistically, you're more likely to get a bad deal than a good one. Even if you try to be smart about execution by splitting your orders into chunks, randomizing order sizes, and similar tactics, there is still a huge information asymmetry between you and more sophisticated players. In many cases, they can classify your orders based on different characteristics of your order flow (such as latency profile), distinguishing them from so-called toxic flow from other HFT firms.

The purpose of these private rooms is to separate your orders from those players so that you trade against other uninformed parties, making your chances of getting a good or bad deal closer to 50/50.

JedMartin commented on C++ patterns for low-latency applications including high-frequency trading arxiv.org/abs/2309.04259... · Posted by u/chris_overseas

sneilan1 · a year ago

>> In the get method, you're returning a pointer to the element within the queue after bumping the consumer position (which frees the slot for the producer), so it can get overwritten while the user is accessing it. And then your producer and consumer positions will most likely end up in the same cache line, leading to false sharing.

I did not realize this. Thank you so much for pointing this out. I'm going to take a look.

>> use std::atomic for your producer

Yes, it is hard to get these data structures right. I used Martin Fowler's description of the LMAX algorithm which did not mention atomic. https://martinfowler.com/articles/lmax.html I'll check out the paper.

JedMartin · a year ago

I have absolutely no idea how this works in Java, but in C++, there are a few reasons you need std::atomic here:

1. You need to make sure that modifying the producer/consumer position is actually atomic. This may end up being the same instruction that the compiler would use for modifying a non-atomic variable, but that will depend on your target architecture and the size of the data type. Without std::atomic, it may also generate multiple instructions to implement that load/store or use an instruction which is non-atomic at the CPU level. See [1] for more information.

2. You're using positions for synchronization between the producer and consumer. When incrementing the reader position, you're basically freeing a slot for the producer, which means that you need to make sure all reads happen before you do it. When incrementing the producer position, you're indicating that the slot is ready to be consumed, so you need to make sure that all the stores to that slot happen before that. Things may go wrong here due to reordering by the compiler or by the CPU [2], so you need to instruct both that a certain memory ordering is required here. Reordering by the compiler can be prevented using a compiler-level memory barrier - asm volatile("" ::: "memory"). Depending on your CPU architecture, you may or may not need to add a memory barrier instruction as well to prevent reordering by the CPU at runtime. The good news is that std::atomic does all that for you if you pick the right memory ordering, and by default, it uses the strongest one (sequentially-consistent ordering). I think in this particular case you could relax the constraints a bit and use memory_order_acquire on the consumer side and memory_order_release on the producer side [3].

[1] https://preshing.com/20130618/atomic-vs-non-atomic-operation...

[2] https://en.wikipedia.org/wiki/Memory_ordering

[3] https://en.cppreference.com/w/cpp/atomic/memory_order

JedMartin commented on C++ patterns for low-latency applications including high-frequency trading arxiv.org/abs/2309.04259... · Posted by u/chris_overseas

sneilan1 · a year ago

I've got an implementation of a stock exchange that uses the LMAX disruptor pattern in C++ https://github.com/sneilan/stock-exchange

And a basic implementation of the LMAX disruptor as a couple C++ files https://github.com/sneilan/lmax-disruptor-tutorial

I've been looking to rebuild this in rust however. I reached the point where I implemented my own websocket protocol, authentication system, SSL etc. Then I realized that memory management and dependencies are a lot easier in rust. Especially for a one man software project.

JedMartin · a year ago

It's not easy to get data structures like this right in C++. There are a couple of problems with your implementation of the queue. Memory accesses can be reordered by both the compiler and the CPU, so you should use std::atomic for your producer and consumer positions to get the barriers described in the original LMAX Disruptor paper. In the get method, you're returning a pointer to the element within the queue after bumping the consumer position (which frees the slot for the producer), so it can get overwritten while the user is accessing it. And then your producer and consumer positions will most likely end up in the same cache line, leading to false sharing.

JedMartin commented on Low Latency Optimization: Understanding Pages (Part 1) hudsonrivertrading.com/hr... · Posted by u/Jumptadel

pca006132 · 3 years ago

Just wondering, how useful is it to get code and stack memory into hugepages? I thought you usually access them sequentially so it doesn't matter that much to put them in hugepages.

JedMartin · 3 years ago

The most benefit comes from the fact that you end up with a lot less TLB misses, since single mapping covers a large chunk of memory. Predictable memory access pattern helps with caches misses thanks to hardware prefetch, but as far as I know hardware prefetch won't work if it would cause TLB miss on most CPUs.

JedMartin commented on Low Latency Optimization: Understanding Pages (Part 1) hudsonrivertrading.com/hr... · Posted by u/Jumptadel

menaerus · 3 years ago

It can be done by manually remapping the relevant sections upon application startup.

Perhaps [1] is a good resource to start with (page nr. 7). Example code is here [2]. And [3] makes some experiments with it.

[1] https://www.kernel.org/doc/ols/2006/ols2006v2-pages-83-90.pd...

[2] https://github.com/intel/iodlr/blob/master/large_page-c/exam...

[3] https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-Fo...

JedMartin · 3 years ago

Thanks, cool stuff. Especially liblppreload.so described in [2] and [3]. I'll give it a try. Do you have any tips how to achieve the same for the stack?

JedMartin commented on Low Latency Optimization: Understanding Pages (Part 1) hudsonrivertrading.com/hr... · Posted by u/Jumptadel

anonymoushn · 3 years ago

Your comment is correct but might cause readers to underestimate how annoying this tuning work is and how difficult it is to get everything into hugepages (executable memory and stack memory and shared libraries if applicable, not just specific heap allocations). We are trading a joke asset class on joke venues that have millisecond-scale jitter, so we can get away with using io_uring instead of kernel bypass networking.

JedMartin · 3 years ago

The part about getting everything into hugepages sounds interesting. Any idea where can I find some resources on that? Most of what I was able to find only tell you how to do that for heap allocations.