Highly Parallel Small Processes In Erlang

May 29, 2012

Highly Parallel Small Processes In Erlang

Recently, my employer needed a way to implement a network proxy which could handle tens of thousands of concurrent packets (UDP) which were all very short-lived processes. We had implementations in Java, C, and Go. None of them were able to keep up with our requirements. So we tried writing our new version in Erlang/OTP, and there we finally had some success after some trial and error.

For those of you who are not yet familiar with Erlang, you can learn the basics here. Suffice it to say that Erlang is a functional language which focuses on lightweight (non-OS) processes and data immutability. Instead of using locks and mutexes, it uses actors and messages. These attributes make it quite simple to write applications which can take full advantage of modern multi-core/multi-processor systems.

As a quick introduction to our goal, we are proxying a UDP based protocol and modifying it's responses based on the original message and the reply from the proxied server in real-time... This means that a packet is received by the Erlang application, several of the lightweight processes are spawned in parallel, and when the response from the proxied server is received; the application applies policies and sends a (possibly modified) response. This all happens in ~~2-3 milliseconds~~ ~15 microseconds.

As all who work with socket programming know, creating sockets is a relatively expensive operation requiring a low-level system call. So, for performance reasons, a programmer will typically create a socket pool and re-use sockets whenever possible. We did the same here. Originally creating a gen_server module which would manage the pool of sockets for us. The problem with using a gen_server for checking in/out pooled workers is that gen_servers are serialized and cannot work on the pool in parallel. This in-turn caused our application to become somewhat serialized and experience a backlog of requests to the socket pool.

On our second attempt, we moved to using dispcount. dispcount used a more parallel methodology which would never block, but could sometimes lead to a lost message. In production, the number of lost packets for our high traffic UDP service was just too high, but dispcount was very fast for protocols that do not need %100 reliability or have their own retry facilities (i.e. HTTP).

What we finally did was integrate our pooled sockets into the gen_server which handled the incoming UDP packets. I created a queue which was stored in the state of the UDP gen_server, and whenever a new packet was received, the pooled socket was removed from the front of the queue and added to the back of the queue. This meant that there was never more than 1 operation needed for any given packet, and no additional messaging when the pooled socket was done being used. The only drawback is that unless you create a sufficiently large queue, you could end up giving out the same pooled socket to more than one user. Creating a large enough pool for such a low latency protocol was quite simple with Erlang's lightweight processes though. We started with a pool size of 5000 and have been able to increase it to as much as 15,000 or as low as 1000 without incident and memory use being the only change.

I hope that this article is useful and informative, and I hope to write more articles like this in the near future.