Why is reservoir sampling used in the RandomChoiceSelection policy?

idotobi · June 5, 2022, 12:57pm

Hi all,

The question comes purely from curiosity.

I’ve been trying to understand better how the load-balancing in Caddy works, when I stumbled upon the code for the RandomChoiceSelection.
Luckily a comment later in the the file links to Reservoir sampling - Wikipedia, which this algorithm seems to be.

I think I understand from Wikipedia that the main benefits of Reservoir sampling are:

no need to know the size of the pool upfront (might apply as not all upstreams are available???)
pool can be bigger than memory (not applicable for UpstreamPool?)

While, the downside at least for the Algorithm R used here is:

scales with linearly with the pool_size
need to draw a random number for every element in the pool

So my question is: Why was this algorithm chosen for the RandomChoiceSelection? As far as I understand only the first benefit somewhat applies, but I have the feeling that I’m missing something more relevant .

al-caddyserver · June 5, 2022, 6:30pm

As you can see in the comment of this line is this the “power of two” algorithm.

github.com

caddyserver/caddy/blob/40b193fb791e241dec630a37167aac576375bc96/modules/caddyhttp/reverseproxy/selectionpolicies.go#L79-L80

      
        
            	// The size of the sub-pool created from the larger upstream pool. The default value
            	// is 2 and the maximum at selection time is the size of the upstream pool.

As you also can see in the documentation can be selected between some different load-balancing algorithms.

Every algorithm have there use cases I assume that this is the reason why several algorithms are offered from the developers.

A quite nice test with the “power of two” Algorithm is done from the HAProxy Developer and nice to read.

matt · June 5, 2022, 10:24pm

Seconding what Aleks said, and adding to it:

Bingo. We don’t know how many upstreams are available, so we effectively don’t know the size of the pool (since we ignore unavailable hosts in the pool). So we treat iterating the array like reading a stream, in that both the array and hypothetical stream are effectively of an unknown size. Reservoir sampling guarantees an equal opportunity for all available hosts.

Nice feature, but yeah, not necessary here.

Yeah, but pool sizes don’t grow by orders of magnitude, usually, so that’s lucky.

We didn’t used to use this algorithm; I think before it used to be something like choose a random number r between 0 and n, where n is the size of the pool. This is quick and constant-time, but if the host at r is unavailable, we’d just iterate until we found the next available one. This gives available upstreams 2/n chance of being selected (assuming the next host is always available; the more hosts are unavailable, the higher the numerator becomes) instead of 1/n, which isn’t evenly distributed.

So if you want an even distribution with an unknown number of available hosts, reservoir sampling is the way to go IMO.

system · August 4, 2022, 12:57pm

This topic was automatically closed after 60 days. New replies are no longer allowed.