I’ve been trying to understand better how the load-balancing in Caddy works, when I stumbled upon the code for the RandomChoiceSelection.
Luckily a comment later in the the file links to Reservoir sampling - Wikipedia, which this algorithm seems to be.
I think I understand from Wikipedia that the main benefits of Reservoir sampling are:
no need to know the size of the pool upfront (might apply as not all upstreams are available???)
pool can be bigger than memory (not applicable for UpstreamPool?)
While, the downside at least for the Algorithm R used here is:
scales with linearly with the pool_size
need to draw a random number for every element in the pool
So my question is: Why was this algorithm chosen for the RandomChoiceSelection? As far as I understand only the first benefit somewhat applies, but I have the feeling that I’m missing something more relevant .
Bingo. We don’t know how many upstreams are available, so we effectively don’t know the size of the pool (since we ignore unavailable hosts in the pool). So we treat iterating the array like reading a stream, in that both the array and hypothetical stream are effectively of an unknown size. Reservoir sampling guarantees an equal opportunity for all available hosts.
Nice feature, but yeah, not necessary here.
Yeah, but pool sizes don’t grow by orders of magnitude, usually, so that’s lucky.
We didn’t used to use this algorithm; I think before it used to be something like choose a random number r between 0 and n, where n is the size of the pool. This is quick and constant-time, but if the host at r is unavailable, we’d just iterate until we found the next available one. This gives available upstreams 2/n chance of being selected (assuming the next host is always available; the more hosts are unavailable, the higher the numerator becomes) instead of 1/n, which isn’t evenly distributed.
So if you want an even distribution with an unknown number of available hosts, reservoir sampling is the way to go IMO.