为了从C ++中的泊松分布中提取随机数,通常建议使用
In order to draw random number from a Poisson distribution in C++, it is generally advised to use
RNG_type rng; std::poisson_distribution<size_t> d(1e-6); auto r = d(rng);每次调用std::poisson_distribution对象时,都会消耗整个随机位序列(例如32位,其中 std :: mt19937 ,64位用于 std :: mt19937_64 ).令我惊讶的是,在这么低的平均值(mean = 1e-6)下,绝大多数情况下,只有几位足以确定要返回的值是0.然后可以将其他位缓存起来以备后用.
At each call of the std::poisson_distribution object, an entire sequence of random bits is consumed (e.g. 32 bits with std::mt19937, 64 bits for std::mt19937_64). It strikes me that with such low mean (mean = 1e-6), the vast majority of times, only a few bits are enough to determine that the value to return is 0. The other bits could then be cached for later use.
假设将设置为true的位序列与泊松分布的高返回值相关联,则使用均值1e-6时,任何不以19 true开头的序列都必须返回零!确实,
Assuming that a sequence of bits set to true is associated to a high returned value from the Poisson distribution, when using a mean of 1e-6, any sequence not starting with 19 trues necessarily returns a zero! Indeed,
1 - 1/2^19 < P(0, 1e-6) < 1 - 1/2^20,其中P(n, r)表示从均值r的泊松分布得出n的概率.一种不浪费比特的算法将使用一半的时间,四分之一的时间,三分之八的时间....
, where P(n, r) denotes the probability of drawing n from a Poisson distribution with mean r. An algorithm that does not waste bits would use one bit half of the time, two bits a quarter of the times, three bits an eighth of the times, ....
是否存在一种可以通过绘制泊松数时消耗尽可能少的位来提高性能的算法?当我们认为平均值较低时,与std::poisson_distribution相比,还有其他方法可以改善性能吗?
Is there an algorithm out there that can improve performance by consuming as few bits as possible when drawing Poisson numbers? Is there another way to improve performance compared to std::poisson_distribution when we consider a low mean?
回应@ Jarod42的评论谁说
In response to @Jarod42's comment who said
想知道如果使用更少的位不会破坏等概率...
Wonder if using fewer bits don't break equiprobability...
我认为这不会破坏等概率.在模糊测试中,我考虑了一个具有简单bernoulli分布的相同问题.我以概率1/2^4采样为true,以概率1 - 1/2^4采样为false.函数drawWithoutWastingBits一旦在缓存中看到一个真值,就会停止,并且函数drawWastingBits会消耗4位,无论这些位是什么.
I don't think it would break equiprobability. In a vague attempt to test it, I consider the same question with a simple bernoulli distribution. I am sampling true with a probability 1/2^4 and sampling false with a probability 1 - 1/2^4. The function drawWithoutWastingBits stops as soon as it sees a true in the cache and the function drawWastingBits consumes 4 bits whatever these bits are.
#include <iostream> #include <vector> #include <string> #include <algorithm> #include <random> bool drawWithoutWastingBits(std::vector<bool>& cache, size_t& cache_index) { /* Get a true with probability 1/2^4 (=1/16=0.0625) and a false otherwise */ size_t nbTrues = 0; while (cache[cache_index]) { ++nbTrues; ++cache_index; if (nbTrues == 4) { return true; } } ++cache_index; return false; } bool drawWastingBits(std::vector<bool>& cache, size_t& cache_index) { /* Get a true with probability 1/2^4 (=1/16=0.0625) and a false otherwise */ bool isAnyTrue = false; for (size_t i = 0 ; i < 4; ++i) { if (cache[cache_index]) { isAnyTrue = true; } ++cache_index; } return !isAnyTrue; } int main() { /* Just cache a lot of bits in advance in `cache`. The same sequence of bits will be used by both function. I am just caching way enough bits to make sure they don't run out of bits below I made sure to have the same number of zeros and ones so that any deviation is caused by the methodology and not by the RNG */ // Produce cache std::vector<bool> cache; size_t nbBitsToCache = 1e7; cache.reserve(nbBitsToCache); for (size_t i = 0 ; i < nbBitsToCache/2 ; ++i) { cache.push_back(false); cache.push_back(true); } // Shuffle cache { std::mt19937 mt(std::random_device{}()); std::shuffle(cache.begin(), cache.end(), mt); } // Draw without wasting bits { size_t nbDraws = 1e6; size_t cache_index = 0; std::pair<size_t, size_t> outcomes = {0,0}; for (size_t r = 0 ; r < nbDraws ; ++r) { drawWithoutWastingBits(cache, cache_index) ? ++outcomes.first : ++outcomes.second; assert(cache_index <= cache.size()); } assert(outcomes.first + outcomes.second == nbDraws); std::cout << "Draw Without Wasting Bits: prob true = " << (double)outcomes.first / nbDraws << "\n"; } // Draw wasting bits { size_t nbDraws = 1e6; size_t cache_index = 0; std::pair<size_t, size_t> outcomes = {0,0}; for (size_t r = 0 ; r < nbDraws ; ++r) { drawWastingBits(cache, cache_index) ? ++outcomes.first : ++outcomes.second; assert(cache_index <= cache.size()); } assert(outcomes.first + outcomes.second == nbDraws); std::cout << "Draw Wit Wasting Bits: prob true = " << (double)outcomes.first / nbDraws << "\n"; } }可能的输出
Draw Without Wasting Bits: prob true = 0.062832 Draw Wit Wasting Bits: prob true = 0.062363推荐答案
Devroye的 非统一随机变量生成 (第505页和第86页)提到了按顺序搜索算法进行的反演.
Devroye's Non-Uniform Random Variate Generation, pp. 505 and 86, mentions an inversion by sequential search algorithm.
基于该算法,如果您知道mean远小于1,则如果您在[0,1]中生成统一的随机数u,则如果u <= exp(-mean),则Poisson变量将为0,否则大于0.
Based on that algorithm, if you know the mean is considerably less than 1, then if you generate a uniform random number u in [0, 1], the Poisson variable will be 0 if u <= exp(-mean), and greater than 0 otherwise.
如果均值较低并且可以容忍近似分布,则可以使用以下方法(请参阅"离散高斯差分隐私算法"):
If the mean is low and you can tolerate an approximate distribution, then you can use the following approach (see Appendix A of "The Discrete Gaussian for Differential Privacy"):
例如,假设平均值为1e-6(1/1000000),先生成伯努利(1/1000000)数,然后生成伯努利(1/2000000),依此类推,直到您以这种方式生成0.如果生成的偶数个偶数,则Poisson变量正好为0.否则,Poisson变量为1或更大,并且需要更慢"的算法.
For example, say the mean is 1e-6 (1/1000000), Generate a Bernoulli(1/1000000) number, then Bernoulli(1/2000000), etc. until you generate 0 this way. If an even number of ones were generated, then the Poisson variable is exactly 0. Otherwise, the Poisson variable is 1 or greater and a "slower" algorithm is necessary.
下面的算法是一个示例,该算法基于第505页和第86页的算法,但仅对Poisson变量1或更大的样本进行采样:
One example is the algorithm below, which is based on the one from pages 505 and 86, but only samples Poisson variables 1 or greater:
METHOD Poisson1OrGreater(mean) sum=Math.exp(-mean) prod=sum u=RNDRANGE(sum, 1) i=0 while i==0 or u>sum prod*=mean/(i+1) sum+=prod i=i+1 end return i END METHOD但是,此方法不是很健壮,特别是因为它使用接近1(浮点空间更稀疏)的数字而不是接近0的数字.
This method, though, is not very robust, especially since it uses numbers close to 1 (where the floating-point space is more sparse) rather than numbers close to 0.
编辑(5月7日):
请注意,n个独立的Poisson(mean)随机数的总和是分布的Poisson(mean*n)(第501页).因此,此答案中的上述讨论适用于n泊松随机数之和,只要它们的均值的n倍较小即可.例如,要生成平均值为1e-6的1000个泊松随机数之和,只需生成一个平均值为0.001的单个泊松随机数.这样可以节省调用随机数生成器的费用.
Note that the sum of n independent Poisson(mean) random numbers is Poisson(mean*n) distributed (p. 501). Thus, the discussion above in this answer applies to a sum of n Poisson random numbers as long as n times their mean remains small. For example, to generate a sum of 1000 Poisson random numbers with a mean of 1e-6, simply generate a single Poisson random number with a mean of 0.001. This will save considerably on calls to the random number generator.
编辑(5月13日):进行了常规编辑.
EDIT (May 13): Edited generally.