最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Algorithm to select multiple non-overlapping subsequences of given sequence - Stack Overflow

programmeradmin2浏览0评论

You have been given the input data (a sequence of items), for example:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

and your task is to randomly select M non-overlapping samples (subsequences) of the same size N.

For example, if the task was to select 3 samples of size 3, one of the solutions would be:

[3, 4, 5]
[8, 9, 10]
[11, 12, 13]

The samples are unordered so [8, 9, 10], [3, 4, 5], [11, 12, 13] is the same solution. All solutions are expected to have an equal probability of being selected.


My algorithm:

  1. Select randomly first sample: [11, 12, 13]
  2. Remaining input data are [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [14, 15, 16].
  3. Select randomly second sample from the remaining input data: [3, 4, 5].
  4. Remaining input data (big enough) are [6, 7, 8, 9, 10] and [14, 15, 16].
  5. Select randomly third sample from the remaining input data: [8, 9, 10].

Sadly, this algorithm does not work when the samples are too big. If the task was to select 3 samples of size 5, there exists a solution, but if you use my algorithm and select randomly the first sample as [3, 4, 5, 6, 7], the algorithm will fail.


Of course there is also an obvious brute-force algorithm: find all possible solutions and select randomly one of them. But I was hoping for something more "clever" (time and space efficient).

You have been given the input data (a sequence of items), for example:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

and your task is to randomly select M non-overlapping samples (subsequences) of the same size N.

For example, if the task was to select 3 samples of size 3, one of the solutions would be:

[3, 4, 5]
[8, 9, 10]
[11, 12, 13]

The samples are unordered so [8, 9, 10], [3, 4, 5], [11, 12, 13] is the same solution. All solutions are expected to have an equal probability of being selected.


My algorithm:

  1. Select randomly first sample: [11, 12, 13]
  2. Remaining input data are [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [14, 15, 16].
  3. Select randomly second sample from the remaining input data: [3, 4, 5].
  4. Remaining input data (big enough) are [6, 7, 8, 9, 10] and [14, 15, 16].
  5. Select randomly third sample from the remaining input data: [8, 9, 10].

Sadly, this algorithm does not work when the samples are too big. If the task was to select 3 samples of size 5, there exists a solution, but if you use my algorithm and select randomly the first sample as [3, 4, 5, 6, 7], the algorithm will fail.


Of course there is also an obvious brute-force algorithm: find all possible solutions and select randomly one of them. But I was hoping for something more "clever" (time and space efficient).

Share Improve this question edited Mar 15 at 12:58 Jeyekomon asked Mar 15 at 8:27 JeyekomonJeyekomon 3,4933 gold badges33 silver badges42 bronze badges 4
  • 1 Is it a requirement that each possible outcome has an equal probability? – trincot Commented Mar 15 at 9:10
  • @trincot Well, yes, I think. It sounds natural to me to expect that. I hope I'm not missing something but getting some outcomes more frequently than others does not seem desired to me. – Jeyekomon Commented Mar 15 at 9:40
  • OK, could you edit the question to settle this requirement? – trincot Commented Mar 15 at 9:48
  • @trincot Thank you for your note about the probability of outcomes. I've found some promising solution attempts, but now I see that surprisingly, they fail to yield outcomes with equal probability. – Jeyekomon Commented Mar 16 at 13:07
Add a comment  | 

2 Answers 2

Reset to default 5
import random

seq = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
M = 3
N = 3

dividers = sorted(random.sample(range(len(seq) - M*N + M), M))
starts = [d + i*(N-1) for i, d in enumerate(dividers)]
samples = [seq[start : start+N] for start in starts]
print(samples)

Example output (Attempt This Online!):

[[3, 4, 5], [10, 11, 12], [13, 14, 15]]

Idea is that if you want 3 subsequences of size 3 from overall 16 items, then 7 elements are not in the subsequences. They're instead in the 4 gaps before/after the subsequences. Each gap can be empty. So find 4 nonnegative integers with sum 7 and use them as gap sizes. How to do that was inspired by this answer.

Alternative explanation: If we had N=1, this would simply be sampling M single items. But in general, each of them is accompanied by the next N-1 items to make up a subsequence. So we basically do sample(range(len(seq)), M) for the N=1 case, but for the general case shrink the range by M*(N-1) to allow room for the M times N-1 accompaniers, which we then add.

I tried choosing samples 120000 times and counting how often each starts tuple occurred. Each of the 120 possibilities occurred about 1000 times as expected:

120
((0, 3, 6), 1030)
((0, 3, 7), 1007)
((0, 3, 8), 956)
...
((6, 9, 13), 1010)
((6, 10, 13), 1006)
((7, 10, 13), 986)

Code:

import random
from collections import Counter

seq = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
M = 3
N = 3

ctr = Counter()
for _ in range(120_000):
    dividers = sorted(random.sample(range(len(seq) - M*N + M), M))
    starts = [d + i*(N-1) for i, d in enumerate(dividers)]
    ctr[*starts] += 1

print(len(ctr))
for x in sorted(ctr.items()):
    print(x)

Attempt This Online!

The key to the solution is to know how many solutions are there.

For the sake of exposition, let's assume that you have a list of 10 elements (numbered 0 through 9), and you need to select 3 subsequences, each of length 2.

Your first subsequence (that is, the one that contains the lowest-numbered element) can start at element 0, or at element 1, or at element 2, or at element 3, or at element 4. (Not at element 5 or beyond, because then you cannot select two more subsequences).

If your first subsequence starts at 0, you have 15 ways to chose the other two subsequences. You can use the stars-and-bars method to compute this number. Specifically, you have to compute the number of ways you can choose 3 non-negative numbers that sum up to 4, where 3 is the number of gaps between your two subsequences, and 4 is the number of elements that don't end up being selected.

If your first subsequence starts at 1, you have 10 ways to chose the other two subsequences.

And so on, you have 6, 3 and 1 ways, respectively, for other starting numbers.

In total, there are 35 solutions. Select a random number uniformly distributed in [1..35], and select your first subsequence accordingly (if it's between 1 and 15, your first subsequence starts at 0; if between 16 and 25, it starts at 1, etc). Here, you have reduced the problem to a smaller one: you have less subsequences to choose from a shorter sequence. Repeat until you have no subsequences to choose.

发布评论

评论列表(0)

  1. 暂无评论