python - Algorithm to select multiple non-overlapping subsequences of given sequence

You have been given the input data (a sequence of items), for example:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

and your task is to randomly select M non-overlapping samples (subsequences) of the same size N.

For example, if the task was to select 3 samples of size 3, one of the solutions would be:

[3, 4, 5]
[8, 9, 10]
[11, 12, 13]

The samples are unordered so [8, 9, 10], [3, 4, 5], [11, 12, 13] is the same solution. All solutions are expected to have an equal probability of being selected.

My algorithm:

Select randomly first sample: [11, 12, 13]
Remaining input data are [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [14, 15, 16].
Select randomly second sample from the remaining input data: [3, 4, 5].
Remaining input data (big enough) are [6, 7, 8, 9, 10] and [14, 15, 16].
Select randomly third sample from the remaining input data: [8, 9, 10].

Sadly, this algorithm does not work when the samples are too big. If the task was to select 3 samples of size 5, there exists a solution, but if you use my algorithm and select randomly the first sample as [3, 4, 5, 6, 7], the algorithm will fail.

Of course there is also an obvious brute-force algorithm: find all possible solutions and select randomly one of them. But I was hoping for something more "clever" (time and space efficient).

You have been given the input data (a sequence of items), for example:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

and your task is to randomly select M non-overlapping samples (subsequences) of the same size N.

For example, if the task was to select 3 samples of size 3, one of the solutions would be:

[3, 4, 5]
[8, 9, 10]
[11, 12, 13]

The samples are unordered so [8, 9, 10], [3, 4, 5], [11, 12, 13] is the same solution. All solutions are expected to have an equal probability of being selected.

My algorithm:

Select randomly first sample: [11, 12, 13]
Remaining input data are [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [14, 15, 16].
Select randomly second sample from the remaining input data: [3, 4, 5].
Remaining input data (big enough) are [6, 7, 8, 9, 10] and [14, 15, 16].
Select randomly third sample from the remaining input data: [8, 9, 10].

Of course there is also an obvious brute-force algorithm: find all possible solutions and select randomly one of them. But I was hoping for something more "clever" (time and space efficient).

Share Improve this question edited Mar 15 at 12:58 asked Mar 15 at 8:27 Jeyekomon 3,4933 gold badges33 silver badges42 bronze badges

1 Is it a requirement that each possible outcome has an equal probability? – trincot Commented Mar 15 at 9:10
@trincot Well, yes, I think. It sounds natural to me to expect that. I hope I'm not missing something but getting some outcomes more frequently than others does not seem desired to me. – Jeyekomon Commented Mar 15 at 9:40
OK, could you edit the question to settle this requirement? – trincot Commented Mar 15 at 9:48
@trincot Thank you for your note about the probability of outcomes. I've found some promising solution attempts, but now I see that surprisingly, they fail to yield outcomes with equal probability. – Jeyekomon Commented Mar 16 at 13:07

Add a comment |

2 Answers 2

Sorted by: Reset to default 5

import random

seq = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
M = 3
N = 3

dividers = sorted(random.sample(range(len(seq) - M*N + M), M))
starts = [d + i*(N-1) for i, d in enumerate(dividers)]
samples = [seq[start : start+N] for start in starts]
print(samples)

Example output (Attempt This Online!):

[[3, 4, 5], [10, 11, 12], [13, 14, 15]]

Idea is that if you want 3 subsequences of size 3 from overall 16 items, then 7 elements are not in the subsequences. They're instead in the 4 gaps before/after the subsequences. Each gap can be empty. So find 4 nonnegative integers with sum 7 and use them as gap sizes. How to do that was inspired by this answer.

Alternative explanation: If we had N=1, this would simply be sampling M single items. But in general, each of them is accompanied by the next N-1 items to make up a subsequence. So we basically do sample(range(len(seq)), M) for the N=1 case, but for the general case shrink the range by M*(N-1) to allow room for the M times N-1 accompaniers, which we then add.

I tried choosing samples 120000 times and counting how often each starts tuple occurred. Each of the 120 possibilities occurred about 1000 times as expected:

120
((0, 3, 6), 1030)
((0, 3, 7), 1007)
((0, 3, 8), 956)
...
((6, 9, 13), 1010)
((6, 10, 13), 1006)
((7, 10, 13), 986)

Code:

import random
from collections import Counter

seq = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
M = 3
N = 3

ctr = Counter()
for _ in range(120_000):
    dividers = sorted(random.sample(range(len(seq) - M*N + M), M))
    starts = [d + i*(N-1) for i, d in enumerate(dividers)]
    ctr[*starts] += 1

print(len(ctr))
for x in sorted(ctr.items()):
    print(x)

Attempt This Online!

The key to the solution is to know how many solutions are there.

For the sake of exposition, let's assume that you have a list of 10 elements (numbered 0 through 9), and you need to select 3 subsequences, each of length 2.

Your first subsequence (that is, the one that contains the lowest-numbered element) can start at element 0, or at element 1, or at element 2, or at element 3, or at element 4. (Not at element 5 or beyond, because then you cannot select two more subsequences).

If your first subsequence starts at 0, you have 15 ways to chose the other two subsequences. You can use the stars-and-bars method to compute this number. Specifically, you have to compute the number of ways you can choose 3 non-negative numbers that sum up to 4, where 3 is the number of gaps between your two subsequences, and 4 is the number of elements that don't end up being selected.

If your first subsequence starts at 1, you have 10 ways to chose the other two subsequences.

And so on, you have 6, 3 and 1 ways, respectively, for other starting numbers.

In total, there are 35 solutions. Select a random number uniformly distributed in [1..35], and select your first subsequence accordingly (if it's between 1 and 15, your first subsequence starts at 0; if between 16 and 25, it starts at 1, etc). Here, you have reduced the problem to a smaller one: you have less subsequences to choose from a shorter sequence. Repeat until you have no subsequences to choose.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Algorithm to select multiple non-overlapping subsequences of given sequence - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)