python - Re-construct a string from a probability distribution of substrings

I was working on practice problems for a job-interview, and I came across this problem that stumped me.

The Problem

The task is to reconstruct a string of a specific length from a probability distribution of substrings of length L.

Given:

L: Length of substrings
n: Length of the output string
p: A probability distribution over all possible length-L substrings
A dictionary mapping each substring to its probability.
The total size is thus (size_of_alphabet)^L
Assume a small fixed-size alphabet (like 10)

I need to reconstruct a string, S of length n that matches p as closely as possible.

The Challenge

The main complexity comes from the fact that the frequencies in p can be non-integer/decimal values. For instance, if substring "abc" has a probability of 0.25, this doesn't cleanly translate to a whole number of occurrences in a discrete string.

Example Input/Output

Given

alphabet: [A, B, C] (3 characters)
L=2
n=3
p=[0.5, 0.25, 0.25, 0, 0, 0, 0, 0, 0] (size of p is 3^k = 9; for each possible substring)

There are two answers: 'AAB' or 'AAC'

Why?

The desired frequencies of length L substrings is for a length L string is:

p * (n-L+1) = p * 2 = [1, 0.5, 0.5, 0, ...]

Meaning we need 1 'AA', 0.5 of 'AB' and 0.5 'AC'.

We can't have fractions of a substring, so we need to decide. Since they are equally likely, we can pick either of them and they will create a string that is equally as optimal:

[AA, AB] => AAB
[AA, AC] => AAC

What I've Tried

De Bruijn Graph Approach:

Built a de Bruijn graph where nodes are L-1-substrings (prefixes and suffixes), and edges represent the substrings themselves.

Attempted to find a path that approximates the target distribution.

This works if p*(n-L+1) gives a vector with round integer frequencies, but I'm not sure how to modify the algorithm for decimal frequencies.

Integer Approximation:

Converted probabilities to integer counts by rounding, tried to find a path that uses each L-substring approximately the right number of times.

Calculated error as the sum of squared differences between target and actual frequencies.

Question

Is there an algorithm to achieve as close to a perfect of a string as possible for this problem.

Specifically:

How can I better handle the non-integer frequencies?
Are there established algorithms for reconstructing strings from probabilistic L- substring distributions?

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Re-construct a string from a probability distribution of substrings - Stack Overflow

The Problem

The Challenge

Example Input/Output

What I've Tried

Question

与本文相关的文章

评论列表(0)