I was working on practice problems for a job-interview, and I came across this problem that stumped me.
The Problem
The task is to reconstruct a string of a specific length from a probability distribution of substrings of length L
.
Given:
L
: Length of substringsn
: Length of the output stringp
: A probability distribution over all possible length-L
substringsA dictionary mapping each substring to its probability.
The total size is thus (size_of_alphabet)^
L
Assume a small fixed-size alphabet (like 10)
I need to reconstruct a string, S
of length n
that matches p
as closely as possible.
The Challenge
The main complexity comes from the fact that the frequencies in p
can be non-integer/decimal values. For instance, if substring "abc" has a probability of 0.25, this doesn't cleanly translate to a whole number of occurrences in a discrete string.
Example Input/Output
Given
- alphabet: [A, B, C] (3 characters)
- L=2
- n=3
- p=[0.5, 0.25, 0.25, 0, 0, 0, 0, 0, 0] (size of
p
is 3^k = 9; for each possible substring)
There are two answers: 'AAB' or 'AAC'
Why?
The desired frequencies of length L
substrings is for a length L
string is:
p * (n-L+1) = p * 2 = [1, 0.5, 0.5, 0, ...]
Meaning we need 1 'AA', 0.5 of 'AB' and 0.5 'AC'.
We can't have fractions of a substring, so we need to decide. Since they are equally likely, we can pick either of them and they will create a string that is equally as optimal:
- [AA, AB] => AAB
- [AA, AC] => AAC
What I've Tried
De Bruijn Graph Approach:
Built a de Bruijn graph where nodes are L-1
-substrings (prefixes and suffixes), and edges represent the substrings themselves.
Attempted to find a path that approximates the target distribution.
This works if p*(n-L+1)
gives a vector with round integer frequencies, but I'm not sure how to modify the algorithm for decimal frequencies.
Integer Approximation:
Converted probabilities to integer counts by rounding, tried to find a path that uses each L
-substring approximately the right number of times.
Calculated error as the sum of squared differences between target and actual frequencies.
Question
Is there an algorithm to achieve as close to a perfect of a string as possible for this problem.
Specifically:
How can I better handle the non-integer frequencies?
Are there established algorithms for reconstructing strings from probabilistic L- substring distributions?