最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Re-construct a string from a probability distribution of substrings - Stack Overflow

programmeradmin6浏览0评论

I was working on practice problems for a job-interview, and I came across this problem that stumped me.

The Problem

The task is to reconstruct a string of a specific length from a probability distribution of substrings of length L.

Given:

  • L: Length of substrings

  • n: Length of the output string

  • p: A probability distribution over all possible length-L substrings

  • A dictionary mapping each substring to its probability.

  • The total size is thus (size_of_alphabet)^L

  • Assume a small fixed-size alphabet (like 10)

I need to reconstruct a string, S of length n that matches p as closely as possible.

The Challenge

The main complexity comes from the fact that the frequencies in p can be non-integer/decimal values. For instance, if substring "abc" has a probability of 0.25, this doesn't cleanly translate to a whole number of occurrences in a discrete string.

Example Input/Output

Given

  • alphabet: [A, B, C] (3 characters)
  • L=2
  • n=3
  • p=[0.5, 0.25, 0.25, 0, 0, 0, 0, 0, 0] (size of p is 3^k = 9; for each possible substring)

There are two answers: 'AAB' or 'AAC'

Why?

The desired frequencies of length L substrings is for a length L string is:

p * (n-L+1) = p * 2 = [1, 0.5, 0.5, 0, ...]

Meaning we need 1 'AA', 0.5 of 'AB' and 0.5 'AC'.

We can't have fractions of a substring, so we need to decide. Since they are equally likely, we can pick either of them and they will create a string that is equally as optimal:

  1. [AA, AB] => AAB
  2. [AA, AC] => AAC

What I've Tried

De Bruijn Graph Approach:

Built a de Bruijn graph where nodes are L-1-substrings (prefixes and suffixes), and edges represent the substrings themselves.

Attempted to find a path that approximates the target distribution.

This works if p*(n-L+1) gives a vector with round integer frequencies, but I'm not sure how to modify the algorithm for decimal frequencies.

Integer Approximation:

Converted probabilities to integer counts by rounding, tried to find a path that uses each L-substring approximately the right number of times.

Calculated error as the sum of squared differences between target and actual frequencies.

Question

Is there an algorithm to achieve as close to a perfect of a string as possible for this problem.

Specifically:

  • How can I better handle the non-integer frequencies?

  • Are there established algorithms for reconstructing strings from probabilistic L- substring distributions?

发布评论

评论列表(0)

  1. 暂无评论