我想知道如何基于Levenshtein distance(字符串编辑距离)生成一组相似的字符串.理想情况下,我喜欢传递源字符串(即用于生成与其相似的其他字符串的字符串),需要生成的字符串数和阈值作为参数,即,生成的集合应大于阈值.我想知道应该使用什么Python软件包?或任何想法如何实现这一目标?
I am wondering how to generate a set of similar strings based on Levenshtein distance (string edit distance). Ideally, I like to pass in, a source string (i.e. a string which is used to generate other strings that are similar to it), the number of strings need to be generated and a threshold as parameters, i.e. similarities among the strings in the generated set should be greater than the threshold. I am wondering what Python package(s) should I use to achieve that? Or any idea how to implement this?
推荐答案我认为您可以用另一种方式来思考问题(反向).
I think you can think of the problem in another way (reversed).
- 给出一个字符串,说它是 sittin .
- 给出一个阈值(编辑距离),说它是k.
- 然后您以k个步骤应用不同编辑"的组合.
- Given a string, say it is sittin.
- Given a threshold (edit distance), say it is k.
- Then you apply combinations of different "edits" in k-steps.
例如,假设k =2.并假设您拥有允许的编辑模式是:
For example, let's say k = 2. And assume the allowed edit modes you have are:
- 删除一个字符
- 添加一个字符
- 用一个字符替换另一个字符.
然后逻辑如下:
input = 'sittin' for num in 1 ... n: # suppose you want to have n strings generated my_input_ = input # suppose the edit distance should be smaller or equal to k; # but greater or equal to one for i in in 1 ... randint(k): pick a random edit mode from (delete, add, substitute) do it! and update my_input_如果您需要使用预定义的字典,那么这会增加一些复杂性,但是仍然可以实现.在这种情况下,编辑必须有效.
If you need to stick with a pre-defined dictionary, that adds some complexity but it is still doable. In this case, the edit must be valid.