最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Most performant approach to find closest match from unordered collection - Stack Overflow

programmeradmin3浏览0评论

I'm wondering what the best approach for finding the closest match to a given value from a collection of items is. The most important part is the lookup time relative to the input size, the data can be shuffled and moved around as much as needed, as long as the lookup is therefore faster.

Here the initial script:

MAX = 1_000_000
MIN = 0

target: float = 333_333.0
collection: set[float] = {random.uniform(MIN, MAX) for _ in range(100_000)}


# --- Code here to find closest as fast as possible, now and for future lookups ---


assert closest in collection and all(abs(target - v) >= delta for v in collection)

Approach 1

Iterate through all values and update closest accordingly. Very simple but very slow for big collections.

closest: float = next(iter(collection))  # get any element
delta: float = abs(closest - target)

for v in collection:
    tmp_delta = abs(v - target)

    if tmp_delta  < delta:
        closest = v
        delta = tmp_delta

Approach 2

Sort data and then find closest via binary search. Sort time is O(n log n), but future lookups will only take O(log n) time to find!

sorted_collection: list[float] = sorted(collection)

idx = bisect.bisect_right(sorted_collection, target)

if idx == 0:
    return sorted_collection[0] 
if idx == len(sorted_collection):
    return sorted_collection[-1]

before, after = sorted_collection[idx - 1], sorted_collection[idx]
return before if target - before <= after - target else after

Approach 3

Using dict with some custom hashing? I thought about implemenenting a __hash__ method for a custom class that could then be used to find values with (armored) O(1) lookup time, but couldn't get anything quite working without making some previous assumptions about the data involved and wonder if it is even possible.

And that is where I got to in a nutshell. I am wondering if there is a faster way than the simple sorting + binary search approach and if my idea with dict + custom hash function is somehow possible.

I'm wondering what the best approach for finding the closest match to a given value from a collection of items is. The most important part is the lookup time relative to the input size, the data can be shuffled and moved around as much as needed, as long as the lookup is therefore faster.

Here the initial script:

MAX = 1_000_000
MIN = 0

target: float = 333_333.0
collection: set[float] = {random.uniform(MIN, MAX) for _ in range(100_000)}


# --- Code here to find closest as fast as possible, now and for future lookups ---


assert closest in collection and all(abs(target - v) >= delta for v in collection)

Approach 1

Iterate through all values and update closest accordingly. Very simple but very slow for big collections.

closest: float = next(iter(collection))  # get any element
delta: float = abs(closest - target)

for v in collection:
    tmp_delta = abs(v - target)

    if tmp_delta  < delta:
        closest = v
        delta = tmp_delta

Approach 2

Sort data and then find closest via binary search. Sort time is O(n log n), but future lookups will only take O(log n) time to find!

sorted_collection: list[float] = sorted(collection)

idx = bisect.bisect_right(sorted_collection, target)

if idx == 0:
    return sorted_collection[0] 
if idx == len(sorted_collection):
    return sorted_collection[-1]

before, after = sorted_collection[idx - 1], sorted_collection[idx]
return before if target - before <= after - target else after

Approach 3

Using dict with some custom hashing? I thought about implemenenting a __hash__ method for a custom class that could then be used to find values with (armored) O(1) lookup time, but couldn't get anything quite working without making some previous assumptions about the data involved and wonder if it is even possible.

And that is where I got to in a nutshell. I am wondering if there is a faster way than the simple sorting + binary search approach and if my idea with dict + custom hash function is somehow possible.

Share Improve this question asked Mar 17 at 14:48 JoniKaufJoniKauf 1201 silver badge8 bronze badges 9
  • 3 It depends. If you can leverage some attribute of your data, then huge performance benefits are possible. For example: using a quad tree when the data is arranged on a plane. But in general, it all depends. – ravenspoint Commented Mar 17 at 14:54
  • 1 Will your data's "closeness" always be finded by a single attribute with a clear ordering? Even if that attribute is a point in 2D space, the binary-search appraoch will break down. – tobias_k Commented Mar 17 at 15:04
  • What does "armored" mean? – no comment Commented Mar 17 at 16:41
  • "without making some previous assumptions about the data involved" - What kind of assumptions (are apparently not ok to make)? – no comment Commented Mar 17 at 16:49
  • You might be interested in this grantjenks/docs/sortedcontainers – Simon Goater Commented Mar 17 at 19:26
 |  Show 4 more comments

1 Answer 1

Reset to default 1

With floats, the binary search scales like the best known. However it requires looking at locations in memory at some distance from each other. A B-tree also scales like O(log(n)), but by locating sets of decisions close to each other, it gets an order of magnitude constant improvement.

The hash idea doesn't work because a good hash function will scatter nearby values in a pseudorandom way. And therefore you have no better way to find those nearby values than to search the whole hash.

发布评论

评论列表(0)

  1. 暂无评论