I'm wondering what the best approach for finding the closest match to a given value from a collection of items is. The most important part is the lookup time relative to the input size, the data can be shuffled and moved around as much as needed, as long as the lookup is therefore faster.
Here the initial script:
MAX = 1_000_000
MIN = 0
target: float = 333_333.0
collection: set[float] = {random.uniform(MIN, MAX) for _ in range(100_000)}
# --- Code here to find closest as fast as possible, now and for future lookups ---
assert closest in collection and all(abs(target - v) >= delta for v in collection)
Approach 1
Iterate through all values and update closest
accordingly. Very simple but very slow for big collections.
closest: float = next(iter(collection)) # get any element
delta: float = abs(closest - target)
for v in collection:
tmp_delta = abs(v - target)
if tmp_delta < delta:
closest = v
delta = tmp_delta
Approach 2
Sort data and then find closest via binary search. Sort time is O(n log n), but future lookups will only take O(log n) time to find!
sorted_collection: list[float] = sorted(collection)
idx = bisect.bisect_right(sorted_collection, target)
if idx == 0:
return sorted_collection[0]
if idx == len(sorted_collection):
return sorted_collection[-1]
before, after = sorted_collection[idx - 1], sorted_collection[idx]
return before if target - before <= after - target else after
Approach 3
Using dict
with some custom hashing? I thought about implemenenting a __hash__
method for a custom class that could then be used to find values with (armored) O(1) lookup time, but couldn't get anything quite working without making some previous assumptions about the data involved and wonder if it is even possible.
And that is where I got to in a nutshell. I am wondering if there is a faster way than the simple sorting + binary search approach and if my idea with dict
+ custom hash function is somehow possible.
I'm wondering what the best approach for finding the closest match to a given value from a collection of items is. The most important part is the lookup time relative to the input size, the data can be shuffled and moved around as much as needed, as long as the lookup is therefore faster.
Here the initial script:
MAX = 1_000_000
MIN = 0
target: float = 333_333.0
collection: set[float] = {random.uniform(MIN, MAX) for _ in range(100_000)}
# --- Code here to find closest as fast as possible, now and for future lookups ---
assert closest in collection and all(abs(target - v) >= delta for v in collection)
Approach 1
Iterate through all values and update closest
accordingly. Very simple but very slow for big collections.
closest: float = next(iter(collection)) # get any element
delta: float = abs(closest - target)
for v in collection:
tmp_delta = abs(v - target)
if tmp_delta < delta:
closest = v
delta = tmp_delta
Approach 2
Sort data and then find closest via binary search. Sort time is O(n log n), but future lookups will only take O(log n) time to find!
sorted_collection: list[float] = sorted(collection)
idx = bisect.bisect_right(sorted_collection, target)
if idx == 0:
return sorted_collection[0]
if idx == len(sorted_collection):
return sorted_collection[-1]
before, after = sorted_collection[idx - 1], sorted_collection[idx]
return before if target - before <= after - target else after
Approach 3
Using dict
with some custom hashing? I thought about implemenenting a __hash__
method for a custom class that could then be used to find values with (armored) O(1) lookup time, but couldn't get anything quite working without making some previous assumptions about the data involved and wonder if it is even possible.
And that is where I got to in a nutshell. I am wondering if there is a faster way than the simple sorting + binary search approach and if my idea with dict
+ custom hash function is somehow possible.
- 3 It depends. If you can leverage some attribute of your data, then huge performance benefits are possible. For example: using a quad tree when the data is arranged on a plane. But in general, it all depends. – ravenspoint Commented Mar 17 at 14:54
- 1 Will your data's "closeness" always be finded by a single attribute with a clear ordering? Even if that attribute is a point in 2D space, the binary-search appraoch will break down. – tobias_k Commented Mar 17 at 15:04
- What does "armored" mean? – no comment Commented Mar 17 at 16:41
- "without making some previous assumptions about the data involved" - What kind of assumptions (are apparently not ok to make)? – no comment Commented Mar 17 at 16:49
- You might be interested in this grantjenks/docs/sortedcontainers – Simon Goater Commented Mar 17 at 19:26
1 Answer
Reset to default 1With floats, the binary search scales like the best known. However it requires looking at locations in memory at some distance from each other. A B-tree also scales like O(log(n))
, but by locating sets of decisions close to each other, it gets an order of magnitude constant improvement.
The hash idea doesn't work because a good hash function will scatter nearby values in a pseudorandom way. And therefore you have no better way to find those nearby values than to search the whole hash.