pyspark - Which optimizations are being done in spark's MinHashLSH.. banding?

I have two related questions about the optimizations of MinHashLSH in (Py)Spark. I didn't find the answers in the docs or online, and I started reading through the code but couldn't achieve clarity. Let's consider the approxSimilarityJoin.

Is Spark using the "banding" scheme to optimize the lookup? Some authors distinguish between using "just" MinHash and MinHashLSH where the LSH variant is an extension of MinHash search that only looks at bands of hash values at a time and does a voting scheme to achieve a sub-linear search time in the number of examples to search against. Examples:

.html
.pdf

One doesn't need to use bands for approximate similarity join, and my best guess is that Spark does not do this. However, part 2 of my question threw me off...

How is Spark using parallelism efficiently in this context, and in particular is it using hash values to determine partitions? (see, e.g., All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster). This sounds like it could be a variant, or done equivalently, to the banding scheme mentioned above. This made me doubt my guess above that Spark is not doing anything like banding, but straight MinHash estimation.

Any clarity on how improvements are being achieved beyond the linear-time estimation of Jaccard afforded by "vanilla" MinHash (if any) would be appreciated!

PS: I do know MinHash without banding is a variant of LSH, and I know it's accurate to call anything using MinHash an LSH. However, some authors do use the LSH to distinguish an extension of "straight" MinHash.

Thank you!

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

pyspark - Which optimizations are being done in spark's MinHashLSH.. banding? - Stack Overflow

与本文相关的文章

评论列表(0)