最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pyspark - Which optimizations are being done in spark's MinHashLSH.. banding? - Stack Overflow

programmeradmin0浏览0评论

I have two related questions about the optimizations of MinHashLSH in (Py)Spark. I didn't find the answers in the docs or online, and I started reading through the code but couldn't achieve clarity. Let's consider the approxSimilarityJoin.

  1. Is Spark using the "banding" scheme to optimize the lookup? Some authors distinguish between using "just" MinHash and MinHashLSH where the LSH variant is an extension of MinHash search that only looks at bands of hash values at a time and does a voting scheme to achieve a sub-linear search time in the number of examples to search against. Examples:
  • .html
  • .pdf

One doesn't need to use bands for approximate similarity join, and my best guess is that Spark does not do this. However, part 2 of my question threw me off...

  1. How is Spark using parallelism efficiently in this context, and in particular is it using hash values to determine partitions? (see, e.g., All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster). This sounds like it could be a variant, or done equivalently, to the banding scheme mentioned above. This made me doubt my guess above that Spark is not doing anything like banding, but straight MinHash estimation.

Any clarity on how improvements are being achieved beyond the linear-time estimation of Jaccard afforded by "vanilla" MinHash (if any) would be appreciated!

PS: I do know MinHash without banding is a variant of LSH, and I know it's accurate to call anything using MinHash an LSH. However, some authors do use the LSH to distinguish an extension of "straight" MinHash.

Thank you!

发布评论

评论列表(0)

  1. 暂无评论