return FALSE; $r = well_tag_thread__update(array('id' => $id), $update); return $r; } function well_tag_thread_find($tagid, $page, $pagesize) { $arr = well_tag_thread__find(array('tagid' => $tagid), array('id' => -1), $page, $pagesize); return $arr; } function well_tag_thread_find_by_tid($tid, $page, $pagesize) { $arr = well_tag_thread__find(array('tid' => $tid), array(), $page, $pagesize); return $arr; } ?>Efficiency of sortWithinPartitions and dropDuplicates in Spark - Stack Overflow
最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Efficiency of sortWithinPartitions and dropDuplicates in Spark - Stack Overflow

programmeradmin2浏览0评论

I need to de-dupe the data, the data fields are as follows id, name,interestedProduct, _timeStamp. The data is stored in a delta lake. I have learned that there is a way to increase the efficiency of de-duplication. Here is my code:

df = spark.sql('select * from dwd.tb_customer')
cols = ['id','name','interestedProduct','_timeStamp']
df = df.sortWithinParition(“id”).dropDuplicates(cols)

After testing, this method does improve the efficiency of de-duplication. But my question is: sortWithinParitions is sorting by partition, while dropDuplicates is de-duplicating globally. So why is it more efficient to sort the data and then do the de-duplication? And I found out that there is an argument that although dropDuplicates is global de-duplication, sortWithinParition('id').dropDuplicates(cols) is optimized by the executor to do the de-duplication inside each partition. If this is correct, then de-duplicating and finally merging is done in the partitions, then if we don't use repartition('id') in partitioning the data then the end result may still be duplicate records. So how exactly does dropDuplicates work. Why sortWithinParition + dropDuplicates can be more efficient? If sortWithinParition + dropDuplicates is used for de-duplication, is it true that the executor will optimize dropDuplicates to de-duplicate within a partition?

发布评论

评论列表(0)

  1. 暂无评论