最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - difference of two arrays with 10 million items - _.difference is too slow - Stack Overflow

programmeradmin1浏览0评论

I have two arrays with user id's and I want to check different items in them.

arr1 = [123, 456, 789];
arr2 = [123, 456, 789, 098];

The problem is: these arrays can have 10 or 20 millions of items.

I'm trying with underscore.difference() but it took 10 mins to plete.

Is there any faster way to do this?

I have two arrays with user id's and I want to check different items in them.

arr1 = [123, 456, 789];
arr2 = [123, 456, 789, 098];

The problem is: these arrays can have 10 or 20 millions of items.

I'm trying with underscore.difference() but it took 10 mins to plete.

Is there any faster way to do this?

Share Improve this question edited May 1, 2014 at 18:25 user2864740 62k15 gold badges157 silver badges227 bronze badges asked May 1, 2014 at 18:19 user3175226user3175226 3,6698 gold badges30 silver badges48 bronze badges 4
  • Are the arrays sorted? – Cameron Commented May 1, 2014 at 18:20
  • 2 if your elements work as object keys, it would probably be quicker to turn one array's elements into object keys with a value of 1, then filter your second array according to the object having or not having the elmement as a key. sorting will be slow with that many items, but object lookup is very fast and avoids the loop-de-loop that underscore's contains() method utilizes. – dandavis Commented May 1, 2014 at 18:22
  • 2 Oh, it looks like _.difference doesn't even use a set .. so using a set/probe will be much better bounds. See stackoverflow./questions/13147278/… – user2864740 Commented May 1, 2014 at 18:23
  • 3 If you're working with 10 million items then maybe you should put your data in a database. – mu is too short Commented May 1, 2014 at 18:49
Add a ment  | 

4 Answers 4

Reset to default 3

How about converting the arrays to object to reduce the sorting plexity:

var arr1 = [123, 456, 789], arr2 = [123, 456, 789, 098];

function toObject(arr){
    return arr.reduce(function(o, v, i) {
      o[v] = i;
      return o;
    }, {});
}

var o1 = toObject(arr1), o2 = toObject(arr2), diff = [];

for(var prop in o2){
    if(o1[prop] === undefined)
        diff.push(prop);
}

console.log(diff);

You would obviously need to start on the biggest set.

http://jsfiddle/sHUw5/

Another thing to consider is to sort your collections and do a binary search to reduce the plexity from (O)N to (O)log2N for each array (if I'm thinking straight).

Use native js, rather than a library that tries to acmodate lots of scenarios / inputs.

  • Skip all the function calls and scope changes.
  • Minimize property lookup/namespace walking
  • Don't keep getting the array length on each loop

Simple optimization:

var array1 = [];
var array2 = [];

var difference = [];

for(var i = 0; len = array1.length; i < len; i++)
{
    var value = array1[i];

    if(value == array2[i])
    {
        continue;
    }

    if(array2.indexOf(value) == -1)
    {
        difference.push(value);
    }
}

This considers you do not have the numbers 0 or 1 in the arrays:

var arr1 = [123, 456, 789,3],
    arr2 = [123, 456, 789, 098],    
    has = {},
    different=[],
    length1=arr1.length,
    length2=arr2.length;



for(var i=0;i<length1;i++){
        has[arr1[i]]=true;
}
for(var i=0;i<length2;i++){
    var val=arr2[i];
    if(has[val] === undefined){
       has[val]=val;
    }
    else{
        if(has[val]!=val){
            has[val]=false;
        }
   }
}
for(var i in has){
    if (has[i]) different.push(i);
}

If you want to check also for 0 and 1:

for(var i=0;i<length1;i++){
    has[arr1[i]]=NaN;
}
for(var i=0;i<length2;i++){
    var val=arr2[i];
    if(has[val] === undefined){
        has[val]=null;
    }
    else{
        if(has[val]!=null){
            has[val]=true;
        }
    }
}
for(var i in has){
    if (!has[i]) different.push(i);
}

here is a fast way of cheating the nested iteration that _.difference will invoke:

var arr1 = [123, 456, 789],
    arr2 = [123, 456, 789, 098],    
     has = {};

arr1.forEach(function(a){ this[a]=1;}, has);
alert(  arr2.filter(function(a){return !this[a]; }, has)   );

by using this in the iteration, we hand a pure function to JS that can be executed at the maximum possible speed.

note that this won't work for arrays of objects, or mixed-type arrays like [1, "1"], but it should work for the problem described and demonstrated by the question.

edit: it you want bi-directional pares (like arr1 having, arr2 missing or vice-versa), reverse and repeat the code above. you'll still only be at 40million putations pared to 100trillion that an indexOf()-using method will cost...

发布评论

评论列表(0)

  1. 暂无评论