arrays - Deduping 28 Million Strings Using JavaScript

I have a JSON file with 28 million strings, the size is roughly 15 GB. Some of the strings are duplicates, so I need to create a new JSON file with just the unique ones. I'm guessing 240 million of them are unique, but I need to find out the exact number. The strings are all less than 100 characters. Here is an example of the data:

[
    '4zWMS2IHAKcsrVtrUBFXIFjkwvbiCyiK',
    'btFqRsglI1Dh81jpgmnRhKPGIBbe2cU7',
    '8Us6mE6zWfyOpjhXsJssE65LrOFc7yr6',
    ...
]

My first approach was to create a JavaScript object and set all of the keys of the object to the strings. Then I would check the length of the keys and that would be my unique count. Unfortunately, I ran into a limit and JavaScript objects can only have ~8M keys.

My next approach was to create a new array in JavaScript and then iterate through my strings and then use the .indexOf method to see if I already added the string to my array. Unfortunately, this is way too slow.

Can anyone think of a way I can do this in JavaScript? I am also OK switching to a different language if this is not the right tool for the job.

[
    '4zWMS2IHAKcsrVtrUBFXIFjkwvbiCyiK',
    'btFqRsglI1Dh81jpgmnRhKPGIBbe2cU7',
    '8Us6mE6zWfyOpjhXsJssE65LrOFc7yr6',
    ...
]

Can anyone think of a way I can do this in JavaScript? I am also OK switching to a different language if this is not the right tool for the job.

Share Improve this question edited Nov 6, 2022 at 13:31 OneCricketeer 192k20 gold badges141 silver badges267 bronze badges asked Nov 5, 2022 at 17:09 Kirk Ouimet 28.4k44 gold badges130 silver badges182 bronze badges

2 Does this file contain nested values? Could you include a small example of how it's structured? – Konrad Commented Nov 5, 2022 at 17:11
3 The biggest problem is that JSON isn't the right tool for the job. You might want to consider storing the data in a different way. – Ivar Commented Nov 5, 2022 at 17:13
5 I would just ignore first and last line and use uniq in bash stackoverflow.com/questions/23740545/…, although sort will be slow – Konrad Commented Nov 5, 2022 at 17:16
4 I would suggest to partition the data into smaller chunks (e.g. by bucketing based on the string or an hash of it, basically mod the hash by 25 or even more partitions), store them on disk (either in a single file each or multiple files) and then deduplicate the smaller chunks individually. – luk2302 Commented Nov 5, 2022 at 17:19
7 If you can import this into a rdbms, most have direct support for json and could run a count(distinct) on a rowset. – Stu Commented Nov 5, 2022 at 20:07

| Show 12 more comments

6 Answers 6

Sorted by: Reset to default 6

Computers are insane. I was inspired by the comments to shard out the keys to many objects so I would not hit the 8M key limit in JavaScript. I then used GitHub Copilot to write this code in about 30 seconds, just using comments and having Copilot write the rest:

// Find the files we want to count
let files = NodeFileSystem.readdirSync('data/imported');
let keyArray = [];
let keyArrayLength = 0;

// Add all of the keys to an array
for(let file of files) {
    if(file.endsWith('.json')) {
        console.log('Parsing file', file);
        let data = JSON.parse(NodeFileSystem.readFileSync('data/imported/'+file));

        // Add the data item.key to the array
        for(let item of data) {
            keyArray.push(item.key);
            keyArrayLength++;
        }
    }
}
console.log('Total array length:', keyArrayLength);

// JavaScript will only allow us to have 8 million keys in an object, so we need to shard the array
// into several objects, using the first characters of each key
let characterCountToUseForSharding = 2;

// An object to store the sharded objects
let shardedObjects = {};

// Loop through the key array
let processedCount = 0;
for(let key of keyArray) {
    processedCount++;
    if(processedCount % 1000000 === 0) {
        let processCountWithCommas = processedCount.toLocaleString();
        console.log('Processed', processCountWithCommas, 'of', keyArrayLength.toLocaleString());
    }

    // Get the first characterCountToUseForSharding characters of the key
    let shardingKey = key.substring(0, characterCountToUseForSharding);

    // If the sharded object doesn't exist, create it
    if(!shardedObjects[shardingKey]) {
        shardedObjects[shardingKey] = {};
        // console.log('Created sharded object', shardingKey);
    }

    // Add the key to the sharded object
    shardedObjects[shardingKey][key] = true;
}

// Count the keys in each sharded object
let total = 0;
for(let shardingKey in shardedObjects) {
    let keyCount = Object.keys(shardedObjects[shardingKey]).length;
    console.log('Sharding key', shardingKey, 'has', keyCount, 'keys');
    total += keyCount;
}

// Log the total
console.log('Total keys:', keyArrayLength);
console.log('Total unique keys:', total);

// Percentage
let percentage = (total / keyArrayLength) * 100;
console.log('Unique percentage:', percentage);

// Duplicate keys
let duplicateKeys = keyArrayLength - total;
console.log('Duplicate keys:', duplicateKeys);

Here you can see the output and speed:

https://www.youtube.com/watch?v=RurzxMjbazg

I am running it on an M1 MacBook Pro and just amazed at the speed. The fan did not even turn on.

I don't have a ready-to-use solution, but I can drop here a bunch of advice that could drive you to an efficient solution:

If you are using NodeJS, consider writing a C++ addon. Indeed, with C++ you can write low-level code that, according to your capacity, could lead to faster and more efficient software. Writing C++ code will allow you to use the correct data structure for this task, manage memory efficiently and overcome many of the NodeJS limitations
Work with file units. I think that the file system is the key to your problem. In other words, you can apply an on-disk sorting to your file (e.g. the Linux sort command), then you can split your 15GB file into, let's say, 200MB chunks. You can now load each chunk with NodeJS and remove duplicates, paying attention that the last row of the nth file can find a duplicate in the (n+1)-th file.
(bonus) Exploit all the threads. Even if this alone won't solve the memory problems, when combined with the multi-chunks solution it will lead to faster code

I would suggest importing into DB, MongoDB or PostgreSQL will do just fine.

But if you really need js solution, here is an implementation of ShardSet(), storing strings into configurable buckets (256 as default).

Tested on 5M strings with 250MB data.json, it took ~15s with deno, and ~86s with sort | uniq.

% ls -lah data.json 
-rw-r--r--  1 me  staff   250M 11 16 16:50 data.json

% time ./scripts/distinct.sh
5242880
./scripts/distinct.sh  14.49s user 1.68s system 117% cpu 13.717 total

% time ./scripts/sort-uniq.sh 
 5242882
./scripts/sort-uniq.sh  86.34s user 0.99s system 106% cpu 1:22.36 total

Checkout this repo to see if it suits your need: deno-demo-large-json

Note: The string hash method taken from v8 is quite good. Random dataset hashed into buckets evenly.

Since you said, you've loaded your data and created an Array. You could try this approach.

You can play with the TOTAL_CHUNK variable to check if it provides some help.

N.B. I've never worked with such huge amount of JSON data.

const data = [
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
    15, 16, 17, 18, 19, 20,
];
const [store, TOTAL_CHUNK] = [[], 18];
let [count, index, uniqueCount] = [0, 0, 0];

while (count < data.length) {
    const currentIndex = index % TOTAL_CHUNK;
    const item = data[index];

    if (store[currentIndex] === undefined) store[currentIndex] = new Set();

    let isUnique = true;
    for (let i = 0; i < TOTAL_CHUNK; i++) {
        if (store[i]?.has(item)) {
            isUnique = false;
            break;
        }
    }

    if (isUnique) {
        store[currentIndex].add(item);
        uniqueCount++;
    }

    count++;
    index++;
}

console.log(uniqueCount);

Others have handled the "javascript isn't really the best thing for this" aspect, but I'll try to answer the question directly, without opinion about context:

Have you tried constructing a Set from the array? It's likely this will take a Good Long While, but I doubt there's any JS solution that won't take a Good Long While.

You can do this with a custom hash function and a custom hash map. Basically to get around memory size limits in the runtime, you're gonna want to limit the hash map size, let's say to 64KB. That means your hash needs to be 16-bit. So:

Create a string -> 16 bit integer hash function
Create an array of string arrays with 2^16 entries
For each string in the input:

3.1 Compute the hash, this becomes your array index

3.2 Lookup in your array at the computed index

3.3 If it's empty, simply add the string

3.4 If it's not empty, compare the string to all others in that bucket (normal linear search) and only add it if not present

You may need to adjust the hash table size until you find the performance sweet spot. 64KB lead to roughly 500 strings in each bucket which is a lot of collisions, but still orders of magnitude faster than doing 28 million comparisons on every string.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

arrays - Deduping 28 Million Strings Using JavaScript - Stack Overflow

6 Answers 6

与本文相关的文章

评论列表(0)