javascript - MongoDB: Find all loweruppercase duplicates in DB

There is a huge collection with 600.000 documents. Unfortunatly there are duplicates, which I want to find. These duplicates differs only in first letter upper/lower case.

{ key: 'Find me' },
{ key: 'find me' },
{ key: 'Don't find me }, // just one document for this string
{ key: 'don't find me either } // just one document for this string

Now I want to get all duplicates, which means there is an existing uppercase AND lowercase string.

There is a huge collection with 600.000 documents. Unfortunatly there are duplicates, which I want to find. These duplicates differs only in first letter upper/lower case.

{ key: 'Find me' },
{ key: 'find me' },
{ key: 'Don't find me }, // just one document for this string
{ key: 'don't find me either } // just one document for this string

Now I want to get all duplicates, which means there is an existing uppercase AND lowercase string.

Share Improve this question asked Dec 5, 2016 at 15:52 user3142695 17.4k55 gold badges199 silver badges375 bronze badges

600k doesn't seem like a lot. Assuming these strings are not too long (i.e. not books) all of them should fit in memory. With an average of 80 chars (~one line in terminal) per document it is only ~48Mb. Thus I suggest just loading all of them to a database client and do processing in memory. It could be done with Mongo as well (db-side functions) but it will block whole database. Also you could try map/reduce but it seems to be more plex solution. I think that's all choices you've got. – freakish Commented Dec 5, 2016 at 15:57
Sounds good, as every entry is really small (avrg 10-20 characters), then it would be a normal javascript question to get duplicates out of an array. – user3142695 Commented Dec 5, 2016 at 16:09

Add a ment |

2 Answers 2

Sorted by: Reset to default 5

In MongoDB, there is a $toLower transformation available that you can use.

Here is a way to output every key appearing more than once (you need to change db.collection by the name of your collection):

db.collection.aggregate([ 
    { $group: 
        { 
            _id: { $toLower: "$key" }, 
            cnt: { "$sum": 1 } 
        }
    },
    { $match: 
        { cnt: {$gt: 1 } } 
    }
])

First, the $group groups the documents by key (case insensitive). The number of documents for each key is accumulated in cnt. For after the $group, you end up with something like:

 {"key": "find me", "cnt": 2}
 {"key": "other key", "cnt": 1}
 ...

Then, the $match filters those results, retaining only the ones with a cnt greated than 1.

Note: above is the code for the mongo shell. You can do pretty much the same from javascript (using the mongodb driver), but you need to add quotes around $group and such.

Here is find query it will find user collection where name => harendra or name => Harendra both case match (small and capital letters).

User.findOne({name: {$regex: '^Harendra$', $options: 'i'}})

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - MongoDB: Find all loweruppercase duplicates in DB - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)