最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Median of medians - is this possible or is there a different way - Stack Overflow

programmeradmin1浏览0评论

Currently i am aggregating big amount of data on a daily basis and for each day i am calculating a median of the current values. Now i need to aggregate all this daily results into a monthly basis and of course i need to calculate the median again.

Is there a way to calculate a median of medians and have it statistically correct? I want to avoid to use the raw data again, because it is a huge amount of it :)

As a small proof of concept i made this javascript - maybe it helps to find a way:

var aSortedNumberGroups = [];
var aSortedNumbers = [];
var aMedians = [];

Math.median = function(aData)
{
    var fMedian = 0;
    var iIndex = Math.floor(aData.length/2);
    if (!(aData.length%2)) {
        fMedian = (aData[iIndex-1]+aData[iIndex])/2;
    } else {
        fMedian = aData[iIndex];
    }

    return fMedian;
};

for (var iCurrGroupNum = 0; iCurrGroupNum < 5; ++iCurrGroupNum) {
    var aCurrNums = [];
    for (var iCurrNum = 0; iCurrNum < 1000; ++iCurrNum) {
        var iCurrRandomNumber = Math.floor(Math.random()*10001);
        aCurrNums.push(iCurrRandomNumber);
        aSortedNumbers.push(iCurrRandomNumber);
    }
    aCurrNums.sort(function(oCountA,oCountB) {
        return (iNumA < iNumB) ? -1 : 1;
    });
    aSortedNumberGroups.push(aCurrNums);
    aMedians.push(Math.median(aCurrNums));
}

console.log("Medians of each group: "+JSON.stringify(aMedians, null, 4));
console.log("Median of medians: "+Math.median(aMedians));
console.log("Median of all: "+Math.median(aSortedNumbers));

As you will see there is often a huge cap between the median of all raw numbers and the median of medians and i like to have it pretty close to each other.

Thanks alot!

Currently i am aggregating big amount of data on a daily basis and for each day i am calculating a median of the current values. Now i need to aggregate all this daily results into a monthly basis and of course i need to calculate the median again.

Is there a way to calculate a median of medians and have it statistically correct? I want to avoid to use the raw data again, because it is a huge amount of it :)

As a small proof of concept i made this javascript - maybe it helps to find a way:

var aSortedNumberGroups = [];
var aSortedNumbers = [];
var aMedians = [];

Math.median = function(aData)
{
    var fMedian = 0;
    var iIndex = Math.floor(aData.length/2);
    if (!(aData.length%2)) {
        fMedian = (aData[iIndex-1]+aData[iIndex])/2;
    } else {
        fMedian = aData[iIndex];
    }

    return fMedian;
};

for (var iCurrGroupNum = 0; iCurrGroupNum < 5; ++iCurrGroupNum) {
    var aCurrNums = [];
    for (var iCurrNum = 0; iCurrNum < 1000; ++iCurrNum) {
        var iCurrRandomNumber = Math.floor(Math.random()*10001);
        aCurrNums.push(iCurrRandomNumber);
        aSortedNumbers.push(iCurrRandomNumber);
    }
    aCurrNums.sort(function(oCountA,oCountB) {
        return (iNumA < iNumB) ? -1 : 1;
    });
    aSortedNumberGroups.push(aCurrNums);
    aMedians.push(Math.median(aCurrNums));
}

console.log("Medians of each group: "+JSON.stringify(aMedians, null, 4));
console.log("Median of medians: "+Math.median(aMedians));
console.log("Median of all: "+Math.median(aSortedNumbers));

As you will see there is often a huge cap between the median of all raw numbers and the median of medians and i like to have it pretty close to each other.

Thanks alot!

Share Improve this question asked Feb 23, 2012 at 14:48 TarisTaris 331 silver badge4 bronze badges 2
  • It would be possible to pute the Mean, but you would have to store the old numbers somewhere in order to determine the Median. – dana Commented Feb 23, 2012 at 14:57
  • I did already considered the quicksort-motivated median-of-medians algorithm since it takes at least some of the raw data away. It was the best approach i could find atm :) – Taris Commented Feb 23, 2012 at 15:07
Add a ment  | 

4 Answers 4

Reset to default 4

you don't actually "calculate" a median you "discover" it through redistribution into subsets, the only optimization for this is a reloadable "tick chart" or running tally: e.g. store each occurrence with the number of times it occurred this way you can recreate the distribution without actually having to reparse the raw data. This is only a small optimization, but depending on the repetition of the data set in question you could save yourself tons of MB and at the very least a bunch of processor cycles.

think of it in JSON: { '1': 3, '5': 12, '7': 4 } canonical: '1' has occurred 3 times, '5' has occurred 12 times, etc...

then persist those counts for the starting at the beginning of time period in which you want to get a median for.

hope this helps -ck

No, unfortunately there is not a way to calculate the median based on medians of subsets of the whole and still be statistically accurate. If you wanted to calculate the mean, however, you could use the means of subsets, given that they are of equal size.

ck's optimization above could be of assistance to you.

I know this is a very dated thread, but future readers may find Tukey's Ninther method quite relevant ... analysis here: http://www.johndcook./blog/2009/06/23/tukey-median-ninther/

-kg

Yet another approach is to take each day's data, parse it, and store it in sorted order. For a given day you can just look at the median piece of data and you've got your answer.

At the end of the month you can do a quick-select to find the median. You can take advantage of the sorted order of each day's data to do a binary search to split it. The result is that your end of month processing will be very, very quick.

The same kind of data, organized in the same kind of way, will also let you do various percentiles very cheaply. The only hard part is extracting each day's raw data and sorting it.

发布评论

评论列表(0)

  1. 暂无评论