How can I filter out sudden increases in a column value using pandas?

I'm working with a mileage dataset and am having trouble programming a way to filter out clearly wrong mileages (e.g. 350,000 and 1234).

VIN	MILEAGE	DATE
a	10000	2024-01-01
a	20000	2024-02-01
a	350000	2024-03-01
a	30000	2024-04-01
a	1234	2024-05-01
a	40000	2024-06-01

I'm working with a mileage dataset and am having trouble programming a way to filter out clearly wrong mileages (e.g. 350,000 and 1234).

VIN	MILEAGE	DATE
a	10000	2024-01-01
a	20000	2024-02-01
a	350000	2024-03-01
a	30000	2024-04-01
a	1234	2024-05-01
a	40000	2024-06-01

I first tried filtering out decreasing mileages by grouping by VIN and getting the difference in one mileage to the next, but because of the 350,000 that filters out 30,000 and 40,000 (which should remain).

My next thought was to check the next few rows and see if removing the 350,000 (basically if mileage decreases, remove the previous row and check:) would make mileage follow a slightly increasing trend, but then I'm not sure how I would treat the 1234. My last resort would be to say that mileage can't increase/decrease by more than X amount, but I was wondering if there was a programmatic approach to filtering out these errors.

Thank you!

Share Improve this question asked Mar 18 at 7:04 Bobbert 1

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

One option to give you a general logic that would need to be fined tuned.

You can combine detection of outliers and of non-monotonic values (respective to their neighbors):

# parameters for outlier detection
window = 3  # increase this if you have a lot of data points
n_std = 1

g = df.sort_values(by='DATE').groupby('VIN')['MILEAGE']
r = g.rolling(window=window, min_periods=1, center=True)

std = r.std().droplevel('VIN')
median = r.median().droplevel('VIN')

# is the value not monotonic?
diff = g.diff().lt(0)
m1 = diff | diff.shift(-1)

# is the value an outlier?
m2 = ~df['MILEAGE'].between(median-n_std*std, median+n_std*std)

# filter out unwanted values
out = df[~(m1 & m2)]

Output:

  VIN  MILEAGE        DATE
0   a    10000  2024-01-01
1   a    20000  2024-02-01
3   a    30000  2024-04-01
5   a    40000  2024-06-01

Intermediates:

  VIN  MILEAGE        DATE            std   median   diff     m1     m2  m1&m2
0   a    10000  2024-01-01    7071.067812  15000.0  False  False  False  False
1   a    20000  2024-02-01  193476.958146  20000.0  False  False  False  False
2   a   350000  2024-03-01  187705.443004  30000.0  False   True   True   True
3   a    30000  2024-04-01  193591.152308  30000.0   True   True  False  False
4   a     1234  2024-05-01   20125.794030  30000.0   True   True   True   True
5   a    40000  2024-06-01   27411.701479  20617.0  False  False  False  False

Plot of the outlier detection bounds:

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

How can I filter out sudden increases in a column value using pandas? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)