最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

How can I filter out sudden increases in a column value using pandas? - Stack Overflow

programmeradmin7浏览0评论

I'm working with a mileage dataset and am having trouble programming a way to filter out clearly wrong mileages (e.g. 350,000 and 1234).

VIN MILEAGE DATE
a 10000 2024-01-01
a 20000 2024-02-01
a 350000 2024-03-01
a 30000 2024-04-01
a 1234 2024-05-01
a 40000 2024-06-01

I'm working with a mileage dataset and am having trouble programming a way to filter out clearly wrong mileages (e.g. 350,000 and 1234).

VIN MILEAGE DATE
a 10000 2024-01-01
a 20000 2024-02-01
a 350000 2024-03-01
a 30000 2024-04-01
a 1234 2024-05-01
a 40000 2024-06-01

I first tried filtering out decreasing mileages by grouping by VIN and getting the difference in one mileage to the next, but because of the 350,000 that filters out 30,000 and 40,000 (which should remain).

My next thought was to check the next few rows and see if removing the 350,000 (basically if mileage decreases, remove the previous row and check:) would make mileage follow a slightly increasing trend, but then I'm not sure how I would treat the 1234. My last resort would be to say that mileage can't increase/decrease by more than X amount, but I was wondering if there was a programmatic approach to filtering out these errors.

Thank you!

Share Improve this question asked Mar 18 at 7:04 BobbertBobbert 1
Add a comment  | 

1 Answer 1

Reset to default 0

One option to give you a general logic that would need to be fined tuned.

You can combine detection of outliers and of non-monotonic values (respective to their neighbors):

# parameters for outlier detection
window = 3  # increase this if you have a lot of data points
n_std = 1

g = df.sort_values(by='DATE').groupby('VIN')['MILEAGE']
r = g.rolling(window=window, min_periods=1, center=True)

std = r.std().droplevel('VIN')
median = r.median().droplevel('VIN')

# is the value not monotonic?
diff = g.diff().lt(0)
m1 = diff | diff.shift(-1)

# is the value an outlier?
m2 = ~df['MILEAGE'].between(median-n_std*std, median+n_std*std)

# filter out unwanted values
out = df[~(m1 & m2)]

Output:

  VIN  MILEAGE        DATE
0   a    10000  2024-01-01
1   a    20000  2024-02-01
3   a    30000  2024-04-01
5   a    40000  2024-06-01

Intermediates:

  VIN  MILEAGE        DATE            std   median   diff     m1     m2  m1&m2
0   a    10000  2024-01-01    7071.067812  15000.0  False  False  False  False
1   a    20000  2024-02-01  193476.958146  20000.0  False  False  False  False
2   a   350000  2024-03-01  187705.443004  30000.0  False   True   True   True
3   a    30000  2024-04-01  193591.152308  30000.0   True   True  False  False
4   a     1234  2024-05-01   20125.794030  30000.0   True   True   True   True
5   a    40000  2024-06-01   27411.701479  20617.0  False  False  False  False

Plot of the outlier detection bounds:

发布评论

评论列表(0)

  1. 暂无评论