I'm working with a mileage dataset and am having trouble programming a way to filter out clearly wrong mileages (e.g. 350,000 and 1234).
VIN | MILEAGE | DATE |
---|---|---|
a | 10000 | 2024-01-01 |
a | 20000 | 2024-02-01 |
a | 350000 | 2024-03-01 |
a | 30000 | 2024-04-01 |
a | 1234 | 2024-05-01 |
a | 40000 | 2024-06-01 |
I'm working with a mileage dataset and am having trouble programming a way to filter out clearly wrong mileages (e.g. 350,000 and 1234).
VIN | MILEAGE | DATE |
---|---|---|
a | 10000 | 2024-01-01 |
a | 20000 | 2024-02-01 |
a | 350000 | 2024-03-01 |
a | 30000 | 2024-04-01 |
a | 1234 | 2024-05-01 |
a | 40000 | 2024-06-01 |
I first tried filtering out decreasing mileages by grouping by VIN and getting the difference in one mileage to the next, but because of the 350,000 that filters out 30,000 and 40,000 (which should remain).
My next thought was to check the next few rows and see if removing the 350,000 (basically if mileage decreases, remove the previous row and check:) would make mileage follow a slightly increasing trend, but then I'm not sure how I would treat the 1234. My last resort would be to say that mileage can't increase/decrease by more than X amount, but I was wondering if there was a programmatic approach to filtering out these errors.
Thank you!
Share Improve this question asked Mar 18 at 7:04 BobbertBobbert 11 Answer
Reset to default 0One option to give you a general logic that would need to be fined tuned.
You can combine detection of outliers and of non-monotonic values (respective to their neighbors):
# parameters for outlier detection
window = 3 # increase this if you have a lot of data points
n_std = 1
g = df.sort_values(by='DATE').groupby('VIN')['MILEAGE']
r = g.rolling(window=window, min_periods=1, center=True)
std = r.std().droplevel('VIN')
median = r.median().droplevel('VIN')
# is the value not monotonic?
diff = g.diff().lt(0)
m1 = diff | diff.shift(-1)
# is the value an outlier?
m2 = ~df['MILEAGE'].between(median-n_std*std, median+n_std*std)
# filter out unwanted values
out = df[~(m1 & m2)]
Output:
VIN MILEAGE DATE
0 a 10000 2024-01-01
1 a 20000 2024-02-01
3 a 30000 2024-04-01
5 a 40000 2024-06-01
Intermediates:
VIN MILEAGE DATE std median diff m1 m2 m1&m2
0 a 10000 2024-01-01 7071.067812 15000.0 False False False False
1 a 20000 2024-02-01 193476.958146 20000.0 False False False False
2 a 350000 2024-03-01 187705.443004 30000.0 False True True True
3 a 30000 2024-04-01 193591.152308 30000.0 True True False False
4 a 1234 2024-05-01 20125.794030 30000.0 True True True True
5 a 40000 2024-06-01 27411.701479 20617.0 False False False False
Plot of the outlier detection bounds: