python - Replace values in series with first regex group if found, if not leave nan

Say I have a dataframe such as

df = pd.DataFrame(data = {'col1': [1,np.nan, '>3', 'NA'], 'col2':["3.","<5.0",np.nan, 'NA']})

out:
    col1    col2
0   1   3.
1   NaN <5.0
2   >3  NaN
3   NA  NA

What I would like is to strip stuff like "<" or ">", and get to floats only

out:
    col1    col2
0   1   3.
1   NaN 5.0
2   3   NaN
3   NaN NaN

I thought about something like

df['col2'].replace({'.*' : r"[-+]?(?:\d*\.*\d+)"}, regex=True, inplace=True)

the idea being, replace anything with the regex for float, (I think), but this fails

error: bad escape \d at position 8

I tried along the lines of

df['col2'].replace({r"[-+]?(?:\d*\.*\d+)": r"\1"}, regex=True, inplace=True)

assuming ">"or "<" come before the float number, but then this fails (does not catch anything) if the field is a string or np.nan.

Any suggestions please?

Say I have a dataframe such as

df = pd.DataFrame(data = {'col1': [1,np.nan, '>3', 'NA'], 'col2':["3.","<5.0",np.nan, 'NA']})

out:
    col1    col2
0   1   3.
1   NaN <5.0
2   >3  NaN
3   NA  NA

What I would like is to strip stuff like "<" or ">", and get to floats only

out:
    col1    col2
0   1   3.
1   NaN 5.0
2   3   NaN
3   NaN NaN

I thought about something like

df['col2'].replace({'.*' : r"[-+]?(?:\d*\.*\d+)"}, regex=True, inplace=True)

the idea being, replace anything with the regex for float, (I think), but this fails

error: bad escape \d at position 8

I tried along the lines of

df['col2'].replace({r"[-+]?(?:\d*\.*\d+)": r"\1"}, regex=True, inplace=True)

assuming ">"or "<" come before the float number, but then this fails (does not catch anything) if the field is a string or np.nan.

Any suggestions please?

Share Improve this question asked Mar 20 at 20:53 user37292 4044 silver badges12 bronze badges

not your initial problem, but note your regex for float matches ".3" but not "3.", as well as matching "1..2" – ysth Commented Mar 20 at 21:09
It should be {r"^.*?([-+]?\d*\.?\d+).*" : r"\1"} – Wiktor Stribiżew Commented Mar 20 at 21:25
I notice that in your pattern you do not have a capturing group (...) that you could refer to in the replacement string with the \. Also, the \d*\.\d+ requires that you have a 1 or more digits (+) after the literal dot, so the pattern cannot match 3.. And, the \.* matches 0 or more (*) literal dots, you want to do \.? which will match 0 or 1 literal dots. – rich neadle Commented Mar 20 at 21:29

Add a comment |

4 Answers 4

Sorted by: Reset to default 3

This should work. Logic: First match, but do not include in the capture group \1, anything we do not want to keep at the beginning of the number with the negated character class [^...].

Python (re module flavor regex):

pattern = "^[^\w+-\.]*([+-]?\d*\.?\d*)"
replacement = \1

Regex Demo: https://regex101/r/Halz4X/2

NOTES:

^ Matches the beginning of the string.
[^\w+-\.]* *Negated class [^...] that matches anything that is not an alphanumeric or underscore, _, character (\w), or a literal +, - or ., 0 or more times (*).
( Begin first (and only) capture group, referred to in the replacement string with \1.
[+-]?\d*.?\d*
- [+-]? Match 0 or 1 (?) + or -
- \d* Match 0 or more (*) digits \d.
- \. Match a literal dot, ..
- \d* Match 0 or more (*) digits \d.
) End capture (group 1 \1).
Replacement string: \1 replaces the matched string with only the characters in the capture group 1.

Try to run the code below

import re
p = repile("\<|\>")

df['col1'] = df['col1'].apply(lambda x: p.sub("", str(x)) if p.findall(str(x)) else x)
df['col2'] = df['col2'].apply(lambda x: p.sub("", str(x)) if p.findall(str(x)) else x)

Then you have the result.

out:
    col1    col2
0   1   3.
1   NaN 5.0
2   3   NaN
3   NA  NA

and if you would like to get float type, you can try this.

import numpy as np

df['col1'] = df['col1'].apply(lambda x: np.nan if x=='NA' else float(x))
df['col2'] = df['col2'].apply(lambda x: np.nan if x=='NA' else float(x))

Then you have the result as well.

out:
    col1    col2
0   1.0   3.0
1   NaN   5.0
2   3.0   NaN
3   NaN   NaN

I suggest using:

import numpy as np
import pandas as pd
import re

df = pd.DataFrame(data = {'col1': [1,np.nan, '>3', 'NA'], 'col2':["3.","<5.0",np.nan, 'NA']})

# from ">|<" fields match only numbers to group 1
pattern=repile(r"^[<>]([+-]?[0-9]*\.?[0-9]*)$")

(
    df
    # remove ">" and "<" with technique suggested by rich neadle
    .replace(pattern,r"\1",regex=True)
    # map litearl "NA" to numpys Not a Number
    .replace("NA",np.nan)
    # cast float type across dataframe
    .astype(float)
)

This Python code produces the desired results on a list named col2. This may give you an idea on how this could be done with the dataframes. Here I am just working with a list. (Note: In my other answer, I focused only on getting the floats, but did not address the requirement of having all other elements replaced with NaN.)

APPROACH: Get every element from the col2. Check to see if it is an element with a float, replace the value of the element with the float. If the element does not contain a float, replace it with literal NaN.

ASSUMPTIONS:

The string element containing a float has a string begins with 0 or 1 characters that is NOT an alphanumerical character, or underscore (\w), _, ., #, $, +, or -; Followed by 0 or 1 + or - characters;
Followed by 0 or more digits (\d), followed by 0 or 1 literal . , followed by 0 or more digits.
The string ends in a digit.
It may not consist of only of non-alphanumeric or non-underscore characters.

PYTHON CODE (re regex module):

import re

col2 = ["1", "np.nan", "3", "NA", "3.", "<5.0", ">-5", "-5", ">>>>>>5", "<<<<", "<<4", "<<4.9", "<4.9", "nan", "NA", ">>", ".", "+", "-", "...9898d7afs33", "#32211", "..", "$44", "33$"]

pattern_to_find_floats = r'^(?!\W+$)[^\w.#$+-]?([+-]?\d*\.?\d*)$'
pattern = repile(pattern_to_find_floats)

def repl(match1, pattern=pattern):
    if re.match(pattern, match1.group(0) ):
        # IF the string matches the pattern_to_find_floats. Return the updated string with float with optional +/- sign.
        return re.sub(pattern, r"\1", match1.group(0))
    else: 
        return 'NaN'        
updated_col2 = []

for item in col2:
    # SYNTAX: updated_string = re.sub(pattern, replacement, original_string)
    updated_item = re.sub(r".*", repl, item) 
    updated_col2.append(updated_item)

print(f"Original col2: {col2}")
print(f"Updated col2: {updated_col2}")

Regex Demo (pattern_to_find_floats): https://regex101/r/7HfIly/6

RESULT:

Original col2: ['1', 'np.nan', '3', 'NA', '3.', '<5.0', '>-5', '-5', '>>>>>>5', '<<<<', '<<4', '<<4.9', '<4.9', 'nan', 'NA', '>>', '.', '+', '-', '...9898d7afs33', '#32211', '..', '$44', '33$']
Updated col2: ['1', 'NaN', '3', 'NaN', '3.', '5.0', '-5', '-5', 'NaN', 'NaN', 'NaN', 'NaN', '4.9', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN']

CONVERSION FROM TO:

Original --> Replacement
1 --> 1
np.nan --> NaN
3 --> 3
NA --> NaN
3. --> 3.
<5.0 --> 5.0
>-5 --> -5
-5 --> -5
>>>>>>5 --> NaN
<<<< --> NaN
<<4 --> NaN
<<4.9 --> NaN
<4.9 --> 4.9
nan --> NaN
NA --> NaN
>> --> NaN
. --> NaN
+ --> NaN
- --> NaN
...9898d7afs33 --> NaN
#32211 --> NaN
.. --> NaN
$44 --> NaN
33$ --> NaN

REGEX NOTES:

^ Match beginning of string.
(?!\W+$) Negative lookahead assertion (?!...) Matches if the string from beginning ^ to end $ DOES NOT contain only (1 or more +) non-alphanumeric and non-underscore characters \W. Does not consume any characters.
[^\w.#$+-]? Positive character class [...] matches 0 or 1 (+) characters that is alphanumerical or underscore character (\w), literal ., #, $, +, -.
( Begin capture group (...), group \1 for the float.
[+-]? Match 0 or 1 + or - characters.
\d* Match 0 or more digits \d.
\.? Match 0 or 1 literal ..
\d* Match 0 or more digits \d.
) Close capture group, group \1.
$ Matches end of string.
The regex replacement string r"\1" replaces the entire match with the characters captured in group 1 (\1).

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Replace values in series with first regex group if found, if not leave nan - Stack Overflow

4 Answers 4

与本文相关的文章

评论列表(0)