Say I have a dataframe such as
df = pd.DataFrame(data = {'col1': [1,np.nan, '>3', 'NA'], 'col2':["3.","<5.0",np.nan, 'NA']})
out:
col1 col2
0 1 3.
1 NaN <5.0
2 >3 NaN
3 NA NA
What I would like is to strip stuff like "<" or ">", and get to floats only
out:
col1 col2
0 1 3.
1 NaN 5.0
2 3 NaN
3 NaN NaN
I thought about something like
df['col2'].replace({'.*' : r"[-+]?(?:\d*\.*\d+)"}, regex=True, inplace=True)
the idea being, replace anything with the regex for float, (I think), but this fails
error: bad escape \d at position 8
I tried along the lines of
df['col2'].replace({r"[-+]?(?:\d*\.*\d+)": r"\1"}, regex=True, inplace=True)
assuming ">"or "<" come before the float number, but then this fails (does not catch anything) if the field is a string or np.nan
.
Any suggestions please?
Say I have a dataframe such as
df = pd.DataFrame(data = {'col1': [1,np.nan, '>3', 'NA'], 'col2':["3.","<5.0",np.nan, 'NA']})
out:
col1 col2
0 1 3.
1 NaN <5.0
2 >3 NaN
3 NA NA
What I would like is to strip stuff like "<" or ">", and get to floats only
out:
col1 col2
0 1 3.
1 NaN 5.0
2 3 NaN
3 NaN NaN
I thought about something like
df['col2'].replace({'.*' : r"[-+]?(?:\d*\.*\d+)"}, regex=True, inplace=True)
the idea being, replace anything with the regex for float, (I think), but this fails
error: bad escape \d at position 8
I tried along the lines of
df['col2'].replace({r"[-+]?(?:\d*\.*\d+)": r"\1"}, regex=True, inplace=True)
assuming ">"or "<" come before the float number, but then this fails (does not catch anything) if the field is a string or np.nan
.
Any suggestions please?
Share Improve this question asked Mar 20 at 20:53 user37292user37292 4044 silver badges12 bronze badges 3 |4 Answers
Reset to default 3This should work. Logic: First match, but do not include in the capture group \1
, anything we do not want to keep at the beginning of the number with the negated character class [^...]
.
Python (re module flavor regex):
pattern = "^[^\w+-\.]*([+-]?\d*\.?\d*)"
replacement = \1
Regex Demo: https://regex101/r/Halz4X/2
NOTES:
^
Matches the beginning of the string.[^\w+-\.]*
*Negated class[^...]
that matches anything that is not an alphanumeric or underscore,_
, character (\w
), or a literal+
,-
or.
, 0 or more times (*
).(
Begin first (and only) capture group, referred to in the replacement string with\1
.- [+-]?\d*.?\d*
[+-]?
Match 0 or 1 (?
)+
or-
\d*
Match 0 or more (*
) digits\d
.\.
Match a literal dot,.
.\d*
Match 0 or more (*
) digits\d
.
)
End capture (group 1\1
).- Replacement string:
\1
replaces the matched string with only the characters in the capture group 1.
Try to run the code below
import re
p = repile("\<|\>")
df['col1'] = df['col1'].apply(lambda x: p.sub("", str(x)) if p.findall(str(x)) else x)
df['col2'] = df['col2'].apply(lambda x: p.sub("", str(x)) if p.findall(str(x)) else x)
Then you have the result.
out:
col1 col2
0 1 3.
1 NaN 5.0
2 3 NaN
3 NA NA
and if you would like to get float type, you can try this.
import numpy as np
df['col1'] = df['col1'].apply(lambda x: np.nan if x=='NA' else float(x))
df['col2'] = df['col2'].apply(lambda x: np.nan if x=='NA' else float(x))
Then you have the result as well.
out:
col1 col2
0 1.0 3.0
1 NaN 5.0
2 3.0 NaN
3 NaN NaN
I suggest using:
import numpy as np
import pandas as pd
import re
df = pd.DataFrame(data = {'col1': [1,np.nan, '>3', 'NA'], 'col2':["3.","<5.0",np.nan, 'NA']})
# from ">|<" fields match only numbers to group 1
pattern=repile(r"^[<>]([+-]?[0-9]*\.?[0-9]*)$")
(
df
# remove ">" and "<" with technique suggested by rich neadle
.replace(pattern,r"\1",regex=True)
# map litearl "NA" to numpys Not a Number
.replace("NA",np.nan)
# cast float type across dataframe
.astype(float)
)
This Python code produces the desired results on a list named col2
. This may give you an idea on how this could be done with the dataframes. Here I am just working with a list. (Note: In my other answer, I focused only on getting the floats, but did not address the requirement of having all other elements replaced with NaN
.)
APPROACH: Get every element from the col2
. Check to see if it is an element with a float, replace the value of the element with the float. If the element does not contain a float, replace it with literal NaN
.
ASSUMPTIONS:
- The string element containing a float has a string begins with 0 or 1 characters that is NOT an alphanumerical character, or underscore (
\w
),_
,.
,#
,$
,+
, or-
; Followed by 0 or 1+
or-
characters; - Followed by 0 or more digits (
\d
), followed by 0 or 1 literal.
, followed by 0 or more digits. - The string ends in a digit.
- It may not consist of only of non-alphanumeric or non-underscore characters.
PYTHON CODE (re
regex module):
import re
col2 = ["1", "np.nan", "3", "NA", "3.", "<5.0", ">-5", "-5", ">>>>>>5", "<<<<", "<<4", "<<4.9", "<4.9", "nan", "NA", ">>", ".", "+", "-", "...9898d7afs33", "#32211", "..", "$44", "33$"]
pattern_to_find_floats = r'^(?!\W+$)[^\w.#$+-]?([+-]?\d*\.?\d*)$'
pattern = repile(pattern_to_find_floats)
def repl(match1, pattern=pattern):
if re.match(pattern, match1.group(0) ):
# IF the string matches the pattern_to_find_floats. Return the updated string with float with optional +/- sign.
return re.sub(pattern, r"\1", match1.group(0))
else:
return 'NaN'
updated_col2 = []
for item in col2:
# SYNTAX: updated_string = re.sub(pattern, replacement, original_string)
updated_item = re.sub(r".*", repl, item)
updated_col2.append(updated_item)
print(f"Original col2: {col2}")
print(f"Updated col2: {updated_col2}")
Regex Demo (pattern_to_find_floats): https://regex101/r/7HfIly/6
RESULT:
Original col2: ['1', 'np.nan', '3', 'NA', '3.', '<5.0', '>-5', '-5', '>>>>>>5', '<<<<', '<<4', '<<4.9', '<4.9', 'nan', 'NA', '>>', '.', '+', '-', '...9898d7afs33', '#32211', '..', '$44', '33$']
Updated col2: ['1', 'NaN', '3', 'NaN', '3.', '5.0', '-5', '-5', 'NaN', 'NaN', 'NaN', 'NaN', '4.9', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN']
CONVERSION FROM TO:
Original --> Replacement
1 --> 1
np.nan --> NaN
3 --> 3
NA --> NaN
3. --> 3.
<5.0 --> 5.0
>-5 --> -5
-5 --> -5
>>>>>>5 --> NaN
<<<< --> NaN
<<4 --> NaN
<<4.9 --> NaN
<4.9 --> 4.9
nan --> NaN
NA --> NaN
>> --> NaN
. --> NaN
+ --> NaN
- --> NaN
...9898d7afs33 --> NaN
#32211 --> NaN
.. --> NaN
$44 --> NaN
33$ --> NaN
REGEX NOTES:
^
Match beginning of string.(?!\W+$)
Negative lookahead assertion(?!...)
Matches if the string from beginning^
to end$
DOES NOT contain only (1 or more+
) non-alphanumeric and non-underscore characters\W
. Does not consume any characters.[^\w.#$+-]?
Positive character class[...]
matches 0 or 1 (+
) characters that is alphanumerical or underscore character (\w
), literal.
,#
,$
,+
,-
.(
Begin capture group(...)
, group\1
for the float.[+-]?
Match 0 or 1+
or-
characters.\d*
Match 0 or more digits\d
.\.?
Match 0 or 1 literal.
.\d*
Match 0 or more digits\d
.)
Close capture group, group\1
.$
Matches end of string.- The regex replacement string
r"\1"
replaces the entire match with the characters captured in group 1 (\1
).
{r"^.*?([-+]?\d*\.?\d+).*" : r"\1"}
– Wiktor Stribiżew Commented Mar 20 at 21:25(...)
that you could refer to in the replacement string with the\
. Also, the\d*\.\d+
requires that you have a 1 or more digits (+
) after the literal dot, so the pattern cannot match3.
. And, the\.*
matches 0 or more (*
) literal dots, you want to do\.?
which will match 0 or 1 literal dots. – rich neadle Commented Mar 20 at 21:29