Latest pandas version casts types into np
types. To cast a series of integer to strings I thought astype(str)
would have been enough:
import pandas as pd
import numpy as np
list_of_str = list(pd.Series([1234, 123, 345]).to_frame()[0].unique().astype(str))
list_of_str
returns [np.str_('1234'), np.str_('123'), np.str_('345')]
.
And also
list_of_str = list(pd.Series([1234, 123, 345]).to_frame()[0].unique().astype(np.str_))
list_of_str
returns [np.str_('1234'), np.str_('123'), np.str_('345')]
.
Is there an efficient way to cast to python type string, that does not require list comprehension, like:
list_of_str = [str(s) for s in list_of_str]
list_of_str
that finally returns ['1234', '123', '345']
?
Latest pandas version casts types into np
types. To cast a series of integer to strings I thought astype(str)
would have been enough:
import pandas as pd
import numpy as np
list_of_str = list(pd.Series([1234, 123, 345]).to_frame()[0].unique().astype(str))
list_of_str
returns [np.str_('1234'), np.str_('123'), np.str_('345')]
.
And also
list_of_str = list(pd.Series([1234, 123, 345]).to_frame()[0].unique().astype(np.str_))
list_of_str
returns [np.str_('1234'), np.str_('123'), np.str_('345')]
.
Is there an efficient way to cast to python type string, that does not require list comprehension, like:
list_of_str = [str(s) for s in list_of_str]
list_of_str
that finally returns ['1234', '123', '345']
?
2 Answers
Reset to default 1If you want python objects, there is no "efficient" way to perform conversions. Strings in numpy are not necessarily handled efficiently compared to numeric data.
The list comprehension a perfectly valid option. Alternatively:
list(map(str, pd.Series([1234, 123, 345]).unique()))
Or with drop_duplicates
in place of unique
:
pd.Series([1234, 123, 345]).drop_duplicates().astype(str).tolist()
Output:
['1234', '123', '345']
Comparison of speeds, working with strings in numpy is not faster than loops:
# initializing an array with 1M items
a = np.arange(1_000_000)
%timeit a.astype(str).tolist()
277 ms ± 95.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit [str(x) for x in a]
194 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(map(str, a))
159 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It's because:
series.unique()
Returns a numpy.ndarray
object, and using .astype(str)
with numpy returns some numpy text type:
>>> pd.Series([1234, 123, 345]).unique()
array([1234, 123, 345])
>>> np.array([1234, 123, 345]).astype(str)
array(['1234', '123', '345'], dtype='<U21')
According to the numpy docs the python type str
will be converted to to scalar type np.str_
, so some array with dtype('U<length>')
will result.
to_frame
then back to Series:.to_frame()[0]
? – mozway Commented 2 days ago[0]
. – SeF Commented yesterday