Just curious if option (b) is more efficient than option (a)? At the first glance, option (a) will have several times of more operations than option (b). But I did some simulations for a million rows in df, option (b) is just a fraction faster on average. Does it mean the Pandas will group all the scalar operations in option (a) automatically?
(a) Variable a, b, c, d, e, f are all scalars.
df['val2'] = (a*b+c*d)*df['val1']*e/f
(b)
x = (a*b+c*d)*e/f
df['val2'] = df['val1']*x
Just curious if option (b) is more efficient than option (a)? At the first glance, option (a) will have several times of more operations than option (b). But I did some simulations for a million rows in df, option (b) is just a fraction faster on average. Does it mean the Pandas will group all the scalar operations in option (a) automatically?
(a) Variable a, b, c, d, e, f are all scalars.
df['val2'] = (a*b+c*d)*df['val1']*e/f
(b)
x = (a*b+c*d)*e/f
df['val2'] = df['val1']*x
Share
Improve this question
asked Feb 14 at 18:43
sguosguo
1542 silver badges10 bronze badges
1 Answer
Reset to default 0Yes, it is better to pre-compute x
. Actually what matters is the operator precedence and the order in which the operations are performed.
Assuming s
your Series, when you run (a*b+c*d)*s*e/f
you perform two multiplications and one division of the full Series. If you pre-compute or use (a*b+c*d)*e/f*s
, then there is only one operation involving the Series.
Example:
%timeit x*s
1.19 ms ± 73.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit (a*b+c*d)*s*e/f
3.45 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit s*(a*b+c*d)*e/f
3.63 ms ± 84.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# now let's force the scalar operation to be grouped
%timeit s*((a*b+c*d)*e/f)
1.21 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit (a*b+c*d)*e/f*s
1.14 ms ± 80.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Setup:
s = pd.Series(np.arange(1_000_000))
a=b=c=d=e=f=2
x = (a*b+c*d)*e/f
In the initial (a*b+c*d)*df['val1']*e/f
, the order or the operations is:
a*b # ab #
c*d # cd # scalars
ab + cd # abcd #
s * abcd # sabcd #
e * sabcd # esabcd # Series
esabcd / f #