最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Should we pre-calculate scalar calculations before we apply them to dataframe columns? - Stack Overflow

programmeradmin2浏览0评论

Just curious if option (b) is more efficient than option (a)? At the first glance, option (a) will have several times of more operations than option (b). But I did some simulations for a million rows in df, option (b) is just a fraction faster on average. Does it mean the Pandas will group all the scalar operations in option (a) automatically?

(a) Variable a, b, c, d, e, f are all scalars.

    df['val2'] = (a*b+c*d)*df['val1']*e/f

(b)

    x = (a*b+c*d)*e/f
    df['val2'] = df['val1']*x

Just curious if option (b) is more efficient than option (a)? At the first glance, option (a) will have several times of more operations than option (b). But I did some simulations for a million rows in df, option (b) is just a fraction faster on average. Does it mean the Pandas will group all the scalar operations in option (a) automatically?

(a) Variable a, b, c, d, e, f are all scalars.

    df['val2'] = (a*b+c*d)*df['val1']*e/f

(b)

    x = (a*b+c*d)*e/f
    df['val2'] = df['val1']*x
Share Improve this question asked Feb 14 at 18:43 sguosguo 1542 silver badges10 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

Yes, it is better to pre-compute x. Actually what matters is the operator precedence and the order in which the operations are performed.

Assuming s your Series, when you run (a*b+c*d)*s*e/f you perform two multiplications and one division of the full Series. If you pre-compute or use (a*b+c*d)*e/f*s, then there is only one operation involving the Series.

Example:

%timeit x*s
1.19 ms ± 73.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit (a*b+c*d)*s*e/f
3.45 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit s*(a*b+c*d)*e/f
3.63 ms ± 84.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# now let's force the scalar operation to be grouped
%timeit s*((a*b+c*d)*e/f)
1.21 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit (a*b+c*d)*e/f*s
1.14 ms ± 80.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Setup:

s = pd.Series(np.arange(1_000_000))
a=b=c=d=e=f=2
x = (a*b+c*d)*e/f

In the initial (a*b+c*d)*df['val1']*e/f, the order or the operations is:

a*b       # ab      #
c*d       # cd      # scalars
ab + cd   # abcd    #
s * abcd  # sabcd      #
e * sabcd # esabcd     # Series
esabcd / f             #
发布评论

评论列表(0)

  1. 暂无评论