最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

减少pandas代码的执行时间

SEO心得admin171浏览0评论
本文介绍了减少pandas代码的执行时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个包含网站用户日志的大型数据框,我需要找到每个用户每次访问的持续时间.

I have a large dataframe containing logs of users on a website, and I need to find the duration of each visit for each users.

我有 350 万行和 45 万单个用户.

I have 3.5 million rows and 450k single users.

这是我的代码:

temp=df["server.REMOTE_ADDR"]# main df with timestamps and ip adresses user_db = df["server.REMOTE_ADDR"]# df with all IP adresses user_db = user_db.drop_duplicates() # drop duplicate IP time_thresh = 15*60 # if user inactive for 15 minutes, it's a new visit temp_moyen=[] # array for mean times temp_min=[] # array for minimal time temp_max=[] # array for max time nb_visites=[] # array for number of visit for k,user in enumerate(user_db.values): # for each user print("User {}/{}").format(k+1,len(user_db.values)) t0=[] # time of beginning of visit tf=[] # time of end of visit times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user times_db = [dateutil.parser.parse(times) for times in times_db] # parse to datetime i=1 last_t = times_db[0] delta = 0 while i<len(times_db): # while there is still a timestamp in the list t0.append(times_db[i-1]) # begin the first visit delta=0 while (delta < time_thresh and i<len(times_db)): # while not inactive for 15 minutes delta = (times_db[i]-last_t).total_seconds() last_t = times_db[i] i+=1 if i!=len(times_db): #if not last run tf.append(times_db[i-2]) else: # if no more timestamp, record the last one as end of last visit tf.append(times_db[-1]) if len(times_db)<=1: # if only one timestamp, tf = t0 tf.append(times_db[-1]) diff=[(final-first).total_seconds() for first,final in zip(t0,tf)] # evaluate diff between each t0 and tf temp_moyen.append(np.mean(diff)) # add to the lists temp_min.append(np.min(diff)) temp_max.append(np.max(diff)) nb_visites.append(len(diff)) user_db=user_db.to_frame() # convert to dataframe user_db["temp_moyen"]=temp_moyen # add columns for each information (mean,min,max,number of visits) user_db["temp_min"]=temp_min user_db["temp_max"]=temp_max user_db["nb_visites"]=nb_visites

此代码有效,但速度很慢:我的计算机上 200 个用户/分钟.我能做什么:

This code works, but it is very slow : 200 users/minutes on my computer. What can I do to :

  • 找出瓶颈?

  • identify the bottleneck?

加快速度?

根据要求,我的数据如下所示:par 每个用户,我有一个时间戳列表:[100, 101, 104, 106, 109, 200, 209, 211, 213]

EDIT : As requested, my data look like this : par each user, I have a list of timestamps : [100, 101, 104, 106, 109, 200, 209, 211, 213]

我需要找到单个用户进行了多少次访问,例如在这种情况下,它将代表两次访问,100-109 和 200-213.第一次访问持续了 9 次,第二次持续了 13 次,所以我可以得到访问持续时间的平均值、最小值和最大值.

I need to find how many visits a single user did, e.g in this case, it would represent two visits, 100-109 and 200-213. The first visit lasted 9, the second lasted 13, so I can have the mean, min and max of visits duration.

瓶颈在这里(每个循环 300 毫秒中的 277 毫秒):

EDIT 2 : Bottleneck is here (277ms out of 300ms per loop):

times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user

我把它放在 for 循环之前的列表理解中,但它仍然很慢:

I put it in a list comprehension before the for loop, but it is still slow :

times_db_all = [df[temp == user]["server.date"].values for user in user_db.values] %timeit times_db_all = [df_temp[temp == user]["server.date"].values for user in user_db.values[0:3]] 1 loops, best of 3: 848 ms per loop #848ms for 3 users !!

我的数据库看起来像这样:

my db looks like this :

user_ip | server.date 1.1.1.1 datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()), 1.1.1.1 datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()), 3.3.3.3 datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc()), 1.1.1.1 datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc()) 4.4.4.4 datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc()) ....

推荐答案

继续我关于删除循环的评论:正如我所见,您有一堆活动时间戳,并且您假设只要这些时间戳是紧靠在一起,它们与单次访问有关,否则它们代表不同的访问.例如,[100, 101, 104, 106, 109, 200, 209, 211, 213] 表示两次访问,100-109 和 200-213.为了加快速度,您可以使用 scipy 执行以下操作:

To continue from my comment about removing the loop: as I see it you have a bunch of timestamps of activity and you are assuming that as long as these timestamps are close together they relate to a single visit and otherwise they represent different visits. As an example, [100, 101, 104, 106, 109, 200, 209, 211, 213] would represent two visits, 100-109 and 200-213. To speed this up, you could do the following using scipy:

import scipy cutoff = 15 times = scipy.array([100, 101, 104, 106, 109, 200, 209, 211, 213, 300, 310, 325]) delta = times[1:] - times[:-1] which = delta > cutoff # identifies which gaps represent a new visit N_visits = which.sum() + 1 # note the +1 for 'fence post' L_boundaries = scipy.zeros((N_visits,)) # generating these arrays might be unnecessary and relatvely slow R_boundaries = scipy.zeros((N_visits,)) L_boundaries[1:] = times[1:][which] R_boundaries[:-1] = times[:-1][which] visit_lengths = R_boundaries - L_boundaries

这可能会更快,但它可能已经比您当前的循环快了很多.

This can probably be made even faster, but it is probably already a lot faster than your current loop.

以下可能会更快一些,但会牺牲代码的清晰度

The following is probably a little faster, at the expense of clarity in the code

import scipy cutoff = 15 times = scipy.array([100, 101, 104, 106, 109, 200, 209, 211, 213, 300, 310, 325]) which = times[1:] - times[:-1] > cutoff N_visits = which.sum() + 1 # fence post visit_lengths = scipy.zeros((N_visits,)) # it is probably inevitable to have to generate this new array visit_lengths[0] = times[:-1][which][0] - times[0] visit_lengths[1:-1] = times[:-1][which][1:] - times[1:][which][:-1] visit_lengths[-1] = times[-1] - times[1:][which][-1]

我还认为,如果您不太在意第一次和最后一次访问,那么忽略这些可能值得考虑.

I also think that if you maybe don't care too much about the first and last visits it might be worth considering to just ignore these.

基于 OP 编辑​​的编辑

EDIT based on OP EDIT

你也许应该看看 pandas.pydata/pandas-docs/stable/indexing.html.我认为缓慢的是您正在为每个用户制作部分数据帧的副本,即 df[temp == user] 制作一个新数据帧,并将其存储为 times_db,也许将结果值放入 numpy 数组会更快?您也可以先对整个数据帧执行到日期时间的解析.

You maybe should look at pandas.pydata/pandas-docs/stable/indexing.html. I think what is slow is the fact that you are making a copy of part of your dataframe for every user, that is, df[temp == user] makes a new dataframe, and stores it as times_db, maybe it would be faster to put the resulting values into a numpy array? You could also perform the parsing to datetime first for the whole dataframe.

发布评论

评论列表(0)

  1. 暂无评论