我有一个包含网站用户日志的大型数据框,我需要找到每个用户每次访问的持续时间.
I have a large dataframe containing logs of users on a website, and I need to find the duration of each visit for each users.
我有 350 万行和 45 万单个用户.
I have 3.5 million rows and 450k single users.
这是我的代码:
temp=df["server.REMOTE_ADDR"]# main df with timestamps and ip adresses user_db = df["server.REMOTE_ADDR"]# df with all IP adresses user_db = user_db.drop_duplicates() # drop duplicate IP time_thresh = 15*60 # if user inactive for 15 minutes, it's a new visit temp_moyen=[] # array for mean times temp_min=[] # array for minimal time temp_max=[] # array for max time nb_visites=[] # array for number of visit for k,user in enumerate(user_db.values): # for each user print("User {}/{}").format(k+1,len(user_db.values)) t0=[] # time of beginning of visit tf=[] # time of end of visit times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user times_db = [dateutil.parser.parse(times) for times in times_db] # parse to datetime i=1 last_t = times_db[0] delta = 0 while i<len(times_db): # while there is still a timestamp in the list t0.append(times_db[i-1]) # begin the first visit delta=0 while (delta < time_thresh and i<len(times_db)): # while not inactive for 15 minutes delta = (times_db[i]-last_t).total_seconds() last_t = times_db[i] i+=1 if i!=len(times_db): #if not last run tf.append(times_db[i-2]) else: # if no more timestamp, record the last one as end of last visit tf.append(times_db[-1]) if len(times_db)<=1: # if only one timestamp, tf = t0 tf.append(times_db[-1]) diff=[(final-first).total_seconds() for first,final in zip(t0,tf)] # evaluate diff between each t0 and tf temp_moyen.append(np.mean(diff)) # add to the lists temp_min.append(np.min(diff)) temp_max.append(np.max(diff)) nb_visites.append(len(diff)) user_db=user_db.to_frame() # convert to dataframe user_db["temp_moyen"]=temp_moyen # add columns for each information (mean,min,max,number of visits) user_db["temp_min"]=temp_min user_db["temp_max"]=temp_max user_db["nb_visites"]=nb_visites此代码有效,但速度很慢:我的计算机上 200 个用户/分钟.我能做什么:
This code works, but it is very slow : 200 users/minutes on my computer. What can I do to :
找出瓶颈?
identify the bottleneck?
加快速度?
根据要求,我的数据如下所示:par 每个用户,我有一个时间戳列表:[100, 101, 104, 106, 109, 200, 209, 211, 213]
EDIT : As requested, my data look like this : par each user, I have a list of timestamps : [100, 101, 104, 106, 109, 200, 209, 211, 213]
我需要找到单个用户进行了多少次访问,例如在这种情况下,它将代表两次访问,100-109 和 200-213.第一次访问持续了 9 次,第二次持续了 13 次,所以我可以得到访问持续时间的平均值、最小值和最大值.
I need to find how many visits a single user did, e.g in this case, it would represent two visits, 100-109 and 200-213. The first visit lasted 9, the second lasted 13, so I can have the mean, min and max of visits duration.
瓶颈在这里(每个循环 300 毫秒中的 277 毫秒):
EDIT 2 : Bottleneck is here (277ms out of 300ms per loop):
times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user我把它放在 for 循环之前的列表理解中,但它仍然很慢:
I put it in a list comprehension before the for loop, but it is still slow :
times_db_all = [df[temp == user]["server.date"].values for user in user_db.values] %timeit times_db_all = [df_temp[temp == user]["server.date"].values for user in user_db.values[0:3]] 1 loops, best of 3: 848 ms per loop #848ms for 3 users !!我的数据库看起来像这样:
my db looks like this :
user_ip | server.date 1.1.1.1 datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()), 1.1.1.1 datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()), 3.3.3.3 datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc()), 1.1.1.1 datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc()) 4.4.4.4 datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc()) .... 推荐答案继续我关于删除循环的评论:正如我所见,您有一堆活动时间戳,并且您假设只要这些时间戳是紧靠在一起,它们与单次访问有关,否则它们代表不同的访问.例如,[100, 101, 104, 106, 109, 200, 209, 211, 213] 表示两次访问,100-109 和 200-213.为了加快速度,您可以使用 scipy 执行以下操作:
To continue from my comment about removing the loop: as I see it you have a bunch of timestamps of activity and you are assuming that as long as these timestamps are close together they relate to a single visit and otherwise they represent different visits. As an example, [100, 101, 104, 106, 109, 200, 209, 211, 213] would represent two visits, 100-109 and 200-213. To speed this up, you could do the following using scipy:
import scipy cutoff = 15 times = scipy.array([100, 101, 104, 106, 109, 200, 209, 211, 213, 300, 310, 325]) delta = times[1:] - times[:-1] which = delta > cutoff # identifies which gaps represent a new visit N_visits = which.sum() + 1 # note the +1 for 'fence post' L_boundaries = scipy.zeros((N_visits,)) # generating these arrays might be unnecessary and relatvely slow R_boundaries = scipy.zeros((N_visits,)) L_boundaries[1:] = times[1:][which] R_boundaries[:-1] = times[:-1][which] visit_lengths = R_boundaries - L_boundaries这可能会更快,但它可能已经比您当前的循环快了很多.
This can probably be made even faster, but it is probably already a lot faster than your current loop.
以下可能会更快一些,但会牺牲代码的清晰度
The following is probably a little faster, at the expense of clarity in the code
import scipy cutoff = 15 times = scipy.array([100, 101, 104, 106, 109, 200, 209, 211, 213, 300, 310, 325]) which = times[1:] - times[:-1] > cutoff N_visits = which.sum() + 1 # fence post visit_lengths = scipy.zeros((N_visits,)) # it is probably inevitable to have to generate this new array visit_lengths[0] = times[:-1][which][0] - times[0] visit_lengths[1:-1] = times[:-1][which][1:] - times[1:][which][:-1] visit_lengths[-1] = times[-1] - times[1:][which][-1]我还认为,如果您不太在意第一次和最后一次访问,那么忽略这些可能值得考虑.
I also think that if you maybe don't care too much about the first and last visits it might be worth considering to just ignore these.
基于 OP 编辑的编辑
EDIT based on OP EDIT
你也许应该看看 pandas.pydata/pandas-docs/stable/indexing.html.我认为缓慢的是您正在为每个用户制作部分数据帧的副本,即 df[temp == user] 制作一个新数据帧,并将其存储为 times_db,也许将结果值放入 numpy 数组会更快?您也可以先对整个数据帧执行到日期时间的解析.
You maybe should look at pandas.pydata/pandas-docs/stable/indexing.html. I think what is slow is the fact that you are making a copy of part of your dataframe for every user, that is, df[temp == user] makes a new dataframe, and stores it as times_db, maybe it would be faster to put the resulting values into a numpy array? You could also perform the parsing to datetime first for the whole dataframe.