最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pandas - Is there a way to break down a log file into columns using python - Stack Overflow

programmeradmin4浏览0评论

I am working on a school project in which I am supposed to clip unnecessary data from a log file and we were told to use python. First, I am supposed to divide the data of the log into columns and then proceed with everything else.

So this is the solution I came up with, though it's not quite working:

import pandas as pd
import pytz
from datetime import datetime
import re

def parse_str(x):
    if x is None:
        return '-'
    else:
        return x[1:-1]

def parse_datetime(x):
    try:
        dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
        dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
        return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
    except ValueError:
        return datetime.now()

def parse_int(x):
    return int(x) if x is not None else 0

data = pd.read_csv(
    'Log_jeden_den.log',
    sep=r'\s(?=(?:[^"]*"[^"]/")*[^"]*$)(?![^\\[]*\\])',
    engine='python',
    na_values='-',
    header=None,
    usecols=['ip', 'request', 'status', 'size', 'referer', 'user_agent'],
    names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
    converters={'time': parse_datetime,
                'request': parse_str,
                'status': parse_int,
                'size': parse_int,
                'referer': parse_str,
                'user_agent': parse_str})
print(data.head())

This is what I get:

Basically I need each part of the log to be split into the mentioned columns.

Log file looks like this:

193.87.12.30 - - [19/Feb/2020:06:25:50 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:55 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:56 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:57 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:49 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"

23.100.232.233 - - [19/Feb/2020:06:25:49 +0100] "GET /media-a-marketing/dianie-na-univerzite/kalendar-udalosti/815-den-otvorenych-dveri-2018 HTTP/1.1" 200 26802 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)"

193.87.12.30 - - [19/Feb/2020:06:25:46 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"

I tried this:

usecols=[0, 3, 4, 5, 6, 7, 8]

But I am getting error:

To sum up. I need to break the log data to broken down to columns. My code somewhat does that but incorrectly. It creates the columns but does not divide the data inbetween them, it puts them all into the first column. Naming the columns didn't help and marking them with numbers caused ParserError.

I know I can just open the file in excel and split it into columns but I really would like to do it "proper" hard way. Is it possible, though? Thanks for any advice in advance.

I am working on a school project in which I am supposed to clip unnecessary data from a log file and we were told to use python. First, I am supposed to divide the data of the log into columns and then proceed with everything else.

So this is the solution I came up with, though it's not quite working:

import pandas as pd
import pytz
from datetime import datetime
import re

def parse_str(x):
    if x is None:
        return '-'
    else:
        return x[1:-1]

def parse_datetime(x):
    try:
        dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
        dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
        return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
    except ValueError:
        return datetime.now()

def parse_int(x):
    return int(x) if x is not None else 0

data = pd.read_csv(
    'Log_jeden_den.log',
    sep=r'\s(?=(?:[^"]*"[^"]/")*[^"]*$)(?![^\\[]*\\])',
    engine='python',
    na_values='-',
    header=None,
    usecols=['ip', 'request', 'status', 'size', 'referer', 'user_agent'],
    names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
    converters={'time': parse_datetime,
                'request': parse_str,
                'status': parse_int,
                'size': parse_int,
                'referer': parse_str,
                'user_agent': parse_str})
print(data.head())

This is what I get:

Basically I need each part of the log to be split into the mentioned columns.

Log file looks like this:

193.87.12.30 - - [19/Feb/2020:06:25:50 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:55 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:56 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:57 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:49 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"

23.100.232.233 - - [19/Feb/2020:06:25:49 +0100] "GET /media-a-marketing/dianie-na-univerzite/kalendar-udalosti/815-den-otvorenych-dveri-2018 HTTP/1.1" 200 26802 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)"

193.87.12.30 - - [19/Feb/2020:06:25:46 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"

I tried this:

usecols=[0, 3, 4, 5, 6, 7, 8]

But I am getting error:

To sum up. I need to break the log data to broken down to columns. My code somewhat does that but incorrectly. It creates the columns but does not divide the data inbetween them, it puts them all into the first column. Naming the columns didn't help and marking them with numbers caused ParserError.

I know I can just open the file in excel and split it into columns but I really would like to do it "proper" hard way. Is it possible, though? Thanks for any advice in advance.

Share Improve this question edited Mar 9 at 20:12 o_d3n1s_o asked Mar 9 at 9:54 o_d3n1s_oo_d3n1s_o 133 bronze badges 3
  • You need to provide a sample of a log file (as text - not an image) and explain what output you expect (from that sample) versus what you're actually getting – Adon Bilivit Commented Mar 9 at 11:04
  • The log file sample provided – o_d3n1s_o Commented Mar 10 at 12:10
  • Please avoid images whenever you can: Useful website idownvotedbecau.se/imageofcode. Please format your log file, possible also in a code block. – Daraan Commented Mar 16 at 10:01
Add a comment  | 

1 Answer 1

Reset to default 1
data = pd.read_csv(
    'Log_jeden_den.log',
 sep=r'\s+(?=(?:[^"]*"[^"]*")*[^"]*$)(?=(?:[^\[]*\[[^\]]*\])*[^\]]*$)',
    engine='python',
    na_values='-',
    header=None,
    usecols=[0, 3, 4, 5, 6, 7, 8],
    names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
    converters={'time': parse_datetime,
                'request': parse_str,
                'status': parse_int,
                'size': parse_int,
                'referer': parse_str,
                'user_agent': parse_str})

发布评论

评论列表(0)

  1. 暂无评论