pandas - Is there a way to break down a log file into columns using python

I am working on a school project in which I am supposed to clip unnecessary data from a log file and we were told to use python. First, I am supposed to divide the data of the log into columns and then proceed with everything else.

So this is the solution I came up with, though it's not quite working:

import pandas as pd
import pytz
from datetime import datetime
import re

def parse_str(x):
    if x is None:
        return '-'
    else:
        return x[1:-1]

def parse_datetime(x):
    try:
        dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
        dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
        return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
    except ValueError:
        return datetime.now()

def parse_int(x):
    return int(x) if x is not None else 0

data = pd.read_csv(
    'Log_jeden_den.log',
    sep=r'\s(?=(?:[^"]*"[^"]/")*[^"]*$)(?![^\\[]*\\])',
    engine='python',
    na_values='-',
    header=None,
    usecols=['ip', 'request', 'status', 'size', 'referer', 'user_agent'],
    names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
    converters={'time': parse_datetime,
                'request': parse_str,
                'status': parse_int,
                'size': parse_int,
                'referer': parse_str,
                'user_agent': parse_str})
print(data.head())

This is what I get:

Basically I need each part of the log to be split into the mentioned columns.

Log file looks like this:

193.87.12.30 - - [19/Feb/2020:06:25:50 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:55 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:56 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:57 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:49 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"

23.100.232.233 - - [19/Feb/2020:06:25:49 +0100] "GET /media-a-marketing/dianie-na-univerzite/kalendar-udalosti/815-den-otvorenych-dveri-2018 HTTP/1.1" 200 26802 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)"

193.87.12.30 - - [19/Feb/2020:06:25:46 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"

I tried this:

usecols=[0, 3, 4, 5, 6, 7, 8]

But I am getting error:

To sum up. I need to break the log data to broken down to columns. My code somewhat does that but incorrectly. It creates the columns but does not divide the data inbetween them, it puts them all into the first column. Naming the columns didn't help and marking them with numbers caused ParserError.

I know I can just open the file in excel and split it into columns but I really would like to do it "proper" hard way. Is it possible, though? Thanks for any advice in advance.

So this is the solution I came up with, though it's not quite working:

import pandas as pd
import pytz
from datetime import datetime
import re

def parse_str(x):
    if x is None:
        return '-'
    else:
        return x[1:-1]

def parse_datetime(x):
    try:
        dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
        dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
        return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
    except ValueError:
        return datetime.now()

def parse_int(x):
    return int(x) if x is not None else 0

data = pd.read_csv(
    'Log_jeden_den.log',
    sep=r'\s(?=(?:[^"]*"[^"]/")*[^"]*$)(?![^\\[]*\\])',
    engine='python',
    na_values='-',
    header=None,
    usecols=['ip', 'request', 'status', 'size', 'referer', 'user_agent'],
    names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
    converters={'time': parse_datetime,
                'request': parse_str,
                'status': parse_int,
                'size': parse_int,
                'referer': parse_str,
                'user_agent': parse_str})
print(data.head())

This is what I get:

Basically I need each part of the log to be split into the mentioned columns.

Log file looks like this:

193.87.12.30 - - [19/Feb/2020:06:25:50 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:55 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:56 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:57 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:49 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"

193.87.12.30 - - [19/Feb/2020:06:25:46 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"

I tried this:

usecols=[0, 3, 4, 5, 6, 7, 8]

But I am getting error:

I know I can just open the file in excel and split it into columns but I really would like to do it "proper" hard way. Is it possible, though? Thanks for any advice in advance.

Share Improve this question edited Mar 9 at 20:12 asked Mar 9 at 9:54 o_d3n1s_o 133 bronze badges

You need to provide a sample of a log file (as text - not an image) and explain what output you expect (from that sample) versus what you're actually getting – Adon Bilivit Commented Mar 9 at 11:04
The log file sample provided – o_d3n1s_o Commented Mar 10 at 12:10
Please avoid images whenever you can: Useful website idownvotedbecau.se/imageofcode. Please format your log file, possible also in a code block. – Daraan Commented Mar 16 at 10:01

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

data = pd.read_csv(
    'Log_jeden_den.log',
 sep=r'\s+(?=(?:[^"]*"[^"]*")*[^"]*$)(?=(?:[^\[]*\[[^\]]*\])*[^\]]*$)',
    engine='python',
    na_values='-',
    header=None,
    usecols=[0, 3, 4, 5, 6, 7, 8],
    names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
    converters={'time': parse_datetime,
                'request': parse_str,
                'status': parse_int,
                'size': parse_int,
                'referer': parse_str,
                'user_agent': parse_str})

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

pandas - Is there a way to break down a log file into columns using python - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)