最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

adding interval to multiple bin counts in pythonpandas - Stack Overflow

programmeradmin0浏览0评论

Context: Given a piece of paper with text/drawings sporadically printed on it, determine how many unmarked strips of a given width could be cut from it (without doing anything clever like offsetting the cuts according to text position). Data comes in the form of a pandas DataFrame with x0 and x1 columns, which are the bounding box values in the relevant dimension.

My initial attempt to solve this was to use pd.cut() and DataFrame.groupby() to bin everything by an np.arange() and count the empty bins, but this only works if x1-x0 < strip_width:

def count_unmarked_strips(df, total_width, strip_width):
    bins = np.arange(0, total_width, strip_width)
    count_min = df.groupby(pd.cut(df['x0'], bins=bins))['x0'].count().to_numpy()
    count_max = df.groupby(pd.cut(df['x1'], bins=bins))['x1'].count().to_numpy()
    return ((count_min + count_max) == 0).sum()

This already seems like a bad way to do this and I cannot come up with a way to extend this to work beyond the above limitation. More generally, is there a good/reasonably performant way to compare two sets of intervals like this?

Update with alternate solution:

def count_unmarked_strips_alt(df, total_width, strip_width):
    bin_edges = np.arange(0, total_width, strip_width)
    bins = np.zeros(len(bin_edges)-1, dtype=bool)
    for i in range(len(bins)):
        bin_count = df.loc[
            ( # if writing starts in strip
                (df['x0'] >= bin_edges[i])
                & (df['x0'] <= bin_edges[i+1])
            )
            | ( # if writing ends in strip
                (df['x1'] >= bin_edges[i])
                & (df['x1'] <= bin_edges[i+1])
            )
            | ( # if writing crosses strip boundary
                (df['x0'] <= bin_edges[i])
                & (df['x1'] >= bin_edges[i])
            )
        ].shape[0]
        bins[i] = bin_count == 0
    return bins.sum()

This seems to work, but is much slower to run, which I would like to improve.

Example data:

total_width = 20.
strip_width = 2. # both functions give the same result
# strip_width = 1. # first function overcounts unmarked strips

df = pd.DataFrame()
df['x0'] = [0.1, 6.3, 2.5, 17.2, 11.8, 5.3]
df['x1'] = [0.7, 7.9, 5.4, 19.0, 12.2, 5.7]

so, e.g. we have a 20cm width of paper with 6 markings on of different sizes, bounded on the left hand side by x0 and on the right hand side by x1. If we were to cut the paper into equal strips with 2cm width, we would find 2 of them would be unmarked.

Context: Given a piece of paper with text/drawings sporadically printed on it, determine how many unmarked strips of a given width could be cut from it (without doing anything clever like offsetting the cuts according to text position). Data comes in the form of a pandas DataFrame with x0 and x1 columns, which are the bounding box values in the relevant dimension.

My initial attempt to solve this was to use pd.cut() and DataFrame.groupby() to bin everything by an np.arange() and count the empty bins, but this only works if x1-x0 < strip_width:

def count_unmarked_strips(df, total_width, strip_width):
    bins = np.arange(0, total_width, strip_width)
    count_min = df.groupby(pd.cut(df['x0'], bins=bins))['x0'].count().to_numpy()
    count_max = df.groupby(pd.cut(df['x1'], bins=bins))['x1'].count().to_numpy()
    return ((count_min + count_max) == 0).sum()

This already seems like a bad way to do this and I cannot come up with a way to extend this to work beyond the above limitation. More generally, is there a good/reasonably performant way to compare two sets of intervals like this?

Update with alternate solution:

def count_unmarked_strips_alt(df, total_width, strip_width):
    bin_edges = np.arange(0, total_width, strip_width)
    bins = np.zeros(len(bin_edges)-1, dtype=bool)
    for i in range(len(bins)):
        bin_count = df.loc[
            ( # if writing starts in strip
                (df['x0'] >= bin_edges[i])
                & (df['x0'] <= bin_edges[i+1])
            )
            | ( # if writing ends in strip
                (df['x1'] >= bin_edges[i])
                & (df['x1'] <= bin_edges[i+1])
            )
            | ( # if writing crosses strip boundary
                (df['x0'] <= bin_edges[i])
                & (df['x1'] >= bin_edges[i])
            )
        ].shape[0]
        bins[i] = bin_count == 0
    return bins.sum()

This seems to work, but is much slower to run, which I would like to improve.

Example data:

total_width = 20.
strip_width = 2. # both functions give the same result
# strip_width = 1. # first function overcounts unmarked strips

df = pd.DataFrame()
df['x0'] = [0.1, 6.3, 2.5, 17.2, 11.8, 5.3]
df['x1'] = [0.7, 7.9, 5.4, 19.0, 12.2, 5.7]

so, e.g. we have a 20cm width of paper with 6 markings on of different sizes, bounded on the left hand side by x0 and on the right hand side by x1. If we were to cut the paper into equal strips with 2cm width, we would find 2 of them would be unmarked.

Share Improve this question edited Mar 4 at 12:44 Idle_92 asked Mar 3 at 15:02 Idle_92Idle_92 691 silver badge7 bronze badges 4
  • I don't know if I understand what you try to do. But first idea ( abs(df['x0'] - df['x1'])//strip_width ).sum(). Maybe better add small example data as DataFrame and expected result - so we could run it. – furas Commented Mar 3 at 22:08
  • 2 Please create mock input dataframe with expected output. – Scott Boston Commented Mar 3 at 22:19
  • I've created a dataframe to test with, as well as parameters to hopefully illustrate what I mean. I also just realised that the arange is exclusive of the max value, so I am missing a bin at the end. Seems to be consistent for both methods though and can be fixed easily enough, but is not the focus of my question – Idle_92 Commented Mar 4 at 12:35
  • I wonder if the poor title lead to this question not gaining much traction? – Idle_92 Commented Mar 5 at 20:32
Add a comment  | 

1 Answer 1

Reset to default 0

Rethinking this, I came up with a more performant solution. Iterating over each pair of x0 and x1 values and converting them from floating point positions to the corresponding indices of the bins array works and is much faster than calling df.loc() within a loop.

 def count_unmarked_strips(df, total_width, strip_width):
    n_total = int(total_width // strip_width)
    bins = np.ones(n_total, dtype=bool)
    for x_range in df[['x0', 'x1']].to_numpy():
        x0, x1 = (x_range // strip_width).astype(int)
        bins[x0:x1+1] = False

    return bins.sum()
发布评论

评论列表(0)

  1. 暂无评论