I have a dataset, part of which looks like this:
customer | product | price | quantity | sale_time |
---|---|---|---|---|
C060235 | P0204 | 6.99 | 2 | 2024-03-11 08:24:11 |
C045298 | P0167 | 14.99 | 1 | 2024-03-11 08:35:06 |
... | ||||
C039877 | P0024 | 126.95 | 1 | 2024-09-30 21:18:45 |
I have a dataset, part of which looks like this:
customer | product | price | quantity | sale_time |
---|---|---|---|---|
C060235 | P0204 | 6.99 | 2 | 2024-03-11 08:24:11 |
C045298 | P0167 | 14.99 | 1 | 2024-03-11 08:35:06 |
... | ||||
C039877 | P0024 | 126.95 | 1 | 2024-09-30 21:18:45 |
What I want is a list of unique customer, product pairs with the total sales, so something like:
customer | product | total |
---|---|---|
C0000105 | P0168 | 643.78 |
C0000105 | P0204 | 76.88 |
... | ||
C1029871 | P1680 | 435.44 |
Here's my attempt at constructing this. This gives me the grand total of all sales, which isn't what I want. What's a correct approach?
import polars as pl
db.select(
(
pl.col('customer'),
pl.col('product'),
pl.col('quantity').mul(pl.col('price')).alias('total')
)
).group_by(('customer', 'product'))
Share
Improve this question
edited Mar 13 at 16:34
Scott Deerwester
asked Mar 13 at 15:28
Scott DeerwesterScott Deerwester
3,9894 gold badges37 silver badges62 bronze badges
1
- 3 Can you please add the exact output that you get when you run that code – Starship Remembers Shadow Commented Mar 13 at 15:43
3 Answers
Reset to default 3To do this calculate the sale amount for each row then group by both customer and product columns, and then sum the calculated amounts within each group
Your current query has a few issues:
- You're selecting
product
andcustomer
but grouping byitem_lookup_key
andshopper_card_number
- You need to use an aggregation function after grouping
This approach works:
db.group_by(["customer", "product"]).agg([
((pl.col("quantity") * pl.col("price")).sum()).alias("total")
])
A more concise alternative is the expr.dot
:
db.group_by("customer", "product").agg(
total=pl.col("quantity").dot("price")
)
as you've not shown all the columns named in your example ie ('item_lookup_key', 'shopper_card_number'), here's a trivial one, that hopefully provides enough for you to progress
NB: am using polars 1.24.0 ! (linux mint 20.x)
cat wester.py
import polars as pl
# Sample dataset
data = {
"customer": ["C060235", "C045298", "C039877", "C060235", "C039877"],
"product": ["P0204", "P0167", "P0024", "P0204", "P0024"],
"price": [6.99, 14.99, 126.95, 6.99, 126.95],
"quantity": [2, 1, 1, 3, 2],
"sale_time": [
"2024-03-11 08:24:11",
"2024-03-11 08:35:06",
"2024-09-30 21:18:45",
"2024-04-15 10:12:30",
"2024-10-01 15:22:10",
],
}
df = pl.DataFrame(data)
# total sales by (customer, product)
result = (
df.with_columns((pl.col("price") * pl.col("quantity")).alias("total_sales"))
.group_by(["customer", "product"])
.agg(pl.sum("total_sales").alias("total_sales"))
)
print(result)
#
python wester.py
shape: (3, 3)
┌──────────┬─────────┬─────────────┐
│ customer ┆ product ┆ total_sales │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════════╪═════════╪═════════════╡
│ C039877 ┆ P0024 ┆ 380.85 │
│ C045298 ┆ P0167 ┆ 14.99 │
│ C060235 ┆ P0204 ┆ 34.95 │
└──────────┴─────────┴─────────────┘
df.group_by("customer", "product").agg(total=pl.col("quantity").dot("price"))
Expr.dot
computes the sum of the products (i.e., dot product).
There is also no need for a list (square brackets) in both group_by
and agg