最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

database - Cannot SAMPLE BY when reading from an external parquet file - Stack Overflow

programmeradmin0浏览0评论

I am trying to use the built-in parquet demo file inside questdb (trades.parquet) to see how it works, but I am stuck as I cannot use SAMPLE BY.

I can do this

select * from read_parquet('trades.parquet');

And it works as expected. I can see the output below with the columns symbol, side, price, amount, and timestamp. The dataset is already sorted by increasing timestamp on the file.

I now want to do the simplest SAMPLE BY query

select timestamp, avg(price) from read_parquet('trades.parquet') SAMPLE by 1m;

But I am getting an error base query does not provide ASC order over dedicated TIMESTAMP column.

I tried creating a table and then doing INSERT INTO tb SELECT * FROM read_parquet('trades.parquet') and I can then use SAMPLE BY. But in this case I am duplicating data and I was hoping to do time-series queries in-place over my file.

I am trying to use the built-in parquet demo file inside questdb (trades.parquet) to see how it works, but I am stuck as I cannot use SAMPLE BY.

I can do this

select * from read_parquet('trades.parquet');

And it works as expected. I can see the output below with the columns symbol, side, price, amount, and timestamp. The dataset is already sorted by increasing timestamp on the file.

I now want to do the simplest SAMPLE BY query

select timestamp, avg(price) from read_parquet('trades.parquet') SAMPLE by 1m;

But I am getting an error base query does not provide ASC order over dedicated TIMESTAMP column.

I tried creating a table and then doing INSERT INTO tb SELECT * FROM read_parquet('trades.parquet') and I can then use SAMPLE BY. But in this case I am duplicating data and I was hoping to do time-series queries in-place over my file.

Share Improve this question asked Jan 29 at 17:52 Javier RamirezJavier Ramirez 4,0951 gold badge27 silver badges36 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

The problem with the SAMPLE BY is that it works over the designated timestamp. When you create a table without a designated timestamp, you get the same error.

Since the parquet file is an external file, QuestDB doesn't know which column, if any, is the designated timestamp. It cannot just use the timestamp column, as for SAMPLE BY to work, the data must be sorted by increasing timestamp. When tables are created inside QuestDB, the increasing timestamp order is always implicit, as QuestDB forces that when persisting data on disk, but when using an external file, it cannot assume anything.

In QuestDB we can always hint the designated timestamp by doing this:

select timestamp, avg(price) from (read_parquet('trades.parquet') timestamp(timestamp)) sample by 1m;

This should work fine in this case, as we know the file is already sorted by time. However, if we were not sure about the original order, we could always add an ORDER BY, as in

select timestamp, avg(price) from ((read_parquet('trades.parquet') order by timestamp  ) timestamp(timestamp)) SAMPLE BY 1m

Which is shorthand for

select timestamp, avg(price) from (SELECT * FROM (read_parquet('trades.parquet') order by timestamp  ) timestamp(timestamp)) SAMPLE BY 1m

Or if we prefer a CTE for clarity

WITH timestamped AS (
  (
   SELECT * FROM read_parquet('trades.parquet') 
   order by timestamp  
  )    
  timestamp(timestamp ) 
  )
 SELECT  timestamp, avg(price) from timestamped SAMPLE BY 1m;
发布评论

评论列表(0)

  1. 暂无评论