node.js - Javascript - Read parquet data (with snappy compression) from AWS s3 bucket

In nodeJS, I am trying to read a parquet file (pression='snappy') but not successful.

I used npm module to open local file and read it but reader.cursor() throws cryptic error 'not yet implemented'. It does not matter which pression (plain, rle, or snappy) was used to create input file, it throws same error.

Here is my code:

const readParquet = async (fileKey) => {

  const filePath = 'parquet-test-file.plain'; // 'snappy';

  console.log('----- reading file : ', filePath);
  let reader = await parquet.ParquetReader.openFile(filePath);
  console.log('---- ParquetReader initialized....');

  // create a new cursor
  let cursor = reader.getCursor();

  // read all records from the file and print them
  if (cursor) {
    console.log('---- cursor initialized....');

    let record = await cursor.next() ; // this line throws exception
    while (record) {
      console.log(record);
      record = await cursor.next();
    }
  }

  await reader.close();
  console.log('----- done with reading parquet file....');

  return;
};

Call to read:

let dt = readParquet(fileKeys.dataFileKey);
dt
  .then((value) => console.log('--------SUCCESS', value))
  .catch((error) => {
    console.log('-------FAILURE ', error); // Random error
    console.log(error.stack);
  })

More info: 1. I have generated my parquet files in python using pyarrow.parquet 2. I used 'SNAPPY' pression while writing file 3. I can read these files in python without any issue 4. My schema is not fixed (unknown) each time I write parquet file. I do not create schema while writing. 5. error.stack prints undefined in console 6. console.log('-------FAILURE ', error); prints "not yet implemented"

I would like to know if someone has encountered similar problem and has ideas/solution to share. BTW my parquet files are stored on AWS S3 location (unlike in this test code). I still have to find solution to read parquet file from S3 bucket.

Any help, suggestions, code example will be highly appreciated.

In nodeJS, I am trying to read a parquet file (pression='snappy') but not successful.

I used https://github./ironSource/parquetjs npm module to open local file and read it but reader.cursor() throws cryptic error 'not yet implemented'. It does not matter which pression (plain, rle, or snappy) was used to create input file, it throws same error.

Here is my code:

const readParquet = async (fileKey) => {

  const filePath = 'parquet-test-file.plain'; // 'snappy';

  console.log('----- reading file : ', filePath);
  let reader = await parquet.ParquetReader.openFile(filePath);
  console.log('---- ParquetReader initialized....');

  // create a new cursor
  let cursor = reader.getCursor();

  // read all records from the file and print them
  if (cursor) {
    console.log('---- cursor initialized....');

    let record = await cursor.next() ; // this line throws exception
    while (record) {
      console.log(record);
      record = await cursor.next();
    }
  }

  await reader.close();
  console.log('----- done with reading parquet file....');

  return;
};

Call to read:

let dt = readParquet(fileKeys.dataFileKey);
dt
  .then((value) => console.log('--------SUCCESS', value))
  .catch((error) => {
    console.log('-------FAILURE ', error); // Random error
    console.log(error.stack);
  })

Any help, suggestions, code example will be highly appreciated.

Share Improve this question edited Oct 10, 2018 at 19:53 CommunityBot 11 silver badge asked Jul 18, 2018 at 21:34 Test Test 1651 silver badge14 bronze badges

did you find solution? – Joe Commented Mar 1, 2019 at 3:32

Add a ment |

2 Answers 2

Sorted by: Reset to default 1

Use var AWS = require('aws-sdk'); to get data from S3.

Then use node-parquet to read parquet file into variable.

import np = require('node-parquet');

// Read from a file:
var reader = new np.ParquetReader(`file.parquet`);
var parquet_info = reader.info();
var parquet_rows = reader.rows();
reader.close();
parquet_rows = parquet_rows + "\n";

There is a fork of https://github./ironSource/parquetjs here: https://github./ZJONSSON/parquetjs which is a "lite" version of the ironSource project. You can install it using npm install parquetjs-lite.

The ZJONSSON project es with a function ParquetReader.openS3, which accepts an s3 client (from version 2 of the AWS SDK) and params ({Bucket: 'x', Key: 'y'}). You might want to try and see if that works for you.

If you are using version 3 of the AWS SDK / S3 client, I have a patible fork here: https://github./entitycs/parquetjs (see tag feature/openS3v3).

Example usage from the project's README.md:

const parquet = require("parquetjs-lite");

const params = {
  Bucket: 'xxxxxxxxxxx',
  Key: 'xxxxxxxxxxx'
};
// v2 example
const AWS = require('aws-sdk');
const client = new AWS.S3({
  accessKeyId: 'xxxxxxxxxxx',
  secretAccessKey: 'xxxxxxxxxxx'
});
let reader = await parquet.ParquetReader.openS3(client,params);

//v3 example
const {S3Client, HeadObjectCommand, GetObjectCommand} = require('@aws-sdk/client-s3');
const client = new S3Client({region:"us-east-1"});
let reader = await parquet.ParquetReader.openS3(
  {S3Client:client, HeadObjectCommand, GetObjectCommand},
  params
);

// create a new cursor
let cursor = reader.getCursor();

// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
  console.log(record);
}

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

node.js - Javascript - Read parquet data (with snappy compression) from AWS s3 bucket - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)