最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - How would I parse a large TSV file in node.js? - Stack Overflow

programmeradmin1浏览0评论

I'm extremely new to Node and JS. I have a large TSV file (1.5gb) that I need to read in and parse into either an array or JSON object. How would I go about doing that? I don't get an error when I try the code below but it doesn't even enter into it.

var d3 = require("d3-dsv");

d3.tsvParse("amazon_reviews_us_Mobile_Apps_v1_00.tsv", function(error, data) 
{
    var sum = 0;
    data.forEach(function(d) 
    {
        d.helpful_votes += d.helpful_votes;
        sum += d.helpful_votes;
    });
    console.log("Total Helpful Votes: " + sum);
});

Any help would be appreciated.

I'm extremely new to Node and JS. I have a large TSV file (1.5gb) that I need to read in and parse into either an array or JSON object. How would I go about doing that? I don't get an error when I try the code below but it doesn't even enter into it.

var d3 = require("d3-dsv");

d3.tsvParse("amazon_reviews_us_Mobile_Apps_v1_00.tsv", function(error, data) 
{
    var sum = 0;
    data.forEach(function(d) 
    {
        d.helpful_votes += d.helpful_votes;
        sum += d.helpful_votes;
    });
    console.log("Total Helpful Votes: " + sum);
});

Any help would be appreciated.

Share Improve this question asked Oct 7, 2020 at 3:38 RouxRoux 131 silver badge5 bronze badges 7
  • Two problems: it should be d3.tsv, not d3.tsvParse, which works only with strings. Also, D3 v5 and above uses Fetch API, meaning it should be d3.tsv(url).then(etc...). – Gerardo Furtado Commented Oct 7, 2020 at 3:55
  • @GerardoFurtado I have tried both of these. d3.tsv gives me function does not exist and d3.tsv(url).then gives me fetch is undefined, even when I installed the d3-fetch module. – Roux Commented Oct 7, 2020 at 22:39
  • What's your D3 version? – Gerardo Furtado Commented Oct 7, 2020 at 22:52
  • @GerardoFurtado 2.0.0. I installed it by using npm install d3-dsv – Roux Commented Oct 7, 2020 at 22:56
  • Are you sure? This is 8 years old! – Gerardo Furtado Commented Oct 7, 2020 at 23:14
 |  Show 2 more ments

2 Answers 2

Reset to default 3

You need to find a module that provides a streaming parser for a TSV file, meaning that it doesn't load the whole file into memory. You can use readline if your parser is synchronous:

const {createInterface} = require("rl");
const {createReadStream} = require("fs");

createInterface({input: createReadStream("amazon_reviews_us_Mobile_Apps_v1_00.tsv")})
   .on('line', (data) => doSomethingWith(data.split("\t")))
   .on('end', () => doSomethingWhenDone())

You wrote that you want to parse that file and change it to an array or object of some sort. You'll still need to be looking at your memory, but you could use my scramjet which will allow you to transform the data anyway you like:

const {StringStream} = require("scramjet");
const {createReadStream, createWriteStream} = require("fs");

StringStream.from(createReadStream("amazon_reviews_us_Mobile_Apps_v1_00.tsv"))
    // read the file
    .CSVParse({delimiter: "\t"})
    // parse as csv
    .map((entry) => doSomething(entry))
        // whatever you return here it will be changed
        // this can be asynchronous too, so you can do requests...
    .toJSONArray()
    .pipe(createWriteStream("somefile.json"))

Let me know what are you trying to achieve besides counting. I'll edit the answer.

BTW, for just counting votes the solution by @hugo-elhaj-lahsen is also good, I'm not sure why it was downvoted.

Use d3.tsv with the promise-based API. Since your file is very large, one optimisation we can do is instead of doing a for-each on each element after they get parsed by D3, use the loop done at parsing time via the initialization function:

var d3 = require("d3-dsv");

var sum = 0

d3.tsvParse("amazon_reviews_us_Mobile_Apps_v1_00.tsv", data => {
  sum += d.helpful_votes;
  return d // Since this is the parser, need to return the parsed object at the end
}).then(data => {
  console.log("Total helpful votes", sum)
})
发布评论

评论列表(0)

  1. 暂无评论