最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

node.js - Best way to read from a big CSV file without loading everything to memory using Javascript - Stack Overflow

programmeradmin0浏览0评论

I'm using Atom/Electron for building an app that has visualisations over video based on data. Each video has a corresponding CSV file with information for each frame. The videos are about 100 minutes, so the files has plenty of data!

The problem I'm having is that it takes a couple of seconds to load and parse the file. Most of the time this is not a problem. But I need to make playlist of parts of videos, and loading the whole CSV file each time a video is changed is not a viable option.

I been looking to file streaming options as fast-csv, but I didn't manage to start reading for an arbitrary part of the file.

EDIT: from the FS documentation. In this case, the question is how can I know which byte corresponds to the position I want in the file?

options can include start and end values to read a range of bytes from the file instead of the entire file. Both start and end are inclusive and start at 0.

What do you think would be the better and most performant approach to this situation?

In concrete:

Is there a way of starting to read a stream from any part of a CSV file?

Do you consider there is another storage method that would allow me to solve this problem better?

UPDATE:

In the end, I solved this by storing the data in a file in binary format. Since I know how many columns the file has I can just read straight from the segment of the file without any performance implications.

I'm using Atom/Electron for building an app that has visualisations over video based on data. Each video has a corresponding CSV file with information for each frame. The videos are about 100 minutes, so the files has plenty of data!

The problem I'm having is that it takes a couple of seconds to load and parse the file. Most of the time this is not a problem. But I need to make playlist of parts of videos, and loading the whole CSV file each time a video is changed is not a viable option.

I been looking to file streaming options as fast-csv, but I didn't manage to start reading for an arbitrary part of the file.

EDIT: from the FS documentation. In this case, the question is how can I know which byte corresponds to the position I want in the file?

options can include start and end values to read a range of bytes from the file instead of the entire file. Both start and end are inclusive and start at 0.

What do you think would be the better and most performant approach to this situation?

In concrete:

Is there a way of starting to read a stream from any part of a CSV file?

Do you consider there is another storage method that would allow me to solve this problem better?

UPDATE:

In the end, I solved this by storing the data in a file in binary format. Since I know how many columns the file has I can just read straight from the segment of the file without any performance implications.

Share Improve this question edited Mar 23, 2018 at 10:27 limoragni asked Jun 22, 2015 at 19:08 limoragnilimoragni 2,7762 gold badges34 silver badges52 bronze badges 14
  • 1 Are the column widths fixed in the csv? (e.g. column 1 is always 20 characters) – Devin H. Commented Jun 22, 2015 at 19:18
  • This question may be of use: stackoverflow./questions/6156501/… I don't know enough about node to answer for sure, but it seems to me you'd want to send on line of the csv at a time. – Devin H. Commented Jun 22, 2015 at 19:22
  • @DevinH. no, but each column is always a number, so it wouldn't be difficult to use some kind of padding. Ideally I would prefer not to do this. I was thinking of indexing the byte position of each row. And store an arrays with the positions. It wouldn't be a problem making some processing when the file is loaded for the first time. – limoragni Commented Jun 22, 2015 at 19:23
  • Specify buffer chunk on each read? That way you only read small chunks at a time (I'm not sure but looks like my first approach) – Bwaxxlo Commented Jun 22, 2015 at 22:31
  • @Bwaxxio I think you've misunderstood the question. He seems to be trying to random-access small parts of the file, not stream the whole thing. – David Knipe Commented Jun 22, 2015 at 22:58
 |  Show 9 more ments

2 Answers 2

Reset to default 3

I would highly remend Papaparse for this. It allows the streaming of a CSV 'row-by-row', which can be processed in JSON format based on headers in the file.

Within a config object passed to the parsing function, you can give a 'step' parameter, which is a function to be carried out for each row of the file as it steps through.

Note: Can also be configured to use a worker-thread for increased performance when handling very large CSV's

http://papaparse./docs

Per my ment, Sqlite seems to be what your looking for. It may not be your permanent solution in the long run but it will certainly work for the time being while you decided if you want to stick with it or code your own solution.

Inner Workings of Sqlite

Sqlite is optimized to the core but it has three major features that cause it to execute faster than normal disk reads, especially CSV files:

  1. The entire database (every database you create) is stored in 1 single file, not multiple files or records.
  2. This file is paged into 1024 byte (1K) chunks allowing you to jump easily around the data.
  3. (Really part of 2) The entire database and paging system is one massive binary tree that usually takes under 10 jumps to find any given data. So in layman's terms, extremely fast!

If you are really interested in understanding the full scope of all of this, I have found no better explanation than this amazing blog post by Julia Evans.

Possible Drawbacks

Besides the inner workings, Sqlite is designed to be client side working on the users machine. If this is not a viable solution there are workarounds that can be done. Sqlite can be used as a web server for example but it really thrives in a stand alone or mixed installation best. Also remember every clients puter is different. One puter may process records faster than the next but in general you do not need to worry since client side puters are usually under little load.

  • Standalone would require everything to be on the clients side. This is typically how Sqlite is used. I have used it for games in the past, utilizing sqlite4java's API to connect to the database with Java; the API made the whole experience feel like PHP and MySQL on a server. You may need to find other API's since Sqlite is written in C.
  • Mixed instillation is done the same way as standalone but you code into your program a link to an actual server. For the games I helped make we would track things like scores and user data, and then periodically in the background pass this to an actual server if we could get a connection. This also works in reverse. You can start the user with nothing but on first run it can download everything you need and from then on keep its self up-to-date with what is on the server.

Summary

Sqlite will work for what you need but may require a little homework to setup in a fashion you need. Sqlite4java for example is easy to install but confusing to learn since their documentation is so poor; Stack Overflow got me through it though. Sqlite is also a use it and forget it type of installation so to answer your question, it will handle 25 rows a second like cake, you do not need to worry about optimizing it only your own code.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论