最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - What is the best way to convert from CSV to JSON when commas and quotations may be in the fields? - Stack Overflow

programmeradmin0浏览0评论

I want to be able to convert a CSV to JSON. The csv comes in as free text like this (with the newlines):

name,age,booktitle
John,2,Hello World
Mary,3,""Alas, What Can I do?""
Joseph,5,"Waiting, waiting, waiting"

My problem as you can tell is the file...

  • Has got some interior commas in some fields, though they are wrapped in at least one double quote.
  • There could be double quotes within the file.

I would like the output to not have any leading and trailing quotes for each field... how can I correctly create a JSON object parsed out from the csv string that represents this CSV accurately? (without the leading and trailing quotes).

I usually use:

var mycsvstring;
var finalconvertedjson = {};
var headerfields = // get headers here
var lines = mycsvstring.split('\n');


for(var i = 0; i < lines.length; i++) {
// loop through each line and set a key for each header field that corresponds to the appropriate lines[i]    
}

I want to be able to convert a CSV to JSON. The csv comes in as free text like this (with the newlines):

name,age,booktitle
John,2,Hello World
Mary,3,""Alas, What Can I do?""
Joseph,5,"Waiting, waiting, waiting"

My problem as you can tell is the file...

  • Has got some interior commas in some fields, though they are wrapped in at least one double quote.
  • There could be double quotes within the file.

I would like the output to not have any leading and trailing quotes for each field... how can I correctly create a JSON object parsed out from the csv string that represents this CSV accurately? (without the leading and trailing quotes).

I usually use:

var mycsvstring;
var finalconvertedjson = {};
var headerfields = // get headers here
var lines = mycsvstring.split('\n');


for(var i = 0; i < lines.length; i++) {
// loop through each line and set a key for each header field that corresponds to the appropriate lines[i]    
}
Share Improve this question edited Dec 6, 2019 at 18:47 Rolando asked Dec 6, 2019 at 18:33 RolandoRolando 62.6k103 gold badges278 silver badges422 bronze badges 5
  • stackoverflow.com/help/how-to-ask – Rob Commented Dec 6, 2019 at 18:36
  • 2 Perhaps this page might be helpful stackoverflow.com/questions/27979002/… – The fourth bird Commented Dec 6, 2019 at 18:36
  • CSV to JSON: convert the CSV into JavaScript Objects, then JSON.stringify (do you really want JSON, a string representation of data, or you just want an array of data?) P.S. there is no such thing as a "JSON object" – crashmstr Commented Dec 6, 2019 at 18:38
  • @Thefourthbird I can't rely on having a delimiter that is not a comma, any symbol is fair game.. would like to get this working with commas. – Rolando Commented Dec 6, 2019 at 18:39
  • You can add your own logic of reading character by character. Keep looking for a qoute and a comma, if you find a comma first then take the value till next comma, and if you find a quote take value till next quote with condition that next char should be comma or end line. – Pranav Asthana Commented Dec 6, 2019 at 19:06
Add a comment  | 

1 Answer 1

Reset to default 20

My first guess is to use a regular expression. You can try this one I've just whipped up (regex101 link):

/\s*(")?(.*?)\1\s*(?:,|$)/gm

This can be used to extract fields, so headers can be grabbed with it as well. The first capture group is used as an optional quote-grabber with a backreference (\1), so the actual data is in the second capture group.

Here's an example of it in use. I had to use a slice to cut off the last match in all cases, since allowing for blank fields with the * wildcard (things like f1,,f3) put a zero-width match at the end. This was easier to get rid of in-code rather than with some regex trickery. Finally, I've got 'extra_i' as a default/placeholder value if there are some extra columns not accounted for by the headers. You should probably swap that part out to fit your own needs.

/**
 * Takes a raw CSV string and converts it to a JavaScript object.
 * @param {string} text The raw CSV string.
 * @param {string[]} headers An optional array of headers to use. If none are
 * given, they are pulled from the first line of `text`.
 * @param {string} quoteChar A character to use as the encapsulating character.
 * @param {string} delimiter A character to use between columns.
 * @returns {object[]} An array of JavaScript objects containing headers as keys
 * and row entries as values.
 */
function csvToJson(text, headers, quoteChar = '"', delimiter = ',') {
  const regex = new RegExp(`\\s*(${quoteChar})?(.*?)\\1\\s*(?:${delimiter}|$)`, 'gs');

  const match = line => [...line.matchAll(regex)]
    .map(m => m[2])  // we only want the second capture group
    .slice(0, -1);   // cut off blank match at the end

  const lines = text.split('\n');
  const heads = headers ?? match(lines.shift());

  return lines.map(line => {
    return match(line).reduce((acc, cur, i) => {
      // Attempt to parse as a number; replace blank matches with `null`
      const val = cur.length <= 0 ? null : Number(cur) || cur;
      const key = heads[i] ?? `extra_${i}`;
      return { ...acc, [key]: val };
    }, {});
  });
}

const testString = `name,age,quote
John,,Hello World
Mary,23,""Alas, What Can I do?""
Joseph,45,"Waiting, waiting, waiting"
"Donaldson Jones"   , sixteen,    ""Hello, "my" friend!""`;

console.log(csvToJson(testString));
console.log(csvToJson(testString, ['foo', 'bar', 'baz']));
console.log(csvToJson(testString, ['col_0']));

As a bonus, I've written this to allow for the passing of a list of strings to use as the headers instead, since I know first hand that not all CSV files have those.


Note: This regex approach does not work if your values have new-lines in them. This is because it relies on splitting the string at the newlines. I did look into using this regular expression to split the lines only at newlines outside of quotes, which almost worked, but took upwards of 30 seconds on anything longer than a few lines.

If you want to get full functionality, your best bet would be to find an existing parsing library, or to write your own: one that counts occurrences of quotes to figure out if you're inside or outside a "cell" at the moment as you iterate through them.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论