最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Parse semi-structured values - Stack Overflow

programmeradmin2浏览0评论

it's my first question here. I tried to find an answer but couldn't, honestly, figure out which terms should I use, so sorry if it has been asked before.

Here it goes: I have thousands of records in a .txt file, in this format:

(1, 3, 2, 1, 'John (Finances)'),
(2, 7, 2, 1, 'Mary Jane'),
(3, 7, 3, 2, 'Gerald (Janitor), Broflowski'),

... and so on. The first value is the PK, the other 3 are Foreign Keys, the 5th is a string.

I need to parse them as JSON (or something) in Javascript, but I'm having troubles because some strings have parentheses+comma (on 3rd record, "Janitor", e.g.), so I can't use substring... maybe trimming the right part, but I was wondering if there is some smarter way to parse it.

Any help would be really appreciated.

Thanks!

it's my first question here. I tried to find an answer but couldn't, honestly, figure out which terms should I use, so sorry if it has been asked before.

Here it goes: I have thousands of records in a .txt file, in this format:

(1, 3, 2, 1, 'John (Finances)'),
(2, 7, 2, 1, 'Mary Jane'),
(3, 7, 3, 2, 'Gerald (Janitor), Broflowski'),

... and so on. The first value is the PK, the other 3 are Foreign Keys, the 5th is a string.

I need to parse them as JSON (or something) in Javascript, but I'm having troubles because some strings have parentheses+comma (on 3rd record, "Janitor", e.g.), so I can't use substring... maybe trimming the right part, but I was wondering if there is some smarter way to parse it.

Any help would be really appreciated.

Thanks!

Share Improve this question edited Jun 17, 2014 at 6:19 Andrew Savinykh 26.3k20 gold badges107 silver badges164 bronze badges asked Jun 16, 2014 at 21:16 FabioFabio 1139 bronze badges 4
  • looks like you can .split(/'\)\,\s*/) to put each one in it's own array slot, and [].map it from there... – dandavis Commented Jun 16, 2014 at 21:20
  • 1 Ok, but some strings have ")," or comma alone... then it could result in more then 5 positions in the array, isn't? – Fabio Commented Jun 16, 2014 at 21:23
  • inside the iteration, you can split on comma, and then set col[4]=col.slice(4).join(",") to get any commas and text in the string back. if that style of solution does not work, you need more consistent or explicit input... – dandavis Commented Jun 16, 2014 at 21:25
  • 2 wait, is each entry on it's own line in the file? (question edited) if so, then simply split by lines and then turn each line into an array – dandavis Commented Jun 16, 2014 at 21:36
Add a comment  | 

5 Answers 5

Reset to default 14

You can't (read probably shouldn't) use a regular expression for this. What if the parentheses contain another pair or one is mismatched?

The good news is that you can easily construct a tokenizer/parser for this. The idea is to keep track of your current state and act accordingly.

Here is a sketch for a parser I've just written here, the point is to show you the general idea. Let me know if you have any conceptual questions about it.

It works demo here but I beg you not to use it in production before understanding and patching it.


How it works

So, how do we build a parser:

var State = { // remember which state the parser is at.
     BeforeRecord:0, // at the (
     DuringInts:1, // at one of the integers
     DuringString:2, // reading the name string
     AfterRecord:3 // after the )
};

We'll need to keep track of the output, and the current working object since we'll parse these one at a time.

var records = []; // to contain the results
var state = State.BeforeRecord;

Now, we iterate the string, keep progressing in it and read the next character

for(var i = 0;i < input.length; i++){
    if(state === State.BeforeRecord){
        // handle logic when in (
    }
    ...
    if(state === State.AfterRecord){
        // handle that state
    }
}

Now, all that's left is to consume it into the object at each state:

  • If it's at ( we start parsing and skip any whitespaces
  • Read all the integers and ditch the ,
  • After four integers, read the string from ' to the next ' reaching the end of it
  • After the string, read until the ) , store the object, and start the cycle again.

The implementation is not very difficult too.


The parser

var State = { // keep track of the state
     BeforeRecord:0,
     DuringInts:1,
     DuringString:2,
     AfterRecord:3
};
var records = []; // to contain the results
var state = State.BeforeRecord;
var input = " (1, 3, 2, 1, 'John (Finances)'), (2, 7, 2, 1, 'Mary Jane'), (3, 7, 3, 2, 'Gerald (Janitor), Broflowski')," // sample input

var workingRecord = {}; // what we're reading into.

for(var i = 0;i < input.length; i++){
    var token = input[i]; // read the current input
    if(state === State.BeforeRecord){ // before reading a record
        if(token === ' ') continue; // ignore whitespaces between records
        if(token === '('){ state = State.DuringInts; continue; }
        throw new Error("Expected ( before new record");
    }
    if(state === State.DuringInts){
        if(token === ' ') continue; // ignore whitespace
        for(var j = 0; j < 4; j++){
            if(token === ' ') {token = input[++i]; j--; continue;} // ignore whitespace 
             var curNum = '';
             while(token != ","){
                  if(!/[0-9]/.test(token)) throw new Error("Expected number, got " + token);
                  curNum += token;
                  token = input[++i]; // get the next token
             }
             workingRecord[j] = Number(curNum); // set the data on the record
             token = input[++i]; // remove the comma
        }
        state = State.DuringString;
        continue; // progress the loop
    }
    if(state === State.DuringString){
         if(token === ' ') continue; // skip whitespace
         if(token === "'"){
             var str = "";
             token = input[++i];
             var lenGuard = 1000;
             while(token !== "'"){
                 str+=token;
                 if(lenGuard-- === 0) throw new Error("Error, string length bounded by 1000");
                 token = input[++i];
             }
             workingRecord.str = str;
             token = input[++i]; // remove )
             state = State.AfterRecord;
             continue;
         }
    }
    if(state === State.AfterRecord){
        if(token === ' ') continue; // ignore whitespace
        if(token === ',') { // got the "," between records
            state = State.BeforeRecord;
            records.push(workingRecord);
            workingRecord = {}; // new record;
            continue;
        }
        throw new Error("Invalid token found " + token);
    }
}
console.log(records); // logs [Object, Object, Object]
                      // each object has four numbers and a string, for example
                      // records[0][0] is 1, records[0][1] is 3 and so on,
                      // records[0].str is "John (Finances)"

I echo Ben's sentiments about regular expressions usually being bad for this, and I completely agree with him that tokenizers are the best tool here.

However, given a few caveats, you can use a regular expression here. This is because any ambiguities in your (, ), , and ' can be attributed (AFAIK) to your final column; as all of the other columns will always be integers.

So, given:

  1. The input is perfectly formed (with no unexpected (, ), , or ').
  2. Each record is on a new line, per your edit
  3. The only new lines in your input will be to break to the next record

... the following should work (Note "new lines" here are \n. If they're \r\n, change them accordingly):

var input = /* Your input */;
var output = input.split(/\n/g).map(function (cols) {
    cols = cols.match(/^\((\d+), (\d+), (\d+), (\d+), '(.*)'\)/).slice(1);

    return cols.slice(0, 4).map(Number).concat(cols[4]);
});

The code splits on new lines, then goes through row by row and splits into cells using a regular expression, which greedily attributes as much as it can to the final cell. It then turns the first 4 elements into integers, and sticks the 5th element (the string) onto the end.

This gives you an array of records, where each record is itself an array. The first 4 elements are your PK's (as integers) and your 5th element is the string.

For example, given your input, use output[0][4] to get "Gerald (Janitor), Broflowski", and output[1][0] to get the first PK 2 for the second record (don't forget JavaScript arrays are zero-indexed).

You can see it working here: http://jsfiddle.net/56ThR/

Another option would be to convert it into something that looks like an Array and eval it. I know it is not recommended to use eval, but it's a cool solution :)

var lines = input.split("\n");
var output = [];

for(var v in lines){

    // Remove opening ( 
    lines[v] = lines[v].slice(1);

    // Remove closing ) and what is after
    lines[v] = lines[v].slice(0, lines[v].lastIndexOf(')'));

    output[v] = eval("[" + lines[v] + "]");       
}

So, the eval parameter would look like: [1, 3, 2, 1, 'John (Finances)'], which is indeed an Array.

Demo: http://jsfiddle.net/56ThR/3/

And, it can also be written shorter like this:

var lines = input.split("\n");
var output = lines.map( function(el) { 
    return eval("[" + el.slice(1).slice(0, el.lastIndexOf(')') - 1) + "]");
});

Demo: http://jsfiddle.net/56ThR/4/

You can always do it "manually" :)

var lines = input.split("\n");
var output = [];

for(var v in lines){

    output[v] = [];

    // Remove opening (
    lines[v] = lines[v].slice(1);

    // Get integers
    for(var i = 0; i < 4; ++i){
         var pos = lines[v].indexOf(',');
         output[v][i] = parseInt(lines[v].slice(0, pos));
         lines[v] = lines[v].slice(pos+1);   
    }

    // Get string betwen apostrophes
    lines[v] = lines[v].slice(lines[v].indexOf("'") + 1);
    output[v][4] = lines[v].slice(0, lines[v].indexOf("'"));
}

Demo: http://jsfiddle.net/56ThR/2/

What you have here is basically a csv (comma separated value) file which you wish to parse.

The easiest way would be to use an wxternal library that will take care of most of the issues you have

Example: jquery csv library is a good one. https://code.google.com/p/jquery-csv/

发布评论

评论列表(0)

  1. 暂无评论