it's my first question here. I tried to find an answer but couldn't, honestly, figure out which terms should I use, so sorry if it has been asked before.
Here it goes: I have thousands of records in a .txt file, in this format:
(1, 3, 2, 1, 'John (Finances)'),
(2, 7, 2, 1, 'Mary Jane'),
(3, 7, 3, 2, 'Gerald (Janitor), Broflowski'),
... and so on. The first value is the PK, the other 3 are Foreign Keys, the 5th is a string.
I need to parse them as JSON (or something) in Javascript, but I'm having troubles because some strings have parentheses+comma (on 3rd record, "Janitor", e.g.), so I can't use substring... maybe trimming the right part, but I was wondering if there is some smarter way to parse it.
Any help would be really appreciated.
Thanks!
it's my first question here. I tried to find an answer but couldn't, honestly, figure out which terms should I use, so sorry if it has been asked before.
Here it goes: I have thousands of records in a .txt file, in this format:
(1, 3, 2, 1, 'John (Finances)'),
(2, 7, 2, 1, 'Mary Jane'),
(3, 7, 3, 2, 'Gerald (Janitor), Broflowski'),
... and so on. The first value is the PK, the other 3 are Foreign Keys, the 5th is a string.
I need to parse them as JSON (or something) in Javascript, but I'm having troubles because some strings have parentheses+comma (on 3rd record, "Janitor", e.g.), so I can't use substring... maybe trimming the right part, but I was wondering if there is some smarter way to parse it.
Any help would be really appreciated.
Thanks!
Share Improve this question edited Jun 17, 2014 at 6:19 Andrew Savinykh 26.3k20 gold badges107 silver badges164 bronze badges asked Jun 16, 2014 at 21:16 FabioFabio 1139 bronze badges 4- looks like you can .split(/'\)\,\s*/) to put each one in it's own array slot, and [].map it from there... – dandavis Commented Jun 16, 2014 at 21:20
- 1 Ok, but some strings have ")," or comma alone... then it could result in more then 5 positions in the array, isn't? – Fabio Commented Jun 16, 2014 at 21:23
- inside the iteration, you can split on comma, and then set col[4]=col.slice(4).join(",") to get any commas and text in the string back. if that style of solution does not work, you need more consistent or explicit input... – dandavis Commented Jun 16, 2014 at 21:25
- 2 wait, is each entry on it's own line in the file? (question edited) if so, then simply split by lines and then turn each line into an array – dandavis Commented Jun 16, 2014 at 21:36
5 Answers
Reset to default 14You can't (read probably shouldn't) use a regular expression for this. What if the parentheses contain another pair or one is mismatched?
The good news is that you can easily construct a tokenizer/parser for this. The idea is to keep track of your current state and act accordingly.
Here is a sketch for a parser I've just written here, the point is to show you the general idea. Let me know if you have any conceptual questions about it.
It works demo here but I beg you not to use it in production before understanding and patching it.
How it works
So, how do we build a parser:
var State = { // remember which state the parser is at.
BeforeRecord:0, // at the (
DuringInts:1, // at one of the integers
DuringString:2, // reading the name string
AfterRecord:3 // after the )
};
We'll need to keep track of the output, and the current working object since we'll parse these one at a time.
var records = []; // to contain the results
var state = State.BeforeRecord;
Now, we iterate the string, keep progressing in it and read the next character
for(var i = 0;i < input.length; i++){
if(state === State.BeforeRecord){
// handle logic when in (
}
...
if(state === State.AfterRecord){
// handle that state
}
}
Now, all that's left is to consume it into the object at each state:
- If it's at
(
we start parsing and skip any whitespaces - Read all the integers and ditch the
,
- After four integers, read the string from
'
to the next'
reaching the end of it - After the string, read until the
)
, store the object, and start the cycle again.
The implementation is not very difficult too.
The parser
var State = { // keep track of the state
BeforeRecord:0,
DuringInts:1,
DuringString:2,
AfterRecord:3
};
var records = []; // to contain the results
var state = State.BeforeRecord;
var input = " (1, 3, 2, 1, 'John (Finances)'), (2, 7, 2, 1, 'Mary Jane'), (3, 7, 3, 2, 'Gerald (Janitor), Broflowski')," // sample input
var workingRecord = {}; // what we're reading into.
for(var i = 0;i < input.length; i++){
var token = input[i]; // read the current input
if(state === State.BeforeRecord){ // before reading a record
if(token === ' ') continue; // ignore whitespaces between records
if(token === '('){ state = State.DuringInts; continue; }
throw new Error("Expected ( before new record");
}
if(state === State.DuringInts){
if(token === ' ') continue; // ignore whitespace
for(var j = 0; j < 4; j++){
if(token === ' ') {token = input[++i]; j--; continue;} // ignore whitespace
var curNum = '';
while(token != ","){
if(!/[0-9]/.test(token)) throw new Error("Expected number, got " + token);
curNum += token;
token = input[++i]; // get the next token
}
workingRecord[j] = Number(curNum); // set the data on the record
token = input[++i]; // remove the comma
}
state = State.DuringString;
continue; // progress the loop
}
if(state === State.DuringString){
if(token === ' ') continue; // skip whitespace
if(token === "'"){
var str = "";
token = input[++i];
var lenGuard = 1000;
while(token !== "'"){
str+=token;
if(lenGuard-- === 0) throw new Error("Error, string length bounded by 1000");
token = input[++i];
}
workingRecord.str = str;
token = input[++i]; // remove )
state = State.AfterRecord;
continue;
}
}
if(state === State.AfterRecord){
if(token === ' ') continue; // ignore whitespace
if(token === ',') { // got the "," between records
state = State.BeforeRecord;
records.push(workingRecord);
workingRecord = {}; // new record;
continue;
}
throw new Error("Invalid token found " + token);
}
}
console.log(records); // logs [Object, Object, Object]
// each object has four numbers and a string, for example
// records[0][0] is 1, records[0][1] is 3 and so on,
// records[0].str is "John (Finances)"
I echo Ben's sentiments about regular expressions usually being bad for this, and I completely agree with him that tokenizers are the best tool here.
However, given a few caveats, you can use a regular expression here. This is because any ambiguities in your (
, )
, ,
and '
can be attributed (AFAIK) to your final column; as all of the other columns will always be integers.
So, given:
- The input is perfectly formed (with no unexpected
(
,)
,,
or'
). - Each record is on a new line, per your edit
- The only new lines in your input will be to break to the next record
... the following should work (Note "new lines" here are \n
. If they're \r\n
, change them accordingly):
var input = /* Your input */;
var output = input.split(/\n/g).map(function (cols) {
cols = cols.match(/^\((\d+), (\d+), (\d+), (\d+), '(.*)'\)/).slice(1);
return cols.slice(0, 4).map(Number).concat(cols[4]);
});
The code splits on new lines, then goes through row by row and splits into cells using a regular expression, which greedily attributes as much as it can to the final cell. It then turns the first 4 elements into integers, and sticks the 5th element (the string) onto the end.
This gives you an array of records, where each record is itself an array. The first 4 elements are your PK's (as integers) and your 5th element is the string.
For example, given your input, use output[0][4]
to get "Gerald (Janitor), Broflowski"
, and output[1][0]
to get the first PK 2
for the second record (don't forget JavaScript arrays are zero-indexed).
You can see it working here: http://jsfiddle.net/56ThR/
Another option would be to convert it into something that looks like an Array
and eval
it. I know it is not recommended to use eval, but it's a cool solution :)
var lines = input.split("\n");
var output = [];
for(var v in lines){
// Remove opening (
lines[v] = lines[v].slice(1);
// Remove closing ) and what is after
lines[v] = lines[v].slice(0, lines[v].lastIndexOf(')'));
output[v] = eval("[" + lines[v] + "]");
}
So, the eval parameter
would look like: [1, 3, 2, 1, 'John (Finances)']
, which is indeed an Array.
Demo: http://jsfiddle.net/56ThR/3/
And, it can also be written shorter like this:
var lines = input.split("\n");
var output = lines.map( function(el) {
return eval("[" + el.slice(1).slice(0, el.lastIndexOf(')') - 1) + "]");
});
Demo: http://jsfiddle.net/56ThR/4/
You can always do it "manually" :)
var lines = input.split("\n");
var output = [];
for(var v in lines){
output[v] = [];
// Remove opening (
lines[v] = lines[v].slice(1);
// Get integers
for(var i = 0; i < 4; ++i){
var pos = lines[v].indexOf(',');
output[v][i] = parseInt(lines[v].slice(0, pos));
lines[v] = lines[v].slice(pos+1);
}
// Get string betwen apostrophes
lines[v] = lines[v].slice(lines[v].indexOf("'") + 1);
output[v][4] = lines[v].slice(0, lines[v].indexOf("'"));
}
Demo: http://jsfiddle.net/56ThR/2/
What you have here is basically a csv (comma separated value) file which you wish to parse.
The easiest way would be to use an wxternal library that will take care of most of the issues you have
Example: jquery csv library is a good one. https://code.google.com/p/jquery-csv/