I have a lex file, and based on the original, I would like to add support for recognizing (.:
). My modifications are as follows:
%option caseless
ALP [a-z]+
NUM [0-9]+
REF {ALP}{NUM}
_NAME_0 [^\-\^\]`~!@#$%&*()+=|{}[;:'",<>/ .1234567890?\n]
_NAME_1 {_NAME_0}|[0-9.?]
NAME {_NAME_0}{_NAME_1}*
OP_1 ":"
OP_2 ".:"
T (" "|\n)*
%s S1 S2
%%
<S1>{
{T}{OP_1} { printf(": "); }
{T}{OP_2} { printf(".: "); }
{T}{NAME} {
BEGIN(S2);
yyless(0);
}
}
<S2>{
{T}{REF} { printf("tokRef "); BEGIN(S1); }
{T}{NAME} { printf("tokName "); BEGIN(S1); }
}
%%
int main() {
BEGIN(S1);
yylex();
return 0;
}
int yywrap() { return 1; }
The parsing results of existing rules for some inputs are as follows:
input output
a1:b1 --> tokRef : tokRef
a1 : b1 --> tokRef : tokRef
a1. : b1 --> tokName : tokRef
a1 .: b1 --> tokRef .: tokRef
a1.:b1 --> tokName : tokRef
The output result of the last one is not as expected. I hope to get the following results:
a1.:b1 --> tokRef .: tokRef
In my application, the original regular expression definitions, such as REF
, NAME
, OP_1
, and T
, cannot be modified. Can I achieve the desired effect by modifying the newly added regular expression (OP_2
) or the state machine rules?
I have a lex file, and based on the original, I would like to add support for recognizing (.:
). My modifications are as follows:
%option caseless
ALP [a-z]+
NUM [0-9]+
REF {ALP}{NUM}
_NAME_0 [^\-\^\]`~!@#$%&*()+=|{}[;:'",<>/ .1234567890?\n]
_NAME_1 {_NAME_0}|[0-9.?]
NAME {_NAME_0}{_NAME_1}*
OP_1 ":"
OP_2 ".:"
T (" "|\n)*
%s S1 S2
%%
<S1>{
{T}{OP_1} { printf(": "); }
{T}{OP_2} { printf(".: "); }
{T}{NAME} {
BEGIN(S2);
yyless(0);
}
}
<S2>{
{T}{REF} { printf("tokRef "); BEGIN(S1); }
{T}{NAME} { printf("tokName "); BEGIN(S1); }
}
%%
int main() {
BEGIN(S1);
yylex();
return 0;
}
int yywrap() { return 1; }
The parsing results of existing rules for some inputs are as follows:
input output
a1:b1 --> tokRef : tokRef
a1 : b1 --> tokRef : tokRef
a1. : b1 --> tokName : tokRef
a1 .: b1 --> tokRef .: tokRef
a1.:b1 --> tokName : tokRef
The output result of the last one is not as expected. I hope to get the following results:
a1.:b1 --> tokRef .: tokRef
In my application, the original regular expression definitions, such as REF
, NAME
, OP_1
, and T
, cannot be modified. Can I achieve the desired effect by modifying the newly added regular expression (OP_2
) or the state machine rules?
1 Answer
Reset to default 1It's easier to understand what's going on if you print what part of the input is matched to which rule. Add %option debug
at the beginning of you program.
If we do that, for the input a1.:b1
, we get the following:
--(end of buffer or a NUL)
--accepting rule at line 23 ("a1.")
--accepting rule at line 31 ("a1.")
--accepting rule at line 21 (":")
--accepting rule at line 23 ("b1")
--accepting rule at line 30 ("b1")
--(end of buffer or a NUL)
--accepting default rule ("
")
tokName : tokRef
--(end of buffer or a NUL)
--EOF (start condition 1)
NAME
allows the character .
and as you can see, it matches the string a1.
.
Can you change it, without editing the expression NAME
?
No.
The rules of how Flex matches the rules are simple: it always takes the longest match.
It will never match a1
to NAME
if it is allowed to match a longer string a1.
.
I'm not sure what are your reasons for keeping the original rules intact.
Maybe you could try to make a workaround of adding a code to the action for NAME
that checks if it ends with .
and do some special logic there but it will get complicated really fast.
I can't help more without a better understanding of what you want to achieve.
.[ \t]*:
? I mean other than taste. In general parser knobs prefer the whitespace decisions to lie in the parser, not the scanner. But even C's continuation (trailing \\ on line) is logically handled in the scanner. – mevets Commented Jan 19 at 0:13{T}{OP_2} { printf(".: "); }
are new additions I made based on the original, and they can be modified. What cannot be modified are the original regular expression contents, namely REF, NAME, OP_1, and T. Of course, under the current conditions, even if I modify OP_2 to\.[ \t]*":"
, the issue still cannot be resolved.@mevets – lijiang99 Commented Jan 19 at 6:59{T}
should be a separate rule that does nothing, not part of all the other rules. – user207421 Commented Jan 19 at 7:03"\.:"
, and (3) putting the rule for OP_2 before the rule for OP_1. – user207421 Commented Jan 22 at 3:55