c - Use flex to parse ":" and ".:" in expressions

I have a lex file, and based on the original, I would like to add support for recognizing (.:). My modifications are as follows:

%option caseless

ALP     [a-z]+
NUM     [0-9]+
REF     {ALP}{NUM}

_NAME_0     [^\-\^\]`~!@#$%&*()+=|{}[;:'",<>/ .1234567890?\n]
_NAME_1     {_NAME_0}|[0-9.?]
NAME        {_NAME_0}{_NAME_1}*

OP_1        ":"
OP_2        ".:"

T       (" "|\n)*

%s S1 S2

%%
<S1>{
    {T}{OP_1} { printf(": "); }
    {T}{OP_2} { printf(".: "); }
    {T}{NAME} {
        BEGIN(S2);
        yyless(0);
    }
}

<S2>{
    {T}{REF} { printf("tokRef "); BEGIN(S1); }
    {T}{NAME} { printf("tokName "); BEGIN(S1); }
}
%%

int main() {
    BEGIN(S1);
    yylex();
    return 0;
}

int yywrap() { return 1; }

The parsing results of existing rules for some inputs are as follows:

input               output
a1:b1        -->    tokRef : tokRef
a1   : b1    -->    tokRef : tokRef
a1.  : b1    -->    tokName : tokRef
a1  .: b1    -->    tokRef .: tokRef

a1.:b1       -->    tokName : tokRef

The output result of the last one is not as expected. I hope to get the following results:

a1.:b1       -->    tokRef .: tokRef

In my application, the original regular expression definitions, such as REF, NAME, OP_1, and T, cannot be modified. Can I achieve the desired effect by modifying the newly added regular expression (OP_2) or the state machine rules?

I have a lex file, and based on the original, I would like to add support for recognizing (.:). My modifications are as follows:

%option caseless

ALP     [a-z]+
NUM     [0-9]+
REF     {ALP}{NUM}

_NAME_0     [^\-\^\]`~!@#$%&*()+=|{}[;:'",<>/ .1234567890?\n]
_NAME_1     {_NAME_0}|[0-9.?]
NAME        {_NAME_0}{_NAME_1}*

OP_1        ":"
OP_2        ".:"

T       (" "|\n)*

%s S1 S2

%%
<S1>{
    {T}{OP_1} { printf(": "); }
    {T}{OP_2} { printf(".: "); }
    {T}{NAME} {
        BEGIN(S2);
        yyless(0);
    }
}

<S2>{
    {T}{REF} { printf("tokRef "); BEGIN(S1); }
    {T}{NAME} { printf("tokName "); BEGIN(S1); }
}
%%

int main() {
    BEGIN(S1);
    yylex();
    return 0;
}

int yywrap() { return 1; }

The parsing results of existing rules for some inputs are as follows:

input               output
a1:b1        -->    tokRef : tokRef
a1   : b1    -->    tokRef : tokRef
a1.  : b1    -->    tokName : tokRef
a1  .: b1    -->    tokRef .: tokRef

a1.:b1       -->    tokName : tokRef

The output result of the last one is not as expected. I hope to get the following results:

a1.:b1       -->    tokRef .: tokRef

Share Improve this question edited Jan 19 at 6:53 asked Jan 18 at 23:21 lijiang99 351 silver badge6 bronze badges

Why can't OP_2 be .[ \t]*:? I mean other than taste. In general parser knobs prefer the whitespace decisions to lie in the parser, not the scanner. But even C's continuation (trailing \\ on line) is logically handled in the scanner. – mevets Commented Jan 19 at 0:13
Sorry for the confusion caused by my previous description, which wasn't clear enough. I have now updated the problem description. OP_2 and its {T}{OP_2} { printf(".: "); } are new additions I made based on the original, and they can be modified. What cannot be modified are the original regular expression contents, namely REF, NAME, OP_1, and T. Of course, under the current conditions, even if I modify OP_2 to \.[ \t]*":", the issue still cannot be resolved.@mevets – lijiang99 Commented Jan 19 at 6:59
1 {T} should be a separate rule that does nothing, not part of all the other rules. – user207421 Commented Jan 19 at 7:03
3 @mevets That is entirely incorrect. Parser guys use the scanner to deal with the whitespace, which never gets into the parser at all. Have a look at the grammar of practically any programming language. – user207421 Commented Jan 19 at 7:10
Try (1) the change I suggested above, (2) escaping the '.' in the definition, thus: "\.:", and (3) putting the rule for OP_2 before the rule for OP_1. – user207421 Commented Jan 22 at 3:55

| Show 1 more comment

1 Answer 1

Sorted by: Reset to default 1

It's easier to understand what's going on if you print what part of the input is matched to which rule. Add %option debug at the beginning of you program.

If we do that, for the input a1.:b1, we get the following:

--(end of buffer or a NUL)
--accepting rule at line 23 ("a1.")
--accepting rule at line 31 ("a1.")
--accepting rule at line 21 (":")
--accepting rule at line 23 ("b1")
--accepting rule at line 30 ("b1")
--(end of buffer or a NUL)
--accepting default rule ("
")
tokName : tokRef
--(end of buffer or a NUL)
--EOF (start condition 1)

NAME allows the character . and as you can see, it matches the string a1..

Can you change it, without editing the expression NAME? No.

The rules of how Flex matches the rules are simple: it always takes the longest match.

It will never match a1 to NAME if it is allowed to match a longer string a1..

I'm not sure what are your reasons for keeping the original rules intact. Maybe you could try to make a workaround of adding a code to the action for NAME that checks if it ends with . and do some special logic there but it will get complicated really fast. I can't help more without a better understanding of what you want to achieve.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

c - Use flex to parse ":" and ".:" in expressions - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)