最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

c - Use flex to parse ":" and ".:" in expressions - Stack Overflow

programmeradmin1浏览0评论

I have a lex file, and based on the original, I would like to add support for recognizing (.:). My modifications are as follows:

%option caseless

ALP     [a-z]+
NUM     [0-9]+
REF     {ALP}{NUM}

_NAME_0     [^\-\^\]`~!@#$%&*()+=|{}[;:'",<>/ .1234567890?\n]
_NAME_1     {_NAME_0}|[0-9.?]
NAME        {_NAME_0}{_NAME_1}*

OP_1        ":"
OP_2        ".:"

T       (" "|\n)*

%s S1 S2

%%
<S1>{
    {T}{OP_1} { printf(": "); }
    {T}{OP_2} { printf(".: "); }
    {T}{NAME} {
        BEGIN(S2);
        yyless(0);
    }
}

<S2>{
    {T}{REF} { printf("tokRef "); BEGIN(S1); }
    {T}{NAME} { printf("tokName "); BEGIN(S1); }
}
%%

int main() {
    BEGIN(S1);
    yylex();
    return 0;
}

int yywrap() { return 1; }

The parsing results of existing rules for some inputs are as follows:

input               output
a1:b1        -->    tokRef : tokRef
a1   : b1    -->    tokRef : tokRef
a1.  : b1    -->    tokName : tokRef
a1  .: b1    -->    tokRef .: tokRef

a1.:b1       -->    tokName : tokRef

The output result of the last one is not as expected. I hope to get the following results:

a1.:b1       -->    tokRef .: tokRef

In my application, the original regular expression definitions, such as REF, NAME, OP_1, and T, cannot be modified. Can I achieve the desired effect by modifying the newly added regular expression (OP_2) or the state machine rules?

I have a lex file, and based on the original, I would like to add support for recognizing (.:). My modifications are as follows:

%option caseless

ALP     [a-z]+
NUM     [0-9]+
REF     {ALP}{NUM}

_NAME_0     [^\-\^\]`~!@#$%&*()+=|{}[;:'",<>/ .1234567890?\n]
_NAME_1     {_NAME_0}|[0-9.?]
NAME        {_NAME_0}{_NAME_1}*

OP_1        ":"
OP_2        ".:"

T       (" "|\n)*

%s S1 S2

%%
<S1>{
    {T}{OP_1} { printf(": "); }
    {T}{OP_2} { printf(".: "); }
    {T}{NAME} {
        BEGIN(S2);
        yyless(0);
    }
}

<S2>{
    {T}{REF} { printf("tokRef "); BEGIN(S1); }
    {T}{NAME} { printf("tokName "); BEGIN(S1); }
}
%%

int main() {
    BEGIN(S1);
    yylex();
    return 0;
}

int yywrap() { return 1; }

The parsing results of existing rules for some inputs are as follows:

input               output
a1:b1        -->    tokRef : tokRef
a1   : b1    -->    tokRef : tokRef
a1.  : b1    -->    tokName : tokRef
a1  .: b1    -->    tokRef .: tokRef

a1.:b1       -->    tokName : tokRef

The output result of the last one is not as expected. I hope to get the following results:

a1.:b1       -->    tokRef .: tokRef

In my application, the original regular expression definitions, such as REF, NAME, OP_1, and T, cannot be modified. Can I achieve the desired effect by modifying the newly added regular expression (OP_2) or the state machine rules?

Share Improve this question edited Jan 19 at 6:53 lijiang99 asked Jan 18 at 23:21 lijiang99lijiang99 351 silver badge6 bronze badges 6
  • Why can't OP_2 be .[ \t]*:? I mean other than taste. In general parser knobs prefer the whitespace decisions to lie in the parser, not the scanner. But even C's continuation (trailing \\ on line) is logically handled in the scanner. – mevets Commented Jan 19 at 0:13
  • Sorry for the confusion caused by my previous description, which wasn't clear enough. I have now updated the problem description. OP_2 and its {T}{OP_2} { printf(".: "); } are new additions I made based on the original, and they can be modified. What cannot be modified are the original regular expression contents, namely REF, NAME, OP_1, and T. Of course, under the current conditions, even if I modify OP_2 to \.[ \t]*":", the issue still cannot be resolved.@mevets – lijiang99 Commented Jan 19 at 6:59
  • 1 {T} should be a separate rule that does nothing, not part of all the other rules. – user207421 Commented Jan 19 at 7:03
  • 3 @mevets That is entirely incorrect. Parser guys use the scanner to deal with the whitespace, which never gets into the parser at all. Have a look at the grammar of practically any programming language. – user207421 Commented Jan 19 at 7:10
  • Try (1) the change I suggested above, (2) escaping the '.' in the definition, thus: "\.:", and (3) putting the rule for OP_2 before the rule for OP_1. – user207421 Commented Jan 22 at 3:55
 |  Show 1 more comment

1 Answer 1

Reset to default 1

It's easier to understand what's going on if you print what part of the input is matched to which rule. Add %option debug at the beginning of you program.

If we do that, for the input a1.:b1, we get the following:

--(end of buffer or a NUL)
--accepting rule at line 23 ("a1.")
--accepting rule at line 31 ("a1.")
--accepting rule at line 21 (":")
--accepting rule at line 23 ("b1")
--accepting rule at line 30 ("b1")
--(end of buffer or a NUL)
--accepting default rule ("
")
tokName : tokRef
--(end of buffer or a NUL)
--EOF (start condition 1)

NAME allows the character . and as you can see, it matches the string a1..

Can you change it, without editing the expression NAME? No.

The rules of how Flex matches the rules are simple: it always takes the longest match.

It will never match a1 to NAME if it is allowed to match a longer string a1..

I'm not sure what are your reasons for keeping the original rules intact. Maybe you could try to make a workaround of adding a code to the action for NAME that checks if it ends with . and do some special logic there but it will get complicated really fast. I can't help more without a better understanding of what you want to achieve.

发布评论

评论列表(0)

  1. 暂无评论