CPP - Editorial

gamabunta · May 15, 2013, 9:41pm

PROBLEM LINKS

DIFFICULTY

HARD

PREREQUISITES

String Parsing, Predictive Parsing of an LL(1) Grammar, Regular Expressions

PROBLEM

You are given a string which defines what an ID should look like followed by a list of IDs. You have to report which one of them are valid.

QUICK EXPLANATION

Refer to the problem statement to carefully understand the description of the grammar for parsing the expressions that describe the valid IDs for each test case. Following that description should lead you to the following context free grammar (from the Problem Setter’s notes).

Tokens

or      : "or"
times   : "times"
optional: "optional"
letter  : "letter"
letters : "letters"
digit   : "digit"
digits  : "digits"
to      : "to"
exactly : "exactly"
upto    : "upto"
l       : [A-Z]
d       : [0-9]
num     : [0-9]{2,} //take care of leading zeroes separately
string  : [A-Z0-9]+

Grammar

S   : E S'
    ;

S'  : E S'
    |
    ;

E   : T E'
    ;

E'  : or E
    |
    ;

T   : F T'
    ;

T'  : num T''
    | d T''
    | optional T'
    |
    ;

T'' : times T'
    |
    ;

F   : d F1
    | l F2
    | upto num F3
    | upto d F3
    | exactly num F4
    | exactly d F4
    | digit
    | letter
    | string
    | num
    | ( S )
    ;

F1  : to d
    |
    ;

F2  : to l
    |
    ;

F3  : digits
    | letters
    ;

F4  : digits
    | letters
    ;

Note that “|” followed by nothing indicates the ‘epsilon production’.

The grammar is indeed a LL(1) grammar. This means that you can always figure out which of the above rules to apply after parsing some part of the string, just by looking at the next token.

EXPLANATION

After taking the input you may first want to clean the spaces in the expression

Add spaces around paranthesis
Remove repeated spaces

This ensures that all tokens are now separated by single space. Now you can tokenize the string and mark each token to the type of token provided.

The next step is to parse the list of tokens according to the rules of the grammar provided. It is important to note that given a LL(1) grammar, you can parse the array of tokens in linear time. The tester’s solution below uses the Recursive Descent Parsing with the simplification that because the grammar is LL(1), you never have to back track.

The outcome of parsing the expression is creating a meaningful regular expression. As well as, calculating the largest possible ID that can be generated so as to discard the case if it exceeds the bound on the length of the IDs.

The GNU C library has <regex.h> include that lets you use create a regex using the POSIX regular expressions syntax.

Once you have the regular expression compiled, you can execute it on all the IDs to check whether they match or they don’t.

Refer to the Setter’s Solution for a clean implementation of the Predictive parsing and learn how the various rules map to ideas in POSIX regular expressions. Each rule’s corresponding method are named accordingly and the approach is easy to follow.

SETTER’S SOLUTION

Can be found here. (some permission issue on uploaded solution, fixing ASAP)

TESTER’S SOLUTION

Can be found here.

mugurelionut · May 15, 2013, 11:40pm

I cannot access the setter’s and tester’s solutions at the moment, so I am not entirely sure, but, from reading the editorial, it seems that the algorithmically interesting part (deciding if a string matches a regular expression) was “delegated” to the regex.h header/library. What is the time complexity of:

“compiling” a regular expression ? (I am using the term from the editorial)
deciding if a given string matches the regular expression ?

For me, efficiently deciding if a given string matches the regular expression was the most interesting part of the problem. I first built an evaluation tree for the expression (although not in linear time because of parantheses) and then I built a non-deterministic finite automaton (NFA) out of the evaluation tree (the NFA can be built in linear time in the size of the tree). Then, using the NFA, we can decide if a string of length L matches the initial expression or not in O(L * N), where N = the number of states of the NFA (although this is usually much faster, because the number of matched states at each step is usually much smaller than N).

The alternative would have been to turn the NFA into a deterministic finite automaton (DFA) in which case deciding if a string matches the regular expression takes only O(L) time. The drawback is that the DFA may have an exponential number of states relative to the number of states of the NFA (and even if this were not the case in this problem, building the DFA would still be algorithmically challenging).

Personally I would have expected some discussion about automata in the editorial (because that’s what regex.h probably uses).

cyberax · May 16, 2013, 12:52am

i totally agree with you. i didn’t succeed in solving this problem because of the second part (meaning matching the string ids to their regexp format). the re module in Python is waaayyyy too slow to handle the number of strings provided by the input test case, and i was convinced that this part could have been implemented with a finite state automaton (but unfortunately, i’ve been really busy at work at the end of the contest, and i’ve not been able to find time to implement it properly).

http://www.codechef.com/viewsolution/2134402

shasha1s2 · May 16, 2013, 12:56am

When will be setter’s and tester’s sol accessible

bhambya · May 16, 2013, 12:59am

mugurelionut is right, if we use regex.h to do the job, this problem will become fairly easy. I think the toughest part of the problem was upto N digits/letters and there should have been at least some explanation of it

gamabunta · May 16, 2013, 1:31am

From the discussions on this problem and going through setter’s and tester’s solutions I assumed that this was the approach I could most easily explain. I never bothered to read and understand how a regex works, but do have some idea about FSM.

I did some googling and found this link very helpful in understanding just what, implementing regex is about.

The implementation varies from language to language. glibc implements NFA Algorithm with caching to DFA. Python uses backtracking.

bugkiller · May 16, 2013, 1:31am

Yes, @pulkit has done a similar thing. Nice solution though. http://www.codechef.com/viewsolution/2122210

vinayak_garg · May 16, 2013, 1:37am

My solution in practice section.

It has 3 parts-

DFA to do lexical analysis. Which breaks Format string into tokens and lexemes.
Predictive parser for LL(1) grammar. The grammar of Format string is represented as-

S -> E S | E
E -> T or E | T
T -> F num times T | F optional T | F
F -> d to d | l to l | digit | letter | upto num digits | exactly num digits | upto num letters | exactly num letters | string | ( S )

But this grammar needs to be left-factored and the left-recursion needs to be removed before building predictive parser. The editorial shows Left factored and left recursion eliminated version of this grammar.
In predictive parsing, we essentially build a parsing table which has some production to be applied for each input symbol. This in turn requires building First and Follow Set from Left factored and left recursion eliminated grammar. I am skipping the details as the theory is best read from book.
Essential fact is that this parsing is linear time. And apart from parsing we perform some actions for each production (or rule in grammar) applied. I performed three actions while parsing-
1. Error checking - If there is no matching entry in parsing table for current symbol, the format is erroneous and rejected.
2. Construct regex pattern. More below.
3. Check if format can match ID string which exceed constraint on ID string.
It was quite easy to see that the format string are simply describing regular expressions. Like A to D can be converted to [A-D]. Each keyword (optional, to, or, times etc.) could be converted to regex pattern. So while parsing we construct this pattern string. And now we “compile” it using regex.h regcomp(). Once compiled we simply match each ID string using regexec() in linear time.

Note-

A more elegant solution is “Recursive descent parser”, but since it uses backtracking, it is too slow.
No one used predictive parsing in solution. It is too much of theory and not much used in practice. Also it is a top-down parser, while there are bottom-up parser also which are much more powerful.
Only few used regex library in the solution. Most language (C, python, Java) have it.
STL regex could not be used unfortunately. As it requires C++11. And regex.h is not available on MSVC.

EDIT - The test cases were weak for the given constraints on length of F and length of IDS. Ideally I should have lowered them further. I apologize if it prevented you from solving the problem.

gamabunta · May 16, 2013, 1:39am

@bhambya I am sorry the solutions are still not up I have asked the codechef team to fix the permission issues, but the recent migrations of the site (for performance) has kept them quite busy.

The upto N digits part is actually very simple using POSIX style regex(s) by using [0-9]{0,N} or [A-Z]{0,N}.

I left that part to be discovered from the Setter’s Solution which is really very very clean.

vinayak_garg · May 16, 2013, 1:57am

Setter’s solution - http://www.codechef.com/viewsolution/2164314

cyberax · May 16, 2013, 3:44am

oh ! thank you for the link, it’s exactly what i was looking for !

inseder · May 16, 2013, 11:57am

My opinion is that is kind of akward that an editorial indicates that a problem should be solved by “GNU C library” (probably the submission should have been restricted to C/C++). I tried the same thing in Java but it’s too slow.

Related with NFA @mugurelionut: can you please explain the matching in O(N*L). I have tried a simple DF traversal on NFA with first matching char edges and only after empty edges.
Thanks.

anton_lunyov · May 17, 2013, 12:23pm

Actually, it’s impossible to solve the problem using regex.h for every possible input under the given constraints. After several annoying hours of thinking how to solve the problem and reading several advanced papers where the best known solutions to regexp matching problem were described, I’ve found test case where NFA has a lot of states and a lot of epsilon-transitions (about 30000 states), while it accepts only strings of length upto 2000. I’ve decided to mail the testgen to [email protected] and what was the surprise when author responded that his solution is too slow on this test and official test data are “easy solvable” using some “practical approach”. Here is the testgen of this testcase:

#include <cstdio>
#include <algorithm>
#include <string>
using namespace std;

int main()
{
  freopen("XXX.in","wb",stdout);
  string base = "(upto 12 digits 12 times ab)12 times";
  string col[100];
  int k=0;
  for(int a=0;a<10;a++)
    for(int b=0;b<10;b++)
    {
      base[25] = a+'0';
      base[26] = b+'0';
      col[k++] = base;
    }
  int T = 100;
  printf("%d\n", T);
  for(int t=0;t<T;t++) {
    random_shuffle(col, col+100);
    string pat = col[0];
    for(int i=1;pat.size() + 3 + col[i].size() <= 1000;i++)
      pat += " or" + col[i];
    puts(pat.c_str());
    int Q = 1000;
    printf("%d\n",Q);
    for(int q=0;q<Q;q++) {
      int L = 78;
      string quer = "";
      for(int i=0;i<L;i++)
        quer += rand()%10 + '0';
      puts(quer.c_str());
    }
  }
  return 0;
}

Actually much tougher tests could be created but even this one seems tough enough.
Let me know if someone has a solution that could solve this test in time (without exploiting its explicit structure).

I even believe that it is impossible to solve the problem by any known algorithm under the given constraints.

mugurelionut · May 17, 2013, 1:56pm

Well… we start at the initial state of the NFA and “expand” the set of reachable states along the empty edges. Then, for each character from the string, we first move along the char edge from each reachable state and then we “expand” the set of (new) reachable states along the empty edges. Since each state has O(1) empty edges in this NFA (unless I computed something wrong), each “expansion” along empty edges takes only O(N) time (e.g. use a BFS and initially add to the queue all the reachable states). Thus, you get O(N*L) overall.

mugurelionut · May 17, 2013, 2:10pm

@anton_lunyov: I can confirm that my solution CANNOT solve this test case under the given time constraints (the reason is, of course, the large number of states and transitions in the NFA). Since the NFA is too slow, I am wondering if it could be turned into a DFA (with a sufficiently low number of states and sufficiently quickly)… Now I am also wondering if anyone could solve the problem within the given time constraints. I guess I could try to run the source code of the other contestants, but I do not have enough time now…

inseder · May 17, 2013, 2:33pm

@mugurelionut - I have tried this exact algorithm in java but it seems that java is not good enough (TLE). Thak you very much for explanation. My solution here - http://www.codechef.com/viewsolution/2164945

balajiganapath · May 17, 2013, 2:45pm

@mugurelionut @anton_lunyov @vinayak.garg My solution takes about 2 minutes on this input! The NFA my code constructs has 43800 states for this input. During the contest, I was looking for ways to convert the NFA into a DFA. I had submitted a solution using only NFA to check some assertions when, to my surprise, it got accepted. I suspected that the test cases were weak, but I never expected that it would take 2 minutes.

anton_lunyov · May 17, 2013, 2:54pm

I forgot that after, I invented even tougher test case where NFA (quite compressed) has about 160000 states. Here the testgen:

#include <cstdio>
#include <algorithm>
#include <string>
using namespace std;

int main()
{
  freopen("XXXX.in","wb",stdout);
  string suffix = ")optional 10 times 9 times A)11 times";
  int T = 100;
  printf("%d\n", T);
  for(int t=0;t<T;t++) {
    int len=0;
    string col[26*26];
    for(char a='A';a<='Z';a++)
      for(char b='A';b<='Z';b++)
      {
        string cur = "";
        cur += a;
        cur += b;
        col[len++] = cur;
      }
    random_shuffle(col, col+len);
    string pat = "((" + col[0];
    for(int i=1; pat.size() + 6 + suffix.size() <= 1000;i++)
      pat += " or " + col[i];
    pat += suffix;
    if(pat.size() > 1000) exit(1);
    puts(pat.c_str());
    int Q = 1000;
    printf("%d\n",Q);
    for(int q=0;q<Q;q++) {
      int L = 79;
      string quer = "";
      for(int i=0;i<L;i++)
        quer += rand()%26 + 'A';
      quer += "AA";
      puts(quer.c_str());
    }
  }
  return 0;
}

The structure of the test is

((AA or AB or AC or ... or AZ or BA or BB or ...)optional 10 times 9 times A) 11 times

but with randomness in choosing pairs of chars. Constraint 1000 allows to use up to 160 pairs of chars. So before applying times commands we already have NFA with about 160 states using more or less compressed form of NFA (less compressed form have 320 states, while original Thompson construction about 480 or even more). Then we should copy this NFA 999 times to get extremely large one. Moreover optional, used here, guarantees that when checking strings the average number of states we are in, will be comparable with overall number of states in NFA.

@balajiganapath
It seems that this a principal problem setter position here. He knows about possible weakness of test data but intentionally set high constraints and generate only some more or less random and corner cases, easy solvable using regex.h.

UPD
In tesgen I assumed that 1MB contained 2²⁰ bytes. If problem setter comment about size of input assumes that 1MB is 1000000 bytes then 79 in int L = 79; should decreased a bit.

vinayak_garg · May 17, 2013, 6:18pm

@inseder: I ran your solution and it takes 33 sec on first file (2.0 KB) and gives WA. One case it fails on

1

(((38)or(38)or(38))or(UM))

2

38

UM

It is quite strange that NFA approach is taking so much time. I unfortunately am not comfortable enough with java to find the reason.

anton_lunyov · May 17, 2013, 7:53pm

BTW.
Here is probably the most recent article about regexp matching problem that improves all algorithms known before.
I learned it more or less fully but algorithm described there is quite cumbersome to code and improvement of asymptotic is rather small comparing with classical simple Thompson NFA approach, so possibly it will not give any time improvement.