I am learning about regular expression and would like some explanation.
- pattern matching in regular expression
- parsing in regular expression
- Why is it fine to pattern match HTML strings, but parsing HTML is not?
If you can give me an example of each, it would really help me understand.
This is where my question comes from:
You can fish for specific bits of info using RegExps, but you can’t parse (i.e. extract the tree structure of) a language with arbitrary nesting using a regular grammar (which RegExp implement and extend a bit).
RegExps can parse languages with nested tags up to a finite depth that is specified in the RegExp. If your RegExp can go three levels deep, and you feed it a source document with 4 levels of nesting, it will fail. You could write a RegExp for four levels, but it would fail at five, etc…
Even a context-free grammar probably wouldn’t do it though, because parsing HTML as it is accepted by browsers (also known as "tag soup") is full of quirks and fallbacks that are hard or impossible to express without a Turing-complete system.
Answered By – Pygy
Answer Checked By – Pedro (AngularFixing Volunteer)