regex - Why do most languages implement wildcard regular expressions inefficiently? -


i given link following article regarding implementation of regular expressions in many modern languages.

http://swtch.com/~rsc/regexp/regexp1.html

tl;dnr: regular expressions such (a?)^na^n fixed $n$ take exponential time matched against, say, a^n because implemented via backtracking on string when matching ? section. implementing these nfa keeping state lists makes more efficient obvious reasons

the details of how each language implements these isn't detailed (and article old), i'm curious: what, if any, drawbacks of using nfa opposed other implementation techniques. thing can come bells , whistles of libraries either a) building nfa features impractical or b) there conflicting performance issue between expression above , other, possibly more common, operation.

while possible construct dfas handle these complex cases (the tcl re engine, written henry spencer, proof example; article linked indicated performance data) it's exceptionally hard.

one key thing though if can detect never need matching group information, can (for many res, without internal backreferences) transform re 1 only uses parentheses grouping allowing more efficient re generated (so (a?){n}a{n} — i'm using modern conventional syntax — becomes equivalent a{n,2n}). backreferences break major optimisation; it's not nothing in henry's re code (alluded above) there code comment describing them “feature black lagoon”. 1 of best comments i've ever read in code (with exception of references academic papers describe algorithm encoded).

on other hand, perl/pcre style engines recursive-descent evaluation schemes, can ascribe saner set of semantics mixed greediness res, , many other things besides. (at extreme end of this, recursive patterns — (?r) et al — impossible automata-theoretic approaches. require stack match, making them formally not regular expressions.)

on practical level, cost of building nfa , dfa compile can quite high. need clever caching make not expensive. , on practical level, pcre , perl implementations have had lot more developer effort applied them.


Comments

Popular posts from this blog

php - regexp cyrillic filename not matches -

c# - OpenXML hanging while writing elements -

sql - Select Query has unexpected multiple records (MS Access) -