tcl regexp performance

Discussion:

(too old to reply)

2022-04-05 09:47:07 UTC

I was reading an article [1] about algorithms implementing regular expressions when I saw this picture [2], tcl regex performance is six orders of magnitude faster than perl regex for certain regular expression, I was really shocked and couldn't believe it, so I've checked, and that's true, tcl regex works in microseconds where perl regex consume seconds:

% set s [string repeat a 30]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
% set r "[string repeat a? 30][string repeat a 30]"
a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
% time {regex $r $s m}
199 microseconds per iteration

now 1649120014 s
ok
exec time: 53.366160 s
now 1649120014 s
dif 0 s
Presione una tecla para continuar . . .

$ time ./p.pl
ok
real 0m 52.30s
user 0m 52.12s
sys 0m 0.01s

with p.l :

#!/bin/perl
$s = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
if ( $s =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/ ){
print "ok\n";
}

I really cannot believe perl regex which is the de facto standard performs so badly, the same applies to pcre lib and so most scripting languages like python, ruby... It's true this is only in certain cases which may be not so common but the article gives examples of common regex affected.

Another point to tcl bag ;-)

[1] https://swtch.com/~rsc/regexp/regexp1.html
[2] Loading Image...

2022-04-05 09:49:41 UTC

Permalink

Post by pd
now 1649120014 s
ok
exec time: 53.366160 s
now 1649120014 s
dif 0 s
Presione una tecla para continuar . . .

you can ignore this, is just another test with extra checking, you can see exec time is 53 s about the same of the linux shell measure with time p.pl

Harald Oehlmann

2022-04-05 10:04:21 UTC

Permalink

Post by pd

Post by pd
now 1649120014 s
ok
exec time: 53.366160 s
now 1649120014 s
dif 0 s
Presione una tecla para continuar . . .

you can ignore this, is just another test with extra checking, you can see exec time is 53 s about the same of the linux shell measure with time p.pl

PD,

thank you. Yes, TCL is sometimes quite fast.
AFAIK, the used regexp lib requires UTF-16 data input.
In consequence, each string must be converted from utf-8 to utf-16
first. As TCL has no utf-16 internal string type, the conversion is done
quite frequently. Also other utf-16 interfaces like the Windows OS API
require this constant conversion. There is the idea to also support
utf-16 as internal data type form many commands, so we do not need to
recode so often. I suppose, this would again boost the regexp performance.

Take care,
Harald

2022-04-05 10:38:08 UTC

Permalink

Post by Harald Oehlmann
thank you. Yes, TCL is sometimes quite fast.
AFAIK, the used regexp lib requires UTF-16 data input.
In consequence, each string must be converted from utf-8 to utf-16
first. As TCL has no utf-16 internal string type, the conversion is done
quite frequently. Also other utf-16 interfaces like the Windows OS API
require this constant conversion. There is the idea to also support
utf-16 as internal data type form many commands, so we do not need to
recode so often. I suppose, this would again boost the regexp performance.

it's really a very good job for tcl developers team, and if performance will be increased even more with internal support of utf-16 then it will be lightning fast, really today there's no reason to choose other scripting langs specially perl for heavy regex or string processing tasks, tcl should be considered as a real winner.

For me it's very surprissing this perl (and similar) performance because perl is *the* language when you think of regex processing and pcre is *the* algorithm for implementing it, I'm aware poor performance should be corner cases but it's so significative I really cannot believe a better algorithm has been already chosen.

2022-04-05 10:32:00 UTC

Permalink

Post by pd
$ time ./p.pl
ok
real 0m 52.30s
user 0m 52.12s
sys 0m 0.01s
#!/bin/perl
$s = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
if ( $s =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/ ){
print "ok\n";
}

in case you think the time difference is not due to regex processing itself but the overtime due to if and print here you are the prove it is not the case:

$ time ./p2.pl
real 0m 53.08s
user 0m 52.57s
sys 0m 0.03s

with p2.pl :

#!/bin/perl
$s = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
$s =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/ ;

Christian Gollwitzer

2022-04-05 18:48:06 UTC

Permalink

Post by pd
I was reading an article [1] about algorithms implementing regular expressions when I saw this picture [2], tcl regex performance is six orders of magnitude faster than perl regex for certain regular expression, I was really shocked and couldn't believe it

Yes, it is true, now here comes the drawback: this speed advtantage
comes for certain regular expressions where there are many backtracking
points in the RE. It only occurs in specially crafted REs. If you
benchmark real-world REs, then Tcl is on the average 2x slower then the
Perl RE engine, and especially compared to the pcre JIT compiler.
Furthermore, the engine is not any longer maintained IIRC. I cannot off
the top of my head find the citations for this, but overall, I'm not
overly convinced, unfortunately.

Christian

2022-04-06 13:27:03 UTC

Permalink

Post by Christian Gollwitzer
Furthermore, the engine is not any longer maintained IIRC. I cannot off
the top of my head find the citations for this, but overall, I'm not
overly convinced, unfortunately.

do you mean tcl regex it's not manteined? what a pity!

jtyler

2022-04-06 19:25:37 UTC

Permalink

Post by pd
% set s [string repeat a 30]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
% set r "[string repeat a? 30][string repeat a 30]"
a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
% time {regex $r $s m}
199 microseconds per iteration

A lot goes on behind the scenes in tcl. There's caching of the r.e.
expression going on somewhere. I'm not sure if the r.e. engine caches
them (say after compiling a r.e.). Or, perhaps tcl's object system does
it or both.

The below try with a second variable, and making a copy it's not likely
to know about, suggests to me the r.e. engine is doing the caching.

One should probably also use the repeat factor with the [time] command
to get a better timing.

1 % set s [string repeat a 30]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 % set r "[string repeat a? 30][string repeat a 30]"
a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

3 % time {regexp $r $s m}
344 microseconds per iteration

4 % time {regexp $r $s m}
96 microseconds per iteration

5 % set r2 $r
a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

6 % time {regexp $r2 $s m}
100 microseconds per iteration

7 % set r2 [string range $r 0 end]
a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

8 % time {regexp $r2 $s m}
95 microseconds per iteration

Christian Gollwitzer

2022-04-06 20:00:17 UTC

Permalink

Post by jtyler

A lot goes on behind the scenes in tcl. There's caching of the r.e.
expression going on somewhere.

That's not the reason: the regexp engine in Tcl transform the NFA for
the regexp into a DFA - those are two standard implementations for
regexp state machines. For some specially crafted REs like the ones
above, with many overlapping matches, the NFA becomes very slow -
exponential runtime in the length of the expression, IIRC - whereas the
DFA still works quickly.

You cand find an overview comparison of these algorithms here:

https://zherczeg.github.io/sljit/regex_compare.html

Christian

jtyler

2022-04-06 22:13:52 UTC

Permalink

Post by Christian Gollwitzer

Post by jtyler

A lot goes on behind the scenes in tcl. There's caching of the r.e.
expression going on somewhere.

That's not the reason: the regexp engine in Tcl transform the NFA for
the regexp into a DFA - those are two standard implementations for
regexp state machines. For some specially crafted REs like the ones
above, with many overlapping matches, the NFA becomes very slow -
exponential runtime in the length of the expression, IIRC - whereas the
DFA still works quickly.
https://zherczeg.github.io/sljit/regex_compare.html
Christian

Isn't there some caching going on as well?

I see things have changed since I last studied r.e. using Kernighan's
book software tools way back in my youth. But I would imagine that the
division into compiling an r.e. and matching likely still exist and the
difference between the first run and the second could be that the r.e.
string had been processed, and apparently doesn't need to be done in the
second run.

Or can you think of some other reason why the second run is so much
faster than the first?

I seem to recall reading somewhere that the tcl_obj has a type of r.e.
that it can reuse in the same way as shimmering between strings and
binary numbers. Could that be the case here?

Christian Gollwitzer

2022-04-07 05:26:49 UTC

Permalink

Post by jtyler

Post by Christian Gollwitzer

Post by jtyler

A lot goes on behind the scenes in tcl. There's caching of the r.e.
expression going on somewhere.

That's not the reason: the regexp engine in Tcl transform the NFA for
the regexp into a DFA - those are two standard implementations for
regexp state machines. For some specially crafted REs like the ones
above, with many overlapping matches, the NFA becomes very slow -
exponential runtime in the length of the expression, IIRC - whereas
the DFA still works quickly.
https://zherczeg.github.io/sljit/regex_compare.html
Christian

Isn't there some caching going on as well?

Yes, indeed, the compilation of the regexp is cached in the internal
representation of the Tcl_Obj for the regexp.

Post by jtyler
Or can you think of some other reason why the second run is so much
faster than the first?

You are right that this is the reason that the second run is faster than
the first. I was answering to the huge speed difference (seconds vs.
microseconds) between the Perl RE engine and the Tcl RE engine for this
particular (maliciously crafted) RE that the OP mentioned. Sorry for
being unclear.

Post by jtyler
I seem to recall reading somewhere that the tcl_obj has a type of r.e.
that it can reuse in the same way as shimmering between strings and
binary numbers. Could that be the case here?

You can check this for yourself easily with
tcl::unsupported::representation:

(chris) 50 % set RE {a*b}
a*b
(chris) 51 % ::tcl::unsupported::representation $RE
value is a pure string with a refcount of 5, object pointer at
0x7f823f35a7f0, string representation "a*b"
(chris) 52 % regexp $RE "blaaaaaab"
1
(chris) 53 % ::tcl::unsupported::representation $RE
value is a regexp with a refcount of 4, object pointer at
0x7f823f35a7f0, internal representation 0x7f823f36b910:0x7f823f35b150,
string representation "a*b"

Christian

jtyler

2022-04-07 18:16:51 UTC

Permalink

Post by Christian Gollwitzer
You can check this for yourself easily with
(chris) 50 % set RE {a*b}
a*b
(chris) 51 % ::tcl::unsupported::representation $RE
value is a pure string with a refcount of 5, object pointer at
0x7f823f35a7f0, string representation "a*b"
(chris) 52 % regexp $RE "blaaaaaab"
1
(chris) 53 % ::tcl::unsupported::representation $RE
value is a regexp with a refcount of 4, object pointer at
0x7f823f35a7f0, internal representation 0x7f823f36b910:0x7f823f35b150,
string representation "a*b"
Christian

Thanks, I was confusing tcl objects, I thought the object was the
variable, not the data.

Below I see that it doesn't increment the ref counter for the $RE, but
does when a*b is entered literally, but they are the same object.

Most interesting, time to re-read Ashok's book chapter 10.

% set RE {a*b}
a*b

% ::tcl::unsupported::representation $RE
value is a pure string with a refcount of 4, object pointer at 04E0BA00,
string representation "a*b"

% ::tcl::unsupported::representation $RE
value is a pure string with a refcount of 4, object pointer at 04E0BA00,
string representation "a*b"

% ::tcl::unsupported::representation a*b
value is a pure string with a refcount of 5, object pointer at 04E0BA00,
string representation "a*b"

% ::tcl::unsupported::representation a*b
value is a pure string with a refcount of 6, object pointer at 04E0BA00,
string representation "a*b"

Christian Gollwitzer

2022-04-07 19:47:56 UTC

Permalink

Post by jtyler
Thanks, I was confusing tcl objects, I thought the object was the
variable, not the data.
Below I see that it doesn't increment the ref counter for the $RE, but
does when a*b is entered literally, but they are the same object.

...this is complicated. If you look at the reference count values ifrom
the the interactive shell, in order to print it, the refcount increases
for the string. Addditionally, for literals there is some kind of
literal caching going on. In a compiled proc, the bytecode compiler also
moves the literal out of the place so that repeated execution of a
command wit a literal value in a loop uses always the cached Tcl_obj and
so on.

Not easy to follow this in detail, in the code you just need to make
sure that every increment is paired with a decrement and then the
lifetime management works correctly.

Christian