regexp-match: tune chunking of UTF-8 decoding

A `string-split` on a big string with lots of small matches sends the
regexp matcher a big string many times. Decoding 1024 bytes each time
is too much. Decoding 32 bytes is be a better trade-off between
chunking for large matches and being lazy for small matches.

For example, on a 60MB string with a space every 15 characters or so,
splitting on a space is about 3 times as fast with this adjustment.

I tried a few chunk sizes, and 32 worked the best in my experiments.
Naturally, as more bytes are read, the chunk size ramps up, so it's
a question of initial size; larger matches are relatively insensitive to
the initial size (so, again, it makes little sense to cater to large
matches with a large initial decoding size of 1024 bytes).
This commit is contained in:
Matthew Flatt 2014-07-24 15:38:55 +01:00
parent 53edc9f258
commit 1809df456a

View File

@ -2912,7 +2912,7 @@ regtry(regexp *prog, char *string, int stringpos, int stringlen, rx_lazy_str_t *
} }
} }
#define LAZY_STRING_CHUNK_SIZE 1024 #define LAZY_STRING_CHUNK_SIZE 32
static void read_more_from_lazy_string(Regwork *rw, rxpos need_total) static void read_more_from_lazy_string(Regwork *rw, rxpos need_total)
{ {