regexp-match: tune chunking of UTF-8 decoding
A `string-split` on a big string with lots of small matches sends the regexp matcher a big string many times. Decoding 1024 bytes each time is too much. Decoding 32 bytes is be a better trade-off between chunking for large matches and being lazy for small matches. For example, on a 60MB string with a space every 15 characters or so, splitting on a space is about 3 times as fast with this adjustment. I tried a few chunk sizes, and 32 worked the best in my experiments. Naturally, as more bytes are read, the chunk size ramps up, so it's a question of initial size; larger matches are relatively insensitive to the initial size (so, again, it makes little sense to cater to large matches with a large initial decoding size of 1024 bytes).
This commit is contained in:
parent
53edc9f258
commit
1809df456a
|
@ -2912,7 +2912,7 @@ regtry(regexp *prog, char *string, int stringpos, int stringlen, rx_lazy_str_t *
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
#define LAZY_STRING_CHUNK_SIZE 1024
|
#define LAZY_STRING_CHUNK_SIZE 32
|
||||||
|
|
||||||
static void read_more_from_lazy_string(Regwork *rw, rxpos need_total)
|
static void read_more_from_lazy_string(Regwork *rw, rxpos need_total)
|
||||||
{
|
{
|
||||||
|
|
Loading…
Reference in New Issue
Block a user