regexp-match: tune chunking of UTF-8 decoding

A `string-split` on a big string with lots of small matches sends the regexp matcher a big string many times. Decoding 1024 bytes each time is too much. Decoding 32 bytes is be a better trade-off between chunking for large matches and being lazy for small matches. For example, on a 60MB string with a space every 15 characters or so, splitting on a space is about 3 times as fast with this adjustment. I tried a few chunk sizes, and 32 worked the best in my experiments. Naturally, as more bytes are read, the chunk size ramps up, so it's a question of initial size; larger matches are relatively insensitive to the initial size (so, again, it makes little sense to cater to large matches with a large initial decoding size of 1024 bytes).
2014-07-24 15:38:55 +01:00 · 2014-07-24 15:38:55 +01:00 · 1809df456a
commit 1809df456a
parent 53edc9f258
1 changed files with 1 additions and 1 deletions
--- a/racket/src/racket/src/regexp.c
+++ b/racket/src/racket/src/regexp.c
@ -2912,7 +2912,7 @@ regtry(regexp *prog, char *string, int stringpos, int stringlen, rx_lazy_str_t *
  }
 }
-#define LAZY_STRING_CHUNK_SIZE 1024
+#define LAZY_STRING_CHUNK_SIZE 32
 static void read_more_from_lazy_string(Regwork *rw, rxpos need_total)
 {