Regular Expressions: To Compile or not to Compile ?
When using regular expressions one often has to decide whether to compile the expressions before applying them.
Below is a simple test I ran and the results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
tee ./regextest.py <<"_EOF_" import time, re text1 = '<h1>match me</h1>' regex1= '<h1>(.*?)</h1>' text2 = '<h2>match me</h2>' regex2= '<h2>(.*?)</h2>' text3 = '<h3>match me</h3>' regex3= '<h3>(.*?)</h3>' re_flags = re.IGNORECASE|re.MULTILINE|re.DOTALL|re.UNICODE arr =[] for (step,max_iterations) in [(1,10), (10,100),(100,1000),(1000,10000),(10000,100000),(100000,1000000)]: arr += [num for num in range (step, max_iterations + step, step)] for max in arr: print('-'*50) t1 = time.time() re_compiled = re.compile(regex1+str(max), re_flags) # pre-compiling and storing regex for _ in range(0, max): re_compiled.findall(text1+str(max)+' '+str(_)) t2 = time.time() print("%5i calls in % 8.3f ms - compiled_once"%(max, (t2-t1)*1000.0) ) t3 = time.time() regex2_max = regex2+str(max) # doing this here to avoid costs of string concatation in a loop for _ in range(0, max): re.compile(regex2_max, re_flags).findall(text2+str(max)+' '+str(_)) t4 = time.time() print("%5i calls in % 8.3f ms - compiled_every_call"%(max, (t4-t3)*1000.0) ) t5 = time.time() regex3_max = regex3+str(max)# doing this here to avoid costs of string concatation in a loop for _ in range(0, max): re.findall(regex3_max, text3+str(max)+' '+str(_), re_flags) t6 = time.time() print("%5i calls in % 8.3f ms - uncompiled"%(max, (t6-t5)*1000.0) ) print('-'*50) _EOF_ |
Results
1-1.000.000 Calls
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 |
> python regextest.py python regextest.py -------------------------------------------------- 1 calls in 0.194 ms - compiled_once 1 calls in 0.124 ms - compiled_every_call 1 calls in 0.124 ms - uncompiled -------------------------------------------------- 2 calls in 0.116 ms - compiled_once 2 calls in 0.132 ms - compiled_every_call 2 calls in 0.173 ms - uncompiled -------------------------------------------------- 3 calls in 0.117 ms - compiled_once 3 calls in 0.134 ms - compiled_every_call 3 calls in 0.133 ms - uncompiled -------------------------------------------------- 4 calls in 0.129 ms - compiled_once 4 calls in 0.175 ms - compiled_every_call 4 calls in 0.128 ms - uncompiled -------------------------------------------------- 5 calls in 0.122 ms - compiled_once 5 calls in 0.129 ms - compiled_every_call 5 calls in 0.135 ms - uncompiled -------------------------------------------------- 6 calls in 0.133 ms - compiled_once 6 calls in 0.144 ms - compiled_every_call 6 calls in 0.139 ms - uncompiled -------------------------------------------------- 7 calls in 0.126 ms - compiled_once 7 calls in 0.144 ms - compiled_every_call 7 calls in 0.146 ms - uncompiled -------------------------------------------------- 8 calls in 0.141 ms - compiled_once 8 calls in 0.151 ms - compiled_every_call 8 calls in 0.146 ms - uncompiled -------------------------------------------------- 9 calls in 0.136 ms - compiled_once 9 calls in 0.155 ms - compiled_every_call 9 calls in 0.143 ms - uncompiled -------------------------------------------------- 10 calls in 0.139 ms - compiled_once 10 calls in 0.160 ms - compiled_every_call 10 calls in 0.161 ms - uncompiled -------------------------------------------------- 10 calls in 0.029 ms - compiled_once 10 calls in 0.054 ms - compiled_every_call 10 calls in 0.042 ms - uncompiled -------------------------------------------------- 20 calls in 0.155 ms - compiled_once 20 calls in 0.198 ms - compiled_every_call 20 calls in 0.228 ms - uncompiled -------------------------------------------------- 30 calls in 0.175 ms - compiled_once 30 calls in 0.219 ms - compiled_every_call 30 calls in 0.223 ms - uncompiled -------------------------------------------------- 40 calls in 0.195 ms - compiled_once 40 calls in 0.260 ms - compiled_every_call 40 calls in 0.257 ms - uncompiled -------------------------------------------------- 50 calls in 0.214 ms - compiled_once 50 calls in 0.289 ms - compiled_every_call 50 calls in 0.294 ms - uncompiled -------------------------------------------------- 60 calls in 0.253 ms - compiled_once 60 calls in 0.319 ms - compiled_every_call 60 calls in 0.327 ms - uncompiled -------------------------------------------------- 70 calls in 0.259 ms - compiled_once 70 calls in 0.377 ms - compiled_every_call 70 calls in 0.368 ms - uncompiled -------------------------------------------------- 80 calls in 0.274 ms - compiled_once 80 calls in 0.408 ms - compiled_every_call 80 calls in 0.408 ms - uncompiled -------------------------------------------------- 90 calls in 0.298 ms - compiled_once 90 calls in 0.425 ms - compiled_every_call 90 calls in 0.445 ms - uncompiled -------------------------------------------------- 100 calls in 0.323 ms - compiled_once 100 calls in 0.486 ms - compiled_every_call 100 calls in 0.670 ms - uncompiled -------------------------------------------------- 100 calls in 0.211 ms - compiled_once 100 calls in 0.354 ms - compiled_every_call 100 calls in 0.360 ms - uncompiled -------------------------------------------------- 200 calls in 0.535 ms - compiled_once 200 calls in 0.796 ms - compiled_every_call 200 calls in 0.849 ms - uncompiled -------------------------------------------------- 300 calls in 0.741 ms - compiled_once 300 calls in 1.282 ms - compiled_every_call 300 calls in 1.123 ms - uncompiled -------------------------------------------------- 400 calls in 1.067 ms - compiled_once 400 calls in 1.504 ms - compiled_every_call 400 calls in 1.596 ms - uncompiled -------------------------------------------------- 500 calls in 1.123 ms - compiled_once 500 calls in 1.906 ms - compiled_every_call 500 calls in 1.853 ms - uncompiled -------------------------------------------------- 600 calls in 1.295 ms - compiled_once 600 calls in 2.165 ms - compiled_every_call 600 calls in 2.265 ms - uncompiled -------------------------------------------------- 700 calls in 1.524 ms - compiled_once 700 calls in 2.682 ms - compiled_every_call 700 calls in 2.670 ms - uncompiled -------------------------------------------------- 800 calls in 1.714 ms - compiled_once 800 calls in 2.885 ms - compiled_every_call 800 calls in 3.064 ms - uncompiled -------------------------------------------------- 900 calls in 1.926 ms - compiled_once 900 calls in 3.324 ms - compiled_every_call 900 calls in 3.186 ms - uncompiled -------------------------------------------------- 1000 calls in 2.056 ms - compiled_once 1000 calls in 3.557 ms - compiled_every_call 1000 calls in 3.983 ms - uncompiled -------------------------------------------------- 1000 calls in 2.058 ms - compiled_once 1000 calls in 3.790 ms - compiled_every_call 1000 calls in 3.671 ms - uncompiled -------------------------------------------------- 2000 calls in 4.053 ms - compiled_once 2000 calls in 7.037 ms - compiled_every_call 2000 calls in 7.269 ms - uncompiled -------------------------------------------------- 3000 calls in 6.313 ms - compiled_once 3000 calls in 10.690 ms - compiled_every_call 3000 calls in 10.786 ms - uncompiled -------------------------------------------------- 4000 calls in 8.843 ms - compiled_once 4000 calls in 14.790 ms - compiled_every_call 4000 calls in 14.555 ms - uncompiled -------------------------------------------------- 5000 calls in 10.437 ms - compiled_once 5000 calls in 17.659 ms - compiled_every_call 5000 calls in 23.852 ms - uncompiled -------------------------------------------------- 6000 calls in 12.337 ms - compiled_once 6000 calls in 21.825 ms - compiled_every_call 6000 calls in 21.183 ms - uncompiled -------------------------------------------------- 7000 calls in 14.900 ms - compiled_once 7000 calls in 25.740 ms - compiled_every_call 7000 calls in 26.754 ms - uncompiled -------------------------------------------------- 8000 calls in 17.000 ms - compiled_once 8000 calls in 30.551 ms - compiled_every_call 8000 calls in 33.780 ms - uncompiled -------------------------------------------------- 9000 calls in 18.430 ms - compiled_once 9000 calls in 31.664 ms - compiled_every_call 9000 calls in 32.703 ms - uncompiled -------------------------------------------------- 10000 calls in 22.263 ms - compiled_once 10000 calls in 37.236 ms - compiled_every_call 10000 calls in 36.782 ms - uncompiled -------------------------------------------------- 10000 calls in 22.298 ms - compiled_once 10000 calls in 36.057 ms - compiled_every_call 10000 calls in 36.947 ms - uncompiled -------------------------------------------------- 20000 calls in 41.064 ms - compiled_once 20000 calls in 72.689 ms - compiled_every_call 20000 calls in 72.712 ms - uncompiled -------------------------------------------------- 30000 calls in 60.894 ms - compiled_once 30000 calls in 108.503 ms - compiled_every_call 30000 calls in 105.168 ms - uncompiled -------------------------------------------------- 40000 calls in 80.844 ms - compiled_once 40000 calls in 138.031 ms - compiled_every_call 40000 calls in 146.786 ms - uncompiled -------------------------------------------------- 50000 calls in 101.330 ms - compiled_once 50000 calls in 177.560 ms - compiled_every_call 50000 calls in 173.693 ms - uncompiled -------------------------------------------------- 60000 calls in 128.290 ms - compiled_once 60000 calls in 213.301 ms - compiled_every_call 60000 calls in 223.361 ms - uncompiled -------------------------------------------------- 70000 calls in 141.226 ms - compiled_once 70000 calls in 248.786 ms - compiled_every_call 70000 calls in 246.285 ms - uncompiled -------------------------------------------------- 80000 calls in 170.583 ms - compiled_once 80000 calls in 284.985 ms - compiled_every_call 80000 calls in 311.732 ms - uncompiled -------------------------------------------------- 90000 calls in 189.461 ms - compiled_once 90000 calls in 329.027 ms - compiled_every_call 90000 calls in 326.422 ms - uncompiled -------------------------------------------------- 100000 calls in 217.344 ms - compiled_once 100000 calls in 351.583 ms - compiled_every_call 100000 calls in 363.048 ms - uncompiled -------------------------------------------------- 100000 calls in 202.079 ms - compiled_once 100000 calls in 356.967 ms - compiled_every_call 100000 calls in 366.751 ms - uncompiled -------------------------------------------------- 200000 calls in 410.131 ms - compiled_once 200000 calls in 707.814 ms - compiled_every_call 200000 calls in 700.848 ms - uncompiled -------------------------------------------------- 300000 calls in 607.796 ms - compiled_once 300000 calls in 1033.031 ms - compiled_every_call 300000 calls in 1073.781 ms - uncompiled -------------------------------------------------- 400000 calls in 806.791 ms - compiled_once 400000 calls in 1417.941 ms - compiled_every_call 400000 calls in 1385.285 ms - uncompiled -------------------------------------------------- 500000 calls in 1007.073 ms - compiled_once 500000 calls in 1705.375 ms - compiled_every_call 500000 calls in 1862.589 ms - uncompiled -------------------------------------------------- 600000 calls in 1208.090 ms - compiled_once 600000 calls in 2125.673 ms - compiled_every_call 600000 calls in 2082.850 ms - uncompiled -------------------------------------------------- 700000 calls in 1429.684 ms - compiled_once 700000 calls in 2401.972 ms - compiled_every_call 700000 calls in 2543.541 ms - uncompiled -------------------------------------------------- 800000 calls in 1662.572 ms - compiled_once 800000 calls in 2849.011 ms - compiled_every_call 800000 calls in 2805.143 ms - uncompiled -------------------------------------------------- 900000 calls in 1940.630 ms - compiled_once 900000 calls in 3150.189 ms - compiled_every_call 900000 calls in 3312.602 ms - uncompiled -------------------------------------------------- 1000000 calls in 2137.428 ms - compiled_once 1000000 calls in 3732.527 ms - compiled_every_call 1000000 calls in 3600.674 ms - uncompiled -------------------------------------------------- |
re.py module
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
def findall(pattern, string, flags=0): """Return a list of all non-overlapping matches in the string. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.""" return _compile(pattern, flags).findall(string) # .... _cache = {} _cache_repl = {} _pattern_type = type(sre_compile.compile("", 0)) _MAXCACHE = 100 def _compile(*key): # internal: compile pattern cachekey = (type(key[0]),) + key p = _cache.get(cachekey) if p is not None: return p pattern, flags = key if isinstance(pattern, _pattern_type): if flags: raise ValueError('Cannot process flags argument with a compiled pattern') return pattern if not sre_compile.isstring(pattern): raise TypeError, "first argument must be string or compiled pattern" try: p = sre_compile.compile(pattern, flags) except error, v: raise error, v # invalid expression if len(_cache) >= _MAXCACHE: _cache.clear() _cache[cachekey] = p return p |
Conclusion
1. Despite several tests, the very first call is always somewhat slower that others.
This could be due to the RE module file includes and initialisation.
2. Reusing pre-compiled regular expressions does indeed improve the execution speed (1.5-2x)
My previous tests showed much bigger differences, but this turned out to be nothing but a test error.
Python string concatenation “got” in the way.
regex2_max = regex2+str(max) <= this is expensive, especially when doing it in a loop the current tests now ensures that the same number of string concatenation are done in on 3 times of tests. 3. Pre-compiling for each call everytime is meaningless because re.match compiles the regular expression before matching any ways. 4. The timing differences between pre-compiled or uncompiled regular expressions is much smaller(1.5-2x), than I even anticipated 5. Based on the numbers above, pre-compilation of regular expressions is an over-optimization UNLESS: a. one has more than 100 different regular expression and b. they are ALL used very frequently - as in 1.000 of times in a few minutes. It may be interesting to actually count the number of times the regular expression are accessed beforehand.
References
1. Python Regular Expression Documentation: http://docs.python.org/2/library/re.html
2. Is Regex Pre-Compilation Worth It: http://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile