Way too often, I want to know whether a filename string “ends with” anything in a list of various file extensions. In a pinch, I could use an if
statement with multiple conditions, like this:
def is_probably_image_file(filename: str) -> bool:
if (
filename.endswith(".bmp")
or filename.endswith(".gif")
or filename.endswith(".jpeg")
or filename.endswith(".jpg")
or filename.endswith(".png")
or filename.endswith(".tif")
or filename.endswith(".tiff")
or filename.endswith(".webp")
):
return True
else:
return False
(For today, let’s set aside the alternative strategy of checking the file’s magic number.)
More generally, I’ve been thinking about the efficiency of checking multiple possible suffixes against a query string using Python’s endswith
. I’m a huge trie fan, so it occurred to me that maybe loading the suffixes into a trie would permit a more efficient comparison between the query string and the suffixes.
Lest we prematurely lurch toward optimization, there are some factors we ought to consider first, such as how often we’ll have to perform this endswith
check. Is it a handful of times? And are there just a few suffixes to check? A humble if
statement might do the trick in such cases. But suppose we know the check will be performed a lot of times (such as on a continuous schedule or inside a hot loop). Or what if we have a very long list of suffixes we need to check against. Now the space usage and/or compute overhead of a trie or hash table or other sophisticated multi-endswith
strategy could be a justifiable tradeoff.
I heard Python 3.11 got a speed boost, so exploring multi-endswith
seemed like a fun way to test-drive 3.11 while investigating whether a trie might be more efficient than Python’s built-in endswith
.
All my Python code for this experiment is on GitHub. I wrote and tested four implementations of a multi-endswith
:
- using Python’s built-in
endswith
, and a generator expression of the suffixes, andany
- again using
endswith
but this time passing in a tuple of the suffixes as an argument - using a trie class and creating and populating a new trie every time the function was invoked during the timing suite (so, a cold trie compared to the hot trie described next)
- a trie class again, but this time creating and prepopulating it before the timing suite starts and using this loaded trie for all invocations of the associated function during the suite (so, a hot trie)
I used timeit
to keep track of things. I experimented with values for timeit
‘s number=
argument from 1,000 up to 1,000,000 by powers of 10, and it didn’t seem to affect the relative outcomes. So I used number=1000
for a swifter finish. I ran things on my PC on python:3.11.0
using the following Docker incantation. (I verified that /usr/local/bin/python
in this Docker image really was Python 3.11.0.)
docker run -it --volume L:\Dropbox\py\python-multi-endswith:/root/py python:3.11.0 /usr/local/bin/python /root/py/main.py
I guess that’s enough about methodology. Let’s check out the results.
With a mere 8 suffixes in the endswith
lookup, here’s how the functions ranked. Python’s built-in endswith
using a tuple beats the two next-place contenders (the naive approach and the hot trie) by a factor of about 3. (So for just this first run of 8 suffixes, I included in the timing suite what I termed a “naive” implementation which is simply the multiple-condition if
implementation shown at the top of this blog post.) At this early stage, we can also see the cold trie’s object-creation overhead hammering its performance. Finally, just a quick reminder that my Python code for all this is in the repo.
(all values are multiples of first result, sorted by median)
function min max median
multi_endswith_tuple: 1.000 1.000 1.000
multi_endswith_naive: 3.028 2.908 3.027
multi_endswith_hot_trie: 3.146 2.752 3.212
multi_endswith_generator: 8.331 6.851 8.328
multi_endswith_cold_trie: 66.453 51.766 65.858
Now we do it again but with 1,008 suffixes. Python’s built-in endswith
-and-tuple continues to shine, outperforming the next fastest implementation (hot trie) by a factor of 3.3.
(all values are multiples of first result, sorted by median)
function min max median
multi_endswith_tuple: 1.000 1.000 1.000
multi_endswith_hot_trie: 3.157 3.219 3.303
multi_endswith_generator: 8.181 8.131 8.207
multi_endswith_cold_trie: 4760.420 4317.807 4769.483
One more time, and now with a whopping 10,008 suffixes. Although hot trie narrows the lead somewhat, endswith
-and-tuple decisively wins again, trouncing hot trie by a factor of 2.3.
(all values are multiples of first result, sorted by median)
function min max median
multi_endswith_tuple: 1.000 1.000 1.000
multi_endswith_hot_trie: 2.305 2.264 2.316
multi_endswith_generator: 8.458 9.229 8.749
multi_endswith_cold_trie: 40847.541 39204.021 40158.531
Honestly the results surprised me! I was sure that as the suffix count soared upward to 10,000 the hot trie would pull ahead due to the optimized way in which it stored and searched through the suffixes.
Right about now you’re probably thinking the same thing I was. What’s cooking in Python’s endswith
source code? So here is the cpython source for Python 3.11’s endswith
(source):
static PyObject *
unicode_endswith(PyObject *self,
PyObject *args)
{
PyObject *subobj;
PyObject *substring;
Py_ssize_t start = 0;
Py_ssize_t end = PY_SSIZE_T_MAX;
int result;
if (!stringlib_parse_args_finds("endswith", args, &subobj, &start, &end))
return NULL;
if (PyTuple_Check(subobj)) {
Py_ssize_t i;
for (i = 0; i < PyTuple_GET_SIZE(subobj); i++) {
substring = PyTuple_GET_ITEM(subobj, i);
if (!PyUnicode_Check(substring)) {
PyErr_Format(PyExc_TypeError,
"tuple for endswith must only contain str, "
"not %.100s",
Py_TYPE(substring)->tp_name);
return NULL;
}
result = tailmatch(self, substring, start, end, +1);
if (result == -1)
return NULL;
if (result) {
Py_RETURN_TRUE;
}
}
Py_RETURN_FALSE;
}
if (!PyUnicode_Check(subobj)) {
PyErr_Format(PyExc_TypeError,
"endswith first arg must be str or "
"a tuple of str, not %.100s", Py_TYPE(subobj)->tp_name);
return NULL;
}
result = tailmatch(self, subobj, start, end, +1);
if (result == -1)
return NULL;
return PyBool_FromLong(result);
}
Line 13701 (highlighted in the snippet) says for (i = 0; i < PyTuple_GET_SIZE(subobj); i++)
, which confirms that Python 3.11’s endswith
processes the tuple linearly. (By the way, endswith
‘s sibling function startswith
, which is located immediately above endswith
in the source, has basically the same linear, looping algorithmic structure.) Evidently the strategy of tackling things one at a time works for endswith
! There may also be more going on that I haven’t uncovered yet, such as how Python caches the interpreted Python code of my alternative implementations versus how it processes the already compiled C source code. That might be fodder for a follow-up experiment on this same subject.
What are some takeaways from this little experiment? I propose three. One, algorithm bake-offs are fun and can confirm or refute guesses and hunches. Two, remember to benchmark using different sizes of data. And three, pop the hood on your favorite programming language’s source code once in a while. You might be surprised at what you find!