π

Improving Code Robustness by Deriving Component Index for RegEx Matches

Show Sidebar

Update 2019-12-12: Email Comment: Named Groups

This is about an idea I've got for a slightly more robust handling of regular expressions (regex). It is not specific to Python but I'm using Python code snippets to explain what I mean. I don't claim that I'm the first person who is doing it like this. As often, after you have seen it, it may seem obvious anyway.

TL;DR: I'm using an example string on a regular expression to determine the index of the interesting chunks within the resulting list of matches. This has some advantages in certain situations.

Example Code

Consider the following code example. I'm parsing file names that follow a certain file name convention in order to extract its components: a date-stamp, the description, one or more tags and the file extension. With these components, I extend the file name with a string that is about to be appended to the description just like appendfilename is doing.

import re

filename = '2019-12-10 A nice file name -- foo bar.txt'
append_string = 'with more information'  # append this to file description

# build and pre-compile regular expression:
FILENAME_REGEX = re.compile('(.+?) (.+) -- (.+)\.(.+)')

# print the parsed components:
filename_components = re.match(FILENAME_REGEX, filename)
print('groups: ' + str(filename_components.groups()))

# check for date-stamp:
if filename_components.group(1):
    print('filename contains a date-stamp: ' + filename_components.group(1))

# appending the new string to the description:
new_filename = filename_components.group(1) + ' ' + \
    filename_components.group(2) + ' ' + \
    append_string.strip() + ' -- ' + \
    filename_components.group(3) + '.' + \
    filename_components.group(4)
print('new filename is: ' + new_filename)	  

#+RESULTS:

 groups: ('2019-12-10', 'A nice file name', 'foo bar', 'txt')
 filename contains a date-stamp: 2019-12-10
 new filename is: 2019-12-10 A nice file name with more information -- foo bar.txt	  

Improvement: Named Index Variables

Now imagine, that you already know that the filename convention will get extended or that you might want to match the components in a more fine-grained way.

This will cause changes to FILENAME_REGEX as well as on the index variables within filename_components.group(). If you address these index multiple times, you might want to think of giving them good names in order to stay sane while reading and maintaining the code. This results in the changed sections:

FILENAME_REGEX = re.compile('(.+?) (.+) -- (.+)\.(.+)')
DATESTAMP_IDX = 1
DESCRIPTION_IDX = 2
TAGS_INDEX = 3
EXTENSION_INDEX = 4	  

... as well as ...

if filename_components.group(DATESTAMP_IDX):
    print('filename contains a date-stamp: ' + filename_components.group(DATESTAMP_IDX))

new_filename = filename_components.group(DATESTAMP_IDX) + ' ' + \
    filename_components.group(DESCRIPTION_IDX) + ' ' + \
    append_string.strip() + ' -- ' + \
    filename_components.group(TAGS_INDEX) + '.' + \
    filename_components.group(EXTENSION_INDEX)	  

It's easy to see that the code legibility got improved.

When you then start to change the regex definition, you end up looking for the new index numbers all the time:

# build and pre-compile regular expression with advanced time- and date-stamps:
TIMESTAMP_PATTERN = '(\d{4,4})-([01]\d)-([0123]\d)([- _T][012]\d\.[012345]\d\.[012345]\d)?'
FILENAME_REGEX = re.compile('(' + TIMESTAMP_PATTERN + ') (.+) -- (.+)\.(.+)')
DATESTAMP_IDX = 1
DESCRIPTION_IDX = 6  # changed
TAGS_INDEX = 7       # changed
EXTENSION_INDEX = 8  # changed	  

This could get tedious.

I was doing it like this too many times. This was stupid. This had to be improved.

Improvement: Determining Index Variables by Using Known Pattern Matching

The idea is that you take a file name that contains all interesting components, apply the regex matching and extract the index numbers from the resulting list. This is now done using the function get_index() as shown below:

import re

def get_index(regex):
    components = re.match(regex, '2019-12-11T21.25.38 my description -- tag1 tag2.myextension')
    datestamp_idx = components.groups().index('2019-12-11') + 1  # index starts with 0, group() with 1
    description_idx = components.groups().index('my description') + 1
    tags_index = components.groups().index('tag1 tag2') + 1
    extension_index = components.groups().index('myextension') + 1
    return datestamp_idx, description_idx, tags_index, extension_index

filename = '2019-12-10 A nice file name -- foo bar.txt'
append_string = 'with more information'  # append this to file description

TIMESTAMP_PATTERN = '(\d{4,4}-[01]\d-[0123]\d)([- _T][012]\d\.[012345]\d\.[012345]\d)?'
FILENAME_REGEX = re.compile('(' + TIMESTAMP_PATTERN + ') (.+) -- (.+)\.(.+)')
DATESTAMP_IDX, \
DESCRIPTION_IDX, \
TAGS_INDEX, \
EXTENSION_INDEX = get_index(FILENAME_REGEX)

# print the parsed components:
filename_components = re.match(FILENAME_REGEX, filename)
print('groups: ' + str(filename_components.groups()))

# check for date-stamp:
if filename_components.group(DATESTAMP_IDX):
    print('filename contains a date-stamp: ' + filename_components.group(DATESTAMP_IDX))

# appending the new string to the description:
new_filename = filename_components.group(DATESTAMP_IDX) + ' ' + \
    filename_components.group(DESCRIPTION_IDX) + ' ' + \
    append_string.strip() + ' -- ' + \
    filename_components.group(TAGS_INDEX) + '.' + \
    filename_components.group(EXTENSION_INDEX)
print('new filename is: ' + new_filename)	  

#+RESULTS:

 groups: ('2019-12-10', '2019-12-10', None, 'A nice file name', 'foo bar', 'txt')
 filename contains a date-stamp: 2019-12-10
 new filename is: 2019-12-10 A nice file name with more information -- foo bar.txt	  

As you can see, the string within get_index() needs to be a static example string which is being parsed. The components need to be unique. The comparison checks need to be on whole strings. When get_index() has some issues determining the correct index, this method fails on compiling with a ValueError.

This trick for regex index generation offers a bit less error proneness when changing regex.

Of course, this is no replacement for proper unit tests.

I hope this little idea might improve your regex code.

Improvement: Python Named Groups

Update 2019-12-12: Bernhard wrote me in an email comment that there is an even better method which is Python-specific: group names. I was not aware of that. Thanks for this high-value input.

My example code from above with using group names, my new favorite method to deal with this:

import re

filename = '2019-12-10 A nice file name -- foo bar.txt'
append_string = 'with more information'  # append this to file description

TIMESTAMP_PATTERN = '(\d{4,4}-[01]\d-[0123]\d)([- _T][012]\d\.[012345]\d\.[012345]\d)?'
FILENAME_REGEX = re.compile('(?P<timestamp>' + TIMESTAMP_PATTERN + ') ' +
                            '(?P<description>.+) -- (?P<tags>.+)\.(?P<extension>.+)')

# print the parsed components:
filename_components = re.match(FILENAME_REGEX, filename)
print('groups: ' + str(filename_components.groups()))

# check for date-stamp:
if filename_components.group('timestamp'):
    print('filename contains a date-stamp: ' + filename_components.group('timestamp'))

# appending the new string to the description:
new_filename = filename_components.group('timestamp') + ' ' + \
    filename_components.group('description') + ' ' + \
    append_string.strip() + ' -- ' + \
    filename_components.group('tags') + '.' + \
    filename_components.group('extension')
print('new filename is: ' + new_filename)	  

#+RESULTS:

 groups: ('2019-12-10', '2019-12-10', None, 'A nice file name', 'foo bar', 'txt')
 filename contains a date-stamp: 2019-12-10
 new filename is: 2019-12-10 A nice file name with more information -- foo bar.txt	  

Comment via email (persistent) or via Disqus (ephemeral) comments below: