I know, I'll use a regex!

This past week, a colleague asked me for help with a shell script that he had come across while investigating how we run one of our data ingestion pipelines. The shell script was designed to clean input CSV files if they had lines which didn't match a specific pattern.

Now to start with, the script was run over a directory and used a very gnarly bit of shell globbing to generate a list of files in a subdirectory. That list was then iterated over to check for a .csv extension.

[Please save your eye-rolls and "but couldn't they..." for later].

Once that list of files had been weeded to only contain CSVs, each of those files was catted and read line by line to see if the line matched a desired pattern - using shell regular expression parsing. If the line did not match the pattern, it was deleted. The matching lines were then written to a new file.

[Again, please save your eye-rolls and "but couldn't they..." for later].

The klaxons went off for my colleague when he saw the regex:

  while IFS="" read -r line && [ -n "$line" ]
        if [[ "$buffer" =~ ^\"[0-9]{4}-([0][0-9]|1[0-2])-([0-2][0-9]|3[01])\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",[^,]*,\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",.*$ ]];
              echo "$buffer"
              buffer="${buffer} "
  } < "${f}" > "${NEW}"

My eyes got whiplash. To make it easier to understand, let's put each element of the pattern on a single line:


Which is really something. The first field matches a date format - "yyyy-mm-dd" (which is ok), then we have 12 fields where we care that they are enclosed in double quotes, one field that we want to not be quoted, another 12 fields which are quoted again, and any other fields we don't care about.


I told my colleague that this wasn't a good way of doing things (he agreed).

There are better ways to achieve this, so let's walk through them.

Firstly, the shell globbing. There's a Unix command to generate a list of filesystem entries which match particular criteria. It's called find. If we want a list of files which have a 'csv' extension we do this:

$ find DIR -type f -name \*.csv

You can use '.' or '*' or any way of representing a DIRectory in the filesystem.

Now since we want this in a list to iterate over, let's put it in a variable:

$ CSVfiles=$( find DIR -type f -name \*.csv -o -name \*.CSV )

(You can redirect stderr to /dev/null, with 2>/dev/null inside the parens if you'd like).

Now that we've got our list, we can move to the second phase - removing lines which do not match our pattern. Let's try this first with awk. Awk has the concept of a Field Separator, and since CSV files are Comma-Separated-Value files, let's make use of that feature. We also know that we are only really interested in two fields - the first (yyyy-mm-dd) and the fourteenth.

$ awk -F',' '$1 ~ /"[0-9]{4}-([0][0-9]|1[0-2])-([0-2][0-9]|3[01])"/ &&
    $14 !~ /".*"/ {print}' < $old > $new

That's still rather ugly but considerably easier to read. For the record, the bare ~ is awk's equals operator, and !~ is not-equals.

We could also do this with grep, but at the cost of using more of that horrible regex.

In my opinion a better method is to cons up a Python script for this validation purpose, and we don't need to use the CSV module.

from collections import UserString
from datetime import datetime

infile = open("/path/to/file.csv", "rw")
input = infile.readlines()

linecount = len(input)

for line in input:

    fields = line.split(",")
    togo = False

        datetime.strptime(fields[0], "%Y-%m-%d")
    except ValueError as _ve:
        togo = True

    if '"' in fields[14] or not UserString(fields[14]).isnumeric():
        togo = True
    if togo:
        del line

if len(input) != linecount:
    # We've modified the input, so have to write out a new version, but
    # let's overwrite our input file rather than creating a new instance.


This script is pretty close to how I would write it in C (could you tell?).

We first open the file (for reading and writing) and read in every line, which yields us a list. While it's not the most memory-efficient way of approaching this problem, it does make processing more efficient because it's one read(), rather than one-read-per-line. We store the number of lines that we've read in for comparison at the end of our loop, and then start the processing.

Since this is a CSV file we know we can split() on the comma, and having done so, we check that we can parse the first field. We're not assigning to a variable with datetime.strptime() because we only care that we can rather than what the object's value is. The second check is to see that we cannot find the double apostrophe in the element, and that the content of the field is in fact numeric. If neither of these checks succeed, we know to delete the line from our input.

Finally, if we have in fact had to delete any lines, we rewind our file (I was going to write pointer, but it's a File object. Told you it was close to C!) to the start, and write out each line of input with a newline character before closing the file.

Whenever I think about regexes, especially the ones I've written in C over the years, I think about this quote which Jeffrey Friedl wrote about a long time ago:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

It was true when I first heard it some time during my first year of uni, and still true today.