<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>/devices/pseudo/bitbucket@0,0:pseudo (Posts about regexp)</title><link>https://www.jmcpdotcom.com/blog/</link><description></description><atom:link href="https://www.jmcpdotcom.com/blog/categories/regexp.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2022 &lt;a href="mailto:blogadmin@jmcpdotcom.com"&gt;jmcp&lt;/a&gt; </copyright><lastBuildDate>Thu, 21 Apr 2022 02:58:36 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>I know, I'll use a regex!</title><link>https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/</link><dc:creator>jmcp</dc:creator><description>&lt;p&gt;This past week, a colleague asked me for help with a shell script that
he had come across while investigating how we run one of our data ingestion
pipelines. The shell script was designed to clean input CSV files if they
had lines which didn't match a specific pattern.&lt;/p&gt;
&lt;p&gt;Now to start with, the script was run over a directory and used a &lt;em&gt;very&lt;/em&gt;
gnarly bit of shell &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Glob_(programming)"&gt;globbing&lt;/a&gt;  to generate a list of files in a subdirectory.
That list was then iterated over to check for a &lt;cite&gt;.csv&lt;/cite&gt; extension.&lt;/p&gt;
&lt;p&gt;[Please save your eye-rolls and "but couldn't they..." for later].&lt;/p&gt;
&lt;p&gt;Once that list of files had been weeded to only contain CSVs, each of those
files was catted and read line by line to see if the line matched a desired
pattern - using shell regular expression parsing. If the line did not match
the pattern, it was deleted. The matching lines were then written to a new
file.&lt;/p&gt;
&lt;p&gt;[Again, please save your eye-rolls and "but couldn't they..." for later].&lt;/p&gt;
&lt;p&gt;The klaxons went off for my colleague when he saw the regex:&lt;/p&gt;
&lt;pre class="code shell"&gt;&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-1" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-1"&gt;&lt;/a&gt;&lt;span class="nv"&gt;NEW&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;%.csv&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;_clean.csv&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-2" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-2"&gt;&lt;/a&gt;  &lt;span class="o"&gt;{&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-3" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-3"&gt;&lt;/a&gt;  &lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-4" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-4"&gt;&lt;/a&gt;  &lt;span class="nb"&gt;read&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-5" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-5"&gt;&lt;/a&gt;  &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt; &lt;span class="nb"&gt;read&lt;/span&gt; -r line &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; -n &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$line&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-6" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-6"&gt;&lt;/a&gt;  &lt;span class="k"&gt;do&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-7" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-7"&gt;&lt;/a&gt;        &lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="si"&gt;}${&lt;/span&gt;&lt;span class="nv"&gt;line&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-8" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-8"&gt;&lt;/a&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$buffer&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;~ ^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-9&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;-&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;][&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-9&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-2&lt;span class="o"&gt;])&lt;/span&gt;-&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-2&lt;span class="o"&gt;][&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-9&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;01&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="o"&gt;[&lt;/span&gt;^,&lt;span class="o"&gt;]&lt;/span&gt;*,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,.*$ &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-9" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-9"&gt;&lt;/a&gt;        &lt;span class="k"&gt;then&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-10" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-10"&gt;&lt;/a&gt;              &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$buffer&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-11" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-11"&gt;&lt;/a&gt;              &lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-12" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-12"&gt;&lt;/a&gt;        &lt;span class="k"&gt;else&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-13" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-13"&gt;&lt;/a&gt;              &lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;buffer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; "&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-14" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-14"&gt;&lt;/a&gt;        &lt;span class="k"&gt;fi&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-15" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-15"&gt;&lt;/a&gt;  &lt;span class="k"&gt;done&lt;/span&gt;
&lt;a id="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-16" name="rest_code_a42af1a719fb4516ae880c31ce5b7f7f-16"&gt;&lt;/a&gt;  &lt;span class="o"&gt;}&lt;/span&gt; &amp;lt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &amp;gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;NEW&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/pre&gt;&lt;p&gt;My eyes got whiplash. To make it easier to understand, let's put each element of
the pattern on a single line:&lt;/p&gt;
&lt;div class="code"&gt;&lt;table class="codetable"&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-1"&gt;&lt;code data-line-number=" 1"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-1" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-1"&gt;&lt;/a&gt; ^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-9&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;-&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;][&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-9&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-2&lt;span class="o"&gt;])&lt;/span&gt;-&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-2&lt;span class="o"&gt;][&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-9&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;01&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-2"&gt;&lt;code data-line-number=" 2"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-2" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-2"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-3"&gt;&lt;code data-line-number=" 3"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-3" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-3"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-4"&gt;&lt;code data-line-number=" 4"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-4" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-4"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-5"&gt;&lt;code data-line-number=" 5"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-5" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-5"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-6"&gt;&lt;code data-line-number=" 6"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-6" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-6"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-7"&gt;&lt;code data-line-number=" 7"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-7" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-7"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-8"&gt;&lt;code data-line-number=" 8"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-8" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-8"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-9"&gt;&lt;code data-line-number=" 9"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-9" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-9"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-10"&gt;&lt;code data-line-number="10"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-10" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-10"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-11"&gt;&lt;code data-line-number="11"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-11" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-11"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-12"&gt;&lt;code data-line-number="12"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-12" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-12"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-13"&gt;&lt;code data-line-number="13"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-13" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-13"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-14"&gt;&lt;code data-line-number="14"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-14" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-14"&gt;&lt;/a&gt; &lt;span class="o"&gt;[&lt;/span&gt;^,&lt;span class="o"&gt;]&lt;/span&gt;*,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-15"&gt;&lt;code data-line-number="15"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-15" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-15"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-16"&gt;&lt;code data-line-number="16"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-16" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-16"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-17"&gt;&lt;code data-line-number="17"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-17" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-17"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-18"&gt;&lt;code data-line-number="18"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-18" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-18"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-19"&gt;&lt;code data-line-number="19"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-19" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-19"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-20"&gt;&lt;code data-line-number="20"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-20" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-20"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-21"&gt;&lt;code data-line-number="21"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-21" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-21"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-22"&gt;&lt;code data-line-number="22"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-22" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-22"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-23"&gt;&lt;code data-line-number="23"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-23" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-23"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-24"&gt;&lt;code data-line-number="24"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-24" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-24"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-25"&gt;&lt;code data-line-number="25"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-25" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-25"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-26"&gt;&lt;code data-line-number="26"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-26" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-26"&gt;&lt;/a&gt; &lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;^&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;*&lt;span class="se"&gt;\"&lt;/span&gt;,
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="linenos linenodiv"&gt;&lt;a href="https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/#rest_code_a2c8b959f269425d95ab26eb16c2e463-27"&gt;&lt;code data-line-number="27"&gt;&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td class="code"&gt;&lt;code&gt;&lt;a id="rest_code_a2c8b959f269425d95ab26eb16c2e463-27" name="rest_code_a2c8b959f269425d95ab26eb16c2e463-27"&gt;&lt;/a&gt; .*$
&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;p&gt;Which is really something. The first field matches a date format - "yyyy-mm-dd"
(which is ok), then we have 12 fields where we care that they are enclosed in
double quotes, one field that we want to &lt;em&gt;not&lt;/em&gt; be quoted, another 12 fields
which are quoted again, and any other fields we don't care about.&lt;/p&gt;
&lt;p&gt;Wow.&lt;/p&gt;
&lt;p&gt;I told my colleague that this wasn't a good way of doing things (he agreed).&lt;/p&gt;
&lt;p&gt;There are better ways to achieve this, so let's walk through them.&lt;/p&gt;
&lt;p&gt;Firstly, the shell globbing. There's a Unix command to generate a list of
filesystem entries which match particular criteria. It's called &lt;a class="reference external" href="https://www.gnu.org/software/findutils/manual/html_mono/find.html"&gt;find&lt;/a&gt;. If
we want a list of files which have a 'csv' extension we do this:&lt;/p&gt;
&lt;pre class="code shell"&gt;&lt;a id="rest_code_2db4cc40a0ac43b89f513f220e3f20aa-1" name="rest_code_2db4cc40a0ac43b89f513f220e3f20aa-1"&gt;&lt;/a&gt;$ find DIR -type f -name &lt;span class="se"&gt;\*&lt;/span&gt;.csv
&lt;/pre&gt;&lt;p&gt;You can use '.' or '*' or any way of representing a DIRectory in the filesystem.&lt;/p&gt;
&lt;p&gt;Now since we want this in a list to iterate over, let's put it in a variable:&lt;/p&gt;
&lt;pre class="code shell"&gt;&lt;a id="rest_code_9e14475bec5642209b963aecbcd59b2e-1" name="rest_code_9e14475bec5642209b963aecbcd59b2e-1"&gt;&lt;/a&gt;$ &lt;span class="nv"&gt;CSVfiles&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt; find DIR -type f -name &lt;span class="se"&gt;\*&lt;/span&gt;.csv -o -name &lt;span class="se"&gt;\*&lt;/span&gt;.CSV &lt;span class="k"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;p&gt;(You can redirect stderr to /dev/null, with &lt;em&gt;2&amp;gt;/dev/null&lt;/em&gt; inside the parens if you'd like).&lt;/p&gt;
&lt;p&gt;Now that we've got our list, we can move to the second phase - removing lines
which do not match our pattern. Let's try this first with &lt;a class="reference external" href="https://www.gnu.org/software/gawk"&gt;awk&lt;/a&gt;. Awk has
the concept of a &lt;a class="reference external" href="https://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators"&gt;Field Separator&lt;/a&gt;, and since CSV files are Comma-&lt;em&gt;Separated&lt;/em&gt;-Value
files, let's make use of that feature. We also know that we are only really interested
in two fields - the first (yyyy-mm-dd) and the fourteenth.&lt;/p&gt;
&lt;pre class="code shell"&gt;&lt;a id="rest_code_6653f13ddc394679b71007bf043b5abf-1" name="rest_code_6653f13ddc394679b71007bf043b5abf-1"&gt;&lt;/a&gt;$ awk -F&lt;span class="s1"&gt;','&lt;/span&gt; &lt;span class="s1"&gt;'$1 ~ /"[0-9]{4}-([0][0-9]|1[0-2])-([0-2][0-9]|3[01])"/ &amp;amp;&amp;amp;&lt;/span&gt;
&lt;a id="rest_code_6653f13ddc394679b71007bf043b5abf-2" name="rest_code_6653f13ddc394679b71007bf043b5abf-2"&gt;&lt;/a&gt;&lt;span class="s1"&gt;    $14 !~ /".*"/ {print}'&lt;/span&gt; &amp;lt; &lt;span class="nv"&gt;$old&lt;/span&gt; &amp;gt; &lt;span class="nv"&gt;$new&lt;/span&gt;
&lt;/pre&gt;&lt;p&gt;That's still rather ugly but considerably easier to read. For the record,
the bare ~ is awk's equals operator, and !~ is not-equals.&lt;/p&gt;
&lt;p&gt;We could also do this with &lt;a class="reference external" href="https://www.gnu.org/software/grep/manual/grep.html"&gt;grep&lt;/a&gt;, but at the cost of using more of that horrible regex.&lt;/p&gt;
&lt;p&gt;In my opinion a better method is to cons up a Python script for this validation
purpose, and we don't need to use the &lt;a class="reference external" href="https://docs.python.org/3.8/library/csv.html"&gt;CSV&lt;/a&gt; module.&lt;/p&gt;
&lt;pre class="code python"&gt;&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-1" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-1"&gt;&lt;/a&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;UserString&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-2" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-2"&gt;&lt;/a&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-3" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-3"&gt;&lt;/a&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-4" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-4"&gt;&lt;/a&gt;&lt;span class="n"&gt;infile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"/path/to/file.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"rw"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-5" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-5"&gt;&lt;/a&gt;&lt;span class="nb"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;infile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readlines&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-6" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-6"&gt;&lt;/a&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-7" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-7"&gt;&lt;/a&gt;&lt;span class="n"&gt;linecount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-8" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-8"&gt;&lt;/a&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-9" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-9"&gt;&lt;/a&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-10" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-10"&gt;&lt;/a&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-11" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-11"&gt;&lt;/a&gt;    &lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;","&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-12" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-12"&gt;&lt;/a&gt;    &lt;span class="n"&gt;togo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-13" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-13"&gt;&lt;/a&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-14" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-14"&gt;&lt;/a&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-15" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-15"&gt;&lt;/a&gt;        &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strptime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s2"&gt;"%Y-%m-&lt;/span&gt;&lt;span class="si"&gt;%d&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-16" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-16"&gt;&lt;/a&gt;    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="ne"&gt;ValueError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_ve&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-17" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-17"&gt;&lt;/a&gt;        &lt;span class="n"&gt;togo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-18" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-18"&gt;&lt;/a&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-19" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-19"&gt;&lt;/a&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s1"&gt;'"'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;UserString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnumeric&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-20" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-20"&gt;&lt;/a&gt;        &lt;span class="n"&gt;togo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-21" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-21"&gt;&lt;/a&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;togo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-22" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-22"&gt;&lt;/a&gt;        &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-23" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-23"&gt;&lt;/a&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-24" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-24"&gt;&lt;/a&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;linecount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-25" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-25"&gt;&lt;/a&gt;    &lt;span class="c1"&gt;# We've modified the input, so have to write out a new version, but&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-26" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-26"&gt;&lt;/a&gt;    &lt;span class="c1"&gt;# let's overwrite our input file rather than creating a new instance.&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-27" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-27"&gt;&lt;/a&gt;    &lt;span class="n"&gt;infile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-28" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-28"&gt;&lt;/a&gt;    &lt;span class="n"&gt;infile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-29" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-29"&gt;&lt;/a&gt;
&lt;a id="rest_code_3b1a2a15c3004168a8e590e47adf46fe-30" name="rest_code_3b1a2a15c3004168a8e590e47adf46fe-30"&gt;&lt;/a&gt;&lt;span class="n"&gt;infile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;p&gt;This script is pretty close to how I would write it in C (could you tell?).&lt;/p&gt;
&lt;p&gt;We first open the file (for reading &lt;em&gt;and&lt;/em&gt; writing) and read in every line,
which yields us a list. While it's not the most memory-efficient way of
approaching this problem, it does make processing more efficient because
it's one &lt;cite&gt;read()&lt;/cite&gt;, rather than one-read-per-line. We store the number of lines
that we've read in for comparison at the end of our loop, and then start the
processing.&lt;/p&gt;
&lt;p&gt;Since this is a &lt;a class="reference external" href="https://docs.python.org/3.8/library/csv.html"&gt;CSV&lt;/a&gt; file we know we can &lt;cite&gt;split()&lt;/cite&gt; on the comma, and having
done so, we check that we can parse the first field. We're not assigning to
a variable with &lt;cite&gt;datetime.strptime()&lt;/cite&gt; because we only care that we &lt;em&gt;can&lt;/em&gt;
rather than what the object's value is. The second check is to see that
we cannot find the double apostrophe in the element, and that the content of
the field is in fact numeric. If neither of these checks succeed, we know to
delete the line from our input.&lt;/p&gt;
&lt;p&gt;Finally, if we have in fact had to delete any lines, we rewind our file
(I was going to write pointer, but it's a File object. Told you it was close
to C!) to the start, and write out each line of input with a newline character
before closing the file.&lt;/p&gt;
&lt;p&gt;Whenever I think about regexes, &lt;em&gt;especially&lt;/em&gt; the ones I've written in C
over the years, I think about this quote which &lt;a class="reference external" href="http://regex.info/blog/2006-09-15/247"&gt;Jeffrey Friedl&lt;/a&gt; wrote about
a long time ago:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Some people, when confronted with a problem, think
“I know, I'll use regular expressions.”   Now they have two problems.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It was true when I first heard it some time during my first year of uni, and
still true today.&lt;/p&gt;
&lt;!-- put references after this point --&gt;</description><category>awk</category><category>data cleaning</category><category>Data Engineering</category><category>grep</category><category>programming</category><category>regex</category><category>regexes</category><category>regexp</category><category>regular expressions</category><category>sed</category><category>software engineering</category><category>SQL</category><guid>https://www.jmcpdotcom.com/blog/posts/2020-08-08-i-know-ill-use-a-regex/</guid><pubDate>Fri, 07 Aug 2020 16:00:00 GMT</pubDate></item></channel></rss>