My 2022 in Cycling.

It's time to wrap up my year in cycling. I've ridden 4709km


which is less than my goal of 6500km, but considering my mental and physical health this year, I'm pretty happy with it. 302km for this week alone is my biggest week ever.

December 2022 has been a big month in general. With the exception of the 1st when it was _rather_ wet, the 5th and 6th when I went to a conference (YOW! 2022) and the 7th-8th when C was ill with strep throat, I rode every day. After the school term ended I was able to take the time to ride to and from my now-regular bunch rides with the Graceville Bike Community (GBC) group, and found that I am indeed able to keep up with the slightly faster speed group (30km/h average).


Overall, though, my riding has been uneven. At the start I was 93kg, lacking in motivation and when I did get on my bike it was an effort. February was starting to improve over January, but then we got rain for three solid weeks and the house flooded. Dealing with cleanup was much more important so I rarely got out. In April the weather started cooling down, which was appreciated, but in the last week we all came down with COVID19 and were knocked flat. I was off work for 2 weeks, and when I was back on deck there were days where I could only get through half before needing a long nap. I feel very fortunate that my employer has been very understanding of that.

You'll note in this yearly calendar panel


that there were two rides in May. I remember them vividly. All week I had been feeling better and better, and had planned on doing a relaxed River Loop. However when I got to the Jindalee bridge I realised I wouldn't be able to complete it, so just rode to the West End ferry terminal instead. Getting home from that (25km from there, usually an easy hour and 15) was a struggle. I managed to get out the following day for a few loops around Jindalee, but that was it, I was exhausted.

I couldn't get my head right for riding at all in June (I put on 3kg, too), and barely managed to get any rides in during July or August. My post-COVID recovery was accelerating though, and was helped by a conscious decision to try to hit a specific weight by my birthday in November.

J and I had also gone to see our counsellor and I noted that the massive layoffs (August 2017) I had somehow managed to survive were still rocking me and that I really, _really_ missed my friends. At that point I made a decision to finally go and see what the GBC bunch rides were like. My riding mates Shane and Adam had been with them for years and I kept seeing their bunch rides on Strava. So on the 30th of August I rocked up for my first ride.

The GBC "bunch" is made up of several different speed groups - a 34km/h (and faster), a 30-32km/h, 28-29km/h and a "lifestyle" of around 22km/h. I hoped I would be able to keep up with the 28-29 group, and was truly delighted that in fact I did keep up. The people in the group were also very welcoming and considerate, so I....... kept turning up.

I also decided that rather than driving to Darra station then catching the train in to the office ($work likes me to go in one day a week), I would ride in instead. We've got pretty good end-of-trip facilities - secure bike storage, and a towel service with very nice showering and changing facilities - so the only thing stopping me being a bike commuter is me. (Or if it's raining). The route is pretty flat (20km each way), the 2-2.5km through the CBD is generally ok - I've only had 2 too-close passes, and both appeared careless rather than deliberate. It's a great feeling to know that even though I'm spending around the same amount of time as if I drove and caught the train, I'm _exercising_ and outdoors and feeling the wind on my face.

I've also learned new routes from the bunch rides. One that we usually do on a Friday is Spinkbrae Street in Fig Tree Pocket, which is _steep_.


I can climb it in about 2 minutes, I'm still figuring out the cadence I need. The road surface is awful, too. Well past time for a re-do by the council.

Another new route I've done is the ride out to Nudgee Beach, which was long and hot and took me to places I've always wanted to explore. Never having a guide to get there was the impediment, but I'm over that and look forward to doing it again.


As the year wound down I rode more. I decided that yes, I would in fact ride the Tour De Brisbane (medio) on 2 April 2023, that I would ride both Gap Creek and Coot-tha back segments before the end of the year, and that for 2023 my weekly distance goal is 200km rather than 150km. After all, I've been able to hit the 150km goal each week since mid-November, and it feels good to be able to do more.


I'm happy with what I've been able to accomplish through riding this year - made new friends, discovered new routes, lost weight (ending the year at 85kg) and improved my physical and mental health. I've got achievable goals lined up for next year and I'm looking forward to putting the effort in to achieving them.

All in all, not a bad year.

Five letter words in the English language

Like many people I quite enjoy playing Wordle, and I quite enjoy playing Worldle, too. I like both of these games so much that I've made completing them my Start Of Day (SOD) procedure.

Yesterday's Worldle was Vatican City, a place that J and I visited when we had an amazing five week long trip to Europe in 2005. Worldle gives you a black-on-white image of the geographic area to guess, and each day's image is approximately the same size in your window.

I thought this was pretty easy to guess - to me it just looks like a fortified place like a castle or ... a city state. There aren't too many of those on this planet so it was a 1 from 1 situation. I forgot to save a copy of the clue version, so imagine this picture below but with everything inside the walls filled in, in black:

Vatican City map from Wikimedia commons

This morning I got to thinking - if you've only got the edges of a word (start and end letters), how many unique combinations are there? [Yes, I'm only thinking about the English language].

A quick check of /usr/share/dict/words on my workstation (standard Linux dictionary installed) shows that out of 104334 words there are 7044 with five letters. Clarifying that just a little, if you remove those that are capitalised (proper nouns and acronyms) you're left with 4667 five letter words.

>>> import re
>>> words = open("words", "r").readlines()
>>> len(words)
>>> fivers = [j.strip() for j in words if len(j) == 6]
>>> len(fivers)
>>> fivers[0:20]
["ABC's", "ABM's", 'AFAIK', "AFC's", "AMD's", 'ANSIs', 'ANZUS', "AOL's", 'ASCII', "ASL's", 'ASPCA', "ATM's", "ATP's", 'AWACS', "AWS's", "AZT's",'Aaron', 'Abbas', 'Abdul', "Abe's"]

>>> fivers = []
>>> for w in words:
...     m = re.match("^[a-z]{5}$", w.strip())
...     if m:
...         fivers.append(
>>> len(fivers)
>>> fivers[0:20]
['abaci', 'aback', 'abaft', 'abase', 'abash', 'abate', 'abbey', 'abbot', 'abeam', 'abets', 'abhor', 'abide', 'abler', 'abode', 'abort', 'about', 'above', 'abuse', 'abuts', 'abuzz']

That's good enough to look at the initial and final letter combinations. To do that we'll use a set:

>>> enders = [(i[0], i[-1]) for i in fivers]
>>> len(enders)
>>> snend = set(enders)
>>> len(snend)
>>> snend
{('a', 'z'), ('t', 's'), ('p', 'x'), ('d', 'f'), ('s', 'i'), ('u', 'e'), ('z', 'i'), ('v', 'a'), ('g', 'a'), ('y', 'h'), ('j', 't'), ('t', 'x'), ('m', 'i'), ('l', 'e'), ('j', 'n'), ('k', 't'), ('q', 'k'), ('o', 'l'), ('h', 'x'), ('l', 'h'), ('u', 'a'), ('k', 'n'), ('b', 'l'), ('d', 'i'), ('y', 'a'), ('v', 'd'), ('g', 'd'), ('v', 'g'), ('k', 'u'), ('o', 't'), ('r', 'y'), ('c', 'r'), ('c', 'l'), ('s', 'p'), ('j', 'i'), ('l', 'a'), ('b', 't'), ('m', 'm'), ('k', 'i'), ('u', 'd'), ('c', 'f'), ('n', 'r'), ('u', 'g'), ('n', 'l'), ('p', 't'), ('y', 'd'), ('f', 's'), ('y', 'g'), ('t', 'l'), ('p', 'n'), ('d', 'm'), ('u', 'r'), ('a', 's'), ('l', 'd'), ('q', 'y'), ('p', 'u'), ('x', 'v'), ('l', 'g'), ('m', 'c'), ('b', 'i'), ('d', 'p'), ('t', 't'), ('k', 'z'), ('r', 'x'), ('h', 't'), ('h', 'n'), ('s', 'e'), ('s', 's'), ('s', 'h'), ('o', 'm'), ('e', 'y'), ('h', 'u'), ('x', 'n'), ('z', 'h'), ('w', 'n'), ('t', 'i'), ('g', 'n'), ('b', 'm'), ('m', 'e'), ('m', 'h'), ('c', 'm'), ('b', 'p'), ('z', 'a'), ('d', 'e'), ('d', 'h'), ('c', 'o'), ('p', 'p'), ('a', 'l'), ('s', 'w'), ('b', 'c'), ('m', 'a'), ('j', 'e'), ('t', 'm'), ('c', 'k'), ('f', 't'), ('v', 'o'), ('h', 'z'), ('j', 's'), ('c', 'c'), ('r', 't'), ('a', 'f'), ('k', 'e'), ('n', 'o'), ('z', 'd'), ('a', 't'), ('r', 'n'), ('k', 's'), ('t', 'p'), ('d', 'a'), ('m', 'w'), ('w', 'z'), ('f', 'u'), ('g', 'z'), ('s', 'r'), ('o', 'e'), ('s', 'l'), ('m', 'd'), ('y', 'o'), ('u', 'k'), ('b', 'e'), ('f', 'i'), ('s', 'f'), ('b', 'h'), ('d', 'd'), ('a', 'i'), ('m', 'r'), ('q', 't'), ('l', 'o'), ('d', 'g'), ('p', 'e'), ('q', 'n'), ('p', 's'), ('o', 'a'), ('p', 'h'), ('c', 'y'), ('d', 'r'), ('b', 'a'), ('t', 'e'), ('o', 'w'), ('f', 'm'), ('t', 'h'), ('c', 'a'), ('j', 'r'), ('n', 'y'), ('e', 't'), ('j', 'l'), ('h', 'e'), ('a', 'm'), ('b', 'w'), ('o', 'd'), ('e', 'n'), ('h', 's'), ('k', 'l'), ('f', 'p'), ('o', 'g'), ('u', 'y'), ('n', 'a'), ('b', 'd'), ('x', 's'), ('b', 'g'), ('y', 'y'), ('t', 'a'), ('w', 's'), ('o', 'r'), ('i', 's'), ('c', 'd'), ('c', 'g'), ('a', 'c'), ('b', 'r'), ('l', 'y'), ('s', 'm'), ('t', 'w'), ('x', 'x'), ('n', 'd'), ('i', 'x'), ('p', 'r'), ('b', 'f'), ('p', 'l'), ('t', 'd'), ('s', 'o'), ('t', 'g'), ('s', 'k'), ('p', 'f'), ('f', 'e'), ('s', 'c'), ('t', 'r'), ('f', 'h'), ('a', 'e'), ('m', 'o'), ('r', 's'), ('h', 'r'), ('a', 'h'), ('c', 'v'), ('h', 'l'), ('d', 'b'), ('t', 'f'), ('d', 'o'), ('l', 'x'), ('f', 'a'), ('w', 'l'), ('i', 'l'), ('d', 'k'), ('a', 'a'), ('j', 'p'), ('j', 'o'), ('k', 'o'), ('w', 't'), ('q', 's'), ('v', 't'), ('i', 't'), ('g', 't'), ('s', 'y'), ('a', 'w'), ('v', 'n'), ('k', 'k'), ('z', 'y'), ('f', 'd'), ('f', 'g'), ('o', 'o'), ('a', 'd'), ('b', 'b'), ('u', 't'), ('a', 'g'), ('s', 'a'), ('p', 'm'), ('b', 'o'), ('m', 'y'), ('f', 'r'), ('u', 'n'), ('o', 'c'), ('x', 'i'), ('f', 'l'), ('r', 'r'), ('c', 'b'), ('e', 's'), ('r', 'l'), ('y', 'n'), ('a', 'r'), ('b', 'k'), ('g', 'i'), ('p', 'o'), ('d', 'y'), ('f', 'f'), ('l', 't'), ('n', 'b'), ('p', 'k'), ('l', 'n'), ('s', 'd'), ('p', 'c'), ('s', 'g'), ('h', 'm'), ('k', 'h'), ('t', 'o'), ('x', 'm'), ('t', 'k'), ('h', 'o'), ('i', 'm'), ('g', 'm'), ('q', 'l'), ('j', 'a'), ('t', 'c'), ('k', 'a'), ('o', 'y'), ('h', 'c'), ('r', 'i'), ('w', 'p'), ('q', 'f'), ('g', 'p'), ('b', 'y'), ('d', 'x'), ('j', 'd'), ('p', 'y'), ('e', 'l'), ('k', 'd'), ('r', 'm'), ('p', 'a'), ('q', 'i'), ('t', 'y'), ('h', 'h'), ('f', 'o'), ('r', 'p'), ('w', 'e'), ('p', 'w'), ('a', 'o'), ('i', 'e'), ('g', 'e'), ('f', 'k'), ('w', 'h'), ('v', 's'), ('f', 'c'), ('g', 's'), ('g', 'h'), ('m', 't'), ('r', 'c'), ('a', 'k'), ('p', 'd'), ('h', 'a'), ('m', 'n'), ('p', 'g'), ('c', 'x'), ('e', 'i'), ('q', 'm'), ('d', 't'), ('u', 's'), ('s', 'b'), ('w', 'a'), ('d', 'n'), ('y', 's'), ('h', 'd'), ('w', 'w'), ('h', 'g'), ('l', 's'), ('r', 'e'), ('r', 'h'), ('w', 'd'), ('f', 'y'), ('i', 'd'), ('w', 'g'), ('i', 'g'), ('g', 'g'), ('a', 'y'), ('e', 'p'), ('o', 'n'), ('w', 'r'), ('v', 'r'), ('i', 'r'), ('g', 'r'), ('v', 'l'), ('r', 'a'), ('b', 'n'), ('g', 'l'), ('e', 'c'), ('c', 't'), ('k', 'b'), ('q', 'e'), ('w', 'f'), ('c', 'n'), ('b', 'u'), ('g', 'f'), ('q', 'h'), ('r', 'w'), ('u', 'l'), ('n', 't'), ('y', 'l'), ('n', 'n'), ('r', 'd'), ('r', 'g'), ('t', 'n'), ('c', 'i'), ('q', 'a'), ('l', 'r'), ('a', 'x'), ('e', 'e'), ('l', 'l'), ('y', 't'), ('e', 'h'), ('p', 'b'), ('z', 's'), ('n', 'i'), ('b', 'z'), ('t', 'b'), ('u', 'i'), ('e', 'a'), ('j', 'y'), ('m', 's'), ('k', 'y'), ('q', 'r'), ('v', 'm'), ('d', 's'), ('e', 'w'), ('c', 'p'), ('t', 'z'), ('l', 'i'), ('w', 'o'), ('e', 'd'), ('i', 'o'), ('g', 'o'), ('e', 'g'), ('f', 'n'), ('w', 'k'), ('a', 'n'), ('e', 'r'), ('u', 'p'), ('a', 'u'), ('o', 's'), ('z', 'l'), ('b', 's'), ('s', 't'), ('c', 'e'), ('l', 'p'), ('r', 'b'), ('m', 'l'), ('s', 'n'), ('c', 's'), ('c', 'h'), ('r', 'o'), ('h', 'y'), ('v', 'e'), ('l', 'c'), ('m', 'f'), ('s', 'u'), ('b', 'x'), ('v', 'h'), ('d', 'l'), ('n', 'e'), ('f', 'z'), ('w', 'y'), ('n', 's'), ('n', 'h'), ('i', 'y'), ('g', 'y')}

That's quite a few pairs! I'm easily amused by things like this, so let's see how many words are in the list which start with 'g' and end with 'y':

>>> gy = [j for j in fivers if j.startswith("g") and j.endswith("y")]
>>> gy
['gabby', 'gaily', 'gamey', 'gassy', 'gaudy', 'gauzy', 'gawky', 'gayly', 'geeky', 'giddy', 'gimpy', 'gipsy', 'glory', 'gluey', 'godly', 'golly', 'goody', 'gooey', 'goofy', 'gouty', 'gravy', 'grimy', 'gully', 'gummy', 'gunny', 'guppy', 'gushy', 'gusty', 'gutsy', 'gypsy']
>>> len(gy)

Let's check the distribution (aren't buckets fun?) amongst all the start/end combinations:

>>> buckets = {}
>>> for b in snend:
...     buckets[b] = len([j for j in fivers if j.startswith(b[0]) and endswith(b[1])])

I admit some surprise at seeing that there are 92 start/end combinations which only have one word in the list

>>> unobuckets = [b for b in buckets if buckets[b] == 1]
>>> len(unobuckets)
>>> unobuckets
[('a', 'z'), ('p', 'x'), ('z', 'i'), ('y', 'h'), ('t', 'x'), ('j', 'n'), ('h', 'x'), ('k', 'n'), ('d', 'i'), ('y', 'a'), ('v', 'g'), ('k', 'u'), ('j', 'i'), ('k', 'i'), ('u', 'g'), ('y', 'g'), ('q', 'y'), ('p', 'u'), ('x', 'v'), ('l', 'g'), ('b', 'i'), ('d', 'p'), ('k', 'z'), ('r', 'x'), ('h', 'u'), ('x', 'n'), ('z', 'a'), ('b', 'c'), ('h', 'z'), ('a', 'f'), ('n', 'o'), ('z', 'd'), ('f', 'u'), ('g', 'z'), ('y', 'o'), ('u', 'k'), ('f', 'i'), ('q', 'n'), ('o', 'w'), ('f', 'm'), ('j', 'l'), ('f', 'p'), ('o', 'g'), ('x', 's'), ('x', 'x'), ('c', 'v'), ('d', 'b'), ('t', 'f'), ('l', 'x'), ('j', 'p'), ('z', 'y'), ('b', 'b'), ('o', 'c'), ('g', 'i'), ('f', 'f'), ('n', 'b'), ('k', 'h'), ('i', 'm'), ('j', 'a'), ('h', 'c'), ('q', 'f'), ('d', 'x'), ('q', 'i'), ('r', 'p'), ('f', 'c'), ('q', 'm'), ('w', 'a'), ('w', 'w'), ('h', 'g'), ('i', 'g'), ('e', 'p'), ('e', 'c'), ('b', 'u'), ('r', 'w'), ('n', 't'), ('n', 'n'), ('q', 'a'), ('p', 'b'), ('n', 'i'), ('b', 'z'), ('u', 'i'), ('q', 'r'), ('v', 'm'), ('t', 'z'), ('l', 'i'), ('a', 'u'), ('z', 'l'), ('l', 'p'), ('r', 'b'), ('m', 'f'), ('s', 'u'), ('f', 'z')]

Let's choose five:

>>> for b in ("a", "z"), ("f", "z"), ("q", "n"), ("r", "x"), ("d", "x"):
...     q = [j for j in fivers if j.startswith(b[0]) and j.endswith(b[1])]
...     print(q)

I didn't really have a point to make here, I just wanted to share my amusement at how many five letter words there are to guess compared to the approximately 300 geographic entities on the planet that you'll shown the edges of in Worldle. Also that knowing Python means you can make short work of asking and answering these questions.

A ramble on tech industry hiring

There's always discourse (on twitter, reddit, HN etc) about hiring, JDs, interview processes etc. Mostly about how it all sucks (which is true to various degrees).

My colleague left (_1_ day after his probation ended; I was _unhappy_) and I got to write his replacement's JD. I've become increasingly aware over the last few years of the implicit and explicit bias in tech hiring, and vowed that if I was ever in a position to do something about it, I would.

Since I've got a BA, not a CS or EE degree, I've always felt like I didn't quite fit in in the industry. This, despite having 2y of CS and Physics in my BA where I wound up majoring in Maths and Modern History. For a long time I put myself in with the "non-traditional entry to tech" pool.

After leaving Oracle in 2019 I talked with AWS and GCP recruiters, who all emphasized questions on CS theory, recommended specific study guides for their interview processes... which I read .. and I still couldn't answer their questions. They made me feel incompetent so I eventually stopped responding to them.

At that point I'd been a successful Solaris kernel engineer for many years. I learnt good software engineering theory and practice from some of the giants in the industry. Just as important, perhaps even more so, though, was the _culture_ I imbided. I'll ramble on about that some other time.

In q3 2019 I had an interview with a different FAANG-like entity. It took seven hours, and I thought I got out lightly. There were three questions in that time which got me mad. One was from a ~3y PhD who asked why I hadn't written a Python generator (had not had a need too). The second was from a former CS prof, who was disgusted that I had never implemented a thread stack.

Him: How can you call yourself a software engineer if you've never implemented a thread stack?

Me: In the Unix kernel space, that's a solved problem. And solved by people who are much smarter than me.

(We then spent the rest of the hour-long session whiteboarding (ugh) stacks with queues and vice versa).

Seriously - I've met and worked with some of the people who worked on threading in the Solaris kernel. They are incredibly smart and excellent engineers. More importantly to me, though, they showed me both how and why the implementation worked. Never in a "I'm so smart" fashion, but always matter-of-fact, "I worked it out, I need to know that you can do it too".

The third question was from one of the overseas members of the panel, who after expressing horror that I didn't know how the internals of Apache Spark worked, demanded that I size - off the cuff - an ElasticSearch-based solution to a text-processing problem that they had just described to me. Including how much ram, how many cores, how much disk and network capacity..... I told the interviewer (and, later, the hiring manager who asked me for feedback on the whole process) that I didn't think that was a reasonable question under any circumstance.


After than FAANG-like entity I wound up at my current employer. I had two interviews, one with hiring manager and team lead, the other with the rest of the team. Rather than the torture session I anticipated, these were two very pleasant discussions. We talked about what they did, what I'd done, got to know each other. When I met the wider team via zoom on my first day (we were in lockdown so that was my only day in an office for another 4-5 months) my Director started the meeting by welcoming me and asking me for a short intro. I was delighted and relieved to discover that the people I'd interviewed with were representative of the rest of the org.

A few months after I started we got some interns via a bootcamp called Coder Academy, and I was asked to be their onboarding buddy. We had lots of sessions in their six weeks, talked about anything and everything, and I was truly delighted when my boss announced on their last day of interning that we'd hired them as employees.

These three (two in my team, one in an adjacent team) people brought home to me just how much I knew, and how much of what they wanted to know I had learnt in my two years of CS - three decades previously. What's that I hear? The claim that you _must_ have a degree in order to succeed in this industry? Utter tosh. Bollocks. A complete lie. These three (and this is my personal experiences, I'm not alone here - check twitter!) demonstrate that there is more than one pathway in to tech, and no singular definition of success.

Back to job descriptions.

My and my colleague's JD's were not wordsmithed. They were also chock full of "must have" and specific tech stack references. I've been in this role for seven and a half months now, but even in my first week I knew that general principles, an inquisitive nature and a desire to make this SCALE were more important than any particular piece of "do you have this certified piece of knowledge right now".

I spent maybe two hours re-writing the piece. I asked a very good friend (ht @girlgerms) for help and she gave me an excellent phrase to use, which was along these lines:

Your work history should show that you are already skilled in, or have the ability to quickly become skilled in these areas

With this way of expressing our needs, it was a fairly small jump to using the indirect language of "generic technology space, we're using (X)" to round out the requirements:

Data Pipeline Technologies such as (but not limited to):
  • Apache Kafka, including Confluent or Aiven’s managed offerings

  • Relational Databases and SQL. We primarily use PostgreSQL and MS SQL

  • Containerisation

In my part of the company we're gearing up for a hiring spree, and my group (I've been reorg'd into another team) needs two "data devs", one Java dev, a business analyst and a product manager. Several other groups within our org are also down at least one, sometimes two developers - so there's a lot of activity right now around writing job descriptions. In the first two weeks of April I spent several hours with my new manager working on wordsmithing our requirements, putting a lot of effort into removing language that we believe actively discourages people who do not look like us (we fit the stereotype) from applying for these roles.

Just before Easter the manager of another team within our org sent an operations-focused JD around for comment, and it was ... ok. Ok in that it asked for specific tech and a degree, but that was about it. This JD fit the stereotype, and it was meh in the 90s, it's definitely not good enough now, 20+ years later.

I suggested he change some of the language, focus on principles rather than specific tech, and entirely remove mention of requiring a degree. Yesterday afternoon he pinged wanting a bit of clarification so I went into more detail, and I am delighted that his second draft (again, sent to our Director and our peers) did just that. There's still some wordsmithing required, but overall it's a much better document.

Yesterday morning I had a chat with our org's Recruiting Partner about my former colleague's replacement. He clued me in on a few of the vagaries of recruiting, which was really helpful, and then we talked about interviewing. He was relieved to hear that I flat out refuse to do whiteboarding, live coding or leetcode - for myself, and for candidates. He also asked me why any candidate would want to work with me on my team. I admitted that I hadn't thought of an answer for that question - but I came up with something that I think is an ok platform to build a better answer on. By the time I get to interview any candidate I'll have worked on it a lot more :).

That's enough rambling and self-aggrandizement. Time to wrap this up.

I'm now in a position where I get to directly and formally influence how my company presents itself to potential employees. I do not believe that there is a candidate "pipeline problem" - there are plenty of candidates out there who could fill any of the roles that we have, _we_ have to make ourselves attractive so that they'll consider us in the first place. _How_ we do that starts with what we write, and for too long (waay too long) our industry has been focused on some specific hiring language and practices which are both implicitly and explicity exclusive. I want that to change, so I'm making changes.

Interview questions - A Complaint

A few days ago I tweeted


Alt text:

Somebody asked (last ~week) something similar to "what questions do you always ask in an #software #engineering #interview?"

The response I saw (but failed to hit like on for bookmarking purposes) was "You hit X on the keyboard. What happens next?"

I was annoyed by this response for several reasons. So annoyed, in fact, that while we were away on a long-planned family holiday I lay awake one night thinking of the many ways which one could respond.

The "always ask" really got to me. Why is that particular question something that the interviewer always asks? Is it appropriate for every role? At what level and in how much detail do they expect the candidate to respond?

Thinking back to my very first encounter with a computer at all, it was an Apple //e at my primary school, early 1980s. You know the one, its user reference manual came with the complete circuit schematics and AppleBasic listing. I don't recall exactly how Apple handled the keyboard interface, but I'll punt and suggest that everything was hardwired....

And it was. I found the Apple //e Hardware Schematics! If you piece together page 4, page 2 and the keyboard circuit schematic -- J17 is where the keyboard ribbon cable connects to the motherboard, and the circuit schematic shows very clearly that the keyboard is a row+column lookup. Very hard-wired.


From the keyboard circuit schematic we see that key 'X' is on (20, 4) which maps to Y2 (pin 19 in UE14) and X3 (pin 31). The markings on UE14 are for the GI (General Instruments) AY-5-3600-PRO, which is a Keyboard Encoder.

From UE14 we go through UE12, the keyboard rom looks up the specific rendering for 'X', pushes that onto the main data bus and sends an interrupt to the 6502b cpu (see page 1 schematic). Then the relevant display function in ROM is invoked to push the actual "how to render this character" instructions into the video (NTSC or PAL) controller chip and thence to the physical display hardware.

By the way, you might be wondering what UE14 actually means. The 'U' means that the component is an integrated circuit, the E14 is an actual row+column lookup so you know where to look for it on the physical board.


It's somewhat obscured, but on the left of the pcb you can see a D and an E, while along the bottom you can see the numbers 1-15. This is a different pcb to what's in the schematics, because you can see that the 6502 is at E1 (so it would be UE1 on the schematic) while in the schematics I've found (see page 1) it's UC4.

Later, there were PCs with keyboards connected by a cable, and as it happens, there was a Intel 8042 inside as the microcontroller - which did pretty much the same thing as the entire Apple //e, except that the interrupt is generated by the Intel 8042 and sent out along the cable to the keyboard port on the main motherboard.... where again, an interrupt was generated and all the other lookups occurred prior to sending out to the display.

That's all well and good, but how about a more complicated system, like one of the many UNIXes such as Solaris. Or how about a minicomputer with hardwired terminals like a PDP-11? One of those hardwired terminals essentially used the same principles described above, but had a much more complicated kernel to push the data into. They also had local display buffer storage, so that wasn't something the kernel needed to worry about. What the kernel was interested in (please don't anthropomorphise computers, they hate it) was sending the keystrokes to the correct process.

Some Pythonic Kafka stuff

I've been actively interviewing over the last few months, and was recently talking with a cloud streaming provider about a Staff Engineer SRE role specializing in Apache Kafka. I've been doing a lot of work in that space for $employer and quite like it as a data streaming solution.

The interviews went well, and they sent me a do-at-home challenge with a one week timeout.

The gist of it was I had to create a Flask app which allowed the user to enter a URL to monitor for status on a given frequency, use Python to write a Kafka Producer to publish this data to a topic, and write a Kafka Consumer to read from the topic and insert into a PostgreSQL database.


I did a brief investigation into using threads within a Flask app for the monitoring code, but quickly decided that a better architecture would be to do the monitoring via a separate daemon. Separation of concerns to allow for easier maintenance. Suddenly I'm all about ongoing maintenance rather then how quickly I can deliver a new feature... Hmmm.

The next step was to sketch out the table schema I wanted in the database:

CREATE SEQUENCE public.urltargets_seq
            INCREMENT BY 1
            MINVALUE 1
            MAXVALUE 9223372036854775807
            START 1
            CACHE 1
            NO CYCLE;

    CREATE SEQUENCE public.monitor_results_seq
            INCREMENT BY 1
            MINVALUE 1
            MAXVALUE 9223372036854775807
            START 1
            CACHE 1
            NO CYCLE;

            urltargets_pk           int4 NOT NULL DEFAULT nextval('urltargets_seq'::regclass),
            urltarget                       varchar(1024) NOT NULL,
            monitor_frequency       int NOT NULL CHECK (monitor_frequency in (1, 2, 5, 10, 20, 30)),
            CONSTRAINT urltargets_pkey PRIMARY KEY (urltargets_pk)

    CREATE TABLE IF NOT EXISTS monitor_results (
            monitor_results_pk      int4 NOT NULL DEFAULT nextval('monitor_results_seq'::regclass),
            http_status                     int NOT NULL,
            start_time                      timestamp with time zone NOT NULL,
            duration            int4 NOT NULL,
            urltarget_fk            int4 NOT NULL,
            CONSTRAINT monitor_results_fk_fkey FOREIGN KEY (urltarget_fk) REFERENCES urltargets(urltargets_pk)

Having decided that I would offer monitoring frequencies of 1, 2, 5, 10, 20 and 30 minutes, I created views for the Flask app to use as well, rather than direct queries. They all look like this, with other values substituted in as you would expect.

            SELECT mr.monitor_results_pk
                    ,  ut.urltarget
                    ,  mr.http_status
                    ,  mr.start_time
                    ,  mr.duration
            FROM monitor_results mr
            JOIN urltargets ut on ut.urltargets_pk = mr.urltarget_fk
            WHERE ut.monitor_frequency = 1

Since I really like seeing schemata visualised, I created a nice(ish) ERD as well:


Well that was straight forward, how about the application and daemon?

I split out the setup functionality into a separate file importable by the Flask app, the monitor daemon and the consumer. This contained the database connection, Kafka Producer and Kafka Consumer code. There's an interesting little niggle in the Kafka Producer setup which is not immediately obvious and required a bit of digging in StackOverflow as well as enabling debug output with librdkafka:

  def _get_kafka_configuration():
      Common function to retrieve the Kafka configuration.
      global appconfig
      configuration = {
          "bootstrap.servers": appconfig["kafka"]["broker"],
          "": "website-monitor",
          "ssl.key.location": appconfig["kafka"]["keyfile"],
          "ssl.certificate.location": appconfig["kafka"]["certfile"],
          "": appconfig["kafka"]["cafile"],
          "security.protocol": "SSL",
          # "debug": "eos, broker, admin",  # re-enable if needed
          '': 60000,
          'enable.idempotence': True
      return configuration
  def setup_kafka_producer(view):
      Creates the connection to our Kafka brokers so we can publish
      messages to the topics we want. We take the {view} argument so
      that we can avoid blatting multiple producers together and getting
      errors from the broker about zombies and fencing. See
      for more details
      Return: a Producer
      configuration = _get_kafka_configuration()
      configuration[""] = "website-monitor" + str(view)
      kafkaProducer = Producer(configuration)
      except KafkaError as ke:
          # If we can't do this, then we have to quit
          print(f"""Producer failed to init_transactions(), throwing {ke}""")
      return kafkaProducer

When I was working with the daemon, my first iteration tried opening the DB connection and Producer in each thread's (one for each frequency) __init__() function, and .... that didn't work.

The DB connection is not picklable, so does _not_ survive the call to os.fork(). Once I had rewritten the setup and run methods to get the DB connection, that part was groovy.

The Kafka Producer still required a bit of work. After reading through stackoverflow and the upstream for librdkafka, I saw that I needed to similarly delay initialising the producer until the thread's run() method was called. I also observed that each Producer should also initialise the transaction feature, but leave the begin... end of the transaction to when it was called to publish a message.

I still had a problem, though - some transactions would get through, but then the Producer would be fenced. This was the niggle, and where the StackOverflow comments helped me out:

Finally, in distributed environments, applications will crash or —worse!— temporarily lose connectivity to the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics.

We call this the problem of “zombie instances.” [emphasis added]

I realised that I was giving the same transactional id to each of the six producer instances (setting the '' in the configuration dict generated by _get_kafka_configuration(), so I needed to uniqify them somehow. I decided to pass the monitoring frequency of the thread to the setup function, and ... booyah, I had messages being published.

That was a really nice feeling.

There is one other aspect of the monitoring daemon that I need to mention. Since each thread reads its list of URLs to monitor each time it wakes, I wanted to parallelize this effort. Monitoring each of the URLs in series could easily take too long from a sleep(...) point of view, and I really did not want to just fork a whole heap of processes and thread either - avoiding the potential for a fork-bomb.

To work around this I used the Python standard library concurrent.futures with a ThreadPoolExecutor for each target URL. Adding attributes to the future object enabled me to use an add_done_callback so that when the future crystallized it would then publish the message.

  def run(self):
      Runs the monitor, updates account-keeping and kicks off
      notifications if required. Then back to sleep.
      self._producer = setup_kafka_producer(self._view)
      self._conn = setup_db()
      self._cursor = self._conn.cursor()
      while True:
          alltargets = self._get_targets()
          if alltargets:
              # We use a 'with' statement to ensure threads in the pool
              # are cleaned up promptly
              with cf.ThreadPoolExecutor(max_workers=50) as executor:
                  for tgt in alltargets:
                      future = executor.submit(check_url,
                                               tgt[0], tgt[1])
                      future.args = (tgt[0], tgt[1])
                      future.producer = self._producer
                  for future in cf.as_completed(self._futures):
                      if future.done():

The check and publish methods are outside of the thread definition:

  def construct_and_publish(input):
      Callback function for the concurrent.future that each thread
      makes use of to query a website. Turns the future's attributes
      into a message for 'url-to-monitor' topic, then publishes that
      message to the topic.
      if input.cancelled() or input.exception():
          errmsg = """Monitor attempt for {args} failed"""
                              file=stderr, flush=True))
          message = json.dumps(dict(zip(msgFields, input.result())))
  def check_url(url, fk):
      Performs an 'HTTP GET' of the supplied url and returns a tuple containing
      (fk, start_time, duration, http_status).
      The start_time is expressed in milliseconds since the UNIX Epoch.
      start_time = datetime.timestamp(
      result = requests.get(url)
      duration = datetime.timestamp( - start_time
      return (fk, start_time, duration, result.status_code)

With the monitoring daemon written, I now needed the other end of the pipe - writing the Kafka Consumer to read from the topic and insert into the database. This was straightforward: we're polling for messages on both configured topics, when we read one we write it to the appropriate DB table using a prepared statement, commit and then do it all again with a while loop.

  urlToMonitorStmt = "INSERT INTO urltargets (urltarget, monitor_frequency "
  urlToMonitorStmt += "VALUES (%(urltarget)s, %(monitor_frequency)s)"
  urlMonitorResultsStmt = "INSERT INTO monitor_results (http_status, "
  urlMonitorResultsStmt += "urltarget_fk, start_time, duration) "
  urlMonitorResultsStmt += "VALUES (%(http_status)s, %(targetId)s, "
  urlMonitorResultsStmt += "to_timestamp(%(start_time)s), %(duration)s)"
  lookups = {
      "url-to-monitor": urlToMonitorStmt,
      "url-monitor-results": urlMonitorResultsStmt
  if __name__ == "__main__":
      consumer = setup_kafka_consumer()
      connection = setup_db()
      while True:
          with connection.cursor() as curs:
              msglist = consumer.consume(200)
              for msg in msglist:
                  if not msg:
                  elif msg.error():
                      print("Received error during poll: {error}".format(
                      stmt = lookups[msg.topic()]
                      values = json.loads(msg.value().decode('utf-8'))
                      curs.execute(stmt, values)

Of course there should be error handling for the execute(). There should also be packaging and tests and templates for the report. Do not @ me, etc etc.

The reason why all these pieces are missing is because the day before I was due to hand this assignment in to my interviewer, I received a very, very nice offer from another company that I'd also been interviewing with - and I accepted it.

An unexpected live coding challenge

A few weeks ago I was in a technical interview, and was asked to do a live coding challenge. I was surprised, because this is the sort of thing that I expect a recruiter and hiring manager to mention ahead of time. Preparation is very important, and while I know that there are many people for whom live coding is a thrill, there are plenty of other people for whom it can be a terrifying experience.

I'm glad to say that I'm not terrified by it, but it's definitely not an ideal environment for me.

So after a few minutes of me describing what I've done in my career (it seemed pretty clear that the interviewer hadn't actually read my resume), and a few technical questions, we got into the challenge.

For a given string composed of parenthesis ("(", "{", "["), check if the string is valid parenthesis.
1. "()" -- valid
2. "({})" -- valid
3. "(}{)" -- invalid
4. "{()}[{}]" -- valid
5. "({(}))" -- invalid

I noted immediately that this is an issue which requires the processing function to track state, because you not only need to determine open and closed pairings, but also what type it is.

It took a minute to sketch out the classifications that I needed, talking through my decision process all the while:

OPENS = ["(", "{", "["]
CLOSES = [")", "}", "]"]

braces = [ "{", "}"]
parens = [ "(", ")"]
brackets = [ "[", "]"]

classes = { "braces": braces,
            "parens": parens,
            "brackets": brackets

I was able to stub out a check function pretty quickly, but got stuck when I went from the stub to implementation, because I realised that I needed to keep track of what the previous element in the string was.

Oh no! How do I do that? (A stack, btw)

Mental blank :(

I needed time to jog my memory, so I asked the interviewer to tell me about himself, what he does on the team and a few other questions.

This, I think, was a very good decision - with the focus of the interview not on me, I could step back and think about what basic data types in Python I could use to implement a stack.

The data type I needed is indeed pretty basic: a list().

A Python list() lets you push (the append() operation) and pop so with the addition of another data structure

counts = { "braces": 0,
           "parens": 0,
           "brackets": 0

and a short function to return the class of the element

def __classof(c):
    """ returns whether 'c' is in brackets, braces or parens """
    if c in braces:
        return "braces"
    elif c in brackets:
        return "brackets"
        return "parens"

we're now in a much better position for the algorithm.

By this time I had also calmed myself down, because everything came together pretty easily for me.

With the above code blocks already noted, here is the body of the function:

def check_valid(input):
    """ For the given examples at top, determine validity.
        Assumption: the input is _only_ braces, parens and brackets

    # start
    c = input[0]

    stack = list()

    counts[__classof(c)] += 1

    for c in input[1:]:
        if c in OPENS:
            ## increment count & add to stack
            counts[__classof(c)] += 1
            ## closing checks
            if __classof(c) !=  __classof(stack[-1]):
                return "invalid"
                # decrement count_ (__classof(c))
                counts[__classof(c)] -= 1

    return "valid"

We're playing fast and loose here with input validity checking - there's no "is this an empty string?" and we're not handling a single-character string, let alone validating that our input only contains braces, parens and brackets.

With this main() function, though, we're doing pretty well:

## main
strings = ["""()""",

for element in strings:

which gives us the following output:

()                   valid
({})                 valid
(}{)                 invalid
{()}[{}]             valid
({(}))               invalid
](){}                valid

Using the criteria specified, the final case is invalid, given that it starts with a terminating rather than initiating/opening element - there's nothing to balance the element with. However at that point my time was up and I didn't worry about it.

My interviewer then asked whether I had considered using recursion to solve the problem.

I hadn't considered recursion because I generally don't have to for the use-cases I need to write - and in this particular problem space it didn't seem to me to be an effective use of resources.

Consider the longest case, {()}[{}]. If you're recursing on the check function, then you'll wind up calling the function four times, so that's four new stack frames to be created and destroyed. That doesn't strike me as particularly efficient in time or space. Iterating over the input, however, avoids all of the setup + teardown overhead.

Anyway, it was a relatively fun exercise, and I'm glad I did it. I was able to keep a cool head and buy myself enough time to jog my memory and finish the problem, and it worked the first time (I know, that _never_ happens!).

For future encounters like this, I think it's handy to remember these points:

  1. Breathe

  2. Talk through what you are doing, and why

  3. If you hit a problem, see point 1, and then say that you're stuck and need to think through a particular part of the issue.

  4. If you need to stop talking so you can think, say that that's what you need to do.

It is my impression that if your interviewer is a decent person, they will help you work through your point of stuckness so that you can complete as much as possible of the task.

Why do I see "Duplicate main class"?

I've recently started work on improving my skills and knowledge in the Java ecosystem, and while working on a previous post I burned several hours trying to work out why I was seeing this error:

[ERROR] ..../[42,1] duplicate class.... bearer_token_cli.bearerTokenCLI

I didn't find the answers at StackOverflow to be very useful, because they invariably said something along the lines of "clean your project and let the IDE re-index things, it'll be fine".

Which is not a solution - it's like "curing" a memory leak by rebooting the host. I like to know the why of a problem.

I eventually re-re-read the message from the Maven compiler plugin and noticed that it was trying to compile 2 source files. For an exploratory project which only had one source file, this was unexpected:

[INFO] --- maven-compiler-plugin:3.8.1:compile (default-compile) @ bearer_token_cli ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 2 source files to /home/jmcp/IdeaProjects/bearer-token-cli/target/classes

Why did I now have two files? The answer lies in a bit of laziness on my part. The previous post had hard-coded credentials and URLs, but I really wanted to start using a getopt()-like library called picocli, and rather than git commit ... I just copied my first version of the source to and kept on editing.

Apart from the relevant information (2 source files) being in an [INFO] block rather than in the [ERROR] block, why on earth couldn't it have printed the names of the files it was compiling along the way?

If you come across this error, a quick

$ find src/main/java -name \*.java |xargs grep -i "public static void main"

should help you find where that erroneous main class is hiding.

Here's where I get a bit ranty. One of the patterns that we invented and applied to the Solaris OS/Net (ON) source tree during the development of #ProjectLullaby was that for every subdirectory which contained source files, the Makefile specified each file individually.

There are solid reasons for this, starting with the need to ensure that when you build part or all of the tree, we do not miss dependencies. Over the years there were enough instances of "developer changes code, adds new file or renames/splits old file, FAILS TO CHECK IN NEW FILES, breaks build after integration" that we forced specificity. You want to add, remove or delete files that the build depended on? It's ON YOU to make sure we're tracking them. A broken build meant either a followup changeset (with "(add missing file)" etc appended to the comment), or getting backed out.

While I'm enjoying some aspects of developing in Java and I do like leaving a lot of heavy lifting to a framework or a toolset, the heavy reliance in the Java world on an IDE to do thinking for you leaves me cold.

Queensland's 2011 floods

It's now ten years since we experienced the Queensland floods of December 2010-January 2011 . I took quite a few photos around the Centenary Suburbs and put some of them into a twitter thread last week. I've put those and many more together into an album for the record.

For our part, we got off lightly. The waters came to within about 1km of our home, and while Energex shut down the West Darra substation at 1pm on the day the waters rose on our part of the river, power was back on again 24 hours later. J was pregnant with #Child2 so the lack of air movement during an incredibly hot and humid night was very draining. But that was it for us. Many people were a lot more affected; the Mud Army helped with cleanup and it was heartbreaking to see just how many homes were damaged.

Access token retrieval in Python and Java

At $work I'm part of our API team, enabling access to the rather large datasets that we have acquired (and generated) over the years. We make heavy use of Postman for testing, but every now and again I find that I want to do something on the commandline.

All of our APIs require authorization, for which we use OAuth, and specifically the Bearer Token.

Rather than having to fire up Postman to extract a Bearer Token, I decided to write a Python script to do it for me, so I could then set an environment variable and pass that to curl as a command-line argument. I was a little lazy and hard-coded my clientid and secret - I'm not going to be requesting anybody else's token!

import requests

client_id = "nope"
client_secret = "still_nope"

resp = requests.get(url.format(client_id=client_id,
print("export BEARER=\"Authorization: Bearer " +
      resp.json()["access_token"] + "\"")

I'm taking advantage of the fact that I know through many years of use that the requests package does a lot of heavy lifting for me, particularly the JSON decoding.

Since $work has a shutdown between Christmas and New Year, I figured that I would spent some time implmenting this in Java. Not because I have a need to, but because I need to get more Java under my belt since $work is a Java shop and cross-pollination / polyglotting is useful.

The first step was to determine how to send an HTTP GET for a specific URI. A few searches later and I'd arrived at


and since it seems cleaner to import the Exceptions that these throw, I added


A bit more hard-coding laziness for the clientid and secret and I had the beginnings of a solution. (Note that I do not claim that any of this is using Best Practices, it's just a building block).

class bearerTokenCLI {

    private static URI authServer;
    private static HttpResponse<String> response;

    public static void main(String... args) {

        try {
            authServer = new URI("https", null,
                    "$AUTHSERVER", 443,
                    "grant_type=client_credentials" +
                            "&client_id=nope" +
                            "&client_secret=still_node", null);
        } catch (URISyntaxException exc) {
            System.out.println("Received URISyntaxException");

        System.out.println("Requesting " + authServer.toString());

Ok so far - we've created a new URI object, caught the specific exception that it could throw, and (because I'm learning) printing the stringified version of the URI to stdout.

Now we need an HttpRequest to send via an HttpClient:

HttpRequest request = HttpRequest.newBuilder(authServer).build();
HttpClient client = HttpClient.newHttpClient();

try {
    response = client.send(request, BodyHandlers.ofString());
} catch ( | java.lang.InterruptedException jiie) {
     * Note that this catch() uses Java7++ syntax for handling
     * multiple exceptions in the same block

Assuming we didn't get an exception, we need to check that the HTTP Status Code of the response is OK, or 200:

if (response.statusCode() != 200) {
     * Something went wrong so print the url we requested, status code,
     * an error message and the response body as text.
    System.out.println("Request was unsuccessful. " +
            "Received status code " + response.statusCode());
    System.out.println("URL requested was\n" + authServer.toString());
    System.out.println("Response body text:\n" + response.body());

If it isn't ok, we bail out, and otherwise we check for the Content-Type header being set to 'application/json'. Why that specific value? If you refer to the RFCs for OAuth (RFC6749 and RFC6750) specifically section 5 of the latter, you see that

The parameters are included in the entity-body of the HTTP response using the "application/json" media type as defined by [RFC4627]. The parameters are serialized into a JavaScript Object Notation (JSON) structure by adding each parameter at the highest structure level. Parameter names and string values are included as JSON strings. Numerical values are included as JSON numbers. The order of parameters does not matter and can vary.

Let's check for that type then.

 * Check that we've got 'application/json' as the Content-Type.
 * Per we know that
 * Content-Type is a semicolon-delimited string of
 *     type/subtype;charset;...
 * More importantly, we know from
 * that the response type MUST be JSON.
List<String> contentTypeHeader = response.headers().map().get("Content-Type");
if (contentTypeHeader.isEmpty()) {
    System.out.println("ERROR: Content-Type header is empty!");

Since contentTypeHeader is a List<T> we can either iterate over it, or, since we know that it can only occur once in an HTTP response we can grab element 0 directly. Here's iteration (and yes, I know we should be confirming that we've actually got 'application/json', do not @ me, etc etc):

for (String el: contentTypeHeader) {
    String contentType = el.split(";")[0];
    System.out.println("Actual Content-Type bit:   " + contentType);

On the other hand, making use of our knowledge that there's only one Content-Type header in the response, and per RFC7231 we know the format of the header, we can take a shortcut and grab the value directly:

String contentType = contentTypeHeader.get(0).split(";")[0];
if (!contentType.equalsIgnoreCase("application/json")) {
    /* Not JSON! */
    System.out.println("Content-Type is " + contentType +
            " not application/json. Exiting");

So far, so good. It took me several hours to get to this point because I not only had to refresh my memory of those RFCs, but also realising that a short-circuit was possible.

Now we can move onto the response body text. By way of printing out response.getClass() I know that the response is an instance of class, and visual inspection of it shows that it's JSON. But how do I turn that into an array that I can pull the access_token information from?

At first I tried using Google's GSON but I just couldn't get my head around it. I need to find and understand more code examples. Until I do that, however, I turned to Jackson JR, which I found a lot more straightforward.

We need another import, this time

import com.fasterxml.jackson.jr.ob.JSON;

And then we construct a Map<String, Object> from the response body:

try {
    Map<String, Object> containerJSON = JSON.std.mapFrom(response.body());
    String accessToken = containerJSON.get("access_token").toString();
    System.out.println("export BEARER=\"BEARER " + accessToken + "\"\n");
} catch (IOException exc) {
    System.out.println("Caught exception " + exc.toString());
    System.out.println("Message:\n" + exc.getMessage());

You'll observe that I'm again being a bit lazy here by wrapping this block in the one try {...} catch (..) {..} block. Whyso? Because by this point we should be certain that we've actually got an access_token element in the response, and if we don't then there's something going wrong upstream.

Finally, how do we build this thing? As much as I'd like to just run javac over the source and create a solitary jar, I've found that including external dependencies is made immensely easier by using a build system like Maven, Ant or Gradle in the Java ecosystem. For C, of course, there's no place like make(1s) (ok, or GNU Make).

I started with using a Maven *archetype*, added this dependency to pom.xml:


and added the Maven Assembly plugin to the <build> lifecycle. Then building was a matter of

$ mvn clean install package assembly:single

and then I could run the package with

$ java -jar target/bearer_token_cli-1.0-SNAPSHOT-jar-with-dependencies.jar

All up, I estimate that researching and writing this in Java took me about 12 hours. Most of which was ecosystem research and exploration. There was only one syntax issue which tripped me up - I needed an Array and for about 10 minutes was searching through Javadocs for an appropriate class before I remembered I could use String[] arrName. Sigh.

Learning the ecosystem is the thing I'm finding most difficult with Java - it's huge and there are so many different classes to solve overlapping problems. I haven't even begun to work with @Annotations or dependency injection for my own code yet. Truth be told, after a decade+ of working in Solaris, the idea that anything could be injected into the code I've written puts a chill down my spine. I'm sure I'll get past it one day.

I know, I'll use a regex!

This past week, a colleague asked me for help with a shell script that he had come across while investigating how we run one of our data ingestion pipelines. The shell script was designed to clean input CSV files if they had lines which didn't match a specific pattern.

Now to start with, the script was run over a directory and used a very gnarly bit of shell globbing to generate a list of files in a subdirectory. That list was then iterated over to check for a .csv extension.

[Please save your eye-rolls and "but couldn't they..." for later].

Once that list of files had been weeded to only contain CSVs, each of those files was catted and read line by line to see if the line matched a desired pattern - using shell regular expression parsing. If the line did not match the pattern, it was deleted. The matching lines were then written to a new file.

[Again, please save your eye-rolls and "but couldn't they..." for later].

The klaxons went off for my colleague when he saw the regex:

  while IFS="" read -r line && [ -n "$line" ]
        if [[ "$buffer" =~ ^\"[0-9]{4}-([0][0-9]|1[0-2])-([0-2][0-9]|3[01])\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",[^,]*,\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",.*$ ]];
              echo "$buffer"
              buffer="${buffer} "
  } < "${f}" > "${NEW}"

My eyes got whiplash. To make it easier to understand, let's put each element of the pattern on a single line:


Which is really something. The first field matches a date format - "yyyy-mm-dd" (which is ok), then we have 12 fields where we care that they are enclosed in double quotes, one field that we want to not be quoted, another 12 fields which are quoted again, and any other fields we don't care about.


I told my colleague that this wasn't a good way of doing things (he agreed).

There are better ways to achieve this, so let's walk through them.

Firstly, the shell globbing. There's a Unix command to generate a list of filesystem entries which match particular criteria. It's called find. If we want a list of files which have a 'csv' extension we do this:

$ find DIR -type f -name \*.csv

You can use '.' or '*' or any way of representing a DIRectory in the filesystem.

Now since we want this in a list to iterate over, let's put it in a variable:

$ CSVfiles=$( find DIR -type f -name \*.csv -o -name \*.CSV )

(You can redirect stderr to /dev/null, with 2>/dev/null inside the parens if you'd like).

Now that we've got our list, we can move to the second phase - removing lines which do not match our pattern. Let's try this first with awk. Awk has the concept of a Field Separator, and since CSV files are Comma-Separated-Value files, let's make use of that feature. We also know that we are only really interested in two fields - the first (yyyy-mm-dd) and the fourteenth.

$ awk -F',' '$1 ~ /"[0-9]{4}-([0][0-9]|1[0-2])-([0-2][0-9]|3[01])"/ &&
    $14 !~ /".*"/ {print}' < $old > $new

That's still rather ugly but considerably easier to read. For the record, the bare ~ is awk's equals operator, and !~ is not-equals.

We could also do this with grep, but at the cost of using more of that horrible regex.

In my opinion a better method is to cons up a Python script for this validation purpose, and we don't need to use the CSV module.

from collections import UserString
from datetime import datetime

infile = open("/path/to/file.csv", "rw")
input = infile.readlines()

linecount = len(input)

for line in input:

    fields = line.split(",")
    togo = False

        datetime.strptime(fields[0], "%Y-%m-%d")
    except ValueError as _ve:
        togo = True

    if '"' in fields[14] or not UserString(fields[14]).isnumeric():
        togo = True
    if togo:
        del line

if len(input) != linecount:
    # We've modified the input, so have to write out a new version, but
    # let's overwrite our input file rather than creating a new instance.


This script is pretty close to how I would write it in C (could you tell?).

We first open the file (for reading and writing) and read in every line, which yields us a list. While it's not the most memory-efficient way of approaching this problem, it does make processing more efficient because it's one read(), rather than one-read-per-line. We store the number of lines that we've read in for comparison at the end of our loop, and then start the processing.

Since this is a CSV file we know we can split() on the comma, and having done so, we check that we can parse the first field. We're not assigning to a variable with datetime.strptime() because we only care that we can rather than what the object's value is. The second check is to see that we cannot find the double apostrophe in the element, and that the content of the field is in fact numeric. If neither of these checks succeed, we know to delete the line from our input.

Finally, if we have in fact had to delete any lines, we rewind our file (I was going to write pointer, but it's a File object. Told you it was close to C!) to the start, and write out each line of input with a newline character before closing the file.

Whenever I think about regexes, especially the ones I've written in C over the years, I think about this quote which Jeffrey Friedl wrote about a long time ago:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

It was true when I first heard it some time during my first year of uni, and still true today.