Posts for year 2020

Access token retrieval in Python and Java

At $work I'm part of our API team, enabling access to the rather large datasets that we have acquired (and generated) over the years. We make heavy use of Postman for testing, but every now and again I find that I want to do something on the commandline.

All of our APIs require authorization, for which we use OAuth, and specifically the Bearer Token.

Rather than having to fire up Postman to extract a Bearer Token, I decided to write a Python script to do it for me, so I could then set an environment variable and pass that to curl as a command-line argument. I was a little lazy and hard-coded my clientid and secret - I'm not going to be requesting anybody else's token!

#!/usr/bin/python3
import requests

url="""https://$AUTHSERVER/access/oauth/token?grant_type=client_credentials&client_id={client_id}&client_secret={client_secret}"""
client_id = "nope"
client_secret = "still_nope"

resp = requests.get(url.format(client_id=client_id,
                               client_secret=client_secret))
print("export BEARER=\"Authorization: Bearer " +
      resp.json()["access_token"] + "\"")

I'm taking advantage of the fact that I know through many years of use that the requests package does a lot of heavy lifting for me, particularly the JSON decoding.

Since $work has a shutdown between Christmas and New Year, I figured that I would spent some time implmenting this in Java. Not because I have a need to, but because I need to get more Java under my belt since $work is a Java shop and cross-pollination / polyglotting is useful.

The first step was to determine how to send an HTTP GET for a specific URI. A few searches later and I'd arrived at

import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.http.HttpResponse.BodyHandlers;
import java.net.URI;

and since it seems cleaner to import the Exceptions that these throw, I added

import java.io.IOException;
import java.net.URISyntaxException;

A bit more hard-coding laziness for the clientid and secret and I had the beginnings of a solution. (Note that I do not claim that any of this is using Best Practices, it's just a building block).

class bearerTokenCLI {

    private static URI authServer;
    private static HttpResponse<String> response;

    public static void main(String... args) {

        try {
            authServer = new URI("https", null,
                    "$AUTHSERVER", 443,
                    "/access/oauth/token",
                    "grant_type=client_credentials" +
                            "&client_id=nope" +
                            "&client_secret=still_node", null);
        } catch (URISyntaxException exc) {
            System.out.println("Received URISyntaxException");
            System.out.println(exc.getReason());
            System.out.println(exc.getMessage());
            System.out.println(exc.getInput());
        }

        System.out.println("Requesting " + authServer.toString());

Ok so far - we've created a new URI object, caught the specific exception that it could throw, and (because I'm learning) printing the stringified version of the URI to stdout.

Now we need an HttpRequest to send via an HttpClient:

HttpRequest request = HttpRequest.newBuilder(authServer).build();
HttpClient client = HttpClient.newHttpClient();

try {
    response = client.send(request, BodyHandlers.ofString());
} catch (java.io.IOException | java.lang.InterruptedException jiie) {
    /*
     * Note that this catch() uses Java7++ syntax for handling
     * multiple exceptions in the same block
     */
    System.out.println("Received java.io.IOException");
    System.out.println(jiie.getCause());
    System.out.println(jiie.getMessage());
    System.out.println(jiie.getStackTrace());
}

Assuming we didn't get an exception, we need to check that the HTTP Status Code of the response is OK, or 200:

if (response.statusCode() != 200) {
    /*
     * Something went wrong so print the url we requested, status code,
     * an error message and the response body as text.
     */
    System.out.println("Request was unsuccessful. " +
            "Received status code " + response.statusCode());
    System.out.println("URL requested was\n" + authServer.toString());
    System.out.println("Response body text:\n" + response.body());
    System.exit(1);
}

If it isn't ok, we bail out, and otherwise we check for the Content-Type header being set to 'application/json'. Why that specific value? If you refer to the RFCs for OAuth (RFC6749 and RFC6750) specifically section 5 of the latter, you see that

The parameters are included in the entity-body of the HTTP response using the "application/json" media type as defined by [RFC4627]. The parameters are serialized into a JavaScript Object Notation (JSON) structure by adding each parameter at the highest structure level. Parameter names and string values are included as JSON strings. Numerical values are included as JSON numbers. The order of parameters does not matter and can vary.

Let's check for that type then.

/*
 * Check that we've got 'application/json' as the Content-Type.
 * Per https://tools.ietf.org/html/rfc7231#section-3.1 we know that
 * Content-Type is a semicolon-delimited string of
 *     type/subtype;charset;...
 * More importantly, we know from https://tools.ietf.org/html/rfc6749#section-5.1
 * that the response type MUST be JSON.
 */
List<String> contentTypeHeader = response.headers().map().get("Content-Type");
if (contentTypeHeader.isEmpty()) {
    System.out.println("ERROR: Content-Type header is empty!");
    System.out.println(response.headers());
    System.exit(1);
}

Since contentTypeHeader is a List<T> we can either iterate over it, or, since we know that it can only occur once in an HTTP response we can grab element 0 directly. Here's iteration (and yes, I know we should be confirming that we've actually got 'application/json', do not @ me, etc etc):

for (String el: contentTypeHeader) {
    String contentType = el.split(";")[0];
    System.out.println("Actual Content-Type bit:   " + contentType);
}

On the other hand, making use of our knowledge that there's only one Content-Type header in the response, and per RFC7231 we know the format of the header, we can take a shortcut and grab the value directly:

String contentType = contentTypeHeader.get(0).split(";")[0];
if (!contentType.equalsIgnoreCase("application/json")) {
    /* Not JSON! */
    System.out.println("Content-Type is " + contentType +
            " not application/json. Exiting");
    System.exit(1);
}

So far, so good. It took me several hours to get to this point because I not only had to refresh my memory of those RFCs, but also realising that a short-circuit was possible.

Now we can move onto the response body text. By way of printing out response.getClass() I know that the response is an instance of class jdk.internal.net.http.HttpResponseImpl, and visual inspection of it shows that it's JSON. But how do I turn that into an array that I can pull the access_token information from?

At first I tried using Google's GSON but I just couldn't get my head around it. I need to find and understand more code examples. Until I do that, however, I turned to Jackson JR, which I found a lot more straightforward.

We need another import, this time

import com.fasterxml.jackson.jr.ob.JSON;

And then we construct a Map<String, Object> from the response body:

try {
    Map<String, Object> containerJSON = JSON.std.mapFrom(response.body());
    String accessToken = containerJSON.get("access_token").toString();
    System.out.println("export BEARER=\"BEARER " + accessToken + "\"\n");
} catch (IOException exc) {
    System.out.println("Caught exception " + exc.toString());
    System.out.println("Message:\n" + exc.getMessage());
}

You'll observe that I'm again being a bit lazy here by wrapping this block in the one try {...} catch (..) {..} block. Whyso? Because by this point we should be certain that we've actually got an access_token element in the response, and if we don't then there's something going wrong upstream.

Finally, how do we build this thing? As much as I'd like to just run javac over the source and create a solitary jar, I've found that including external dependencies is made immensely easier by using a build system like Maven, Ant or Gradle in the Java ecosystem. For C, of course, there's no place like make(1s) (ok, or GNU Make).

I started with using a Maven *archetype*, added this dependency to pom.xml:

<dependencies>
    <dependency>
        <groupId>com.fasterxml.jackson.jr</groupId>
        <artifactId>jackson-jr-objects</artifactId>
        <version>2.12.0</version>
    </dependency>
</dependencies>

and added the Maven Assembly plugin to the <build> lifecycle. Then building was a matter of

$ mvn clean install package assembly:single

and then I could run the package with

$ java -jar target/bearer_token_cli-1.0-SNAPSHOT-jar-with-dependencies.jar

All up, I estimate that researching and writing this in Java took me about 12 hours. Most of which was ecosystem research and exploration. There was only one syntax issue which tripped me up - I needed an Array and for about 10 minutes was searching through Javadocs for an appropriate class before I remembered I could use String[] arrName. Sigh.

Learning the ecosystem is the thing I'm finding most difficult with Java - it's huge and there are so many different classes to solve overlapping problems. I haven't even begun to work with @Annotations or dependency injection for my own code yet. Truth be told, after a decade+ of working in Solaris, the idea that anything could be injected into the code I've written puts a chill down my spine. I'm sure I'll get past it one day.




I know, I'll use a regex!

This past week, a colleague asked me for help with a shell script that he had come across while investigating how we run one of our data ingestion pipelines. The shell script was designed to clean input CSV files if they had lines which didn't match a specific pattern.

Now to start with, the script was run over a directory and used a very gnarly bit of shell globbing to generate a list of files in a subdirectory. That list was then iterated over to check for a .csv extension.

[Please save your eye-rolls and "but couldn't they..." for later].

Once that list of files had been weeded to only contain CSVs, each of those files was catted and read line by line to see if the line matched a desired pattern - using shell regular expression parsing. If the line did not match the pattern, it was deleted. The matching lines were then written to a new file.

[Again, please save your eye-rolls and "but couldn't they..." for later].

The klaxons went off for my colleague when he saw the regex:

NEW=${f%.csv}_clean.csv;
  {
  buffer=""
  read
  while IFS="" read -r line && [ -n "$line" ]
  do
        buffer="${buffer}${line}"
        if [[ "$buffer" =~ ^\"[0-9]{4}-([0][0-9]|1[0-2])-([0-2][0-9]|3[01])\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",[^,]*,\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",\"[^\"]*\",.*$ ]];
        then
              echo "$buffer"
              buffer=""
        else
              buffer="${buffer} "
        fi
  done
  } < "${f}" > "${NEW}"

My eyes got whiplash. To make it easier to understand, let's put each element of the pattern on a single line:

^\"[0-9]{4}-([0][0-9]|1[0-2])-([0-2][0-9]|3[01])\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
[^,]*,
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
\"[^\"]*\",
.*$

Which is really something. The first field matches a date format - "yyyy-mm-dd" (which is ok), then we have 12 fields where we care that they are enclosed in double quotes, one field that we want to not be quoted, another 12 fields which are quoted again, and any other fields we don't care about.

Wow.

I told my colleague that this wasn't a good way of doing things (he agreed).

There are better ways to achieve this, so let's walk through them.

Firstly, the shell globbing. There's a Unix command to generate a list of filesystem entries which match particular criteria. It's called find. If we want a list of files which have a 'csv' extension we do this:

$ find DIR -type f -name \*.csv

You can use '.' or '*' or any way of representing a DIRectory in the filesystem.

Now since we want this in a list to iterate over, let's put it in a variable:

$ CSVfiles=$( find DIR -type f -name \*.csv -o -name \*.CSV )

(You can redirect stderr to /dev/null, with 2>/dev/null inside the parens if you'd like).

Now that we've got our list, we can move to the second phase - removing lines which do not match our pattern. Let's try this first with awk. Awk has the concept of a Field Separator, and since CSV files are Comma-Separated-Value files, let's make use of that feature. We also know that we are only really interested in two fields - the first (yyyy-mm-dd) and the fourteenth.

$ awk -F',' '$1 ~ /"[0-9]{4}-([0][0-9]|1[0-2])-([0-2][0-9]|3[01])"/ &&
    $14 !~ /".*"/ {print}' < $old > $new

That's still rather ugly but considerably easier to read. For the record, the bare ~ is awk's equals operator, and !~ is not-equals.

We could also do this with grep, but at the cost of using more of that horrible regex.

In my opinion a better method is to cons up a Python script for this validation purpose, and we don't need to use the CSV module.

from collections import UserString
from datetime import datetime

infile = open("/path/to/file.csv", "rw")
input = infile.readlines()

linecount = len(input)

for line in input:

    fields = line.split(",")
    togo = False

    try:
        datetime.strptime(fields[0], "%Y-%m-%d")
    except ValueError as _ve:
        togo = True

    if '"' in fields[14] or not UserString(fields[14]).isnumeric():
        togo = True
    if togo:
        del line

if len(input) != linecount:
    # We've modified the input, so have to write out a new version, but
    # let's overwrite our input file rather than creating a new instance.
    infile.seek(0)
    infile.write("\n".join(input))

infile.close()

This script is pretty close to how I would write it in C (could you tell?).

We first open the file (for reading and writing) and read in every line, which yields us a list. While it's not the most memory-efficient way of approaching this problem, it does make processing more efficient because it's one read(), rather than one-read-per-line. We store the number of lines that we've read in for comparison at the end of our loop, and then start the processing.

Since this is a CSV file we know we can split() on the comma, and having done so, we check that we can parse the first field. We're not assigning to a variable with datetime.strptime() because we only care that we can rather than what the object's value is. The second check is to see that we cannot find the double apostrophe in the element, and that the content of the field is in fact numeric. If neither of these checks succeed, we know to delete the line from our input.

Finally, if we have in fact had to delete any lines, we rewind our file (I was going to write pointer, but it's a File object. Told you it was close to C!) to the start, and write out each line of input with a newline character before closing the file.

Whenever I think about regexes, especially the ones I've written in C over the years, I think about this quote which Jeffrey Friedl wrote about a long time ago:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

It was true when I first heard it some time during my first year of uni, and still true today.




Common Table Expressions and an ORM

I solved a gnarly performance problem last week and I'd like to share with you what I learnt while doing it.

We received an alert from one of our automated monitoring systems that a particular query being run via an API was taking too long. And by "too long" I mean minutes. Since database engines are optimised to return data to you in milliseconds, this seemed wildly wrong.

The first step when you are checking a query for performance is to use the query analyzer. This is a very powerful tool in any database, and helps you see just what the engine is going to do for any particular bit of SQL. The term you're looking for in the documentation is EXPLAIN.

We thought the query was fairly simple. It followed this general form:

SELECT columns, COUNT(SELECT specific column WHERE conditions),
FROM database
WHERE conditions
ORDER BY ordering criteria;

and gave us not just results with specific conditions, but the count of rows which met those conditions too.

You will notice that I did not use first set of conditions and second set of conditions. This is because the selection criteria were in fact the same. That was the first clue.

A second clue was that of these selection conditions was that we had a range check - is the column's value between A and B? [We actually had two, but having just one was sufficient to show the problem]. This was expressed as

columnA >= smallervalue AND
columnA <= largervalue

So you'll see that for each row we had two comparisons to do. I tried making that condition singular, by using just >= smallervalue (I also tried <= largervalue> - no difference in timing) and while that did make the query faster, it did not reflect what we need to be able to do, so that was out.

Back to the query planner. It wasn't happy. It turns out that the most efficient plan it could come up with was using a Bitmap Index Scan. Twice. For the same query - but throwing the results of the first query away before running the whole thing again.

I knew that there had to be a better way (tm) - and there was. Here's the first thing I learned: the SQL standard has a thing called a Common Table Expression or CTE. This is a way of creating a temporary table that is used for just that query.

With this knowledge, I could now re-write the query in a considerably more efficient fashion. (It took me several days to figure this out - I'm no SQL guru!)

WITH better_query AS
    (SELECT columnname
    FROM tablename
    WHERE conditions)
SELECT columns, count(better_query.columnname)
FROM tablename
GROUP BY grouping criteria
ORDER BY ordering criteria;

Excellent! ... except that that was giving me all the rows in tablename which matched from better_query rather than the specific rows which met conditions. To solve this niggle I needed a join.

WITH better_query AS
    (SELECT columnname
    FROM tablename
    WHERE conditions)
SELECT columns, count(better_query.columnname)
FROM tablename TN
JOIN better_query BQ on BQ.columnname = TN.columnname
GROUP BY grouping criteria
ORDER BY ordering criteria;

The query planner was a lot happier. Happier to the tune of a 20x improvement.

There might have been a cheer or two.

So how do we put this into our application codebase?

While a major reason $employer hired me was my Python skills, $employer isn't a Python shop by any stretch of the imagination. We're a Java shop, and I'm cross-pollinating :-).

Part of that process has been learning how we abstract database specificities out of the code - we use an Object-Relational Mapping (aka ORM) to cope with using more than one vendor's database. There are a few options for Java ORMs and the one that we use is jOOQ. This is the first time I've used an ORM (of course I've read about them, but years ago) so this too was something new that I've learned. Heading to the documentation I read a lot, getting my head around the concepts and syntax.

It took me several days of research and hacking (stackoverflow, github, come on down!) to work out the correct syntax.

Here's the gist of what we needed:

public static QueryClass {

    private final String withClauseName = new String("betterQuery");
    private final String withClauseField = new String("fieldName");

    ....

    CommonTableExpression<Record1<Integer>> withClause = DSL.name(withClauseName)
        .fields(withClauseField)
        .as(select(columnName)
            .from(tableName)
            .where(conditions)
        );

    Field<Integer> results = DSL.field(context.select(
        count(DSL.field(columnName)))
        .from(withClause)
        .limit(1)
    ).as("results");

    /*
     * append results field to array of column names in the select, then ...
     */

    List<Typename> resultSet = DSLcontext.with(withClause)
        .select(columns)
        .from(tableName)
        .join(withClause)
            .on(withClause.field(withClauseField, typename.class)
                .eq(tablename.columnName))
        .groupBy(columnName)
        .orderBy(columnName)
    ....
};

No, it's not simple - but using jOOQ does mean that we can keep our code pretty close to how we would write bare SQL (in this case that would still be rather gnarly), and that is definitely a win.

Acknowledgement

Since this is another SQL-based post, I'd like to acknowledge the massive knowledge and confidence boost I got from reading Julia Evans (aka b0rk)'s SQL zine Become a SELECT star. She has a beautiful style and breaks complex technical concepts down into easily understandable chunks with great illustrations. Reading that zine advanced my database confidence immensely.




A brief introduction to the Python Database API

As popular as NoSQL databases currently are, there is still an immense amount of data in the world for which the relational database is still the most appropriate way to store, access and manipulate it. For those of us who like to use Python rather than JDBC, we are fortunate to have bindings for the major databases which implement the Python Database API, aka PEP249.

This standard is important because it allows you to write your application in a way that promotes cross-platform (cross-database engine) portability. While you do still need to be aware of differences between the databases (for example, Oracle's VARCHAR2 vs PostgreSQL's VARCHAR), the essential tasks that you need to accomplish when making use of a database are abstracted from you.

In this brief introduction I will show you how to install two popular Python database binding packages, connect to a database and run queries.

Installing the bits you need

I assume that you've got Python installed on your system already, and that the database you want to connect to is set up and running. What you need now are the bindings for that particular database. For the Oracle database, this is a package known as cx_Oracle. For PostgreSQL, I suggest psycopg2 though there are other bindings available.

You will definitely need to install the pre-packaged database client libraries on your host (or in your container) as well. For Oracle look for your platform in the Oracle Instant Client download pages, for PostgreSQL on linux you will need libpq-dev as supplied by your package manager. On MacOS, please find the appropriate option from the PostgreSQL MacOSX download page.

After you've installed these pieces, you can then pip install --user cx_Oracle or pip install --user psycopg2. My personal preference is to install bindings in a venv. For Solaris 11 you can get the cx_Oracle bits as well as the Oracle Instant Client by uttering pkg install cx_Oracle. [On a personal note, I made that possible - see the History tab on the Solaris Userland cx_Oracle github].

Getting a connection to the database

Now that you have the correct packages installed, you can start investigating. The first thing to do is start an interactive interpreter and import the modules you need:

$ python3.8
>> import cx_Oracle

(Note that unless you're using Solaris' pre-packaged version, you must point LD_LIBRARY_PATH to where the module can locate the Instant Client libraries, specifically libclntsh.so)

$ python 3.8
>>> import psycopg2
>>> from psycopg2.extras import execute_batch

We need a connection to the database, which in Oracle terms is a DSN, or Data Source Name. This is made up of the username, password, host (either address or IP), port and database instance name. Put together, the host, port and database instance are what Oracle calls the "service name".

The common example you will see in Oracle documentation (and on stackoverflow!) is

scott/tiger@orcl:1521/orcl

I'm a bit over seeing that, so I've created a different username and DBname:

DEMOUSER/DemoDbUser1@dbhost:1521/demodb

Since Oracle defaults to using port 1521 on the host, you only need to specify the port number if your database is listening on a different port.

An Oracle connection is therefore

user = "DEMOUSER"
passwd = "DemoDbUser1"
servicename = "dbhost:1521/demodb"

try:
    connection = cx_Oracle.connect(user, passwd, servicename)
except cx_Oracle.DatabaseError as dbe:
    print("""Unable to obtain a connection to {servicename}""".format(servicename=servicename))
    raise

With PostgreSQL we can also specify whether we want SSL enabled.

try:
    connection = psycopg2.connect(dbname=dbname,
                                  user=dbuser,
                                  password=dbpassword,
                                  host=dbhost,
                                  port=dbport,
                                  sslmode=dbsslmode)
except psycopg2.OperationalError as dboe:
    print("""Unable to obtain a connection to the database.\n
          Please check your database connection details\n
          Notify dbadmin@{dbhost}""".format(dbhost=dbhost),
          file=sys.stderr, flush=True)
    raise

Once we have the connection, we need a cursor so we can execute statements:

cursor = connection.cursor()

Now that we have our connection and a cursor to use, what can we do with it? SQL baby!

We need a handy dataset to muck around with, so I've got just the thing courtesy of my previous post about determining your electorate. My JSON files have a fairly basic structure reflecting my needs for that project, but that is still sufficient our purposes here.

>>> import json
>>> actf = json.load(open("find-my-electorate/json/ACT.json", "r"))
>>> actf.keys()
dict_keys(['Brindabella', 'Ginninderra', 'Kurrajong', 'Murrumbidgee', 'Yerrabi'])
>>> actf["Yerrabi"].keys()
dict_keys(['jurisdiction', 'locality', 'blocks', 'coords'])

The 'jurisdiction' key is the state or territory name (ACT, Australian Capital Territory in this case), the 'locality' is the electorate name, 'blocks' is a list of the Australian Bureau of Statistics Mesh Block which the electorate contains, and 'coords' is a list of the latitude, longitude pairs which are the boundaries of those blocks.

Now we need a schema to operate within. This is reasonably simple - if you've done any work with SQL databases. To start with, we need two SEQUENCE s, which are database objects from which any database user may generate a unique integer. This is the easiest and cheapest way that I know of to generate values for primary key columns. (For a good explanation on them, see the Oracle 19c CREATE SEQUENCE page). After that, we'll create two tables: electorate and geopoints. (We're going to ignore the Mesh Block data, it's not useful for this example).

-- This is the Oracle DB form
DROP TABLE GEOPOINTS CASCADE CONSTRAINTS;
DROP TABLE ELECTORATES CASCADE CONSTRAINTS;

DROP SEQUENCE ELECTORATE_SEQ CASCADE;
DROP SEQUENCE LATLONG_SEQ CASCADE;

CREATE SEQUENCE ELECTORATE_SEQ INCREMENT BY 1 MINVALUE 1 NOMAXVALUE;
CREATE SEQUENCE LATLONG_SEQ INCREMENT BY 1 MINVALUE 1 NOMAXVALUE;

CREATE TABLE ELECTORATES (
    ELECTORATE_PK   NUMBER DEFAULT DEMOUSER.ELECTORATE_SEQ.NEXTVAL NOT NULL PRIMARY KEY,
    ELECTORATE      VARCHAR2(64) NOT NULL,
    STATENAME       VARCHAR2(3) NOT NULL -- Using the abbreviated form
);

CREATE TABLE GEOPOINTS (
    LATLONG_PK      NUMBER DEFAULT DEMOUSER.LATLONG_SEQ.NEXTVAL NOT NULL PRIMARY KEY,
    LATITUDE        NUMBER NOT NULL,
    LONGITUDE       NUMBER NOT NULL,
    ELECTORATE_FK   NUMBER REFERENCES ELECTORATES(ELECTORATE_PK)
);

-- Now for the PostgreSQL form
DROP TABLE IF EXISTS GEOPOINTS CASCADE;
DROP TABLE IF EXISTS ELECTORATES CASCADE;

DROP SEQUENCE IF EXISTS ELECTORATE_SEQ CASCADE;
DROP SEQUENCE IF EXISTS LATLONG_SEQ CASCADE;

CREATE SEQUENCE ELECTORATE_SEQ INCREMENT BY 1 MINVALUE 1 NO MAXVALUE;
CREATE SEQUENCE LATLONG_SEQ INCREMENT BY 1 MINVALUE 1 NO MAXVALUE;

CREATE TABLE ELECTORATES (
    ELECTORATE_PK   INTEGER DEFAULT NEXTVAL('ELECTORATE_SEQ') NOT NULL PRIMARY KEY,
    ELECTORATE      VARCHAR(64) NOT NULL,
    STATENAME       VARCHAR(3) NOT NULL -- Using the abbreviated form
);

CREATE TABLE GEOPOINTS (
    LATLONG_PK      INTEGER DEFAULT NEXTVAL('LATLONG_SEQ') NOT NULL PRIMARY KEY,
    LATITUDE        NUMERIC NOT NULL,
    LONGITUDE       NUMERIC NOT NULL,
    ELECTORATE_FK   INTEGER REFERENCES ELECTORATES(ELECTORATE_PK) NOT NULL
);

We need to execute these statements one by one, using cursor.execute(). Come back once you've done that and we've got our schema set up.

[I see that a short time has passed - welcome back]

Now we need to populate our database. For both connection types, we'll make use of prepared statements, which allow us to insert, update, delete or select many rows at a time. For the Oracle connection we'll use the executemany() function, but for PostgreSQL's psycopg2 bindings we'll use execute_batch() instead. (See the psycopg2 website note about it).

For the Oracle examples, we'll use a bind variable so that when we INSERT into the ELECTORATES table we get the ELECTORATE_PK to use in the subsequent INSERT to the GEOPOINTS table. That saves us a query to get that information. The psycopg2 module does not have this support, unfortunately. Since we have many records to process, I've created some small functions to enable DRY

# Oracle version
>>> epk = cursor.var(int)
>>> electStmt = """INSERT INTO ELECTORATES (ELECTORATE, STATENAME)
...                VALUES (:electorate, :statename)
...                RETURNING ELECTORATE_PK INTO :epk"""
>>> geoStmt = """INSERT INTO GEOPOINTS(LATITUDE, LONGITUDE, ELECTORATE_FK)
...              VALUES (:lat, :long, :fk)"""

>>> def addstate(statename):
...     for electorate in statename:
...         edict = {"electorate": statename[electorate]["locality"],
...                  "statename": statename[electorate]["jurisdiction"],
...                  "epk": epk}
...         cursor.execute(electStmt, edict)
...         points = list()
...         for latlong in statename[electorate]["coords"]:
...             points.append({"latitude": float(latlong[1]),
...                            "longitude": float(latlong[0]),
...                            "fk": int(epk.getvalue(0)[0])})
...         cursor.executemany(geoStmt, points)
...         connection.commit()

>>> allpoints = dict()
>>> states = ["act", "nsw", "nt", "qld", "sa", "tas", "wa", "vic", "federal"]

>>> for st in states:
...     allpoints[st] = json.load(open("json/{stu}.json".format(stu=st.upper()), "r"))
...     addstate(allpoints[st])

For the PostgreSQL version, we need to subtly change the INSERT statements:

>>> electStmt = """INSERT INTO ELECTORATES (ELECTORATE, STATENAME)
...                VALUES (%(electorate)s, %(statename)s)
...                RETURNING ELECTORATE_PK"""
>>> geoStmt = """INSERT INTO GEOPOINTS(LATITUDE, LONGITUDE, ELECTORATE_FK)
...                VALUES (%(latitude)s, %(longitude)s, %(fk)s)"""

Another thing we need to change is the first execute, because bind variables are an Oracle extension, and we're also going to change from executemany() to execute_batch():

>>> def addstate(statename):
...     for electorate in statename:
...         edict = {"electorate": statename[electorate]["locality"],
...                  "statename": statename[electorate]["jurisdiction"]}
...         cursor.execute(electStmt, edict)
...         epk = cursor.fetchone()[0]
...         points = list()
...         for latlong in statename[electorate]["coords"]:
...             points.append({"latitude": float(latlong[1]),
...                            "longitude": float(latlong[0]), "fk": epk})
...         execute_batch(cursor, geoStmt, points)
...         connection.commit()

The rest of the data loading is the same as with Oracle. Since I'm curious about efficiencies, I did a rough benchmark of the data load with executemany() as well, and it was around half the speed of using execute_batch(). YMMV, of course, so always test your code with your data and as close to real-world conditions as possible.

Now that we have the data loaded we can do some investigations. I'm going to show the PostgreSQL output for this and call out differences with Oracle where necessary.

While some parts of our state and territory borders follow rivers and mountain ranges, quite a lot of them follow specific latitudes and longitudes. The border between Queensland and New South Wales, for example is mostly along the 29th parallel. Between the Northern Territory and South Australia it's the 26th parallel, and that between South Australia and Western Australia is along meridian 129 East.

If you go to a mapping service and plug in 29S,141E (the line between Queensland and New South Wales):

/images/2020/python-db-api/29s141e-400x400.png

you'll see that the exact point is not where the boundary is actually drawn. That means we need to use some fuzziness in our matching.

>>> cursor.execute("""SELECT STATENAME, ELECTORATE FROM ELECTORATES E WHERE ELECTORATE_PK IN
...                   (SELECT DISTINCT ELECTORATE_FK FROM GEOPOINTS WHERE LATITUDE BETWEEN -29.001 AND -28.995)
...                   ORDER BY STATENAME, ELECTORATE""")
>>> results = cursor.fetchall()
>>> for s, e in results:
...     print("{s:6} {e}".format(s=s, e=e))
...
NSW    Ballina
NSW    Barwon
NSW    Clarence
NSW    Lismore
NSW    New England
NSW    Northern Tablelands
NSW    Page
NSW    Parkes
QLD    Maranoa
QLD    Southern Downs
QLD    Warrego
SA     Giles
SA     Grey
SA     Stuart
SA     Stuart
WA     Durack
WA     Geraldton
WA     Kalgoorlie
WA     Moore
WA     North West Central

Hmm. Not only did I not want SA or WA electorates returned, the database has also given me the federal electorates as well. Let's update our electorates table to reflect that jurisdictional issue:

>>> cursor.execute("""ALTER TABLE ELECTORATES ADD FEDERAL BOOLEAN""")
>>> connection.commit()

Popping into a psql session for a moment, let's see what we have:

demodb=> \d electorates
                                    Table "public.electorates"
    Column     |         Type          | Collation | Nullable |               Default
---------------+-----------------------+-----------+----------+-------------------------------------
 electorate_pk | integer               |           | not null | nextval('electorate_seq'::regclass)
 electorate    | character varying(64) |           | not null |
 statename     | character varying(3)  |           | not null |
 federal       | boolean               |           |          |
Indexes:
    "electorates_pkey" PRIMARY KEY, btree (electorate_pk)
Referenced by:
    TABLE "geopoints" CONSTRAINT "geopoints_electorate_fk_fkey" FOREIGN KEY (electorate_fk) REFERENCES electorates(electorate_pk)

Now let's populate that column. Unfortunately, though, some state electorates have the same name as federal electorates - and some electorate names exist in more than one state, too! (I'm looking at you, Bass!). Tempting as it is to zorch our db and start from scratch, I'm going to take advantage of this information:

  1. We added the federal electorate list after all the states,

  2. the federal list was constructed starting with the ACT, and therefore

  3. Federal electorates will have a higher primary key value than all the states and territories.

With that in mind here's the first federal electorate entry:

demodb=> SELECT ELECTORATE_PK, ELECTORATE, STATENAME FROM ELECTORATES WHERE STATENAME = 'ACT' ORDER BY ELECTORATE_PK, STATENAME;
 electorate_pk |  electorate  | statename
---------------+--------------+-----------
             1 | Brindabella  | ACT
             2 | Ginninderra  | ACT
             3 | Kurrajong    | ACT
             4 | Murrumbidgee | ACT
             5 | Yerrabi      | ACT
           416 | Bean         | ACT
           417 | Canberra     | ACT
           418 | Fenner       | ACT
(8 rows)

Let's check how many electorates have a primary key higher than Bean:

>>> cursor.execute("""SELECT COUNT(ELECTORATE_PK), MAX(ELECTORATE_PK) FROM ELECTORATES""")
>>> cursor.fetchall()
[(566, 566)]

And a quick check to see that we do in fact have 151 electorates with that condition:

>>> 566 - 415
151

We do. Onwards.

>>> cursor.execute("""UPDATE ELECTORATES SET FEDERAL = TRUE WHERE ELECTORATE_PK > 415""")
>>> connection.commit()
>>> cursor.execute("""SELECT COUNT(*) FROM ELECTORATES WHERE FEDERAL IS TRUE""")
>>> cursor.fetchall()
[(151,)]

Likewise, we'll set the others to federal=False:

>>> cursor.execute("""UPDATE ELECTORATES SET FEDERAL = FALSE WHERE ELECTORATE_PK < 416""")
>>> connection.commit()

Back to our queries. I want to see both sorts of electorates, but grouped by whether they are federal or state electorates:

>>> cursor.execute("""SELECT E.STATENAME, E.ELECTORATE, E.FEDERAL FROM ELECTORATES E
...                   WHERE E.ELECTORATE_PK IN
...                       (SELECT DISTINCT ELECTORATE_FK FROM GEOPOINTS WHERE LATITUDE BETWEEN -29.001 AND -28.995)
...                   GROUP BY E.STATENAME, E.ELECTORATE, E.FEDERAL
...                   ORDER BY STATENAME, ELECTORATE""")
>>> fedstate = cursor.fetchall()
>>> fedstate
[('NSW', 'Ballina', False), ('NSW', 'Barwon', False), ('NSW', 'Clarence', False), ('NSW', 'Lismore', False), ('NSW', 'New England', True), ('NSW', 'Northern Tablelands', False), ('NSW', 'Page', True), ('NSW', 'Parkes', True), ('QLD', 'Maranoa', True), ('QLD', 'Southern Downs', False), ('QLD', 'Warrego', False), ('SA', 'Giles', False), ('SA', 'Grey', True), ('SA', 'Stuart', False), ('WA', 'Durack', True), ('WA', 'Geraldton', False), ('WA', 'Kalgoorlie', False), ('WA', 'Moore', False), ('WA', 'North West Central', False)]

>>> def tfy(input):
...     if input:
...         return "yes"
...     else:
...         return ""

>>> for res in fedstate:
...     fmtstr = """{statename:6} {electorate:30} {federal}"""
...     print(fmtstr.format(statename=res[0], electorate=res[1], federal=tfy(res[2])))
...
NSW    Ballina
NSW    Barwon
NSW    Clarence
NSW    Lismore
NSW    New England                    yes
NSW    Northern Tablelands
NSW    Page                           yes
NSW    Parkes                         yes
QLD    Maranoa                        yes
QLD    Southern Downs
QLD    Warrego
SA     Giles
SA     Grey                           yes
SA     Stuart
WA     Durack                         yes
WA     Geraldton
WA     Kalgoorlie
WA     Moore
WA     North West Central

I could have added a WHERE E.FEDERAL = TRUE to the query, too.

Finally, let's see what state electorates in WA and SA are on the border:

>>> cursor.execute("""SELECT STATENAME, ELECTORATE FROM ELECTORATES E
...                   WHERE ELECTORATE_PK IN
...                       (SELECT DISTINCT ELECTORATE_FK FROM GEOPOINTS WHERE LATITUDE BETWEEN -60.00 AND -25.995
                           AND LONGITUDE BETWEEN 128.995 AND 129.1)
...                   AND FEDERAL = FALSE ORDER BY STATENAME, ELECTORATE""")
>>> results = cursor.fetchall()
>>> for _res in results:
...     print("""{statename:6} {electorate}""".format(statename=_res[0], electorate=_res[1]))
...
NT     Namatjira
SA     Giles
WA     Kalgoorlie
WA     North West Central

Why do we have that electorate from the Northern Territory? It's because the coordinates are a bit fuzzy and we had to use a range (the BETWEEN) in our query.

Finally

I apologise, because while this was supposed to be a brief introduction I did get side-tracked with data investigation. I suppose that's an occupational hazard since I'm a Data Engineer working for a company which provides GIS-related services. Anyway, in terms of depth, this was definitely only scratching the surface of what is possible with Python and databases. I encourage you to go and read the Python Database API as well as the SQL reference manual(s) for your chosen database and its binding documentation.