Since I departed from my comfortable niche in Solaris engineering earlier
this year, I've spent a considerable amount of time and energy in re-training
and upskilling to assist my employment prospects. Apart from acquainting
myself with a lot of terminology, I've written code. A lot of code. Mostly, as
it turns out, has been related to microservices.
This post is about a microservice I wrote to assist with accessibility in a
specific part of the Australian electoral process (finding out which
electorate you live in) and some supporting digressions.
You can find all the code for this microservice and its associated data
preparation in my GitHub repos grabbag and find-my-electorate.
On 18 May 2019, Australia had a federal election, and in the lead up to
that event I became very interested in political polling. While I have a few
ideas on the subject which are on my back burner, mind-mapping the various
components of political polling got to wondering how do the various state,
territory and federal electoral commissions map a voter's address to an
electorate?
My first point of call was the Australian Electoral Commission and their
Find my electorate site. This is nicely laid out and lets you find out
which electorate you are in - by postcode. This is all well and good if you're
in a densely populated area, like the electorate of Brisbane - three
suburbs. If, however, you choose somewhere else, like 2620 which covers a
lot of Canberra and surrounding districts, you wind up with several
electorates covering 2620.
The AEC's website is written in asp.net, which is up to the task, but
when you have more than one page of results the authors of the page make use
of some (to my mind) squirrelly features and callbacks which make scraping the
site difficult. As best I can determine, the AEC doesn't provide an API to
access this information, so Another Method was required.
At this point, I turned to the standard libraries for this sort of thing in the
Python worldL Beautiful Soup and requests. I started by setting up a
quick venv to keep the dependencies contained
$ python3.7 -m venv scraping-venv
$ . scraping/bin/activate
(scraping-venv) $ pip install requests bs4 json csv
Now since we know the url to GET, we can get the first page of responses very
easily:
import requests
from bs4 import BeautifulSoup
url = "https://electorate.aec.gov.au/LocalitySearchResults.aspx?"
url += "filter={0}&filterby=Postcode"
result = requests.post(url.format(2620))
resh = BeautifulSoup(result.text, "html.parser")
Beautiful Soup parses the response text, and gives us a tree-like structure
to work with. Making use of the Chrome devtools (or the Firefox devtools )
I could see that I need to find a <table>
with an attribute of
*ContentPlaceHolderBody_gridViewLocalities*
- what a mouthful! - and then
process all the table rows (<tr>
) within that table
tblAttr = "ContentPlaceHolderBody_gridViewLocalities"
restbl = resh.find_all(name="table", attrs={"id": tblAttr})
rows = restbl[0].find_all("tr")
Using a for
loop we can construct a dict
of the data that we actually
need. Simple!
What do we do, though, when we want to get the second or later pages of
result? This is where the squirrelly features and callbacks come in. The page
makes use of an __EVENTARGUMENT
element which is POST
ed as payload
back to the same url. The way that we determine this is to look for a row with
the class pagingLink
, then for each table data (<td>
) element check
for its contents matching this regex
".*__doPostBack.'(.*?gridViewLocalities)','(Page.[0-9]+)'.*")
And after that we can recursively call our query with the extra payload data
in the argument list:
def queryAEC(postcode, extrapage):
"""
Queries the AEC url and returns soup. If extrapage is empty
then we pass the soup to findFollowups before returning.
"""
url = "https://electorate.aec.gov.au/LocalitySearchResults.aspx?"
url += "filter={0}&filterby=Postcode"
if not extrapage:
res = requests.post(url.format(postcode))
else:
payload["__EVENTARGUMENT"] = extrapage
res = requests.post(url.format(postcode), data=payload)
I now had a script to run which extracted this info and pretty-printed it
(as well as the same info in JSON):
$ ./postcode.py 2620
State Postcode Locality Electorate
ACT 2620 BEARD Canberra
ACT 2620 BOOTH DISTRICT Bean
NSW 2620 BURRA Eden-Monaro
NSW 2620 CARWOOLA Eden-Monaro
NSW 2620 CLEAR RANGE Eden-Monaro
ACT 2620 CORIN DAM Bean
NSW 2620 CRESTWOOD Eden-Monaro
NSW 2620 ENVIRONA Eden-Monaro
NSW 2620 FERNLEIGH PARK Eden-Monaro
NSW 2620 GOOGONG Eden-Monaro
NSW 2620 GREENLEIGH Eden-Monaro
NSW 2620 GUNDAROO Eden-Monaro
ACT 2620 HUME Bean
NSW 2620 KARABAR Eden-Monaro
ACT 2620 KOWEN DISTRICT Canberra
ACT 2620 KOWEN FOREST Canberra
NSW 2620 MICHELAGO Eden-Monaro
ACT 2620 OAKS ESTATE Canberra
ACT 2620 PADDYS RIVER DISTRICT Bean
NSW 2620 QUEANBEYAN Eden-Monaro
NSW 2620 YARROW Eden-Monaro
NSW 2620 QUEANBEYAN EAST Eden-Monaro
NSW 2620 QUEANBEYAN WEST Eden-Monaro
NSW 2620 RADCLIFFE Eden-Monaro
ACT 2620 RENDEZVOUS CREEK DISTRICT Bean
ACT 2620 ROYALLA Bean
NSW 2620 ROYALLA Eden-Monaro
NSW 2620 SUTTON Eden-Monaro
ACT 2620 TENNENT DISTRICT Bean
ACT 2620 THARWA Bean
NSW 2620 THARWA Eden-Monaro
NSW 2620 THARWA Eden-Monaro
NSW 2620 THE ANGLE Eden-Monaro
NSW 2620 THE RIDGEWAY Eden-Monaro
NSW 2620 TINDERRY Eden-Monaro
NSW 2620 TRALEE Eden-Monaro
ACT 2620 TUGGERANONG DISTRICT Bean
NSW 2620 URILA Eden-Monaro
NSW 2620 WAMBOIN Eden-Monaro
ACT 2620 WILLIAMSDALE Bean
NSW 2620 WILLIAMSDALE Eden-Monaro
That really is quite a few suburbs.
So now that we've got a way to extract that information, how do we make it
available and useful for everybody? With a microservice! I hear you cry.
The very first microservice I wrote (in 2011-12, the subject of a future
post) used CherryPy, because we'd embedded it within Solaris IPS (image
packaging system) and didn't need any further corporate approvals. The path of
least resistance. This time, however, I was unconstrained regarding approvals,
so had to choose between Django and flask. For no particular reason, I
chose flask.
It was pretty easy to cons up the requisite templates, and write the
/results
method. It was at this point that my extend the fix habit
(learnt via the Kepner-Tregoe Analytical Troubleshooting training many
years ago) kicked in, and I started exploring the Electoral Commission of
Queensland website for the same sort of information. To my surprise, the
relatively straight-forward interface of the AEC was not available, and the
closest analogue was an interactive map.
After a brief phone conversation with ECQ and more digging, I discovered
that the 2017 boundaries were available from QLD Spatial in shapefile,
MapInfo and Google Maps KML formats. This was very useful, because KML can be
mucked about with directly using Beautiful Soup. After not too much effort I
had the latitude+longitude pairs for the boundaries extracted and stored as
JSON
. My phone conversation with ECQ also took me down the path
of wanting to translate a street address into GeoJSON - and that took me
to the Google Maps API. I did investigate OpenStreetMap's api, but
testing a few specific locations (addresses where we've lived over the
years) gave me significantly different latitude+longitude results. I bit the
bullet and got a Google Maps API key .
The next step was to research how to find out if a specific point is located
within a polygon, and to my delight the Even-odd rule has example code
in Python, which needed only a small change to work with my data
arrangement.
With that knowledge in hand, it was time to turn the handle on the
Google Maps API :
keyarg = "&key={gmapkey}"
queryurl = "https://maps.googleapis.com/maps/api/geocode/json?address="
queryurl += "{addr} Australia"
queryurl += keyarg
...
# Helper functions
def get_geoJson(addr):
"""
Queries the Google Maps API for specified address, returns
a dict of the formatted address, the state/territory name, and
a float-ified version of the latitude and longitude.
"""
res = requests.get(queryurl.format(addr=addr, gmapkey=gmapkey))
dictr = {}
if res.json()["status"] == "ZERO_RESULTS" or not res.ok:
dictr["res"] = res
else:
rresj = res.json()["results"][0]
dictr["formatted_address"] = rresj["formatted_address"]
dictr["latlong"] = rresj["geometry"]["location"]
for el in rresj["address_components"]:
if el["types"][0] == "administrative_area_level_1":
dictr["state"] = el["short_name"]
return dictr
When you provide an address, we send that to Google which does a best-effort
match on the text address then returns GeoJSON for that match. For example,
if you enter
42 Wallaby Way, Sydney
the best-effort match will give you
42 Rock Wallaby Way, Blaxland NSW 2774, Australia
I now had a way to translate a street address into a federal electorate, but
with incomplete per-State data my app wasn't finished. I managed to get
Federal, Queensland, New South Wales, Victoria and Tasmania data fairly easily
(see the links below) and South Australia's data came via personal email after
an enquiry through their contact page. I didn't get any response to several
contact attempts with either Western Australia or the Northern Territory, and
the best I could get for the ACT was their electorate to suburb associations.
I remembered that the Australian Bureau of Statistics has a standard
called Statistical Geography, and the smallest unit of that is called a
Mesh Block:
Mesh Blocks (MBs) are the smallest geographical area defined by the
ABS. They are designed as geographic building blocks rather than as areas
for the release of statistics themselves. All statistical areas in the
ASGS, both ABS and Non ABS Structures, are built up from Mesh Blocks. As a
result the design of Mesh Blocks takes into account many factors including
administrative boundaries such as Cadastre, Suburbs and Localities and
LGAs as well as land uses and dwelling distribution.
(emphasis added)
Mesh Blocks are then aggregated into SA1s:
Statistical Areas Level 1 (SA1s) are designed to maximise the spatial
detail available for Census data. Most SA1s have a population of between
200 to 800 persons with an average population of approximately 400
persons. This is to optimise the balance between spatial detail and the
ability to cross classify Census variables without the resulting counts
becoming too small for use. SA1s aim to separate out areas with different
geographic characteristics within Suburb and Locality boundaries. In rural
areas they often combine related Locality boundaries. SA1s are
aggregations of Mesh Blocks.
(emphasis added)
With this knowledge, and a handy SA1-to-postcode map in CSV format
$ head australia-whole/SED_2018_AUST.csv
SA1_MAINCODE_2016,SED_CODE_2018,SED_NAME_2018,STATE_CODE_2016,STATE_NAME_2016,AREA_ALBERS_SQKM
10102100701,10031,Goulburn,1,New South Wales,362.8727
10102100702,10053,Monaro,1,New South Wales,229.7459
10102100703,10053,Monaro,1,New South Wales,2.3910
10102100704,10053,Monaro,1,New South Wales,1.2816
10102100705,10053,Monaro,1,New South Wales,1.1978
....
I went looking into the SA1 information from the ABS shapefile covering the
whole of the country. Transforming the shapefile into kml is done with
ogr2ogr and provides us with an XML schema definition. From the CSV header
line above we can see that we want the SA1_MAINCODE_2016
and (for
validation) the STATE_NAME_2016
fields. Having made a per-state list of
the SA1s, we go back to the kml and process each member of the document:
<gml:featureMember>
<ogr:SED_2018_AUST fid="SED_2018_AUST.0">
<ogr:geometryProperty>
<gml:Polygon srsName="EPSG:4283">
<gml:outerBoundaryIs>
<gml:LinearRing>
<gml:coordinates>
....
</gml:coordinates>
</gml:LinearRing>
</gml:outerBoundaryIs>
</gml:Polygon>
</gml:polygonMember>
</ogr:geometryProperty>
<ogr:SED_CODE18>30028</ogr:SED_CODE18>
<ogr:SED_NAME18>Gladstone</ogr:SED_NAME18>
<ogr:AREASQKM18>2799.9552</ogr:AREASQKM18>
</ogr:SED_2018_AUST>
</gml:featureMember>
The gml:coordinates
are what we really need, they're space-separated
lat,long pairs.
for feature in sakml.findAll("gml:featureMember"):
sa1 = feature.find("ogr:SA1_MAIN16").text
mb_coord[sa1] = mb_to_points(feature)
for block in mb_to_sed:
electorate = mb_to_sed[block]
sed_to_mb[electorate]["coords"].extend(mb_coord[block])
After which we can write each jurisdiction's dict
of localities and lat/long
coordinates out as JSON using json.dump(localitydict, outfile)
.
To confirm that I had the correct data, I wrote another simple quick-n-dirty
script jsoncheck.py to diff the SA1-acquired JSON against my other
extractions. There was one difference of importance found - Queensland has a
new electorate McConnel, which was created after the most recent ABS SA1
allocation.
So that's the data preparation done, back to the flask app! The app listens
at the root (/
), and is a simple text form. Hitting enter
after typing
in an address routes the POST
request to the results function where we
call out to the Google Maps API, load the relevant state JSONified
electorate list, and then locate the Federal division. There are 151 Federal
divisions, so it's not necessarily a bad thing to search through each on an
alphabetic basis and break when we get a match. I haven't figured out a way to
(time and space)-efficiently hash the coordinates vs divisions. After
determining the Federal division we then use the same method to check against
the identified state's electorate list.
The first version of the app just returned the two electorate names, but I
didn't think that was very friendly, so I added another call to the
Google Maps API to retrieve a 400x400 image showing the supplied address on
the map; clicking on that map takes you to the larger Google-hosted map. I
also added links to the Wikipedia entries for the Federal and state
electorates. To render the image's binary data we use b64encode
:
from base64 import b64encode
keyarg = "&key={gmapkey}"
imgurl = "https://maps.googleapis.com/maps/api/staticmap?size=400x400"
imgurl += "¢er={lati},{longi}&scale=1&maptype=roadmap&zoom=13"
imgurl += "&markers=X|{lati},{longi}"
imgurl += keyarg
# Let's provide a Google Maps static picture of the location
# Adapted from
# https://stackoverflow.com/questions/25140826/generate-image-embed-in-flask-with-a-data-uri/25141268#25141268
#
def get_image(latlong):
"""
latlong -- a dict of the x and y coodinates of the location
Returns a base64-encoded image
"""
turl = imgurl.format(longi=latlong["lng"],
lati=latlong["lat"],
gmapkey=gmapkey)
res = requests.get(turl)
return b64encode(res.content)
....
# and in the results function
img_data = get_image(dictr["latlong"])
return render_template("results.html",
...,
img_data=format(quote(img_data))
...)
Putting that all together gives us a rendered page that looks like this:
To finalise the project, I ran it through flake8 again (I do this every few
saves), and then git commit
followed by git push
.
Reference data locations
Tasmania's state parliament has multi-member electorates, which have the same
boundaries as their 5 Federal divisions.
South Australia data was provided via direct personal email.
Australian Capital Territory, Western Australia and Northern Territory data
was extracted from the ABS shapefile after ogr2ogr
-converting from
MapInfo Interchange Format.