Spirograph math

Yesterday's post about Mary Wagner's giant spirographs prompted two lines of thought:

  1. Wouldn't it be cool to actually make some of those cogs?
  2. Wouldn't it be cool to prototype them in code?

It didn't take me long to put the first of these on ice. My local maker space has all the necessary kit to convert a sketch on my laptop into something solid but oh my, gears are hard. There seem to be all sorts of ways of getting a pair of gears to grind immovably into one another. I suspect I'd get to explore that space thoroughly before actually getting something to work. I don't have the patience for that. (If you are more patient than me, check out this guide to gear constrution from Make.)

Let's turn to the code then.

There are plenty of spirograph examples out in the wild (the one by Nathan Friend is probably most well known). But if I want to build one of my own, that's going to need some math.

Wikipedia didn't let me down. Apparently the pattern produced by a spirograph is an example of a roulette, the curve that is traced out by a point on a given curve as that curve rolls without slipping on a second fixed curve. In the case of a spirograph we have two roulettes:

  1. A circle rolling without slipping inside a circle (this is a hypotrochoid); and
  2. A circle rolling without slipping outside a circle (this is an epicycloid)

Here's a hypotrochoid:

Source: https://commons.wikimedia.org/wiki/File:HypotrochoidOutThreeFifths.gif

And here's an epicycloid:

Source: https://commons.wikimedia.org/wiki/File:EpitrochoidOn3-generation.gif

The math is straightforward (here and here) so no real barrier to coding this up, right?

Hmm. As usual one line of enquiry opens up several others. Sure, spirographs look nice but they are rather simple, no? What are our options if we want something more complex but still aesthetically satisfying? Wouldn't it be more fun to play with that? I mean the math couldn't be much harder, right?

More on that tomorrow…

Belemnite battlefields

I have been tinkering around with the code in Pearson's "Generative Art" and came up with this and the image above.

The author's intention was to show how complex patterns could spontaneously emerge from the interaction of multiple components that individually exhibit simple behavior. In this case, a couple of dozen invisible disks track across the display; whenever the disks intersect they cause a circle to suddenly appear at the point of intersection. So a little like the bubble chamber that I'd used in physics lab at university. You don't observe the particle directly, you observe the evidence of its passage through a medium. (My code is here if you want to play around with it.)

But what the images really remind me of are the belemnites that Cynthia and I had tried (and failed) to find in the fossil beds of Dorset and Yorkshire.

Belemnites are the fossilized remains of little squid-like creatures. Most of the times you find individuals—a small one looks a little like a shark's tooth—but sometimes you find a mass of them in a single slab, a "belemnite battlefield".

Source: thefossilforum.com

How do these battlefields occur? I found a paper by Doyle and MacDonald from 1993 in which the authors identify five pathways: post-spawning mortality, catastrophic mass mortality, predation concentration, stratigraphical condensation, and resedimentation. I don't know which occurred in the slab shown above but hey, I now know more potential pathways than I did before I read that paper.

And that's why art is fun. One moment you are messing around with RGB codes to get just the right sandy color for the background of a picture, the next you are researching natural phenomena that until an hour ago you never knew even had a name.

How many times am I going to forget this?

I have been using matplotlib on and off for over 15 years; either using it in its own right (the early days) or as some deeply embedded package whenever I use pandas (last couple years).

So why do I always always forget how to get the damn graphics to display?

Here's an example of plotting from the documentation on pandas.DataFrame.hist:

>>> df = pd.DataFrame({
...     'length': [1.5, 0.5, 1.2, 0.9, 3],
...     'width': [0.7, 0.2, 0.15, 0.2, 1.1]},
...     index= ['pig', 'rabbit', 'duck', 'chicken', 'horse'])
>>> hist = df.hist(bins=3)

When you run that, do you see a nice blue histogram describing the dimensions of farmyard animals? Lucky you. I don't. Because I forgot the magic incantation that needs to go after this.

import matplotlib.pylot as plt
plt.show()

And another thing: this shouldn't work. According to their own documentation, getting matplotlib to operate on OSX—which is what I'm attempting—involves a titanic struggle of framework builds against regular builds, incompatible backends, and bizarre specificity regarding my environment manager (my use of conda apparently means that any perceived irregularities in behaviour are my own damn fault).

OK, I'm done venting. Until the next time I forget the incantation and have to re-learn this.

Random colors again

I had speculated in an earlier post that I could get better (pseudo-)random colors if I pulled from color mixing models like Pantone or perceptual color models like NCS or Munsell. I'd love to, but it turns out that they are all copyright. That's a dead end then.

However my instinct that HSB is probably a better model from which to draw than RGB is supported elsewhere. There's an old post on Martin Ankeri's blog that has an interesting discussion on the issue, and he comes out firmly on the HSB side.

So let's think about how I want to vary hue, saturation, and brightness:

  • Hue: I don't really care about the underlying color so I will just pull from a uniform distribution between 0 and 1. (By the way, I'm assuming HSB are all on a 0-1 scale).

  • Saturation: I know I'm going to want some sort of lumpiness but I'm not sure exactly how much and where. Maybe I want a lot of the colors bleached out. Or really saturated. This suggests that I need should pull randomly from a beta distribution and then play around with the distribution's parameters to change where the lump/lumps occur.

  • Brightness: ditto for brightness. I want it non-uniform but tweakable. Another beta distribution.

Code

I'm now using the python version of Processing instead of Java. It may have some issue around importing third-party libraries but this last week of using the Java version has just reminded me how much I don't like that damn language.

The translation of the Java code to Python was simple; the inclusion of the beta distributions is straightforward too as they are already available in Python's random module.

If you want to play with the code it is in my repository here. If you want to mess around with the paramaters in a beta distributions and see the impact on the shape of the distribution, there's a handly online tool here.

Results

I show a couple of examples below. Personally I think they are an improvement on the earlier examples. (And yes, I've gone full Richter. No gutters here.)

(α, β) saturation = (1.2, 0.9); brightness = (0.8, 0.4)
(α, β) saturation = (1.1, 0.9); brightness = (0.9, 0.6)

Obviously, the colors are different every time I run the code. But it's the overall feel of how these colors work together that interests me. And this overall feel is—again, as one would expect—pretty sensitive with respect to the parameters you set in the beta-distributions for saturation and brightness.

"Random" colors? Trickier than I thought

Here was my first attempt at a simple grid of colored squares. And yes, it is more than a little influenced by Richter's 4900 Colours.

That is odd. I am using the RGB color model—typical in computer graphics—in which each element has 256 levels. I randomly select one of the 256 levels for each of the three dimensions, so I have a space of 2563 possible colors from which to draw. That's a big number (16,777,216 to be precise).

So why do so many of those 25 squares look so similar? Who knew there were so many minty greens?

Maybe it is a function of my color model. Let's try Hue-Saturation-Brightness instead.

I guess it is more visually appealing but I seem to have an awful lot of almost-blacks now. In short: just as bad.

Color models

I simplistically thought "OK, so I might be making some false comparisons here. Who says that an HSB model needs to have 256 levels on each of its components? And is a unit difference equally perceptible irrespective of level, i.e. is the color (0, 0, 0) as distinguishable from (0, 0, 1) as the color (255, 255, 255) is from (255, 255, 254)? Perhaps I just need to choose a different number of levels and I'll get 'better' random colors".

Not so much. A quick Google search reveals a huge rabbit hole down which I could disappear. Even simple questions like "How many colors can we perceive?" give answers ranging from 100,000 to 10,000,000. And there are a bunch of perceptual colour models that I had never heard of before. NCS? Munsell? I never got past Goethe. Maybe I should just find a Pantone color list and draw randomly off that.

This might take a while.

Code

By the way, if you want a copy of the Processing script that produced these images, you can find it here.

Run Processing from the command line

I have recently started to play with Processing as I explore generative art.

It's a lovely piece of software but, coming out of the box, it does not really fit my typical workflow. The problem? It's an IDE. I can't stand IDEs. Not only do I have to learn a new language but I have to learn a whole new toolkit for working in it. I much much prefer being able to compose a script in my favorite text editor and then summon that script from the command line.

Fortunately Processing supports this workflow. I found this post at dsfcode.com that does a great job of describing how to set it up. And once I had done so it becomes so easy to get into my usual edit/run/re-edit cycle.

The only thing to watch out for is that the order of parameters in the processing-java command matters. So this command succeeds:

$ processing-java --sketch=`pwd`/waveclock/ \
> --output=`pwd`/../outputs/waveclock/ \
> --force --run
Finished

but this does not:

$ processing-java --sketch=`pwd`/waveclock/ \
> --output=`pwd`/../outputs/waveclock/ \
> --run --force
The output folder already exists. Use --force to remove it.

By the way, the sketch that I am running is a slight modification of case study 4.2 in Pearson's "Generative Art" and its output is shown above. My code is here.

Password management, one last time

TLDR: use 1password.

My hacked solution (here and here) has been kinda sorta OK for the past three years. But some of the kludges were painful. pass works great on OSX but doesn't play well on Windows or at all on Android and iOS. KeepassX works on the latter platforms but keeping the .kdbx file synchronized in Dropbox was problematic (namely, it wouldn't; and I had to come up with some odd and now forgotten hack to force a refresh of the local file). It was all just ugly and it was causing me headaches.

And then I discovered 1password. Oh, it's truly lovely. It syncs effortlessly across almost all the platforms I care about. In version 7, it now checks whether passwords are on a "known compromised" list. I can use markdown in the notes, which is useful because dammit, some sites appear to revel in the complexity of their user security and I need to write myself detailed instructions on how to navigate them. Their family plan allows me to manage shared and private vaults between me and my girlfriend.

"Almost" all the platforms? Yes, almost. I have an elderly iPad that I am resisting throwing away, even though the majority of its apps can no longer be updated because it cannot run a shiny new iOS. (Planned obsolescence bugs the heck out of me.) My solution: don't do anything on it that requires greater security than Netflix. I mean honestly, how many platforms do I really need to access my bank from?

In short, use 1password. It saves a lot of effort and is likely to do the vast majority of what you need it to.

More on pass

The approach to password management that I am trialing (detailed here) is almost perfect. The one thing that was bugging me was that I could not get bash completion to work. And when I am visiting http://this-site-has-a-weirdly-longURL.com that becomes something of a nuisance.

What was the problem? Short version: a missing forward slash in a directory name.

Long version: I had installed the password store in a Dropbox subfolder so that I could access it on multiple machines. That meant that I needed to set the environment variable PASSWORD_STORE_DIR to its location. Consequently I had this line in ~/.bash_profile:

export PASSWORD_STORE_DIR=~/Dropbox/.password-store

This looked like it was working. pass was storing and recalling passwords quite happily; the password store was synchronizing across my machines. So why the heck was bash completion not working?

Next step: try bash completion after I have turned on command and parameter logging. I do this in bash thus:

$ set -x

The effect of this command is

$ help set
(snip)
-x  Print commands and their arguments as they are executed

When I have pass attempt to complete amazon.com after the first two characters, I get this:

$ pass am+ COMPREPLY=()
+ local cur=am
+ local 'commands=init ls find grep show insert generate edit rm mv cp git help version'
+ [[ 1 -gt 1 ]]
+ COMPREPLY+=($(compgen -W "${commands}" -- ${cur}))
++ compgen -W 'init ls find grep show insert generate edit rm mv cp git help version' -- am
+ _pass_complete_entries 1
+ prefix=/Users/robert/Dropbox/.password-store
+ suffix=.gpg
+ autoexpand=1
+ local 'IFS=
'
+ items=($(compgen -f $prefix$cur))
++ compgen -f /Users/robert/Dropbox/.password-stoream
+ local items

That compgen command in the penultimate line does not look correct, does it? It rather looks as if I need to add a terminating / to the value in PASSWORD_STORE_DIR.

So I turn off logging (set +x), append the forward-slash to the directory name and bingo, bash completion is working.

Percentiles in Tableau

The Tableau Support Communities contain several threads on how to calculate percentiles. Here is one that dates back to 2011 and is still going strong. It seems that historically (i.e. pre version 9), the calculation of percentile required all sorts of homegrown calculated fields that use Tableau's LOOKUP() and WINDOW_*() functions and other abstruse and barely documented features of Tableau's inner workings.

Now that we have the PERCENTILE() function and Level-of-Detail calculations, it seems to be a lot simpler. Here is the code that I use to tercile the items on all the orders in Tableau's "superstore" dataset by sales value:

IF [Sales] > {FIXED : PERCENTILE([Sales], 0.667)}
    THEN "H"
ELSEIF [Sales] > {FIXED : PERCENTILE([Sales], 0.333)}
    THEN "M"
ELSE
    "L"
END

Dropping this dimension into a crosstab confirms that (i) each tercile contains the same number of items and (ii) the minimum and maximum of each tercile do not overlap.

tercile minimum sale/$ maximum sale/$ count
H 127.96 22,638.48 3,329
M 25.20 127.95 3,334
L 0.44 25.18 3,331

Isn't there a term missing from the LOD expression?

Yes. All the documentation I have found suggests that the first of my LOD expressions should look like this:

{FIXED [grain] : PERCENTILE([Sales], 0.667)}

Omitting the "grain" qualifier seems to cause the expression to be evaluated at the finest grain possible, namely the individual row within the dataset. In this case, that is just what I want.

Sidebar: Why do I want to tercile anyway?

Splitting a continuous variable into discrete ranges aids communication and non-experts' interpretation of results. But how many discrete ranges should one use? Well, that depends on (i) the question you are trying to answer and (ii) the established practice in that particular discipline. For example, in pharmaceutical sales everything gets split into deciles: the things that a pharma rep does with a decile 10 physician are very different to the things she does with a decile 1 physician.

Personally, I like splitting into an odd number of ranges as it allows some items to be average. That central category contains the peak of the bell-curve and some stuff either side: in many cases I have found that this provides a better mapping of my analysis to the real-world problem that the analysis is attempting to solve. (I suspect that this is the flip-side of the problem in social sciences about whether a Likert scale should contain an odd or even number of terms; see link for discussion.)

Here is more evidence to support the odd-is-better-than-even position: Beyond the median split: Splitting a predictor into 3 parts.