Random colors again

I had speculated in an earlier post that I could get better (pseudo-)random colors if I pulled from color mixing models like Pantone or perceptual color models like NCS or Munsell. I'd love to, but it turns out that they are all copyright. That's a dead end then.

However my instinct that HSB is probably a better model from which to draw than RGB is supported elsewhere. There's an old post on Martin Ankeri's blog that has an interesting discussion on the issue, and he comes out firmly on the HSB side.

So let's think about how I want to vary hue, saturation, and brightness:

  • Hue: I don't really care about the underlying color so I will just pull from a uniform distribution between 0 and 1. (By the way, I'm assuming HSB are all on a 0-1 scale).

  • Saturation: I know I'm going to want some sort of lumpiness but I'm not sure exactly how much and where. Maybe I want a lot of the colors bleached out. Or really saturated. This suggests that I need should pull randomly from a beta distribution and then play around with the distribution's parameters to change where the lump/lumps occur.

  • Brightness: ditto for brightness. I want it non-uniform but tweakable. Another beta distribution.


I'm now using the python version of Processing instead of Java. It may have some issue around importing third-party libraries but this last week of using the Java version has just reminded me how much I don't like that damn language.

The translation of the Java code to Python was simple; the inclusion of the beta distributions is straightforward too as they are already available in Python's random module.

If you want to play with the code it is in my repository here. If you want to mess around with the paramaters in a beta distributions and see the impact on the shape of the distribution, there's a handly online tool here.


I show a couple of examples below. Personally I think they are an improvement on the earlier examples. (And yes, I've gone full Richter. No gutters here.)

(α, β) saturation = (1.2, 0.9); brightness = (0.8, 0.4)
(α, β) saturation = (1.1, 0.9); brightness = (0.9, 0.6)

Obviously, the colors are different every time I run the code. But it's the overall feel of how these colors work together that interests me. And this overall feel is—again, as one would expect—pretty sensitive with respect to the parameters you set in the beta-distributions for saturation and brightness.

Harlequin Romance Titles: Postscript

Does the little script that I described in yesterday's post, which invents random titles for Harlequin romance novels, actually tell us anything about neural networks? No, it does not. You are not going to get any insight into why you should be using a recurrent network architecture rather than any other; you are not going to get any insight into—well—anything really. Honestly, it is not much more than a "Hello World" program that proves you have successfully installed TensorFlow and Keras. It's a piece of fluff, that's all.

It's a fun piece of fluff though. And I'd still like to read "The Wolf of When".

Randomly generated titles for Harlequin romance novels

You can use recurrent neural networks (RNNs) to generate text: feed one the names of death metal bands (say) and it will start creating it's own. Ditto ice-cream flavors, paint colors, Trump tweets, etc.

One popular package that allows the likes of me to play with this technology was developed by a data scientist, Max Woolf. He's got a great how-to page on his blog. Lifehacker also have an extremely simple and helpful guide that relies on Max's work.


What random text should I generate? I have to feed an RNN a few thousand examples before it can start creating its own (or at least creating text that feels realistic). Fortunately I found a list of all of Harlequin's books from 1949 to 2012; about 4400 titles in total. That is large enough to give the RNN a chance to identify the patterns in the titles, small enough that the RNN will not need to run for hours before giving a result.


You can find the full version of this in my git repository here. The short version is:

  • Install python3, tensorflow, and textgenrnn on your computer.
  • Create "training" and "run" scripts in python. The training script reads all the example data and discovers its underlying patterns, the run script generates new examples.
  • Enjoy!


The RNN's "run" step has a temperature parameter that you can dial up or down betwen zero and one. I think this is something of a misnomer. Lifehacker describe it as a "creativity dial"; I prefer to think of it as a weirdness control. Here are some of the titles that the RNN produced, ordered by temperature.

The Man From The Heart
The Baby Bonding
The Wild Sister
The Man In The Bride
The Sun And The Sheikh
The Sheikh's Convenient Wife
The Girl At Saltbush Bay
The Bachelor And The Playboy
The Dream And The Sunrancher
The Touch Of The Single Man
The Billionaire Bride
Hunter's Daughter
The Wolf Of When
A Savage Sanctuary
The Rancher's Forever Drum
Only My Heart Of Hearts
The Sheikh's Daughter
The Sheriff's Mother Bride
Chateau Pland
The Golden Pag
Just Mother And The Candles
Reluctant Paragon
The Unexpected Islands
Expecting the Young Nurse
The Girl In A Whirlwind
In Village Touch (Doctor Season)
Rapture Of The Parka
Portrait Of Works!
"Trave Palagry Surrender, Baby"
Bridegroom On Her Secret

So, which of these would you read?

Percentiles in Tableau

The Tableau Support Communities contain several threads on how to calculate percentiles. Here is one that dates back to 2011 and is still going strong. It seems that historically (i.e. pre version 9), the calculation of percentile required all sorts of homegrown calculated fields that use Tableau's LOOKUP() and WINDOW_*() functions and other abstruse and barely documented features of Tableau's inner workings.

Now that we have the PERCENTILE() function and Level-of-Detail calculations, it seems to be a lot simpler. Here is the code that I use to tercile the items on all the orders in Tableau's "superstore" dataset by sales value:

IF [Sales] > {FIXED : PERCENTILE([Sales], 0.667)}
    THEN "H"
ELSEIF [Sales] > {FIXED : PERCENTILE([Sales], 0.333)}
    THEN "M"

Dropping this dimension into a crosstab confirms that (i) each tercile contains the same number of items and (ii) the minimum and maximum of each tercile do not overlap.

tercile minimum sale/$ maximum sale/$ count
H 127.96 22,638.48 3,329
M 25.20 127.95 3,334
L 0.44 25.18 3,331

Isn't there a term missing from the LOD expression?

Yes. All the documentation I have found suggests that the first of my LOD expressions should look like this:

{FIXED [grain] : PERCENTILE([Sales], 0.667)}

Omitting the "grain" qualifier seems to cause the expression to be evaluated at the finest grain possible, namely the individual row within the dataset. In this case, that is just what I want.

Sidebar: Why do I want to tercile anyway?

Splitting a continuous variable into discrete ranges aids communication and non-experts' interpretation of results. But how many discrete ranges should one use? Well, that depends on (i) the question you are trying to answer and (ii) the established practice in that particular discipline. For example, in pharmaceutical sales everything gets split into deciles: the things that a pharma rep does with a decile 10 physician are very different to the things she does with a decile 1 physician.

Personally, I like splitting into an odd number of ranges as it allows some items to be average. That central category contains the peak of the bell-curve and some stuff either side: in many cases I have found that this provides a better mapping of my analysis to the real-world problem that the analysis is attempting to solve. (I suspect that this is the flip-side of the problem in social sciences about whether a Likert scale should contain an odd or even number of terms; see link for discussion.)

Here is more evidence to support the odd-is-better-than-even position: Beyond the median split: Splitting a predictor into 3 parts.