Harlequin Romance Titles: Postscript

Does the little script that I described in yesterday's post, which invents random titles for Harlequin romance novels, actually tell us anything about neural networks? No, it does not. You are not going to get any insight into why you should be using a recurrent network architecture rather than any other; you are not going to get any insight into—well—anything really. Honestly, it is not much more than a "Hello World" program that proves you have successfully installed TensorFlow and Keras. It's a piece of fluff, that's all.

It's a fun piece of fluff though. And I'd still like to read "The Wolf of When".

Randomly generated titles for Harlequin romance novels

You can use recurrent neural networks (RNNs) to generate text: feed one the names of death metal bands (say) and it will start creating it's own. Ditto ice-cream flavors, paint colors, Trump tweets, etc.

One popular package that allows the likes of me to play with this technology was developed by a data scientist, Max Woolf. He's got a great how-to page on his blog. Lifehacker also have an extremely simple and helpful guide that relies on Max's work.


What random text should I generate? I have to feed an RNN a few thousand examples before it can start creating its own (or at least creating text that feels realistic). Fortunately I found a list of all of Harlequin's books from 1949 to 2012; about 4400 titles in total. That is large enough to give the RNN a chance to identify the patterns in the titles, small enough that the RNN will not need to run for hours before giving a result.


You can find the full version of this in my git repository here. The short version is:

  • Install python3, tensorflow, and textgenrnn on your computer.
  • Create "training" and "run" scripts in python. The training script reads all the example data and discovers its underlying patterns, the run script generates new examples.
  • Enjoy!


The RNN's "run" step has a temperature parameter that you can dial up or down betwen zero and one. I think this is something of a misnomer. Lifehacker describe it as a "creativity dial"; I prefer to think of it as a weirdness control. Here are some of the titles that the RNN produced, ordered by temperature.

The Man From The Heart
The Baby Bonding
The Wild Sister
The Man In The Bride
The Sun And The Sheikh
The Sheikh's Convenient Wife
The Girl At Saltbush Bay
The Bachelor And The Playboy
The Dream And The Sunrancher
The Touch Of The Single Man
The Billionaire Bride
Hunter's Daughter
The Wolf Of When
A Savage Sanctuary
The Rancher's Forever Drum
Only My Heart Of Hearts
The Sheikh's Daughter
The Sheriff's Mother Bride
Chateau Pland
The Golden Pag
Just Mother And The Candles
Reluctant Paragon
The Unexpected Islands
Expecting the Young Nurse
The Girl In A Whirlwind
In Village Touch (Doctor Season)
Rapture Of The Parka
Portrait Of Works!
"Trave Palagry Surrender, Baby"
Bridegroom On Her Secret

So, which of these would you read?

"Random" colors? Trickier than I thought

Here was my first attempt at a simple grid of colored squares. And yes, it is more than a little influenced by Richter's 4900 Colours.

That is odd. I am using the RGB color model—typical in computer graphics—in which each element has 256 levels. I randomly select one of the 256 levels for each of the three dimensions, so I have a space of 2563 possible colors from which to draw. That's a big number (16,777,216 to be precise).

So why do so many of those 25 squares look so similar? Who knew there were so many minty greens?

Maybe it is a function of my color model. Let's try Hue-Saturation-Brightness instead.

I guess it is more visually appealing but I seem to have an awful lot of almost-blacks now. In short: just as bad.

Color models

I simplistically thought "OK, so I might be making some false comparisons here. Who says that an HSB model needs to have 256 levels on each of its components? And is a unit difference equally perceptible irrespective of level, i.e. is the color (0, 0, 0) as distinguishable from (0, 0, 1) as the color (255, 255, 255) is from (255, 255, 254)? Perhaps I just need to choose a different number of levels and I'll get 'better' random colors".

Not so much. A quick Google search reveals a huge rabbit hole down which I could disappear. Even simple questions like "How many colors can we perceive?" give answers ranging from 100,000 to 10,000,000. And there are a bunch of perceptual colour models that I had never heard of before. NCS? Munsell? I never got past Goethe. Maybe I should just find a Pantone color list and draw randomly off that.

This might take a while.


By the way, if you want a copy of the Processing script that produced these images, you can find it here.

Why am I playing with generative art?

Because of this. And because I'm a philistine.

Source: Gerhard Richter, 4900 Colours, #479

It is usually Richter's big smeary abstracts that grab my attention (like this one); I've only recently come across his series 4900 Colours and 1024 Farben. Unfortunately I have only encountered these online (this site has a great analysis of why I should seek one out in person); I think that is why I indulged myself in the response of the philistine: "Huh, I bet I could do that".

Long story short: even creating a small panel with smooth, well-executed, regular patches of color is beyond my abilities (good enough for my living room wall, though). A more immediate and positive effect is that this project got me into generative art.

How so? Well, before committing acrylic to board I wanted to get an idea of what the end product would look like. So why not prototype it? Sure, I could write a python script to knock something like this out, but I don't need to reinvent the wheel when tools like Processing and NodeBox already exist.

My first attempts with NodeBox were a bust. I liked the idea of playing with a GUI that allows you to build a network of nodes and edges that represent series of actions and the flow of their results, but in practice it was just too clunky and I was too impatient. "Why does this node have so many input connections? Why can't I right-click this edge? How the heck do I reseed the random number generator?".

So Processing it was. Processing comes in two flavors: Java and Python. I have been programming Python for years and I really wanted to like that version. But it is just not as mature and complete as the original Java. I guess it is time for me to brush up my Java.

These are early days in my exploration of generative art. I think I am going to kick off with a Richter simulator. Seems simple, right? I shall post images and code when I can.

Photographs I have failed to take

This is not my photograph.

Source: @helengrantsays

I had intended to write about a fun synchronism that I recently encountered. I have been reading Austin Kleon's "Show Your Work!". One chapter—"Open Up Your Cabinet of Curiosities"—talks about sharing your sources of inspiration, the items that pique your interest, the oddities that you find especially engaging. Indeed, the decision to bring back this blog from its three-year hiatus was a direct result of reading that chapter.

Then just a week or so later, I'm with my partner in the Scottish National Gallery of Modern Art and we find ourselves opposite a fine example of such a cabinet: the one that was assembled by the surrealist Roland Penrose. It contains artefacts from the sublime to the silly. A cast of Lee Miller's hands. A glass pig.

"Nice", I think. "I should take a photo of that. But my phone's in the bottom of my bag and there's probably too much reflection and anyway am I getting peckish? Yes I think I am, maybe we should get a cup of tea and a bun and anyway I can probably get a perfectly good picture from the museum's website".

Of course, when I get home, there's no such image to be found. It turns out that the museum is surprisingly parsimonious in the images that it shares. I finally track down a photo: it appears that this cabinet has only been photographed and shared twice; one of these photos is on the blog and instagram feed of Helen Grant. And how about photos of the items in the case? Maybe a photo of the other cast that Penrose commissioned, his foot and Miller's nestled together? No, nothing. Absolutley no trace online.

I hope that I can learn from this. Apparently it's not enough to have the camera with you, you actually need to use it. I could probably apply the same lesson to that almost empty notebook that is weighing down my pocket.

Fingers crossed.

Run Processing from the command line

I have recently started to play with Processing as I explore generative art.

It's a lovely piece of software but, coming out of the box, it does not really fit my typical workflow. The problem? It's an IDE. I can't stand IDEs. Not only do I have to learn a new language but I have to learn a whole new toolkit for working in it. I much much prefer being able to compose a script in my favorite text editor and then summon that script from the command line.

Fortunately Processing supports this workflow. I found this post at dsfcode.com that does a great job of describing how to set it up. And once I had done so it becomes so easy to get into my usual edit/run/re-edit cycle.

The only thing to watch out for is that the order of parameters in the processing-java command matters. So this command succeeds:

$ processing-java --sketch=`pwd`/waveclock/ \
> --output=`pwd`/../outputs/waveclock/ \
> --force --run

but this does not:

$ processing-java --sketch=`pwd`/waveclock/ \
> --output=`pwd`/../outputs/waveclock/ \
> --run --force
The output folder already exists. Use --force to remove it.

By the way, the sketch that I am running is a slight modification of case study 4.2 in Pearson's "Generative Art" and its output is shown above. My code is here.

Password management, one last time

TLDR: use 1password.

My hacked solution (here and here) has been kinda sorta OK for the past three years. But some of the kludges were painful. pass works great on OSX but doesn't play well on Windows or at all on Android and iOS. KeepassX works on the latter platforms but keeping the .kdbx file synchronized in Dropbox was problematic (namely, it wouldn't; and I had to come up with some odd and now forgotten hack to force a refresh of the local file). It was all just ugly and it was causing me headaches.

And then I discovered 1password. Oh, it's truly lovely. It syncs effortlessly across almost all the platforms I care about. In version 7, it now checks whether passwords are on a "known compromised" list. I can use markdown in the notes, which is useful because dammit, some sites appear to revel in the complexity of their user security and I need to write myself detailed instructions on how to navigate them. Their family plan allows me to manage shared and private vaults between me and my girlfriend.

"Almost" all the platforms? Yes, almost. I have an elderly iPad that I am resisting throwing away, even though the majority of its apps can no longer be updated because it cannot run a shiny new iOS. (Planned obsolescence bugs the heck out of me.) My solution: don't do anything on it that requires greater security than Netflix. I mean honestly, how many platforms do I really need to access my bank from?

In short, use 1password. It saves a lot of effort and is likely to do the vast majority of what you need it to.

More on pass

The approach to password management that I am trialing (detailed here) is almost perfect. The one thing that was bugging me was that I could not get bash completion to work. And when I am visiting http://this-site-has-a-weirdly-longURL.com that becomes something of a nuisance.

What was the problem? Short version: a missing forward slash in a directory name.

Long version: I had installed the password store in a Dropbox subfolder so that I could access it on multiple machines. That meant that I needed to set the environment variable PASSWORD_STORE_DIR to its location. Consequently I had this line in ~/.bash_profile:

export PASSWORD_STORE_DIR=~/Dropbox/.password-store

This looked like it was working. pass was storing and recalling passwords quite happily; the password store was synchronizing across my machines. So why the heck was bash completion not working?

Next step: try bash completion after I have turned on command and parameter logging. I do this in bash thus:

$ set -x

The effect of this command is

$ help set
-x  Print commands and their arguments as they are executed

When I have pass attempt to complete amazon.com after the first two characters, I get this:

$ pass am+ COMPREPLY=()
+ local cur=am
+ local 'commands=init ls find grep show insert generate edit rm mv cp git help version'
+ [[ 1 -gt 1 ]]
+ COMPREPLY+=($(compgen -W "${commands}" -- ${cur}))
++ compgen -W 'init ls find grep show insert generate edit rm mv cp git help version' -- am
+ _pass_complete_entries 1
+ prefix=/Users/robert/Dropbox/.password-store
+ suffix=.gpg
+ autoexpand=1
+ local 'IFS=
+ items=($(compgen -f $prefix$cur))
++ compgen -f /Users/robert/Dropbox/.password-stoream
+ local items

That compgen command in the penultimate line does not look correct, does it? It rather looks as if I need to add a terminating / to the value in PASSWORD_STORE_DIR.

So I turn off logging (set +x), append the forward-slash to the directory name and bingo, bash completion is working.

Percentiles in Tableau

The Tableau Support Communities contain several threads on how to calculate percentiles. Here is one that dates back to 2011 and is still going strong. It seems that historically (i.e. pre version 9), the calculation of percentile required all sorts of homegrown calculated fields that use Tableau's LOOKUP() and WINDOW_*() functions and other abstruse and barely documented features of Tableau's inner workings.

Now that we have the PERCENTILE() function and Level-of-Detail calculations, it seems to be a lot simpler. Here is the code that I use to tercile the items on all the orders in Tableau's "superstore" dataset by sales value:

IF [Sales] > {FIXED : PERCENTILE([Sales], 0.667)}
    THEN "H"
ELSEIF [Sales] > {FIXED : PERCENTILE([Sales], 0.333)}
    THEN "M"

Dropping this dimension into a crosstab confirms that (i) each tercile contains the same number of items and (ii) the minimum and maximum of each tercile do not overlap.

tercile minimum sale/$ maximum sale/$ count
H 127.96 22,638.48 3,329
M 25.20 127.95 3,334
L 0.44 25.18 3,331

Isn't there a term missing from the LOD expression?

Yes. All the documentation I have found suggests that the first of my LOD expressions should look like this:

{FIXED [grain] : PERCENTILE([Sales], 0.667)}

Omitting the "grain" qualifier seems to cause the expression to be evaluated at the finest grain possible, namely the individual row within the dataset. In this case, that is just what I want.

Sidebar: Why do I want to tercile anyway?

Splitting a continuous variable into discrete ranges aids communication and non-experts' interpretation of results. But how many discrete ranges should one use? Well, that depends on (i) the question you are trying to answer and (ii) the established practice in that particular discipline. For example, in pharmaceutical sales everything gets split into deciles: the things that a pharma rep does with a decile 10 physician are very different to the things she does with a decile 1 physician.

Personally, I like splitting into an odd number of ranges as it allows some items to be average. That central category contains the peak of the bell-curve and some stuff either side: in many cases I have found that this provides a better mapping of my analysis to the real-world problem that the analysis is attempting to solve. (I suspect that this is the flip-side of the problem in social sciences about whether a Likert scale should contain an odd or even number of terms; see link for discussion.)

Here is more evidence to support the odd-is-better-than-even position: Beyond the median split: Splitting a predictor into 3 parts.

Fun with hsxkpasswd

It is all very well coming up with a nice new method of organizing my passwords but what should those passwords actually be? Randall Munroe addressed this in a well-known xkcd comic and this approach has been extended by Bart Buschotts, who coded it and made it available here.

It is a lot of fun playing with this password generator (how many words? Padding, yes or no? Separators, trailing and leading digits? What does that do to my entropy?) but it does not feel practical when I want to explore lots of options and then generate lots of passwords. Fortunately Buschotts has provided a Perl module XKPasswd.pm, which you can download and tinker with to your heart's content.

Here are the steps that I took.

1. Download and install hsxkpasswd
It does not look as if anyone has built a homebrew package yet, so you need to pick the module up at the author's website (here) and follow the install instructions. These involve a sudo to cpan, which I am not wildly keen about. I just don't like having to revert to sudo when I'm installing packages. But I am not going to expend the effort to work out how to get round this so I guess I should just stop grumbling.

[Update: Bart has posted details of how to install without sudo. See comments below.]

2. Configure hsxkpasswd
The documentation is good, with plenty of examples that you can modify. Here is my .hsxkpasswdrc file that creates a preset RH that I can call from the command line.

$ cat ~/.hsxkpasswdrc
  "custom_presets" : {
    "RH" : {
      "description" : "Robert default",
      "config" : {
        "word_length_min" : 4,
        "word_length_max" : 8,
        "num_words" : 3,
        "separator_character" : " ",
        "padding_digits_before" : 4,
        "padding_digits_after" : 0,
        "padding_type" : "NONE",
        "padding_character" : "RANDOM",
        "padding_characters_before" : 0,
        "padding_characters_after" : 0,
        "case_transform" : "RANDOM",
        "allow_accents" : 0

Simple, no?

3. Find a decent dictionary
You don't need to play with the password generator too long before you start noticing repeats (Mexico again? Really?). When you run the module with the --verbose option you see why: it is pulling from a dictionary that only contains about a thousand words. That is more than enough to get good entropies but it also makes for some pretty dull (and hence unmemorable) passwords.

How about a 10,000 word dictionary? Here is one that is extracted from Google's Trillion Word Corpus.

Not enough? (Answer: no.) That same page contains a link to Peter Novig's 1/3 million most frequent English words. This may be of limited use for text analysis (it is not cleaned and is full of proper nouns, contractions, etc.) but for my purpose is perfect. Just use your favorite text manipulation tool to jettison the word counts and chop it down to a single column of words (I used csvkit), use head to shorten the file to the desired word count (I thought 30,000 was a good number) and you have a dictionary that can replace the default.

4. Generate passwords
The -p option allows me to specify my preset, -d allows me to override the default dictionary and --verbose gives me the entropy.

$ hsxkpasswd -p RH -d NOVIGwords.30000.txt 10 --verbose

Source: Crypt::HSXKPasswd::Dictionary::Basic (loaded from: the file(s) NOVIGwords.30000.txt)
# words: 26821
# words of valid length: 19058 (71%)
Contains Accented Characters: NO
Password length: between 19 & 31
Permutations (brute-force): between 3.77x10^37 & 2.03x10^61 (average 2.77x10^49)
Permutations (given dictionary & config): 5.53x10^17
Entropy (Brute-Force): between 124bits and 203bits (average 164bits)
Entropy (given dictionary & config): 58bits

19k words to choose from? Nice. And that entropy is a good size too (the module flags a warning if the entropy is lower than ~48 or so). So what do the passwords look like?

0517 bolster tyco peugeot
5552 OTAGO external cons
4813 received genetic MARA
5401 dieting prevail ARTERIAL
8730 DAZZLING sixteen GAMES
6815 teaches WEAKEN GUTS
9579 ABOLISH buddhism neff
8707 mortar BUCKET RUINS
9703 solitary LIBRA TRUTHFUL

OK, I like them. And who is ever going to guess a password that contains the names of a dodgy French car?