More on pass

The approach to password management that I am trialing (detailed here) is almost perfect. The one thing that was bugging me was that I could not get bash completion to work. And when I am visiting http://this-site-has-a-weirdly-longURL.com that becomes something of a nuisance.

What was the problem? Short version: a missing forward slash in a directory name.

Long version: I had installed the password store in a Dropbox subfolder so that I could access it on multiple machines. That meant that I needed to set the environment variable PASSWORD_STORE_DIR to its location. Consequently I had this line in ~/.bash_profile:

export PASSWORD_STORE_DIR=~/Dropbox/.password-store

This looked like it was working. pass was storing and recalling passwords quite happily; the password store was synchronizing across my machines. So why the heck was bash completion not working?

Next step: try bash completion after I have turned on command and parameter logging. I do this in bash thus:

$ set -x

The effect of this command is

$ help set
(snip)
-x  Print commands and their arguments as they are executed

When I have pass attempt to complete amazon.com after the first two characters, I get this:

$ pass am+ COMPREPLY=()
+ local cur=am
+ local 'commands=init ls find grep show insert generate edit rm mv cp git help version'
+ [[ 1 -gt 1 ]]
+ COMPREPLY+=($(compgen -W "${commands}" -- ${cur}))
++ compgen -W 'init ls find grep show insert generate edit rm mv cp git help version' -- am
+ _pass_complete_entries 1
+ prefix=/Users/robert/Dropbox/.password-store
+ suffix=.gpg
+ autoexpand=1
+ local 'IFS=
'
+ items=($(compgen -f $prefix$cur))
++ compgen -f /Users/robert/Dropbox/.password-stoream
+ local items

That compgen command in the penultimate line does not look correct, does it? It rather looks as if I need to add a terminating / to the value in PASSWORD_STORE_DIR.

So I turn off logging (set +x), append the forward-slash to the directory name and bingo, bash completion is working.

Percentiles in Tableau

The Tableau Support Communities contain several threads on how to calculate percentiles. Here is one that dates back to 2011 and is still going strong. It seems that historically (i.e. pre version 9), the calculation of percentile required all sorts of homegrown calculated fields that use Tableau's LOOKUP() and WINDOW_*() functions and other abstruse and barely documented features of Tableau's inner workings.

Now that we have the PERCENTILE() function and Level-of-Detail calculations, it seems to be a lot simpler. Here is the code that I use to tercile the items on all the orders in Tableau's "superstore" dataset by sales value:

IF [Sales] > {FIXED : PERCENTILE([Sales], 0.667)}
    THEN "H"
ELSEIF [Sales] > {FIXED : PERCENTILE([Sales], 0.333)}
    THEN "M"
ELSE
    "L"
END

Dropping this dimension into a crosstab confirms that (i) each tercile contains the same number of items and (ii) the minimum and maximum of each tercile do not overlap.

tercile minimum sale/$ maximum sale/$ count
H 127.96 22,638.48 3,329
M 25.20 127.95 3,334
L 0.44 25.18 3,331

Isn't there a term missing from the LOD expression?

Yes. All the documentation I have found suggests that the first of my LOD expressions should look like this:

{FIXED [grain] : PERCENTILE([Sales], 0.667)}

Omitting the "grain" qualifier seems to cause the expression to be evaluated at the finest grain possible, namely the individual row within the dataset. In this case, that is just what I want.

Sidebar: Why do I want to tercile anyway?

Splitting a continuous variable into discrete ranges aids communication and non-experts' interpretation of results. But how many discrete ranges should one use? Well, that depends on (i) the question you are trying to answer and (ii) the established practice in that particular discipline. For example, in pharmaceutical sales everything gets split into deciles: the things that a pharma rep does with a decile 10 physician are very different to the things she does with a decile 1 physician.

Personally, I like splitting into an odd number of ranges as it allows some items to be average. That central category contains the peak of the bell-curve and some stuff either side: in many cases I have found that this provides a better mapping of my analysis to the real-world problem that the analysis is attempting to solve. (I suspect that this is the flip-side of the problem in social sciences about whether a Likert scale should contain an odd or even number of terms; see link for discussion.)

Here is more evidence to support the odd-is-better-than-even position: Beyond the median split: Splitting a predictor into 3 parts.

Fun with hsxkpasswd

It is all very well coming up with a nice new method of organizing my passwords but what should those passwords actually be? Randall Munroe addressed this in a well-known xkcd comic and this approach has been extended by Bart Buschotts, who coded it and made it available here.

It is a lot of fun playing with this password generator (how many words? Padding, yes or no? Separators, trailing and leading digits? What does that do to my entropy?) but it does not feel practical when I want to explore lots of options and then generate lots of passwords. Fortunately Buschotts has provided a Perl module XKPasswd.pm, which you can download and tinker with to your heart's content.

Here are the steps that I took.

1. Download and install hsxkpasswd
It does not look as if anyone has built a homebrew package yet, so you need to pick the module up at the author's website (here) and follow the install instructions. These involve a sudo to cpan, which I am not wildly keen about. I just don't like having to revert to sudo when I'm installing packages. But I am not going to expend the effort to work out how to get round this so I guess I should just stop grumbling.

[Update: Bart has posted details of how to install without sudo. See comments below.]

2. Configure hsxkpasswd
The documentation is good, with plenty of examples that you can modify. Here is my .hsxkpasswdrc file that creates a preset RH that I can call from the command line.

$ cat ~/.hsxkpasswdrc
{
  "custom_presets" : {
    "RH" : {
      "description" : "Robert default",
      "config" : {
        "word_length_min" : 4,
        "word_length_max" : 8,
        "num_words" : 3,
        "separator_character" : " ",
        "padding_digits_before" : 4,
        "padding_digits_after" : 0,
        "padding_type" : "NONE",
        "padding_character" : "RANDOM",
        "padding_characters_before" : 0,
        "padding_characters_after" : 0,
        "case_transform" : "RANDOM",
        "allow_accents" : 0
      }
    }
  }
}

Simple, no?

3. Find a decent dictionary
You don't need to play with the password generator too long before you start noticing repeats (Mexico again? Really?). When you run the module with the --verbose option you see why: it is pulling from a dictionary that only contains about a thousand words. That is more than enough to get good entropies but it also makes for some pretty dull (and hence unmemorable) passwords.

How about a 10,000 word dictionary? Here is one that is extracted from Google's Trillion Word Corpus.

Not enough? (Answer: no.) That same page contains a link to Peter Novig's 1/3 million most frequent English words. This may be of limited use for text analysis (it is not cleaned and is full of proper nouns, contractions, etc.) but for my purpose is perfect. Just use your favorite text manipulation tool to jettison the word counts and chop it down to a single column of words (I used csvkit), use head to shorten the file to the desired word count (I thought 30,000 was a good number) and you have a dictionary that can replace the default.

4. Generate passwords
The -p option allows me to specify my preset, -d allows me to override the default dictionary and --verbose gives me the entropy.

$ hsxkpasswd -p RH -d NOVIGwords.30000.txt 10 --verbose

*DICTIONARY*
Source: Crypt::HSXKPasswd::Dictionary::Basic (loaded from: the file(s) NOVIGwords.30000.txt)
# words: 26821
# words of valid length: 19058 (71%)
Contains Accented Characters: NO
<snip>
*PASSWORD STATISTICS*
Password length: between 19 & 31
Permutations (brute-force): between 3.77x10^37 & 2.03x10^61 (average 2.77x10^49)
Permutations (given dictionary & config): 5.53x10^17
Entropy (Brute-Force): between 124bits and 203bits (average 164bits)
Entropy (given dictionary & config): 58bits

19k words to choose from? Nice. And that entropy is a good size too (the module flags a warning if the entropy is lower than ~48 or so). So what do the passwords look like?

0517 bolster tyco peugeot
5552 OTAGO external cons
4881 DATETIME HEARING replace
4813 received genetic MARA
5401 dieting prevail ARTERIAL
8730 DAZZLING sixteen GAMES
6815 teaches WEAKEN GUTS
9579 ABOLISH buddhism neff
8707 mortar BUCKET RUINS
9703 solitary LIBRA TRUTHFUL

OK, I like them. And who is ever going to guess a password that contains the names of a dodgy French car?

Pattern-matching in vim solves my problem

I am gradually migrating passwords out of my old encrypted passwords.txt and into pass, as detailed here. Just so I can keep track of what has migrated and what has not, I prefix each migrated line in passwords.txt with the text "===". But these migrated lines are scattered all over a 700-line file. How can I gather the lines in one place so that I can quickly scan what I have migrated?

Using vi's pattern-matching rules, that's how. The following command matches all the lines prefixed with "===" and moves them to the bottom of the file.

:g/^===/m$

Why to the bottom (m$) and not the top (m0)? Because this preserves the lines' ordering, so any multi-line entries stay in their original order and are easier to read.

Password management: at last, a simple solution

If I want to use Keepass on OSX then I need to install Mono and that galls.

Lastpass looks nice and shiny. I tried it for a couple of weeks but I found one too many sites with which it did not play nice (refusing to populate the fields on my bank's login screen, specifically) so no, that won't do.

I was about to use my old approach of using GPG to encrypt a big ugly text file when I came across pass. Now that looks interesting. It uses GPG, it has useful off-the-shelf functionality (add/edit/delete/copy to clipboard), and it has a flexible structure that allows one to manage additional PINs and security questions. Importantly, I do not decrypt every one of my passwords simultaneously when I just need one of them, which is a less than desirable side effect of my current approach.

I suspect that it will fit in nicely with the command line tools that I use to manage my to-do lists (Taskwarrior) and Engineer's Notebook (jrnl).

I shall install it and report back.