September 2019
Here we've applied the same kind of clustering we used on Twitter relative usage rate time series for n-grams, to baby names. Historical records of babynames in the US since 1880 are collected from the Social Security Adminstration.
For the next plot we take a smaller subset, only keeping names which were used at least 603 times in any year (for a total of 2000 names). The colors here are the result of running k-means in the PCA coefficent space, with k = 50. Here we can see more traditional male names (William, John, Robert) in the lower left-hand corner, which seem to be stable over time. In the upper left, there are some interesting discontinuous jumps for many female names (Jeanette, Sara), where usage rate jumps down before discontinuously increasing again. I have no idea why this happens, but it doesn't seem to be an artifact of the data. On the right-hand side, there seem to be more novel names, with lower average usage rates.