Principle Component Analysis of Baby Name Timeseries

September 2019

Here we've applied the same kind of clustering we used on Twitter relative usage rate time series for n-grams, to baby names. Historical records of babynames in the US since 1880 are collected from the Social Security Adminstration.

We take the yearly counts for each name, subset the dataset to include only those names which were used at least 60 times in a year (for a total of 10003 names). Next, we divide by the total number of births recorded per year by the SSA to obtain each name's usage rate over time. To visualize the space of name-usage-rate shapes, we then take the logarithm, perform PCA, and use t-SNE reduce the dimensionality of the name timeseries represented as their PCA eigenfunction coefficients down to 2 dimensions. While t-SNE does not perserve distances, it is likely that things which are close together after embedding were close before.

For the next plot we take a smaller subset, only keeping names which were used at least 603 times in any year (for a total of 2000 names). The colors here are the result of running k-means in the PCA coefficent space, with k = 50. Here we can see more traditional male names (William, John, Robert) in the lower left-hand corner, which seem to be stable over time. In the upper left, there are some interesting discontinuous jumps for many female names (Jeanette, Sara), where usage rate jumps down before discontinuously increasing again. I have no idea why this happens, but it doesn't seem to be an artifact of the data. On the right-hand side, there seem to be more novel names, with lower average usage rates.

It is important to remember that t-SNE is an inherently stocastic process, so the axes and positions of the names in the embedded space have no meaning whatsoever, and change each time the embedding is fixed.