Methodologies

Company Mapping

In publicly available workforce data, we see many different company names entered by users that may all refer to the same company (eg. Bank of America and BofA both refer to the same company). There’s also the added issue of understanding subsidiaries (eg. Sprint is really a subsidiary of SoftBank. Both of these factors are critical to resolve in order to determine the accurate number of people that work for a single company (headcount). In the chart below, we are taking into account the number of employees that are associating with “Systems, Inc.”:

Company Mapping

In order to increase the confidence of company mapping, we try to gather as many of the following fields as are available: Company Name (required), Alternative Company Names, Company Headquarters (City, State, and Country), Ticker, ISIN, and LinkedIn URL.

Sampling Weights

The raw data we collect is not an accurate representation of a company’s workforce. Certain roles have a higher likelihood of being represented than others (eg. white collar positions have a higher likelihood than blue collar positions) and people in certain regions of a higher likelihood of being represented than others (eg. people in big cities have a higher likelihood of being represented than people in small cities). To generate data that is representative of the true underlying population, we must employ sampling weights to adjust for these biases in both occupation and location.

To generate sampling factors, we first probabilistically classify each job title and location to standard government occupational codes. For a previously unseen title, we infer occupational codes based on position in the learned occupational space. Observations in our dataset are stratified by occupation and country, summed and smoothed to account for potential misclassifications in government codes, and then compared with government level statistics on total workforce. For the United States (using Bureau of Labor Statistics data), detailed occupational breakdowns are available, which allow us to infer the likelihood of representation in our dataset. We then extrapolate to other countries using international statistics (International Labor Organization data) with a similar methodology. The likelihood of representation is currently assumed independent between occupation and country, and the final likelihood is generated by taking the product of the two likelihoods.

Sampling Weights

Lags in Reporting

People can be slow to update their profiles when they make a transition to a new job. And this lag can be quite severe when there are any kind of layoffs at a company. This introduces a systematic underrepresentation of the rate at which people transition in the very recent past, relative to the true rate of transition. To adjust for this effect, we employ sampling methods to provide an unbiased estimate of the flows of employees at every time period. As seen in the chart below, our model is able to correct for this underrepresentation of inflow and outflow in the most recent period.

Lags in reporting

And below is the change in headcount over time.

Lags in reporting - Headcount

Our model is built around a Bayesian Structural Time-Series (BSTS). The model uses the flow deviation relative to all companies, unemployment statistics, seasonality, and the expected lag distribution as features, and predicts the distribution of their coefficients that best predicts the observed flow. The model first uses naive prior distributions to train for a set of several thousand representative companies, then applies these updated (and weighted) posterior distributions to make predictions on any company of interest.

Jobs Taxonomy

Our job taxonomy allows us to reduce the tens of millions of unique job titles seen in publicly reported workforce data to a manageable set of occupations. This job taxonomy is a core part of what enables us to compare employees across different companies and industries, even though companies often have very different conventions for what job titles they use.

We start by defining what a job is. A job is, simply, a bundle of activities. If we can represent the activities that people do, we can find out what jobs belong in the same occupation. There are two dataset of activities that we use: the first is the bullet points that people put on resumes or online profiles for each position and the second is the responsibilities section of online job postings. From this training data, we build a mathematical representation of every job title, independent of seniority-related keywords like “senior” or “principal”. This abstracted representation allows us to then calculate distances between all job titles.

Once we have a mathematical representation for every job, we cluster them into broader occupations. The clustering algorithm starts with many distinct occupations and iteratively combines occupational clusters into broader and broader occupations. That way, users who are interested in a broad view of a company can segment employees into a small number of groups and users who are interested in a specific view of a company can segment employees into a large number of groups.

We then choose a representative title for each cluster to become the name of the occupation. We choose this title based on a combination of title frequency and proximity to the cluster centroid. While the name alone may not be completely representative of each title within the cluster, the name represents the closest title at a given level.

Read more

The plot below shows our entire landscape of job titles, broken into 27 clusters. Each circle represents an abstracted job title within our taxonomy and is sized based on a measure of how commonly seen it is. Circles belonging to the same cluster have the same color. More similar job titles are closer together in this ‘jobscape’ and less similar ones are further apart. Use the search box below to find a job title.

Combining Job Clusters

Using a custom agglomerative clustering algorithm, we build a hierarchy of cluster titles that runs from 1500 independent job categories, at the most granular level, up to 7 job categories, at the highest level. At every level (i.e. to go from 1500 clusters to 7 clusters), we combine clusters of job titles that inhabit a similar area of the occupational landscape, and rename the cluster using the same naming scheme outlined above.

Read more

Agglomerative clustering allows us to deliver multiple levels of job title granularities, each representing different cuts of the same taxonomy tree. We can view the tree from the top down (as shown in the visualization below) so that every broad job category can be subdivided into several job types - for example, Engineer can be subdivided into Mechanical Engineer, Software Engineer, Data Engineer, etc.

The plot below represents our job taxonomy in a hierarchical format. Each node on the tree can be thought of as the name of a cluster, which incorporates multiple job titles. As we move up the tree, we decrease the number of clusters and combine job titles from different clusters if necessary. Some of the most popular job titles contained within each cluster can be seen by hovering over that node on the tree (these would be akin to the biggest circles within each colored cluster on the ‘Jobscape’ scatterplot in the Clustering Job Titles section).

Our abstracted representation of job titles has an additional advantage, in that we are not strictly tied to any one hierarchy. We may use the abstract representation to create granular job categories upon client request for uncommon titles that may be too specific for most common uses. This approach also gives us the flexibility to define new job categories, while retaining the rigorous methodology used in our common hierarchical tree.

Uniformity of Clusters

Today’s standard clustering algorithms are extremely poor at deriving uniformly sized groups. In order to derive a set of clusters that could be used as a universal occupational taxonomy, we developed a proprietary clustering algorithm that can overcome this limitation.

Inferring Occupation

To infer occupation names at an individual level, we take into account a person’s listed skills, job descriptions, previous job titles, personal connections, and company/title level aggregated descriptions to help inform a person’s occupational classification. In practice, this allows us to make informed predictions about a person’s occupation in the absence of any specific job title, or where certain pieces of information may be missing. For example, an employee with the title “manager” may be classified as a technical manager based on their prior experience in software development, along with strong connections to other technical staff.

This approach helps us distinguish cases where identical titles may represent fundamentally different occupations when they are used in different industries. For example, an associate at an investment bank will fundamentally differ from an associate at a law firm. Our model helps predict whether these same titles should be in the same cluster across the two industries– if they are similar in their inputs, they will be, otherwise they may be matched to different occupations.

In detail, the inference model makes a prediction of a person’s job in the abstracted occupational vector space described above. The model comprises several independent models, each of which makes independent predictions about a person’s job title. For instance, one model uses a person’s skills to predict their job title. Another model mean encodes the person’s written descriptions to predict their job title. Another model reads in the person’s past employment history to help predict what their current job title should be. All of these (and others) are combined and weighted by their predictive quality in order to generate a final predicted position in the abstract occupational vector space, which can then be matched to our hierarchy or used to generate custom job clusters.

Uncommon Titles

Although we train mathematical representations of job titles based on their text descriptions, we can assign representations to all titles, even if they have never shown up before in our sample and do not have any text descriptions. We do this through FastText embeddings, a common and well documented model on which we train our dataset of job titles. This allows us to learn representations of substrings of individual words, so that even if there is a small mispelling or new job title phrasing, we still can relate it to job titles that we have well-trained representations for. Combined with the inference method outlined above, this allows us to classify every single position, no matter how uncommon.

Skill Clustering

Clustering Skills

Our skills taxonomy is similar to our Jobs Taxonomy in that it allows us to reduce the tens of millions of unique skills seen in publicly reported workforce data to a manageable set of skills. The taxonomy further enables us to compare employees across different companies, industries, and job titles.

We start by gathering skills seen in public data and determining which different phrasings of skills may represent the same skill (eg. JavaScript and JS).

Once we have a representation for every skill, we cluster them into broader categories. The clustering algorithm starts with many distinct skills and iteratively combines clusters into broader and broader skill sets. That way, users who are interested in a broad view of a company can segment skills into a small number of groups and users who are interested in a specific view of a company can segment skills into a large number of groups.

We then choose a representative title for each cluster to become the name of the skillset. We choose this title based on a combination of title frequency and the title’s proximity to the cluster centroid. While the name alone may not be completely representative of each title within the cluster, the name represents the closest title at a given level.

Read more

The plot below shows our entire landscape of skills, broken into ___ clusters. Each circle represents an abstracted skill title within our taxonomy and is sized based on a measure of how commonly it’s seen. Circles belonging to the same cluster have the same color. More similar skills are closer together in this ‘skillscape’ and less similar ones are further apart. Use the search box below to find a skill.

Combining Skill Clusters

Using a custom agglomerative clustering algorithm, we’ve built a hierarchy of cluster titles that run from 25 independent skill categories, at the most granular level, up to 4000 skill categories at the highest level. At every level (i.e. to go from 1500 clusters to 7 clusters), we combine clusters of skills that inhabit a similar area of the skill landscape, and rename the cluster using the same naming scheme outlined in Clustering Skills.

Read more

The plot below represents our skill taxonomy in a hierarchical format. Each node on the tree can be thought of as the name of a cluster, which incorporates multiple skills. As we move up the tree, we decrease the number of clusters. Some of the most popular skills contained within each cluster can be seen by hovering over that node on the tree (these would be akin to the biggest circles within each colored cluster on the ‘Skillscape’ scatterplot in the Clustering Skills section).

Gender and Ethnicity

Gender is predicted from first name, based on social security administration data. The predicted gender is represented as probabilities of the name being male or female. For example, in the social security dataset, if 70% of people named Lauren are female and 30% are male, our model will output a 0.7 probability that the person is female and 0.3 probability that the person is male.

To predict ethnicity, we use a Bayesian model that considers the first name, last name, and location (based on US census) of a person. The model assumes conditional independence among all three variables, meaning that it assumes no correlation between first names, last names, and locations when calculating ethnicity. The output format is similar to the gender model - a probability for each possible ethnicity for that name/location combination.

Read more

The ethnicity model is trained using name and geography data from the US Census, and predicts a probability of the person being a particular race from the set {White, Black, Asian Pacific Islander (API), Hispanic or Latino, American Indian or Alaska Native, Two or More Races (Multiple)}. In case of missing data or uncommon names, we default to country-wide ethnic distributions, which represent our prior probability of ethnicity before seeing the person’s name or location. To improve the accuracy of the model outside of the US, we include ethnic distribution data for countries that do not contain a mono-ethnic vast majority (for example, Canada), and adjust the probability on this distribution rather than the US census geography distribution. For countries that contain a mono-ethnic vast majority (for example, Japan), we set the majority ethnicity to have the vast majority of population density, and uniformly distribute the remaining density across the non-majority ethnicities. This allows the model to assign the majority ethnicity to people for which we do not have strong contradictory information, while still assigning non-majority ethnicities when there is a strong belief to suggest otherwise.

Explore how the models work using the visualization below! Use the search boxes to view gender and ethnicity probabilities for a specific name or name/location combination.

Salary

We predict the salary for each position based on the following covariates: job title, seniority of the position, company, and country. The model architecture is that of an ensemble of regressors, each of which is trained to predict the base salary of the position. We train this model using over 50 million salaries from job postings and publicly available visa applications, and use country level multipliers and inflation rates to adjust for salaries outside of the US and over time. The predicted salary will always be given in USD and will represent the nominal base yearly wage.

Due to the large number of jobs and companies that exist, we use abstract representations of companies and job titles in order to get estimates for any uncommon or unseen company/title pair. The advantage of this approach is that even for companies where we lack salary data, we may infer predicted salaries based on transitions into and out of that company. For instance, if a company hires software engineers directly from Google and Apple, this informs our prediction about the type of salary offers given to these employees upon transition.

While our model does not explicitly use individual features such as gender, ethnicity, education, skills, MSA, and others to predict salary, we still allow the data to be broken down by these groups and there are often significant levels of associated signal. These are real effects, and occur due to the high levels of correlation among predictors of salaries. For instance, education may not be implicitly input into the model, but educational background has a significant effect on types of jobs one is likely to obtain, and thus indirectly affects salary predictions.