Company Mapping

We match the list of companies requested by the client to our internal company set. The client can provide information in the following fields in order to increase the confidence in the match: company name (required), alternative company names, company headquarters location (city, state, and country), ticker, isin, and LinkedIn url. The more information provided, the more confident we can be that we are internally matching to the correct company.

Sampling Weights

The raw data we analyze is a non-random sample of a company’s population. To generate data that is representative of the true underlying population of interest, we employ sampling techniques based on the representativeness of different groups that we observe. The variables we use to stratify our estimates are occupation and country, which are assumed independent factors that contribute to the likelihood of observation

To generate sampling factors, we first probabilistically classify job titles to standard government occupational codes using learned representations of individual titles. For a previously unseen title, we infer occupational codes based on position in the learned occupational space. Observations in our dataset are stratified by occupation and country, summed and smoothed to account for potential misclassifications in government codes, and then compared with government level statistics on total workforce. For the United States, detailed occupational breakdowns are available and allow us to infer the likelihood of representation in our dataset. We then extrapolate to other countries using international statistics with a similar methodology. The likelihood of representation is currently assumed independent between occupation and country, and the final likelihood is thus generated by taking the product of the two likelihoods.

Lags in Reporting

It takes some time for the data to be updated both from the web scraping, and also from individuals who may delay the rate at which they update their online professional profiles. This introduces a systematic underrepresentation of the rate at which people transition in the very recent past, relative to the true rate of transition. To adjust for this effect, we employ sampling methods to provide an unbiased estimate of the flows of employees at every time period.

Our model is built around a Bayesian Structural Time-Series (BSTS). The model uses the flow deviation relative to all companies, unemployment statistics, seasonality, and the expected lag distribution as features, and predicts the distribution of their coefficients that best predicts the observed flow. The model first uses naive prior distributions to train for a set of several thousand representative companies, then applies these updated (and weighted) posterior distributions to make predictions on any company of interest.

Jobs Taxonomy

Our job taxonomy allows us to reduce the tens of millions of unique job titles seen in publicly reported workforce data to a manageable set of occupations. This job taxonomy is a core part of what enables us to compare employees across different companies and industries, even though companies often have very different conventions for what job titles they use.

We start by defining what a job is. A job is, simply, a bundle of activities. If we can represent the activities that people do, we can find out what jobs belong in the same occupation. There are two dataset of activities that we use: the first is the bullet points that people put on resumes or online profiles for each position and the second is the responsibilities section of online job postings. From this training data, we build a mathematical representation of every job title, independent of seniority-related keywords like “senior” or “principal”. This abstracted representation allows us to then calculate distances between all job titles.

Once we have a mathematical representation for every job, we cluster them into broader occupations. The clustering algorithm starts with many distinct occupations and iteratively combines occupational clusters into broader and broader occupations. That way, users who are interested in a broad view of a company can segment employees into a small number of groups and users who are interested in a specific view of a company can segment employees into a large number of groups.

We then choose a representative title for each cluster to become the name of the occupation. We choose this title based on a combination of title frequency and proximity to the cluster centroid. While the name alone may not be completely representative of each title within the cluster, the name represents the closest title at a given level.

Read more

The plot below shows our entire landscape of job titles, broken into 27 clusters. Each circle represents an abstracted job title within our taxonomy and is sized based on a measure of how commonly seen it is. Circles belonging to the same cluster have the same color. More similar job titles are closer together in this ‘jobscape’ and less similar ones are further apart. Use the search box below to find a job title.

Combining Job Clusters

Using a custom agglomerative clustering algorithm, we build a hierarchy of cluster titles that runs from 1500 independent job categories, at the most granular level, up to 7 job categories, at the highest level. At every level (i.e. to go from 1500 clusters to 7 clusters), we combine clusters of job titles that inhabit a similar area of the occupational landscape, and rename the cluster using the same naming scheme outlined above.

Read more

Agglomerative clustering allows us to deliver multiple levels of job title granularities, each representing different cuts of the same taxonomy tree. We can view the tree from the top down (as shown in the visualization below) so that every broad job category can be subdivided into several job types - for example, Engineer can be subdivided into Mechanical Engineer, Software Engineer, Data Engineer, etc.

The plot below represents our job taxonomy in a hierarchical format. Each node on the tree can be thought of as the name of a cluster, which incorporates multiple job titles. As we move up the tree, we decrease the number of clusters and combine job titles from different clusters if necessary. Some of the most popular job titles contained within each cluster can be seen by hovering over that node on the tree (these would be akin to the biggest circles within each colored cluster on the ‘Jobscape’ scatterplot in the Clustering Job Titles section).

Our abstracted representation of job titles has an additional advantage, in that we are not strictly tied to any one hierarchy. We may use the abstract representation to create granular job categories upon client request for uncommon titles that may be too specific for most common uses. This approach also gives us the flexibility to define new job categories, while retaining the rigorous methodology used in our common hierarchical tree.

Uniformity of Clusters

Today’s standard clustering algorithms are extremely poor at deriving uniformly sized groups. In order to derive a set of clusters that could be used as a universal occupational taxonomy, we developed a proprietary clustering algorithm that can overcome this limitation.

Inferring Occupation

To infer occupation names at an individual level, we take into account a person’s listed skills, job descriptions, previous job titles, personal connections, and company/title level aggregated descriptions to help inform a person’s occupational classification. In practice, this allows us to make informed predictions about a person’s occupation in the absence of any specific job title, or where certain pieces of information may be missing. For example, an employee with the title “manager” may be classified as a technical manager based on their prior experience in software development, along with strong connections to other technical staff.

This approach helps us distinguish cases where identical titles may represent fundamentally different occupations when they are used in different industries. For example, an associate at an investment bank will fundamentally differ from an associate at a law firm. Our model helps predict whether these same titles should be in the same cluster across the two industries– if they are similar in their inputs, they will be, otherwise they may be matched to different occupations.

In detail, the inference model makes a prediction of a person’s job in the abstracted occupational vector space described above. The model comprises several independent models, each of which makes independent predictions about a person’s job title. For instance, one model uses a person’s skills to predict their job title. Another model mean encodes the person’s written descriptions to predict their job title. Another model reads in the person’s past employment history to help predict what their current job title should be. All of these (and others) are combined and weighted by their predictive quality in order to generate a final predicted position in the abstract occupational vector space, which can then be matched to our hierarchy or used to generate custom job clusters.

Uncommon Titles

Although we train mathematical representations of job titles based on their text descriptions, we can assign representations to all titles, even if they have never shown up before in our sample and do not have any text descriptions. We do this through FastText embeddings, a common and well documented model on which we train our dataset of job titles. This allows us to learn representations of substrings of individual words, so that even if there is a small mispelling or new job title phrasing, we still can relate it to job titles that we have well-trained representations for. Combined with the inference method outlined above, this allows us to classify every single position, no matter how uncommon.

Gender and Ethnicity

To predict ethnicity, we use a Bayesian model that takes into account the first name, last name, and location of a person. The model assumes conditional independence among all three variables, meaning that the model assumes no correlation between first names, last names, and locations when calculating ethnicity. The model is trained using name and Zip Code Tabulation Area (ZCTA) location data from the 2010 US Census, and predicts a probability of the person being a particular race from the set {white, black, api, hispanic, multiple, native}. In case of missing data or uncommon names, we default to country wide ethnic distributions, which represent our prior belief of possible ethnicity before seeing the person’s name or location. To improve the accuracy of the model outside of the US, we include ethnic distribution data for countries that do not contain a mono-ethnic vast majority, and condition the probability on this distribution rather than the ZCTA distribution. For countries that contain a mono-ethnic vast majority, we set the majority ethnicity to have the vast majority of population density, and uniformly distribute the remaining density across the non-majority ethnicities. This allows the model to assign the majority ethnicity to people for which we do not have strong contradictory information, while still assigning non-majority ethnicities when there is a strong belief to suggest otherwise.

Gender is predicted from first name based on social security administration data.


We predict the salary for each position based on the following covariates: job title, seniority of the position, company, and country. The model architecture is that of an ensemble of regressors, each of which is trained to predict the base salary of the position. We train this model using over 50 million salaries from job postings and publicly available visa applications, and use country level multipliers and inflation rates to adjust for salaries outside of the US and over time. The predicted salary will always be given in USD and will represent the nominal base yearly wage.

Due to the large number of jobs and companies that exist, we use abstract representations of companies and job titles in order to get estimates for any uncommon or unseen company/title pair. The advantage of this approach is that even for companies where we lack salary data, we may infer predicted salaries based on transitions into and out of that company. For instance, if a company hires software engineers directly from Google and Apple, this informs our prediction about the type of salary offers given to these employees upon transition.

While our model does not explicitly use individual features such as gender, ethnicity, education, skills, MSA, and others to predict salary, we still allow the data to be broken down by these groups and there are often significant levels of associated signal. These are real effects, and occur due to the high levels of correlation among predictors of salaries. For instance, education may not be implicitly input into the model, but educational background has a significant effect on types of jobs one is likely to obtain, and thus indirectly affects salary predictions.