Company Mapping

In publicly available profile data, we see many different company names that all refer to the same company (e.g. Bank of America and BofA both refer to the same company). There’s also the added issue of understanding subsidiaries (e.g. Sprint is really a subsidiary of SoftBank). Both of these factors are critical to resolve in order to determine the accurate number of people that work for a single company (headcount).

In the chart below, we are taking into account the number of employees that are associated with “Systems, Inc.”:

Company Mapping

In order to increase the confidence of company mapping, we gather as many of the following fields that are available, such as: Company Name, Alternative Company Names, Company Headquarters (City, State, and Country), Ticker, ISIN, LinkedIn URL, etc.

Sampling Weights

Raw online professional data does not accurately represent a company’s workforce. Certain roles have a higher or lower likelihood of being represented (e.g. white collar positions have a higher likelihood of being represented than blue collar positions) and people in certain regions have a higher or lower likelihood of being represented (e.g. people in big cities have a higher likelihood of being represented than people in small cities). To generate data that is representative of the true population, we employ sampling weights to adjust for occupation and location bias.

For example, if an engineer in the US has a 90% chance of having an online profile, we would consider every engineer in the US to represent 1.1 people. If a nurse in Germany has a 25% chance of having an online profile, we would consider every nurse in Germany to represent 4 people.

To generate sampling factors, we first probabilistically classify each job title and location to standard government occupational codes. For a previously unseen title, we infer occupational codes based on position data. Observations in our dataset are stratified by occupation and country, summed and smoothed to account for potential misclassifications in government codes, and then compared with government level statistics on the total workforce. Detailed occupational breakdowns are available for the United States (using Bureau of Labor Statistics data), allowing us to infer the likelihood of representation in our dataset. We then extrapolate to other countries using international statistics (using International Labor Organization data) with a similar methodology. The likelihood of representation is currently assumed independent between occupation and country, and the final likelihood of representation is generated by taking the product of the occupation and country likelihoods.

Sampling Weights

Lags in Reporting

People can be slow to update their profiles when they transition to a new job, especially if they were laid off. To adjust for this lag, we employ a nowcasting model that provides an estimate of the inflows and outflows of employees in recent months. As shown in the chart below, our model is able to correct for this underrepresentation in the most recent period.

Lags in reporting

And below is the change in headcount over time.

Lags in reporting - Headcount

Our model predicts the inflows and outflows that will be revealed once all public profiles have been updated. It does this by taking snapshots of companies’ reported workforces at various periods of time and comparing them to future snapshots, factoring unreported flows from the past into its predictions of current unreported flows. These predictions also draw from unemployment statistics and seasonal trends, among other factors.

Jobs Taxonomy

Our job taxonomy allows us to reduce the tens of millions of unique job titles seen in public workforce data to a manageable set of occupations. Using this occupation set, we can compare employees across companies and industries that may have different job naming conventions.

Read more

We start by identifying the activities that are associated with each job. We draw from two activity datasets: the first contains job descriptions from resumes and online profiles, and the second contains the “responsibilities” sections of online job postings. From this training data, we build a mathematical representation of every job title, independent of seniority-related keywords like “senior” or “principal”. These representations allows us to then calculate distances between all job titles, with jobs involving more similar activities being closer to each other.

After we map the distances between job titles, we group them into broader occupations. Our clustering algorithm starts with many distinct occupations and iteratively combines them into clusters representing broader and broader occupational groups. As a result, users can segment employees of interest into a small number of broad occupational groups or a large number of specific groups.

We then choose a representative title for each occupational cluster. We choose this title based on its frequency in the cluster and its closeness to the cluster’s center. While the name alone may not be completely representative of each title within the cluster, the name represents the closest title at a given level.

The plot below shows our entire landscape of job titles, broken into 27 color-coded clusters. Each circle represents a job title within our taxonomy, with more common titles having larger circles, and more similar jobs being closer together. Use the search box below to find a job title.

Combining Job Clusters

Using a clustering algorithm, we build a hierarchy of cluster titles that distills 1500 distinct job categories down to 7 general job categories. To move from more distinct groups to more general ones, we combine clusters of job titles that inhabit a similar area of the occupational landscape and rename the cluster using the same naming scheme outlined above.

Read more

The plot below represents our job taxonomy in a hierarchical format. Each node on the tree can be thought of as the name of a cluster, which incorporates multiple job titles. As we move up the tree, we decrease the number of clusters and combine job titles from different clusters, if necessary. Some of the most popular job titles contained within each cluster can be seen by hovering over that cluster’s node.

Our representation of job titles has an additional advantage: We are not strictly tied to any one hierarchy. We can use the representation to create granular job categories for specific titles upon request. Or we can have the flexibility to define new job categories, while retaining the rigorous methodology used in our common hierarchical tree.

Uniformity of Clusters

Many standard clustering algorithms struggle to create groups of the same size. We developed a proprietary clustering algorithm that overcomes this limitation. For example, as our algorithm broadens from a set of 300 jobs to a set of 50 more general groups, it ensures that each group from the second set has a roughly equal number of jobs from the first set (6 in this case). That way, we know that each group has a similar level of generality.

Inferring Occupation

To infer occupation names at an individual level, we take into account a person’s listed skills, job descriptions, previous job titles, personal connections, and company/title from their online job profiles. In practice, this allows us to make informed predictions about a person’s occupation in cases where certain pieces of information, including specific job titles, may be missing. For example, an employee with the title “manager” may be classified as a technical manager based on their prior experience in software development, along with strong connections to other technical staff.

Similarly, this approach helps us distinguish between different occupations under the same title by industry. For example, an associate at an investment bank will fundamentally differ from an associate at a law firm.

Read more

Further, the model comprises several algorithms, each of which makes independent predictions about a person’s job title. One model uses a person’s skills to predict their job title, another analyzes the person’s descriptions of their job. These sets of predictions (and others) are weighted by their predictive quality and combined to predict the job’s position in the landscape of occupations, which can then be matched to our hierarchy, or used to create custom job clusters.

Uncommon Titles

Although we develop mathematical representations of job titles based on their text descriptions, we can assign representations to all titles, even if they have never shown up before in our sample and do not have a description. We do this through FastText embeddings, a model that groups together similar words and phrases. This allows us to recognize titles that are misspelled or phrased differently as part of our existing list of job titles. Combined with the inference method outlined above, this also allows us to classify every unique position, no matter how uncommon it is.

Skill Clustering

Clustering Skills

Our skills taxonomy is similar to our Jobs Taxonomy, as it creates a universal skill language to easily compare employees across different jobs, companies, and industries.

Read more

We group together the different ways that people describe the same skill to get a set of unique skills. Then, we cluster them into broader categories. The clustering algorithm starts with many distinct skills and iteratively combines clusters into broader and broader skill sets. Users can view broad or specific skill groups, depending on their needs.

We then choose a representative name for each skill set cluster. We choose this title based on how often it appears in the cluster and its closeness to the center of the cluster.

The plot below shows our entire landscape of skills grouped into clusters. Each circle represents a standardized skill title within our taxonomy; the bigger the circle is, the more common the skill is among workers. Circles belonging to the same cluster have the same color, andmore similar skills are closer together in this ‘skillscape.’ Use the search box below to find a skill.

Combining Skill Clusters

Using a clustering algorithm, we’ve built a hierarchy of cluster titles that run from 25 general skill categories to 4000 specific skill categories. At every level of specificity, we combine clusters of skills that inhabit a similar area of the skill landscape, and rename the cluster using the same naming scheme outlined in Clustering Skills.

Read more

The plot below demonstrates our skill taxonomy’s hierarchical format. Each node on the tree can be thought of as the name of a cluster, which comprises multiple skills. As we move up the tree, we decrease the number of clusters. Some of the most popular skills contained within each cluster can be seen by hovering over that cluster’s node on the tree. General clusters near the top of the tree are akin to the biggest circles within each colored cluster on the ‘Skillscape’ scatterplot in the Clustering Skills section.

Gender and Ethnicity

We predict an individual’s gender using their first name by estimating the probabilities that the name is male or female. The model is informed by social security administration data. For example, if 70% of people named Lauren are female and 30% are male, our model will output a 0.7 probability that the person is female and 0.3 probability that the person is male.

Similarly, we predict an individual’s ethnicity using first name, last name, and location. The model draws from US census data for its predictions, in which it estimates the probability that a given individual belongs to a particular ethnic group from the set {White, Black, API (Asian and Pacific Islander), Hispanic, Multiple (Two or More Ethnicities), Native}.

Read more

In the cases of missing data or uncommon names, we default to countrywide ethnic distributions. To improve the accuracy of the model outside of the US, we use ethnic distribution data for countries whenever available, especially for ethnically non-homogenous countries like Canada and Australia. For ethnically homogenous countries like Japan or Ghana, we set the share of the majority ethnicity group in the population to 99% and split the remainder evenly among the non-majority groups.

Use the search boxes to view gender and ethnicity probabilities for a specific name or name/location combination.


Our new salary model predicts the salary for each position from job title, company, location, years of experience, and seniority, as well as the year observed. The model is trained from salaries in publicly-available visa application data, self-reported data, and job postings. This new approach improves the precision of the model’s predictions and allows for more nuanced predictions (by recognizing differences across specific locations in the US, for example).

Read more

To account for differences in salaries across countries, the salary model incorporates country-level multipliers. The model also accounts for changes in salary over time using inflation rates. When salaries are unavailable for a certain company, we infer them from the transitions into and out of that company. For example, if we have salary data on software engineers from other tech companies but not from Apple, we predict the salaries of Apple engineers based on the salaries of closely related tech companies. While our model does not explicitly use individual features such as gender, ethnicity, education, and skills to predict salaries, we still allow the data to be broken down by these factors, as there are often significant differences between these groups. For instance, education may not be factored explicitly into the model, but education background affects the types of jobs one is likely to obtain. As a result, our salary data indirectly captures pay gaps between genders and ethnicities.


The seniority metric is created using an ensemble model. First, information about an individual’s current job, including their title, company, and industry, are used to generate an initial seniority score. Second, details about an individual’s job history, such as the duration of their previous employment and the seniority of previous positions, are taken into account to create a second seniority score. Finally, an individual’s age is used to generate a third seniority score. The scores from these models are averaged together to arrive at a continuous seniority metric for an individual.

To convert from this continuous seniority metric into an ordinal value, we gather samples of seniority predictions corresponding to recognizable keywords such as “junior”, “senior”, “director”, etc. and map the metric to the most likely bin. This allows us to attach meaning to the raw metric values, and to bin seniorities into discrete buckets.