Methodologies¶

Company Mapping¶

In publicly available profile data, we see many different company names that all refer to the same company (e.g. Bank of America and BofA both refer to the same company). Further, some entities are subsidiaries of others (e.g. Sprint is really a subsidiary of SoftBank).

Both of these factors are critical to resolve in order to determine the accurate number of people that work for a single company. For instance, in the chart below, we are taking into account the number of employees that are associated with “Systems, Inc.”:

We address these issues using our company mapping model. Using various company entity identifiers, we map each publicly reported position at a company to a Revelio Labs Company ID (RCID) – an ID in our proprietary company universe. By design, each subsidiary company has its own RCID that is tied to its parent company’s RCID.

In the example above, all variants of the “Systems, Inc.” name would be linked to the single RCID for Systems, Inc. Likewise, Protocols, Inc. would be mapped to its own RCID. By default, subsidiaries are included as part of the parent company. However, we can break this out into subsidiaries upon request.

Note: Our company mapping model enables us to match your company list to RCIDs in our company universe. The accuracy of our matching depends on the quality of the given company identifiers. To ensure the most accurate matching possible, we recommend providing strong identifiers such as: company name, LinkedIn URL, company URL, FactSet ID, and/or ticker symbol.

Sampling Weights¶

Raw online professional data does not accurately represent a company’s workforce. Certain roles have a higher or lower likelihood of being represented (e.g. white collar positions have a higher likelihood of being represented than blue collar positions) and people in certain regions have a higher or lower likelihood of being represented (e.g. people in big cities have a higher likelihood of being represented than people in small cities). To generate data that is representative of the true population, we employ sampling weights to adjust for occupation and location bias.

For example, if an engineer in the US has a 90% chance of having an online profile, we would consider every engineer in the US to represent 1.1 people. If a nurse in Germany has a 25% chance of having an online profile, we would consider every nurse in Germany to represent 4 people.

To generate sampling factors, we first probabilistically classify each job title and location to standard government occupational codes. For a previously unseen title, we infer occupational codes based on position data. Observations in our dataset are stratified by occupation and country, summed and smoothed to account for potential misclassifications in government codes, and then compared with government level statistics on the total workforce. Detailed occupational breakdowns are available for the United States (using Bureau of Labor Statistics data), allowing us to infer the likelihood of representation in our dataset. We then extrapolate to other countries using international statistics (using International Labor Organization data) with a similar methodology. The likelihood of representation is currently assumed independent between occupation and country, and the final likelihood of representation is generated by taking the product of the occupation and country likelihoods.

Lags in Reporting¶

People can be slow to update their profiles when they transition to a new job, especially if they were laid off. To adjust for this lag, we employ a nowcasting model that provides an estimate of the inflows and outflows of employees in recent months. As shown in the chart below, our model is able to correct for this underrepresentation in the most recent period.

And below is the change in headcount over time.

Our model predicts the inflows and outflows that will be revealed once all public profiles have been updated. It does this by taking snapshots of companies’ observed workforces at various periods of time and comparing them to future snapshots, predicting currently unreported flows based on previously underreported flows. These predictions also draw from seasonal trends, special events like layoffs, and overall company hiring patterns.

Unifying Job Postings¶

Revelio Labs’ COSMOS offering is a comprehensive job postings dataset representative of job market demand for skilled talent. To construct COSMOS, our approach involves integrating postings from diverse sources while addressing the inherent complexities of data redundancy. Given that a single job opening can be disseminated through multiple channels (e.g., company career pages, job boards, and aggregators), the same posting may appear in various forms across different platforms, with discrepancies in formatting and content. A direct comparison of postings based on exact feature matches would yield an exceptionally high degree of approximate duplicates, while approximate matching is impractical given the volume (>3.9 billion postings).

To mitigate this issue, we employ a two-stage process. First, we normalize the primary attributes (title, company, location) of each posting to prune the search space, limiting comparisons to postings that exhibit high similarity in these critical features. Within this reduced search space, we then compute pairwise similarity scores of postings that temporally overlap, and consider them duplicates if the score exceeds the confidence threshold. By using this approach, we are able to leverage normalized fields for improved computational efficiency while retaining the specificity of fine-grain comparisons.

Whether a posting comes from a single source or found in multiple sources, it may represent one or more openings or it may be a ghost posting - not intended to lead to a hire. By analyzing our profiles data and leveraging company-, role- and geography-specific historical trends, we can predict the expected hires associated with a job posting. This prediction enables us to provide an accurate picture of the market and inform decision-making.

Jobs Taxonomy¶

Our job taxonomy allows us to reduce the tens of millions of unique job titles seen in public workforce data to a manageable set of occupations. Using this occupation set, we can compare employees across companies and industries that may have different job naming conventions.

Combining Job Clusters¶

Using a clustering algorithm, we build a hierarchy of cluster titles that distills 1500 distinct job categories down to 7 general job categories. To move from more distinct groups to more general ones, we combine clusters of job titles that inhabit a similar area of the occupational landscape and rename the cluster using the same naming scheme outlined above.

Uniformity of Clusters¶

Many standard clustering algorithms struggle to create groups of the same size. We developed a proprietary clustering algorithm that overcomes this limitation. For example, as our algorithm broadens from a set of 300 jobs to a set of 50 more general groups, it ensures that each group from the second set has a roughly equal number of jobs from the first set (6 in this case). That way, we know that each group has a similar level of generality.

Inferring Occupation¶

To infer occupation names at an individual level, we take into account a person’s listed skills, job descriptions, previous job titles, personal connections, and company/title from their online job profiles. In practice, this allows us to make informed predictions about a person’s occupation in cases where certain pieces of information, including specific job titles, may be missing. For example, an employee with the title “manager” may be classified as a technical manager based on their prior experience in software development, along with strong connections to other technical staff.

Similarly, this approach helps us distinguish between different occupations under the same title by industry. For example, an associate at an investment bank will fundamentally differ from an associate at a law firm.

Uncommon Titles¶

Although we develop mathematical representations of job titles based on their text descriptions, we can assign representations to all titles, even if they have never shown up before in our sample and do not have a description. We do this through FastText embeddings, a model that groups together similar words and phrases. This allows us to recognize titles that are misspelled or phrased differently as part of our existing list of job titles. Combined with the inference method outlined above, this also allows us to classify every unique position, no matter how uncommon it is.

Skill Clustering¶

Clustering Skills¶

Our skills taxonomy is similar to our Jobs Taxonomy, as it creates a universal skill language to easily compare employees across different jobs, companies, and industries.

Combining Skill Clusters¶

Using a clustering algorithm, we’ve built a hierarchy of cluster titles that run from 25 general skill categories to 3000 specific skill categories. At every level of specificity, we combine clusters of skills that inhabit a similar area of the skill landscape, and rename the cluster using the same naming scheme outlined in Clustering Skills.

Gender and Ethnicity¶

We predict an individual’s gender using their first name by estimating the probabilities that the name is male or female. The model is informed by social security administration data. For example, if 70% of people named Lauren are female and 30% are male, our model will output a 0.7 probability that the person is female and 0.3 probability that the person is male.

Similarly, we predict an individual’s ethnicity using first name, last name, and location. The model draws from US census data for its predictions, in which it estimates the probability that a given individual belongs to a particular ethnic group from the set {White, Black, API (Asian and Pacific Islander), Hispanic, Multiple (Two or More Ethnicities), Native}.

Prestige¶

We infer prestige at the institution, user, and position level using relationships between individuals as they move throughout the workforce, weighting by seniority and tenure. World university rankings set prior values for a base model, with information then being redistributed among all users, positions, and companies according to the changing networks created by worker inflows and outflows. An institution’s prestige is derived from the prestige of the individual employees, and individuals derive their prestige from their previous institutions. The model assigns continuous prestige values between -1 and 1.

Salary¶

Our salary model predicts the salary for each position using position-specific information such as job title, seniority level, company, and location, as well as user-specific information such as the number of years an individual has worked at the company. The year of observation is also taken into account. The model is trained on salaries found in publicly-available visa application data, self-reported data, and job postings.

Seniority¶

The seniority metric is created using an ensemble model. First, information about an individual’s current job, including their title, company, and industry, are used to generate an initial seniority score. Second, details about an individual’s job history, such as the duration of their previous employment and the seniority of previous positions, are taken into account to create a second seniority score. Finally, an individual’s age is used to generate a third seniority score. The scores from these models are averaged together to arrive at a continuous seniority metric for an individual.

To convert from this continuous seniority metric into an ordinal value, we gather samples of seniority predictions corresponding to recognizable keywords such as “junior”, “senior”, “director”, etc. and map the metric to the most likely bin. This allows us to attach meaning to the raw metric values, and to bin seniorities into discrete buckets.

The seven ordinal seniority levels are:

Entry Level (Ex. Accounting Intern, Software Engineer Trainee, Paralegal)
Junior Level (Ex. Account Receivable Bookkeeper, Junior Software QA Engineer, Legal Adviser)
Associate Level (Ex. Senior Tax Accountant; Lead Electrical Engineer; Attorney)
Manager Level (Ex. Account Manager; Superintendent Engineer; Lead Lawyer)
Director Level (Ex. Chief of Accountants; VP Network Engineering; Head of Legal)
Executive Level (Ex. Managing Director, Treasury; Director of Engineering, Backend Systems; Attorney, Partner)
Senior Executive Level (Ex. CFO; COO; CEO)

The example job titles above are titles that we can expect at each Seniority level. However, depending on the specific characteristics of the company and position, these titles could also appear at slightly higher or lower levels.