FAQs

Trials and Data

Which data delivery methods do you offer? We offer a variety of data delivery methods including flat files, API, self-service Dashboard access, and custom reports. Flat files can be delivered using Amazon S3, AWS Data Exchange, Snowflake, or via a link containing a zipped version of your flat file. Our most popular delivery method is through an Amazon S3 bucket where we can deliver parquet or CSV files to our clients.

How can I access my Amazon S3 bucket? The first step to accessing your Amazon S3 bucket is to install the AWS Command Line Interface (AWS CLI) on your local machine. AWS’s documentation on the installation process can be found here. Once you have installed the AWS CLI, you will use the code below to access your bucket and your files:

$ aws configure
$ aws s3 ls s3://revelio-client-<client-name>/
# To copy all files from your S3 bucket to the current working directory on your local machine, use the following code:
$ aws s3 cp s3://revelio-client-<client-name>/ ./ --recursive
# To copy all files from a folder in your S3 bucket to the current working directory on your local machine, use the following code:
$ aws s3 cp s3://revelio-client-<client-name>/<folder-name>/ ./ --recursive

When is your data delivered? And how frequently is it updated? Clients will receive updated data from the previous month on the 15th of each month. For example, December data (including headcounts, inflows, outflows, etc) would all become available by January 15th.

If you would like your data to be updated more frequently, we also offer a daily data feed. Keep in mind that the daily data feed is not as comprehensive as the monthly updates.

When does your data start? Our employee sentiment data dates back to 2007. Workforce dynamics data dates back to 2008. Job postings data dates back to 2019, but postings that predate 2019 are also available for an additional charge.

What are your data sources? Our data is sourced from a variety of publicly accessible datasets including data from online professional profiles, online employee reviews, H-1B visa filings, job postings on aggregator sites and career pages, and WARN layoff notices.

How many companies and positions do you cover? Our data covers all public and private companies, which comes out to roughly 20 million companies globally, and it observes over 400 million active positions.

Can you cover companies that are not in your sample file? Yes, if there are any specific companies you are interested in tracking that are not included in the standard trial or sample files, we can include them upon request.

How do you treat company subsidiaries? When a company acquires or merges with another company, we will include the subsidiary as a part of the parent company, even retroactively, before the acquisition took place. For example, we include all of Whole Foods employees as part of Amazon, during 2008-2016, even though Amazon only acquired Whole Foods in 2017. The reason for this decision is that we want to avoid seeing an artificial spike in headcount when an acquisition or spinoff occurs.

Can you provide data on any additional granularities and outcomes? We can provide data along custom granularities, including veteran status and highest degree, upon request. Other outcomes that we can provide include a gender diversity score, an ethnic diversity score, and suitability for remote work. For individual position files, we can provide position descriptions.

Modeling

How do you compensate for some people not having online profiles? Because we collect our data from online professional profiles, we face an issue of data being drawn from a non representative sample of the underlying population. We impose sampling weights to adjust for roles and locations that are underrepresented in the sample. For example, if an engineer in the US has a 90% chance of having an online profile, we would consider every engineer in the US that we see to actually represent 1.1 people. If a nurse in Germany has a 25% chance of having an online profile, we would consider every nurse in Germany that we see to actually represent 4 people. This allows us to approximate, as closely as possible, the true estimate of the underlying population.

How are companies being mapped? Company mapping to Revelio Labs’ proprietary company universe is achieved by utilizing company identifiers such as company name, ticker, or website. Each company in the Revelio Labs universe has a RCID (Revelio Labs Company Identifier) associated with it. We assign weights to the different identifiers given to us and map them to our internal RCID universe. Then, we give each potential pairing a probability score, with 1 being a definite match and 0 being a definite mismatch. Finally, we choose the highest scored pairing as our match.

What is the relationship between count, inflow, and outflow in the Workforce Dynamics dataset? To recreate the count metric for a given granularity g at time t, use the following formula:

\[count_g(t) = count_g(t-1)+ inflow_g(t) - outflow_g(t-1)\]

Why are the counts, inflows, and outflows columns decimals, rather than integers? Our data uses time-scaling and cross-sectional models to adjust for lags in reporting and sampling bias. The weights applied in these models produce non-integer values for counts, inflows, and outflows.

Is it true that the change in counts must be equal to inflows minus outflows? Yes.

In what cases will your employee headcounts differ from employee headcounts in a company’s annual report? Our employee headcounts will often differ from a company’s 10-K as they omit contingent workers, which in many cases, make up the majority of a company’s workforce. Our reporting, however, includes all portions of a company’s workforce, such as contingent workers.

How is the Prestige score generated? We take publicly available university rankings to determine a base score for individuals from those universities. We then derive a score for companies, which is based on where those employees attended universities. The company scores and the university scores are used to iteratively generate the Prestige score, filling in information gaps until the algorithm converges. The model is constructed so that even if someone went to a low ranking school, but then went on to work at a prestigious firm, their ranking would still be high.

What kinds of positions fall into each Seniority level? Our Seniority Model assigns seniority scores to positions in one of seven levels. These levels are:

  1. Entry level / Intern (Ex. Accounting Intern, Software Engineer Trainee, Paralegal)

  2. Junior Level (Ex. Account Receivable Bookkeeper, Junior Software QA Engineer, Legal Adviser)

  3. Associate/Analyst Level (Ex. Senior Tax Accountant; Lead Electrical Engineer; Attorney)

  4. Manager Level (Ex. Account Manager; Superintendent Engineer; Lead Lawyer)

  5. Vice President Level (Ex. Chief of Accountants; VP Network Engineering; Head of Legal)

  6. Director Level (Ex. Managing Director, Treasury; Director of Engineering, Backend Systems; Attorney, Partner)

  7. C-suite Level (Ex. CFO; COO; CEO)

The example job titles above are titles that we can expect at each Seniority level. However, depending on the specific characteristics of the company and position, these titles could also appear at slightly higher or lower levels.

How is the Gender/Ethnicity Entropy metric generated? The Gender/Ethnicity Entropy metric aims to ordinally rank the diversity of a company’s workforce. The metric uses a modified version of the Shannon Index to calculate diversity scores in relation to a company’s peers, while taking into account occupation and region. To calculate the final score, we assign the percentile to which the company of interest falls in relation to all other peer companies, given the region and occupation type. This provides a relative score that is easily interpretable.

The Gender/Ethnicity Entropy metric is designed on a linear scale of 1-10, where a score of 1 indicates poor diversity and a score of 10 indicates high diversity. The scale may also be interpreted as a percentile with respect to peers (i.e. 1 corresponds to a diversity score between the 0 and 10th percentile, 2 corresponds to a diversity score between the 10th and 20th percentile, and so forth). The metric is available at any level of granularity, and may also be tuned to have an increased sensitivity to particular groups of people. The metrics are reported independently for gender and ethnicity diversity.

We have taken measures to evaluate the accuracy of both of these models. We evaluate our Gender Model by comparing its predicted share of females to the share of females reported by pronouns in the “recommended” section of profiles (self-reported pronouns are not available on public profiles but can be visible in the text of recommendations). In doing so, we find that our Gender Model has an accuracy of 96.16%. Further, we validate our Ethnicity Model by comparing the shares of ethnic groups in each US Metropolitan Statistical Area (MSA) reported by official statistics to those predicted by our model. We find that our Ethnicity Model has an accuracy of 93.28%.

How are Sentiment scores generated? Our Sentiment Model uses Natural Language Processing to capture employee sentiment on specific topics across raw user reviews of companies. The model is built from a Transformer architecture and is trained on an entailment task that allows it to predict the probability that a given topic, phrase, or sentence follows the text of interest. The model is then generalized for our uses in a task known as Zero-Shot Classification, where we allow both positive and negative text to be matched to a predefined topic list. We assume that positive reviews classified to a given topic correspond to positive sentiment for that topic and negative reviews correspond to negative sentiment for that topic. For every review, we can then compute a weighted sentiment score based on how relevant a given topic was for the positive or negative portion of the review. To offset any negative bias in the reviews, we normalize the scores by assuming that they are normally distributed and report how many standard deviations away from the mean a given topic of a review is. These scores can then be averaged across any level of granularity in order to produce a final sentiment score.

How are attrition rates, hiring rates and growth rates calculated? The attrition rate \(a_g(t)\), and hiring rate \(h_g(t)\) are calculated at the particular granularity level \(g\) chosen and the month \(t\). In the formulas below, \(o_g(j)\), \(i_g(j)\) denote the outflows, inflows at the particular granularity level \(g\) in the month \(j\) and \(\bar{c}_g(t)\) denote the average head count at that granularity over the last year. Growth rate is the difference between hiring rate and attrition rate.

\[ \begin{align}\begin{aligned}a_g(t) = 100 \cdot \frac{\sum_{j=t-11}^{t} o_g(j)}{\bar{c}_g(t)}\\h_g(t) = 100 \cdot \frac{\sum_{j=t-11}^{t} i_g(j)}{\bar{c}_g(t)}\\\bar{c}_g(t) = \frac{1}{12} \sum_{j=t-11}^{t} c_g(j)\end{aligned}\end{align} \]

In other words, the attrition rate is the 12-month moving sum of outflows divided by the 12-month moving average of headcount, while the hiring rate is the 12-month moving sum of inflows divided by the 12-month moving average of headcount.

How can I calculate average prestige from my workforce dynamics file? Average prestige can be calculated with the total_prestige field and the prestige_weight field using the sample code below.

select
    company,
    month,
    sum(total_prestige)/sum(prestige_weight) as avg_prestige
from wf_table
group by company, month;

Which languages can you translate for job titles and descriptions? We have the capability to translate job titles and descriptions from all languages using industry standard translation software. Further, for seven languages – Spanish, French, Portuguese, German, Italian, Dutch, and Chinese – we have developed in-house models (built using FastText) that translate text with even higher precision than the standard software.

How can I recreate the plots I see on the Dashboard using the data from my data feed? You can recreate key metrics such as hiring rate, attrition rate, and salary from the data in your data feed with the following sample code:

Hiring rate:

select
    company,
    month,
    sum(inflow_sum) over(partition by company order by month rows between 11 preceding and current row) as inflow_rolling_sum,
    avg(count_sum) over(partition by company order by month rows between 11 preceding and current row) as count_rolling_avg,
    inflow_rolling_sum/count_rolling_avg as hiring_rate
from wf_table;

Attrition rate:

select
    company,
    month,
    sum(outflow_sum) over(partition by company order by month rows between 11 preceding and current row) as outflow_rolling_sum,
    avg(count_sum) over(partition by company order by month rows between 11 preceding and current row) as count_rolling_avg,
    outflow_rolling_sum/count_rolling_avg as attrition_rate
from wf_table;

Salary:

select
    company,
    month,
    sum(salary)/sum(count) as salary
from wf_table
group by company, month;

Why do my plots look different than those on the Dashboard? The data on the Dashboard may differ from the data in your data feed due to differing model versions as the data on the Dashboard reflects our latest available models. More information on the methodology for each model can be found in Methodologies.

Known Issues and Updates

Known Issues

Skill Data Sparsity

Issue: Our profile data is combined from multiple sources which gather publicly available profiles. Around May 2021, user skills disappeared from the majority of public profiles. However, they are still visible on a minority of public profiles which we collect, but we do not see the most recently added skills for most existing users, and we do not see any skills for most new users.

Scope: This affects the individual level skill_file and the workforce dynamics skill_file. The workforce dynamics skill_file still tracks users’ observable skills across different positions.

Solution: We will continue to capture the skills when they appear on public profiles, and monitor for any changes in their availability. Additionally, we recently implemented a model which predicts missing skills that may be useful in filling the gaps in the individual level skill_file as well as improving the currency of the workforce dynamics skill_file.

Updated: June 9, 2022

Bug Fixes

Highest Degree

We fixed a bug where an individual’s reported highest degree was sometimes not their actual highest degree.

Updated: June 13, 2022

Updates

Company Mapping 2.0

We recently released a new company mapping model that maps company entities to Revelio Labs’ proprietary company universe through RCIDs (Revelio Labs Company IDs). This approach improves on our previous model, allowing for greater consistency and accuracy in mapping, as well as greater flexibility in resolving mapping inconsistencies.

Salary Model 2.0

We recently released a new salary model using tree-based prediction models and an ensemble model to improve overall salary predictions. This model also provides significant improvements in the salary predictions across different geographies within the U.S. and salary predictions for C-suite employees. Additionally, the new salary predictions yield an interval rather than a point estimate.

Updated: June 2, 2022

Please feel free to reach out directly with any questions: info@reveliolabs.com