Datasets

Workforce Dynamics

This dataset contains aggregated workforce statistics. Every row is a distinct level of aggregation and month combination. Generally, the broadest configuration of this dataset is the company and month level. In that example, every row is a company and month combination. Imagine, then, that we would like to include country as a level of aggregation. Each row would correspond to a company, country, and month combination. The dataset at the company-country-month level can be aggregated to create the company-month dataset.

Let’s take a look at an example output where we have the levels of aggregation as company, country tracked across month and let count be the outcome of interest which represents the total headcounts for that particular level of aggregation, month combination (the count represents that headcount at the end of that particular month)

company

country

month

count

Company A

U.S.

2021-01

10

Company A

U.S.

2021-02

12

Company A

U.S.

2021-03

14

Company A

Canada

2021-01

10

Company A

Canada

2021-02

11

Company A

Canada

2021-03

9

This enables us to visualize the table as a graph as well, where the month can be represented along X-axis, and the outcome count can be represented along Y-axis. Thus each level of aggregation, in this case (Company A, U.S.) and (Company A, Canada) can be viewed as entities for which the outcome count is tracked along time (month) on this graph. This generalization, where we provide outcomes at a level of aggregation across time, is what forms the basis of our workforce dynamics data.

Note that it’s easy to compute a lower level of aggregation from a higher level of aggregation. To come down to company and month level, we can sum across the country column to get,

company

month

count

Company A

2021-01

20 (10+10)

Company A

2021-02

23 (12+11)

Company A

2021-03

23 (14+9)

  • Count (float): We estimate the total number of employees at the end of a given month that fits each granularity level using the proprietary algorithm described below.

  • Inflow/Outflow (float): In addition to the number of employees, we also estimated the total inflow and outflow of employees at each granularity level at the end of a given month

  • Salary (float): We predict the salary for each position based on role, seniority, company, and country using a regression based model. We train this model using over 200 million salaries from job postings and publicly available labor certification applications, and use country level inflation to estimate the change in salary over time. We get an out-of-sample root mean squared error (RMSE) of 14%. Salary in long_file shows the sum of salaries of employees in the particular granularity level

  • Prestige (float): We generate prestige for each university, degree level, company, and individual. We initialize prestige of universities by publicly available scores and then proceed to propagate through the relationships between universities, individuals, and companies that we observe in our data until each individual converges on a prestige score

  • Month (time): The month and year of the position are provided in “YYYY-MM” format. Each deliverable file contains monthly data up to the previous month end.

  • Company (categorical): RL delivery file can provide insight on all public (and private) companies. Companies by default are defined at holding company level and all subsidiaries held by the top parent company are included in the company level. The list of parent companies covered by Revelio include those mapped by FactSet Research Systems Inc. in addition to manually defined companies at the client’s request

  • Region (categorical): The most coarse geographical granularity can be defined at region level. The 16 region names are as follows:

    • Arab States

    • Northern Africa

    • South-Eastern Asia

    • Central America

    • Northern America

    • Southern Asia

    • Central and Western Asia

    • Northern Europe

    • Southern Europe

    • Eastern Asia

    • Pacific Islands

    • Sub-Saharan Africa

    • Eastern Europe

    • South America

    • Western Europe

  • Country (categorical): The profiles come from 232 different countries

  • State (categorical): For US and US territories, granularity level can be specified at state level. States include 50 states and 9 territories (American Samoa, Guam, Northern Mariana Islands, Puerto Rico, Virgin Islands, Minor Outlying Islands, Micronesia, Marshall Islands, Palau)

  • MSA (categorical): For US states and US territories, the most granular geography is the Metropolitan Statistical Area (MSA).

  • Job Category (categorical): In addition to geographical granularities, role level granularities can also be specified. The most basic job category classification groups roles into the following 7 groups:

    • Admin

    • Engineer

    • Finance

    • Marketing

    • Operations

    • Sales

    • Scientist

    The job role taxonomy is developed by our proprietary representation and clustering algorithms. We develop mathematical representations of each job title using the title itself, the text description of the position (either by individuals describing their own experiences or by employers on a job posting), individual’s skills, associates, and previous experience. Our clustering algorithm is in the family of hierarchical/agglomerative clustering algorithms. This means that we begin with every job title occupying its own cluster, then iteratively combine clusters based on a set of criteria. This allows for complete flexibility of the number of clusters. We can create a very numerous and granular set of clusters, a medium set, a broad set, etc and have all sets of clusters perfectly map into each other. We update this taxonomy periodically to adjust to the changing occupational landscape. Aside from the 7 job categories above, the most common job clustering is done at 150 groups and 1000 groups.

  • Seniority (ordinal): Seniority ranges from 1 to 4. 1 is the most junior and 4 is the most senior. Our seniority model is a predicted seniority based on the title, accounting for industry and company size. Age and tenure do not directly determine seniority. If there are two people with identical titles in the same industry in companies of the same size, those people in their position would get the same seniority score even if they have different lengths of time in the workforce

Statistics that can be included are as follows:

  • Levels of Aggregation:
    • Month (time): The month and year of the position, provided in “YYYY-MM” format.

    • Region (categorical): The most coarse geographical granularity with 16 discrete levels

    • Country (categorical): 232 different countries

    • State (categorical): For US and US territories, state level location

    • MSA (categorical): For US states and US territories, metropolitan statistical area

    • Job_category (categorical): Aggregated position role with 7 discrete levels

    • Seniority (ordinal): Seniority level with 4 discrete levels

    • Highest_degree (categorical): Highest degree attained by workers at the granularity level

    • Gender (categorical): Gender is calculated as a probability based on the likelihood of the first name being male or female

    • Ethnicity (categorical): Ethnicity is estimated based on the likelihood of both the first and last name as well as location

    • Veteran (categorical): Veteran status

  • Outcomes:
    • Count (float): Headcount for a given month

    • Inflow/Outflow (float): Total inflow and outflow of employees at each granularity level.

    • Salary (float): Sum of Estimated salaries in the particular granularity level

    • Prestige (float): Average prestige score of the specified granularity level

    • Remote_suitability (float) : Revelio score for suitability of a role for remote work

    • Duration (float): Average tenure of employees at a given granularity level in years

    • Hiring (float): Sum of inflows at a given level of granularity over the last year divided by the average counts at that granularity over the last year

    • Attrition (float): Sum of outflows at a given level of granularity over the last year divided by the average counts at that granularity over the last year

    • Gender_entropy (float): Gender diversity score at a given granularity level

    • Ethnicity_entropy (float): Ethnicity diversity score at a given granularity level

Transitions

  • User_id: Revelio user id

  • Month: Month

  • Prev_rn: Position order of previous role

  • Prev_company: Previous company name

  • Prev_sector: Previous sector

  • Prev_industry: Previous industry

  • Prev_region: Previous region

  • Prev_jobtitle: Previous job title

  • Prev_job_category: Aggregated previous position role with 7 discrete levels

  • Prev_role_k50: Aggregated previous position role with 50 discrete levels

  • Prev_role_k150: Aggregated previous position role with 150 discrete levels

  • Prev_seniority: Previous seniority level with 4 discrete levels

  • Prev_enddate: End date of previous position

  • Prev_salary: Estimated salary of the previous role

  • New_rn: Position order of new role

  • New_company: New company name

  • New_sector: New sector

  • New_industry: New industry

  • New_region: New region

  • New_jobtitle: New job title

  • New_job_category: Aggregated new position role with 7 discrete levels

  • New_role_k50: Aggregated new position role with 50 discrete levels

  • New_role_k150: Aggregated new position role with 150 discrete levels

  • New_seniority: New seniority level with 4 discrete levels

  • New_startdate: Start date of new position

  • New_salary: Estimated salary of the new role

Job Posting Dynamics

This dataset contains aggregated job posting statistics. Every row is a distinct level of aggregation and month combination. Generally, the broadest configuration of this dataset is the company and month level. Each row would correspond to a company, country, and month combination. The dataset at the company-country-month level can be aggregated to create the company-month dataset. For more information on the levels of aggregation, please refer to Workforce Dynamics section.

  • Granularity_id: Revelio internal ID

  • Start_date: Month

  • Active_posting: Number of active postings during that month

  • New_posting: Number of new postings during that month

  • Removed_posting: Number of postings removed during that month:

  • Active_salary_avg: Average salary for active postings during that month

  • New_salary_avg: average salary for new postings during that month

  • Removed_salary_avg: Average salary for postings that got removed during that month

  • Filling_time_avg: Average time to fill

  • Company: Company name

  • State: Location of job posting

  • Role_k7: aggregated posting role with 7 discrete levels

  • Role_k150: aggregated posting role with 150 discrete levels

  • Seniority: seniority level with 4 discrete levels

Individual Job Postings

RL also provides individual level job postings data.

  • Job_id: posting key

  • Company: Name of the company

  • Company_cleaned: Standardized company name

  • Post_date: date at which the job was posted

  • Remove date: date at which the job was removed. If null, it hasn’t been removed yet.

  • Title: Raw job title

  • Title_cleaned: standardized title

  • Role_k150: aggregated position role with 150 discrete levels

  • Role_k50: aggregated position role with 50 discrete levels

  • Role_k7: aggregated position role with 7 discrete levels

  • Status: Discrete posting status includes: open, closed, expired and pending close

  • Salary: salary information from the posting.

  • Location, city, state, state_long, zip, county, latitude, longitude: listed location for posting

  • Region_state: Metropolitan Statistical Area of the posting

  • Industry: listed industry of posting

  • Industry_cleaned: standardized listed industry

  • Company_size_min, company_size_max: Minimum and maximum company size on company profile.

Employee Sentiment

RL provides Company review data with the following information. Note that not all rating fields are required to be filled out by the reviewer. Also, some ratings (ie. ‘culture and values’ and ‘diversity and inclusion’) were added more recently.

  • Review_id: review key

  • Review_language_id: Indicates the language of the review. Most reviews are automatically translated to English, however some remain in their native language.

  • Location, city, state, country: listed location of the reviewer

  • Job_title_name: raw position title of the reviewer

  • Review_date_time: time when review was posted

  • Review_featured: indicates whether review is featured on company page

  • Review_iscovid19: indicates whether review mentions the covid19 pandemic

  • Reviewer_employment_status: indicates employment type of reviewer (freelance, part time, intern, contract, regular)

  • Reviewer_job_ending_year: final year of the reviewer’s employment with the company

  • Reviewer_length_of_employment: number of years the reviewer worked at the company

  • Reviewer_current_job: indicates whether the reviewer is a current or former employee

  • Rating_overall: overall rating of company (integer values 1 to 5, with 5 being the best)

  • Rating_business_outlook: business outlook rating (positive, negative, neutral)

  • Rating_career_opportunities: rating of career opportunities (1 to 5, with half-points awarded, and 5 being the best)

  • Rating_ceo: approval rating of the CEO (approve, disapprove, no opinion)

  • Rating_compensation_and_benefits: rating of employee compensation and benefits (1 to 5, with half-points awarded, and 5 being the best)

  • Rating_culture_and_values: rating of company culture and values (integer values 1 to 5, with 5 being the best)

  • Rating_diversity_and_inclusion: rating of company diversity and inclusion (integer values 1 to 5, with 5 being the best)

  • Rating_recommend_to_friend: indicates whether reviewer would recommend a friend to the company (positive, negative)

  • Rating_senior_leadership: rating of senior management (1 to 5, with half-points awarded, and 5 being the best)

  • Rating_work_life_balance: rating of work-life balance (1 to 5, with half-points awarded, and 5 being the best)

  • Review_summary: title of review

  • Review_advice: reviewer’s advice to management

  • Review_pros: positive review of company

  • Review_cons: negative review of company

  • Review_count_helpful: number of users who found the review helpful

  • Review_count_not_helpful: number of users who found the review unhelpful

Layoff Notices

We also collect WARN layoff data, which details whenever a firm is planning to lay off a significant portion of its workforce. The WARN Act (Worker Adjustment and Retraining Notification) ensures that mass layoffs and plant closures are registered with states and the Department of Labor in advance to allow for provision of compliance assistance materials to help workers and employers understand their rights and responsibilities.

We provide the WARN data at the notice level, where each row represents a layoff notice.

  • Company_name: name of company registering layoff

  • State: state where layoff is occurring

  • City: city where layoff is occurring

  • County_or_region: The county or larger region, where applicable, where layoff is occurring

  • Num_employees: Number of employees to be laid off

  • Layoff_start_date: date as of which layoffs will be effective

  • Layoff_end_date: date by which all workers laid off in this notice will be laid off

  • Notice_date: date the notice was files

  • Layoff_type: the type of layoff occurring (large layoff, closure, etc)

Individual Level Data

RL also provides individual level position data. These files contain user-level information on current or historical positions, educational history, name, and demographics information.

position_file

This file contains the individual level position data

  • Position_id: Revelio job id

  • User_id: Revelio user id

  • Location: Job location string from profile

  • Country: Country of position. imputed from location

  • State: state of position, if missing we infer it from the user’s current state

  • Msa: MSA of position, if missing we infer it from the user’s current location

  • Title: reported position title

  • Company: Online profile company name

  • Company_cleaned: Revelio company name (cleaning function applied to company)

  • Company_priname: Revelio mapped primary company name

  • Companyurl: Online profile url for employer

  • Company: Name of the company

  • Industry: Reported industry

  • Naics_code: 2-6 digit North American Industry Classification System (NAICS) code

  • Naics_description: NAICS code definition

  • Final_parent_company: The final parent company is the top-level company of which this company is a subsidiary (i.e. the company at the top of the corporate hierarchy in which this company resides). For example, the final parent company for both Google and Waymo is Alphabet.

  • Seniority: Seniority level with 4 discrete levels

  • Role_k1000: Aggregated position role with 1000 discrete levels

  • Salary: Modeled salary for the position

  • Remote_suitability: Revelio remote suitability score

  • Soc6d: Mapped Standard Occupation Classifcation (SOC) code at the 6-digit level

  • Soc6d_title: Standard Occupation Classifcation (SOC) title

  • Description: User reported description of position

  • Start_date: Position start date if reported, null otherwise.

  • End_date: Position end date if reported, null otherwise.

  • Sequenceno: Chronological order of positions in a user’s profile

  • Rn: Estimate of position order based on sequenceno and startdate

user_file

This file contains the individual level user data

  • User_id: revelio user id

  • Firstname: first name (parsed from fullname)

  • Lastname: last name (parsed from fullname)

  • Fullname: Name reported on online profile

  • Location: profile location

  • Country: profile country

  • Years: Estimated age in years

  • Title: Current job title reported on online profile

  • Currentindustry: current industry reported on online profile

  • Languages: Comma-separated list of languages listed on the user’s profile

  • Interests: Comma-separated list of interests listed on the user’s profile

  • Courses_taken: Comma-separated list of courses listed on the user’s profile

  • People_also_viewed: Comma-separated list of “people also viewed” listed on the user’s profile

  • Numconnections: number of connections (max is 500) on online profile

  • Url: url of profile

  • F_prob: probability of user being female

  • M_prob: probability of user being male

  • Api_prob: probability of user being Asian/Pacific Islander

  • Black_prob: probability of user being Black or African American

  • Hispanic_prob: probability of user being Hispanic or Latino

  • Multiple_prob: probability of user being two or more races

  • Native_prob: probability of user being American Indian or Alaskan Native

  • White_prob: probability of user being Non-Hispanic White

  • Updated_dt: the last date the profile was scraped

education_file

This file contains the individual level education data

  • User_id: Revelio user id

  • Campus: campus name (university)

  • Campus_cleaned: Revelio Cleaned Campus name

  • University_priname_usa: Mapped university name from USA rankings

  • University_priname_world: Mapped university from world rankings

  • University_priname: Mapped university name

  • Universityurl: university url of online university profile

  • Major: listed degree type (e.g. bachelor of science)

  • Specialization: listed field of study (e.g. physics)

  • Startdate: start date

  • Enddate: end date

  • Sequenceno: chronological order

  • Degree: Degree title

  • Field: Degree Field

  • Degree_level: Code for degree level (0: empty, 1: High School, 2: Associate, 3: Bachelor, 4: Master, 5: MBA, 6: Doctor)

skill_file

This file contains the individual level skills data. RL uses proprietary algorithms to cluster the skill universe into distinct clusters of skills. The clustering can be as coarse as 25 groups and as fine as over 20,000 groups. The default skill clustering is done at 50 groups.

  • User_id: revelio user id

  • Skill: single skill from profile

  • Skill_k50: aggregated skill with 50 discrete levels