Datasets¶
Workforce Dynamics¶
This dataset contains aggregated workforce statistics. Every row is a distinct level of aggregation and month combination. Generally, the broadest configuration of this dataset is the company and month level. In that case, every row observes a particular company in a given month. If we include country as a level of aggregation, then each row of the dataset would correspond to a company, country, and month combination. The dataset at the company-country-month level can be aggregated to create the company-month dataset.
Let’s take a look at an example output where we have the levels of aggregation as company, country tracked across month and let count be the outcome of interest that represents the total headcounts for that particular level of aggregation, month combination (the count represents the headcount at the end of that particular month):
company |
country |
month |
count |
---|---|---|---|
Company A |
U.S. |
2021-01 |
10 |
Company A |
U.S. |
2021-02 |
12 |
Company A |
U.S. |
2021-03 |
14 |
Company A |
Canada |
2021-01 |
10 |
Company A |
Canada |
2021-02 |
11 |
Company A |
Canada |
2021-03 |
9 |
This enables us to visualize the table as a graph as well, where the month can be represented along the X-axis, and the outcome count can be represented along the Y-axis. Thus, in this case (Company A, U.S.) and (Company A, Canada) can be viewed as entities for which the outcome count is tracked over time (month) on this graph.
Note that it’s easy to compute a broader level of aggregation from a narrower level of aggregation. To reduce our previous example to the company and month level, we can sum across the country column to get:
company |
month |
count |
---|---|---|
Company A |
2021-01 |
20 (10+10) |
Company A |
2021-02 |
23 (12+11) |
Company A |
2021-03 |
23 (14+9) |
Count (float): The total number of employees for a specific level of granularity for each month.
Inflow/Outflow (float): In addition to the number of employees, we also estimate the total inflow (people joining) and outflow (people leaving) in a given month.
Salary (float): We predict the salary for each position based on role, seniority, company, and country using a regression-based model. We train this model using over 200 million salaries from job postings and publicly available labor certification applications, and use country-level inflation rates to estimate the change in salary over time. We get an out-of-sample root mean squared error (RMSE) of 14%. The Salary column in long_file shows the sum of salaries of employees in the particular granularity level.
Prestige (float): We generate the prestige score for each university, degree level, company, and individual. We start with the prestige of universities by publicly available scores and then include the relationships between universities, individuals, and companies that we observe in our data until each individual converges on a prestige score.
Month (time): The month and year of the position are provided in “YYYY-MM” format. Each deliverable file contains monthly data up to the previous month’s end.
Company (categorical): RL delivery file can provide insights on all public (and many private) companies. By default, companies are defined at the holding company level, where all subsidiaries held by the top parent company are included. The list of parent companies covered by Revelio include those mapped by FactSet Research Systems Inc., in addition to manually defined companies at the client’s request.
Region (categorical): The most coarse geographical granularity can be defined at region level. The 15 region names are as follows:
Arab States
Northern Africa
South-Eastern Asia
Central America
Northern America
Southern Asia
Central and Western Asia
Northern Europe
Southern Europe
Eastern Asia
Pacific Islands
Sub-Saharan Africa
Eastern Europe
South America
Western Europe
Country (categorical): The granularity can be specified at the country level for 232 distinct countries.
State (categorical): For US and US territories, the granularity can be specified at the state level. This level includes 50 states and 9 territories (American Samoa, Guam, Northern Mariana Islands, Puerto Rico, Virgin Islands, Minor Outlying Islands, Micronesia, Marshall Islands, Palau).
MSA (categorical): For US states and US territories, the most granular geography is the Metropolitan Statistical Area (MSA).
Job Category (categorical): In addition to geographical granularities, role-level granularities can also be specified. The most basic job category classification groups roles into the following 7 groups:
Admin
Engineer
Finance
Marketing
Operations
Sales
Scientist
The job role taxonomy is developed by our proprietary representation and clustering algorithms. We develop mathematical representations of each job title using the title itself, the text description of the position (from either individuals describing their own experiences or employers on a job posting), individuals’ skills, associates, and previous experience. Our clustering algorithm is in the family of hierarchical/agglomerative clustering algorithms. This means that we begin with every job title occupying its own cluster, then iteratively combine clusters based on a set of criteria. This allows for complete flexibility of the number of clusters. We update this taxonomy periodically to adjust to the changing occupational landscape. Aside from the 7-cluster jobs above, the most common job clustering is done at 150 groups and 1000 groups.
Seniority (ordinal): Seniority ranges from 1 to 4. 1 is the most junior, and 4 is the most senior. Our seniority model predicts seniority based on the title, accounting for industry and company size. Age and tenure do not directly determine our seniority measure.
Statistics that can be included are as follows:
- Levels of Aggregation:
Month (time): The month and year of the position, provided in “YYYY-MM” format
Region (categorical): The most coarse geographical granularity with 16 discrete levels
Country (categorical): 232 different countries
State (categorical): For US and US territories, state level location
MSA (categorical): For US states and US territories, metropolitan statistical area
Job_category (categorical): Aggregated position role with 7 discrete levels
Seniority (ordinal): Seniority level with 4 discrete levels
Highest_degree (categorical): Highest degree attained by workers at the granularity level
Gender (categorical): Gender is calculated as a probability based on the likelihood of the first name being male or female
Ethnicity (categorical): Ethnicity is estimated based on the likelihood of both the first and last name as well as location
Veteran (categorical): Veteran status
- Outcomes:
Count (float): Headcount for a given month
Inflow/Outflow (float): Total inflow and outflow of employees at each granularity level
Salary (float): Sum of Estimated salaries in the particular granularity level
Prestige (float): Average prestige score of the specified granularity level
Remote_suitability (float) : Revelio score for suitability of a role for remote work
Duration (float): Average tenure of employees at a given granularity level in years
Hiring (float): Sum of inflows at a given level of granularity over the last year divided by the average counts at that granularity over the last year
Attrition (float): Sum of outflows at a given level of granularity over the last year divided by the average counts at that granularity over the last year
Gender_entropy (float): Gender diversity score at a given granularity level
Ethnicity_entropy (float): Ethnicity diversity score at a given granularity level
Transitions¶
User_id: Revelio user id
Month: Month
Prev_company: Previous company name
Prev_sector: Previous sector
Prev_industry: Previous industry
Prev_region: Previous region
Prev_jobtitle: Previous job title
Prev_job_category: Aggregated previous position role with 7 discrete levels
Prev_role_k50: Aggregated previous position role with 50 discrete levels
Prev_role_k150: Aggregated previous position role with 150 discrete levels
Prev_seniority: Previous seniority level with 4 discrete levels
Prev_enddate: End date of previous position
Prev_salary: Estimated salary of the previous role
New_company: New company name
New_sector: New sector
New_industry: New industry
New_region: New region
New_jobtitle: New job title
New_job_category: Aggregated new position role with 7 discrete levels
New_role_k50: Aggregated new position role with 50 discrete levels
New_role_k150: Aggregated new position role with 150 discrete levels
New_seniority: New seniority level with 4 discrete levels
New_startdate: Start date of new position
New_salary: Estimated salary of the new role
Job Posting Dynamics¶
This dataset contains aggregated job posting statistics. Every row is a distinct level of aggregation and month combination. Generally, the broadest configuration of this dataset is the company and month level. Each row would correspond to a company and month combination. For more information on the levels of aggregation, please refer to the Workforce Dynamics section.
Granularity_id: Revelio internal ID
Start_date: Month
Active_posting: Number of active postings during that month
New_posting: Number of new postings during that month
Removed_posting: Number of postings removed during that month:
Active_salary_avg: Average salary for active postings during that month
New_salary_avg: average salary for new postings during that month
Removed_salary_avg: Average salary for postings that got removed during that month
Filling_time_avg: Average time to fill
Company: Company name
State: Location of job posting
Role_k7: Aggregated posting role with 7 discrete levels
Role_k150: Aggregated posting role with 150 discrete levels
Seniority: Seniority level with 4 discrete levels
Individual Job Postings¶
RL also provides individual level job postings data.
Job_id: Posting key
Company: Name of the company
Company_cleaned: Standardized company name
Post_date: Date at which the job was posted
Remove date: Date at which the job was removed. If null, it hasn’t been removed yet.
Title: Raw job title
Title_cleaned: Standardized title
Role_k150: Aggregated position role with 150 discrete levels
Role_k50: Aggregated position role with 50 discrete levels
Role_k7: Aggregated position role with 7 discrete levels
Status: Discrete posting status includes: open, closed, expired and pending close
Salary: Salary information from the posting.
Location, city, state, state_long, zip, county, latitude, longitude: Listed location for posting
Region_state: Metropolitan Statistical Area of the posting
Industry: Listed industry of posting
Industry_cleaned: Standardized listed industry
Employee Sentiment¶
RL provides Company review data with the following information. Note that not all rating fields are required to be filled out by the reviewer. Also, some ratings (ie. ‘culture and values’ and ‘diversity and inclusion’) were added more recently.
Review_id: Review key
Review_language_id: Indicates the language of the review. Most reviews are automatically translated to English. However, some remain in their native language.
Location, city, state, country: Listed location of the reviewer
Job_title_name: Raw position title of the reviewer
Review_date_time: Time when review was posted
Review_featured: Indicates whether review is featured on the company page
Review_iscovid19: Indicates whether review mentions the Covid-19 pandemic
Reviewer_employment_status: Indicates employment type of the reviewer (freelance, part time, intern, contract, regular)
Reviewer_job_ending_year: Final year of the reviewer’s employment with the company
Reviewer_length_of_employment: Number of years the reviewer worked at the company
Reviewer_current_job: Indicates whether the reviewer is a current or former employee
Rating_overall: Overall rating of company (integer values from 1 to 5, with 5 being the best)
Rating_business_outlook: Business outlook rating (positive, negative, neutral)
Rating_career_opportunities: Rating of career opportunities (from 1 to 5, with half-points awarded, and 5 being the best)
Rating_ceo: Approval rating of the CEO (approve, disapprove, no opinion)
Rating_compensation_and_benefits: Rating of employee compensation and benefits (from 1 to 5, with half-points awarded, and 5 being the best)
Rating_culture_and_values: Rating of company culture and values (integer values from 1 to 5, with 5 being the best)
Rating_diversity_and_inclusion: Rating of company diversity and inclusion (integer values from 1 to 5, with 5 being the best)
Rating_recommend_to_friend: Indicates whether the reviewer would recommend the company to a friend (positive, negative)
Rating_senior_leadership: Rating of senior management (from 1 to 5, with half-points awarded, and 5 being the best)
Rating_work_life_balance: Rating of work-life balance (from 1 to 5, with half-points awarded, and 5 being the best)
Review_summary: Title of review
Review_advice: Reviewer’s advice to management
Review_pros: Positive review of company
Review_cons: Negative review of company
Review_count_helpful: Number of users who found the review helpful
Review_count_not_helpful: Number of users who found the review unhelpful
Layoff Notices¶
We also collect WARN layoff data, which details whenever a firm is planning to lay off a significant portion of its workforce. The WARN Act (Worker Adjustment and Retraining Notification) ensures that mass layoffs and plant closures are registered with states and the Department of Labor in advance to allow for the provision of compliance assistance materials to help workers and employers understand their rights and responsibilities.
We provide the WARN data at the notice level, where each row represents a layoff notice.
Company_name: Name of company registering layoff
State: State where layoff is occurring
City: City where layoff is occurring
County_or_region: The county or larger region, where applicable, where layoff is occurring
Num_employees: Number of employees to be laid off
Layoff_start_date: Date as of which layoffs will be effective
Layoff_end_date: Date by which all workers laid off in this notice will be laid off
Notice_date: Date the notice was filed
Layoff_type: The type of layoff occurring (large layoff, closure, etc)
Individual Level Data¶
RL also provides individual level position data. These files contain user-level information on current or historical positions, educational history, name, and demographics information.
position_file¶
This file contains the individual level position data. Each row is a position held by an individual.
Position_id: Revelio job ID
User_id: Revelio user ID
Location: Job location string from profile
Region: Region of position (Ex. Southern Asia, Western Europe)
Country: Country of position (imputed from location)
State: State of position (if missing, we infer it from the user’s current state)
Msa: MSA of position (if missing, we infer it from the user’s current location)
Company: Company name (raw from online profile)
Companyurl: URL for employer (from online profile)
Company_cleaned: Company name (from online profile, cleaned of special characters)
Title: Position title (raw from online profile)
Mapped_role: Position title (Revelio mapped)
Seniority: Seniority level with 4 discrete levels
Role_k7: Aggregated position role with 7 discrete levels (also available at other levels of aggregation)
Salary: Modeled salary for the position
Remote_suitability: Revelio remote suitability score
Description: User reported description of position
Startdate: Position start date if reported, null otherwise.
Enddate: Position end date if reported, null otherwise.
Rn: Chronological order of position in a user’s profile (i.e., 1 corresponds the earliest position reported)
user_file¶
This file contains the individual level user data. Each row is an individual’s public profile.
User_id: Revelio user id
Firstname: First name (parsed from fullname)
Lastname: Last name (parsed from fullname)
Fullname: Name reported on online profile
Location: Profile location
Country: Profile country
Title: Current job title reported on online profile
Currentindustry: Current industry reported on online profile
Url: Url of profile
F_prob: Probability of user being female
M_prob: Probability of user being male
Api_prob: Probability of user being Asian/Pacific Islander
Black_prob: Probability of user being Black or African American
Hispanic_prob: Probability of user being Hispanic or Latino
Multiple_prob: Probability of user being two or more races
Native_prob: Probability of user being American Indian or Alaskan Native
White_prob: Probability of user being Non-Hispanic White
Highest_degree: The highest level of education reported (Ex. Bachelor, High School)
education_file¶
This file contains the individual level education data. Each row is an educational record.
User_id: Revelio user id
Campus: Campus name (university)
University_priname_usa: Mapped university name from USA rankings
University_priname_world: Mapped university from world rankings
University_priname: Mapped university name
Universityurl: University url of online university profile
Major: Listed degree type (e.g. Bachelor of Science)
Specialization: Listed field of study (e.g. Physics)
Startdate: Start date
Enddate: End date
Sequenceno: Chronological order
Degree: Degree title
Field: Degree Field
Degree_level: Code for degree level (0: empty, 1: High School, 2: Associate, 3: Bachelor, 4: Master, 5: MBA, 6: Doctor)
skill_file¶
This file contains the individual level skills data. RL uses proprietary algorithms to cluster the skill universe into distinct clusters of skills. The clustering can be as coarse as 25 groups and as fine as over 20,000 groups. The default skill clustering is done at 50 groups.
User_id: Revelio user id
Skill: Single skill from profile
Skill_k50: Aggregated skill with 50 discrete levels (also available at other levels of aggregation)