# FAQs¶

## Trials and Delivery¶

What are data delivery types? We deliver data through flat files, API, Dashboard, or custom reports. Due to the size of delivery files, the majority of data is delivered in csv format through an Amazon S3 bucket. Each row of the dataset represents an aggregate count of the number of employees at the particular granularity level.

When is your data delivered? And what period does it cover? We deliver all data for the latest month by the 15th of each following month. For example, by January 15th, we would deliver all inflows and outflows that occurred over the course of December.

How long do trials run? We do not have a hard limit on the length of trials, but we’d like trials to be limited to a 2-3 month range if possible. Our dashboard access will expire after three months and we will check in on your progress with the data feed after three months. We’re more than happy to work with you during the trial and share example use-cases and sample code to make the process as seamless as possible.

## Coverage¶

When does your data start from? Our data goes back to 2008. This is possible because each online professional profile contains a full history, giving us rich longitudinal/panel data. Reliable postings data is available starting 2018.

How many companies do you cover? Our data covers 5k public companies, and 1m private companies.

Can you cover companies that are not in your sample file? Yes. Further, if there are any specific companies you are interested in tracking that are not included in the trial, we can often include them upon request.

How frequently is the data updated? Our standard update cadence is monthly. On or before the 15th of each month, we fully refresh our data sets by merging in new data and applying various models to correct for bias and lag.

If you need data more frequently than monthly, we have a daily feed as well. The daily feed is not as comprehensive as the monthly updates, and is much closer to being “raw” data.

How do you treat company subsidiaries? When a company acquires or merges with another company, we choose to include the subsidiary as a part of the parent company, even retroactively, before the acquisition took place. For example, we include all Whole Foods employees as part of Amazon, during 2008-2016, even though Amazon only acquired Whole Foods in 2017. The reason for this decision is that we want to avoid seeing an artificial spike in inflows and outflows when an acquisition or spinoff occurs. We can provide data for companies without their subsidiaries upon request.

## Modeling¶

How do you compensate for some people not having online profiles? Because we collect our data from online professional profiles, we face an issue of data being drawn from a non representative sample of the underlying population. We impose sampling weights to adjust for roles and locations that are underrepresented in the sample. For example, if an engineer in the US has a 90% chance of having an online profile, we would consider every engineer in the US that we see to actually represent 1.1 people. If a nurse in Germany has a 25% chance of having an online profile, we would consider every nurse in Germany that we see to actually represent 4 people. This allows us to approximate, as closely as possible, the true estimate of the underlying population.

How does company mapping work? Company mapping is a three-step process. The first step matches each desired company to a company in our coverage universe through a heuristic matching process based on common identifiers such as name, ticker, website, ISIN, etc. The second step identifies all the subsidiaries of each company using labeled data from Factset and D&B. The third step gathers a set of positions for each company by clustering together sets of company names used in positions that point to the same online professional URL and performing a many-to-one matching of name clusters to each identified company and subsidiary. This allows us to identify all the colloquial names for each requested company and all of its subsidiaries.

Why are the counts, inflows, and outflows columns floats, rather than integers? If gender and/or ethnicity are included as granularity columns, an observation of a position is assigned to every category probabilistically (where the total probability sums to 1). Therefore, the headcount in a single row are not integers but when aggregated over these categories, the raw count will sum to an integer. However, in order to correct for sampling bias in our data, we use models to predict true inflow, outflow, and headcount. The expected counts are not integers, and therefore the estimated headcount (the “count” column) even when aggregated tends not to be an integer

Is it true that the change in counts must be equal to inflows minus outflows? Yes

In what cases will your employee headcounts differ from employee headcounts in a company’s annual report? Company’s 10-k’s are biased because they omit contingent workers, which in many cases, make up the majority of a company’s workforce. We include all employees that identify with a company which includes contingent workforce and provide a more comprehensive view of company headcounts.

Have you done any post checking on accuracy? How do you build confidence that the data are representative? One of the big challenges with workforce data is that companies tend to only report their employees, not contingent workers. However, for many companies the contingent workforce can be quite large, in some cases 1-2/3rds of the workforce. So we instead have to find another way besides comparing these numbers to the reported counts. We use information in our salary prediction model, such as H1B Visa data and job posting data, to collect over 80 million salaries that match to each role at a company. We then add up those salaries and compare them to each public company’s published operating expenses. The predicted vs published operating expenses have a root mean squared error of 8%, indicating close alignment with real values.

How does prestige work? Prestige is computed by collecting university rankings and deducing a base score for people from those universities. We then derive a score for companies where those users have worked, based on information available at the user level. Using those company scores, we iteratively generate new user scores and company scores, filling in information gaps until the algorithm converges. The model is constructed so that even if someone went to a low ranking school, but then went on to work at a prestigious firm, their ranking would still be high.

We generate prestige for each university, degree level, company, and individual. We initialize prestige of universities by publicly available scores and then proceed to propagate through the relationships between universities, individuals, and companies that we observe in our data until each individual converges on a prestige score. The prestige for each individual is assumed constant historically. Prestige in long_file shows the average prestige of employees in the particular granularity level.

How does seniority work? Our seniority model, by design, is based on the expected seniority of the title, accounting for industry and company size. Age and tenure do not directly determine seniority. If there are two people with identical titles in the same industry in companies of the same size, those people in their position would get the same seniority score even if they had very different lengths of time in the workforce.

How does gender/ethnicity entropy work? The Gender/Ethnicity metric aims to ordinal rank the diversity of a workforce at a given company. The metric is available at any level of granularity, and takes into account both location and job category when computing the score of any particular group of individuals. The metric is designed on a scale of 1-10, and scores are reported independently for gender and ethnicity diversity. The scale is linear, where a score of 1 indicates poor diversity and a score of 10 indicates high diversity. The scale may also be interpreted as a percentile with respect to peers (i.e. 1 corresponds to between the 0 and 10th percentile, 2 corresponds to between the 10th and 20th percentile). To design the metric, we used a modified version of a Shannon Index to calculate diversity scores in relation to a company’s peers, while taking into account occupation and region. The method can easily be extended to take into account other regional/occupation differences as desired. The ethnicity diversity score may also be tuned to have increased sensitivity to particular groups. To calculate the final score, we assign the quantile to which the target company of interest falls in relation to all other peer companies, given the region and occupation type. This provides a relative score that is easily interpretable.

How are attrition rates, hiring rates and growth rates calculated? The attrition rate $$a_g(t)$$, and hiring rate $$h_g(t)$$ are calculated at the particular granularity level $$g$$ chosen and the month $$t$$. In the formulae below, $$o_g(j)$$, $$i_g(j)$$ denote the outflows, inflows at the particular granularity level $$g$$ in the month $$j$$ and $$\bar{c}_g(t)$$ denote the average head count at that granularity over the last year. Growth rate is the difference between hiring rate and attrition rate.

\begin{align}\begin{aligned}a_g(t) = 100 \cdot \frac{\sum_{j=t-11}^{t} o_g(j)}{\bar{c}_g(t)}\\h_g(t) = 100 \cdot \frac{\sum_{j=t-11}^{t} i_g(j)}{\bar{c}_g(t)}\\\bar{c}_g(t) = \frac{1}{12} \sum_{j=t-11}^{t} c_g(j)\end{aligned}\end{align}

## Known Issues¶

### Skill Data Sparsity¶

The Issue: Our profile data is combined from multiple sources which all gather only publicly available profiles. Around May, 2021, user skills disappeared from the majority of public profiles. They are visible on small minority of public profiles which we collect, but it may be important for your usecase to understand that we do not see the most recently added skills for most existing users, and we do not see any skills for most new users.

The Scope: This affects the user Skill File and the Skill Dynamics File. The Skill Dynamics File still tracks the skills that we do see as they follow a user to new positions.

The Outlook: We will continue to capture the skills when they appear on public profiles, and monitor for any changes in their availability. We have a model close to production which predicts missing skills that may be useful in filling the gaps in the Skill File as well as improving the currency of the Skill Dynamics File.

Updated: June 3, 2022

## Bug Fixes¶

### Highest Degree¶

(June 13, 2022) We fixed a big where highest degree was sometimes obviously not the highest degree.