Words Matter: Gender, Jobs, and Applicant Behavior | Center for the Advanced Study of India (CASI)

CASI Seminar

in partnership with the South Asia Center

Kanika Mahajan

Assistant Professor of Economics, Ashoka University

Thursday, April 7, 2022 - 12:00PM

A Virtual CASI Seminar via Zoom — 12 noon EDT | 9:30pm IST

(English captions & Hindi subtitles available)

About the Seminar:
Prof. Mahajan examines employer preferences for hiring men vs. women using newly collected data on approximately 160,000 job ads posted on an online job portal in India, linked with 6.45 million applications. She applies machine learning algorithms on text contained in job ads to construct measures that indicate whether the job ad text is predictive of an employer's explicit gender preference. She finds that advertised wages are lowest in jobs where employers prefer women, even when the job text only contains implicit markers of female preference, and that these jobs also attract a larger share of female applicants. She then systematically uncovers what lies beneath these relationships by retrieving words that are predictive of an explicit gender preference, or gendered words, and assigning them to the categories of hard and soft skills, personality traits, and exibility. She finds that skills related female-gendered words have low returns but attract a higher share of female applicants while male-gendered words indicating decreased flexibility (e.g., frequent travel or unusual working hours) have high returns but result in a smaller share of female applicants. This contributes to a gender earnings gap. Her findings highlight how non-wage elements in job text are associated with search behavior in the labor market.

About the Speaker:
Kanika Mahajan is an Assistant Professor of Economics at Ashoka University, Sonepat, India. Her primary research interests include empirical development economics in the field of gender, labor, and agriculture. As part of her research agenda on gender and labor, she is currently working on issues around stagnation of women's labor force participation in urban India and decline in female employment in rural areas—exploring both the supply side and the demand side linkages. Her other projects in the area of gender examine links between economic shocks and women's employment, gender and sanitation, and violence against women. In the context of COVID-19, her research examines resilience of supply chains in agriculture and manufacturing sectors in India.

FULL TRANSCRIPT:

Naveen Bharathi:

Hello and welcome to CASI's spring seminar series. I'm Naveen. I'm a postdoctoral researcher at CASI and, along with my colleagues, moderate this series. Today's talk is in partnership with South Asia Center. Before I introduce today's speaker, just to put a plug for our last seminar for the spring semester on April 14th, same time, by Dr. Swagato Ganguly of The Times of India, who's also currently a visiting CASI fellow. We'll talk about Indian foreign policy and current state of play in Indo-Pacific geopolitics. Please register for this event on CASI's website. Today, I'm delighted to welcome Professor Kanika Mahajan, Assistant Professor of Economics at Ashoka University. Her primary research interests include empirical development economics in the field of gender, labor and agriculture. As part of her research agenda on gender and labor, she's currently working on issues around stagnation of women's labor, post participation in urban India and decline in female employment in rural areas. Her other projects in the area of gender examine links between economic shocks and women's employment, gender and sanitation and violence against women. In context of COVID-19, her research examines the silence of supply chains in agriculture and manufacturing sectors in India.

In today's seminar Kanika examines employer preferences for hiring men versus women using newly collected data on approximately 160,000 job ads posted on an online job portal in India. She applies machine learning algorithms on text contained in job ads to construct measures to indicate whether the job ad text is predictive of an employee explicit gender preference. Also find gender earning gaps and hiring preferences. Her findings highlight how non-weight elements in job ads are associated with such behavior in the labor market. Kanika will present for 30 minutes and we'll have questions from the audience for around 30 minutes. And if you have any clarificatory questions you want to ask her, please enter it in the chat box directly to me, and please keep your questions brief and to the point so we can get to as many as possible and apologies in advance if I can't get to everyone.

So if you have questions at the end, again, please enter directly to me in the chat box, Naveen Bharathi so enter into my box. Finally, please be mindful about muting your mics throughout the duration of the event. Once again, thank you for interest and for being here today. With that, I'm going to hand over the platform to Kanika. Thank you.

Kanika Mahajan:

Thank you, Naveen and thank you, CASI, for having me over. Just sharing my screen. Okay. I hope everyone can see my screen. So I'm very glad to be presenting again at CASI, though virtually this time and hopefully we'll be at CASI soon in person. So the paper and the work that I'm presenting today is co-authored with Sugat Chaturvedi at ISI, Delhi and with Zahra Siddique at the University of Bristol. And this paper is motivated by this extent literature in economics, which has looked at gender disparities in employment and wages. Since we're looking in the context of India at these gender gaps, specifically to mention, India has probably one of the largest employment gaps in gender across the world. Only about 22% of urban women participate in labor market whereas about 90% of urban men are employed in the working age group. The wage disparity is also quite large. About, on an average, an urban and a woman who's employed would earn about 65 to 70% of the wages out of the employed man.

And a large part of this gender wage gap has been shown to be explained by occupational segregation, not only in India, but across the world. So occupational segregation has received a lot of attention in the literature on gender gaps. However, of late and more recent literature has started looking at within occupation disparity. So what the literature finds is that the occupational gaps have been closing down. However, the wage gaps have not been closing at the rate at which the occupation disparities have been closing. And hence this literature focuses on looking at within occupation attributes, which can explain some of the disparities that you observe in the labor market across a gender.

And here in this paper, what we want to focus on is the application stage. So the first stage in the labor market, when the candidates maybe choosing a particular job to apply to, and what is the role that the attributes which are shown in the job ads can play in the process? How these attributes can affect posted wage and how they actually direct the job search behavior of men and women differently in the labor market, and hence affect the final outcomes that we see in terms of the gender wage gaps, even at the application stage?

So essentially that's the main objective of this paper. So what we do in this paper to achieve our objective is we exploit this occurrence, which is peculiar from the developing countries. It's something that you won't see the developed country labor markets is that the job ads in India allow employers to state an explicit gender preference. So some of the government job portals actually have a field, which the employers can fill in where they can mention whether they want a man or a woman for a job, whereas private portals, although they don't allow you to directly put in the field, but employers have found their way. So for example, here on the left hand side, you see a job ad when the job description, the employer has written that the applying candidate should be a female only. Whereas the right hand side job, you can see that the job title contains male.

So clearly the employer is definitely to being proactively looking for a male to hire in this particular job. This is the aspect of the Indian labor market that actually exploit in this portal, in this paper. So we first examined these explicit gender preferences of employers on an online job portal. We then use the expression of the explicit preferences to actually get how predictive any job ad may be of gender preference, given the text that is coming in these job ads and using that, we basically arrive at measures of implicit maleness or femaleness that could be associated with job ads. We then uncover what are the attributes which are going behind the measure of implicit maleness or femaleness that we are finding. And we systematically uncover these attributes by looking at four main categories which are important in the labor market, which are hard skills, soft skills, personality, and job flexibility.

And we then see how these attributes are related to posted wages and the search behavior of male and female candidates on the platform and how they may eventually lead to the gender wage gaps arising at the application stage. So broad findings, we find that on this portal, about 7.5% of job ads exhibit an explicit gender preference, of which slightly more than half are for women and about 2.5% are for men. And job ads, which have an explicit female preference actually have the lowest advertised wage or posted wage on the portal. Now jobs may not have an explicit preference, but they may contain words or texts, which is like a job ad which is being posted, which explicitly requests for a male or female. And there we find that if the text of a job ad is predictive, of some kind of a female preference, or at least what we see as female preference using the explicit job ads, then it has a significantly lower posted wage.

And it also receives an increased share of female applicants. Lastly, when we undertake that category-wide analysis, we find that the two most important categories, which seem to be directing the behavior of candidates, are skills and job flexibility. So as I [inaudible 00:08:51] detail out later, in terms of how exactly we pin down on skills and flexibility as being the two critical factors in explaining the behavior. In terms of literature that we contribute to, so there is literature which has looked at employer gender preferences in job ads, using data on China and Indonesia, and the primary aim of the literature has been to see what of job ads have a gender preference, largely eliciting a negative skilled targeting relationship where job ads which have such gender preferences tend to be low, skilled types. And the second behavior this literature has uncovered is that what happens if you take away a gender request from a job ad?

So using an exhaustion experiment in China, [inaudible 00:09:38] actually show that if you take away presence of a gender request suddenly from a job portal, there is a diversification in the way candidates apply to different jobs. So in this paper we build on this literature, we actually extend it to actually show what is really going behind these preferences. Why are the employers even posting these preferences? And what are the kind of attributes of jobs which may be correlated with even existence of these preferences in the labor market to begin with? The second literature that we related to is the literature that looks at job ad text in general and how it can lead to not just gender disparities, but in general labor market outcomes. So for instance, if you change the occupation to a male type, then do you see the female responses fall in response to just changing the occupational nature of a particular job?

The third literature that we contribute to is the one that looks at within occupation job attributes, which may matter. And this literature has largely looked at commutation times as well as part-time versus full-time flexibility that may be valued more by females vis a vis men in the labor market. So let me jump into the data that we are using in this study for the purposes that I've already highlighted. So we use data on about 1.6 [inaudible 00:11:01] job ads from an Indian online job portal that matches young, urban job seekers. On an average, these are mid to high skill job seekers. So that's the labor market that this job portal is capturing. We use the job ads which have been posted between July, 2018 and February, 2020. An interesting feature of this data set is that about 87% of the job ads posted on this portal have an advertised wage. And usually papers that use wages from online portals or from job ads at least don't have that much [inaudible 00:11:34]. So it's interesting that we are able to do much more with this [inaudible 00:11:37] data as compared to the existing literature.

And we look at about 1 million job seekers. And during this time period, they have made about 6.45 million applications. So some of the key variables that we construct in our analysis, one is the explicit gender preference. Now, as I mentioned, this portal doesn't have... It's a private portal. It doesn't have any field per se. So we do a textual search where we search for terms which can stand for excessive female preference, like female, females, women, if all these terms are coming, either in the title of the job description, and we call these ads as F jobs. We then search for words that may indicate an excessive male preference. And we call these as M jobs. Some ads may contain neither of the words, or they may contain both of them. These ads are referred to as N jobs in our analysis.

And we also get these ads double checked. So apart from doing the textual search, we also hire two independent annotators who actually went through these job ads and then confirmed that indeed the job ads that we marked as F and M and N jobs do indicate the same as what we marked them. So in total, what we find eventually is that about 7.5% of job ads indicate a gender preference with slightly more than half of women and slightly less for men. The other important variables are the education and the experience requirements in the job ad and I'll detail the variables later in terms of how we include them in our analysis. The third most important variable is the occupation classification because we are interested in controlling and looking at within occupation variation.

So here the job portal doesn't give us any occupation classification for a job ad. So what we do is that we use the job titles, which have been shown to be more informative about occupations, at least in the US job market. And here either one can use a manual classification using... If similar words are coming into job titles, they can be clubbed together or one can actually put together or cluster together job titles, which not only have the same words, but which may have semantically similar words, for instance, like a translator or a transcriber is a similar word. So if they're two job ads, one for a transcriber and the other for a translator, they would be actually be put together by this clustering [inaudible 00:14:01]. So we use this clustering [inaudible 00:14:04] but we also check our results to the robustness of the existing manual classification that the literature has used.

In terms of how the broad data looks like... So on our job ad front, about 50% of the jobs require at least college education, whereas in terms of experience, 67% require less than one year of experience. So they're mostly for freshers. And the average wage offered on the portal is around 213,000 rupees per annum. So these are the education and the experience requirement categories that we have constructed using the detail categorization that the portal had, and these are the ones that we will be controlling in our analysis as we go further in the regressions. In the applicant dataset, the average job seeker age is around 24 years on this portal. About 80%, 86% in fact, have a university degree. And about 35% of the job seekers on the portal are women. The posted wages on this portal are slightly higher than the average urban worker would get, instead it's just giving to the mid to high skill level workers. So it's representative of that particular market part of the lower weight segment or the lower skill segment really.

Now coming to our next important variable in terms of defining the implicit femaleness and maleness, which I spoke about a few minutes ago. So implicit femaleness is defined as the probability of placing an explicit female request, conditional on the job text which appears in the job title and in the job description. And we similarly define maleness, which is a probability of having an explicit male request. To derive these probabilities, we basically use a much novel logistic regression. We train it and then we predict the maleness and femaleness for the remaining sample, for which there is no explicit request available. And even for the explicit ones, one can actually predict using the machine learning algorithm. So just to give you a broad idea on what you are finding, now typical occupations like beautician, personal secretaries, school teachers are very high on [inaudible 00:16:21], whereas occupations such as cargo loader, delivery executive are very high on maleness, but not only is there occupational differentiation, even within an occupation, we see a lot of variation in the maleness and femaleness attributes, and why are these coming in?

So let me just give you an example here. So let's take the example for business development manager. Okay. There are these two job ads. Now, both the job ads have the same title. However, the first job ad that you see here, the words in blue show that these words are contributing to maleness and the words in red, so that these words are contributing to femaleness. And on an average, what we find is that this job exhibits a female preference, whereas the second job for a business development manager, on an average, there are more words which are contributing towards maleness and with the same job title, given the words which are coming into these jobs, they're reflective of a preference for a male. And in fact, these are the actual job ads that we've taken of course, we've taken out the [inaudible 00:17:18] et cetera, so what you see are this announced sector, which are used for training of the model, but what one can see is that it is not just about the occupation or what is the title of the job.

It also depends on what the content of the job is, what that job really wants in the person which is to be hired that would determine whether it has a male or a female preference. So the first thing we do is that we test some of the hypothesis in the existing literature, which have already been looked at using datasets from China and Indonesia. One is a negative skill targeting relationship where we find that the gender preference is more likely to be in ads which have low education, low experience. Additionally, since we have wage data, we also find that these jobs have lower wages. At the same time, when we look at meal preference where the dependent variability, it's a value one. If it's a male, zero if no, and minus one if there's a female preference, we find that if you have a male preference, exclusive preference, these jobs have a higher reach as compared to the other jobs in the sample.

The next thing we check is what happens to the total applications and the share of female applicants to a job ad when there are changes in explicit gender request, and here again, what we find is that if there is an explicit female preference, the share of female applications is more. While if there is male preference, the share of female applicants is lower. So these test some of the basic hypothesis that we would have about how people would behave on the portal. So next what we do is that we exploit the maleness and the femaleness measures that we've constructed in our data set. And we want to see how the wages differ by femaleness and maleness. So here, our dependent variable is log of wage in job ad i, which i, which is for an occupation j, in state s, in time t. Here, femaleness is noted by FP and maleness of a job ad is noted by MP.

Xs are the job ad specific variables, which I just discussed with education and experience requirements. We have a set of occupation and state fixed effects. We also cluster the standard errors, assuming that there is correlation within an occupation and a state, and we run these regressions separately for the F, M and N jobs, just to see how things change across these three different types of jobs. So what do we find? So let's concentrate on the N jobs because maleness and femaleness have been derived from F and M jobs so the sample here is very small as well. But here, what we are basically finding is that an increase in femaleness, irrespective of whether we control for occupation or we don't control for occupation, in columns three and four, tends to decrease the wages which are posted on a particular job ad. Increase in maleness also decreases it, but the decline in wages, because if high femaleness is almost double, then that is for maleness.

So, as I said, these are low scale jobs, which anyways would have an [inaudible 00:20:27] preference. So if you explicitly show a very high maleness or femaleness, you do have lower wages, but it's much more negative for femaleness as compared to maleness is what we find. And these versions are significantly different from each other. Next, what we want to see is how the application behavior changes when femaleness and maleness changes and how that interacts with the existence of a gender request. So for instance here, what is the exercise that I'm going in my head? So we have the jobs with a female preference which are red, neutral which are black and blue which are male preference jobs. Now looking at job ads which have very low femaleness, now if such a job ad [inaudible 00:21:15] it's a stereotypically low femaleness job, if there is a female request in such a job, do women respond to it? No, they don't. There's no statistically significant difference across these three lines when there is very low femaleness. It increases sharply as soon as femaleness increases somewhat.

However, if you look at for men, even if there is very low maleness associated, but if there is a male request, they would apply to a job ad. So in a way this ambiguity, a version which is typically being spoken about, it seems to be holding in this context and what is what we see in the context of the labor markets, where requests seem to be having some gender differentiated behavior, even across different types of job ads. The next thing we are interested in doing is finding out what is going behind this black box of maleness and femaleness that I just showed to you. And in order to be able to do that, what we do is that we find that all the prediction that we were getting from the logistic model, which are the words in the job ads which are contributing to this maleness or femaleness. And this is done using an algorithm which is called LIME, so if anyone's interested, they can find more about it, but the way it operates is that it randomly removes one or more words from each job ad and then sees how it changes the contribution of that particular job ad towards maleness and femaleness. And then it assigns those scores to the words which are being removed randomly from the set of job ads.

However, to be able to do this, we cannot use all the words which come in all the job ads. There are too many of these words. So what we do is that we use the most commonly or frequently occurring words. And we put a limit that we will look at the words which occur at least 10 times in the job ads which have an explicit gender preference. Now why these job ads? Because these job ads are the ones which are being used to even predict maleness and femaleness. So these are the words which really matter for us. And what we find is that these 3,000 words actually contain 92% of all word occurrences in the end jobs as well. So we're not missing words, really. However, we then need to manually classify these 3,000 words into a certain set of categories, which are meaningful, which makes sense given the literature on the labor markets, where it has been shown that skills, personality, job flexibility, maybe some of the attributes which matter to impose.

And hence we take these broad categories based on the [inaudible 00:23:53] in work and labor market, as well as psychology to classify these words into different categories. Then what we do is that for each word, we obtain a relevant score. Now, what is this relevant score? So as I said before, the LIME algorithm basically gives us, what is the net, what is the contribution of each word towards femaleness and maleness? So we construct this relevant score by subtracting the score towards maleness from the score towards femaleness. So if it's a net positive score, then that word is contributing more towards predicting femaleness as compared to maleness. And hence that word is predictive of a female preference across job ads, whereas if it's negative, then it's indicative of a preference for a male. That's where it's contributing more. So what do we find in general? So here I've just listed and just to show you, what are these words and how they really look like?

So the top 20 words that we found in each skill category, which have contributed to femaleness, in the first column and maleness in the second column from each type of workforce. Now in the hard skills, what we find is that skills which are associated with beauticians like pedicure or manicure, or accounting skills like tally, ledger skills, or basic computer skills or skills related to design such as coral, or [inaudible 00:25:14], making like zoho, et cetera, somehow they are more in the hard skills when it comes hard skills for women. Whereas for men, skills related to softwares or hardwares or engineering like WPM, RCM, QC, engineering, short terms tend to come in more when looking at the contribution of these skills towards a male preference.

On the soft skills, we find that for women, fluency across languages is a predominant theme across these words, whereas for men, while fluency matters, things like negotiation, supervision, liaising, all these also come in, which are not present for women in column three. We look at words on personality and for women, we find that, again, being presentable, good looking matters, having confidence, also entrepreneur comes and so that's good to see. And for men, you have some words which are related to appearance, but a lot of words are towards being energetic, being able to handle high pressure situations, being methodical, being passionate, leaving a good first impression, being resourceful, so different sort of work and objectives being used when employers are trying to indicate their preference for a particular gender.

On the job flexibility side, if words such as being able to work from home or being able to work through Skype calls seem to indicate a preference for women, whereas words which are related to the job requiring travel or working on the weekends, on rotational shifts, night shifts, or a job that may require you to relocate a different locations generally are associated more with male preference in the job market. Then what we do with that... We want to be able to construct some meaningful score and look at aggregate categories eventually. We can't look at all the 6,000 words. So we've aggregated these words into these five categories of skills, soft skills, personality, flexibility, and the other words that we could not [inaudible 00:27:24] right. To arrive at a net score for every category, we basically sum up the scores of all the words that we have classified under that category. First we take the sum over femaleness and we then take the sum over maleness.

And then just like we did for every word, we subtracted the two. We subtract the two for every category. If the net score is positive, then it contributes towards femaleness, which is indicated by FW here. Whereas if the net score is negative, we take the absolute value and we plug it into contribution towards the male category gender score. And then we standardize each gender category scores to be able to compare them and just to make sense of the coefficients that we get eventually in the regression. So here we undertake similar kind of [inaudible 00:28:18] regressions that I showed you earlier. And we see what are the effects of these different categories, gender category scores, on the wages, which are posted on the job portal. Here again, I would like you to concentrate on the end jobs because the sample for the F and the M jobs is very limited. What we find is that if an ad contains words which are indicative of hard skills, which are generally used for women, sees a lower posted wage. For female soft skills and personalities, it's actually positive and significant. The other female categories don't matter. For male categories, almost all categories have a significantly positive effect on the wages posted for a job ad.

And here I would like you to keep in mind these negative and positive coefficients for hard skills and flexibility. And why? Because when we look at how the search behavior of candidate is different, a greater fraction of applicants are women when a job ad contain words, which come in the category of female hard skills and the female applicant share is lower when there are more male flexibility related words, which are coming in these job ads. So these two sort of things stand out where these two job attributes have higher posted wages, whereas the behavior of candidates reverses where women tend to apply more of their female hard skills being listed in a job ad, and they tend to apply less if there are male flexible related words which are coming across the job ads.

So we test the robustness of these regressions that I've shown you. In the regressions which use candidate application data, we also control... We conduct these regressions and estimate them at the candidate level as well, just as these regressions become huge and inconfutable, they take a lot of time to converge and hence we show them in the appendix, but controlling for the candidate characteristics, things completely fold. We also use the alternative manual classification of occupations that I talked about earlier. We also control for firm specifics fixed effects and see if results go away, but they don't. We also use a different method of construction of gender category scores, where the score associated with a word, may be context specific for a particular job ad. One can also tweak with that.

And even if we use context specific scores for words, which come in a particular job [inaudible 00:30:58] for instance, a word when it coming on its own may have a different contribution towards maleness and femaleness. But if that word is coming with two other words, it may have a different contribution towards maleness and femaleness. And that's what we sort of test the robustness of our escalation results to and we find that similar patterns are found there as well when we change the way we estimate the scores.

As another robustness check, what we do is that we directly know from just... We just undertake a regression where we want to see, at the word level, which words attract a higher share of female applicants. And here, what we do is we first regress the female applicant share and on all the other Xs that we have, which are fixed effects and the job characteristics, we get the residuals and using these residuals as a dependent wearable, we then estimate a rigid regression model since this is textual later, we can't estimate it the usual regression model and you obtain a coefficient for every word, which is nothing but the marginal effect of every word on the female applicant share.

And what we find is that when we undertake this exercise, there is a very... At least there's a positive and high correlation for words coming in hard skills and flexibility, whereas the correlation is really not there for words coming in the category of soft skills and personality traits, which is again related to what we saw, that it's the hard skills and flexibility which seem to be contributing to the patterns that we see in the data and not surely soft skills and personality traits. So a second way of validating the previous findings using just a different method of doing the same thing.

So let me just chuck this out in the interest of time. Coming to how do we put all these findings together? So I've shown you a different set of results, different patterns in the data, but how is this contributing to emergence of disparities in the labor market, at least at the application stage? So here are the disparity that we look at is the gender wage gap. And we want to see how application behavior, along different kinds of attributes, is contributing to the gender wage gap that we see, which I haven't yet shown you, but which exists on the data portal between men and women. For doing this, we use a semi-parametric decomposition method, which was posed by DiNardo and [Gorkas 00:33:25] in 1996. So it's nothing but a re-weighting method in which we just re-weight group A to look like group B.

And then we see that if group A looked like group B on certain characteristics, then what would have been the extent of wage gap if these two groups would have had similar observable characteristics? So that is essentially what this technique is trying to do. And using this re-weighting method, we first estimate the baseline wage gap, which is essentially the difference between the average wage that a male candidate applies to on this job portal vis a vis a female candidate. If the female candidate had the same observable characteristics like education, experience, age as the male applicants on the job portal. So the first thing we do is construct this baseline wage gap using this re-weighting method. Once we have the baseline wage gap, we then undertake further re-weighting, in which we then decompose what part of this wage gap is being explained by application behavior of candidates along different set of attributes.

So I'll show the set of attributes that we look at in our analysis and share the contribution. So the baseline wage gap is 3.5%. So on an average men apply to job ads which give two and a half which have a posted wage of 3.5% more than that of female candidates on the job portal. Now, one has to remember that this wage gap looks small here, but on the portal, the employer's actually posting a wage range. It's a minimum and a maximum, and we're taking the midpoint of that rate really to get the gender wage gap, and in all the regressions so far that we've seen, plus no females on this job portal, a very selected sample. So if they're that highly motivated set of females who are looking for jobs, and even then we find this level of wage gap, then it does show that even at the application stage, even amongst the most highly motivated women who are career-oriented, one can see these wage gaps arising at the application stage itself.

So this baseline wage camp that we have found, we then decompose it into the part that is explained by a certain attribute. In model one, the attribute that we are interested in are the occupations and locations. So of this 0.0349 are the 3.5%, but 1.6% is actually being explained by differential application behavior of men and women to different kinds of occupations on the job portal. And this is about 45% of the raw wage gap. So what 45% of the wage gap can completely be just explained by women and men applying to different occupations on the job portal. An additional 7%, so the next variable, the next attribute we look at is that on top of the occupations, what if you also control for gender requests? And if you control for gender requests, then the additional gain that we get in the same proportion is 7%.

So again, though it looks small, but one has to remember that this additional is coming from the existence of gender requests on the job portal and only 7% of the job ads on this portal actually had a gender request. So given that even with 7% of job ads, one is able to explain 7% of the gender wage gap. If there is the phenomenon of gender requests, which is much higher in the government portals. In India, I know if that's the benchmark when one uses them, then this can be much higher as the proportion of job ads that exhibit a gender request actually increases on any job portal.

The next attribute we are interested in is maleness and femaleness, the implicit measures that we [inaudible 00:37:19] and here we find that on top of occupation, which explains 45%, about 13% is explained by the implicit measures that we've constructed using the data on the explicit measures themselves. And what 19% is explained by, the gendered words that we just discussed, which were extracted from maleness and femaleness that was exhibited by different job ads. So these are approximately one third of the effect of occupations. So if the literature has found that occupational differences across occupations and working in different occupations explains a large portion of the wage gap, then these attributes, even within occupation, would explain about one third of the wage gap that one sees, even within the same occupation across men and women. [inaudible 00:38:14]

So this work of ours, basically our aim was to look at more deeply into the question of how words in job ads may be related to a particular gender, particular stereotypes and how they can actually affect wages as well. And the search behavior of men and women differently in the labor market at the application stage itself and how they may eventually contribute towards the wage gap at the application stage. So what we find is pretty [inaudible 00:38:46] that women applying to jobs which are stereotypically female, or which have high femaleness, but these jobs also have the lowest advertised wages and women are willing to pay for attributes such as workplace flexibility. And these are the findings with a set of young people, maybe young, very motivated, skilled people, which are just entering into the labor market. And is we have this [inaudible 00:39:10] which has shown that early career decisions matter and early career shocks matter.

So they can, of course, have cumulative consequences for later life, career tragic freeze for men and women. There's a very recent paper that is actually shown using data from [inaudible 00:39:29] if I'm not wrong, is that when it comes to the final wage gap observed the main population, about 70% of that wage gap is actually coming from the application stage itself. So they've been able to link the employer, employee data with the application data stage. So given that number, I think it just... The researcher shows that there are some important, nice features to be looked at at the application stage itself and certain policy measures, especially I think [inaudible 00:40:00] gender request may be a good way to at least start with in the developing countries to be able to minimize some of these gaps that we see across gender. So let me just stop there. I think I've taken more time.

Naveen Bharathi:

Thanks Kanika. Thanks for a fascinating presentation. Rashi, do you want to go first?

Rashi Sabherwal:

Oh yeah, sure. Thank you so much, Kanika, for this fascinating presentation. I had a three part question and I'll try to keep it really short, but it's all related to the coding of the implicit bias. So I was curious about the training data and how the hard skills and soft skills were coded. How did you determine what was female and what was male? The second question I had was how you conceptualize education and skills? Do you see it as post-treatment effects of gender, because obviously education and skills, like low education or low skills, drive wage effects as well.

So I was just wondering just how you think about it, you thinking about it as correlated with gender, are you trying to separate them out and trying to control for them? And the third question was more of a... How are applicants seeing these words and are they seeing them the same way as you're quoting them? For example, there's some ambiguous ones like school teachers and occupation could be seen as female. It could also be seen as male, but I was wondering if an ID, or something like that, with applicants might be useful, or an ID with a subset, so that that can help prove your results further.

Kanika Mahajan:

Thanks Rashi. I actually did not get the third question. So you were talking about the IIT [crosstalk 00:42:01].

Rashi Sabherwal:

Oh yeah. I was just asking about how do we... Is there any indication of how applicants are seeing these words? So are applicants seeing these words as gendered, just to get a cross-verify if applicants are actually perceiving these words as gendered or not?

Kanika Mahajan:

Yeah. Okay. So thanks for really great question actually, they're very good questions. So first on assigning these words to female and male, so we are not doing any assignment to female and male, the only assignment that we did was classifying them into either hard skills, soft skills or flexibility, as well as personality appearance. That's the only manual intervention that we did. We then let the algorithm decide because we had the predictions from the logistic model. So we used the training data set so that the training data set is the set of job ads in which we have the explicit gender request, right? So we basically make use of the existence of gender request. So if we did not have it, we couldn't have given these gendered scores. So we basically made... We exploited that bit, that these explicit references exist, right?

And there are certain texts or words which are being associated... Or which are being used, right? So we've taken out the male female words, but we've kept the other words. And then we see that, no, which are these other words which are... Well, can we even predict it? Maybe we didn't have good predictive power, but actually our model is pretty good eventually. I didn't show you the model stats, but the coalition was around 0.44, 0.545, which is pretty high for a machine learning model. So that very fact gave us an idea that no, there seemed to be some consistency, this certain kind of pattern on certain kind of jobs in which there seems to be higher preference for men and women, or either at least an exclusive preference for a particular gender.

So we essentially, after classification, it was the logistic classifier scores that we used along with... So in machine learning, after you've made the predictions, you can also go back and check which are the facets of your data which have contributed to those predictions. And that's called explainable AI essentially. So we use methods in explainable AI to be able to then back out, to be able to then open the black box, basically. Okay. We made these predictions, but what are these words really? We have to understand that, right? What is going behind this femaleness and maleness that we've just quoted and the contribution towards a male or a female request is what makes a particular word either male gender or a female gender. So we are recording it ourselves. It is what the algorithm is giving to us.

That's the first question. The second question, yes. We are using education and experience more as control variables, to be able to control for them and their effects on posted wages and applicant behavior because of course the candidate characteristics are also different and men and women may behave differently to different kinds of job ads in the portal. So they're essentially being used for controls. We've not really looked at their effects in detail. That's not been the main interest in this paper. And third, a very interesting point. How do applicants see it? In fact, that's why we wanted to check the applicant share. If they respond to femaleness, then clearly they're also seeing some bit of femaleness, like women are seeing some bit of femaleness in these job ads and it's not just a figment of our putting it as being female. So that somehow gives us some sort of affirmation, some positive strength and faith in our results that no, it's just an arbitrary recording anything.

We do see that as femaleness increases, there is an increased share of female applicants. We do see that as maleness increases, there is an increased share of male applicants. So clearly there are not perfect. It's not that you find zero female applicants if there is very high male list. No, they reduce, but they don't become zero. So of course it's not zero one, but there is definitely some perception change that happens when the wordings change for a particular job ad, so yeah. Let just stop there.

Rashi Sabherwal:

Okay, thank you.

Naveen Bharathi:

The last question was interesting one, whether... I think you missed that question. How do applicants perceive that these advertisements?

Tariq Thachil:

No, that's the one she was just responding to, yeah.

Kanika Mahajan:

Yeah. So basically, I mean, there's no direct way for us to test. I wish we could actually test, we could show these jobs to people. Actually, we could do that. We could actually show these job ads to a set of students and just ask them, "What do you think?" Just ask them some belief or perception based question. I think that's a good idea. One can actually supplement it and actually, using just the same job ads that we've used in our tests and send data, one could actually do it, but we've not done it. So far, we've only relied on what other patterns consistent is what we've done, but this is something which is doable. It's a very small lab experiment, should actually be done and it [inaudible 00:47:27] sort of... Yeah. Clarify certain takes for sure.

Naveen Bharathi:

Thanks Kanika. And I have another question, adding onto Rashi's questions. How do you see the supply side of this, in a sense that some educational courses seem to have more women than men, like STEM maybe. There's already a kind of a discrimination happening, what course to choose, and how does that show up in these ads? Do you see some sort of a correlation in... I think there are a few paper, I can't recall the name of the paper, where they do show that STEM courses have lesser women compared to men. Similarly for lots of other courses, I also noticed architecture, you mentioned it as a soft male, hard female skill, maybe I do know [crosstalk 00:48:22].

Kanika Mahajan:

Yeah. So in jobs, which probably... I don't know why, but architecture seems to be more women kind of, like architectural skills. [crosstalk 00:48:29] I'm not sure why.

Naveen Bharathi:

So how do-

Kanika Mahajan:

But there could be some randomness here also. I'm not saying that the words are perfectly classified.

Naveen Bharathi:

No, no. I didn't mean that, but I want to know whether you take that into account in a sense that whether these courses for some STEM courses have more men than women, like lots of courses, like diploma, mechanical engineering may have more, right? So even the supply side itself is so restricted.

Kanika Mahajan:

Actually we have the detailed education qualifications of the candidates, so we control for them. So essentially we control for them and we run these candidate level regressions, which are either for candidate behavior. So I didn't show them. So all the regressions that I showed were at the job ad level because they're easier regressions to run. They converge very quickly, but we did control for the candidate specific education. So there we knew that whether a candidate is, has done, say, BA in STEM or in a non-STEM degree. But what I think is interesting from this aspect of skills is eventually the supply side of skills. So what is being asked in the market may also be... It's also being these attributes are also stereotypically formed, right? So the employer's specifically requesting for a... Are more likely to request for a man if you have an engineering skill, which is being required for a particular job.

It will also be reflective of the stereotypes that generally prevail in society. And it could just be that women, there're less women who have these skills. So given that there is a stereotype or a belief amongst the employers and amongst the candidates where candidates, so women candidates generally don't have enough STEM and hence they're not even eligible to apply for a lot of these jobs. So it could just be reflective of that sort of general equilibrium. But I agree with you. That needs more thought. One needs to look at the skills data for this, to be able to exactly tease out which are the skills which women and men have and how different are these from the supply side itself?

But I don't know if you had that in mind, but that's depending of...

Naveen Bharathi:

Yeah, exactly, yeah.

Kanika Mahajan:

But that's a very important question, which I think has been thrown up from our findings as next question to answer, that we see that hard skills make men and women apply differently, but is it because they just perceive these hard skills in different? Or is it because they have different... Like men and women, because of enduring different kinds of education or they took different courses and they have different skills. So there could be these two explanations, which are plausible in this case.

Naveen Bharathi:

Mm-hmm. One question by Chris, do you want to go next? So can you go next? I'll wait for him to unmute.

Tariq Thachil:

Sure. Yeah. Kanika, thanks so much for that paper. There's so many interesting findings and a lot of work that's gone into this. So I guess one very small question, which is just my unfamiliar... I mean, unfamiliarity may be just be a mechanical thing about how you guys did this. I was struck in figure four with your heat map that sometimes the same word would have opposite coding. So software business could be coded both ways. Is that just based on what else is in the ad? I wasn't quite clear on how that could happen. That's a very small question, but yeah.

Kanika Mahajan:

No, so that's a very important question actually, because the heat map that I showed you, so there the word scores were derived based in the context. So the software score would come not just from software, but from the two words, which are besides the software. So we [crosstalk 00:52:31]-

Tariq Thachil:

Right. So then when you're giving us the kind of words that are most commonly associated, it's based on an overall... It doesn't mean that that word is always involved in that way,-

Kanika Mahajan:

A median.

Tariq Thachil:

But yeah. Okay.

Kanika Mahajan:

Yeah.

Tariq Thachil:

Got it.

Kanika Mahajan:

Yes. A median [inaudible 00:52:44], right.

Tariq Thachil:

Okay. That was a small mechanical question, but I think the two larger questions I had, in some ways, because there's so many different findings, I think it can be a challenge to back out, like what are the... It doesn't have to be necessarily a single coherent big picture, but one thought I had was that there are a series of connected findings in the paper that suggest that maybe some of the action is elsewhere. So for example, I was struck by the fact that both male and female in table one, right, that you have... And you touched on this for just a second, but the idea that if there's an explicit femaleness or maleness in the end jobs column in table one, you have a negative... It's a negative effect on advertised wages.

And to me, that begs the question of, is the... And you said, okay, overall, this is relatively lower wage jobs, but it's also... to me that wasn't a fully compelling explanation of that, why, what does it mean that any gender indication is identifying a subset of jobs that have lower wages, including for men? So that, I mean, that may have to do with... I would be interested in just unpacking that the gender doesn't always just mean the gender gap between men and women, but here there's also a gendered aspect to just both any gender being indicated in a job ad, and I think at least the way that you dealt with it in the presentation, I'd like to hear more about that because I think it suggests that there's also a larger story happening within which the story that you're telling is happening, but it's also an alter... I think about that finding in conjunction with the finding that female personality actually relates to higher wages.

What I mean is if you look at... There's so many findings on the paper that I can potentially come up with a different narrative. I understand the aggregate findings on application rates where you're like, this is actually a mixed and very diverse picture of the employment portal. And what we are actually throwing up is that there's this complex, very complex landscape, rather in which agenda gap exists. But really what we are showing is that it's also a very complex hiring landscape. And that relates to the second thought I had, which was on employer intent.

And I know that's not what you're dealing with here, and that's a different data challenge, but theoretically, is this a model in which employers are aware and using this strategically to signal their ideas? Or is it a either implicit or steer, but even with an implicit, there's a sense that there's an implicit goal in mind, or is it that they're actually completely unaware that they're producing these different applicant streams? And I'm not sure that you will deal with this in this paper, but if we're really thinking about how this paper opens up new research on the kind of labor market in India, to me, a lot of it is in this kind of... The most interesting and arresting findings are that it's actually a complicated landscape and I'm not even sure I could understand, as an employer, what my intent would be in that landscape with particular job ads. So, sorry, it's a long winded question, but that's my thought.

Kanika Mahajan:

No, thanks a lot. I think I'll take the second question first because we thought a lot about it. We actually thought a lot and the writing underwent a lot of change actually, because of the changing thoughts as we processed the results. And I think, again, it would be wrong to say that it is the employer's intent always. If you are expressing a explicit request, of course that's your intent, but if I'm not putting explicit request, I just want someone who can work on weekends. Or if I just want someone, it's a traveling job, it's a marketing job. I want to travel from one part of the city to the other part of the city, do I really want a man or a woman?

It may just be a job that I want from my employee. Now that job attribute has low flexibility and hence women will not apply for it, but is that my intent that I don't want a woman? I'm not very sure. I don't think can see that from this analysis so far. It's still ambiguous. I wouldn't hang my hat on either that employers have intent or actually don't have intent, because in some cases, they may even have an impact where at least some employers may know, okay, if you don't want women with this job. So let's put these outright, upfront, these are the qualities that we want and [inaudible 00:57:15] they won't apply for these jobs. So, but that's something which I agree is, again, up for future work in the sense that how employers perceive. And so if you show them these job ads-

Tariq Thachil:

Exactly.

Kanika Mahajan:

How the employers perceive them to be, do they think that a particular job ad is wanting a female? Or they just think that it is just certain job attributes that that particular job has? So again, it's a future question for future research, just to see how employers perceive this. Second, on the question of the negative effect on wages of both femaleness and maleness. Now again here... So the base category here, so these scores have been constructed with the base category of no preference, right? So this is a much normal model where you have male, female, and then you have the third category of neutralness. So one thing we understood is that, in general, these typical gendered words... Now, again, this is also a problem of data. I would tell you why, because our training data itself is using explicit job ads like these explicit gender preference job ads. And now the training data itself is finding that these job ads, on average, have lower skills. Okay? That is what will get reflected in your eventual... Even if you predicted for the other jobs.

So here, I also think that it's an artifact of the way we've construct it using explicit measures and hence it's the gender category scores, which I find more where we know what's going on. When we look at just the overall maleness, femaleness, it's not really sure that what is it that is going into this maleness and femaleness, which are being predicated and which are being taken from the explicit gender request. But when we look at the gender categories for the gendered words, essentially, we know what we're looking at because we can make sense of it. So that's why the first part, I think, is just an artifact of the model itself, where the model has been trained and predicted using the data on these gender requests themselves. So if I review another way of arriving at maleness, femaleness, maybe I won't have gotten it.

Tariq Thachil:

Mm-hmm. No, that might be important to... I mean, and maybe it's there in the paper and I missed it, but that may be important to acknowledge, especially [crosstalk 00:59:51]-

Kanika Mahajan:

Quite [inaudible 00:59:51] knowledge. Yeah.

Tariq Thachil:

This is the limitation and also would that then lead to limitations in other parts of the analysis? If we take that implication forward, is there a reason to think it's just contained to that particular result or that it might also... Because that's the kind of thing that you're taking. I understand what you're saying. This is the advantage of having multiple different cuts like you guys do in the paper. But I think making that little explicit and saying that it's a contained problem would be helpful, or understanding where it would be contained to.

Kanika Mahajan:

No, definitely. Yeah. I think it needs to be acknowledged and the fact that your eventual predictions are not a part of your training. Right?

Tariq Thachil:

Right.

Kanika Mahajan:

So I think that has to be acknowledged. And that is what... Because there was a way of doing it at least. And we use this way of doing it.

Tariq Thachil:

Sure.

Kanika Mahajan:

But of course it's not a fool proof and it's not the only way that one use. One could use other [inaudible 01:00:37] if one can get the preferences.

Tariq Thachil:

Sure, sure, sure. Thanks so much.

Naveen Bharathi:

You go next, and this will be the last question.

Chris Klaniecki:

Thank you so much. I recognize we're at time already. I'll be quick. Thank you so much. It's been fascinating. The question that I have, I recognize, is informed by a fair amount of ignorance. So I apologize in advance. You've already spoken a little bit on how this data might inform policy changes. I'm curious to know your perspective on how any initiatives for better wage equity in India might be politicized by the Indian government. I know how it tends to play in the United States, but I'm not sure how it compares and contrasts to the Indian context.

Kanika Mahajan:

So you mean in... Sorry, initiatives regarding what? Initiatives regarding equal pay and-

Chris Klaniecki:

Taking like... Yeah, exactly. The findings from this project and putting it towards informing policy changes to reach a better gender wage equity.

Kanika Mahajan:

So I think what false implication that I could see is that if you get rid of gender requests and you have to look at the Indian government job portals, you really have to look at them, just check Delhi government job portal, just look at the national, the NCS job portal that's the National Career Service in India. Now you'll find that they actually have a explicit gender request and you can just take a sample of these job ads and because... And you'll find a lot of these job ads want men. So we actually even looked at one of these job portals and interestingly, almost 40% of the demand was for men on these job portals, which is much higher than, say, what employers are wanting on the private sector job, only because they're much more aware that they should not be doing something on a private job portal, and learn the kind of employer which are posting here, but on government job portals, clearly people use this since it is available.

I'm going to use this to give out my gender preference for a particular gender. And I think that is a clear policy implication. And you clearly see that having a gender request is leading to gender wage inequity in the labor markets. It's also reinforcing some of the beliefs that maybe existing, which are stereotypical and you can just keep percolating them or keep perpetuating them if you continue to have such explicit references being allowed to be posted on job portals or on any, for that matter, any advertising platform. So I think that is one clear policy implication which I take from the results of this paper is that these need to go away. That is clearly one policy implication. Now there, of course, many of the things I think that, can the governments really do anything about it?

I'm not very sure. They'd be... Already have a lot of legislation, but we know that it goes nowhere. So we already have these so many acts, on which there should be equity on various dimensions, but we know that no... And enforcement is an issue, so at least what can be enforced is the simple rule of not allowing any sort of explicit reference refer to the gender. At least when an ad is being posted, of course, that doesn't guarantee anything, but I think eventually things do improve when stereotypes get shedded in the long run, but we have to give them a chance to shed. Let me just stop there.

Naveen Bharathi:

Thanks Kanika. I think we, that parting comment. I think it's good to end the seminar. Thanks everyone for attending the seminar and next week we have Swagato Ganguly who'll be talking on India’s foreign policy. Thanks a lot. Have a nice day ahead.

The Nand & Jeet Khemka Distinguished Lecture Series is an endowed public program of the Center for the Advanced Study of India (CASI). Launched in the 2007-08 academic year, and made possible through the generous support of the Nand & Jeet Khemka Foundation, the series brings renowned India specialists to the Penn community and serves as a critical forum for analyzing and understanding the complex economic, political, social, and cultural changes that the world’s largest democracy is experiencing, as well as the challenges that lie ahead.

The Saluja Global Fellows Program has been made possible by the generous gift from Vishal Saluja ENG’89 W’89. CASI was excited to launch the program during the 2022–23 academic year, coinciding with the Center’s 30th Anniversary. This new program enables CASI to invite eminent leaders and rising experts on contemporary India preferably from the fields of media, culture, law, and contemporary history to be in residence for one to two weeks at CASI.