A new study has revealed that popular open-source AI tools used in hiring decisions are more likely to recommend men over women, especially for higher-paying jobs.
The research, led by Sugat Chaturvedi of Ahmedabad University and Rochana Chaturvedi of the University of Illinois, Chicago, analyzed how mid-sized open-source large language models (LLMs) behave when asked to choose between male and female job applicants with equal qualifications.
In their paper, “Who Gets the Callback? Generative AI and Gender Bias,” the authors ran more than 40 million simulations using 332,044 real job ads from India’s National Career Services portal. The results show patterns of gender bias in AI hiring recommendations, raising new concerns about fairness and equity in recruitment.
“We find that most models reproduce stereotypical gender associations and systematically recommend equally qualified women for lower-wage roles,” the researchers wrote in the study.
Men favored in high-paying roles
The researchers tested six widely-used open-source models (Llama-3-8B, Qwen2.5, Llama-3.1, Granite-3.1, Ministral-8B, and Gemma-2) to see how often each recommended women for interviews. The findings showed wide variation:
- Ministral had the lowest female callback rate at just 1.4%.
- Gemma had the highest callback rate at 87.3%.
- Llama, developed by Meta, appeared the most balanced, recommending women 41% of the time and refusing to make a gendered recommendation in 6% of cases, more than any other model.
Wage gap and model ‘personality’
Even when female candidates were selected, they were more likely to be recommended for lower-paying jobs.
“We find that the wage gap is lowest for Granite and Llama-3.1 (≈ 9 log points for both), followed by Qwen (≈ 14 log points), with women being recommended for lower wage jobs than men,” the researchers wrote in the paper. “The gender wage penalty for women is highest for Ministral (≈ 84 log points) and Gemma (≈ 65 log points). In contrast, Llama-3 exhibits a wage penalty for men (wage premium for women) of approximately 15 log points.”
The study also found that the AI models pushed women into “female-associated” roles like personal care and service, while steering men toward construction, extraction, and other traditionally male-dominated fields.
To understand why this happens, the researchers analyzed the language in job ads, finding that AI models aligned closely with gendered stereotypes embedded in the text.
Notably, when the researchers simulated different recruiter “personalities” using psychological traits known as the Big Five (openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism), they found that less agreeable AI personas showed less bias, suggesting that personality traits in AI prompting can influence hiring fairness.
Hidden risks in hiring automation
Despite rules in many countries banning gender preferences in job ads, about 2% of postings in the Indian dataset explicitly stated a preference. AI models tended to follow these cues faithfully.
“We don’t conclusively know which companies might be using these models,” Chaturvedi told The Register, adding, “The companies usually don’t disclose this and our findings imply that such disclosures might be crucial for compliance with AI regulations.”
The findings have sparked fresh debate around the ethics of AI in HR, especially as more companies adopt automated tools to screen large pools of job applicants.
As Meta acknowledged during the release of its Llama-4 model, bias is a long-standing issue: “It’s well-known that all leading LLMs have had issues with bias — specifically, they historically have leaned left when it comes to debated political and social topics,” the company said, attributing it to the nature of internet training data.