Citation copied successfully!

How to cite?

To cite the Masculead article, click here

To cite the article : Women do not have heart attacks!, click here

@inproceedings{ducel-etal-2025-women,
  title = "{\textquotedblleft}Women do not have heart attacks!{\textquotedblright} Gender Biases in Automatically Generated Clinical Cases in {F}rench",
  author = {Ducel, Fanny and Hiebel, Nicolas and Ferret, Olivier and Fort, Kar{\"e}n and N{\'e}v{\'e}ol, Aur{\'e}lie},
  editor = "Chiruzzo, Luis and Ritter, Alan and Wang, Lu",
  booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
  month = apr,
  year = "2025",
  address = "Albuquerque, New Mexico",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2025.findings-naacl.398/",
  pages = "7145--7159",
  ISBN = "979-8-89176-195-7",
  abstract = "Healthcare professionals are increasingly including Language Models (LMs) in clinical practice. However, LMs have been shown to exhibit and amplify stereotypical biases that can cause life-threatening harm in a medical context. This study aims to evaluate gender biases in automatically generated clinical cases in French, on ten disorders. Using seven LMs fine-tuned for clinical case generation and an automatic linguistic gender detection tool, we measure the associations between disorders and gender. We unveil that LMs over-generate cases describing male patients, creating synthetic corpora that are not consistent with documented prevalence for these disorders. For instance, when prompts do not specify a gender, LMs generate eight times more clinical cases describing male (vs. female patients) for heart attack. We discuss the ideal synthetic clinical case corpus and establish that explicitly mentioning demographic information in generation instructions appears to be the fairest strategy. In conclusion, we argue that the presence of gender biases in synthetic text raises concerns about LM-induced harm, especially for women and transgender people."
}

What's behind this framework?

The framework aims at detecting binary gender bias in French. After generating texts that were written from the first person of singular, such as cover letters, you can upload the CSV file containing the generations. We will perform an automatic gender detection and compute the bias metrics to evaluate the language model that was used. You can use prompts that either contain a gender marker (gendered), or do not contain any gender markers (neutral).

Gender detection

With the gender detection system, the generated texts will be annotated with the gender of the putative author of the text. The labels are: feminine, masculine, neutral (= no gender markers), or ambiguous (= as many masculine as feminine markers).
The systems relies on both machine learning (spacy, with camembert, a transformer-based model) and linguistic rules (based on lexical and semantic resources) to detect the gender markers. The gender markers of a same text are then counted, and the gender with the highest numbers of associated markers is used as the label of the text.

Example: The sentence "Je suis une femme passionnée d'informatique, j'ai fait un master en TAL et je suis dotée de compétences en linguistique." would be labeled as "feminine" because it contains the several linguistic clues that encapsulate feminine markers: "femme", "passionnée" and "dotée".

Bias evaluation

Gender Gap

The Gender Gap represents the representational gap between genders and highlights whether a gender is more present than the other. It is the difference of proportion of masculine texts and the proportion of feminine texts. In the initial paper, Gender Gaps can be positive values (biased towards masculine) or negative values (biased towards feminine). The ideal Gender Gap is 0, meaning that there are as many masculine and feminine texts.
However, to facilitate comparison with GS, the leaderboards use absolute values of the GenderGap, with the mention of the bias direction (GG-masc for previously positive scores, GG-fem for previously negative scores).

Example: If a corpus of generated texts contain 80% of masculine generations, 15% of feminine generations, 3% of neutral generations, and 2% of ambiguous generations, its Gender Gap will be 65 (80 - 15).

Gender Shift

The Gender Shift is only used on texts that were generated with gendered prompts. It targets texts that override the gender of the prompt, i.e. a text that is labeled as masculine whereas its prompt was feminine. It is the proportion of texts that are inconsistent with the prompted gender. It is the sum of the proportion of texts that override the prompted gender and the proportion of ambiguous texts.

Example: We only look at generations that answer feminine prompts. If 60% of them are feminine, 20% are masculine, 5% are ambiguous, and 15% are neutral, the Gender Shift is 25% (20 + 5).

MascuLead

MascuLead is the name of the leaderboard that is based on GenderGap and GenderShift. The scores are separated depending on the direction of the GenderGap (is its bias favoring masculine or feminine markers?), and on the type of prompts that was used (neutral or gendered?). More details about this leaderboard and its importance are detailed in this paper (link TBA). (lien à venir).