Gender Bias In Coreference Resolution

6 minute read

Published:

This post will explore what coreference is, how can it be gender-biased, and some models to reduce the Biases. Following a paper

Winograd and ambiguous references

Winograd was a challenge created as a Turing test to determine how well an NLP(Natural Language processing) agent works. The task is to assign a reference to an ambiguous pronoun.

The city councilmen refused the demonstrators a permit because they feared/advocated violence.

Here they would refer to councilmen if using feared and to demonstrators if using advocated. This means that the system would require world knowledge along with grammar.
The paper uses Winograd to make sentences with stereotypical and non-stereotypical biases (based on occupational statistics)

Nurse is more likely to be a female
Carpenter is more likely to be a male

The following table represents percentage of females in various occupations.

Occupation%Occupation%Occupation%Occupation%
Carpenter2Chief27Editor52Teacher78
Mechanician4Janitor34Designers54Sewer80
Construction worker4Lawyer35Accountant61Librarian84
Labourer4Cook38Auditro61Assistant85
Driver6Physician38Writer63Cleaner89
Sheriff14CEO39Baker65Housekeeper89
Mover18Analyst41Clerk72Nurse90
Developer20Manager43Cashier73Receptionist90
Farmer22Supervisor44Counsellors73Hairdressers92
Guard22Salesperson48Attendant76Secretary95

Occupational Statistics showing % of females in various occupations

Coreference and resolution systems

Many times when we use pronouns in English, it could refer to a lot of different things in the sentence, if I may say a con of using pronouns.

Coreference example

An example of complex coreference.

A Coreference resolution system addresses this problem. In simple words, it tries to assign the pronouns to the proper noun.

We will be discussing three such systems:

  • Stanford deterministic coreference system [RULE]: A handcrafted rule-based system.
  • Berkeley coreference resolution system [FEATURE]: It uses a shallow understanding of the features of references.
  • UW end-to-end neural coreference resolution system [E2E]: State of the art model for resolution.

Teaching the machine how to solve an equation is a better way than telling it the individual steps.

Changes in the dataset to reduce gender bias

  1. Changing named entity to anonymous entries

    Barack Obama was re-elected.
    John went to his house.

    is changed to

    E1 was re-elected.
    E2 went to his house.

  2. Replacing male entities with females and vice versa.

    E2 went to her house.

This augmented dataset, used along with the original dataset (with anonymous name entries still there) helps equalize the count of gender pronoun usage in all contexts.

  • The bias of the dataset has, in many ways, risen from history. Is changing historical sentences the right way to remove bias from a dataset?
  • The removal of gender bias in this context could also lose out to information that the system would learn.
  • This was exactly what they expected to happen. Hence they decided to grade how well their data augmentation worked by the difference in the performance of pre-trained vis-a-vis newly trained models on the augmented dataset.

The need to make their own dataset

Often in machine learning, especially NLP, you will realize that you need a more specific dataset for your task. The same happened here.
While Ontonotes 5.0 was perfect for all the coreference resolution system, it was difficult to use data augmentation techniques and then evaluate the gender bias (Winobias) as the augmented data had already been fed as training data. As a result there were very few entries that were anti-stereotypical in the original dataset.

  • So they decided to make a new dataset to better identify gender bias in those resolution systems using occupations and pairs of sentences. This was evaluated on the basis of two types of sentences:

    One with syntactic cues

    { entity1 } { interacts with } { entity2 }and then { interacts with } { pronoun } for { circumstances }.

    and other with less/no cues

    { entity1 } { interacts with } { entity2 } { conjunction } { pronoun } { circumstances }

Ambiguous pronoun, without cues Ambiguous pronoun, with cues

Images showing ambiguous pronoun usage without and with cues. The arrow represents the noun it is referring to, dotted line means it is anti-stereotypical.

Reducing bias from resources:

  1. Word embeddings: replaced GloVe with debiased vectors, which has review papers against it
  2. Gender lists: Used on non-word embedding approaches (feature-rich and rule-based)
    • equalize the count of female and male nouns
  • The result is, against all the critiques, the biases were almost entirely removed, while the system performance was barely affected (remember that the performance being affected shows that the system was biased initially )
  • 2/3 methods (Stanford deterministic system is rule-based and cannot be trained) showed a significant drop in Winobias
MethodOntoNotesT1-pT1-aAvg|Diff|T2-pT2-aAvg|Diff|
E2E67.274.947.761.327.2*88.677.382.911.3*
E2E-R66.562.460.361.32.178.478.078.20.4
Feature64.062.958.360.64.6*68.557.863.110.7*
Feature-R63.662.260.661.41.770.069.569.70.6
Rule58.772.037.554.834.5*47.826.637.221.2*

The table of results of this entire process.

-R: Results after augmentation of data and changing resource bias.
-p : Pro-stereotyped, -a: Anti-stereotyped
Avg : Average, |Diff| : Modulus of difference between -a and -p

A few observations from the results

  • Pro-stereotyped sentences got a much better evaluation than anti-stereotyped
    • Being biased helped the system predict the references more easily, meaning the system learned to be biased from the data provided
  • Given syntactic cues, the system has much less bias and more average accuracy (b/w pro and anti-stereotyped sentences)
    • This means that on bigger datasets, where the world knowledge and sentence information is much more easily accessible to the prediction systems, it would be less likely to give a gender-biased outcome and focus more on the cues given to it.
    • However, this also means that in ambiguous cases, it will most likely predict the gender biased outcome since it makes more sense statistically (after all we can’t deny the fact that carpenters are more likely to be men).
  • E2E had much more initial bias as compared to Feature based (more than twice as much), even if E2E is state of the art method. This shows how the best model will have a much heavier gender bias as it is ingrained into our languages and society.

This makes natural language processing all the more challenging, because bias needs to be eliminated from the system and should not affect accuracy. The above mentioned methods are not perfect but it is one step forward in what we can do.