使用 Rubrix 改进厌女症检测模型

大家好！在本文中，我将向您展示我们如何使用 Rubrix 来改进训练数据，从而提升现有深度学习模型在自动厌女症检测 (AMI) 方面的性能：即能够检测给定文本中潜在厌女症的 NLP 模型。每年在这个主题上都会取得突破性进展，共享任务和新想法不断涌现，使这些模型的性能越来越接近人类审核团队的能力。

除了 Rubrix，我们还使用了我们的开源库 biome.text，通过一个简单的工作流程来训练模型。

我们的目标

我当时正在准备关于自动厌女症检测的学士论文，数据来自多个关于这个主题的共享任务和竞赛。在这种情况下，我们希望利用 Rubrix 的潜力，超越深度学习中常见的经典线性工作流程：收集一些数据、预处理数据、训练模型并开始进行推理。Rubrix 打破了这种模式；它允许我们使用人机循环方法进行再训练和微调，并在后续训练中添加新数据。

我们想在本文中与您更详细地讨论这个问题。本文并非旨在成为教程，即使其中会有一些代码片段来重现我们的过程并获得类似的结果（我们将简化其中的一些方面以使其更轻量化），但更像是一个故事。我们将重点关注过程以及我们在过程中学到的东西。

背景

学士论文的重点是西班牙语，而西班牙语文本中的厌女症数据集非常稀缺。这是第一个困难：找到数据集来开始训练一些模型。幸运的是，有一个非常重要的共享任务社区专注于这个问题，涵盖了许多非英语语言。我们开始使用来自 IberEval 2018 的数据，这是一个共享任务，它提供了一个推文汇编，由专家分析并分为 5 个不同的厌女症类别和 2 个目标。我们开始用大约三千个带注释的数据实例训练我们的第一个模型。

之后，EXIST 任务在 IberLEF 2021 中引起了我们的注意。我们看到了一个完美的机会，可以将我们的厌女症检测工作应用于两个原因

在不同的子领域和不同的数据上测试我们的厌女症检测方法。
扩展我们大约三千个实例的数据集。

我们在我们的博客文章中介绍了我们参与 EXIST 任务的情况。在我们提交运行结果后，我们将工作重心转向将这些新数据整合到我们的语料库中。分类方法有所不同：IberEval 2018 使用了 5 个类别和 2 个目标的分类系统，但 IberLEF 2021 没有目标字段，并且其类别也不同。

使用经典的深度学习方法，我们的旅程本应到此结束：拥有两个不同的、不兼容的数据集。但是，感谢 Rubrix，这仅仅是开始。

合并来自不同来源的数据

论文围绕 IberEval 2018 的数据语料库展开。厌女症分类由组织机构建立，遵循以下类别

厌女症：二元字段，定义推文是否具有性别歧视。
厌女症类别： 表示文本中是否存在厌女症行为，以及厌女症行为的类型。5 个可能的类别：刻板印象与物化、支配、转移话题、性骚扰与暴力威胁 以及诋毁。
厌女症目标：主动，如果厌女症行为针对特定的女性或女性群体；被动，如果行为针对许多潜在的接受者，甚至是女性作为一个性别。

在参加 IberLEF 2021 的 EXIST 之后，我们决定在论文中包含更多训练数据。最初，我们想抓取 Twitter 并手动检索厌女症推文。然而，使用最初由专家注释的数据（NLP 社区也使用这些数据）似乎是正确的做法。唯一的问题是标签系统，它有所不同

厌女症：与 IberEval 2018 中相同的二元字段。
厌女症类别：5 个不同的类别；意识形态与不平等、刻板印象与支配、物化、性暴力、厌女症与非性暴力。此标签系统没有厌女症目标。

在这一点上，Rubrix 成为了我们工作中必不可少的工具。它使我们能够探索这两个语料库，并进行注释阶段，以将来自 EXIST 的数据调整为 IberEval 2018 的标准。

作为单个注释者进行注释

我们的目标是将来自 EXIST 共享任务的数据调整为 IberLEF 2021 的标准。因此，作为单个注释者，我们上传了来自 IberLEF 2021 的数据集，并开始探索其预测和注释。

如果您想上传 EXIST 任务的训练集，请按照下面的代码片段操作

annotation_df = pd.read_csv('https://raw.githubusercontent.com/ignacioct/Temis/main/datasets/IberLEF%202021/Spanish/EXIST2021_test_labeled_spanish.csv')records = []for index, row in annotation_df.iterrows():    item = rb.TextClassificationRecord(        id=index,        inputs={            "text": row["text"]        },        multi_label=True,        metadata={            "task1_annotation": row['task1'],            "task2_annotation": row['task2'],        }    )    records.append(item)rb.log(records=records, name="single_annotation", tags={"project": "misogyny", "annotator": "ignacio"})

一旦我们将注释数据集记录到 Rubrix 中，我们就可以开始在 UI 上进行注释。让我们快速回顾一下它是如何完成的

在浏览器中打开 Rubrix。如果您在本地运行它，它通常在 http://localhost:6900 上运行。
选择 single_annotation 数据集。
在右上角，切换 Annotation mode。
开始选择您认为适合输入文本的类别。如果您不懂西班牙语，请不要担心！15 个实例不会对最终模型产生太大影响，您仍然可以学习如何注释。
对于每个实例，您可以通过按类别来注释类别，丢弃记录（如果您认为它不适合问题域），或将其留空而不进行注释。

作为团队进行注释

我们安排了一个由 5 位不同注释者组成的团队，他们在一周内工作，将来自 EXIST 标准的实例转换为来自 IberEval 2018 的标准。为此，我们需要一种方法将同一实例的多个注释合并为一个，保留大多数人的意愿，这时注释者间一致性 (IAA) 就派上了用场。IAA 有许多不同的类型，一些基于规则，另一些基于统计数据。

以下是我们 IAA 作为规则系统的简化：* 对于要注释为某个类别的实例，必须至少有两位注释者的共识。* 如果在某个性别歧视类别中存在共识，而其他注释者认为该实例中不存在性别歧视，则该实例将被丢弃。

我们的注释者团队由 Amélie、Leire、Javier、Víctor 和 Ignacio 组成。在接下来的单元格中，您可以找到一个单元格，其中记录了我们的注释者所做的原始注释（非注释版本是在上一节中下载的版本）。之后，我们将使用 load 命令从 Rubrix 检索这些注释数据集。

如果您想探索整个论文中使用的所有数据集、代码和资源，您可以在 Temis Github 页面上找到它们。欢迎来打个招呼！

annotation_1_df = pd.read_json('https://raw.githubusercontent.com/ignacioct/Temis/main/datasets/Annotation/temis_retraining_1.json')annotation_2_df = pd.read_json('https://raw.githubusercontent.com/ignacioct/Temis/main/datasets/Annotation/temis_retraining_2.json')annotation_3_df = pd.read_json('https://raw.githubusercontent.com/ignacioct/Temis/main/datasets/Annotation/temis_retraining_3.json')annotation_4_df = pd.read_json('https://raw.githubusercontent.com/ignacioct/Temis/main/datasets/Annotation/temis_retraining_4.json')annotation_5_df = pd.read_json('https://raw.githubusercontent.com/ignacioct/Temis/main/datasets/Annotation/temis_retraining_5.json')

现在，让我们将此信息记录到 Rubrix 中。我们向您展示如何记录其中一个数据集，您只需重复该过程并更改记录数据集的名称，以便单独记录，并且每个代理都知道她或他应该注释哪个数据集。

records = []for index, row in annotation_1_df.iterrows():    item = rb.TextClassificationRecord(        id=index,        inputs={            "text": row["text"]        },        annotation=row["annotation"],        annotation_agent="annotator 1",        multi_label=True,        metadata={            "task1_annotation": row['task1'],            "task2_annotation": row['task2'],        }    )    records.append(item)rb.log(records=records, name="annotation_misogyny_1", tags={"project": "misogyny", "annotator": "annotator 1"})

应该记住的一件事是，为了普及目的，我们简化了问题的复杂性。您可以在此处找到有关我们的代理注释的标签如何分为两个子类别的更多信息。

在我们的记录和探索之后，我们可以继续从 Rubrix 加载这些数据集。

annotation_1 = rb.load("annotation_misogyny_1").set_index("id").sort_index()annotation_2 = rb.load("annotation_misogyny_2").set_index("id").sort_index()annotation_3 = rb.load("annotation_misogyny_3").set_index("id").sort_index()annotation_4 = rb.load("annotation_misogyny_4").set_index("id").sort_index()annotation_5 = rb.load("annotation_misogyny_5").set_index("id").sort_index()

rb.load() 返回一个 Pandas Dataframe。我们将使用此库将我们的注释合并到一个数据集。

# We will use this tool to count ocurrences in listfrom collections import Counterannotation_final = pd.DataFrame(columns=['id','text', 'annotation', 'annotation_agent'])# Iterating through the datasets, all of them has the same lengthfor i in range(len(annotation_anna)):    # Extracting the annotated categories by each annotator    category_annotated_1 = annotation_1.iloc[i]["annotation"]    category_annotated_2 = annotation_2.iloc[i]["annotation"]    category_annotated_3 = annotation_3.iloc[i]["annotation"]    category_annotated_4 = annotation_4.iloc[i]["annotation"]    category_annotated_5 = annotation_5.iloc[i]["annotation"]    # Merging the annotations into a list    annotated_categories = [category_annotated_1, category_annotated_2, category_annotated_3, category_annotated_4, category_annotated_5]    # Flattening the list (if there is annotation, it is saved as an individual list)    if not None in annotated_categories:        annotated_categories = [item for sublist in annotated_categories for item in sublist]    # If all the elements in the list are None, we can return 'non-annotated'    if all(annotation is None for annotation in annotated_categories):        merged_annotation = 'non-annotated'    # Counting the annotations    counted_annotations = Counter(annotated_categories)    # Checking if the element with the most number of annotations follows the rules to be annotated    if counted_annotations[max(counted_annotations, key=counted_annotations.get)] >= 2 and "0" not in counted_annotations:        merged_annotation = max(counted_annotations, key=counted_annotations.get)    else:        merged_annotation = 'no-consensus'    # As all elements in each row of the DataFrame except the annotations are the same, we can    # retrieve information from any of the annotators. In our case is Anna.    annotation_final = annotation_final.append({        'id': annotation_1.iloc[i]["metadata"]["id"],        'text': annotation_1.iloc[i]["inputs"]["text"],        'annotation': merged_annotation,        'annotation_agent': 'Recognai Team',    }, ignore_index=True)

获得的语料库

在我们的注释会话之后，我们获得了 517 个新实例，我们将其添加到论文的语料库中。其中，332 个被注释为非性别歧视，184 个被注释为性别歧视。

我们遵循多标签注释方法，因此每个实例可以有多个厌女症类别。例如，在性骚扰文本中，通常也存在某种形式的支配或物化，因此我们想涵盖这些情况。这是我们的类别分布。

最后，我们还对实例的目标进行了分类。我们不允许在此处进行多标签注释；性别歧视文本不能同时是主动的和被动的。

结论

感谢这次注释会话，我们能够将 517 个新实例添加到我们的数据语料库中，从而提高了我们的厌女症检测模型的性能，该模型后来作为 RESTful API 发布，供应用程序开发人员和用户进行预测并围绕它们构建审核管道。

除了提高模型的性能外，我们还希望探索一种使用 Rubrix 的开发生命周期，其中模型可以随着时间的推移得到改进，人们作为人机循环中的参与者，分析我们第一个模型的输出，寻找其弱点并尝试加强它们。我们认为数据科学是一个迭代过程，其中监控获得的模型并迭代改进它们是关键。

我们邀请您试用 Rubrix 并加入对话！查看我们的 Github 页面和讨论论坛，分享想法、问题，或者只是打个招呼。

附录

以下是我们为本指南制作的一些在后台保留的程序。如果您想重现我们的所有步骤，包括模型的训练和一些额外的部分，我们将提供单元格来完成！欢迎随时更改任何内容并尝试新事物，如果您有任何疑问或在我们的 Github 论坛中发现任何有趣的东西，请告诉我们

依赖项和安装

在本指南中，我们为我们的用例提供了一些最少的代码。但是，要完全重现我们的过程，您首先需要安装 Rubrix、biome.text 和 pandas。我们还将导入它们。

%pip install -U git+https://github.com/recognai/biome-text%pip install rubrix%pip install pandasexit(0)  # Force restart of the runtime

from biome.text import *import pandas as pdimport rubrix as rb

训练我们的第一个模型

要重现第一个训练模型的简化版本（在注释之前），您可以执行以下单元格。我们已经搜索了足够好的配置，因此您可以跳过该步骤。

让我们从加载数据集开始

# Loading the datasetstraining_ds = Dataset.from_csv('https://raw.githubusercontent.com/ignacioct/Temis/main/datasets/IberEval%202018/training_full_df.csv', index=False)test_ds = Dataset.from_csv('https://raw.githubusercontent.com/ignacioct/Temis/main/datasets/IberEval%202018/test_df.csv', index=False)

使用 biome.text 创建 NLP 管道既快速又方便！我们在后台执行了 HPO 过程，以找到适合此领域的超参数，因此让我们使用它们来创建我们的第一个 AMI 模型。请注意，我们正在创建一个以 BETO（西班牙语 Transformer 模型）为头的管道。

pipeline_dict = {    "name": "AMI_first_model",    "features": {        "transformers": {            "model_name": "dccuchile/bert-base-spanish-wwm-cased", # BETO model            "trainable": True,            "max_length": 280,  # As we are working with data from Twitter, this is our max length        }    },    "head": {        "type": "TextClassification",        # These are the possible misogyny categories.        "labels": [            'sexual_harassment',             'dominance',             'discredit',             'stereotype',             'derailing',             'non-sexist'        ],        "pooler": {            "type": "lstm",            "num_layers": 1,            "hidden_size": 256,            "bidirectional": True,        },    },}pl = Pipeline.from_config(pipeline_dict)

trainer_config = TrainerConfiguration(    optimizer={        "type": "adamw",        "lr": 0.000023636840436059507,        "weight_decay": 0.01438297700463013,    },    batch_size=8,    max_epochs=10,)

trainer = Trainer(    pipeline=pl,    train_dataset=training_ds,    valid_dataset=test_ds,    trainer_config=trainer_config)

trainer.fit()

在 trainer.fit() 停止后，训练结果和获得的模型将位于输出文件夹中。

我们可以进行一些预测，并查看模型的性能。

pl.predict("Las mujeres no deberían tener derecho a voto")

使用 Rubrix 改进厌女症检测模型

我们的目标

背景

合并来自不同来源的数据

作为单个注释者进行注释

作为团队进行注释

获得的语料库

结论

附录

依赖项和安装

训练我们的第一个模型

在 Hub 上构建微调和评估数据集 — 无需代码

Argilla 2.0 发布

RLHF 和替代方案：SimPO

使用 Rubrix 改进厌女症检测模型

保持 关注！

在 Hub 上构建微调和评估数据集 — 无需代码

Argilla 2.0 发布

RLHF 和替代方案：SimPO

保持关注！