Klopp Compares Penalties With Manchester United After Losing

Liverpool manager Jurgen Klopp aimed a sly dig at Manchester United over penalties. following the club’s 1–0 loss to Southampton on Monday night. Danny Ings scored a second-minute lob before Saints…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




TLDR

Have you ever wondered how to train an algorithm to produce a condensed text from an original content ? In this article, we first discuss the state of the art mechanisms behind abstractive summarization. Then, we show how we fine-tune models on our articles database to automatically produce their headings.

NB: This article is more about a global implementation of a Transformer model [1] and its different elements rather than a coding article.

At Unify, we own and manage various press websites. In order to provide our audience with an interesting content, we launched different initiatives around NLP (e.g. content recommendations) to take advantages of the advancements in this field. In a global context of competition, we aim to provide our audience with engaging content. Our studies have revealed that people generally prefer shorter articles. In response to these insights, we developed a summarization model based on articles’ heading (chapô in french) to offer a short format to our readers. However, the state-of-the-art algorithms are pre-trained in English and not in French, which prevents us from using them.

There are two different types of models: extractive and abstractive. The first studies and models employed the extractive method, which transforms the summarization problem into a classification problem. The summary is built by selecting some sentences from the original content. Prosaically, the model predicts whether a sentence should be present in the summary. This method allows to have a concise text, grammatically correct, and that accurately reflects the words in the article. However, a summary should not plagiarize the original article by using the same sentences. So, we resorted to the second category of models : the abstractive ones

We decided to test two state-of-art models for french corpora: the mt5 and the BARThez. The mT5 [2] is the multilingual version of the Text-to-Text Transfer Transformer (T5) [3], published by Google in 2020. Also, we tried the french version of the BART model [4] developed by Polytechnique Paris: the BARThez [5]. Those models use the classical Transformer architecture with encoders and decoders. Let’s present the differences:

Both models have an encoder that learns high-dimensional representations with bidirectional information (to predict a masked token, information comes from right and left). The decoder has access to that vector and to all of the hidden states of the encoder to predict the next token as shown in the figure below.

models architecture
BARThez and mT5 architecture

The mentioned models are pre-trained and need to be fine-tuned on a specific task. In our case, it’s obviously the summarization task. Nevertheless, it is necessary to find the appropriate metrics and to compute the baseline. To evaluate our models, we chose two metrics:

At Unify, we have a large amount of data. In order to fine-tune our models, we selected 27K articles with a training sample of 21K articles, 2684 for the validation and 2711 for the test sample. Firstly, we decided to use the Lead Baseline which basically selects the N first sentences of a text. N=1 has been selected according to the BARThez paper and we obtained on the test set a ROUGE-1 score of 21.72 and a BERTScore of 67.45.

As the two pre-trained models failed to perform well with our data (content and heading of an article) we fine-tuned them with 8 GPUs Nvidia Tesla K80 with 12GB of memory each in order to obtain models based on our own text and articles. We managed to fine-tune the small version of the mT5 and the base version of the BARThez because of memory issues.

Furthermore, a train, validation and test Pytorch dataloaders are created to split our data into batches of size 32 (cannot be exceeded due to resource memory). For each epoch:

After training our models for approximately 6 hours, for 5 epochs, we obtained the best results with the BARThez model. Ultimately, to avoid overfitting, we trained the model for 3 epochs. The losses are shown below:

Training and validation losses per epoch (BARThez)

All the results after training are shown here:

Final results
Example

In terms of performances, we achieve good results compared to those obtained by the authors of the BARThez paper [5], our metrics values are close. Moreover, the summaries predicted by the model are grammatically correct and they include the essential elements of the content necessary for its understanding. However, the main business problem lies in the length of the summaries used as input to train the models.
Those are short articles’ headings so they are composed of two or three sentences. Consequently, the generated summary is also short!

Finally, as long as we don’t have summaries written explicitly by journalists, our approach will be limited by the lack of labeled data. As a next steps, we could try a translation method in order to use the english state-of-arts models. By translating from french to english the articles before summarization with some APIs like Deepls’ one as an input and from english to french the predicted summaries as an output we think we could obtain longer coherent summaries.

References:

Add a comment

Related posts:

The Effects of the Abolition of DACA

Ever since Trump began his presidency he has targeted the Latino, Hispanic, and Chicano community. He has made it a nightmare for both the individuals who have resided here for years and the new…

The Power Of Kindness In Building Strong Relationships

Kindness is an important part of being human. It’s a quality that lets you connect with others and makes them feel they can trust you. Kindness is also an amazing way to make yourself feel good…

You Are A Valued Thoughts and Ideas Reader

It is my pleasure to share this brief new newsletter for the Thoughts and Ideas publication. There is not a day that goes by, where we don’t feel a valued appreciation for all of our contributors…