diff --git a/TraiNMT_abridged.ipynb b/TraiNMT_abridged.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..7e7aa4b845db2a1b0fe07b47dd97a67fd2fb8ae4 --- /dev/null +++ b/TraiNMT_abridged.ipynb @@ -0,0 +1,762 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + }, + "accelerator": "GPU", + "gpuClass": "standard" + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Entraîner son propre programme de traduction automatique neuronale\n", + "\n", + "## Ou, comment contraindre un outil de TA au romanesque\n", + "\n", + "<br>\n", + "\n", + "Version abrégée du cahier d'exercices préparé à l'occasion du colloque :\n", + "\n", + "*Traduction littéraire et intelligence artificielle : théorie, pratique, création*\n", + "\n", + "21 octobre 2022 — Paris, France\n", + "\n", + "<br>\n", + "\n", + "Damien Hansen\n", + "\n", + "Centre Interdisciplinaire de Recherche en Traduction et en Interprétation (Université de Liège, Belgique)\n", + "\n", + "Laboratoire d'Informatique de Grenoble (Université Grenoble Alpes, France)\n", + "\n", + "<br>\n", + "\n", + "Ce contenu est diffusé sous la licence CC BY-SA 4.0.\n", + "\n", + "Du moment que l'Å“uvre originale est dûment créditée et que l'Å“uvre partagée est diffusée avec la même licence, vous êtes libres de :\n", + "- partager — copier, distribuer et communiquer le matériel par tous moyens et sous tous formats ;\n", + "- adapter — remixer, transformer et créer à partir du matériel\n", + "pour toute utilisation, y compris commerciale." + ], + "metadata": { + "id": "X03ZBj1mAJAd" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Étape 1 — Installation" + ], + "metadata": { + "id": "TfL4sFfOAP8V" + } + }, + { + "cell_type": "markdown", + "source": [ + "Installation des outils nécessaires :\n", + "- `OpusFilter` (téléchargement des corpus) ;\n", + "- `fast-mosestokenizer` (tokenisation) ;\n", + "- `SentencePiece` (segmentation en sous-mots) ;\n", + "- `OpenNMT-py` (entraînement du système) ;\n", + "- `sacreBLEU` (évaluation)." + ], + "metadata": { + "id": "AHoIaZ01orNz" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KAmADpz1OVD0" + }, + "outputs": [], + "source": [ + "rm -rf sample_data\n", + "pip install opusfilter opennmt-py==2.3.0\n", + "mkdir datasets subword tools output output/{log,models,tensor,translations,vocab}" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Installation manuelle des modules fast-mosestokenizer et SentencePiece pour utilisation dans Colab :" + ], + "metadata": { + "id": "XB-k15MJoJVx" + } + }, + { + "cell_type": "code", + "source": [ + "apt-get install libgoogle-perftools-dev\n", + "apt-get install libglib2.0\n", + "cd tools\n", + "git clone https://github.com/google/sentencepiece.git \n", + "cd sentencepiece\n", + "mkdir build\n", + "cd build\n", + "cmake ../\n", + "make -j $(nproc)\n", + "make install\n", + "ldconfig -v\n", + "cd ../../\n", + "git clone https://github.com/mingruimingrui/fast-mosestokenizer.git \n", + "cd fast-mosestokenizer\n", + "git clone https://code.googlesource.com/re2\n", + "cd re2\n", + "make\n", + "make install\n", + "cd ../\n", + "mkdir build\n", + "cd build\n", + "cmake ../\n", + "make install\n", + "cd ../../../" + ], + "metadata": { + "id": "7hE6jcjz8Yu-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Téléchargement de quelques corpus couramment utilisés pour la paire anglais-français grâce au module [OpusFilter](https://github.com/Helsinki-NLP/OpusFilter) (~ 15–20 minutes) :\n", + "\n", + "<details>\n", + "<summary>Note</summary>\n", + "Le module produit une erreur pour la version brute des corpus <i>GlobalVoices</i> et <i>News-Commentary</i>, nous utilisons donc ici une version différente qui est peu plus longue à télécharger.\n", + "</details>" + ], + "metadata": { + "id": "UjKo3gBAqY_V" + } + }, + { + "cell_type": "code", + "source": [ + "with open('corpus.yaml', 'w', encoding='utf-8') as config:\n", + " config.write('''common:\n", + "\n", + " output_directory: datasets\n", + "\n", + "steps:\n", + "\n", + " - type: opus_read\n", + " parameters:\n", + " corpus_name: Books\n", + " source_language: en\n", + " target_language: fr\n", + " release: v1\n", + " preprocessing: raw\n", + " src_output: books_raw.en\n", + " tgt_output: books_raw.fr\n", + " suppress_prompts: true\n", + "\n", + " - type: opus_read\n", + " parameters:\n", + " corpus_name: Europarl\n", + " source_language: en\n", + " target_language: fr\n", + " release: v8\n", + " preprocessing: raw\n", + " src_output: europarl_raw.en\n", + " tgt_output: europarl_raw.fr\n", + " suppress_prompts: true\n", + "\n", + " - type: opus_read\n", + " parameters:\n", + " corpus_name: GlobalVoices\n", + " source_language: en\n", + " target_language: fr\n", + " release: v2018q4\n", + " preprocessing: xml\n", + " src_output: globalvoices_raw.en\n", + " tgt_output: globalvoices_raw.fr\n", + " suppress_prompts: true\n", + "\n", + " - type: opus_read\n", + " parameters:\n", + " corpus_name: News-Commentary\n", + " source_language: en\n", + " target_language: fr\n", + " release: v16\n", + " preprocessing: xml\n", + " src_output: news_raw.en\n", + " tgt_output: news_raw.fr\n", + " suppress_prompts: true\n", + "\n", + " - type: opus_read\n", + " parameters:\n", + " corpus_name: TED2020\n", + " source_language: en\n", + " target_language: fr\n", + " release: v1\n", + " preprocessing: raw\n", + " src_output: ted_raw.en\n", + " tgt_output: ted_raw.fr\n", + " suppress_prompts: true''')\n", + "config.close()\n", + "\n", + "opusfilter corpus.yaml\n", + "rm ./datasets/*.{gz,zip}" + ], + "metadata": { + "id": "-JIzslVIrtad" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Étape 2 — Préparation des données" + ], + "metadata": { + "id": "SduJfP1m5-xS" + } + }, + { + "cell_type": "markdown", + "source": [ + "Tokenisation et normalisation avec le module [fast-mosestokenizer](https://github.com/mingruimingrui/fast-mosestokenizer) :" + ], + "metadata": { + "id": "3khJ7E3QquaF" + } + }, + { + "cell_type": "code", + "source": [ + "for file in datasets/*_raw.en ;\\\n", + "do filename=$(basename $file _raw.en) ;\\\n", + "mosestokenizer -N en < datasets/${filename}_raw.en > datasets/${filename}.en ;\\\n", + "done\n", + "\n", + "for file in datasets/*_raw.fr ;\\\n", + "do filename=$(basename $file _raw.fr) ;\\\n", + "mosestokenizer -N fr < datasets/${filename}_raw.fr > datasets/${filename}.fr ;\\\n", + "done\n", + "\n", + "rm datasets/*_raw.{en,fr}" + ], + "metadata": { + "id": "5D5wO_JeqbV_" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Création d'un modèle de segmentation en sous-mots via [SentencePiece](https://github.com/google/sentencepiece) (~ 15 minutes) :" + ], + "metadata": { + "id": "OKRpaCPrr5Jr" + } + }, + { + "cell_type": "code", + "source": [ + "spm_train \\\n", + " --input=datasets/books.en,datasets/europarl.en,datasets/globalvoices.en,datasets/news.en,datasets/ted.en\\\n", + " --model_prefix=./subword/unigram_en\\\n", + " --vocab_size=16000\\\n", + " --character_coverage=1.0\\\n", + " --model_type=unigram\n", + "\n", + "spm_train \\\n", + " --input=datasets/books.fr,datasets/europarl.fr,datasets/globalvoices.fr,datasets/news.fr,datasets/ted.fr\\\n", + " --model_prefix=./subword/unigram_fr\\\n", + " --vocab_size=16000\\\n", + " --character_coverage=1.0\\\n", + " --model_type=unigram" + ], + "metadata": { + "id": "oDwBTeqo83lL" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Création d'un sous-corpus d'entraînement, de validation et de test :" + ], + "metadata": { + "id": "a29V4dAqwh-Z" + } + }, + { + "cell_type": "code", + "source": [ + "head -n -11000 datasets/books.en > datasets/trn.en; truncate -s -1 datasets/trn.en\n", + "head -n -11000 datasets/books.fr > datasets/trn.fr; truncate -s -1 datasets/trn.fr\n", + "\n", + "tail -n 11000 datasets/books.en > datasets/val.en ; truncate -s -1 datasets/val.en\n", + "tail -n 11000 datasets/books.fr > datasets/val.fr ; truncate -s -1 datasets/val.fr\n", + "\n", + "tail -n 1000 datasets/val.en > datasets/tra.en ; head -n -1000 datasets/val.en > temp.txt ; mv temp.txt datasets/val.en\n", + "tail -n 1000 datasets/val.fr > datasets/tra.fr ; head -n -1000 datasets/val.fr > temp.txt ; mv temp.txt datasets/val.fr" + ], + "metadata": { + "id": "QPBW3Ushw1vM" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Étape 3 — Entraînement" + ], + "metadata": { + "id": "amt6QopgzP6F" + } + }, + { + "cell_type": "markdown", + "source": [ + "Fichier de configuration pour le modèle de TA générique :" + ], + "metadata": { + "id": "FAZgO6wLzRhF" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rd4eo4ZY2vnZ" + }, + "outputs": [], + "source": [ + "with open('config_train.yaml', 'w', encoding='utf-8') as config:\n", + " config.write('''# Data output:\n", + "overwrite: false\n", + "save_data: ./output/vocab/voc\n", + "src_vocab: ./output/vocab/voc.vocab.src\n", + "tgt_vocab: ./output/vocab/voc.vocab.tgt\n", + "\n", + "# Training corpora:\n", + "data:\n", + " europarl:\n", + " path_src: ./datasets/europarl.en\n", + " path_tgt: ./datasets/europarl.fr\n", + " transforms: [filtertoolong, sentencepiece]\n", + " weight: 4\n", + " globalvoices:\n", + " path_src: ./datasets/globalvoices.en\n", + " path_tgt: ./datasets/globalvoices.fr\n", + " transforms: [filtertoolong, sentencepiece]\n", + " weight: 1\n", + " news:\n", + " path_src: ./datasets/news.en\n", + " path_tgt: ./datasets/news.fr\n", + " transforms: [filtertoolong, sentencepiece]\n", + " weight: 1\n", + " ted:\n", + " path_src: ./datasets/ted.en\n", + " path_tgt: ./datasets/ted.fr\n", + " transforms: [filtertoolong, sentencepiece]\n", + " weight: 2\n", + " valid:\n", + " path_src: ./datasets/val.en\n", + " path_tgt: ./datasets/val.fr\n", + " transforms: [filtertoolong, sentencepiece]\n", + "src_seq_length: 200\n", + "tgt_seq_length: 200\n", + "skip_empty_level: silent\n", + "src_subword_model: ./subword/unigram_en.model\n", + "tgt_subword_model: ./subword/unigram_fr.model\n", + "src_subword_vocab: ./subword/unigram_en.vocab\n", + "tgt_subword_vocab: ./subword/unigram_fr.vocab\n", + "src_subword_alpha: 0.5\n", + "tgt_subword_alpha: 0.5\n", + "\n", + "# Training parameters:\n", + "batch_type: \"tokens\"\n", + "batch_size: 4096\n", + "valid_batch_size: 16\n", + "batch_size_multiple: 1\n", + "max_generator_batches: 0\n", + "accum_count: [3]\n", + "accum_steps: [0]\n", + "train_steps: 100000\n", + "valid_steps: 10000\n", + "report_every: 500\n", + "save_checkpoint_steps: 10000\n", + "queue_size: 10000\n", + "bucket_size: 32768\n", + "\n", + "# Optimization\n", + "model_dtype: \"fp32\"\n", + "optim: \"adam\"\n", + "learning_rate: 2\n", + "warmup_steps: 8000\n", + "decay_method: \"noam\"\n", + "average_decay: 0.0005\n", + "adam_beta2: 0.998\n", + "max_grad_norm: 0\n", + "label_smoothing: 0.1\n", + "param_init: 0\n", + "param_init_glorot: true\n", + "normalization: \"tokens\"\n", + "\n", + "# Model\n", + "encoder_type: transformer\n", + "decoder_type: transformer\n", + "enc_layers: 6\n", + "dec_layers: 6\n", + "heads: 8\n", + "rnn_size: 512\n", + "word_vec_size: 512\n", + "transformer_ff: 2048\n", + "dropout_steps: [0]\n", + "dropout: [0.1]\n", + "attention_dropout: [0.1]\n", + "position_encoding: true\n", + "\n", + "# Model output:\n", + "save_model: ./output/models/train\n", + "\n", + "# Logs:\n", + "log_file: ./output/log/train\n", + "tensorboard: true\n", + "tensorboard_log_dir: ./output/tensor/train\n", + "\n", + "# GPU settings:\n", + "world_size: 1\n", + "gpu_ranks: [0]\n", + "\n", + "# Reproducibility:\n", + "seed: 123''')\n", + "config.close()" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Fichier de configuration pour le modèle de TA affiné :" + ], + "metadata": { + "id": "wgn6piww1vns" + } + }, + { + "cell_type": "code", + "source": [ + "with open('config_tuned.yaml', 'w', encoding='utf-8') as config:\n", + " config.write('''# Data output:\n", + "overwrite: false\n", + "save_data: ./output/vocab/voc\n", + "src_vocab: ./output/vocab/voc.vocab.src\n", + "tgt_vocab: ./output/vocab/voc.vocab.tgt\n", + "\n", + "# Training corpora:\n", + "data:\n", + " europarl:\n", + " path_src: ./datasets/europarl.en\n", + " path_tgt: ./datasets/europarl.fr\n", + " transforms: [filtertoolong, sentencepiece]\n", + " weight: 1\n", + " globalvoices:\n", + " path_src: ./datasets/globalvoices.en\n", + " path_tgt: ./datasets/globalvoices.fr\n", + " transforms: [filtertoolong, sentencepiece]\n", + " weight: 1\n", + " news:\n", + " path_src: ./datasets/news.en\n", + " path_tgt: ./datasets/news.fr\n", + " transforms: [filtertoolong, sentencepiece]\n", + " weight: 1\n", + " ted:\n", + " path_src: ./datasets/ted.en\n", + " path_tgt: ./datasets/ted.fr\n", + " transforms: [filtertoolong, sentencepiece]\n", + " weight: 1\n", + " books:\n", + " path_src: ./datasets/trn.en\n", + " path_tgt: ./datasets/trn.fr\n", + " transforms: [filtertoolong, sentencepiece]\n", + " weight: 5\n", + " valid:\n", + " path_src: ./datasets/val.en\n", + " path_tgt: ./datasets/val.fr\n", + " transforms: [filtertoolong, sentencepiece]\n", + "src_seq_length: 200\n", + "tgt_seq_length: 200\n", + "skip_empty_level: silent\n", + "src_subword_model: ./subword/unigram_en.model\n", + "tgt_subword_model: ./subword/unigram_fr.model\n", + "src_subword_vocab: ./subword/unigram_en.vocab\n", + "tgt_subword_vocab: ./subword/unigram_fr.vocab\n", + "src_subword_alpha: 0.5\n", + "tgt_subword_alpha: 0.5\n", + "\n", + "# Training parameters:\n", + "batch_type: \"tokens\"\n", + "batch_size: 4096\n", + "valid_batch_size: 16\n", + "batch_size_multiple: 1\n", + "max_generator_batches: 0\n", + "accum_count: [3]\n", + "accum_steps: [0]\n", + "train_steps: 150000\n", + "valid_steps: 5000\n", + "report_every: 100\n", + "save_checkpoint_steps: 5000\n", + "queue_size: 10000\n", + "bucket_size: 32768\n", + "train_from: ./output/models/train_step_100000.pt\n", + "\n", + "# Optimization\n", + "model_dtype: \"fp32\"\n", + "optim: \"adam\"\n", + "learning_rate: 2\n", + "warmup_steps: 8000\n", + "decay_method: \"noam\"\n", + "average_decay: 0.0005\n", + "adam_beta2: 0.998\n", + "max_grad_norm: 0\n", + "label_smoothing: 0.1\n", + "param_init: 0\n", + "param_init_glorot: true\n", + "normalization: \"tokens\"\n", + "\n", + "# Model\n", + "encoder_type: transformer\n", + "decoder_type: transformer\n", + "enc_layers: 6\n", + "dec_layers: 6\n", + "heads: 8\n", + "rnn_size: 512\n", + "word_vec_size: 512\n", + "transformer_ff: 2048\n", + "dropout_steps: [0]\n", + "dropout: [0.1]\n", + "attention_dropout: [0.1]\n", + "position_encoding: true\n", + "\n", + "# Model output:\n", + "save_model: ./output/models/tuned\n", + "\n", + "# Logs:\n", + "log_file: ./output/log/tuned\n", + "tensorboard: true\n", + "tensorboard_log_dir: ./output/tensor/tuned\n", + "\n", + "# GPU settings:\n", + "world_size: 1\n", + "gpu_ranks: [0]\n", + "\n", + "# Reproducibility:\n", + "seed: 123''')\n", + "config.close()" + ], + "metadata": { + "id": "HupO64dt2vMj" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Création des fichiers de vocabulaire utilisés par [OpenNMT](https://github.com/OpenNMT/OpenNMT-py) :" + ], + "metadata": { + "id": "27pl8wJQ3wJu" + } + }, + { + "cell_type": "code", + "source": [ + "onmt_build_vocab --config config_tuned.yaml --n_sample -1" + ], + "metadata": { + "id": "Zhx5JWK_324G" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Entraînement du système générique :" + ], + "metadata": { + "id": "cpnxR0YS-a7_" + } + }, + { + "cell_type": "code", + "source": [ + "onmt_train --config config_train.yaml" + ], + "metadata": { + "id": "IrfWAbIt5S0e" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Entraînement du système affiné :" + ], + "metadata": { + "id": "aLRi6ZaAAHLj" + } + }, + { + "cell_type": "code", + "source": [ + "onmt_train --config config_tuned.yaml" + ], + "metadata": { + "id": "XGPaZbuF5SIB" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Étape 4 — Traduction et évaluation" + ], + "metadata": { + "id": "dlyT9dWGAud3" + } + }, + { + "cell_type": "markdown", + "source": [ + "Segmentation en sous-mots du texte soumis au système :" + ], + "metadata": { + "id": "rcprLVCa2AzX" + } + }, + { + "cell_type": "code", + "source": [ + "spm_encode \\\n", + " --model=subword/unigram_en.model \\\n", + " --output_format=piece \\\n", + " < datasets/tra.en \\\n", + " > datasets/tra_sub.en" + ], + "metadata": { + "id": "DujCQvZM1XJH" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Traduction pour chaque point de sauvegarde de nos deux systèmes :" + ], + "metadata": { + "id": "OcCRM_Lk2P-o" + } + }, + { + "cell_type": "code", + "source": [ + "for checkpoint in output/models/*.pt ;\\\n", + "do filename=$(basename $checkpoint .pt) ;\\\n", + "echo \"# Translating checkpoint\" ${filename} ;\\\n", + "onmt_translate \\\n", + " --verbose \\\n", + " --model $checkpoint \\\n", + " --src datasets/tra_sub.en \\\n", + " --output output/translations/${filename}_sub.txt ;\\\n", + "done" + ], + "metadata": { + "id": "id3FeyUE2mq1" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Détokenisation des traductions produites par nos systèmes :" + ], + "metadata": { + "id": "uph7ub_H3kff" + } + }, + { + "cell_type": "code", + "source": [ + "for file in output/translations/*_sub.txt ;\\\n", + "do filename=$(basename $file _sub.txt) ;\\\n", + "spm_decode \\\n", + "\t--model=subword/unigram_fr.model \\\n", + "\t--input_format=piece \\\n", + "\t< output/translations/${filename%.*}_sub.txt \\\n", + "\t> output/translations/${filename%.*}_tok.txt ;\\\n", + "done\n", + "\n", + "for file in output/translations/*_tok.txt ;\\\n", + "do filename=$(basename $file _tok.txt) ;\\\n", + "mosestokenizer -D fr \\\n", + "\t< output/translations/${filename%.*}_tok.txt \\\n", + "\t> output/translations/${filename%.*}.txt ;\\\n", + "done\n", + "\n", + "rm output/translations/*sub.txt\n", + "rm output/translations/*tok.txt" + ], + "metadata": { + "id": "3d5z-018BiyX" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Évaluation et sélection du meilleur modèle avec [sacreBLEU](https://github.com/mjpost/sacrebleu) :" + ], + "metadata": { + "id": "lBwMRO4j3325" + } + }, + { + "cell_type": "code", + "source": [ + "sacrebleu datasets/tra.fr \\\n", + "\t--input output/translations/*.txt \\\n", + "\t--language-pair en-fr \\\n", + "\t--metrics bleu chrf ter \\\n", + "\t--chrf-word-order 2 \\\n", + "\t--tokenize 13a \\\n", + "\t--width 2 \\\n", + "\t--format text" + ], + "metadata": { + "id": "efbvsvZwBPBA" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file