update finetune notebook

c9851d82 · Eelco van der Wel · 656150bf · c9851d82
Commit c9851d82 authored 2 years ago by Eelco van der Wel
Hide whitespace changes
Inline Side-by-side

Showing

with 1 addition and 1 deletion
+1 -1
--- a/nbs/finetune_model.ipynb
+++ b/nbs/finetune_model.ipynb
-{"cells":[{"cell_type":"markdown","source":["# Load a dataset and train your model\n","Once you have labeled your data on the Memri platform, you can use it to train your model in [this Google Colab notebook](https://colab.research.google.com/drive/189JJ2gLHAtxlmzc5XI3HhB9_VE3fT6DT)*.\n","\n","In this guide you will:\n","\n","1.   Load a labeled dataset from the POD\n","2.   Train a distilRoBERTa text classifier model on a labelled dataset\n","3.   Upload a trained model to use in a plugin for a data app\n","\n","\n","\n","\n","> * If you are unfamiliar with Google Colab notebooks, have a look at [this quick intro.](https://colab.research.google.com/)\n","* Make sure to run the below cells, one by one, in the correct order to avoid errors!\n","* In this guide we are helping you connect your own personal data from your Memri POD, alternitively you can use the [Tweet eval emoji](https://huggingface.co/datasets/tweet_eval#source-data) datasets, which is available from 🤗 [Hugging Face](https://huggingface.co/docs/datasets/index.html).\n","* If you don't wish to use your personal data, or you don't want to spend time training a model, you can simply use our [sentiment-plugin](https://gitlab.memri.io/koenvanderveen/sentiment-plugin/-/packages/6), which uses a pre-trained model from 🤗 Hugging Face. Just paste the  plugin repo address at project set-up step on the Memri platform, and skip this process."],"metadata":{"id":"suH5kbdZ-Zfi"}},{"cell_type":"markdown","metadata":{"id":"A22x_JOU5XVk"},"source":["## Setup"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"1ov_QUj2qnZy"},"outputs":[],"source":["from IPython.display import clear_output\n","!pip install pandas transformers torch git+https://gitlab.memri.io/memri/pymemri.git@dev\n","clear_output()\n","print(\"Installed\")"]},{"cell_type":"markdown","metadata":{"id":"j8t1xaae8uQJ"},"source":["1. Import the libraries needed to train your model\n","\n","> * Make sure to run the installation step above first to avoid errors!\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"zNkau9-8q9rC"},"outputs":[],"source":["import os\n","import random\n","import textwrap\n","\n","import pandas as pd\n","import torch\n","import transformers\n","from transformers import AutoModelForSequenceClassification, AutoTokenizer\n","from transformers.utils import logging\n","\n","from pymemri.data.itembase import Edge, Item\n","from pymemri.data.schema import Dataset, Message, CategoricalLabel\n","from pymemri.data.oauth import OauthFlow\n","from pymemri.data.loader import write_model_to_package_registry\n","from pymemri.pod.client import PodClient\n","from getpass import getpass\n","transformers.utils.logging.set_verbosity_error()\n","os.environ[\"WANDB_DISABLED\"] = \"true\""]},{"cell_type":"markdown","metadata":{"id":"gby4o-am6Lsa"},"source":["## 1. Load your dataset from the POD"]},{"cell_type":"markdown","source":["1. Run the cell\n","2. Copy your Dataset Name, Login Key and Password Key from your app.memri.io screen, and paste them below as prompted to load your connected dataset from you POD."],"metadata":{"id":"CqWDS-4kWG3s"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"M-bZ2bW-6ATQ"},"outputs":[],"source":["### *Define your pod url here*, this is the one for dev.app.memri.io ####\n","pod_url = \"https://dev.pod.memri.io\"\n","### *Define your dataset here* ####\n","dataset_name = input(\"dataset_name:\") if \"dataset_name\" not in locals() else dataset_name\n","### *Define your login key here* ####\n","owner_key = getpass(\"owner key:\") if \"owner_key\" not in locals() else owner_key\n","### *Define your password key here* ####\n","database_key = getpass(\"database_key:\") if \"database_key\" not in locals() else database_key"]},{"cell_type":"markdown","source":["2.   Connect your POD to load your data"],"metadata":{"id":"v6jxJtPwSMg3"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"EB2zI24IrEPH"},"outputs":[],"source":["# Connect to pod\n","client = PodClient(\n","    url=pod_url,\n","    owner_key=owner_key,\n","    database_key=database_key,\n",")\n","client.add_to_schema(CategoricalLabel, Message, Dataset, OauthFlow);"]},{"cell_type":"markdown","metadata":{"id":"7liaUCkM59U5"},"source":["3.   Download and inspect the dataset\n","\n","> * All entries in the dataset can be found via the Dataset.entry edge\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dWwpzbQ27EQX"},"outputs":[],"source":["dataset = client.get_dataset(dataset_name)\n","\n","num_entries = len(dataset.entry)\n","print(f\"number of items in the dataset: {num_entries}\")"]},{"cell_type":"markdown","metadata":{"id":"Cid4dUXO7GCy"},"source":["4.   Export the dataset to a format compatible with Python and inspect in a table\n","\n","\n","> * In pymemri, the `Dataset` class can format your dataset to different datatypes using the `Dataset.to` method; here we will use Pandas.\n","> * The columns of the dataset are inferred automatically. If you want to use custom columns, you can use the `columns` argument. See the [dataset documentation](https://docs.memri.io/component-architectures/plugins/datasets/) for more info."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Bz8XW0fB7Prl"},"outputs":[],"source":["data = dataset.to(\"pandas\")\n","data.head()"]},{"cell_type":"markdown","metadata":{"id":"c4sisozj7eTm"},"source":["## 2. Fine-tune a model\n","\n"]},{"cell_type":"markdown","source":["1.   Configure the distilRoBERTa model on your dataset\n","> The transformers library contains all code to do the training, you only need to define a torch Dataset that contains our data and handles tokenization.\n"],"metadata":{"id":"KL2VtYuDX9YZ"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"KjKuK2wi7iAN"},"outputs":[],"source":["# Hyperparameters\n","model_name = \"distilroberta-base\"\n","batch_size = 32\n","learning_rate = 1e-3\n","\n","class TransformerDataset(torch.utils.data.Dataset):\n","    def __init__(self, data: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase):\n","        self.data = data\n","        self.label2idx, self.idx2label = self.get_label_map()\n","        self.num_labels = len(self.label2idx)\n","        self.tokenizer = tokenizer\n","        \n","    def tokenize(self, message, label=None):\n","        tokenized = self.tokenizer(message, padding=\"max_length\", truncation=True)\n","        if label:\n","            tokenized[\"label\"] = self.label2idx[label]\n","        return tokenized\n","\n","    def get_label_map(self):\n","        unique_labels = data[\"annotation.labelValue\"].unique()\n","        return {l: i for i, l in enumerate(unique_labels)}, {i: l for i, l in enumerate(unique_labels)}\n","        \n","    def __len__(self):\n","        return len(self.data)\n","        \n","    def __getitem__(self, idx):\n","        # Get the row from self.data, and skip the first column (id).\n","        return self.tokenize(*self.data.iloc[idx][1:])\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name)\n","dataset = TransformerDataset(data, tokenizer)"]},{"cell_type":"markdown","metadata":{"id":"OFFbLaAK7nli"},"source":["2.   Train and finetune the model\n","\n","> * The 🤗 Transformers library provides all code needed to train a RoBERTa model. Read their [tutorial on fine-tuning models](https://huggingface.co/docs/transformers/training)\n","* We use Trainer class, as it handles all training, monitoring and integration with [Weights & Biases](https://wandb.ai/site)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"F2ludJvo7qD8"},"outputs":[],"source":["# Load model\n","model = AutoModelForSequenceClassification.from_pretrained(\n","    model_name,\n","    num_labels=dataset.num_labels,\n","    id2label=dataset.idx2label\n",")\n","\n","# To increase training speed, we will freeze all layers except the classifier head.\n","for param in model.base_model.parameters():\n","    param.requires_grad = False\n","training_args = transformers.TrainingArguments(\n","    \"twitter-emoji-trainer\",\n","    learning_rate=learning_rate,\n","    per_device_train_batch_size=batch_size,\n","    logging_steps=1,\n","    optim=\"adamw_torch\"\n",")\n","\n","trainer = transformers.Trainer(\n","    model=model,\n","    args=training_args,\n","    train_dataset=dataset\n",")\n","logging.set_verbosity(40)\n","trainer.train()"]},{"cell_type":"markdown","metadata":{"id":"XMFRgMo38Eah"},"source":["## 3. Upload your model to a data app plugin"]},{"cell_type":"markdown","metadata":{"id":"nbJOIR8g75_c"},"source":["Now that your model is trained, it will be uploaded to your new GitLab project.\n","\n","\n","1.   Run the cell\n","2.   Copy and paste the GitLab project name from the your screen on app.memri.io\n","\n","\n","> * To avoid errors, make sure your GitLab project does not have any full stops in the name/URL"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"xm2PgsAP77BX"},"outputs":[],"source":["project_name = input(\"project name:\") if \"project_name\" not in locals() else project_name\n","write_model_to_package_registry(model, project_name=project_name, client=client)"]},{"cell_type":"markdown","metadata":{"id":"8qVHRe9Y8AlU"},"source":["That's it! 🎉\n","\n","You have trained a ML model and made it accesible via the package registry, ready to be used in your data app.\n","\n","Check out the next step to see how to [build a plugin and deploy a data app](https://docs.memri.io/tutorials/build_a_sentiment_analysis_app/#deploy-your-data-app/)."]}],"metadata":{"accelerator":"GPU","colab":{"collapsed_sections":[],"name":"Load a Dataset and Finetune a Model [DEV].ipynb","provenance":[{"file_id":"1WX1VYwoAQ_2yOMzqvrIz77x52vaNmrtW","timestamp":1655375191426}]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}
\ No newline at end of file
+{"cells":[{"cell_type":"markdown","source":["# Load a dataset and train your model\n","Once you have labeled your data on the Memri platform, you can use it to train your model in [this Google Colab notebook](https://colab.research.google.com/drive/189JJ2gLHAtxlmzc5XI3HhB9_VE3fT6DT)*.\n","\n","In this guide you will:\n","\n","1.   Load a labeled dataset from the POD\n","2.   Train a distilRoBERTa text classifier model on a labelled dataset\n","3.   Upload a trained model to use in a plugin for a data app\n","\n","\n","\n","\n","> * If you are unfamiliar with Google Colab notebooks, have a look at [this quick intro.](https://colab.research.google.com/)\n","* Make sure to run the below cells, one by one, in the correct order to avoid errors!\n","* In this guide we are helping you connect your own personal data from your Memri POD, alternitively you can use the [Tweet eval emoji](https://huggingface.co/datasets/tweet_eval#source-data) datasets, which is available from 🤗 [Hugging Face](https://huggingface.co/docs/datasets/index.html).\n","* If you don't wish to use your personal data, or you don't want to spend time training a model, you can simply use our [sentiment-plugin](https://gitlab.memri.io/koenvanderveen/sentiment-plugin/-/packages/6), which uses a pre-trained model from 🤗 Hugging Face. Just paste the  plugin repo address at project set-up step on the Memri platform, and skip this process."],"metadata":{"id":"suH5kbdZ-Zfi"}},{"cell_type":"markdown","metadata":{"id":"A22x_JOU5XVk"},"source":["## Setup"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"1ov_QUj2qnZy"},"outputs":[],"source":["from IPython.display import clear_output\n","!pip install pandas transformers torch git+https://gitlab.memri.io/memri/pymemri.git@v0.0.29\n","clear_output()\n","print(\"Installed\")"]},{"cell_type":"markdown","metadata":{"id":"j8t1xaae8uQJ"},"source":["1. Import the libraries needed to train your model\n","\n","> * Make sure to run the installation step above first to avoid errors!\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"zNkau9-8q9rC"},"outputs":[],"source":["import os\n","import random\n","import textwrap\n","\n","import pandas as pd\n","import torch\n","import transformers\n","from transformers import AutoModelForSequenceClassification, AutoTokenizer\n","from transformers.utils import logging\n","\n","from pymemri.data.itembase import Edge, Item\n","from pymemri.data.schema import Dataset, Message, CategoricalLabel\n","from pymemri.data.oauth import OauthFlow\n","from pymemri.data.loader import write_model_to_package_registry\n","from pymemri.pod.client import PodClient\n","from getpass import getpass\n","transformers.utils.logging.set_verbosity_error()\n","os.environ[\"WANDB_DISABLED\"] = \"true\""]},{"cell_type":"markdown","metadata":{"id":"gby4o-am6Lsa"},"source":["## 1. Load your dataset from the POD"]},{"cell_type":"markdown","source":["1. Run the cell\n","2. Copy your Dataset Name, Login Key and Password Key from your app.memri.io screen, and paste them below as prompted to load your connected dataset from you POD."],"metadata":{"id":"CqWDS-4kWG3s"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"M-bZ2bW-6ATQ"},"outputs":[],"source":["### *Define your pod url here*, this is the one for dev.app.memri.io ####\n","pod_url = \"https://dev.pod.memri.io\"\n","### *Define your dataset here* ####\n","dataset_name = input(\"dataset_name:\") if \"dataset_name\" not in locals() else dataset_name\n","### *Define your login key here* ####\n","owner_key = getpass(\"owner key:\") if \"owner_key\" not in locals() else owner_key\n","### *Define your password key here* ####\n","database_key = getpass(\"database_key:\") if \"database_key\" not in locals() else database_key"]},{"cell_type":"markdown","source":["2.   Connect your POD to load your data"],"metadata":{"id":"v6jxJtPwSMg3"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"EB2zI24IrEPH"},"outputs":[],"source":["# Connect to pod\n","client = PodClient(\n","    url=pod_url,\n","    owner_key=owner_key,\n","    database_key=database_key,\n",")\n","client.add_to_schema(CategoricalLabel, Message, Dataset, OauthFlow);"]},{"cell_type":"markdown","metadata":{"id":"7liaUCkM59U5"},"source":["3.   Download and inspect the dataset\n","\n","> * All entries in the dataset can be found via the Dataset.entry edge\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dWwpzbQ27EQX"},"outputs":[],"source":["dataset = client.get_dataset(dataset_name)\n","\n","num_entries = len(dataset.entry)\n","print(f\"number of items in the dataset: {num_entries}\")"]},{"cell_type":"markdown","metadata":{"id":"Cid4dUXO7GCy"},"source":["4.   Export the dataset to a format compatible with Python and inspect in a table\n","\n","\n","> * In pymemri, the `Dataset` class can format your dataset to different datatypes using the `Dataset.to` method; here we will use Pandas.\n","> * The columns of the dataset are inferred automatically. If you want to use custom columns, you can use the `columns` argument. See the [dataset documentation](https://docs.memri.io/component-architectures/plugins/datasets/) for more info."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Bz8XW0fB7Prl"},"outputs":[],"source":["data = dataset.to(\"pandas\")\n","data.head()"]},{"cell_type":"markdown","metadata":{"id":"c4sisozj7eTm"},"source":["## 2. Fine-tune a model\n","\n"]},{"cell_type":"markdown","source":["1.   Configure the distilRoBERTa model on your dataset\n","> The transformers library contains all code to do the training, you only need to define a torch Dataset that contains our data and handles tokenization.\n"],"metadata":{"id":"KL2VtYuDX9YZ"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"KjKuK2wi7iAN"},"outputs":[],"source":["# Hyperparameters\n","model_name = \"distilroberta-base\"\n","batch_size = 32\n","learning_rate = 1e-3\n","\n","class TransformerDataset(torch.utils.data.Dataset):\n","    def __init__(self, data: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase):\n","        self.data = data\n","        self.label2idx, self.idx2label = self.get_label_map()\n","        self.num_labels = len(self.label2idx)\n","        self.tokenizer = tokenizer\n","        \n","    def tokenize(self, message, label=None):\n","        tokenized = self.tokenizer(message, padding=\"max_length\", truncation=True)\n","        if label:\n","            tokenized[\"label\"] = self.label2idx[label]\n","        return tokenized\n","\n","    def get_label_map(self):\n","        unique_labels = data[\"annotation.labelValue\"].unique()\n","        return {l: i for i, l in enumerate(unique_labels)}, {i: l for i, l in enumerate(unique_labels)}\n","        \n","    def __len__(self):\n","        return len(self.data)\n","        \n","    def __getitem__(self, idx):\n","        # Get the row from self.data, and skip the first column (id).\n","        return self.tokenize(*self.data.iloc[idx][1:])\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name)\n","dataset = TransformerDataset(data, tokenizer)"]},{"cell_type":"markdown","metadata":{"id":"OFFbLaAK7nli"},"source":["2.   Train and finetune the model\n","\n","> * The 🤗 Transformers library provides all code needed to train a RoBERTa model. Read their [tutorial on fine-tuning models](https://huggingface.co/docs/transformers/training)\n","* We use Trainer class, as it handles all training, monitoring and integration with [Weights & Biases](https://wandb.ai/site)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"F2ludJvo7qD8"},"outputs":[],"source":["# Load model\n","model = AutoModelForSequenceClassification.from_pretrained(\n","    model_name,\n","    num_labels=dataset.num_labels,\n","    id2label=dataset.idx2label\n",")\n","\n","# To increase training speed, we will freeze all layers except the classifier head.\n","for param in model.base_model.parameters():\n","    param.requires_grad = False\n","training_args = transformers.TrainingArguments(\n","    \"twitter-emoji-trainer\",\n","    learning_rate=learning_rate,\n","    per_device_train_batch_size=batch_size,\n","    logging_steps=1,\n","    optim=\"adamw_torch\"\n",")\n","\n","trainer = transformers.Trainer(\n","    model=model,\n","    args=training_args,\n","    train_dataset=dataset\n",")\n","logging.set_verbosity(40)\n","trainer.train()"]},{"cell_type":"markdown","metadata":{"id":"XMFRgMo38Eah"},"source":["## 3. Upload your model to a data app plugin"]},{"cell_type":"markdown","metadata":{"id":"nbJOIR8g75_c"},"source":["Now that your model is trained, it will be uploaded to your new GitLab project.\n","\n","\n","1.   Run the cell\n","2.   Copy and paste the GitLab project name from the your screen on app.memri.io\n","\n","\n","> * To avoid errors, make sure your GitLab project does not have any full stops in the name/URL"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"xm2PgsAP77BX"},"outputs":[],"source":["project_name = input(\"project name:\") if \"project_name\" not in locals() else project_name\n","write_model_to_package_registry(model, project_name=project_name, client=client)"]},{"cell_type":"markdown","metadata":{"id":"8qVHRe9Y8AlU"},"source":["That's it! 🎉\n","\n","You have trained a ML model and made it accesible via the package registry, ready to be used in your data app.\n","\n","Check out the next step to see how to [build a plugin and deploy a data app](https://docs.memri.io/tutorials/build_a_sentiment_analysis_app/#deploy-your-data-app/)."]}],"metadata":{"accelerator":"GPU","colab":{"collapsed_sections":[],"provenance":[{"file_id":"1WX1VYwoAQ_2yOMzqvrIz77x52vaNmrtW","timestamp":1655375191426}]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}
\ No newline at end of file
 %% Cell type:markdown id: tags:

 # Load a dataset and train your model
 Once you have labeled your data on the Memri platform, you can use it to train your model in [this Google Colab notebook](https://colab.research.google.com/drive/189JJ2gLHAtxlmzc5XI3HhB9_VE3fT6DT)*.

 In this guide you will:

 1.   Load a labeled dataset from the POD
 2.   Train a distilRoBERTa text classifier model on a labelled dataset
 3.   Upload a trained model to use in a plugin for a data app




 > * If you are unfamiliar with Google Colab notebooks, have a look at [this quick intro.](https://colab.research.google.com/)
 * Make sure to run the below cells, one by one, in the correct order to avoid errors!
 * In this guide we are helping you connect your own personal data from your Memri POD, alternitively you can use the [Tweet eval emoji](https://huggingface.co/datasets/tweet_eval#source-data) datasets, which is available from 🤗 [Hugging Face](https://huggingface.co/docs/datasets/index.html).
 * If you don't wish to use your personal data, or you don't want to spend time training a model, you can simply use our [sentiment-plugin](https://gitlab.memri.io/koenvanderveen/sentiment-plugin/-/packages/6), which uses a pre-trained model from 🤗 Hugging Face. Just paste the  plugin repo address at project set-up step on the Memri platform, and skip this process.

 %% Cell type:markdown id: tags:

 ## Setup

 %% Cell type:code id: tags:

 ``` 
 from IPython.display import clear_output
-!pip install pandas transformers torch git+https://gitlab.memri.io/memri/pymemri.git@dev
+!pip install pandas transformers torch git+https://gitlab.memri.io/memri/pymemri.git@v0.0.29
 clear_output()
 print("Installed")
 ```

 %% Cell type:markdown id: tags:

 1. Import the libraries needed to train your model

 > * Make sure to run the installation step above first to avoid errors!


 %% Cell type:code id: tags:

 ``` 
 import os
 import random
 import textwrap

 import pandas as pd
 import torch
 import transformers
 from transformers import AutoModelForSequenceClassification, AutoTokenizer
 from transformers.utils import logging

 from pymemri.data.itembase import Edge, Item
 from pymemri.data.schema import Dataset, Message, CategoricalLabel
 from pymemri.data.oauth import OauthFlow
 from pymemri.data.loader import write_model_to_package_registry
 from pymemri.pod.client import PodClient
 from getpass import getpass
 transformers.utils.logging.set_verbosity_error()
 os.environ["WANDB_DISABLED"] = "true"
 ```

 %% Cell type:markdown id: tags:

 ## 1. Load your dataset from the POD

 %% Cell type:markdown id: tags:

 1. Run the cell
 2. Copy your Dataset Name, Login Key and Password Key from your app.memri.io screen, and paste them below as prompted to load your connected dataset from you POD.

 %% Cell type:code id: tags:

 ``` 
 ### *Define your pod url here*, this is the one for dev.app.memri.io ####
 pod_url = "https://dev.pod.memri.io"
 ### *Define your dataset here* ####
 dataset_name = input("dataset_name:") if "dataset_name" not in locals() else dataset_name
 ### *Define your login key here* ####
 owner_key = getpass("owner key:") if "owner_key" not in locals() else owner_key
 ### *Define your password key here* ####
 database_key = getpass("database_key:") if "database_key" not in locals() else database_key
 ```

 %% Cell type:markdown id: tags:

 2.   Connect your POD to load your data

 %% Cell type:code id: tags:

 ``` 
 # Connect to pod
 client = PodClient(
    url=pod_url,
    owner_key=owner_key,
    database_key=database_key,
 )
 client.add_to_schema(CategoricalLabel, Message, Dataset, OauthFlow);
 ```

 %% Cell type:markdown id: tags:

 3.   Download and inspect the dataset

 > * All entries in the dataset can be found via the Dataset.entry edge


 %% Cell type:code id: tags:

 ``` 
 dataset = client.get_dataset(dataset_name)

 num_entries = len(dataset.entry)
 print(f"number of items in the dataset: {num_entries}")
 ```

 %% Cell type:markdown id: tags:

 4.   Export the dataset to a format compatible with Python and inspect in a table


 > * In pymemri, the `Dataset` class can format your dataset to different datatypes using the `Dataset.to` method; here we will use Pandas.
 > * The columns of the dataset are inferred automatically. If you want to use custom columns, you can use the `columns` argument. See the [dataset documentation](https://docs.memri.io/component-architectures/plugins/datasets/) for more info.

 %% Cell type:code id: tags:

 ``` 
 data = dataset.to("pandas")
 data.head()
 ```

 %% Cell type:markdown id: tags:

 ## 2. Fine-tune a model


 %% Cell type:markdown id: tags:

 1.   Configure the distilRoBERTa model on your dataset
 > The transformers library contains all code to do the training, you only need to define a torch Dataset that contains our data and handles tokenization.

 %% Cell type:code id: tags:

 ``` 
 # Hyperparameters
 model_name = "distilroberta-base"
 batch_size = 32
 learning_rate = 1e-3

 class TransformerDataset(torch.utils.data.Dataset):
    def __init__(self, data: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase):
        self.data = data
        self.label2idx, self.idx2label = self.get_label_map()
        self.num_labels = len(self.label2idx)
        self.tokenizer = tokenizer

    def tokenize(self, message, label=None):
        tokenized = self.tokenizer(message, padding="max_length", truncation=True)
        if label:
            tokenized["label"] = self.label2idx[label]
        return tokenized

    def get_label_map(self):
        unique_labels = data["annotation.labelValue"].unique()
        return {l: i for i, l in enumerate(unique_labels)}, {i: l for i, l in enumerate(unique_labels)}

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # Get the row from self.data, and skip the first column (id).
        return self.tokenize(*self.data.iloc[idx][1:])

 tokenizer = AutoTokenizer.from_pretrained(model_name)
 dataset = TransformerDataset(data, tokenizer)
 ```

 %% Cell type:markdown id: tags:

 2.   Train and finetune the model

 > * The 🤗 Transformers library provides all code needed to train a RoBERTa model. Read their [tutorial on fine-tuning models](https://huggingface.co/docs/transformers/training)
 * We use Trainer class, as it handles all training, monitoring and integration with [Weights & Biases](https://wandb.ai/site)

 %% Cell type:code id: tags:

 ``` 
 # Load model
 model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=dataset.num_labels,
    id2label=dataset.idx2label
 )

 # To increase training speed, we will freeze all layers except the classifier head.
 for param in model.base_model.parameters():
    param.requires_grad = False
 training_args = transformers.TrainingArguments(
    "twitter-emoji-trainer",
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    logging_steps=1,
    optim="adamw_torch"
 )

 trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
 )
 logging.set_verbosity(40)
 trainer.train()
 ```

 %% Cell type:markdown id: tags:

 ## 3. Upload your model to a data app plugin

 %% Cell type:markdown id: tags:

 Now that your model is trained, it will be uploaded to your new GitLab project.


 1.   Run the cell
 2.   Copy and paste the GitLab project name from the your screen on app.memri.io


 > * To avoid errors, make sure your GitLab project does not have any full stops in the name/URL

 %% Cell type:code id: tags:

 ``` 
 project_name = input("project name:") if "project_name" not in locals() else project_name
 write_model_to_package_registry(model, project_name=project_name, client=client)
 ```

 %% Cell type:markdown id: tags:

 That's it! 🎉

 You have trained a ML model and made it accesible via the package registry, ready to be used in your data app.

 Check out the next step to see how to [build a plugin and deploy a data app](https://docs.memri.io/tutorials/build_a_sentiment_analysis_app/#deploy-your-data-app/).