[Language Model] KoBERT

2023-05-15 12 분 소요

5월 12일 기준, 코랩에서 작동 가능한 코드

코랩에서 파이썬 버전을 업데이트 하면서 버그가 발생했을 것이라 추정.
깃헙 이슈 페이지에서 고수님들의 대화를 통해 버그를 해결할 수 있었음
SK TBrain에서 공개한 BERT모델
한국어에 특화된 모델이며, 이 모델을 통해 예제 리뷰를 예측 및 정확도 평가를 측정하여 모델의 신뢰성을 확인
본 태스크는 네이버 영화 리뷰 데이터를 감성분석하는 목적으로 이진분류를 진행하여 모델의 정확도를 측정

데이터 & 모델 다운로드

네이버 영화 리뷰 데이터 (NSMC)
KoBERT 패키지
transformer 설치

필요한 라이브러리 설치

!pip install mxnet
!pip install gluonnlp==0.8.0
!pip install tqdm pandas
!pip install sentencepiece
!pip install transformers
!pip install torch

# kobert
!pip install 'git+https://github.com/SKTBrain/KoBERT.git#egg=kobert_tokenizer&subdirectory=kobert_hf'
# NSMC
!git clone https://github.com/e9t/nsmc.git

라이브러리

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import numpy as np
import pandas as pd
import re
import random
from tqdm import tqdm, tqdm_notebook

# 자칫하면 오류날 수도 있는 부분
import gluonnlp as nlp

# ★ Hugging Face를 통한 모델 및 토크나이저 Import
from kobert_tokenizer import KoBERTTokenizer
from transformers import BertModel

from transformers import AdamW
from transformers.optimization import get_cosine_schedule_with_warmup

/usr/local/lib/python3.10/dist-packages/mxnet/optimizer/optimizer.py:163: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB
  warnings.warn('WARNING: New optimizer %s.%s is overriding '

seed = 2021
deterministic = True

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# 데이터 읽기
train = pd.read_csv("./nsmc/ratings_train.txt", sep='\t')
test = pd.read_csv("./nsmc/ratings_test.txt", sep='\t')

# 데이터 줄이기 (빠르게 학습 결과를 보기 위함)
# train = train.iloc[:20000, :]
# test = test.iloc[:5000, :]

# 결측치 제거
train = train.dropna()
test = test.dropna()
print('train shape:',train.shape)
print('test shape:',test.shape)
print()

# GPU 확인 및 사용
n_devices = torch.cuda.device_count()
print('device count:', n_devices)
print()

# GPU 있으면 할당
device = torch.device("cuda:0" if torch.cuda.is_available() else 'cpu')

#  ★ Hugging Face을 통한 BERT 모델, Vocabulary 불러오기
tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')
bertmodel = BertModel.from_pretrained('skt/kobert-base-v1', return_dict=False)
vocab = nlp.vocab.BERTVocab.from_sentencepiece(tokenizer.vocab_file, padding_token='[PAD]')

print(bertmodel)
print('-'*100)
print(vocab)
print("Current device:", device)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'XLNetTokenizer'. 
The class this function is called from is 'KoBERTTokenizer'.

train shape: (149995, 3)
test shape: (49997, 3)

device count: 1

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(8002, 768, padding_idx=1)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)
----------------------------------------------------------------------------------------------------
Vocab(size=8002, unk="[UNK]", reserved="['[CLS]', '[SEP]', '[MASK]', '[PAD]']")
Current device: cuda:0

display(train[:5])

         id                                           document  label
0   9976970                                아 더빙.. 진짜 짜증나네요 목소리      0
1   3819312                  흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나      1
2  10265843                                  너무재밓었다그래서보는것을추천한다      0
3   9045019                      교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정      0
4   6483659  사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...      1

데이터 전처리

입력 데이터를 만들기 위해 데이터프레임으로 되어 있던 타입을 아래와 같이 리스트 형태로 변환한다.\

[[document(1), label(1)], [document(2), label(2)], …, [document(n), label(n)]]

# DataFrame -> List

dataset_train = []
for docu, label in zip(train.document, train.label):
  if '"' in docu:
    docu = re.sub('"', '', docu)
  dataset_train.append([docu, str(label)])

dataset_test = []
for docu, label in zip(test.document, test.label):
  if '"' in docu:
    docu = re.sub('"', '', docu)
  dataset_test.append([docu, str(label)])


print(dataset_train[:3])

[['아 더빙.. 진짜 짜증나네요 목소리', '0'], ['흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나', '1'], ['너무재밓었다그래서보는것을추천한다', '0']]

KoBERT의 입력 데이터로 만들기 위해서 다음과 같이 전처리 작업을 추가적으로 진행한다.

토큰화
정수 인코딩
패딩

class BERTSentenceTransform:
    def __init__(self, tokenizer, max_seq_length,vocab, pad=True, pair=True):
        self._tokenizer = tokenizer
        self._max_seq_length = max_seq_length
        self._pad = pad
        self._pair = pair
        self._vocab = vocab 

    def __call__(self, line):
        # convert to unicode
        text_a = line[0]
        if self._pair:
            assert len(line) == 2
            text_b = line[1]

        ##### 여기 수정!! #####
        tokens_a = self._tokenizer.tokenize(text_a)
        tokens_b = None

        if self._pair:
            tokens_b = self._tokenizer(text_b)

        if tokens_b:
            self._truncate_seq_pair(tokens_a, tokens_b,
                                    self._max_seq_length - 3)
        else:
            if len(tokens_a) > self._max_seq_length - 2:
                tokens_a = tokens_a[0:(self._max_seq_length - 2)]
        vocab = self._vocab
        tokens = []
        tokens.append(vocab.cls_token)
        tokens.extend(tokens_a)
        tokens.append(vocab.sep_token)
        segment_ids = [0] * len(tokens)

        if tokens_b:
            tokens.extend(tokens_b)
            tokens.append(vocab.sep_token)
            segment_ids.extend([1] * (len(tokens) - len(segment_ids)))

        input_ids = self._tokenizer.convert_tokens_to_ids(tokens)

        # The valid length of sentences. Only real  tokens are attended to.
        valid_length = len(input_ids)

        if self._pad:
            # Zero-pad up to the sequence length.
            padding_length = self._max_seq_length - valid_length
            # use padding tokens for the rest
            input_ids.extend([vocab[vocab.padding_token]] * padding_length)
            segment_ids.extend([0] * padding_length)

        return np.array(input_ids, dtype='int32'), np.array(valid_length, dtype='int32'),\
            np.array(segment_ids, dtype='int32')



class BERTDataset(Dataset):
    def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, vocab, max_len, pad, pair):
        transform = BERTSentenceTransform(bert_tokenizer, max_seq_length=max_len, vocab=vocab, pad=pad, pair=pair)

        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        self.labels = [np.int32(i[label_idx]) for i in dataset]

    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))

    def __len__(self):
        return (len(self.labels))

사용자 파라미터 정의를 위한 전처리

# 가장 긴 문장의 리뷰
max_len = 0
for i in train.document:
  if max_len < len(i):
    max_len = len(i)
  else:
    continue

위에 정의한 토큰함수를 기반으로 데이터 토큰화

train = BERTDataset(dataset_train, 0, 1, tokenizer, vocab, max_len, True, False) # input:{'문장', '긍부정 라벨'}
test = BERTDataset(dataset_test, 0, 1, tokenizer, vocab, max_len, True, False) # input:{'문장', '긍부정 라벨'}

# data_train의 첫 번재[0] 데이터
print(train[0][0]) # 패딩된 시퀀스
print(train[0][1]) # 길이
print(train[0][2]) # 어텐션 마스크 시퀀스 (함께 입력되어야 함)
print(train[0][3]) # 정답 Label
print('-'*50)

print(train[0]) # pair 형태

[   2 3093 1698 6456   54   54 4368 4396 7316 5655 5703 2073    3    1
    1    1    1    1    1    1    1    1    1    1    1    1    1    1
    1    1    1    1    1    1    1    1    1    1    1    1    1    1
    1    1    1    1    1    1    1    1    1    1    1    1    1    1
    1    1    1    1    1    1    1    1    1    1    1    1    1    1
    1    1    1    1    1    1    1    1    1    1    1    1    1    1
    1    1    1    1    1    1    1    1    1    1    1    1    1    1
    1    1    1    1    1    1    1    1    1    1    1    1    1    1
    1    1    1    1    1    1    1    1    1    1    1    1    1    1
    1    1    1    1    1    1    1    1    1    1    1    1    1    1
    1    1    1    1    1    1]
13
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
0
--------------------------------------------------
(array([   2, 3093, 1698, 6456,   54,   54, 4368, 4396, 7316, 5655, 5703,
       2073,    3,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1], dtype=int32), array(13, dtype=int32), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32), 0)

Torch 입력 데이터 형식으로 변환

batch_size = 64

train_dataloader = torch.utils.data.DataLoader(train, batch_size=batch_size)
test_dataloader = torch.utils.data.DataLoader(test, batch_size=batch_size)

print(len(train_dataloader), len(test_dataloader)) # 각 Iteration의 수 (= len(data_train)/batch_size)

2344 782

for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(train_dataloader):
  print(token_ids)
  print(valid_length)
  print(segment_ids)
  break

tensor([[   2, 3093, 1698,  ...,    1,    1,    1],
        [   2,  517, 7989,  ...,    1,    1,    1],
        [   2, 1458, 7191,  ...,    1,    1,    1],
        ...,
        [   2, 2485, 6821,  ...,    1,    1,    1],
        [   2, 4841, 6333,  ...,    1,    1,    1],
        [   2, 2355, 5842,  ...,    1,    1,    1]], dtype=torch.int32)
tensor([13, 25, 14, 22, 46, 33, 15, 67, 13, 36, 16, 32, 29, 31, 34, 13, 45, 23,
        32, 24, 26, 10, 81, 16, 11, 35, 19,  9,  6, 27, 30, 23, 12, 15, 13, 13,
        23, 19, 14, 12, 11, 52, 17, 27, 55, 17, 77, 38, 13, 97, 32, 30, 17, 37,
         4,  5,  3, 65,  7, 11, 14, 19, 24, 56], dtype=torch.int32)
tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], dtype=torch.int32)

KoBERT

num_classes 변수 사용해서 classification의 분류 수에 맞게 설정

class KoBERTClassifier(nn.Module):
  def __init__(self, bert, hidden_size=768, num_classes=2, dr_rate=None, params=None):
    super(KoBERTClassifier, self).__init__()

    self.bert = bert
    self.dr_rate = dr_rate

    self.classifier = nn.Linear(hidden_size, num_classes)
    if dr_rate:
      self.dropout = nn.Dropout(p=dr_rate)
    
  def gen_attention_mask(self, token_ids, valid_length):
    attention_mask = torch.zeros_like(token_ids)
    for i, v in enumerate(valid_length):
      attention_mask[i][:v] = 1
    return attention_mask.float()

  def forward(self, token_ids, valid_length, segment_ids):
    attention_mask = self.gen_attention_mask(token_ids, valid_length)

    _, pooler = self.bert(input_ids=token_ids, token_type_ids=segment_ids.long(), attention_mask=attention_mask.float().to(token_ids.device))
    if self.dr_rate:
      out = self.dropout(pooler)
    out = self.classifier(out)
    return out

사용자 정의 파라미터

# Setting parameters
warmup_ratio = 0.1
num_epochs = 2 # test 용도 (램이 딸려서 중간에 끊김..)
max_grad_norm = 1
log_interval = 200
learning_rate =  4e-5

pretrained model 불러오고, 최적화, 손실함수, 스케쥴 등등 설정

# BERT 모델 불러오기
model = KoBERTClassifier(bertmodel, dr_rate=0.3).to(device)

# optimizer와 schedule 설정
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
loss_fn = nn.CrossEntropyLoss()

t_total = len(train_dataloader) * num_epochs
warmup_step = int(t_total * warmup_ratio)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_step, num_training_steps=t_total)

#정확도 측정을 위한 함수 정의
def calc_accuracy(X,Y):
    max_vals, max_indices = torch.max(X, 1)
    train_acc = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0]
    return train_acc

/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

eval_acc = 0
for e in range(num_epochs):
  train_acc = 0.0
  test_acc = 0.0

  model.train()
  for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm(train_dataloader)):
    optimizer.zero_grad()
    token_ids = token_ids.long().to(device)
    segment_ids = segment_ids.long().to(device)

    valid_length = valid_length
    label = label.long().to(device) # 정답
    out = model(token_ids, valid_length, segment_ids) # 예측

    loss = loss_fn(out, label)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

    optimizer.step()
    scheduler.step() # Update learning rate schedule
    train_acc += calc_accuracy(out, label)
    if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
  print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))

  # 평가
  with torch.no_grad():
    model.eval()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm(test_dataloader)):
      token_ids = token_ids.long().to(device)
      segment_ids = segment_ids.long().to(device)
      valid_length = valid_length

      label = label.long().to(device)
      out = model(token_ids, valid_length, segment_ids)

      test_acc += calc_accuracy(out, label)
    
    total_test_acc = test_acc / (batch_id+1)
    print("epoch {} test acc {}".format(e+1, total_test_acc))

  # 모델 저장
  if eval_acc < total_test_acc:
    eval_acc = total_test_acc
    torch.save(model.state_dict(), f'./model_state_dict.pt')

  0%|          | 1/2344 [00:03<2:34:50,  3.97s/it]

epoch 1 batch id 1 loss 0.7078469395637512 train acc 0.53125

  9%|▊         | 201/2344 [04:37<49:26,  1.38s/it]

epoch 1 batch id 201 loss 0.39718449115753174 train acc 0.644589552238806

 17%|█▋        | 401/2344 [09:15<45:05,  1.39s/it]

epoch 1 batch id 401 loss 0.36952823400497437 train acc 0.7321150249376559

 26%|██▌       | 601/2344 [13:55<40:54,  1.41s/it]

epoch 1 batch id 601 loss 0.4212072789669037 train acc 0.7703826955074875

 34%|███▍      | 801/2344 [18:35<36:24,  1.42s/it]

epoch 1 batch id 801 loss 0.4127542972564697 train acc 0.7919202559300874

 43%|████▎     | 1001/2344 [23:13<30:57,  1.38s/it]

epoch 1 batch id 1001 loss 0.3550129234790802 train acc 0.8056006493506493

 51%|█████     | 1201/2344 [27:50<26:22,  1.38s/it]

epoch 1 batch id 1201 loss 0.31423938274383545 train acc 0.8160777477102414

 60%|█████▉    | 1401/2344 [32:27<21:46,  1.39s/it]

epoch 1 batch id 1401 loss 0.4369466006755829 train acc 0.8232624018558172

 68%|██████▊   | 1601/2344 [37:04<17:06,  1.38s/it]

epoch 1 batch id 1601 loss 0.2754068970680237 train acc 0.8300183479075578

 77%|███████▋  | 1801/2344 [41:41<12:33,  1.39s/it]

epoch 1 batch id 1801 loss 0.25950542092323303 train acc 0.8355167268184343

 85%|████████▌ | 2001/2344 [46:17<07:54,  1.38s/it]

epoch 1 batch id 2001 loss 0.3744203448295593 train acc 0.8403142178910544

 94%|█████████▍| 2201/2344 [50:54<03:17,  1.38s/it]

epoch 1 batch id 2201 loss 0.27376240491867065 train acc 0.844573489323035

100%|██████████| 2344/2344 [54:11<00:00,  1.39s/it]

epoch 1 train acc 0.8473601575521867

100%|██████████| 782/782 [06:15<00:00,  2.08it/s]

epoch 1 test acc 0.8929443119220932

  0%|          | 1/2344 [00:01<54:43,  1.40s/it]

epoch 2 batch id 1 loss 0.48298001289367676 train acc 0.796875

  9%|▊         | 201/2344 [04:38<49:29,  1.39s/it]

epoch 2 batch id 201 loss 0.18480078876018524 train acc 0.8889148009950248

 17%|█▋        | 401/2344 [09:15<44:52,  1.39s/it]

epoch 2 batch id 401 loss 0.23000317811965942 train acc 0.8945994389027432

 26%|██▌       | 601/2344 [13:51<39:59,  1.38s/it]

epoch 2 batch id 601 loss 0.324882835149765 train acc 0.9013103161397671

 34%|███▍      | 801/2344 [18:28<35:36,  1.38s/it]

epoch 2 batch id 801 loss 0.2671748399734497 train acc 0.9058793695380774

 43%|████▎     | 1001/2344 [23:05<30:55,  1.38s/it]

epoch 2 batch id 1001 loss 0.2897033989429474 train acc 0.9084040959040959

 51%|█████     | 1201/2344 [27:41<26:19,  1.38s/it]

epoch 2 batch id 1201 loss 0.12562845647335052 train acc 0.9112458368026645

 60%|█████▉    | 1401/2344 [32:18<21:43,  1.38s/it]

epoch 2 batch id 1401 loss 0.14803026616573334 train acc 0.9137892576730906

 68%|██████▊   | 1601/2344 [36:55<17:09,  1.39s/it]

epoch 2 batch id 1601 loss 0.222313791513443 train acc 0.9152287632729544

 77%|███████▋  | 1801/2344 [41:32<12:31,  1.38s/it]

epoch 2 batch id 1801 loss 0.12680453062057495 train acc 0.9166782343142699

 85%|████████▌ | 2001/2344 [46:09<07:55,  1.39s/it]

epoch 2 batch id 2001 loss 0.28915905952453613 train acc 0.9180722138930535

 94%|█████████▍| 2201/2344 [50:46<03:18,  1.38s/it]

epoch 2 batch id 2201 loss 0.15833142399787903 train acc 0.9191773625624716

100%|██████████| 2344/2344 [54:03<00:00,  1.38s/it]

epoch 2 train acc 0.9200489932236685

100%|██████████| 782/782 [06:16<00:00,  2.08it/s]

epoch 2 test acc 0.9000359654731458

테스트 데이터셋 예측

def predict(model, sentence):
  global tokenizer, vocab, max_len, device

  data = [[sentence, '0']] # 입력 포맷
  
  tokenized_data = BERTDataset(data, 0, 1, tokenizer, vocab, max_len, True, False)
  tensor_data = torch.utils.data.DataLoader(tokenized_data)
  
  with torch.no_grad():
    model.eval()
    token_ids, valid_len, segment_ids, label = next(iter(tensor_data))
    
    token_ids = token_ids.long().to(device)
    segment_ids = segment_ids.long().to(device)
    
    out = model(token_ids, valid_len, segment_ids)
    softmax_out = nn.functional.softmax(out, dim=1)

  return softmax_out

test = pd.read_csv("./nsmc/ratings_test.txt", sep='\t')

for sentence in test.document:
  sentence = '애매 하지만 낫밷'
  softmax_out = predict(model, sentence)
  pred_label = softmax_out.argmax(dim=1)
  break

softmax_out

tensor([[0.0669, 0.9331]], device='cuda:0')

pred_label

tensor([1], device='cuda:0')

Twitter Facebook LinkedIn

[Language Model] KoBERT

데이터 & 모델 다운로드

라이브러리

데이터 전처리

KoBERT

테스트 데이터셋 예측

공유하기

댓글남기기

참고

우잉남

[알고리즘] 그래프탐색: DFS

[알고리즘] 그래프탐색: BFS

[학회 정보] AI 주요 학회