BERT深度学习模型中的特殊标记介绍

BERT第三篇：Tokenizer
In the previous articles in this series, we introduced the context of BERT (Bidirectional Encoder Representations from Transformers) and reviewed its architecture and training procedures. In this third and final installment, we will delve into one of the most critical components of BERT: the tokenizer.
The BERT tokenizer is the first line of defense in preparing text for deep learning. It transforms the text into a numerical representation that the model can understand, and in doing so, plays a crucial role in determining the model’s performance. Here, we will cover the following topics:

Tokenization overview
Tokenization is the process of breaking text into smaller, meaningful units or tokens. These tokens typically correspond to words or punctuation marks in the text. However, BERT goes beyond this traditional approach and introduces two new types of tokens: BERT-specific special tokens and segment indicators.
Special tokens in BERT
BERT introduces special tokens that are not part of the vocabulary but are essential for representing the input sequence and maintaining the model’s internal state. These special tokens include:
[CLS] (Classification) - The first token in the sequence that represents the class/classification task. It is used to condition the model on the classification task at hand.
[SEP] (Separator) - The separator token that divides the input sequence into individual sentences. It is used to signal the end of each sentence/segment within the input text.
[PAD] (Padding) - The padding token that is used to make all input sequences the same length. It ensures that the model can compare tokens within each sequence but not across different sequences.
Segment indicators
In BERT, sentences are typically split into two segments: a [CLS] token followed by the first sentence and a [SEP] token followed by the second sentence. These segment indicators help BERT understand where each sentence starts and ends within the input sequence.
Implementing the BERT tokenizer
The BERT tokenizer is implemented using the Hugging Face’s Transformers library (formerly known as AllenNLP). The library provides ready-made tokenizers such as BertTokenizer and BertTokenizerFast that can be easily fine-tuned on your custom datasets. However, for maximum control and flexibility, you can also create your own custom tokenizer from scratch.
Using the BERT tokenizer for encoding text
Once you have initialized a BERT tokenizer, you can use it to encode your text into a numerical representation. The process involves the following steps:

Tokenization - The text is split into words, punctuation marks, and special tokens (like [CLS], [SEP], [PAD], etc.).
Wordpiece tokenization - Each word is further split into sub-words to account for infrequent or out-of-vocabulary words. This is achieved using a wordpiece tokenizer that is pre-trained on a large corpus of text (like Wikipedia).
Segmentation - Each sentence in the input text is split into individual tokens using segment indicators.
Numerical representation - Each token is replaced by its corresponding integer ID, resulting in a numerical representation of the input text that can be fed into a deep learning model like BERT for training or inference.

Conclusion
In this third and final article on BERT, we covered the concept of tokenization and its importance in preparing text for deep learning models like BERT. We also introduced the special tokens introduced by BERT for representing input sequences and maintaining the model’s internal state. Finally, we walked through how to implement a BERT tokenizer using Hugging Face’s Transformers library and use it to encode text into numerical representations for training or inference tasks.

BERT深度学习模型中的特殊标记介绍

最热文章