textbox.data.utils¶
- textbox.data.utils.attribute2idx(text, token2idx)[source]¶
transform attribute to id.
- Parameters
text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of attribute data, consisting of multiple groups.
token2idx (dict) – map token to index
- Returns
attribute index length (None or List[List[int]]): sequence length
- Return type
idx (List[List[List[int]]] or List[List[List[List[int]]]])
- textbox.data.utils.build_attribute_vocab(text)[source]¶
Build attribute vocabulary of list of attribute data.
- Parameters
text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of attribute data, consisting of multiple groups.
- Returns
idx2token (dict): map index to token.
token2idx (dict): map token to index.
- Return type
tuple
- textbox.data.utils.build_vocab(text, max_vocab_size, special_token_list)[source]¶
Build vocabulary of list of text data.
- Parameters
text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of text data, consisting of multiple groups.
max_vocab_size (int) – max size of vocabulary.
special_token_list (List[str]) – list of special tokens.
- Returns
idx2token (dict): map index to token.
token2idx (dict): map token to index.
max_vocab_size (int): updated max size of vocabulary.
- Return type
tuple
- textbox.data.utils.data_preparation(config, save=False)[source]¶
call
dataloader_construct()
to create corresponding dataloader.- Parameters
config (Config) – An instance object of Config, used to record parameter information.
save (bool, optional) – If
True
, it will callsave_datasets()
to save split dataset. Defaults toFalse
.
- Returns
train_data (AbstractDataLoader): The dataloader for training.
valid_data (AbstractDataLoader): The dataloader for validation.
test_data (AbstractDataLoader): The dataloader for testing.
- Return type
tuple
- textbox.data.utils.dataloader_construct(name, config, dataset, batch_size=1, shuffle=False, drop_last=True, DDP=False)[source]¶
Get a correct dataloader class by calling
get_dataloader()
to construct dataloader.- Parameters
name (str) – The stage of dataloader. It can only take two values: ‘train’ or ‘evaluation’.
config (Config) – An instance object of Config, used to record parameter information.
dataset (Dataset or list of Dataset) – The split dataset for constructing dataloader.
batch_size (int, optional) – The batch_size of dataloader. Defaults to
1
.shuffle (bool, optional) – Whether the dataloader will be shuffle after a round. Defaults to
False
.drop_last (bool, optional) – Whether the dataloader will drop the last batch. Defaults to
True
.DDP (bool, optional) – Whether the dataloader will distribute in different GPU. Defaults to
False
.
- Returns
Constructed dataloader in split dataset.
- Return type
AbstractDataLoader or list of AbstractDataLoader
- textbox.data.utils.get_dataloader(config)[source]¶
Return a dataloader class according to
config
andsplit_strategy
.- Parameters
config (Config) – An instance object of Config, used to record parameter information.
- Returns
The dataloader class that meets the requirements in
config
andsplit_strategy
.- Return type
type
- textbox.data.utils.get_dataset(config)[source]¶
Create dataset according to
config['model']
andconfig['MODEL_TYPE']
.- Parameters
config (Config) – An instance object of Config, used to record parameter information.
- Returns
Constructed dataset.
- Return type
Dataset
- textbox.data.utils.load_data(dataset_path, tokenize_strategy, max_length, language, multi_sentence, max_num)[source]¶
Load dataset from split (train, valid, test). This is designed for single sentence format.
- Parameters
dataset_path (str) – path of dataset dir.
tokenize_strategy (str) – strategy of tokenizer.
max_length (int) – max length of sequence.
language (str) – language of text.
multi_sentence (bool) – whether to split text into sentence level.
max_num (int) – max number of sequence.
- Returns
the text list loaded from dataset path.
- Return type
List[List[str]]
- textbox.data.utils.pad_sequence(idx, length, padding_idx, num=None)[source]¶
padding a batch of word index data, to make them have equivalent length
- Parameters
idx (List[List[int]] or List[List[List[int]]]) – word index
length (List[int] or List[List[int]]) – sequence length
padding_idx (int) – the index of padding token
num (List[int]) – sequence number
- Returns
word index length (List[int] or List[List[int]]): sequence length num (List[int]): sequence number
- Return type
idx (List[List[int]] or List[List[List[int]]])
- textbox.data.utils.text2idx(text, token2idx, tokenize_strategy)[source]¶
transform text to id and add sos and eos token index.
- Parameters
text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of text data, consisting of multiple groups.
token2idx (dict) – map token to index
tokenize_strategy (str) – strategy of tokenizer.
- Returns
word index length (List[List[int]] or List[List[List[int]]]): sequence length num (None or List[List[int]]): sequence number
- Return type
idx (List[List[List[int]]] or List[List[List[List[int]]]])
- textbox.data.utils.tokenize(text, tokenize_strategy, language, multi_sentence)[source]¶
Tokenize text data.
- Parameters
text (str) – text data.
tokenize_strategy (str) – strategy of tokenizer.
language (str) – language of text.
multi_sentence (bool) – whether to split text into sentence level.
- Returns
the tokenized text data.
- Return type
List[str]