textbox.data.utils¶

textbox.data.utils.attribute2idx(text, token2idx)[source]¶

transform attribute to id.

Parameters

text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of attribute data, consisting of multiple groups.
token2idx (dict) – map token to index

Returns

attribute index length (None or List[List[int]]): sequence length

Return type

idx (List[List[List[int]]] or List[List[List[List[int]]]])

textbox.data.utils.build_attribute_vocab(text)[source]¶

Build attribute vocabulary of list of attribute data.

Parameters

text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of attribute data, consisting of multiple groups.

Returns

idx2token (dict): map index to token.
token2idx (dict): map token to index.

Return type

tuple

textbox.data.utils.build_vocab(text, max_vocab_size, special_token_list)[source]¶

Build vocabulary of list of text data.

Parameters

text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of text data, consisting of multiple groups.
max_vocab_size (int) – max size of vocabulary.
special_token_list (List[str]) – list of special tokens.

Returns

idx2token (dict): map index to token.
token2idx (dict): map token to index.
max_vocab_size (int): updated max size of vocabulary.

Return type

tuple

textbox.data.utils.construct_quick_test_dataset(dataset_path)[source]¶

textbox.data.utils.data_preparation(config, save=False)[source]¶

call dataloader_construct() to create corresponding dataloader.

Parameters

config (Config) – An instance object of Config, used to record parameter information.
save (bool, optional) – If True, it will call save_datasets() to save split dataset. Defaults to False.

Returns

train_data (AbstractDataLoader): The dataloader for training.
valid_data (AbstractDataLoader): The dataloader for validation.
test_data (AbstractDataLoader): The dataloader for testing.

Return type

tuple

textbox.data.utils.dataloader_construct(name, config, dataset, batch_size=1, shuffle=False, drop_last=True, DDP=False)[source]¶

Get a correct dataloader class by calling get_dataloader() to construct dataloader.

Parameters

name (str) – The stage of dataloader. It can only take two values: ‘train’ or ‘evaluation’.
config (Config) – An instance object of Config, used to record parameter information.
dataset (Dataset or list of Dataset) – The split dataset for constructing dataloader.
batch_size (int, optional) – The batch_size of dataloader. Defaults to 1.
shuffle (bool, optional) – Whether the dataloader will be shuffle after a round. Defaults to False.
drop_last (bool, optional) – Whether the dataloader will drop the last batch. Defaults to True.
DDP (bool, optional) – Whether the dataloader will distribute in different GPU. Defaults to False.

Returns

Constructed dataloader in split dataset.

Return type

AbstractDataLoader or list of AbstractDataLoader

textbox.data.utils.deconstruct_quick_test_dataset(dataset_path)[source]¶

textbox.data.utils.get_dataloader(config)[source]¶

Return a dataloader class according to config and split_strategy.

Parameters: config (Config) – An instance object of Config, used to record parameter information.
Returns: The dataloader class that meets the requirements in config and split_strategy.
Return type: type

textbox.data.utils.get_dataset(config)[source]¶

Create dataset according to config['model'] and config['MODEL_TYPE'].

Parameters: config (Config) – An instance object of Config, used to record parameter information.
Returns: Constructed dataset.
Return type: Dataset

textbox.data.utils.load_data(dataset_path, tokenize_strategy, max_length, language, multi_sentence, max_num)[source]¶

Load dataset from split (train, valid, test). This is designed for single sentence format.

Parameters

dataset_path (str) – path of dataset dir.
tokenize_strategy (str) – strategy of tokenizer.
max_length (int) – max length of sequence.
language (str) – language of text.
multi_sentence (bool) – whether to split text into sentence level.
max_num (int) – max number of sequence.

Returns

the text list loaded from dataset path.

Return type

List[List[str]]

textbox.data.utils.pad_sequence(idx, length, padding_idx, num=None)[source]¶

padding a batch of word index data, to make them have equivalent length

Parameters

idx (List[List[int]] or List[List[List[int]]]) – word index
length (List[int] or List[List[int]]) – sequence length
padding_idx (int) – the index of padding token
num (List[int]) – sequence number

Returns

word index length (List[int] or List[List[int]]): sequence length num (List[int]): sequence number

Return type

idx (List[List[int]] or List[List[List[int]]])

textbox.data.utils.text2idx(text, token2idx, tokenize_strategy)[source]¶

transform text to id and add sos and eos token index.

Parameters

text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of text data, consisting of multiple groups.
token2idx (dict) – map token to index
tokenize_strategy (str) – strategy of tokenizer.

Returns

word index length (List[List[int]] or List[List[List[int]]]): sequence length num (None or List[List[int]]): sequence number

Return type

idx (List[List[List[int]]] or List[List[List[List[int]]]])

textbox.data.utils.tokenize(text, tokenize_strategy, language, multi_sentence)[source]¶

Tokenize text data.

Parameters

text (str) – text data.
tokenize_strategy (str) – strategy of tokenizer.
language (str) – language of text.
multi_sentence (bool) – whether to split text into sentence level.

Returns

the tokenized text data.

Return type

List[str]