textbox.data.utils

textbox.data.utils.attribute2idx(text, token2idx)[source]

transform attribute to id.

Parameters
  • text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of attribute data, consisting of multiple groups.

  • token2idx (dict) – map token to index

Returns

attribute index length (None or List[List[int]]): sequence length

Return type

idx (List[List[List[int]]] or List[List[List[List[int]]]])

textbox.data.utils.build_attribute_vocab(text)[source]

Build attribute vocabulary of list of attribute data.

Parameters

text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of attribute data, consisting of multiple groups.

Returns

  • idx2token (dict): map index to token.

  • token2idx (dict): map token to index.

Return type

tuple

textbox.data.utils.build_vocab(text, max_vocab_size, special_token_list)[source]

Build vocabulary of list of text data.

Parameters
  • text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of text data, consisting of multiple groups.

  • max_vocab_size (int) – max size of vocabulary.

  • special_token_list (List[str]) – list of special tokens.

Returns

  • idx2token (dict): map index to token.

  • token2idx (dict): map token to index.

  • max_vocab_size (int): updated max size of vocabulary.

Return type

tuple

textbox.data.utils.construct_quick_test_dataset(dataset_path)[source]
textbox.data.utils.data_preparation(config, save=False)[source]

call dataloader_construct() to create corresponding dataloader.

Parameters
  • config (Config) – An instance object of Config, used to record parameter information.

  • save (bool, optional) – If True, it will call save_datasets() to save split dataset. Defaults to False.

Returns

  • train_data (AbstractDataLoader): The dataloader for training.

  • valid_data (AbstractDataLoader): The dataloader for validation.

  • test_data (AbstractDataLoader): The dataloader for testing.

Return type

tuple

textbox.data.utils.dataloader_construct(name, config, dataset, batch_size=1, shuffle=False, drop_last=True, DDP=False)[source]

Get a correct dataloader class by calling get_dataloader() to construct dataloader.

Parameters
  • name (str) – The stage of dataloader. It can only take two values: ‘train’ or ‘evaluation’.

  • config (Config) – An instance object of Config, used to record parameter information.

  • dataset (Dataset or list of Dataset) – The split dataset for constructing dataloader.

  • batch_size (int, optional) – The batch_size of dataloader. Defaults to 1.

  • shuffle (bool, optional) – Whether the dataloader will be shuffle after a round. Defaults to False.

  • drop_last (bool, optional) – Whether the dataloader will drop the last batch. Defaults to True.

  • DDP (bool, optional) – Whether the dataloader will distribute in different GPU. Defaults to False.

Returns

Constructed dataloader in split dataset.

Return type

AbstractDataLoader or list of AbstractDataLoader

textbox.data.utils.deconstruct_quick_test_dataset(dataset_path)[source]
textbox.data.utils.get_dataloader(config)[source]

Return a dataloader class according to config and split_strategy.

Parameters

config (Config) – An instance object of Config, used to record parameter information.

Returns

The dataloader class that meets the requirements in config and split_strategy.

Return type

type

textbox.data.utils.get_dataset(config)[source]

Create dataset according to config['model'] and config['MODEL_TYPE'].

Parameters

config (Config) – An instance object of Config, used to record parameter information.

Returns

Constructed dataset.

Return type

Dataset

textbox.data.utils.load_data(dataset_path, tokenize_strategy, max_length, language, multi_sentence, max_num)[source]

Load dataset from split (train, valid, test). This is designed for single sentence format.

Parameters
  • dataset_path (str) – path of dataset dir.

  • tokenize_strategy (str) – strategy of tokenizer.

  • max_length (int) – max length of sequence.

  • language (str) – language of text.

  • multi_sentence (bool) – whether to split text into sentence level.

  • max_num (int) – max number of sequence.

Returns

the text list loaded from dataset path.

Return type

List[List[str]]

textbox.data.utils.pad_sequence(idx, length, padding_idx, num=None)[source]

padding a batch of word index data, to make them have equivalent length

Parameters
  • idx (List[List[int]] or List[List[List[int]]]) – word index

  • length (List[int] or List[List[int]]) – sequence length

  • padding_idx (int) – the index of padding token

  • num (List[int]) – sequence number

Returns

word index length (List[int] or List[List[int]]): sequence length num (List[int]): sequence number

Return type

idx (List[List[int]] or List[List[List[int]]])

textbox.data.utils.text2idx(text, token2idx, tokenize_strategy)[source]

transform text to id and add sos and eos token index.

Parameters
  • text (List[List[List[str]]] or List[List[List[List[str]]]]) – list of text data, consisting of multiple groups.

  • token2idx (dict) – map token to index

  • tokenize_strategy (str) – strategy of tokenizer.

Returns

word index length (List[List[int]] or List[List[List[int]]]): sequence length num (None or List[List[int]]): sequence number

Return type

idx (List[List[List[int]]] or List[List[List[List[int]]]])

textbox.data.utils.tokenize(text, tokenize_strategy, language, multi_sentence)[source]

Tokenize text data.

Parameters
  • text (str) – text data.

  • tokenize_strategy (str) – strategy of tokenizer.

  • language (str) – language of text.

  • multi_sentence (bool) – whether to split text into sentence level.

Returns

the tokenized text data.

Return type

List[str]