quanda.utils.tokenization module¶
Utils for tokenization of HuggingFace datasets.
- quanda.utils.tokenization.resolve_tokenizer(tokenizer_cfg: dict) Tuple[Any, int][source]¶
Resolve a tokenizer config to
(tokenizer, pad_token_id).tokenizerexposes HF’s__call__(text, padding, truncation, max_length) -> {"input_ids", "attention_mask"}. Supported backends:backend: hfwithname(HF tokenizer repo) — returns theAutoTokenizerdirectly.backend: tiktokenwithencoding(defaultgpt2) — returns a_TikTokenHFAdapterwith the same interface.
- quanda.utils.tokenization.tokenize_dataset(hf_dataset: Dataset, tokenizer_cfg: dict) Dataset[source]¶
Tokenize an HF dataset for transformer models.
- Parameters:
hf_dataset (datasets.Dataset) – Raw HuggingFace dataset.
tokenizer_cfg (dict) – Keys:
name,text_fields,max_length,label_field.
- Returns:
Tokenized dataset formatted as torch tensors.
- Return type:
datasets.Dataset