quanda.utils.tokenization module

Utils for tokenization of HuggingFace datasets.

quanda.utils.tokenization.resolve_tokenizer(tokenizer_cfg: dict) Tuple[Any, int][source]

Resolve a tokenizer config to (tokenizer, pad_token_id).

tokenizer exposes HF’s __call__(text, padding, truncation, max_length) -> {"input_ids", "attention_mask"}. Supported backends:

  • backend: hf with name (HF tokenizer repo) — returns the AutoTokenizer directly.

  • backend: tiktoken with encoding (default gpt2) — returns a _TikTokenHFAdapter with the same interface.

quanda.utils.tokenization.tokenize_dataset(hf_dataset: Dataset, tokenizer_cfg: dict) Dataset[source]

Tokenize an HF dataset for transformer models.

Parameters:
  • hf_dataset (datasets.Dataset) – Raw HuggingFace dataset.

  • tokenizer_cfg (dict) – Keys: name, text_fields, max_length, label_field.

Returns:

Tokenized dataset formatted as torch tensors.

Return type:

datasets.Dataset