Skip to content

Pre-Tokenized Dataset from Hugging Face¤

Pre-Tokenized Dataset from Hugging Face.

PreTokenizedDataset should work with any of the following tokenized datasets: - NeelNanda/pile-small-tokenized-2b - NeelNanda/pile-tokenized-10b - NeelNanda/openwebtext-tokenized-9b - NeelNanda/c4-tokenized-2b - NeelNanda/code-tokenized - NeelNanda/c4-code-tokenized-2b - NeelNanda/pile-old-tokenized-2b - alancooney/sae-monology-pile-uncopyrighted-tokenizer-gpt2

PreTokenizedDataset ¤

Bases: SourceDataset[dict]

General Pre-Tokenized Dataset from Hugging Face.

Can be used for various datasets available on Hugging Face.

Source code in sparse_autoencoder/source_data/pretokenized_dataset.py
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
@final
class PreTokenizedDataset(SourceDataset[dict]):
    """General Pre-Tokenized Dataset from Hugging Face.

    Can be used for various datasets available on Hugging Face.
    """

    def preprocess(
        self,
        source_batch: dict,
        *,
        context_size: int,
    ) -> TokenizedPrompts:
        """Preprocess a batch of prompts.

        The method splits each pre-tokenized item based on the context size.

        Args:
            source_batch: A batch of source data.
            context_size: The context size to use for tokenized prompts.

        Returns:
            Tokenized prompts.

        Raises:
            ValueError: If the context size is larger than the tokenized prompt size.
        """
        tokenized_prompts: list[list[int]] = source_batch[self._dataset_column_name]

        # Check the context size is not too large
        if context_size > len(tokenized_prompts[0]):
            error_message = (
                f"The context size ({context_size}) is larger than the "
                f"tokenized prompt size ({len(tokenized_prompts[0])})."
            )
            raise ValueError(error_message)

        # Chunk each tokenized prompt into blocks of context_size,
        # discarding the last block if too small.
        context_size_prompts = []
        for encoding in tokenized_prompts:
            chunks = [
                encoding[i : i + context_size]
                for i in range(0, len(encoding), context_size)
                if len(encoding[i : i + context_size]) == context_size
            ]
            context_size_prompts.extend(chunks)

        return {"input_ids": context_size_prompts}

    @validate_call
    def __init__(
        self,
        dataset_path: str,
        context_size: PositiveInt = 256,
        buffer_size: PositiveInt = 1000,
        dataset_dir: str | None = None,
        dataset_files: str | Sequence[str] | Mapping[str, str | Sequence[str]] | None = None,
        dataset_split: str = "train",
        dataset_column_name: str = "input_ids",
        preprocess_batch_size: PositiveInt = 1000,
        *,
        pre_download: bool = False,
    ):
        """Initialize a pre-tokenized dataset from Hugging Face.

        Args:
            dataset_path: The path to the dataset on Hugging Face (e.g.
                `alancooney/sae-monology-pile-uncopyrighted-tokenizer-gpt2).
            context_size: The context size for tokenized prompts.
            buffer_size: The buffer size to use when shuffling the dataset when streaming. When
                streaming a dataset, this just pre-downloads at least `buffer_size` items and then
                shuffles just that buffer. Note that the generated activations should also be
                shuffled before training the sparse autoencoder, so a large buffer may not be
                strictly necessary here. Note also that this is the number of items in the dataset
                (e.g. number of prompts) and is typically significantly less than the number of
                tokenized prompts once the preprocessing function has been applied.
            dataset_dir: Defining the `data_dir` of the dataset configuration.
            dataset_files: Path(s) to source data file(s).
            dataset_split: Dataset split (e.g. `train`).
            dataset_column_name: The column name for the tokenized prompts.
            preprocess_batch_size: The batch size to use just for preprocessing the dataset (e.g.
                tokenizing prompts).
            pre_download: Whether to pre-download the whole dataset.
        """
        super().__init__(
            buffer_size=buffer_size,
            context_size=context_size,
            dataset_dir=dataset_dir,
            dataset_files=dataset_files,
            dataset_path=dataset_path,
            dataset_split=dataset_split,
            dataset_column_name=dataset_column_name,
            pre_download=pre_download,
            preprocess_batch_size=preprocess_batch_size,
        )

__init__(dataset_path, context_size=256, buffer_size=1000, dataset_dir=None, dataset_files=None, dataset_split='train', dataset_column_name='input_ids', preprocess_batch_size=1000, *, pre_download=False) ¤

Initialize a pre-tokenized dataset from Hugging Face.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset on Hugging Face (e.g. `alancooney/sae-monology-pile-uncopyrighted-tokenizer-gpt2).

required
context_size PositiveInt

The context size for tokenized prompts.

256
buffer_size PositiveInt

The buffer size to use when shuffling the dataset when streaming. When streaming a dataset, this just pre-downloads at least buffer_size items and then shuffles just that buffer. Note that the generated activations should also be shuffled before training the sparse autoencoder, so a large buffer may not be strictly necessary here. Note also that this is the number of items in the dataset (e.g. number of prompts) and is typically significantly less than the number of tokenized prompts once the preprocessing function has been applied.

1000
dataset_dir str | None

Defining the data_dir of the dataset configuration.

None
dataset_files str | Sequence[str] | Mapping[str, str | Sequence[str]] | None

Path(s) to source data file(s).

None
dataset_split str

Dataset split (e.g. train).

'train'
dataset_column_name str

The column name for the tokenized prompts.

'input_ids'
preprocess_batch_size PositiveInt

The batch size to use just for preprocessing the dataset (e.g. tokenizing prompts).

1000
pre_download bool

Whether to pre-download the whole dataset.

False
Source code in sparse_autoencoder/source_data/pretokenized_dataset.py
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
@validate_call
def __init__(
    self,
    dataset_path: str,
    context_size: PositiveInt = 256,
    buffer_size: PositiveInt = 1000,
    dataset_dir: str | None = None,
    dataset_files: str | Sequence[str] | Mapping[str, str | Sequence[str]] | None = None,
    dataset_split: str = "train",
    dataset_column_name: str = "input_ids",
    preprocess_batch_size: PositiveInt = 1000,
    *,
    pre_download: bool = False,
):
    """Initialize a pre-tokenized dataset from Hugging Face.

    Args:
        dataset_path: The path to the dataset on Hugging Face (e.g.
            `alancooney/sae-monology-pile-uncopyrighted-tokenizer-gpt2).
        context_size: The context size for tokenized prompts.
        buffer_size: The buffer size to use when shuffling the dataset when streaming. When
            streaming a dataset, this just pre-downloads at least `buffer_size` items and then
            shuffles just that buffer. Note that the generated activations should also be
            shuffled before training the sparse autoencoder, so a large buffer may not be
            strictly necessary here. Note also that this is the number of items in the dataset
            (e.g. number of prompts) and is typically significantly less than the number of
            tokenized prompts once the preprocessing function has been applied.
        dataset_dir: Defining the `data_dir` of the dataset configuration.
        dataset_files: Path(s) to source data file(s).
        dataset_split: Dataset split (e.g. `train`).
        dataset_column_name: The column name for the tokenized prompts.
        preprocess_batch_size: The batch size to use just for preprocessing the dataset (e.g.
            tokenizing prompts).
        pre_download: Whether to pre-download the whole dataset.
    """
    super().__init__(
        buffer_size=buffer_size,
        context_size=context_size,
        dataset_dir=dataset_dir,
        dataset_files=dataset_files,
        dataset_path=dataset_path,
        dataset_split=dataset_split,
        dataset_column_name=dataset_column_name,
        pre_download=pre_download,
        preprocess_batch_size=preprocess_batch_size,
    )

preprocess(source_batch, *, context_size) ¤

Preprocess a batch of prompts.

The method splits each pre-tokenized item based on the context size.

Parameters:

Name Type Description Default
source_batch dict

A batch of source data.

required
context_size int

The context size to use for tokenized prompts.

required

Returns:

Type Description
TokenizedPrompts

Tokenized prompts.

Raises:

Type Description
ValueError

If the context size is larger than the tokenized prompt size.

Source code in sparse_autoencoder/source_data/pretokenized_dataset.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def preprocess(
    self,
    source_batch: dict,
    *,
    context_size: int,
) -> TokenizedPrompts:
    """Preprocess a batch of prompts.

    The method splits each pre-tokenized item based on the context size.

    Args:
        source_batch: A batch of source data.
        context_size: The context size to use for tokenized prompts.

    Returns:
        Tokenized prompts.

    Raises:
        ValueError: If the context size is larger than the tokenized prompt size.
    """
    tokenized_prompts: list[list[int]] = source_batch[self._dataset_column_name]

    # Check the context size is not too large
    if context_size > len(tokenized_prompts[0]):
        error_message = (
            f"The context size ({context_size}) is larger than the "
            f"tokenized prompt size ({len(tokenized_prompts[0])})."
        )
        raise ValueError(error_message)

    # Chunk each tokenized prompt into blocks of context_size,
    # discarding the last block if too small.
    context_size_prompts = []
    for encoding in tokenized_prompts:
        chunks = [
            encoding[i : i + context_size]
            for i in range(0, len(encoding), context_size)
            if len(encoding[i : i + context_size]) == context_size
        ]
        context_size_prompts.extend(chunks)

    return {"input_ids": context_size_prompts}