Skip to content

Abstract tokenized prompts dataset class¤

Abstract tokenized prompts dataset class.

HuggingFaceDatasetItem = TypeVar('HuggingFaceDatasetItem', bound=Any) module-attribute ¤

Hugging face dataset item typed dict.

When extending :class:SourceDataset you should create a TypedDict that matches the structure of each dataset item in the underlying Hugging Face dataset.

Example

With the Uncopyrighted Pile this should be a typed dict with text and meta properties.

class PileUncopyrightedSourceDataBatch(TypedDict): ... text: list[str] ... meta: list[dict[str, dict[str, str]]]

TokenizedPrompt = list[int] module-attribute ¤

A tokenized prompt.

SourceDataset ¤

Bases: ABC, Generic[HuggingFaceDatasetItem]

Abstract source dataset.

Source dataset that is used to generate the activations dataset (by running forward passes of the source model with this data). It should contain prompts that have been tokenized with no padding tokens (apart from an optional single first padding token). This enables efficient generation of the activations dataset.

Wraps an HuggingFace IterableDataset.

Source code in sparse_autoencoder/source_data/abstract_dataset.py
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
class SourceDataset(ABC, Generic[HuggingFaceDatasetItem]):
    """Abstract source dataset.

    Source dataset that is used to generate the activations dataset (by running forward passes of
    the source model with this data). It should contain prompts that have been tokenized with no
    padding tokens (apart from an optional single first padding token). This enables efficient
    generation of the activations dataset.

    Wraps an HuggingFace IterableDataset.
    """

    context_size: int
    """Number of tokens in the context window.

    The paper *Towards Monosemanticity: Decomposing Language Models With Dictionary Learning* used
    a context size of 250.
    """

    dataset: Dataset | IterableDataset
    """Underlying HuggingFace Dataset.

    Warning:
        Hugging Face `Dataset` objects are confusingly not the same as PyTorch `Dataset` objects.
    """

    _dataset_column_name: str
    """Dataset column name for the prompts."""

    @abstractmethod
    def preprocess(
        self,
        source_batch: HuggingFaceDatasetItem,
        *,
        context_size: int,
    ) -> TokenizedPrompts:
        """Preprocess function.

        Takes a `preprocess_batch_size` ($m$) batch of source data (which may e.g. include string
        prompts), and returns a dict with a single key of `input_ids` and a value of an arbitrary
        length list ($n$) of tokenized prompts. Note that $m$ does not have to be equal to $n$.

        Applied to the dataset with the [Hugging Face
        Dataset](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.map)
        `map` function.

        Warning:
            The returned tokenized prompts should not have any padding tokens (apart from an
            optional single first padding token).

        Args:
            source_batch: A batch of source data. For example, with The Pile dataset this would be a
                dict including the key "text" with a value of a list of strings (not yet tokenized).
            context_size: The context size to use when returning a list of tokenized prompts.
                *Towards Monosemanticity: Decomposing Language Models With Dictionary Learning* used
                a context size of 250.

        Returns:
            Tokenized prompts.
        """

    @abstractmethod
    @validate_call
    def __init__(
        self,
        dataset_path: str,
        dataset_split: str,
        context_size: PositiveInt,
        buffer_size: PositiveInt = 1000,
        dataset_dir: str | None = None,
        dataset_files: str | Sequence[str] | Mapping[str, str | Sequence[str]] | None = None,
        dataset_column_name: str = "input_ids",
        n_processes_preprocessing: PositiveInt | None = None,
        preprocess_batch_size: PositiveInt = 1000,
        *,
        pre_download: bool = False,
    ):
        """Initialise the dataset.

        Loads the dataset with streaming from HuggingFace, dds preprocessing and shuffling to the
        underlying Hugging Face `IterableDataset`.

        Args:
            dataset_path: The path to the dataset on Hugging Face.
            dataset_split: Dataset split (e.g. `train`).
            context_size: The context size to use when returning a list of tokenized prompts.
                *Towards Monosemanticity: Decomposing Language Models With Dictionary Learning* used
                a context size of 250.
            buffer_size: The buffer size to use when shuffling the dataset when streaming. When
                streaming a dataset, this just pre-downloads at least `buffer_size` items and then
                shuffles just that buffer. Note that the generated activations should also be
                shuffled before training the sparse autoencoder, so a large buffer may not be
                strictly necessary here. Note also that this is the number of items in the dataset
                (e.g. number of prompts) and is typically significantly less than the number of
                tokenized prompts once the preprocessing function has been applied.
            dataset_dir: Defining the `data_dir` of the dataset configuration.
            dataset_files: Path(s) to source data file(s).
            dataset_column_name: The column name for the prompts.
            n_processes_preprocessing: The number of processes to use for preprocessing.
            preprocess_batch_size: The batch size to use just for preprocessing the dataset (e.g.
                tokenizing prompts).
            pre_download: Whether to pre-download the whole dataset.

        Raises:
            TypeError: If the loaded dataset is not a Hugging Face `Dataset` or `IterableDataset`.
        """
        self.context_size = context_size
        self._dataset_column_name = dataset_column_name

        # Load the dataset
        should_stream = not pre_download
        dataset = load_dataset(
            dataset_path,
            streaming=should_stream,
            split=dataset_split,
            data_dir=dataset_dir,
            data_files=dataset_files,
            verification_mode=VerificationMode.NO_CHECKS,  # As it fails when data_files is set
        )

        # Setup preprocessing (we remove all columns except for input ids)
        remove_columns: list[str] = list(next(iter(dataset)).keys())
        if "input_ids" in remove_columns:
            remove_columns.remove("input_ids")

        if pre_download:
            if not isinstance(dataset, Dataset):
                error_message = (
                    f"Expected Hugging Face dataset to be a Dataset when pre-downloading, but got "
                    f"{type(dataset)}."
                )
                raise TypeError(error_message)

            # Download the whole dataset
            mapped_dataset = dataset.map(
                self.preprocess,
                batched=True,
                batch_size=preprocess_batch_size,
                fn_kwargs={"context_size": context_size},
                remove_columns=remove_columns,
                num_proc=n_processes_preprocessing,
            )
            self.dataset = mapped_dataset.shuffle()
        else:
            # Setup approximate shuffling. As the dataset is streamed, this just pre-downloads at
            # least `buffer_size` items and then shuffles just that buffer.
            # https://huggingface.co/docs/datasets/v2.14.5/stream#shuffle
            if not isinstance(dataset, IterableDataset):
                error_message = (
                    f"Expected Hugging Face dataset to be an IterableDataset when streaming, but "
                    f"got {type(dataset)}."
                )
                raise TypeError(error_message)

            mapped_dataset = dataset.map(
                self.preprocess,
                batched=True,
                batch_size=preprocess_batch_size,
                fn_kwargs={"context_size": context_size},
                remove_columns=remove_columns,
            )
            self.dataset = mapped_dataset.shuffle(buffer_size=buffer_size)  # type: ignore

    @final
    def __iter__(self) -> Any:  # noqa: ANN401
        """Iterate Dunder Method.

        Enables direct access to :attr:`dataset` with e.g. `for` loops.
        """
        return self.dataset.__iter__()

    @final
    def get_dataloader(
        self, batch_size: int, num_workers: NonNegativeInt = 0
    ) -> DataLoader[TorchTokenizedPrompts]:
        """Get a PyTorch DataLoader.

        Args:
            batch_size: The batch size to use.
            num_workers: Number of CPU workers.

        Returns:
            PyTorch DataLoader.
        """
        torch_dataset: TorchDataset[TorchTokenizedPrompts] = self.dataset.with_format("torch")  # type: ignore

        return DataLoader[TorchTokenizedPrompts](
            torch_dataset,
            batch_size=batch_size,
            # Shuffle is most efficiently done with the `shuffle` method on the dataset itself, not
            # here.
            shuffle=False,
            num_workers=num_workers,
        )

context_size: int = context_size instance-attribute ¤

Number of tokens in the context window.

The paper Towards Monosemanticity: Decomposing Language Models With Dictionary Learning used a context size of 250.

dataset: Dataset | IterableDataset instance-attribute ¤

Underlying HuggingFace Dataset.

Warning

Hugging Face Dataset objects are confusingly not the same as PyTorch Dataset objects.

__init__(dataset_path, dataset_split, context_size, buffer_size=1000, dataset_dir=None, dataset_files=None, dataset_column_name='input_ids', n_processes_preprocessing=None, preprocess_batch_size=1000, *, pre_download=False) abstractmethod ¤

Initialise the dataset.

Loads the dataset with streaming from HuggingFace, dds preprocessing and shuffling to the underlying Hugging Face IterableDataset.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset on Hugging Face.

required
dataset_split str

Dataset split (e.g. train).

required
context_size PositiveInt

The context size to use when returning a list of tokenized prompts. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning used a context size of 250.

required
buffer_size PositiveInt

The buffer size to use when shuffling the dataset when streaming. When streaming a dataset, this just pre-downloads at least buffer_size items and then shuffles just that buffer. Note that the generated activations should also be shuffled before training the sparse autoencoder, so a large buffer may not be strictly necessary here. Note also that this is the number of items in the dataset (e.g. number of prompts) and is typically significantly less than the number of tokenized prompts once the preprocessing function has been applied.

1000
dataset_dir str | None

Defining the data_dir of the dataset configuration.

None
dataset_files str | Sequence[str] | Mapping[str, str | Sequence[str]] | None

Path(s) to source data file(s).

None
dataset_column_name str

The column name for the prompts.

'input_ids'
n_processes_preprocessing PositiveInt | None

The number of processes to use for preprocessing.

None
preprocess_batch_size PositiveInt

The batch size to use just for preprocessing the dataset (e.g. tokenizing prompts).

1000
pre_download bool

Whether to pre-download the whole dataset.

False

Raises:

Type Description
TypeError

If the loaded dataset is not a Hugging Face Dataset or IterableDataset.

Source code in sparse_autoencoder/source_data/abstract_dataset.py
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
@abstractmethod
@validate_call
def __init__(
    self,
    dataset_path: str,
    dataset_split: str,
    context_size: PositiveInt,
    buffer_size: PositiveInt = 1000,
    dataset_dir: str | None = None,
    dataset_files: str | Sequence[str] | Mapping[str, str | Sequence[str]] | None = None,
    dataset_column_name: str = "input_ids",
    n_processes_preprocessing: PositiveInt | None = None,
    preprocess_batch_size: PositiveInt = 1000,
    *,
    pre_download: bool = False,
):
    """Initialise the dataset.

    Loads the dataset with streaming from HuggingFace, dds preprocessing and shuffling to the
    underlying Hugging Face `IterableDataset`.

    Args:
        dataset_path: The path to the dataset on Hugging Face.
        dataset_split: Dataset split (e.g. `train`).
        context_size: The context size to use when returning a list of tokenized prompts.
            *Towards Monosemanticity: Decomposing Language Models With Dictionary Learning* used
            a context size of 250.
        buffer_size: The buffer size to use when shuffling the dataset when streaming. When
            streaming a dataset, this just pre-downloads at least `buffer_size` items and then
            shuffles just that buffer. Note that the generated activations should also be
            shuffled before training the sparse autoencoder, so a large buffer may not be
            strictly necessary here. Note also that this is the number of items in the dataset
            (e.g. number of prompts) and is typically significantly less than the number of
            tokenized prompts once the preprocessing function has been applied.
        dataset_dir: Defining the `data_dir` of the dataset configuration.
        dataset_files: Path(s) to source data file(s).
        dataset_column_name: The column name for the prompts.
        n_processes_preprocessing: The number of processes to use for preprocessing.
        preprocess_batch_size: The batch size to use just for preprocessing the dataset (e.g.
            tokenizing prompts).
        pre_download: Whether to pre-download the whole dataset.

    Raises:
        TypeError: If the loaded dataset is not a Hugging Face `Dataset` or `IterableDataset`.
    """
    self.context_size = context_size
    self._dataset_column_name = dataset_column_name

    # Load the dataset
    should_stream = not pre_download
    dataset = load_dataset(
        dataset_path,
        streaming=should_stream,
        split=dataset_split,
        data_dir=dataset_dir,
        data_files=dataset_files,
        verification_mode=VerificationMode.NO_CHECKS,  # As it fails when data_files is set
    )

    # Setup preprocessing (we remove all columns except for input ids)
    remove_columns: list[str] = list(next(iter(dataset)).keys())
    if "input_ids" in remove_columns:
        remove_columns.remove("input_ids")

    if pre_download:
        if not isinstance(dataset, Dataset):
            error_message = (
                f"Expected Hugging Face dataset to be a Dataset when pre-downloading, but got "
                f"{type(dataset)}."
            )
            raise TypeError(error_message)

        # Download the whole dataset
        mapped_dataset = dataset.map(
            self.preprocess,
            batched=True,
            batch_size=preprocess_batch_size,
            fn_kwargs={"context_size": context_size},
            remove_columns=remove_columns,
            num_proc=n_processes_preprocessing,
        )
        self.dataset = mapped_dataset.shuffle()
    else:
        # Setup approximate shuffling. As the dataset is streamed, this just pre-downloads at
        # least `buffer_size` items and then shuffles just that buffer.
        # https://huggingface.co/docs/datasets/v2.14.5/stream#shuffle
        if not isinstance(dataset, IterableDataset):
            error_message = (
                f"Expected Hugging Face dataset to be an IterableDataset when streaming, but "
                f"got {type(dataset)}."
            )
            raise TypeError(error_message)

        mapped_dataset = dataset.map(
            self.preprocess,
            batched=True,
            batch_size=preprocess_batch_size,
            fn_kwargs={"context_size": context_size},
            remove_columns=remove_columns,
        )
        self.dataset = mapped_dataset.shuffle(buffer_size=buffer_size)  # type: ignore

__iter__() ¤

Iterate Dunder Method.

Enables direct access to :attr:dataset with e.g. for loops.

Source code in sparse_autoencoder/source_data/abstract_dataset.py
211
212
213
214
215
216
217
@final
def __iter__(self) -> Any:  # noqa: ANN401
    """Iterate Dunder Method.

    Enables direct access to :attr:`dataset` with e.g. `for` loops.
    """
    return self.dataset.__iter__()

get_dataloader(batch_size, num_workers=0) ¤

Get a PyTorch DataLoader.

Parameters:

Name Type Description Default
batch_size int

The batch size to use.

required
num_workers NonNegativeInt

Number of CPU workers.

0

Returns:

Type Description
DataLoader[TorchTokenizedPrompts]

PyTorch DataLoader.

Source code in sparse_autoencoder/source_data/abstract_dataset.py
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
@final
def get_dataloader(
    self, batch_size: int, num_workers: NonNegativeInt = 0
) -> DataLoader[TorchTokenizedPrompts]:
    """Get a PyTorch DataLoader.

    Args:
        batch_size: The batch size to use.
        num_workers: Number of CPU workers.

    Returns:
        PyTorch DataLoader.
    """
    torch_dataset: TorchDataset[TorchTokenizedPrompts] = self.dataset.with_format("torch")  # type: ignore

    return DataLoader[TorchTokenizedPrompts](
        torch_dataset,
        batch_size=batch_size,
        # Shuffle is most efficiently done with the `shuffle` method on the dataset itself, not
        # here.
        shuffle=False,
        num_workers=num_workers,
    )

preprocess(source_batch, *, context_size) abstractmethod ¤

Preprocess function.

Takes a preprocess_batch_size (\(m\)) batch of source data (which may e.g. include string prompts), and returns a dict with a single key of input_ids and a value of an arbitrary length list (\(n\)) of tokenized prompts. Note that \(m\) does not have to be equal to \(n\).

Applied to the dataset with the Hugging Face Dataset map function.

Warning

The returned tokenized prompts should not have any padding tokens (apart from an optional single first padding token).

Parameters:

Name Type Description Default
source_batch HuggingFaceDatasetItem

A batch of source data. For example, with The Pile dataset this would be a dict including the key "text" with a value of a list of strings (not yet tokenized).

required
context_size int

The context size to use when returning a list of tokenized prompts. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning used a context size of 250.

required

Returns:

Type Description
TokenizedPrompts

Tokenized prompts.

Source code in sparse_autoencoder/source_data/abstract_dataset.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
@abstractmethod
def preprocess(
    self,
    source_batch: HuggingFaceDatasetItem,
    *,
    context_size: int,
) -> TokenizedPrompts:
    """Preprocess function.

    Takes a `preprocess_batch_size` ($m$) batch of source data (which may e.g. include string
    prompts), and returns a dict with a single key of `input_ids` and a value of an arbitrary
    length list ($n$) of tokenized prompts. Note that $m$ does not have to be equal to $n$.

    Applied to the dataset with the [Hugging Face
    Dataset](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.map)
    `map` function.

    Warning:
        The returned tokenized prompts should not have any padding tokens (apart from an
        optional single first padding token).

    Args:
        source_batch: A batch of source data. For example, with The Pile dataset this would be a
            dict including the key "text" with a value of a list of strings (not yet tokenized).
        context_size: The context size to use when returning a list of tokenized prompts.
            *Towards Monosemanticity: Decomposing Language Models With Dictionary Learning* used
            a context size of 250.

    Returns:
        Tokenized prompts.
    """

TokenizedPrompts ¤

Bases: TypedDict

Tokenized prompts.

Source code in sparse_autoencoder/source_data/abstract_dataset.py
20
21
22
23
class TokenizedPrompts(TypedDict):
    """Tokenized prompts."""

    input_ids: list[TokenizedPrompt]

TorchTokenizedPrompts ¤

Bases: TypedDict

Tokenized prompts prepared for PyTorch.

Source code in sparse_autoencoder/source_data/abstract_dataset.py
26
27
28
29
class TorchTokenizedPrompts(TypedDict):
    """Tokenized prompts prepared for PyTorch."""

    input_ids: Int[Tensor, Axis.names(Axis.SOURCE_DATA_BATCH, Axis.POSITION)]