Mock dataset¤
Mock dataset.
For use with tests and simple examples.
ConsecutiveIntHuggingFaceDataset
¤
Bases: IterableDataset
Consecutive integers Hugging Face dataset for testing.
Creates a dataset where the first item is [0,1,2...], and the second item is [1,2,3...] and so on.
Source code in sparse_autoencoder/source_data/mock_dataset.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
|
__getitem__(index)
¤
Get Item.
Source code in sparse_autoencoder/source_data/mock_dataset.py
106 107 108 109 110 111 112 113 |
|
__init__(context_size, vocab_size=50000, n_items=10000)
¤
Initialize the mock HF dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context_size |
int
|
The number of tokens in the context window |
required |
vocab_size |
int
|
The size of the vocabulary to use. |
50000
|
n_items |
int
|
The number of items in the dataset. |
10000
|
Raises:
Type | Description |
---|---|
ValueError
|
If more items are requested than we can create with the vocab size (given that each item is a consecutive list of integers and unique). |
Source code in sparse_autoencoder/source_data/mock_dataset.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
__iter__()
¤
Initialize the iterator.
Returns:
Type | Description |
---|---|
Iterator
|
Iterator. |
Source code in sparse_autoencoder/source_data/mock_dataset.py
77 78 79 80 81 82 83 84 |
|
__len__()
¤
Len Dunder Method.
Source code in sparse_autoencoder/source_data/mock_dataset.py
102 103 104 |
|
__next__()
¤
Return the next item in the dataset.
Returns:
Name | Type | Description |
---|---|---|
TokenizedPrompts |
TokenizedPrompts | TorchTokenizedPrompts
|
The next item in the dataset. |
Raises:
Type | Description |
---|---|
StopIteration
|
If the end of the dataset is reached. |
Source code in sparse_autoencoder/source_data/mock_dataset.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
create_data(n_items, context_size)
¤
Create the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_items |
int
|
The number of items in the dataset. |
required |
context_size |
int
|
The number of tokens in the context window. |
required |
Returns:
Type | Description |
---|---|
Int[Tensor, 'items context_size']
|
The generated data. |
Source code in sparse_autoencoder/source_data/mock_dataset.py
38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
with_format(type)
¤
With Format.
Source code in sparse_autoencoder/source_data/mock_dataset.py
115 116 117 118 119 120 121 |
|
MockDataset
¤
Bases: SourceDataset[TokenizedPrompts]
Mock dataset for testing.
For use with tests and simple examples.
Source code in sparse_autoencoder/source_data/mock_dataset.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
|
__init__(context_size=250, buffer_size=1000, preprocess_batch_size=1000, dataset_path='dummy', dataset_split='train')
¤
Initialize the Random Int Dummy dataset.
Example
data = MockDataset() first_item = next(iter(data)) len(first_item["input_ids"]) 250
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context_size |
PositiveInt
|
The context size to use when returning a list of tokenized prompts. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning used a context size of 250. |
250
|
buffer_size |
PositiveInt
|
The buffer size to use when shuffling the dataset. As the dataset is
streamed, this just pre-downloads at least |
1000
|
preprocess_batch_size |
PositiveInt
|
The batch size to use just for preprocessing the dataset (e.g. tokenizing prompts). |
1000
|
dataset_path |
str
|
The path to the dataset on Hugging Face. |
'dummy'
|
dataset_split |
str
|
Dataset split (e.g. |
'train'
|
Source code in sparse_autoencoder/source_data/mock_dataset.py
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
|
preprocess(source_batch, *, context_size)
¤
Preprocess a batch of prompts.
Source code in sparse_autoencoder/source_data/mock_dataset.py
133 134 135 136 137 138 139 140 141 |
|