Pre-Tokenized Dataset from Hugging Face¤
Pre-Tokenized Dataset from Hugging Face.
PreTokenizedDataset should work with any of the following tokenized datasets: - NeelNanda/pile-small-tokenized-2b - NeelNanda/pile-tokenized-10b - NeelNanda/openwebtext-tokenized-9b - NeelNanda/c4-tokenized-2b - NeelNanda/code-tokenized - NeelNanda/c4-code-tokenized-2b - NeelNanda/pile-old-tokenized-2b - alancooney/sae-monology-pile-uncopyrighted-tokenizer-gpt2
PreTokenizedDataset
¤
Bases: SourceDataset[dict]
General Pre-Tokenized Dataset from Hugging Face.
Can be used for various datasets available on Hugging Face.
Source code in sparse_autoencoder/source_data/pretokenized_dataset.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
__init__(dataset_path, context_size=256, buffer_size=1000, dataset_dir=None, dataset_files=None, dataset_split='train', dataset_column_name='input_ids', preprocess_batch_size=1000, *, pre_download=False)
¤
Initialize a pre-tokenized dataset from Hugging Face.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset on Hugging Face (e.g. `alancooney/sae-monology-pile-uncopyrighted-tokenizer-gpt2). |
required |
context_size |
PositiveInt
|
The context size for tokenized prompts. |
256
|
buffer_size |
PositiveInt
|
The buffer size to use when shuffling the dataset when streaming. When
streaming a dataset, this just pre-downloads at least |
1000
|
dataset_dir |
str | None
|
Defining the |
None
|
dataset_files |
str | Sequence[str] | Mapping[str, str | Sequence[str]] | None
|
Path(s) to source data file(s). |
None
|
dataset_split |
str
|
Dataset split (e.g. |
'train'
|
dataset_column_name |
str
|
The column name for the tokenized prompts. |
'input_ids'
|
preprocess_batch_size |
PositiveInt
|
The batch size to use just for preprocessing the dataset (e.g. tokenizing prompts). |
1000
|
pre_download |
bool
|
Whether to pre-download the whole dataset. |
False
|
Source code in sparse_autoencoder/source_data/pretokenized_dataset.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
preprocess(source_batch, *, context_size)
¤
Preprocess a batch of prompts.
The method splits each pre-tokenized item based on the context size.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source_batch |
dict
|
A batch of source data. |
required |
context_size |
int
|
The context size to use for tokenized prompts. |
required |
Returns:
Type | Description |
---|---|
TokenizedPrompts
|
Tokenized prompts. |
Raises:
Type | Description |
---|---|
ValueError
|
If the context size is larger than the tokenized prompt size. |
Source code in sparse_autoencoder/source_data/pretokenized_dataset.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
|