Pytorch usually uses two tool classes, Dataset and DataLoader, to build data pipelines.
Dataset defines the content of the data set, which is equivalent to a data structure similar to a list, has a certain length, and can use the index to get the elements in the data set.
The DataLoader defines a method for loading data sets in batches. It is an iterable object that implements the __iter__
method, and outputs a batch of data for each iteration.
DataLoader can control the size of the batch, the sampling method of the elements in the batch, and the method of sorting the batch results into the input form required by the model, and can use multiple processes to read data.
In most cases, users only need to implement the __len__
method and __getitem__
method of the Dataset, and then they can easily build their own data set and load it with the default data pipeline.
1, steps to obtain a batch data
Let us consider the steps required to obtain a batch of data from a data set.
(Assuming that the features and labels of the data set are expressed as tensors X
and Y
, the data set can be expressed as (X,Y)
, assuming the batch size is m
)
- First, we have to determine the length of the data set
n
.
The result is similar: n = 1000
.
- Then we sample the number of
m
(batch size) from the range of0
ton-1
.
Assuming m=4
, the result obtained is a list, similar to: indices = [1,4,8,9]
- Then we get the elements of the subscript corresponding to the number of
m
from the data set.
The result is a list of tuples, similar to: samples = [(X[1],Y[1]),(X[4],Y[4]),(X[8],Y[8] ),(X[9],Y[9])]
- Finally, we organize the result into two tensors as output.
The result is two tensors, similar to batch = (features,labels)
,
Where features = torch.stack([X[1],X[4],X[8],X[9]])
labels = torch.stack([Y[1],Y[4],Y[8],Y[9]])
2, the division of labor between Dataset and DataLoader
The first step above to determine the length of the data set is implemented by the __len__
method of Dataset.
In the second step, the method of sampling the number of m
from the range of 0
to n-1
is specified by the sampler
and batch_sampler
parameters of DataLoader.
The sampler
parameter specifies the sampling method of a single element, and generally does not need to be set by the user. By default, the program uses random sampling when the DataLoader parameter shuffle=True
, and uses sequential sampling when shuffle=False
.
The batch_sampler
parameter organizes multiple sampled elements into a list, which generally does not need to be set by the user. The default method will discard the last batch of the data set whose length is not divisible by the batch size when the DataLoader parameter drop_last=True
. Keep the last batch when drop_last=False`.
The core logic of the third step is to get the elements in the data set according to the subscripts, which is implemented by the __getitem__
method of the Dataset.
The logic of the fourth step is specified by the parameter collate_fn
of the DataLoader. Under normal circumstances, there is no need for user settings.
3, the main interface of Dataset and DataLoader
The following is the core interface logic pseudo code of Dataset and DataLoader, which is not completely consistent with the source code.
import torch
class Dataset(object):
def __init__(self):
pass
def __len__(self):
raise NotImplementedError
def __getitem__(self,index):
raise NotImplementedError
class DataLoader(object):
def __init__(self,dataset,batch_size,collate_fn,shuffle = True,drop_last = False):
self.dataset = dataset
self.sampler =torch.utils.data.RandomSampler if shuffle else \
torch.utils.data.SequentialSampler
self.batch_sampler = torch.utils.data.BatchSampler
self.sample_iter = self.batch_sampler(
self.sampler(range(len(dataset))),
batch_size = batch_size,drop_last = drop_last)
def __next__(self):
indices = next(self.sample_iter)
batch = self.collate_fn([self.dataset[i] for i in indices])
return batch
Dataset commonly used methods to create data sets are:
-
Use torch.utils.data.TensorDataset to create a data set based on Tensor (numpy array, Pandas DataFrame needs to be converted to Tensor).
-
Use torchvision.datasets.ImageFolder to create image datasets based on the image catalog.
-
Inherit torch.utils.data.Dataset to create a custom data set.
In addition, you can also pass
-
torch.utils.data.random_split splits a data set into multiple parts, which is often used to split training set, validation set and test set.
-
Call the addition operator (
+
) of Dataset to merge multiple data sets into one data set.
1, create a data set based on Tensor
import numpy as np
import torch
from torch.utils.data import TensorDataset,Dataset,DataLoader,random_split
# Create a data set based on Tensor
from sklearn import datasets
iris = datasets.load_iris()
ds_iris = TensorDataset(torch.tensor(iris.data),torch.tensor(iris.target))
# Split into training set and prediction set
n_train = int(len(ds_iris)*0.8)
n_valid = len(ds_iris)-n_train
ds_train,ds_valid = random_split(ds_iris,[n_train,n_valid])
print(type(ds_iris))
print(type(ds_train))
# Use DataLoader to load data sets
dl_train,dl_valid = DataLoader(ds_train,batch_size = 8),DataLoader(ds_valid,batch_size = 8)
for features,labels in dl_train:
print(features,labels)
break
# Demonstrate the combined effect of the addition operator (`+`)
ds_data = ds_train + ds_valid
print('len(ds_train) =',len(ds_train))
print('len(ds_valid) =',len(ds_valid))
print('len(ds_train+ds_valid) =',len(ds_data))
print(type(ds_data))
2, create a picture data set according to the picture directory
import numpy as np
import torch
from torch.utils.data import DataLoader
from torchvision import transforms,datasets
#Demonstrate some common image enhancement operations
from PIL import Image
img = Image.open('./data/cat.jpeg')
img
# Random value flip
transforms.RandomVerticalFlip()(img)
#Random rotation
transforms.RandomRotation(45)(img)
# Define image enhancement operations
transform_train = transforms.Compose([
transforms.RandomHorizontalFlip(), #Random horizontal flip
transforms.RandomVerticalFlip(), #Random vertical flip
transforms.RandomRotation(45), #Randomly rotate within 45 degrees
transforms.ToTensor() #convert to tensor
]
)
transform_valid = transforms.Compose([
transforms.ToTensor()
]
)
# Create a data set based on the picture directory
ds_train = datasets.ImageFolder("./data/cifar2/train/",
transform = transform_train,target_transform = lambda t:torch.tensor([t]).float())
ds_valid = datasets.ImageFolder("./data/cifar2/test/",
transform = transform_train,target_transform = lambda t:torch.tensor([t]).float())
print(ds_train.class_to_idx)
{'0_airplane': 0, '1_automobile': 1}
# Use DataLoader to load data sets
dl_train = DataLoader(ds_train, batch_size = 50, shuffle = True, num_workers=3)
dl_valid = DataLoader(ds_valid, batch_size = 50, shuffle = True, num_workers=3)
for features,labels in dl_train:
print(features.shape)
print(labels.shape)
break
torch.Size([50, 3, 32, 32])
torch.Size([50, 1])
3, create a custom data set
Next, create a custom data set for imdb text classification tasks by inheriting the Dataset class.
The general idea is as follows: First, construct a dictionary for word segmentation of the training set text. Then convert the training set text and test set text data into token word encoding.
Then the training set data and the test set data converted into word codes are divided into multiple files according to the samples, and one file represents one sample.
Finally, we can obtain the sample content corresponding to the serial number according to the file name list to build the Dataset data set.
import numpy as np
import pandas as pd
from collections import OrderedDict
import re,string
MAX_WORDS = 10000 # Only consider the most frequent 10000 words
MAX_LEN = 200 # Each sample retains the length of 200 words
BATCH_SIZE = 20
train_data_path ='data/imdb/train.tsv'
test_data_path ='data/imdb/test.tsv'
train_token_path ='data/imdb/train_token.tsv'
test_token_path ='data/imdb/test_token.tsv'
train_samples_path ='data/imdb/train_samples/'
test_samples_path ='data/imdb/test_samples/'
First, we build a dictionary and keep the most frequent MAX_WORDS words.
##Build a dictionary
word_count_dict = {}
#Cleaning text
def clean_text(text):
lowercase = text.lower().replace("\n"," ")
stripped_html = re.sub('<br />', '',lowercase)
cleaned_punctuation = re.sub('[%s]'%re.escape(string.punctuation),'',stripped_html)
return cleaned_punctuation
with open(train_data_path,"r",encoding ='utf-8') as f:
for line in f:
label,text = line.split("\t")
cleaned_text = clean_text(text)
for word in cleaned_text.split(" "):
word_count_dict[word] = word_count_dict.get(word,0)+1
df_word_dict = pd.DataFrame(pd.Series(word_count_dict,name = "count"))
df_word_dict = df_word_dict.sort_values(by = "count",ascending=False)
df_word_dict = df_word_dict[0:MAX_WORDS-2] #
df_word_dict["word_id"] = range(2,MAX_WORDS) #Numbers 0 and 1 are reserved for unknown words <unkown> and padding <padding> respectively
word_id_dict = df_word_dict["word_id"].to_dict()
df_word_dict.head(10)
Then we use the built dictionary to convert the text into a token serial number.
#Conversion token
# Fill text
def pad(data_list,pad_length):
padded_list = data_list.copy()
if len(data_list)> pad_length:
padded_list = data_list[-pad_length:]
if len(data_list)< pad_length:
padded_list = [1]*(pad_length-len(data_list))+data_list
return padded_list
def text_to_token(text_file,token_file):
with open(text_file,"r",encoding ='utf-8') as fin,\
open(token_file,"w",encoding ='utf-8') as fout:
for line in fin:
label,text = line.split("\t")
cleaned_text = clean_text(text)
word_token_list = [word_id_dict.get(word, 0) for word in cleaned_text.split(" ")]
pad_list = pad(word_token_list,MAX_LEN)
out_line = label+"\t"+" ".join([str(x) for x in pad_list])
fout.write(out_line+"\n")
text_to_token(train_data_path,train_token_path)
text_to_token(test_data_path,test_token_path)
Next, the token text is divided into samples, and each file stores the data of a sample.
# Split sample
import os
if not os.path.exists(train_samples_path):
os.mkdir(train_samples_path)
if not os.path.exists(test_samples_path):
os.mkdir(test_samples_path)
def split_samples(token_path,samples_dir):
with open(token_path,"r",encoding ='utf-8') as fin:
i = 0
for line in fin:
with open(samples_dir+"%d.txt"%i,"w",encoding = "utf-8") as fout:
fout.write(line)
i = i+1
split_samples(train_token_path,train_samples_path)
split_samples(test_token_path,test_samples_path)
print(os.listdir(train_samples_path)[0:100])
['11303.txt', '3644.txt', '19987.txt', '18441.txt', '5235.txt', '17772.txt', '1053.txt', '13514.txt', ' 8711.txt', '15165.txt', '7422.txt', '8077.txt', '15603.txt', '7344.txt', '1735.txt', '13272.txt', '9369. txt', '18327.txt', '5553.txt', '17014.txt', '4895.txt', '11465.txt', '3122.txt', '19039.txt', '5547.txt' , '18333.txt', '17000.txt', '4881.txt', '2228.txt', '11471.txt', '3136.txt', '4659.txt', '15617.txt', ' 8063.txt', '7350.txt', '12178.txt', '1721.txt', '13266.txt', '14509.txt', '6728.txt', '1047.txt', '13500. txt', '15171.txt', '8705.txt', '7436.txt', '16478.txt', '11317.txt', '3650.txt', '19993.txt', '10009.txt' , '5221.txt', '18455.txt', '17766.txt', '3888.txt', '6700.txt', '14247.txt', '9433.txt', '13528.txt', ' 12636.txt', '15159.txt', '16450.txt', '4117.txt', '19763.txt', '3678.txt', '17996.txt', '2566.txt', '10021. txt', '5209.txt', '17028.txt', '2200.txt', '10747.txt', '11459.txt', '16336.txt', '4671.txt', '19005.txt' , '7378.txt', '12150.txt', '1709.txt', '6066.txt', '14521.txt ', '9355.txt', '12144.txt', '289.txt', '6072.txt', '9341.txt', '14535.txt', '2214.txt', '10753.txt', '16322.txt', '19011.txt', '4665.txt', '16444.txt', '19777.txt', '4103.txt', '17982.txt', '2572.txt', '10035 .txt', '18469.txt', '6714.txt', '9427.txt']
Everything is ready, we can create a data set Dataset, read the file content from the file name list.
import os
class imdbDataset(Dataset):
def __init__(self,samples_dir):
self.samples_dir = samples_dir
self.samples_paths = os.listdir(samples_dir)
def __len__(self):
return len(self.samples_paths)
def __getitem__(self,index):
path = self.samples_dir + self.samples_paths[index]
with open(path,"r",encoding = "utf-8") as f:
line = f.readline()
label,tokens = line.split("\t")
label = torch.tensor([float(label)],dtype = torch.float)
feature = torch.tensor([int(x) for x in tokens.split(" ")],dtype = torch.long)
return (feature,label)
ds_train = imdbDataset(train_samples_path)
ds_test = imdbDataset(test_samples_path)
print(len(ds_train))
print(len(ds_test))
20000
5000
dl_train = DataLoader(ds_train,batch_size = BATCH_SIZE,shuffle = True,num_workers=4)
dl_test = DataLoader(ds_test,batch_size = BATCH_SIZE,num_workers=4)
for features,labels in dl_train:
print(features)
print(labels)
break
tensor([[ 1, 1, 1, ..., 29, 8, 8],
[13, 11, 247, ..., 0, 0, 8],
[8587, 555, 12, ..., 3, 0, 8],
...,
[1, 1, 1, ..., 2, 0, 8],
[618, 62, 25, ..., 20, 204, 8],
[1, 1, 1, ..., 71, 85, 8]])
tensor([[1.],
[0.],
[0.],
[1.],
[0.],
[1.],
[0.],
[1.],
[1.],
[1.],
[0.],
[0.],
[0.],
[1.],
[0.],
[1.],
[1.],
[1.],
[0.],
[1.]])
Finally, build a model to test whether the data set pipeline is available.
import torch
from torch import nn
import importlib
from torchkeras import Model,summary
class Net(Model):
def __init__(self):
super(Net, self).__init__()
#After setting the padding_idx parameter, the filled token will always be assigned a 0 vector during the training process
self.embedding = nn.Embedding(num_embeddings = MAX_WORDS,embedding_dim = 3,padding_idx = 1)
self.conv = nn.Sequential()
self.conv.add_module("conv_1",nn.Conv1d(in_channels = 3,out_channels = 16,kernel_size = 5))
self.conv.add_module("pool_1",nn.MaxPool1d(kernel_size = 2))
self.conv.add_module("relu_1",nn.ReLU())
self.conv.add_module("conv_2",nn.Conv1d(in_channels = 16,out_channels = 128,kernel_size = 2))
self.conv.add_module("pool_2",nn.MaxPool1d(kernel_size = 2))
self.conv.add_module("relu_2",nn.ReLU())
self.dense = nn.Sequential()
self.dense.add_module("flatten",nn.Flatten())
self.dense.add_module("linear",nn.Linear(6144,1))
self.dense.add_module("sigmoid",nn.Sigmoid())
def forward(self,x):
x = self.embedding(x).transpose(1,2)
x = self.conv(x)
y = self.dense(x)
return y
model = Net()
print(model)
model.summary(input_shape = (200,),input_dtype = torch.LongTensor)
Net(
(embedding): Embedding(10000, 3, padding_idx=1)
(conv): Sequential(
(conv_1): Conv1d(3, 16, kernel_size=(5,), stride=(1,))
(pool_1): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(relu_1): ReLU()
(conv_2): Conv1d(16, 128, kernel_size=(2,), stride=(1,))
(pool_2): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(relu_2): ReLU()
)
(dense): Sequential(
(flatten): Flatten()
(linear): Linear(in_features=6144, out_features=1, bias=True)
(sigmoid): Sigmoid()
)
)
-------------------------------------------------- --------------
Layer (type) Output Shape Param #
================================================= ==============
Embedding-1 [-1, 200, 3] 30,000
Conv1d-2 [-1, 16, 196] 256
MaxPool1d-3 [-1, 16, 98] 0
ReLU-4 [-1, 16, 98] 0
Conv1d-5 [-1, 128, 97] 4,224
MaxPool1d-6 [-1, 128, 48] 0
ReLU-7 [-1, 128, 48] 0
Flatten-8 [-1, 6144] 0
Linear-9 [-1, 1] 6,145
Sigmoid-10 [-1, 1] 0
================================================= ==============
Total params: 40,625
Trainable params: 40,625
Non-trainable params: 0
-------------------------------------------------- --------------
Input size (MB): 0.000763
Forward/backward pass size (MB): 0.287796
Params size (MB): 0.154972
Estimated Total Size (MB): 0.443531
-------------------------------------------------- --------------
# Compile model
def accuracy(y_pred,y_true):
y_pred = torch.where(y_pred>0.5,torch.ones_like(y_pred,dtype = torch.float32),
torch.zeros_like(y_pred,dtype = torch.float32))
acc = torch.mean(1-torch.abs(y_true-y_pred))
return acc
model.compile(loss_func = nn.BCELoss(),optimizer = torch.optim.Adagrad(model.parameters(),lr = 0.02),
metrics_dict={"accuracy":accuracy})
# Training model
dfhistory = model.fit(10,dl_train,dl_val=dl_test,log_step_freq=200)
Start Training ...
================================================= ==============================2020-07-11 23:21:53
{'step': 200,'loss': 0.956,'accuracy': 0.521}
{'step': 400,'loss': 0.823,'accuracy': 0.53}
{'step': 600,'loss': 0.774,'accuracy': 0.545}
{'step': 800,'loss': 0.747,'accuracy': 0.56}
{'step': 1000,'loss': 0.726,'accuracy': 0.572}
+-------+-------+----------+----------+----------- ---+
| epoch | loss | accuracy | val_loss | val_accuracy |
+-------+-------+----------+----------+----------- ---+
| 1 | 0.726 | 0.572 | 0.661 | 0.613 |
+-------+-------+----------+----------+----------- ---+
================================================= ==============================2020-07-11 23:22:20
{'step': 200,'loss': 0.605,'accuracy': 0.668}
{'step': 400,'loss': 0.602,'accuracy': 0.674}
{'step': 600,'loss': 0.592,'accuracy': 0.681}
{'step': 800,'loss': 0.584,'accuracy': 0.687}
{'step': 1000,'loss': 0.575,'accuracy': 0.696}
+-------+-------+----------+----------+----------- ---+
| epoch | loss | accuracy | val_loss | val_accuracy |
+-------+-------+----------+----------+----------- ---+
| 2 | 0.575 | 0.696 | 0.553 | 0.716 |
+-------+-------+----------+----------+----------- ---+
================================================= ==============================2020-07-11 23:25:53
{'step': 200,'loss': 0.294,'accuracy': 0.877}
{'step': 400,'loss': 0.299,'accuracy': 0.875}
{'step': 600,'loss': 0.298,'accuracy': 0.875}
{'step': 800,'loss': 0.296,'accuracy': 0.876}
{'step': 1000,'loss': 0.298,'accuracy': 0.875}
+-------+-------+----------+----------+----------- ---+
| epoch | loss | accuracy | val_loss | val_accuracy |
+-------+-------+----------+----------+----------- ---+
| 10 | 0.298 | 0.875 | 0.464 | 0.795 |
+-------+-------+----------+----------+----------- ---+
================================================= ==============================2020-07-11 23:26:19
Finished Training...
DataLoader can control the size of the batch, the sampling method of the elements in the batch, and the method of sorting the batch results into the input form required by the model, and can use multiple processes to read data.
The function signature of DataLoader is as follows.
DataLoader(
dataset,
batch_size=1,
shuffle=False,
sampler=None,
batch_sampler=None,
num_workers=0,
collate_fn=None,
pin_memory=False,
drop_last=False,
timeout=0,
worker_init_fn=None,
multiprocessing_context=None,
)
Under normal circumstances, we only configure the five parameters of dataset, batch_size, shuffle, num_workers, drop_last, and use the default values for other parameters.
In addition to the torch.utils.data.Dataset we mentioned earlier, DataLoader can also load another data set torch.utils.data.IterableDataset.
Unlike Dataset, which is equivalent to a list structure, IterableDataset is equivalent to an iterator structure. It is more complex and generally less used.
-dataset: dataset -batch_size: batch size -shuffle: Whether out of order -sampler: sample sampling function, generally no need to set. -batch_sampler: batch sampling function, generally no need to set. -num_workers: Use multiple processes to read data and set the number of processes. -collate_fn: A function to organize a batch of data. -pin_memory: Whether it is set to lock industry memory. The default is False, the lock industry memory will not use virtual memory (hard disk), and the speed of copying from the lock industry memory to the GPU will be faster. -drop_last: Whether to drop the last batch of data that the number of samples is less than batch_size. -timeout: The maximum waiting time for loading a data batch, generally no setting is required. -worker_init_fn: The initialization function of the dataset in each worker, often used in IterableDataset. Generally not used.
#Build input data pipeline
ds = TensorDataset(torch.arange(1,50))
dl = DataLoader(ds,
batch_size = 10,
shuffle = True,
num_workers=2,
drop_last = True)
#Iteration data
for batch, in dl:
print(batch)
tensor([43, 44, 21, 36, 9, 5, 28, 16, 20, 14])
tensor([23, 49, 35, 38, 2, 34, 45, 18, 15, 40])
tensor([26, 6, 27, 39, 8, 4, 24, 19, 32, 17])
tensor([ 1, 29, 11, 47, 12, 22, 48, 42, 10, 7])
If this book is helpful to you and want to encourage the author, remember to add a star⭐️ to this project and share it with your friends 😊!
If you need to further communicate with the author on the understanding of the content of this book, please leave a message under the public account "Algorithm Food House". The author has limited time and energy and will respond as appropriate.
You can also reply to keywords in the background of the official account: Add group, join the reader exchange group and discuss with you.