Build an efficient input data pipeline for deep learning

Problem

Developing the input pipeline in a machine learning project is usually time-consuming and uncomfortable, and it might take longer than building the model itself. When dealing with massive datasets including thousands/millions of files, the input data pipeline can either be a game changer or a bottleneck depending on the architecture.

When datasets are too large to fit in RAM, a python generator-based technique can represent a significant barrier in training complicated GPU compute-intensive models. GPUs that are idle while waiting for data significantly slow down model training.

This is when TensorFlow TFRecord comes in handy.

We will learn how to utilise TensorFlow’s Dataset modules tf.data and tf.record to create efficient pipelines for photos and text in this lesson.

TFRecord

To record, data is stored as a binary record in a succession of protocol buffers. TFRecords files are incredibly read-efficient and take up little hard drive space. TFRecord is a method for storing data samples in sequential order. Each sample contains a number of characteristics.

We specify our features as a dictionary within the ‘write record’ method. In this scenario, we have picture data, the location of the item on the image, and the colour of the object, and we also want to keep the image data’s form.

Certain datatypes can be specified for a feature in TFRecord. For our features, we use a byte list and a float list.

Now we enter our real data into the features: we begin with data in digits, which is quite handy. We flatten them (‘.ravel()’) and provide them to the appropriate feature constructor. You might be wondering why we store picture data as floats. This is a design choice (oops! — see on for the consequences of this design choice), so we already save the picture data with colour values in the 0=val=1 range, so we can feed this immediately to the training. You’ll see that there are a few of spots suited for data conversions – assuming you stored it as uINT8 here, you can later convert it in the feeding pipeline.

The last thing we need is a means to shut down the writer and ensure that everything is written to disc. We introduced a close writer function to do this (as an aside, you might modify this to operate in combination with the Python ‘with’ statement).

That’s all. We will return to this point later: we presently do not write validation data separately from training data. One may expect a simple’split dataset’ method that we can use afterwards, however there is none with Datasets. This is acceptable given that tf.data is designed to operate with billions of records, and splitting billions of information in a certain way is difficult.

To record eliminates the need to read each sample file from the disc for each epoch.

The effectiveness of utilising TFRecord comes at a cost, in the form of the need to write a lot of complicated code to produce and read the records files.

We provide a solution to this problem by using Datum to handle TFRecords files.

Datum

Datum is a TensorFlow-based framework for creating an efficient, quick input pipeline. Datum is intended to create/manage TFRecord databases nearly entirely without the use of complicated programming.

Datum is intended to read and publish TFRecord datasets while also constructing a fast input pipeline that can be used for single GPU or distributed training with only a few lines of code.

Datum constructs a fast input pipeline using tf.data and tf.record.

TFData

The Dataset API enables you to create an asynchronous, highly efficient data pipeline to prevent data exhaustion on your GPU. It reads data from disc (pictures or text), performs optimal transformations, generates batches, and sends them to the GPU. Previously, data pipelines forced the GPU to wait for the CPU to load the data, resulting in performance concerns.

Installation

Datum may be installed with the command pypi

pip instal datum.

TFRecord export

If you don’t use datum, writing/exporting data to tfrecord format might get quite complicated.

Datum facilitates the export of datasets to tfrecord format. Datum has a few preset issue kinds that allow you to generate a dataset with a few lines of code without having to go into the inner workings of tfrecord and serialisation.

Import TFRWriteConfigs to define datum configuration for writing/exporting data to tfrecord

from datum.configs import TFRWriteConfigs

Define the splits information in the configs, splits names are important for a datum to automatically identify the splits data.

write_configs = TFRWriteConfigs()
write_configs.splits = {
"train": {
"num_examples": <num of train examples in the dataset>
},
"val": {
"num_examples": <num of validation examples in the dataset>
},
}

To transform the datasets, import the export API and problem type.

Different datasets serve different functions. An image classification dataset with merely a class label, for example, cannot be utilised for picture segmentation or detection. To facilitate conversion, datum distinguishes between issue categories for classification, detection, and segmentation tasks.

from datum.export.export import export_to_tfrecord
from datum.problem.types import IMAGE_CLF

Suppose we want to build a tfrecord dataset for an image classification task, the type for that is IMAGE_CLF

Convert the dataset to tfrecord format

export_to_tfrecord(input_path, output_path, types.IMAGE_CLF, write_configs)

Datum will transform and store the result dataset. In the output directory, there are tfrecord files and dataset metadata files. The exported tfrecord files may be imported as tf.data with ease. Datum load API was used to load the dataset.

Load the data as tf.data.Dataset.

Import the loading API as tf.data to load the tfrecord dataset.

Import dataset from datum.reader

To load the dataset, simply give the output path from the previous export state.

dataset = load(<path to tfreord files folder>)
train_dataset = dataset.train_fn('train', shuffle=True)
val_dataset = dataset.val_fn('val', shuffle=False)

Examples/cases in the dataset can be augmented before feeding into the model. It’s easy to preprocess and post-process samples in the dataset using pre_batching_callbackand post_batching_callback .

pre_batching_callback : Using this callback example can be processed before batching.

post_batching_callback: Using this callback example can be processed after batching. Examples are processed as a batch.

Suppose we want to augment the dataset, which can be achieved using the following pre_batching_callback

def augment_image(example):
image = tf.image.resize(example["image"], IMG_SIZE)
image = tf.image.random_flip_left_right(image)
image = tf.image.random_flip_up_down(image)
example.update({"image": image})
return exampledataset_configs = dataset.dataset_configs
datset_configs.pre_batching_callback = lambda example: augment_image(example)
train_dataset = dataset.train_fn('train', shuffle=True)

Put Datum to the test.

The usage of datum for transfer learning an EfficientNet-B0 model for an image classification project is demonstrated in this notebook.

Play with the Transfer Learning with Datum notebook to speed up your input flow.

Derive maximum ROI for your business from AI

Request Demo

Get started with Subex
Request Demo Contact Us
Request a demo