<> preface

A new one was recently created conda Environmental Science , Yes tensorflow 2.0
(Beat),,,TF2.0 There's been a lot of change , For example, it was deleted Session…… I'm used to building maps first —— again Session For those who execute it , I'm a square yuppie ……2.0 I don't have a clue how to run as a graph ( Just found out tf.compat It has a historical version 233)…… So I'm still shivering with the new version TF Highly recommended keras.

It's on standby today TF2.0 Trot an image task , The first is the reading of data , But here's the dataset 11G, So we plan to integrate TFrecord, After convenience ;

<> introduce

TFrecord yes Tensorflow A unified binary file format provided and recommended for use , Used to store data , In theory, it can hold information in any format .

type value
uint64 length
uint32 masked_crc32_of_length
byte data[length]
uint32 masked_crc32_of_data
As above : The whole file is composed of file length information , Length check code , data , Data check code composition .

TFRecord There are a series of Example ,Example yes protocolbuf Message body under protocol .
For example, I use it here Example That's true :
exam = tf.train.Example ( features=tf.train.Features( feature={ 'name' : tf.
train.Feature(bytes_list=tf.train.BytesList (value=[splits[-1].encode('utf-8')])
), 'shape': tf.train.Feature(int64_list=tf.train.Int64List (value=[img.shape[0],
img.shape[1], img.shape[2]])), 'data' : tf.train.Feature(bytes_list=tf.train.
BytesList (value=[bytes(img.numpy())])) } ) )
It can be seen that , One Example The message body contains a Features, and Features By many feature form , Each of them feature It's a map, that is
key-value Key value pairs for . among ,key The value is String type ; and value yes Feature Message body of type , Its values are 3 species :

* BytesList
* FloatList
* Int64List

It should be noted that , They are all in the form of lists .
<> How to create TFrecord file

We know from above ,TFRecord A series of Example form , each Example Can represent a set of data .

Tensorflow 2.0 Beat in , output TFrecord Of API by tf.io.TFRecordWriter (filename,
options=None), The second parameter is used to control the output configuration of the file , Don't worry about it . The first parameter is the name of the file you want to save , After calling the function , Will return one Writer example .

Yes Writer, We can call it all the time Writer.write
(example) Come and take our Examples Output to file , It should be noted that , This function accepts a string, So we should first example Serialize to string type , Namely
Writer.write(example.SerializeToString())

When all the example After output to file , Need to call Writer.close() Close the file .

example :
writer = tf.io.TFRecordWriter (file_name) for item in file_list: # item =
.\\data\\xx(label)\\xxx.jpg splits = item.split ('\\') label = splits[2] img =
tf.io.read_file (item) img = tf.image.decode_jpeg (img) exam = tf.train.Example
( features=tf.train.Features( feature={ 'name' : tf.train.Feature(bytes_list=tf.
train.BytesList (value=[splits[-1].encode('utf-8')])), 'label': tf.train.Feature
(int64_list=tf.train.Int64List (value=[int(label)])), 'shape': tf.train.Feature(
int64_list=tf.train.Int64List (value=[img.shape[0], img.shape[1], img.shape[2]])
), 'data' : tf.train.Feature(bytes_list=tf.train.BytesList (value=[bytes(img.
numpy())])) } ) ) writer.write (exam.SerializeToString()) writer.close()
Here because Tensorflow 2.0 The default is Eager pattern , therefore img It's a Eager Tensor, Need to be converted to numpy.

<> How to read TFrecord

In the old version , We can use it tf.TFrecordReader(), But this one is 2.0 I didn't find it , So we use
tf.data.TFRecordDataset(filename), After calling, we get one Dataset(tf.data.Dataset
), literal comprehension , This is where we've written everything Example.

Remember when writing , We put each example Are they all serialized , So we need to get the previous one example, You also need to parse the serialization previously written below string.
tf.io.parse_single_example(example_proto, feature_description) Function can parse a single piece example.

Explain this function :

The first parameter is to be parsed string, The focus is on the second parameter , He asked us to specify the resolution example Format of . In order to be able to correctly analyze , This is the same as when we write it example Corresponding :
For example, when we write example by :
exam = tf.train.Example ( features=tf.train.Features( feature={ 'name' : tf.
train.Feature(bytes_list=tf.train.BytesList (value=[splits[-1].encode('utf-8')])
), 'label': tf.train.Feature(int64_list=tf.train.Int64List (value=[int(label)]))
, 'shape': tf.train.Feature(int64_list=tf.train.Int64List (value=[img.shape[0],
img.shape[1], img.shape[2]])), 'data' : tf.train.Feature(bytes_list=tf.train.
BytesList (value=[bytes(img.numpy())])) } ) )
The parameter we need to specify is :
feature_description = { 'name' : tf.io.FixedLenFeature([], tf.string,
default_value='Nan'), 'label': tf.io.FixedLenFeature([] , tf.int64,
default_value=-1), # The default value is defined by itself 'shape': tf.io.FixedLenFeature([3], tf.int64),
'data' : tf.io.FixedLenFeature([], tf.string) }
You can see that every one of them is the same as the previous one example Medium feature corresponding (feature_description in
map Of key May not correspond , such as name Change to id Still no problem ).

OK, We've solved one problem example, But one Dataset Medium example tanto . No problem tensorflow Of dataset Provided
Dataset.map(func), A mapping rule can be given , take dataset All entries in are mapped according to this rule , Actually and python Of map The function is almost the same .

So we can present our mapping function to the Dataset.map(func), To resolve all the example.
reader = tf.data.TFRecordDataset(file_name) # Open one TFrecord feature_description
= { 'name' : tf.io.FixedLenFeature([], tf.string, default_value='Nan'), 'label':
tf.io.FixedLenFeature([] , tf.int64, default_value=-1), 'shape': tf.io.
FixedLenFeature([3], tf.int64), 'data' : tf.io.FixedLenFeature([], tf.string) }
def _parse_function (exam_proto): # Mapping function , Used to parse a example return tf.io.
parse_single_example(exam_proto, feature_description) reader = reader.map (
_parse_function)
Read it , We can use it for loop :
for row in reader.take(10): # Take the front only 10 strip # for row in reader: # Enumerate all example print (
row['name']) print (np.frombuffer(row['data'].numpy(), dtype=np.uint8)) #
If you want to restore 3d array , can reshape
But we can do something :
dataset There are many ways to do it , such as batch,shuffle,repeat... More can go to the official website to explore ( I don't know when , visit TF All of a sudden, the official website doesn't use anything )

We can do this :
reader = tf.data.TFRecordDataset(file_name) feature_description = { 'name' : tf
.io.FixedLenFeature([], tf.string, default_value='Nan'), 'label': tf.io.
FixedLenFeature([] , tf.int64, default_value=-1), 'shape': tf.io.FixedLenFeature
([3], tf.int64), 'data' : tf.io.FixedLenFeature([], tf.string) } def
_parse_function (exam_proto): return tf.io.parse_single_example (exam_proto,
feature_description) reader = reader.repeat (1) # The number of repetitions of reading data is :1 second , This is equivalent to epoch
reader= reader.shuffle (buffer_size = 2000) # Randomly scrambling data in buffer reader = reader.map (
_parse_function) # Parsing data batch = reader.batch (batch_size = 10) #
each 10 One piece of data batch, Generate a new Dataset shape = [] batch_data_x, batch_data_y = np.array([]
), np.array([]) for item in batch.take(1): # test , Take only 1 individual batch shape = item['shape'][
0].numpy() for data in item['data']: # One item It's just one batch img_data = np.frombuffer(
data.numpy(), dtype=np.uint8) batch_data_x = np.append (batch_data_x, img_data)
for label in item ['label']: batch_data_y = np.append (batch_data_y, label.numpy
()) batch_data_x = batch_data_x.reshape ([-1, shape[0], shape[1], shape[2]])
print (batch_data_x.shape, batch_data_y.shape) # = (10, 480, 640, 3) (10,) #
When my picture data 480*640*3 Of
Can be very convenient to read out the data of each batch , And then wait .

Technology