keras generator seems to be memory leak(tf.1.3 keras:2.0.9) - Python

I'm training large image data set(3608930, 244 x 244) for InceptionResNetV2 or Xception by using ft_generator or sequence.

I use multigpumodel(5 gpu). So I make batch_size 45gpuNum. Batch Memory 2442443 4byte 455 = about 153M.

As trainging goes by, memory monotonically increases by about 15 M like below.

Why 15M( 10% of 150M) monotonically increase? 15*55522= about 821 G. So I can't train data fully. But using datagen.flowfromdirectory don't increase. Why? My generator has problem? Any ideas welcome.

On each batch, memory increase by about 15N 734/55522 [..............................] - ETA: 53:06:05 - loss: 1.7416 - categoricalaccuracy: 0.6141 67215272 735/55522 [..............................] - ETA: 53:06:01 - loss: 1.7416 - categoricalaccuracy: 0.6140 67233632 736/55522 [..............................] - ETA: 53:05:58 - loss: 1.7408 - categoricalaccuracy: 0.6142 67253048 737/55522 [..............................] - ETA: 53:05:56 - loss: 1.7399 - categoricalaccuracy: 0.6144 67268764 738/55522 [..............................] - ETA: 53:05:52 - loss: 1.7397 - categorical_accuracy: 0.6145 67285804

Below is fit_generator

class MemoryCallback(Callback):
    def on_batch_end(self, epoch, log={}):
        print("  ",resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

parallel_model.fit_generator(generator=myGenerator(trainIndex),steps_per_epoch=int(len(trainIndex)/(batch_size))
                             ,epochs=5 ,verbose=1
                    ,callbacks=[cSVLogger,checkpointer,MemoryCallback]
                             ,validation_data=myTestGenerator(testIndex),validation_steps=int(len(testIndex)/(batch_size))
                             ,class_weight=classWeight
                    ,initial_epoch=0
                    ,max_queue_size=1
                    ,shuffle=False
                    ,workers=1
                   )

Below is my generator. inputValue is image list.

@threadsafe_generator
def myGenerator(trainIndex):
    cnt=0
    ckValue=int(len(trainIndex)/(batch_size))            
    while 1:        
        for idx,x in enumerate(range(ckValue)):
            returnA=[]
            returnB=[]

            for y in trainIndex[idx*batch_size:(idx+1)*batch_size]:
                returnA.append(img_to_array(inputValue[y])/255)

                categoryOne=[0]*len(word2IntClassValue)
                categoryOne[word2IntClassValue[lableValue[y]]]=1
                returnB.append(categoryOne)

            yield np.array(returnA),np.array(returnB)
            returnA=[]
            returnB=[]
Asked Oct 17 '21 12:10
avatar linetor
linetor

2 Answer:

I assume the MemoryCallback returns the RAM utilization on this machine. Do you actually run out of memory when you train with the whole dataset or you estimate that you will crush due to the increase?

At first glance I don't see something that could cause a memory leak in your generator. Keep in mind that garbage collection can be an expensive process and might be delayed if you have lots of memory in your system. If you don't experience any out-of-memory issues don't worry about it as the GC is likely to cleanup the memory at one point. Alternatively you can try calling manually the gc.collect(). Most of the times this practice is discouraged but in few cases it might be helpful.

If you have more time to invest on this, you can also rewrite the generator to avoid allocating so much memory every time. This can be done by preallocating the memory out side of the loop (assuming fixed batch sizes that might not be the case; still possible with extra efford). I also assume that the pasted code is a simplified snippet and that in your actual implementation you handle concurrency safely.

1
Answered Nov 06 '17 at 14:28
avatar  of datumbox
datumbox

Thanks for your comment.

But Actually I crushed due to memory problem. I use AWS p2. 8xlarge. And fit_generator start with 80G, but as train goes by, batch 2974/7152's memory 378G. And gc is not helpful in this situation.

But I'll try your opinion that preallocating memory. Thanks

1
Answered Nov 06 '17 at 14:50
avatar  of linetor
linetor