Parallel Python Community Forums

Python Forums => Parallel Python Forum => Topic started by: daniel_victoria on April 07, 2009, 05:29:24 AM



Title: can't pickle PySwigObject objects
Post by: daniel_victoria on April 07, 2009, 05:29:24 AM
Hi,

I'm new to programing, specially parallel coding. I wrote a program to take a time series of images and, for each image pixel, calculate a Fourrier Transform and return images of the amplitude and phase angle. The not-parallelized(?) code works OK and to achieve this I defined a function that will take a series of images (a list of gdal objects) and also the position of the image I'll analyse. This way I can split the images in smaller spatial-chunks.
The function call is something like this:
fourrier([ list of image objects], xi, yi, xsize, ysize, fourrier_terms, filter_function)

My idea was to send each of these spatial chunks to a parallel job but, when I try to run the code I get:

Code:
Traceback (most recent call last):
  File "E:\python\harmonicos_pp.py", line 298, in <module>
    main()
  File "E:\python\harmonicos_pp.py", line 285, in main
    todos_dados = recortar(imagens, args)
  File "E:\python\harmonicos_pp.py", line 143, in recortar
    jobs = [job_server.submit(fourrier, (tuple(imagens)+tuple(bloco)), modules=('numpy','gdal',)) for bloco in partes]
  File "C:\Python25\Lib\site-packages\pp.py", line 407, in submit
    sargs = pickle.dumps(args)
  File "C:\Python25\lib\copy_reg.py", line 69, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle PySwigObject objects

Any idea what is going on? Is it because the images in the image list are proxy of <Swig Object of Type 'GDALDatasetShadow'> ? Would converting the images to Numpy Arrays help?

Thanks
Daniel

PS - I can attach  the code if needed. It's just that, being a self tough and sloppy programmer, I'm a bit embarrassed of my coding style :o :-\


Title: Re: can't pickle PySwigObject objects
Post by: krb on April 07, 2009, 06:02:07 AM
I had similar problems, in my case I was trying to pass an unpickleable object via pp's submit routine.
Looks like your PySwig object cannot be pickled, so yeah use a pickle-able object.
I know that numpy array's can be pickled, so use them instead and try.


Title: Re: can't pickle PySwigObject objects
Post by: daniel_victoria on April 07, 2009, 06:40:51 AM
Krb,

Thanks for the reply. I'll try to modify the code so I can pass numpy arrays instead of Gdal objects to function. I'm just afraid that, by converting the Gdal object to a numpy array, the entire time-series will be put to memory since I'd have to read the file instead of just pointing to it. Or am I messing up somewere (very likelly)

Will have to figure out how to do that without loading the array in memory


Title: Re: can't pickle PySwigObject objects
Post by: krb on April 07, 2009, 07:57:01 AM
just curious on what your gdal object is?
is it a file pointer(file handle) type??


Title: Re: can't pickle PySwigObject objects
Post by: daniel_victoria on April 07, 2009, 09:42:27 AM
Yes, gdal object is a pointer to a Geotiff image (a satellite image). Along with the data in the image itself, there are informations about georeferencing, pixel size, projections etc.

The data I need is accessed through a function that will return an array of the image data (Ex. image.ReadAsArray())

I was able to modify the code so I'm now givint the numpy arrays to the function instead of the file handles but, in order to do that, I'm having to read all images into memory :o At least the parallelization is working ;D

Would it be possible to pickle.dump() the data in the GDAL object file handle to a temp file and then use theis temp file as input to the function? Would that free up some memory, storing stuff in my HD?

Thanks
Daniel


Title: Re: can't pickle PySwigObject objects
Post by: daniel_victoria on April 07, 2009, 10:44:15 AM
Answering to self and also to anyone who happens to bump on the same issue.

I managed to keep the memory use low and learned a few new tricks in python.

To maintain memory use low I though, why use a python data object stored in memory if I can pickle.dump() it and use a temporary file? So first I tried using pickle.dump and pass the job_server the file handle. That way, memory use would remain low. But a file handle can't be picckled...

So, instead of passing the job_server file handle, I just passed it a string with the name of a file. Since the file was a pickle dump of a numpy array, all the function had to do was to unpickle it and do it's thing.

So now, I can dump chunks of my images to a file and then later I load each chunk separetelly, keeping memory use low.

This has been a good day for a self though sloppy python programer ;D :D

Krb, thanks for your help


Title: Re: can't pickle PySwigObject objects
Post by: daniel_victoria on April 07, 2009, 12:12:06 PM
oops...
Spoke too soon.

What I'm tring to do is distribute the image processing into several jobs but, since my machine is not a multi-core, I tough about using a cluster approach, using several computers in the lab. The thing is that, to distribute the process, I also have to send the data to the different machines and just that process will consume more time than I first tough. So, the bad news is that my cluster probably won't work the way I planned, unless I split the images in hundreds of pieces.
The good news is that, if I ever get a multicore processor, the parallelization is working :)



Title: Re: can't pickle PySwigObject objects
Post by: krb on April 08, 2009, 03:18:31 AM
why dump the pickle to a file and pass it.
pp does that, I mean (I think) it uses 'dumps' and then passes that data around to the servers (local or on grid).

so if you happen to convert your gdal data into numpy array, then no need to use the file based approach.
I am, of course, assuming that the gdal to numpy conversion consumes most of the memory. So, if you write it to a file or send it to a pp server does it matter??


Title: Re: can't pickle PySwigObject objects
Post by: daniel_victoria on April 08, 2009, 04:48:39 AM
Maybe I'm doing something wrong but, when I convert the Gdal object to a numpy array, the entire array has to be held in memory right? So that would easily consume some gigabytes of RAM, since I'm working with a time series of images. By dumping a 3d array to a file, I'm storing it on the HD, not in RAM. At least that was my reasoning.

The problem is that originally I was sending a list of objects to the job_server and then the function called by each job would compose the 3d array. I changed the function and now I pass it a 3d numpy array but, in order to do that, I must read all gdal objects to compose the array. And there goes my RAM... So that is why I'm dumping the array on disk, to keep memory usage low.

I guess I'll mess around with the code and see if I can change it to something more sane. I tried passing a file handle and that works IF I keep the processes on the same computer. On a cluster, that's a no no since the file handle is sent to another machine but the file is not.


Title: Re: can't pickle PySwigObject objects
Post by: krb on April 08, 2009, 07:57:46 AM
ok I'll break it down into steps and see if my understanding of your program is same as yours.
  • 1 you read in a gdal object
  • 2 convert the gdal data, which is on disk, to numpy array which is in memory
  • 3 write the numpy data to a pickle file
  • 4 pass the pickle file to pp_server

ok, so if the above is correct then what I am saying is
  • 1 you read in a gdal object
  • 2 convert the gdal data, which is on disk, to numpy array which is in memory
  • 3 pass the numpy array to pp_server
  • 4 delete the numpy array, so keep your memory usage to a minimum, just after you submit a job

so you submit a job and delete the numpy array. When pp is done sending the numpy array it will no longer refer to the numpy array so python will release the memory associated with it.

As you submit jobs sequentially, you will be dealing with 1 chunk of numpy array at a time (pp may still be referencing it, and it depends on how wide your pp server is).


Title: Re: can't pickle PySwigObject objects
Post by: daniel_victoria on April 08, 2009, 03:22:59 PM
You understood the program correctly. There is just one thing that is missing. Each gdal object is about 200 Mbs and I work with a time series of them, that is, I have 23 gdal objects. So, to convert that to a numpy array I would need at leas 200mb * 23 of ram (4600Mb)! That is why I chose to read the gdal objects in chunks and dump them to files. So each chunk is an array of smaller size (for instance: 23,20,20). Then I can pass each chunk to a pp job_server.

I guess my problem is not so much the computation but the size of the dataset!


Title: Re: can't pickle PySwigObject objects
Post by: krb on April 09, 2009, 02:47:10 AM
>>That is why I chose to read the gdal objects in chunks and dump them to files. So each chunk is an array of smaller size (for instance: >>23,20,20)

So once you dumped your gdal object to file, you don't consume lot of memory, right?
I think, instead of dumping it to a file you can send it to pp_server. (try cStringIO's file like object to print everything and then pass its value).

And since you submit a job sequentially, you cannot be consuming whole lot of memory.

I did something similar for my current project, only difference being my objects are not 200Mb, but I have lot of them about 3000-20000.