Parallel Python
Home arrow Python Forums arrow Parallel Python Forum arrow Problem with max file open
Parallel Python Community Forums rss  
September 21, 2014, 07:06:16 AM *
Welcome, Guest. Please login or register.

Login with username, password and session length
News: Parallel python forum is up and running!
 
   Home   Help Search Login Register  
Pages: [1]
  Print  
Author Topic: Problem with max file open  (Read 15235 times)
0 Members and 1 Guest are viewing this topic.
micerinos
New Python
*

Karma: 0
Posts: 2


View Profile
« on: January 17, 2009, 08:51:13 AM »

Hi,

i'm using pp to run a huge number of simulations (a genetic algorithm) and in the middle of the simulation after a couple of days it either hungs the machine or breaks with the message:

Traceback (most recent call last):
  File "iorga.py", line 75, in <module>
  File "/home/jorge/dev/schoner/dftgenalg/ga_mod.py", line 506, in evolve
  File "iorga.py", line 17, in applyFitnessFunction
  File "/home/jorge/dev/schoner/dftgenalg/dftparallel.py", line 68, in run
  File "/home/jorge/dev/schoner/dftgenalg/dftparallel.py", line 24, in __init__
  File "/var/lib/python-support/python2.5/pp.py", line 282, in __init__
  File "/var/lib/python-support/python2.5/pp.py", line 437, in set_ncpus
  File "/var/lib/python-support/python2.5/pp.py", line 121, in __init__
  File "/usr/lib/python2.5/popen2.py", line 189, in popen3
  File "/usr/lib/python2.5/popen2.py", line 52, in __init__
OSError: [Errno 24] Too many open files
Exception exceptions.AttributeError: AttributeError("Popen3 instance has no attribute 'pid'",) in <bound method Popen3.__del__ of <popen2.Popen3 instance at 0x2bf8dd0>> ignored

Finally, i discovered that the problem is that destroying the server (i'm doing this because of a problem with ctypes that otherwise hangs the program with a misterious double free) the pipes aren't destroyed.
In another thread i read a suggestion of adding a self.r.close() self.w.close() in the destroy method, which Vitaly said would consider for inclusion.
In the code of pp.py it exits the function PPTransport.close(), which does close both pipes, but the destroy method doesn't call this, so pipes exists forever. I found that changing this method solves my problem and I considered on sharing this. Is there a better way to accomplish this? The new code is:

 def destroy(self):
        """Kills ppworkers and closes open files"""
        self.__exiting = True
        self.__queue_lock.acquire()
        self.__queue = []
        self.__queue_lock.release()

        for worker in self.__workers:
            worker.t.exiting = True
            if sys.platform.startswith("win"):
                os.popen('TASKKILL /PID '+str(worker.pid)+' /F')
            else:
                try:
                    worker.t.close() # This is the new line
                    os.kill(worker.pid, 9)
                    os.waitpid(worker.pid, 0)
                except:
                    pass

Thanks a lot for this nice program.




Logged
var
Jr. Python
**

Karma: 0
Posts: 9


View Profile
« Reply #1 on: December 02, 2009, 08:11:59 PM »

I also had this issue, though the workaround above was not viable as the product release requires the client to install 'pp' via easy_install or similar. Ended up writing an emulation of the above code within the product, taking in the jobServer and shutting down cleanly, note this has the windows portion removed:

Code:
   def __jobServerDestroy(js):
        """
        Destroy the parallel python job server
        NOTE: this is a bug in the pp code resulting in open FIFO pipes
        """
        # access job server methods for shutting down cleanly
        js._Server__exiting = True
        js._Server__queue_lock.acquire()
        js._Server__queue = []
        js._Server__queue_lock.release()

        for worker in js._Server__workers:
            worker.t.exiting = True
            try:
                # add worker close()
                worker.t.close()
                os.kill(worker.pid, 0)
                os.waitpid(worker.pid, 0)
            except:
                # something nasty happened
                pass

Edit: This code seems to hang on Debian 6 (squeeze) with python 2.6.6, removal of this code and use of pp 1.6.1 does seem to kill worker processes.
« Last Edit: July 12, 2011, 04:25:49 PM by var » Logged
Vitalii
Global Moderator
Parallel Python
*****

Karma: 2
Posts: 518


View Profile WWW
« Reply #2 on: May 10, 2010, 01:29:03 AM »

(fixed in PP 1.6 RC4 or later)
Logged

JoeCM
New Python
*

Karma: 0
Posts: 2


View Profile
« Reply #3 on: March 04, 2011, 07:10:14 AM »

Hi,
  I'm using the latest version and am getting the error as described above. The error messages read:

  File "process_vrs.py", line 91, in <module>
  File "/data/joe/pythonfiles/vrs/cube_processing.py", line 495, in basic_processing
  File "/data/joe/pythonfiles/vrs/cube_processing.py", line 328, in fix_bad_pixels
  File "/usr/local/lib/python2.6/dist-packages/pp.py", line 340, in __init__
  File "/usr/local/lib/python2.6/dist-packages/pp.py", line 503, in set_ncpus
  File "/usr/local/lib/python2.6/dist-packages/pp.py", line 143, in __init__
  File "/usr/local/lib/python2.6/dist-packages/pp.py", line 150, in start
  File "/usr/lib/python2.6/subprocess.py", line 633, in __init__
  File "/usr/lib/python2.6/subprocess.py", line 1039, in _execute_child
OSError: [Errno 24] Too many open files
Logged
coomteng
New Python
*

Karma: 0
Posts: 1


View Profile
« Reply #4 on: March 19, 2011, 11:51:17 AM »

I am using 1.6.1, and still get IOError: [Errno 24] Too many open files. After destroy job_server, resource cannot be released. The number of open files keep increasing. Only remote server has this issue.
« Last Edit: March 20, 2011, 03:18:40 AM by coomteng » Logged
Vitalii
Global Moderator
Parallel Python
*****

Karma: 2
Posts: 518


View Profile WWW
« Reply #5 on: April 12, 2011, 12:03:06 AM »

A small program which reproduces the issue quickly and deterministically would be helpful.
Logged

maahnman
New Python
*

Karma: 1
Posts: 3


View Profile
« Reply #6 on: September 27, 2011, 02:39:50 AM »

I have the same problem with pp 1.6.1, the number of sockets is increasing until it hits the too-many-files limit. When I run this code with one client with 2 workers it stops after approx. 140 itterations. Then ls -l /proc/PID/fd shows > 1000 files

Code:
import pp
import numpy
import time


def get_val():
K = numpy.random.rand()
time.sleep(0.1)
return K

for j in numpy.arange(1000):
print j
job_server = pp.Server(0,ppservers= ("*",),secret="test")

jobs = []
for i in xrange(2):
jobs.append(job_server.submit(get_val,(),(),("numpy","time",)))
print j, "all submitted"

for i,job in enumerate(jobs):
print j, i,job()
job_server.destroy()
Logged
derevo
New Python
*

Karma: 0
Posts: 2


View Profile
« Reply #7 on: November 25, 2011, 06:42:57 PM »

Hi guys, im olso have this problem. Im use 1.6.1 v of PP

my destroy method looks like this

Code:
    def destroy(self):
        """Kills ppworkers and closes open files"""
        self._exiting = True
        self.__queue_lock.acquire()
        self.__queue = []
        self.__queue_lock.release()

        for worker in self.__workers:
            try:
                worker.t.close()
                if sys.platform.startswith("win"):
                    os.popen('TASKKILL /PID '+str(worker.pid)+' /F')
                else:
                    os.kill(worker.pid, 9)
                    os.waitpid(worker.pid, 0)
            except:
                pass


after few hours of working, i have this in workers

#lsof -i | grep 35000 | wc -l
4753

all this connections didnt die until i reboot ppserver

Code:
python    27913   root  387u  IPv4 1032035      0t0  TCP domU-12-31-39-09-F8-1E.compute-1.internal:51788->domU-12-31-39-09-F8-1E.compute-1.internal:35000 (CLOSE_WAIT)
python    27913   root  388u  IPv4 1032037      0t0  TCP domU-12-31-39-09-F8-1E.compute-1.internal:44711->domU-12-31-39-0A-24-3D.compute-1.internal:35000 (CLOSE_WAIT)
python    27913   root  389u  IPv4 1032038      0t0  TCP domU-12-31-39-09-F8-1E.compute-1.internal:51790->domU-12-31-39-09-F8-1E.compute-1.internal:35000 (CLOSE_WAIT)
python    27913   root  390u  IPv4 1032040      0t0  TCP domU-12-31-39-09-F8-1E.compute-1.internal:55138->ip-10-242-229-254.ec2.internal:35000 (CLOSE_WAIT)
python    27913   root  391u  IPv4 1032041      0t0  TCP domU-12-31-39-09-F8-1E.compute-1.internal:44714->domU-12-31-39-0A-24-3D.compute-1.internal:35000 (CLOSE_WAIT)
python    27913   root  392u  IPv4 1032042      0t0  TCP domU-12-31-39-09-F8-1E.compute-1.internal:51793->domU-12-31-39-09-F8-1E.compute-1.internal:35000 (CLOSE_WAIT)
python    27913   root  393u  IPv4 1032044      0t0  TCP domU-12-31-39-09-F8-1E.compute-1.internal:55141->ip-10-242-229-254.ec2.internal:35000 (CLOSE_WAIT)


need help

Thanks alot
Logged
siberiancat
New Python
*

Karma: 0
Posts: 1


View Profile
« Reply #8 on: March 26, 2012, 12:54:50 PM »

I understand better what's happening. My jobs are long lived, and the hardcoded 10 sec timeout setting in pptransport.py is too small.

As a result, the connection is dropped after 10 sec, the job still (!) runs to completion thus blocking the new incoming requests, the attempt to write back the result ends in an exception, and the socket sits in CLOSE_WAIT. Incoming requests keep piling up, waiting and timing out, too, adding to the sockets in CLOSE_WAIT.

Granted, there is a FAQ entry suggesting to change the variable. It is a workaround, but I got the pp module installed via the standard Ubuntu way, and hacking the source in /usr/share/sharedpy just does not feel right.

I suggest this setting becomes a programmable parameter. The behavior when the timeout occurs and the connection is dropped also does not seem very graceful to me.

Great code though. Very simple to set up and use.
« Last Edit: March 26, 2012, 03:42:35 PM by siberiancat » Logged
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.16 | SMF © 2011, Simple Machines Valid XHTML 1.0! Valid CSS!
Nutrition facts and analysis