|
Title: Job Server stops sending jobs Post by: mickvdv on February 01, 2010, 01:40:03 PM Hi
I am buidling a prime finder for my PP cluster. And got into a problem; after a while the jobserver stops sending jobs and just keeps on waiting: mission.server.wait() While all the clients in the cluster are idle and dont get jobs. My script looks something like this: Code: loadLastPosition() from vars import servers server = pp.Server(ncpus=0, ppservers=servers, secret="XXXXXXXX") while not lastNumber == endNumber: loadJobs(1000) server.wait() saveCurrentPosition() mission.server.print_stats() print "Done" I get the idea that this occurs when some clients quit. Or a network connection (at client side) fails,.. Is there a fix for this? Thnx for the help Mickvdv EDIT: The jobserver has no local workers, ncpus = 0 Title: Re: Job Server stops sending jobs Post by: mickvdv on February 04, 2010, 08:09:49 AM I stil get this problem. This is the the program i wrote:
Code: #! /usr/bin/python # To change this template, choose Tools | Templates # and open the template in the editor. #! /usr/bin/python # To change this template, choose Tools | Templates # and open the template in the editor. import pp, time, sys, cPickle, thread, stats from millerrabin import isPrime from millerrabin import test from millerrabin import toBinary import logging def lowpriority(): """ Set the priority of the process to below-normal.""" import sys try: sys.getwindowsversion() except: isWindows = False else: isWindows = True if isWindows: # Based on: # "Recipe 496767: Set Process Priority In Windows" on ActiveState # http://code.activestate.com/recipes/496767/ import win32api,win32process,win32con pid = win32api.GetCurrentProcessId() handle = win32api.OpenProcess(win32con.PROCESS_ALL_ACCESS, True, pid) win32process.SetPriorityClass(handle, win32process.IDLE_PRIORITY_CLASS) else: import os os.nice(1) def highpriority(): """ Set the priority of the process to below-normal.""" import sys try: sys.getwindowsversion() except: isWindows = False else: isWindows = True if isWindows: # Based on: # "Recipe 496767: Set Process Priority In Windows" on ActiveState # http://code.activestate.com/recipes/496767/ import win32api,win32process,win32con pid = win32api.GetCurrentProcessId() handle = win32api.OpenProcess(win32con.PROCESS_ALL_ACCESS, True, pid) win32process.SetPriorityClass(handle, win32process.HIGH_PRIORITY_CLASS) else: import os os.nice(20) lowpriority() #import wmi #comp = wmi.WMI() def dictToTags(dictionary): output = str() for key in dictionary: output += "<"+str(key)+">"+str(dictionary[key])+"</"+str(key)+">" return output def getFreeMemory(comp): freeMem = int() for os in comp.Win32_OperatingSystem(): freeMem = float(os.FreePhysicalMemory) totalMem = float(os.TotalVisibleMemorySize) return int(float(freeMem / totalMem)*100) def tagsToDict(tags): # output = dict() while len(tags) > 0: #print tags key = tags[1:][:tags.find(">")-1] # DATA CORRUPTION TEST if (tags.find(">") == -1): return "DATAERROR" #print "KEY = "+key tmp = tags[tags.find(">")+1:] #print "TMP = "+tmp if (tmp.find("</"+key+">") == -1): return "DATAERROR" #print "TMP= "+ tmp #print "TMP FIND = "+str(tmp.find("</"+key+">")) value = tmp[:tmp.find("</"+key+">")] #print "VALUE = "+value #print "VALUE= "+ value try: # FOR USAGE IN THE KEY RECOVERY output[int(key)] = int(value) except: output[key] = value #print output tags = tags[tags.find("</"+key+">")+len("</"+key+">"):] return output def prime(start, stop, jobNumber): results = list() for x in range(start, stop): try: if isPrime(x): results.append(x) #self.outputMsg("FOUND A PRIME " + str(x)) except ValueError: pass return (results, jobNumber) class Control: def __init__(self): #self.work = dict() self.chunkNumber = 0 def startServer(self, servers): outputMsg("INITIALIZING SERVER") self.server = pp.Server(ncpus=1, ppservers=servers, secret="33qrtp4") outputMsg("SERVER STARTED") #def outputMsg(self, msg): # print +str(round(time.clock())) + "\tCONTROL:\t"+str(msg) def loadWork(self, WorkStart, WorkStop): #self.outputMsg("LOADING WORK INTO QUEUE") x = WorkStart while x < WorkStop: if x + self.chunkSize > WorkStop: y = WorkStop else: y = x + self.chunkSize #print "X="+str(x)+"Y="+str(y)+"chunkNumber="+str(chunkNumber) self.server.submit(prime, (x, y, self.chunkNumber), (isPrime, test, toBinary), ("random",), callback=dumpResults) #self.workDescription[self.chunkNumber] = x self.chunkNumber += 1 # DEBUG print [x, y] if not y == self.chunkSize+x: break else: x = y #print "chunkNumer="+str(self.chunkNumber) # self.outputMsg("LOADING DONE") def outputMsg(msg): txt = time.strftime("%H:%M@") + str(round(time.clock()))+"\t" +str(msg) f = open("messages.log", "a") f.write(txt+"\n") f.close() print txt def addMoreWork(number, mission): # startPoint = mission.lastPosition endPoint = startPoint + (mission.chunkSize*number) if endPoint > mission.endPosition: endPoint = mission.endPosition mission.loadWork(startPoint, endPoint) #outputMsg("startPoint="+str(startPoint)+" endPoint="+str(endPoint)) mission.lastPosition = endPoint def dumpResults(args): resultsDump, jobNumber = args resultsDump.sort() #print resultsDump #outputMsg('Start dumping the results') f = open("results/"+str(jobNumber)+".result", "w") #outputMsg('Opened') outputStr = str() for getal in resultsDump: outputStr += str(getal)+"\n" f.write(outputStr) f.close() # CLEAN UP THE MEMORY del outputStr del resultsDump #del mission.work[jobNumber] from vars import servers # RESETTING THE LOGFILE f = open("debug.log", "w") f.close() # LOAD THE "MISSION" mission = Control() # CONFIGURE THE MISSION mission.lastPosition = 13296000000 mission.endPosition = 1000000000000 mission.chunkSize = 20000 mission.numberOfChunks = (mission.endPosition - mission.lastPosition)/mission.chunkSize outputMsg("Number of Chunks "+str(mission.numberOfChunks)) # LOADING STATS mission.stats = stats.Statistics() mission.stats.loadStats() # TRY TO RESUME,... try: mission.lastPosition = cPickle.load(open('lastPosition.p', 'rb')) mission.chunkNumber = cPickle.load(open('chunkNumber.p', 'rb')) except: pass outputMsg("starting at "+str(mission.chunkNumber)+" total of "+str(mission.numberOfChunks)+" chunks") # UPDATING STATS mission.stats.endPosition = mission.endPosition mission.stats.chunkSize = mission.chunkSize mission.stats.currentPosition = mission.lastPosition # START THE SERVER #print servers mission.startServer(servers) #n = 0 # START RUNNING: numberToLoad = 2000 mission.stats.chunksLoaded = numberToLoad highpriority() try: while not mission.lastPosition == mission.endPosition: startTime = round(time.clock()) # LOAD NIEUW WORK outputMsg("DEBUG: LOADING WORK") thread.start_new_thread(addMoreWork,(numberToLoad, mission)) #f = open("aapje", "r") outputMsg("New jobs loaded! "+str(numberToLoad)) time.sleep(5) try: mission.server.print_stats() except: pass # OUTPUTTING THE STATISTICS outputMsg("DEBUG: STARTED OUTPUTTING THE STATISTICS") mission.stats.generateStats() mission.stats.printStats() mission.stats.outputStats() outputMsg("DEBUG: OUTPUT DONE, STARTED WAITING") # WAIT UNTIL THE JOB IS DONE mission.server.wait() outputMsg("DEBUG: WAITING DONE, STARTED CLEANING MEMORY") #del mission.work #mission.work = dict() outputMsg("DEBUG: CLEARING DONE!") # GENERATE D TIME dTime = round(time.clock()) - startTime # SAVE THE POSITIONS! cPickle.dump(mission.lastPosition, open('lastPosition.p', 'wb')) cPickle.dump(mission.chunkNumber, open('chunkNumber.p', 'wb')) outputMsg("DEBUG: SAVING DONE, GENERATE THE STATISTICS") # SAVING STATISTICS mission.stats.addTime(dTime) mission.stats.addData(mission.server.get_stats()) mission.stats.currentPosition = mission.lastPosition mission.stats.saveStats() outputMsg("DEBUG: STATISTICS DONE!") # Check the free memory #freeMem = getFreeMemory(comp) freeMem = "???" outputMsg("at "+str(mission.chunkNumber)+", "+str((float(mission.lastPosition)/mission.endPosition) * 100)+"% done, mem "+str(freeMem)+"%") except Exception, e: f = open("error.log", "a") f.write(str(e)+"\n") f.close() outputMsg(str(e)) outputMsg("DONE!!") While trying to fix this problem i changed ncpus back to 1. And i still get this error. What is wrong with my code, or is there some bug in the pp module? Gtz Mick Title: Re: Job Server stops sending jobs Post by: mickvdv on February 04, 2010, 11:35:13 AM Hi
I guess this happens when a client doesn't quit correctly (for example by shutting his pc down); I accidentally hit the reset button on one of my pc's and 1 minute later (i put the jobs to load on 2000), at the end of the 2000 jobs. The server fails; all the clients stop working and it keeps on "wait() ing". I hope you can fix this in the module, because my project kindda relies on PP. And it will take for ages if i must restart the server script every couple of hours. Also there is no way for me to secure that ll clients exit "correctly". Because it runs on i think 20 computers, of which 4 are in my house. And i one guy decides to end the ppserver.py script all the other computers are idle. EDIT: I now can confirm that this problem is caused by a non-correct shutdown. I reproduced it by unplugging my network cable on one of the clients and then shutdown ppserver, reconnect the cable and restart ppserver. The pc doesn't get any jobs, and soon after this event all the clients stop getting jobs: The server is waiting for a job to be finished at a client who isn't connected anymore. Is there a way to get the server to check if the client is still online. For example by some sort of ping system? And if the client is disconnected send the job out to another client. I ppserver is build for usage in clusters. But it is also a very usefull module for usage in distributed computing over several computers / network on the internet. Could this please be fixed? I am glad to help anyone with it, but i am not the best programmer in the world, and cant write a fix for this by myself. Thnx everyone. Title: Re: Job Server stops sending jobs Post by: mickvdv on February 05, 2010, 07:45:51 AM Hi
I wrote a simple piece of code that reproduces this problem Code: #! /usr/bin/python # Mick van der Vegt # mickvdv@gmail.com # This reproduces the bug i reported: # http://www.parallelpython.com/component/option,com_smf/Itemid,29/topic,419.0 import pp, logging pp._RECONNECT_WAIT_TIME = 60 N = 1000001 def reverse(l): """Calculates a sum""" l.reverse() return l # change this to your remote servers. servers = ( ) job_server = pp.Server(ncpus=0, loglevel=logging.DEBUG, ppservers=servers, proto=2) jobs = dict() print "Loading jobs" for x in range(0, 100): jobs[x] = job_server.submit(reverse, (range(N), )) job_server.print_stats() print "Loading done! Start waiting untill jobs are finished" # Here , unplug the network cable from 1 of the remote workers, after a while you will see the message "Task ... failed due of broken network connection, rescheduling" # The task gets reloaded and executed but after all the jobs are finished, the server keeps on waiting. And the clients are idle job_server.wait() print "Finished!" job_server.print_stats() Maybe this will help solving it Grtz mick Title: Re: Job Server stops sending jobs Post by: Vitalii on February 06, 2010, 08:53:18 AM Here is the exact code I ran:
Code: (python) #!/usr/bin/python Initially there was as server at localhost, in the middle of execution I kill the server, then bring it back. Here is the log: Code: (apache) python hangs.py As you can see client completes. Could you please attach you log for the case when it hangs? Title: Re: Job Server stops sending jobs Post by: mickvdv on February 06, 2010, 03:09:52 PM Does this also happen when you disconnect the network cable?
Code: Loading jobs 2010-02-07 00:05:48,634 DEBUG Creating server instance (pp-1.5.7) 2010-02-07 00:05:48,644 INFO pp local server started with 0 workers 2010-02-07 00:05:48,638 DEBUG Creating Rworker id=home.schutijser.com:60000 persistent=True 2010-02-07 00:05:48,638 DEBUG Creating Rworker id=home.schutijser.com:60002 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60000 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60004 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=77.109.74.145:60001 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60006 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=94.209.26.156:60001 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=vulpix.physgaming.nl:60010 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=83.82.231.248:60003 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=77.109.74.145:60002 persistent=True 2010-02-07 00:05:48,638 DEBUG Creating Rworker id=home.schutijser.com:60001 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60002 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60007 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60008 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=94.209.26.156:60000 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=83.82.231.248:60001 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=83.82.231.248:60002 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.101:60000 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.201:60000 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=77.109.74.145:60000 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60001 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60003 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60009 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=94.209.29.155:60000 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.6:60000 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.202:60000 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.203:60000 persistent=True 2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60005 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=83.82.231.248:60000 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.5:60000 persistent=True 2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.200:60000 persistent=True 2010-02-07 00:05:48,786 DEBUG Task 0 submited, function='reverse' 2010-02-07 00:05:48,838 DEBUG Creating Rworker id=192.168.0.101:60000 persistent=True 2010-02-07 00:05:49,055 DEBUG Task 1 submited, function='reverse' 2010-02-07 00:05:49,173 DEBUG Task 2 submited, function='reverse' 2010-02-07 00:05:49,201 DEBUG Creating Rworker id=192.168.0.101:60000 persistent=True 2010-02-07 00:05:49,201 DEBUG Task (remote) 0 started 2010-02-07 00:05:49,816 DEBUG Creating Rworker id=vulpix.physgaming.nl:60000 persistent=True 2010-02-07 00:05:49,963 DEBUG Task 8 submited, function='reverse' 2010-02-07 00:05:49,963 DEBUG Creating Rworker id=77.109.74.145:60000 persistent=True 2010-02-07 00:05:49,963 DEBUG Creating Rworker id=192.168.0.101:60000 persistent=True 2010-02-07 00:05:50,085 DEBUG Creating Rworker id=192.168.0.200:60000 persistent=True 2010-02-07 00:05:50,085 DEBUG Failed to reconnect with (host=192.168.0.6, port=60000), will try again in 60 s 2010-02-07 00:05:50,085 DEBUG Task 9 submited, function='reverse' 2010-02-07 00:05:49,994 DEBUG Task (remote) 1 started 2010-02-07 00:05:50,213 DEBUG Task 10 submited, function='reverse' 2010-02-07 00:05:50,213 DEBUG Failed to reconnect with (host=192.168.0.5, port=60000), will try again in 60 s 2010-02-07 00:05:50,365 DEBUG Task 11 submited, function='reverse' 2010-02-07 00:05:50,401 DEBUG Creating Rworker id=192.168.0.101:60000 persistent=True 2010-02-07 00:05:50,502 DEBUG Task 12 submited, function='reverse' 2010-02-07 00:06:07,647 DEBUG Task (remote) 32 started job count | % of all jobs | job time sum | time per job | job server 4 | 12.12 | 5.0620 | 1.265500 | 192.168.0.200:60000 4 | 12.12 | 4.5780 | 1.144500 | 77.109.74.145:60000 4 | 12.12 | 0.0000 | 0.000000 | vulpix.physgaming.nl:60000 21 | 63.64 | 28.8920 | 1.375810 | 192.168.0.101:60000 Time elapsed since server creation 16.8490002155 Loading done! Start waiting untill jobs are finished 2010-02-07 00:06:07,880 DEBUG Task (remote) 29 ended 2010-02-07 00:06:07,881 DEBUG Task (remote) 33 started 2010-02-07 00:06:08,802 DEBUG Task (remote) 32 ended 2010-02-07 00:06:08,805 DEBUG Task (remote) 31 ended 2010-02-07 00:06:08,805 DEBUG Task (remote) 34 started 2010-02-07 00:06:08,808 DEBUG Task (remote) 35 started 2010-02-07 00:06:09,719 DEBUG Failed to reconnect with (host=83.82.231.248, port=60003), will try again in 60 s 2010-02-07 00:06:09,719 DEBUG Failed to reconnect with (host=home.schutijser.com, port=60002), will try again in 60 s 2010-02-07 00:06:09,719 DEBUG Failed to reconnect with (host=94.209.26.156, port=60001), will try again in 60 s 2010-02-07 00:06:09,727 DEBUG Task (remote) 33 ended 2010-02-07 00:06:09,730 DEBUG Task (remote) 36 started 2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60006), will try again in 60 s 2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=77.109.74.145, port=60002), will try again in 60 s 2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=home.schutijser.com, port=60000), will try again in 60 s 2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60004), will try again in 60 s 2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=77.109.74.145, port=60001), will try again in 60 s 2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60010), will try again in 60 s 2010-02-07 00:06:09,799 DEBUG Task (remote) 30 ended 2010-02-07 00:06:09,806 DEBUG Task (remote) 37 started 2010-02-07 00:06:09,819 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60007), will try again in 60 s 2010-02-07 00:06:09,828 DEBUG Task (remote) 27 ended 2010-02-07 00:06:09,828 DEBUG Task (remote) 38 started 2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=home.schutijser.com, port=60001), will try again in 60 s 2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60008), will try again in 60 s 2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=83.82.231.248, port=60001), will try again in 60 s 2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60002), will try again in 60 s 2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=94.209.26.156, port=60000), will try again in 60 s 2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=83.82.231.248, port=60002), will try again in 60 s 2010-02-07 00:06:09,918 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60003), will try again in 60 s 2010-02-07 00:06:09,918 DEBUG Failed to reconnect with (host=192.168.0.201, port=60000), will try again in 60 s 2010-02-07 00:06:09,918 DEBUG Failed to reconnect with (host=192.168.0.203, port=60000), will try again in 60 s 2010-02-07 00:06:09,918 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60009), will try again in 60 s 2010-02-07 00:06:09,918 DEBUG Failed to reconnect with (host=94.209.29.155, port=60000), will try again in 60 s 2010-02-07 00:06:09,947 DEBUG Failed to reconnect with (host=192.168.0.202, port=60000), will try again in 60 s 2010-02-07 00:06:09,947 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60005), will try again in 60 s 2010-02-07 00:06:10,019 DEBUG Failed to reconnect with (host=83.82.231.248, port=60000), will try again in 60 s 2010-02-07 00:06:10,760 DEBUG Task (remote) 35 ended 2010-02-07 00:06:10,762 DEBUG Task (remote) 39 started 2010-02-07 00:06:10,776 DEBUG Task (remote) 34 ended 2010-02-07 00:06:10,786 DEBUG Task (remote) 40 started 2010-02-07 00:06:11,746 DEBUG Task (remote) 37 ended 2010-02-07 00:06:11,749 DEBUG Task (remote) 41 started 2010-02-07 00:06:11,852 DEBUG Task (remote) 36 ended 2010-02-07 00:06:11,855 DEBUG Task (remote) 42 started 2010-02-07 00:06:12,571 DEBUG Task (remote) 39 ended 2010-02-07 00:06:12,573 DEBUG Task (remote) 43 started 2010-02-07 00:06:12,782 DEBUG Task (remote) 40 ended 2010-02-07 00:06:12,786 DEBUG Task (remote) 44 started 2010-02-07 00:06:13,479 DEBUG Task (remote) 41 ended 2010-02-07 00:06:13,480 DEBUG Task (remote) 45 started 2010-02-07 00:06:13,678 DEBUG Task (remote) 42 ended 2010-02-07 00:06:13,681 DEBUG Task (remote) 46 started 2010-02-07 00:06:36,496 DEBUG Task (remote) 97 started 2010-02-07 00:06:36,542 DEBUG Task (remote) 95 ended 2010-02-07 00:06:36,546 DEBUG Task (remote) 98 started 2010-02-07 00:06:36,989 DEBUG Task (remote) 92 ended 2010-02-07 00:06:36,993 DEBUG Task (remote) 99 started 2010-02-07 00:06:37,470 DEBUG Task (remote) 93 ended 2010-02-07 00:06:39,384 DEBUG Task (remote) 98 ended 2010-02-07 00:06:39,515 DEBUG Task (remote) 97 ended 2010-02-07 00:06:39,529 DEBUG Task (remote) 99 ended 2010-02-07 00:06:39,529 DEBUG Task (remote) 51 ended 2010-02-07 00:06:51,092 DEBUG Failed to reconnect with (host=192.168.0.6, port=60000), will try again in 60 s 2010-02-07 00:06:51,144 DEBUG Task (remote) 96 ended 2010-02-07 00:06:51,372 DEBUG Failed to reconnect with (host=192.168.0.5, port=60000), will try again in 60 s 2010-02-07 00:06:52,221 DEBUG Task (remote) 94 ended 2010-02-07 00:06:52,595 DEBUG Task (remote) 91 ended 2010-02-07 00:06:56,309 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60001), will try again in 60 s Btw: I am running on python 2.6.4 maybe that makes a difference. Soon i will try the same code with python 2.5. Grtz Mick P.S. i modified the log a little bit because it was over 20000 characters. Title: Re: Job Server stops sending jobs Post by: mickvdv on February 09, 2010, 01:45:39 PM Is still get the problem at py2.5 (job server side). I guess that the module is the problem.
Could this please be fixed because i have to reboot the jobserver every couple of hours and when it happens at night lots of "free" computing hours get lost. P.S.: this only happens when i disconnect my network cable, when i kill the process everything goes fine. I guess that when i disconnect the cable the socket doesn't get closed correctly. Mick Title: Re: Job Server stops sending jobs Post by: Vitalii on February 10, 2010, 03:12:26 AM I was able to reproduce it, but not sure about the cause yet.
So far I've seen it only to happens on program exit. In this case workaround is simple, call Code: (python) os._exit(0) Title: Re: Job Server stops sending jobs Post by: mickvdv on February 10, 2010, 08:36:43 AM That is good news:D I hope you solve it soon:D
Where exactly do i have to put os._exit(0) in my program. At the point where i call the callback function? Or when i return the results? Thnx for the help Mick Title: Re: Job Server stops sending jobs Post by: mickvdv on February 10, 2010, 08:52:23 AM I got an idea, isn't there some way to build in a TTL (time to live) for a job. So that realy slow clients and lost connections and other failures dont slow down the project? For instance if i run ppserver on my desktop, at low priority while encoding movies. My pc gets very slow and jobs take minutes to complete, if i set the ttl on 60 seconds that doesnt matter anymore.
Maybe you could build somethink like that into pp. Grtz Mick Title: Re: Job Server stops sending jobs Post by: Vitalii on February 10, 2010, 09:59:19 PM Mick,
if you do not change pp._RECONNECT_WAIT_TIME value, does it ever hang for more then 5 sec after the completion? Title: Re: Job Server stops sending jobs Post by: mickvdv on February 11, 2010, 08:26:47 AM Vitalli
It also happens to me when i put a "#" before pp._RECONNECT_WAIT_TIME = 60. Do you have any luck finding the bug? Gtrz Mick Title: Re: Job Server stops sending jobs Post by: Vitalii on February 11, 2010, 11:44:12 PM This is a combination of how tcp socket works and how threads are created by thread module.
If network cable is disconnected tcp socket hangs the thread it is in and that thread can survive the exit of the main thread. The good news is that in PP 1.6 it won't be an issue anymore. Would you be interested to test the Release Candidate for PP 1.6? Title: Re: Job Server stops sending jobs Post by: mickvdv on February 12, 2010, 12:59:56 AM Yes, shure:D When does it get released?
Title: Re: Job Server stops sending jobs Post by: Vitalii on February 14, 2010, 12:50:57 AM Probably by the end of this month.
Title: Re: Job Server stops sending jobs Post by: Vitalii on February 18, 2010, 10:21:01 PM The issue is reported is solved in PP 1.6:
http://www.parallelpython.com/component/option,com_smf/Itemid,29/topic,424.0 |