Parallel Python Community Forums

Python Forums => Parallel Python Forum => Topic started by: mickvdv on February 01, 2010, 01:40:03 PM



Title: Job Server stops sending jobs
Post by: mickvdv on February 01, 2010, 01:40:03 PM
Hi

I am buidling a prime finder for my PP cluster. And got into a problem; after a while the jobserver stops sending jobs and just keeps on waiting:

mission.server.wait()

While all the clients in the cluster are idle and dont get jobs.

My script looks something like this:

Code:
loadLastPosition()
from vars import servers

server = pp.Server(ncpus=0, ppservers=servers, secret="XXXXXXXX")

while not lastNumber == endNumber:
     loadJobs(1000)
     server.wait()

     saveCurrentPosition()
     mission.server.print_stats()

print "Done"

I get the idea that this occurs when some clients quit. Or a network connection (at client side) fails,..

Is there a fix for this?

Thnx for the help

Mickvdv

EDIT: The jobserver has no local workers, ncpus = 0




Title: Re: Job Server stops sending jobs
Post by: mickvdv on February 04, 2010, 08:09:49 AM
I stil get this problem. This is the the program i wrote:

Code:
#! /usr/bin/python

# To change this template, choose Tools | Templates
# and open the template in the editor.

#! /usr/bin/python

# To change this template, choose Tools | Templates
# and open the template in the editor.

import pp, time, sys, cPickle, thread, stats
from millerrabin import isPrime
from millerrabin import test
from millerrabin import toBinary
import logging



def lowpriority():
    """ Set the priority of the process to below-normal."""

    import sys
    try:
        sys.getwindowsversion()
    except:
        isWindows = False
    else:
        isWindows = True

    if isWindows:
        # Based on:
        #   "Recipe 496767: Set Process Priority In Windows" on ActiveState
        #   http://code.activestate.com/recipes/496767/
        import win32api,win32process,win32con

        pid = win32api.GetCurrentProcessId()
        handle = win32api.OpenProcess(win32con.PROCESS_ALL_ACCESS, True, pid)
        win32process.SetPriorityClass(handle, win32process.IDLE_PRIORITY_CLASS)
    else:
        import os

        os.nice(1)

def highpriority():
    """ Set the priority of the process to below-normal."""

    import sys
    try:
        sys.getwindowsversion()
    except:
        isWindows = False
    else:
        isWindows = True

    if isWindows:
        # Based on:
        #   "Recipe 496767: Set Process Priority In Windows" on ActiveState
        #   http://code.activestate.com/recipes/496767/
        import win32api,win32process,win32con

        pid = win32api.GetCurrentProcessId()
        handle = win32api.OpenProcess(win32con.PROCESS_ALL_ACCESS, True, pid)
        win32process.SetPriorityClass(handle, win32process.HIGH_PRIORITY_CLASS)
    else:
        import os

        os.nice(20)


lowpriority()

#import wmi
#comp = wmi.WMI()

def dictToTags(dictionary):
    output = str()
    for key in dictionary:
        output += "<"+str(key)+">"+str(dictionary[key])+"</"+str(key)+">"
    return output

def getFreeMemory(comp):

    freeMem = int()
    for os in comp.Win32_OperatingSystem():
        freeMem = float(os.FreePhysicalMemory)
        totalMem = float(os.TotalVisibleMemorySize)
    return int(float(freeMem / totalMem)*100)

def tagsToDict(tags):
    #
    output = dict()

    while len(tags) > 0:
        #print tags
        key = tags[1:][:tags.find(">")-1]
        # DATA CORRUPTION TEST
        if (tags.find(">") == -1):
            return  "DATAERROR"
        #print "KEY = "+key
        tmp = tags[tags.find(">")+1:]
        #print "TMP = "+tmp

        if (tmp.find("</"+key+">") == -1):
           return "DATAERROR"

        #print "TMP= "+ tmp
        #print "TMP FIND = "+str(tmp.find("</"+key+">"))
        value = tmp[:tmp.find("</"+key+">")]
        #print "VALUE = "+value
        #print "VALUE= "+ value

        try:

            # FOR USAGE IN THE KEY RECOVERY
            output[int(key)] = int(value)
        except:
            output[key] = value
        #print output
        tags = tags[tags.find("</"+key+">")+len("</"+key+">"):]
    return output

def prime(start, stop, jobNumber):
    results = list()
    for x in range(start, stop):
        try:
            if isPrime(x):
                results.append(x)
                        #self.outputMsg("FOUND A PRIME " + str(x))
        except ValueError:
            pass
    return (results, jobNumber)




class Control:
    def __init__(self):
        #self.work = dict()
        self.chunkNumber = 0

    def startServer(self, servers):
        outputMsg("INITIALIZING SERVER")
        self.server = pp.Server(ncpus=1, ppservers=servers, secret="33qrtp4")
        outputMsg("SERVER STARTED")

    #def outputMsg(self, msg):
    #    print +str(round(time.clock())) + "\tCONTROL:\t"+str(msg)

    def loadWork(self, WorkStart, WorkStop):
        #self.outputMsg("LOADING WORK INTO QUEUE")
        x = WorkStart
        while x < WorkStop:
            if x + self.chunkSize > WorkStop:
                y = WorkStop
            else:
                y = x + self.chunkSize
            #print "X="+str(x)+"Y="+str(y)+"chunkNumber="+str(chunkNumber)
            self.server.submit(prime, (x, y, self.chunkNumber), (isPrime, test, toBinary), ("random",), callback=dumpResults)

            #self.workDescription[self.chunkNumber] = x
            self.chunkNumber += 1
            # DEBUG print [x, y]

            if not y == self.chunkSize+x:
                break
            else:
                x = y
       #print "chunkNumer="+str(self.chunkNumber)
       # self.outputMsg("LOADING DONE")
def outputMsg(msg):
    txt = time.strftime("%H:%M@") + str(round(time.clock()))+"\t" +str(msg)
    f = open("messages.log", "a")
    f.write(txt+"\n")
    f.close()
    print txt


def addMoreWork(number, mission):
    #
    startPoint = mission.lastPosition
    endPoint = startPoint + (mission.chunkSize*number)
    if endPoint > mission.endPosition:
        endPoint = mission.endPosition

    mission.loadWork(startPoint, endPoint)
    #outputMsg("startPoint="+str(startPoint)+" endPoint="+str(endPoint))
    mission.lastPosition = endPoint

def dumpResults(args):
    resultsDump, jobNumber = args
    resultsDump.sort()
    #print resultsDump
    #outputMsg('Start dumping the results')
    f = open("results/"+str(jobNumber)+".result", "w")
    #outputMsg('Opened')
    outputStr = str()
    for getal in resultsDump:
        outputStr += str(getal)+"\n"
    f.write(outputStr)
    f.close()

    # CLEAN UP THE MEMORY
    del outputStr
    del resultsDump
    #del mission.work[jobNumber]


from vars import servers

# RESETTING THE LOGFILE
f = open("debug.log", "w")
f.close()
# LOAD THE "MISSION"
mission = Control()


# CONFIGURE THE MISSION
mission.lastPosition = 13296000000
mission.endPosition = 1000000000000
mission.chunkSize = 20000
mission.numberOfChunks = (mission.endPosition - mission.lastPosition)/mission.chunkSize
outputMsg("Number of Chunks "+str(mission.numberOfChunks))

# LOADING STATS
mission.stats = stats.Statistics()
mission.stats.loadStats()



# TRY TO RESUME,...
try:
    mission.lastPosition = cPickle.load(open('lastPosition.p', 'rb'))
    mission.chunkNumber = cPickle.load(open('chunkNumber.p', 'rb'))
except:
    pass

outputMsg("starting at "+str(mission.chunkNumber)+" total of "+str(mission.numberOfChunks)+" chunks")

# UPDATING STATS
mission.stats.endPosition = mission.endPosition
mission.stats.chunkSize = mission.chunkSize
mission.stats.currentPosition = mission.lastPosition

# START THE SERVER
#print servers
mission.startServer(servers)

#n = 0
# START RUNNING:
numberToLoad = 2000
mission.stats.chunksLoaded = numberToLoad
highpriority()
try:
    while not mission.lastPosition == mission.endPosition:
       
        startTime = round(time.clock())
        # LOAD NIEUW WORK
        outputMsg("DEBUG: LOADING WORK")
        thread.start_new_thread(addMoreWork,(numberToLoad, mission))

        #f = open("aapje", "r")
        outputMsg("New jobs loaded! "+str(numberToLoad))
        time.sleep(5)

        try:
            mission.server.print_stats()
        except:
            pass

        # OUTPUTTING THE STATISTICS
        outputMsg("DEBUG: STARTED OUTPUTTING THE STATISTICS")
        mission.stats.generateStats()
        mission.stats.printStats()
        mission.stats.outputStats()


        outputMsg("DEBUG: OUTPUT DONE, STARTED WAITING")
        # WAIT UNTIL THE JOB IS DONE
        mission.server.wait()
        outputMsg("DEBUG: WAITING DONE, STARTED CLEANING MEMORY")
        #del mission.work
        #mission.work = dict()
        outputMsg("DEBUG: CLEARING DONE!")

        # GENERATE D TIME
        dTime = round(time.clock()) - startTime
        # SAVE THE POSITIONS!

        cPickle.dump(mission.lastPosition, open('lastPosition.p', 'wb'))
        cPickle.dump(mission.chunkNumber, open('chunkNumber.p', 'wb'))

        outputMsg("DEBUG: SAVING DONE, GENERATE THE STATISTICS")
        # SAVING STATISTICS
        mission.stats.addTime(dTime)
        mission.stats.addData(mission.server.get_stats())
        mission.stats.currentPosition = mission.lastPosition
        mission.stats.saveStats()
        outputMsg("DEBUG: STATISTICS DONE!")

        # Check the free memory
        #freeMem = getFreeMemory(comp)
        freeMem = "???"

        outputMsg("at "+str(mission.chunkNumber)+", "+str((float(mission.lastPosition)/mission.endPosition) * 100)+"% done, mem "+str(freeMem)+"%")

   
except Exception, e:
    f = open("error.log", "a")
    f.write(str(e)+"\n")
    f.close()
    outputMsg(str(e))

outputMsg("DONE!!")

While trying to fix this problem i changed ncpus back to 1. And i still get this error.

What is wrong with my code, or is there some bug in the pp module?

Gtz Mick


Title: Re: Job Server stops sending jobs
Post by: mickvdv on February 04, 2010, 11:35:13 AM
Hi

I guess this happens when a client doesn't quit correctly (for example by shutting his pc down); I accidentally hit the reset button on one of my pc's and 1 minute later (i put the jobs to load on 2000), at the end of the 2000 jobs. The server fails; all the clients stop working and it keeps on "wait() ing".

I hope you can fix this in the module, because my project kindda relies on PP. And it will take for ages if i must restart the server script every couple of hours.

Also there is no way for me to secure that ll clients exit "correctly". Because it runs on i think 20 computers, of which 4 are in my house. And i one guy decides to end the ppserver.py script all the other computers are idle.

EDIT:

I now can confirm that this problem is caused by a non-correct shutdown. I reproduced it by unplugging my network cable on one of the clients and then shutdown ppserver, reconnect the cable and restart ppserver. The pc doesn't get any jobs, and soon after this event all the clients stop getting jobs: The server is waiting for a job to be finished at a client who isn't connected anymore.

Is there a way to get the server to check if the client is still online. For example by some sort of ping system? And if the client is disconnected send the job out to another client.

I ppserver is build for usage in clusters. But it is also a very usefull module for usage in distributed computing over several computers / network on the internet. Could this please be fixed? I am glad to help anyone with it, but i am not the best programmer in the world, and cant write a fix for this by myself.

Thnx everyone.


Title: Re: Job Server stops sending jobs
Post by: mickvdv on February 05, 2010, 07:45:51 AM
Hi

I wrote a simple piece of code that reproduces this problem

Code:
#! /usr/bin/python

# Mick van der Vegt
# mickvdv@gmail.com
# This reproduces the bug i reported:
# http://www.parallelpython.com/component/option,com_smf/Itemid,29/topic,419.0
import pp, logging

pp._RECONNECT_WAIT_TIME = 60

N = 1000001

def reverse(l):
    """Calculates a sum"""
    l.reverse()
    return l

# change this to your remote servers.
servers = ( )

job_server = pp.Server(ncpus=0, loglevel=logging.DEBUG, ppservers=servers, proto=2)
jobs = dict()

print "Loading jobs"

for x in range(0, 100):
    jobs[x] = job_server.submit(reverse, (range(N), ))

job_server.print_stats()
print "Loading done! Start waiting untill jobs are finished"

# Here , unplug the network cable from 1 of the remote workers, after a while you will see the message "Task ... failed due of broken network connection, rescheduling"
# The task gets reloaded and executed but after all the jobs are finished, the server keeps on waiting. And the clients are idle

job_server.wait()
print "Finished!"

job_server.print_stats()

Maybe this will help solving it

Grtz mick


Title: Re: Job Server stops sending jobs
Post by: Vitalii on February 06, 2010, 08:53:18 AM
Here is the exact code I ran:
Code: (python)
#!/usr/bin/python               
 
# Mick van der Vegt
# mickvdv@gmail.com
# This reproduces the bug i reported:
# http://www.parallelpython.com/component/option,com_smf/Itemid,29/topic,419.0
import pp, logging
 
 
N = 1000001
 
def reverse(l):
   """Calculates a sum"""
   l.reverse()
   return l
 
# change this to your remote servers.
servers = ("127.0.0.1",)
 
job_server = pp.Server(ncpus=0, loglevel=logging.DEBUG, ppservers=servers, proto=2)
jobs = dict()
 
print "Loading jobs"
 
for x in range(0, 20):
   jobs[x] = job_server.submit(reverse, (range(N), ))
 
job_server.print_stats()
print "Loading done! Start waiting untill jobs are finished"
 
# Here , unplug the network cable from 1 of the remote workers, after a while you will see the message "Task ... failed due of broken network connection, rescheduling"
# The task gets reloaded and executed but after all the jobs are finished, the server keeps on waiting. And the clients are idle
 
job_server.wait()
print "Finished!"
 
job_server.print_stats()

Initially there was as server at localhost, in the middle of execution I kill the server, then bring it back.
Here is the log:
Code: (apache)
python hangs.py
00232323232                        
Loading jobs                      
2010-02-06 08:43:33,699 DEBUG Task 0 submited, function='reverse'
DEBUG:com.parallelpython:Task 0 submited, function='reverse'    
2010-02-06 08:43:33,769 DEBUG Task 1 submited, function='reverse'
DEBUG:com.parallelpython:Task 1 submited, function='reverse'    
2010-02-06 08:43:33,835 DEBUG Task 2 submited, function='reverse'
DEBUG:com.parallelpython:Task 2 submited, function='reverse'    
2010-02-06 08:43:33,899 DEBUG Task 3 submited, function='reverse'
DEBUG:com.parallelpython:Task 3 submited, function='reverse'    
2010-02-06 08:43:33,911 DEBUG Task (remote) 0 started            
DEBUG:com.parallelpython:Task (remote) 0 started                
2010-02-06 08:43:33,970 DEBUG Task 4 submited, function='reverse'
DEBUG:com.parallelpython:Task 4 submited, function='reverse'    
2010-02-06 08:43:34,002 DEBUG Task (remote) 1 started            
DEBUG:com.parallelpython:Task (remote) 1 started                
2010-02-06 08:43:34,052 DEBUG Task 5 submited, function='reverse'
DEBUG:com.parallelpython:Task 5 submited, function='reverse'    
2010-02-06 08:43:34,187 DEBUG Task (remote) 2 started            
DEBUG:com.parallelpython:Task (remote) 2 started                
2010-02-06 08:43:34,195 DEBUG Task 6 submited, function='reverse'
DEBUG:com.parallelpython:Task 6 submited, function='reverse'    
2010-02-06 08:43:34,230 DEBUG Task (remote) 3 started            
DEBUG:com.parallelpython:Task (remote) 3 started                
2010-02-06 08:43:34,286 DEBUG Task 7 submited, function='reverse'
DEBUG:com.parallelpython:Task 7 submited, function='reverse'    
2010-02-06 08:43:34,375 DEBUG Task 8 submited, function='reverse'
DEBUG:com.parallelpython:Task 8 submited, function='reverse'    
2010-02-06 08:43:34,523 DEBUG Task 9 submited, function='reverse'
DEBUG:com.parallelpython:Task 9 submited, function='reverse'    
2010-02-06 08:43:34,550 DEBUG Task (remote) 4 started            
DEBUG:com.parallelpython:Task (remote) 4 started                
2010-02-06 08:43:34,559 DEBUG Task (remote) 0 ended              
DEBUG:com.parallelpython:Task (remote) 0 ended                  
2010-02-06 08:43:34,549 DEBUG Task (remote) 1 ended              
DEBUG:com.parallelpython:Task (remote) 1 ended                  
2010-02-06 08:43:34,706 DEBUG Task (remote) 5 started            
DEBUG:com.parallelpython:Task (remote) 5 started                
2010-02-06 08:43:34,718 DEBUG Task 10 submited, function='reverse'
DEBUG:com.parallelpython:Task 10 submited, function='reverse'    
2010-02-06 08:43:34,828 DEBUG Task 11 submited, function='reverse'
DEBUG:com.parallelpython:Task 11 submited, function='reverse'    
2010-02-06 08:43:34,970 DEBUG Task 12 submited, function='reverse'
2010-02-06 08:43:34,997 DEBUG Task (remote) 2 ended              
DEBUG:com.parallelpython:Task (remote) 2 ended                    
2010-02-06 08:43:35,008 DEBUG Task (remote) 3 ended              
DEBUG:com.parallelpython:Task 12 submited, function='reverse'    
DEBUG:com.parallelpython:Task (remote) 3 ended                    
2010-02-06 08:43:35,008 DEBUG Task (remote) 6 started            
DEBUG:com.parallelpython:Task (remote) 6 started                  
2010-02-06 08:43:35,113 DEBUG Task (remote) 7 started            
DEBUG:com.parallelpython:Task (remote) 7 started                  
2010-02-06 08:43:35,146 DEBUG Task 13 submited, function='reverse'
DEBUG:com.parallelpython:Task 13 submited, function='reverse'    
2010-02-06 08:43:35,210 DEBUG Task 14 submited, function='reverse'
DEBUG:com.parallelpython:Task 14 submited, function='reverse'    
2010-02-06 08:43:35,364 DEBUG Task (remote) 4 ended              
DEBUG:com.parallelpython:Task (remote) 4 ended                    
2010-02-06 08:43:35,365 DEBUG Task 15 submited, function='reverse'
DEBUG:com.parallelpython:Task 15 submited, function='reverse'    
2010-02-06 08:43:35,442 DEBUG Task (remote) 8 started            
DEBUG:com.parallelpython:Task (remote) 8 started                  
2010-02-06 08:43:35,460 DEBUG Task 16 submited, function='reverse'
DEBUG:com.parallelpython:Task 16 submited, function='reverse'    
2010-02-06 08:43:35,519 DEBUG Task (remote) 5 ended              
DEBUG:com.parallelpython:Task (remote) 5 ended                    
2010-02-06 08:43:35,519 DEBUG Task (remote) 9 started            
DEBUG:com.parallelpython:Task (remote) 9 started                  
2010-02-06 08:43:35,581 DEBUG Task 17 submited, function='reverse'
DEBUG:com.parallelpython:Task 17 submited, function='reverse'    
2010-02-06 08:43:35,659 DEBUG Task 18 submited, function='reverse'
DEBUG:com.parallelpython:Task 18 submited, function='reverse'    
2010-02-06 08:43:35,735 DEBUG Task 19 submited, function='reverse'
DEBUG:com.parallelpython:Task 19 submited, function='reverse'    
Job execution statistics:                                        
2010-02-06 08:43:35,790 DEBUG Task (remote) 6 ended              
DEBUG:com.parallelpython:Task (remote) 6 ended                    
job count | % of all jobs | job time sum | time per job | job server
       11 |        100.00 |       3.2256 |     0.293237 | 127.0.0.1:60000
Time elapsed since server creation 2.11736416817                          
 
Loading done! Start waiting untill jobs are finished
2010-02-06 08:43:35,833 DEBUG Task (remote) 10 started
DEBUG:com.parallelpython:Task (remote) 10 started    
2010-02-06 08:43:35,914 DEBUG Task (remote) 7 ended  
DEBUG:com.parallelpython:Task (remote) 7 ended        
2010-02-06 08:43:35,914 DEBUG Task (remote) 11 started
DEBUG:com.parallelpython:Task (remote) 11 started    
2010-02-06 08:43:36,111 DEBUG Task (remote) 8 ended  
DEBUG:com.parallelpython:Task (remote) 8 ended        
2010-02-06 08:43:36,117 DEBUG Task (remote) 12 started
DEBUG:com.parallelpython:Task (remote) 12 started    
2010-02-06 08:43:36,227 DEBUG Task (remote) 9 ended  
DEBUG:com.parallelpython:Task (remote) 9 ended        
2010-02-06 08:43:36,228 DEBUG Task (remote) 13 started
DEBUG:com.parallelpython:Task (remote) 13 started    
2010-02-06 08:43:36,420 DEBUG Task (remote) 10 ended  
DEBUG:com.parallelpython:Task (remote) 10 ended      
2010-02-06 08:43:36,420 DEBUG Task (remote) 14 started
DEBUG:com.parallelpython:Task (remote) 14 started    
2010-02-06 08:43:36,599 DEBUG Task (remote) 11 ended  
DEBUG:com.parallelpython:Task (remote) 11 ended      
2010-02-06 08:43:36,642 DEBUG Task (remote) 15 started
DEBUG:com.parallelpython:Task (remote) 15 started    
2010-02-06 08:43:36,756 DEBUG Task (remote) 12 ended  
DEBUG:com.parallelpython:Task (remote) 12 ended      
2010-02-06 08:43:36,792 DEBUG Task (remote) 16 started
DEBUG:com.parallelpython:Task (remote) 16 started    
2010-02-06 08:43:36,923 DEBUG Task (remote) 13 ended  
DEBUG:com.parallelpython:Task (remote) 13 ended      
2010-02-06 08:43:36,923 DEBUG Task (remote) 17 started
DEBUG:com.parallelpython:Task (remote) 17 started    
2010-02-06 08:43:36,996 DEBUG Task (remote) 14 ended  
DEBUG:com.parallelpython:Task (remote) 14 ended      
2010-02-06 08:43:36,996 DEBUG Task (remote) 18 started
DEBUG:com.parallelpython:Task (remote) 18 started    
2010-02-06 08:43:37,244 DEBUG Task (remote) 15 ended  
DEBUG:com.parallelpython:Task (remote) 15 ended      
2010-02-06 08:43:37,276 DEBUG Task (remote) 19 started
DEBUG:com.parallelpython:Task (remote) 19 started    
2010-02-06 08:43:37,288 DEBUG Task 17 failed due to broken network connection - rescheduling
DEBUG:com.parallelpython:Task 17 failed due to broken network connection - rescheduling    
2010-02-06 08:43:37,288 DEBUG Task 17 inserted                                              
DEBUG:com.parallelpython:Task 17 inserted                                                  
2010-02-06 08:43:37,288 DEBUG Task 18 failed due to broken network connection - rescheduling
DEBUG:com.parallelpython:Task 18 failed due to broken network connection - rescheduling    
2010-02-06 08:43:37,289 DEBUG Task 18 inserted                                              
DEBUG:com.parallelpython:Task 18 inserted                                                  
2010-02-06 08:43:37,289 DEBUG Task 16 failed due to broken network connection - rescheduling
DEBUG:com.parallelpython:Task 16 failed due to broken network connection - rescheduling    
2010-02-06 08:43:37,289 DEBUG Task 16 inserted                                              
DEBUG:com.parallelpython:Task 16 inserted                                                  
2010-02-06 08:43:37,291 DEBUG Task 19 failed due to broken network connection - rescheduling
DEBUG:com.parallelpython:Task 19 failed due to broken network connection - rescheduling    
2010-02-06 08:43:37,291 DEBUG Task 19 inserted                                              
DEBUG:com.parallelpython:Task 19 inserted                                                  
2010-02-06 08:43:42,334 DEBUG Task (remote) 17 started                                      
DEBUG:com.parallelpython:Task (remote) 17 started                                          
2010-02-06 08:43:42,343 DEBUG Task (remote) 16 started                                      
DEBUG:com.parallelpython:Task (remote) 16 started                                          
2010-02-06 08:43:42,343 DEBUG Task (remote) 18 started                                      
DEBUG:com.parallelpython:Task (remote) 18 started                                          
2010-02-06 08:43:42,345 DEBUG Task (remote) 19 started                                      
DEBUG:com.parallelpython:Task (remote) 19 started                                          
2010-02-06 08:43:42,709 DEBUG Task (remote) 17 ended                                        
DEBUG:com.parallelpython:Task (remote) 17 ended                                            
2010-02-06 08:43:42,758 DEBUG Task (remote) 16 ended                                        
DEBUG:com.parallelpython:Task (remote) 16 ended                                            
2010-02-06 08:43:43,018 DEBUG Task (remote) 18 ended                                        
DEBUG:com.parallelpython:Task (remote) 18 ended                                            
2010-02-06 08:43:43,054 DEBUG Task (remote) 19 ended                                        
DEBUG:com.parallelpython:Task (remote) 19 ended                                            
Finished!                                                                                  
Job execution statistics:                                                                  
job count | % of all jobs | job time sum | time per job | job server                      
       24 |        100.00 |       0.0000 |     0.000000 | 127.0.0.1:60000                  
Time elapsed since server creation 9.41980099678                                            
WARNING: statistics provided above is not accurate due to job rescheduling      
 

As you can see client completes.

Could you please attach you log for the case when it hangs?


Title: Re: Job Server stops sending jobs
Post by: mickvdv on February 06, 2010, 03:09:52 PM
Does this also happen when you disconnect the network cable?

Code:
Loading jobs
2010-02-07 00:05:48,634 DEBUG Creating server instance (pp-1.5.7)
2010-02-07 00:05:48,644 INFO pp local server started with 0 workers
2010-02-07 00:05:48,638 DEBUG Creating Rworker id=home.schutijser.com:60000 persistent=True
2010-02-07 00:05:48,638 DEBUG Creating Rworker id=home.schutijser.com:60002 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60000 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60004 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=77.109.74.145:60001 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60006 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=94.209.26.156:60001 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=vulpix.physgaming.nl:60010 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=83.82.231.248:60003 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=77.109.74.145:60002 persistent=True
2010-02-07 00:05:48,638 DEBUG Creating Rworker id=home.schutijser.com:60001 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60002 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60007 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60008 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=94.209.26.156:60000 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=83.82.231.248:60001 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=83.82.231.248:60002 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.101:60000 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.201:60000 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=77.109.74.145:60000 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60001 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60003 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60009 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=94.209.29.155:60000 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.6:60000 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.202:60000 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.203:60000 persistent=True
2010-02-07 00:05:48,642 DEBUG Creating Rworker id=vulpix.physgaming.nl:60005 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=83.82.231.248:60000 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.5:60000 persistent=True
2010-02-07 00:05:48,644 DEBUG Creating Rworker id=192.168.0.200:60000 persistent=True
2010-02-07 00:05:48,786 DEBUG Task 0 submited, function='reverse'
2010-02-07 00:05:48,838 DEBUG Creating Rworker id=192.168.0.101:60000 persistent=True
2010-02-07 00:05:49,055 DEBUG Task 1 submited, function='reverse'
2010-02-07 00:05:49,173 DEBUG Task 2 submited, function='reverse'
2010-02-07 00:05:49,201 DEBUG Creating Rworker id=192.168.0.101:60000 persistent=True
2010-02-07 00:05:49,201 DEBUG Task (remote) 0 started
2010-02-07 00:05:49,816 DEBUG Creating Rworker id=vulpix.physgaming.nl:60000 persistent=True
2010-02-07 00:05:49,963 DEBUG Task 8 submited, function='reverse'
2010-02-07 00:05:49,963 DEBUG Creating Rworker id=77.109.74.145:60000 persistent=True
2010-02-07 00:05:49,963 DEBUG Creating Rworker id=192.168.0.101:60000 persistent=True
2010-02-07 00:05:50,085 DEBUG Creating Rworker id=192.168.0.200:60000 persistent=True
2010-02-07 00:05:50,085 DEBUG Failed to reconnect with (host=192.168.0.6, port=60000), will try again in 60 s
2010-02-07 00:05:50,085 DEBUG Task 9 submited, function='reverse'
2010-02-07 00:05:49,994 DEBUG Task (remote) 1 started
2010-02-07 00:05:50,213 DEBUG Task 10 submited, function='reverse'
2010-02-07 00:05:50,213 DEBUG Failed to reconnect with (host=192.168.0.5, port=60000), will try again in 60 s
2010-02-07 00:05:50,365 DEBUG Task 11 submited, function='reverse'
2010-02-07 00:05:50,401 DEBUG Creating Rworker id=192.168.0.101:60000 persistent=True
2010-02-07 00:05:50,502 DEBUG Task 12 submited, function='reverse'
2010-02-07 00:06:07,647 DEBUG Task (remote) 32 started
 job count | % of all jobs | job time sum | time per job | job server
         4 |         12.12 |       5.0620 |     1.265500 | 192.168.0.200:60000
         4 |         12.12 |       4.5780 |     1.144500 | 77.109.74.145:60000
         4 |         12.12 |       0.0000 |     0.000000 | vulpix.physgaming.nl:60000
        21 |         63.64 |      28.8920 |     1.375810 | 192.168.0.101:60000
Time elapsed since server creation 16.8490002155

Loading done! Start waiting untill jobs are finished
2010-02-07 00:06:07,880 DEBUG Task (remote) 29 ended
2010-02-07 00:06:07,881 DEBUG Task (remote) 33 started
2010-02-07 00:06:08,802 DEBUG Task (remote) 32 ended
2010-02-07 00:06:08,805 DEBUG Task (remote) 31 ended
2010-02-07 00:06:08,805 DEBUG Task (remote) 34 started
2010-02-07 00:06:08,808 DEBUG Task (remote) 35 started
2010-02-07 00:06:09,719 DEBUG Failed to reconnect with (host=83.82.231.248, port=60003), will try again in 60 s
2010-02-07 00:06:09,719 DEBUG Failed to reconnect with (host=home.schutijser.com, port=60002), will try again in 60 s
2010-02-07 00:06:09,719 DEBUG Failed to reconnect with (host=94.209.26.156, port=60001), will try again in 60 s
2010-02-07 00:06:09,727 DEBUG Task (remote) 33 ended
2010-02-07 00:06:09,730 DEBUG Task (remote) 36 started
2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60006), will try again in 60 s
2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=77.109.74.145, port=60002), will try again in 60 s
2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=home.schutijser.com, port=60000), will try again in 60 s
2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60004), will try again in 60 s
2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=77.109.74.145, port=60001), will try again in 60 s
2010-02-07 00:06:09,743 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60010), will try again in 60 s
2010-02-07 00:06:09,799 DEBUG Task (remote) 30 ended
2010-02-07 00:06:09,806 DEBUG Task (remote) 37 started
2010-02-07 00:06:09,819 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60007), will try again in 60 s
2010-02-07 00:06:09,828 DEBUG Task (remote) 27 ended
2010-02-07 00:06:09,828 DEBUG Task (remote) 38 started
2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=home.schutijser.com, port=60001), will try again in 60 s
2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60008), will try again in 60 s
2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=83.82.231.248, port=60001), will try again in 60 s
2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60002), will try again in 60 s
2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=94.209.26.156, port=60000), will try again in 60 s
2010-02-07 00:06:09,848 DEBUG Failed to reconnect with (host=83.82.231.248, port=60002), will try again in 60 s
2010-02-07 00:06:09,918 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60003), will try again in 60 s
2010-02-07 00:06:09,918 DEBUG Failed to reconnect with (host=192.168.0.201, port=60000), will try again in 60 s
2010-02-07 00:06:09,918 DEBUG Failed to reconnect with (host=192.168.0.203, port=60000), will try again in 60 s
2010-02-07 00:06:09,918 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60009), will try again in 60 s
2010-02-07 00:06:09,918 DEBUG Failed to reconnect with (host=94.209.29.155, port=60000), will try again in 60 s
2010-02-07 00:06:09,947 DEBUG Failed to reconnect with (host=192.168.0.202, port=60000), will try again in 60 s
2010-02-07 00:06:09,947 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60005), will try again in 60 s
2010-02-07 00:06:10,019 DEBUG Failed to reconnect with (host=83.82.231.248, port=60000), will try again in 60 s
2010-02-07 00:06:10,760 DEBUG Task (remote) 35 ended
2010-02-07 00:06:10,762 DEBUG Task (remote) 39 started
2010-02-07 00:06:10,776 DEBUG Task (remote) 34 ended
2010-02-07 00:06:10,786 DEBUG Task (remote) 40 started
2010-02-07 00:06:11,746 DEBUG Task (remote) 37 ended
2010-02-07 00:06:11,749 DEBUG Task (remote) 41 started
2010-02-07 00:06:11,852 DEBUG Task (remote) 36 ended
2010-02-07 00:06:11,855 DEBUG Task (remote) 42 started
2010-02-07 00:06:12,571 DEBUG Task (remote) 39 ended
2010-02-07 00:06:12,573 DEBUG Task (remote) 43 started
2010-02-07 00:06:12,782 DEBUG Task (remote) 40 ended
2010-02-07 00:06:12,786 DEBUG Task (remote) 44 started
2010-02-07 00:06:13,479 DEBUG Task (remote) 41 ended
2010-02-07 00:06:13,480 DEBUG Task (remote) 45 started
2010-02-07 00:06:13,678 DEBUG Task (remote) 42 ended
2010-02-07 00:06:13,681 DEBUG Task (remote) 46 started
2010-02-07 00:06:36,496 DEBUG Task (remote) 97 started
2010-02-07 00:06:36,542 DEBUG Task (remote) 95 ended
2010-02-07 00:06:36,546 DEBUG Task (remote) 98 started
2010-02-07 00:06:36,989 DEBUG Task (remote) 92 ended
2010-02-07 00:06:36,993 DEBUG Task (remote) 99 started
2010-02-07 00:06:37,470 DEBUG Task (remote) 93 ended
2010-02-07 00:06:39,384 DEBUG Task (remote) 98 ended
2010-02-07 00:06:39,515 DEBUG Task (remote) 97 ended
2010-02-07 00:06:39,529 DEBUG Task (remote) 99 ended
2010-02-07 00:06:39,529 DEBUG Task (remote) 51 ended
2010-02-07 00:06:51,092 DEBUG Failed to reconnect with (host=192.168.0.6, port=60000), will try again in 60 s
2010-02-07 00:06:51,144 DEBUG Task (remote) 96 ended
2010-02-07 00:06:51,372 DEBUG Failed to reconnect with (host=192.168.0.5, port=60000), will try again in 60 s
2010-02-07 00:06:52,221 DEBUG Task (remote) 94 ended
2010-02-07 00:06:52,595 DEBUG Task (remote) 91 ended
2010-02-07 00:06:56,309 DEBUG Failed to reconnect with (host=vulpix.physgaming.nl, port=60001), will try again in 60 s

Btw: I am running on python 2.6.4 maybe that makes a difference. Soon i will try the same code with python 2.5.

Grtz Mick

P.S. i modified the log a little bit because it was over 20000 characters.


Title: Re: Job Server stops sending jobs
Post by: mickvdv on February 09, 2010, 01:45:39 PM
Is still get the problem at py2.5 (job server side). I guess that the module is the problem.

Could this please be fixed because i have to reboot the jobserver every couple of hours and when it happens at night lots of "free" computing hours get lost.

P.S.: this only happens when i disconnect my network cable, when i kill the process everything goes fine. I guess that when i disconnect the cable the socket doesn't get closed correctly.

Mick


Title: Re: Job Server stops sending jobs
Post by: Vitalii on February 10, 2010, 03:12:26 AM
I was able to reproduce it, but not sure about the cause yet.
So far I've seen it only to happens on program exit.
In this case workaround is simple, call
Code: (python)
os._exit(0)
at the very end of your program.


Title: Re: Job Server stops sending jobs
Post by: mickvdv on February 10, 2010, 08:36:43 AM
That is good news:D I hope you solve it soon:D

Where exactly do i have to put os._exit(0) in my program.

At the point where i call the callback function? Or when i return the results?

Thnx for the help

Mick


Title: Re: Job Server stops sending jobs
Post by: mickvdv on February 10, 2010, 08:52:23 AM
I got an idea, isn't there some way to build in a TTL (time to live) for a job. So that realy slow clients and lost connections and other failures dont slow down the project? For instance if i run ppserver on my desktop, at low priority while encoding movies. My pc gets very slow and jobs take minutes to complete, if i set the ttl on 60 seconds that doesnt matter anymore.

Maybe you could build somethink like that into pp.

Grtz

Mick


Title: Re: Job Server stops sending jobs
Post by: Vitalii on February 10, 2010, 09:59:19 PM
Mick,

if you do not change pp._RECONNECT_WAIT_TIME value, does it ever hang for more then 5 sec after the completion?


Title: Re: Job Server stops sending jobs
Post by: mickvdv on February 11, 2010, 08:26:47 AM
Vitalli

It also happens to me when i put a "#" before pp._RECONNECT_WAIT_TIME = 60. Do you have any luck finding the bug?

Gtrz Mick


Title: Re: Job Server stops sending jobs
Post by: Vitalii on February 11, 2010, 11:44:12 PM
This is a combination of how tcp socket works and how threads are created by thread module.
If network cable is disconnected tcp socket hangs the thread it is in and that thread can survive the exit of the main thread.
The good news is that in PP 1.6 it won't be an issue anymore.
Would you be interested to test the Release Candidate for PP 1.6?


Title: Re: Job Server stops sending jobs
Post by: mickvdv on February 12, 2010, 12:59:56 AM
Yes, shure:D When does it get released?


Title: Re: Job Server stops sending jobs
Post by: Vitalii on February 14, 2010, 12:50:57 AM
Probably by the end of this month.


Title: Re: Job Server stops sending jobs
Post by: Vitalii on February 18, 2010, 10:21:01 PM
The issue is reported is solved in PP 1.6:
http://www.parallelpython.com/component/option,com_smf/Itemid,29/topic,424.0