www.BinaryAlchemy.de :: View topic - Server crashing with the same error over and over again.
 SearchSearch   RegisterRegister  ProfileProfile   UsergroupsUsergroups   Log inLog in 
If you create a new post, please use a topic that describes your problem
Documento sin título
 
Server crashing with the same error over and over again.

 
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    www.BinaryAlchemy.de Forum Index -> old - RR Bug reports - v6.x
View previous topic :: View next topic  
Author Message

kevinyaya



Joined: 21 May 2014
Posts: 4

PostPosted: Wed Jul 02, 2014 10:14 am    Post subject: Server crashing with the same error over and over again. Reply with quote

Hi there, (we are running RR v 7.0.02, on Windows)
Since the 24th of June, we have a recurring problem on our royal render server side, which seems odd since it did not occur before, and we haven't changed anything in the configuration since. This is what the errors look like in _ERROR_SERVER.txt:
[DISTRISERVER01]
06.24 15:16.39| CRF Critical Failure - EXCEPTION: The thread tried to read from or write to a virtual address for which it does not have the appropriate access. (Possible NULL Pointer) Address:971c61 Flags:0;; () rrServer console L 7.0.02+6.02.42 64bit []

[...]

[DISTRISERVER01]
07.02 09:55.16| CRF Critical Failure - EXCEPTION: The thread tried to read from or write to a virtual address for which it does not have the appropriate access. (Possible NULL Pointer) Address:861c61 Flags:0;; () rrServer console L 7.0.02+6.02.42 64bit []

[DISTRISERVER01]
07.02 10:54.50| CRF Critical Failure - EXCEPTION: The thread tried to read from or write to a virtual address for which it does not have the appropriate access. (Possible NULL Pointer) Address:471c61 Flags:0;; () rrServer console L 7.0.02+6.02.42 64bit []

[DISTRISERVER01]
07.02 11:25.26| CRF Critical Failure - EXCEPTION: The thread tried to read from or write to a virtual address for which it does not have the appropriate access. (Possible NULL Pointer) Address:271c61 Flags:0;; () rrServer console L 7.0.02+6.02.42 64bit []



and this is what it looks like in the file server_DISTRISERVER01.txt
07.02 10:54.50| CRF Critical Failure - EXCEPTION: The thread tried to read from or write to a virtual address for which it does not have the appropriate access. (Possible NULL Pointer) Address:471c61 Flags:0;; () rrServer console L 7.0.02+6.02.42 64bit []
07.02 10:54.50| _rrServerCheckThread Status: 105592708
07.02 10:54.50| _rrServerFtpThread Status: 0
07.02 10:54.50| _rrServerWatchThread Status: 1
07.02 10:54.50| _rrServerEmailThread Status: 0
07.02 10:54.50| rrServerTCP RENDER04 Client ClientSingle pos:9004

I found a similar error in this post http://www.binaryalchemy.de/forum/viewtopic.php?p=10033&sid=4249b4eb2a8257ac686035f8f3a9b55b
but this was happening on the client side if I understood correctly. Our problem comes from the distriserver.

There are a few things that make debugging difficult:
- the error is kind of cryptic (some stack issue?) and we don't really know how to obtain more information. Is there a way to get more useful debugging info?

- The user that submits the job is Administrator, so I don't think it has to do with access rights, but I might be wrong. I can't seem to understand if it's the DISTRISERVER that has a virtual access problem or if it could be the client mentioned in rrServerTCP

- I don't understand the last line, why it points to a particular client, in this case RENDER04, and when I open individual logs, in general the mentioned client is not the last one to have communicated with the distriserver.

- that rrServerTCP status means that the error can come from TCP communication or not?

- I don't understand the rrServerCheckThread status code, which is always different.

- The jobs also disappear, (I know there's a writing every 30 min in jobs.db so it makes sense), but it makes it hard to debug.

- I know there are some z_day files, so can it be that if we switch something to its z_day backup, things will be up and working again?

- I'm not sure if it's possible to open the jobs.db file, or also the .stdb files the client machines are spitting out before the crash.

- can we provide more information to help solve the problem?
Back to top
View user's profile Send private message

schoenberger
Site Admin


Joined: 02 Mar 2005
Posts: 3785

PostPosted: Wed Jul 02, 2014 10:51 am    Post subject: Reply with quote

Hi

Please update to the latest version, there have been two fixes for server crashes.
If it still occurs, then it is possible to debug the issue to find the bug.


>Is there a way to get more useful debugging info?
Only with the 32bit executable, but that 32bit error message is still unusable without the source code.
But easier to fix. We will add better source erorr messages to the 64bit version in the next time.

>I don't understand the last line
> that rrServerTCP
Is is printing the current status of all 5 server threads, not the thread that crashed.

>I don't understand the rrServerCheckThread status code
You need the source code for that to understand

>- The jobs also disappear
If possible, then the rrServer tries to save the database if there in an issue. But this is not always possible.


>I know there are some z_day files
I would say that only deelting the jobs.db after you have shut down the rrServer solves the issue.
_________________
Holger Schönberger
Binary Alchemy - digital materialization
Back to top
View user's profile Send private message Send e-mail

kevinyaya



Joined: 21 May 2014
Posts: 4

PostPosted: Wed Jul 02, 2014 11:26 am    Post subject: Reply with quote

still no luck...
We updated to 7.0.05, we restarted the royal render service, and submitted a new job, limited to one client to simplify behavior (RENDER01). The only things that did not get updated are:
a) rrSubmit.py files on the users' machines.
b) rrSubmitNuke.py on the users' machines.
c) some .cfg files in render_apps/_config/

The job seems to not pass the submit process, since there is nothing in RENDER01 log, and in the crash, the machine mentioned is the one of the artist (COMPO03):

07.02 14:19.25| CRF Critical Failure - EXCEPTION: The thread tried to read from or write to a virtual address for which it does not have the appropriate access. (Possible NULL Pointer) Address:861c61 Flags:0;; () rrServer console L 7.0.05 64bit []
07.02 14:19.25| _rrServerCheckThread Status: 7300-105907200 {7HZ}
07.02 14:19.25| _rrServerFtpThread Status: 0
07.02 14:19.25| _rrServerWatchThread Status: 1
07.02 14:19.25| _rrServerEmailThread Status: 0
07.02 14:19.25| rrServerTCP COMPO03 Control ControlListRequest pos:9004

what's next? we're kind of out of ideas...
Back to top
View user's profile Send private message

schoenberger
Site Admin


Joined: 02 Mar 2005
Posts: 3785

PostPosted: Wed Jul 02, 2014 12:25 pm    Post subject: Reply with quote

I am checking the new error message now.
If it does not work, we can do a remote session to fix the issue.
_________________
Holger Schönberger
Binary Alchemy - digital materialization
Back to top
View user's profile Send private message Send e-mail

schoenberger
Site Admin


Joined: 02 Mar 2005
Posts: 3785

PostPosted: Wed Jul 02, 2014 12:28 pm    Post subject: Reply with quote

Ok, it is an image file that the rrServer tries to convert into these preview images.
Your colleague contacted via the rrSupport website as well.
I will continue there to arrange a remote session to find which image is broken.
_________________
Holger Schönberger
Binary Alchemy - digital materialization
Back to top
View user's profile Send private message Send e-mail

kevinyaya



Joined: 21 May 2014
Posts: 4

PostPosted: Wed Jul 02, 2014 12:50 pm    Post subject: Reply with quote

OK cool. Also, it is not 100% sure yet, but we have tried to submit jobs from different machines and fiddle with them (disable them, kill them, etc) and it seems that the crashes happen only when submitted from one of the users' machine. From the other machines, it seems fine. Again, we're still testing to confirm that this happens 100% of the time from the same machine.
Back to top
View user's profile Send private message

schoenberger
Site Admin


Joined: 02 Mar 2005
Posts: 3785

PostPosted: Wed Jul 02, 2014 12:55 pm    Post subject: Reply with quote

The crash is caused by an image file, not the machine that submits the job.
_________________
Holger Schönberger
Binary Alchemy - digital materialization
Back to top
View user's profile Send private message Send e-mail

kevinyaya



Joined: 21 May 2014
Posts: 4

PostPosted: Wed Jul 02, 2014 12:59 pm    Post subject: Reply with quote

OK, strange, because if we launch the same job for the same images from different machines, only one machine manages to crash the server... But let's setup this remote session with my colleague and we'll be able to discuss further.
Back to top
View user's profile Send private message
Display posts from previous:   
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    www.BinaryAlchemy.de Forum Index -> old - RR Bug reports - v6.x All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
 
Documento sin título
 



Powered by phpBB © 2001, 2002 phpBB Group



Number of shameful bots caught by Anti-Spam ACP: 1667