SageTV Community

SageTV Community (http://forums.sagetv.com/forums/index.php)
-   SageTV Customizations (http://forums.sagetv.com/forums/forumdisplay.php?f=27)
-   -   Plugin: Sage Job Queue (SJQ) v3 (http://forums.sagetv.com/forums/showthread.php?t=45174)

gveres 11-17-2009 07:30 AM

Quote:

Originally Posted by Slugger (Post 392640)
655 didn't fix this issue, directly. The issue here was that you were running task client v2.1.5.556 with a v3.0.0 server. The HTTP 500 errors come about because a 2.x client can't talk with a 3.x server. Of course, upgrading the task client to a 3.x version would immediately fix the problem. :)


Thanks for noticing that. I guess I had a little too much wine with dinner to remember I put the new V3 client in another directory. :blush:

Greg

Slugger 11-17-2009 10:10 AM

Quote:

Originally Posted by nyplayer (Post 392651)
Slugger,

Nothing is processing.. its just hung.

Code:

Mon Nov 16 23:57:39 PST 2009: SJQ Application Dump
        QueueLoader thread is alive...
                java.lang.Thread.sleep(Native Method)
                com.google.code.sagetvaddons.sjq.server.MediaFileQueueLoader.run(MediaFileQueueLoader.java:101)
                java.lang.Thread.run(Unknown Source)
        ClientMonitor thread is alive...
                java.lang.Thread.sleep(Native Method)
                com.google.code.sagetvaddons.sjq.server.ClientMonitor.run(ClientMonitor.java:35)
                java.lang.Thread.run(Unknown Source)
        InteralTaskClient thread is alive...
                java.lang.Thread.sleep(Native Method)
                com.google.code.sagetvaddons.sjq.server.InternalTaskClient.run(InternalTaskClient.java:59)
                java.lang.Thread.run(Unknown Source)
        Most recent LogCleaner thread is dead (this is normal behaviour)!
        Most recent FileCleaner thread is dead (this is normal behaviour)!


What's the queue look like when it stops processing? Are there other tasks stuck in STARTED or RUNNING state? This dump shows that no threads have died off, which is a good thing. I saw this scenario happen yesterday where my task client crashed (due to an unrelated bug) and when the task client crashed it left some jobs stuck in STARTED and RUNNING states. That held up the rest of the queue b/c internal tasks will not run if there are external tasks pending or running for the same object. Furthermore, I discovered that if an internal task is at the front of the queue, but can't run b/c there are external tasks pending for it, that task will hold up the rest of the tasks of the same type behind it in the queue. This particular behaviour is not as designed, presents itself as a queue that appears to be stuck/hung, and probably won't be fixed since the easy fix introduces a high probability of an infinite loop into the queue processor. The queue is actually being processed and the internal tasks will continue to run once the external tasks are cleaned up.

If this describes what you're seeing then start by upgrading to task client 655. If it's not then I need a visual/description of what the state of your queue is when it stops processing and then I'll have to go add a lot more logging to the internal task client to see why it stops processing. Is it just internal tasks (tasks starting with '_') that stop processing and external tasks continue to process or does everything in the queue stop? If the latter, did you check to see if your task client had crashed? If so, then upgrade to 655.

nyplayer 11-17-2009 10:27 AM

Slugger,

That is exactly what happens... you describe it to a tee. I will upgrade today.

Slugger 11-17-2009 11:06 PM

Nasty JRE Bug
 
I'm running Sun JRE 1.6.0_17 and I've encountered a JRE bug that seems to block on an attempt to read the size of a disk. It's a local disk that I'm trying to read so it's definitely not network related. One of my test blocks is using the FreeSpacePercentage test. Tonight I discovered that my queue stopped processing and when investigating, I end up with this dump from SJQ:

Code:

Tue Nov 17 22:49:04 EST 2009: SJQ Application Dump
        QueueLoader thread is alive...
                java.io.WinNTFileSystem.getSpace0(Native Method)
                java.io.WinNTFileSystem.getSpace(Unknown Source)
                java.io.File.getTotalSpace(Unknown Source)
                com.google.code.sagetvaddons.sjq.server.FieldFreeSpacePercentage.run(FieldFreeSpacePercentage.java:43)
                com.google.code.sagetvaddons.sjq.server.RulesParser$Evaluator.internalTest(RulesParser.java:220)
                com.google.code.sagetvaddons.sjq.server.RulesParser$Evaluator.runTest(RulesParser.java:135)
                com.google.code.sagetvaddons.sjq.server.RulesParser$Evaluator.evalTest(RulesParser.java:122)
                com.google.code.sagetvaddons.sjq.server.RulesParser$Evaluator.<init>(RulesParser.java:80)
                com.google.code.sagetvaddons.sjq.server.RulesParser$Evaluator.evalTest(RulesParser.java:111)
                com.google.code.sagetvaddons.sjq.server.RulesParser$Evaluator.<init>(RulesParser.java:78)
                com.google.code.sagetvaddons.sjq.server.RulesParser$Evaluator.<init>(RulesParser.java:101)
                com.google.code.sagetvaddons.sjq.server.RulesParser.loadQ(RulesParser.java:258)
                com.google.code.sagetvaddons.sjq.server.RulesParser.<init>(RulesParser.java:232)
                com.google.code.sagetvaddons.sjq.server.MediaFileQueueLoader.run(MediaFileQueueLoader.java:71)
                java.lang.Thread.run(Unknown Source)

My CPU was not pegged nor even busy (0-1%), my recordings kept recording, I was even able to watch tv on my HD100 - I didn't even notice there was a problem until I noticed some of tonight's recordings hadn't been commercial flagged. So I get this dump and run the dump a bunch of times and discover that the QueueLoader thread is definitely stuck in the java.io.WinNTFileSystem.getSpace0() native method. Given it doesn't appear to be stressing my system, I'm guessing the method has blocked on something. Either way, I could not kill SJQ by stopping the app, trying to recycle the app in Jetty simply caused a second copy to start up (the original app's threads would not unblock and die off gracefully). So then I tried to stop and restart Sage and Sage itself would not stop. Trying to kill off the SageTV process from the task manager eventually locked up the server's desktop to the point where I had to give it a hard reboot.

The issue definitely seems to be isolated to Windows (given it's blocking in a native method) and the scenario that seems to be the cause is that some of the queue loader threads (namely the System Message queue loader) just happened to be scheduled to run at exactly the top and bottom of every hour. This was a random coincidence as the last time I started SJQ happened to be at exactly the top or bottom of an hour. So tests were being run at exactly the same time Sage was creating new files and I was checking the free space on a drive in one of my tests and it appears that a "perfect storm" of events occurred to cause this scenario. Unfortunately, I forgot to update the SJQ thread dumper to include the system message queue loader thread in the output so I can't verify that this was exactly the situation, but based on the log output I know the System Message Queue Loader started at exactly 22:30:00 and never completed. And a recording also ended at exactly 22:30:00. So something happened, but as I said, everything else continued to work fine until I tried to stop SJQ, Jetty, then SageTV itself.

I document this only as a warning. SJQ had been running on this top/bottom of hour schedule for most of the day (10+ hours) before this happened so it takes some special set of circumstances to cause the JRE native method to block/hang/loop/fail or whatever it's doing (or so it seems). Since there is no mention of the java.io.File.getTotalSpace() method ever blocking, certainly not for at least 8 mins before I stopped tracking it, I'm fairly certain this a JRE bug, but my course of action is to defend against the scenario and ensure that no queue loader thread (threads that can possibly run the FreeSpacePercentage test in SJQ) starts at exactly the same time a recording is scheduled to end or start instead of trying to track down the JRE bug and get it fixed (not really something I want to pursue).

Given the very specific and special set of circumstances that appear to be required to hit this bug, I'm not exactly rushing to fix it, but some kind of defensive fix will be added into SJQ before v3 is moved out of beta.

EDIT: Issue 93 is where this will be fixed in SJQ.

kevine 11-18-2009 03:51 AM

SJQ is reporting Comskip failing but indeed they work. What should I look for that the client is reporting a failure. I have 3 clients running. 1 is running on the server. All are UNC paths to the server. New to this thread and SJQ so do you want a debug mode output. Thanks.

Slugger 11-18-2009 09:50 AM

Quote:

Originally Posted by kevine (Post 392836)
SJQ is reporting Comskip failing but indeed they work. What should I look for that the client is reporting a failure. I have 3 clients running. 1 is running on the server. All are UNC paths to the server. New to this thread and SJQ so do you want a debug mode output. Thanks.

Use the :RETURNCODE option in the client config to tell SJQ that comskip returns 1 on success instead of the expected 0.

jsin 11-18-2009 01:53 PM

Quote:

Originally Posted by Slugger (Post 392874)
Use the :RETURNCODE option in the client config to tell SJQ that comskip returns 1 on success instead of the expected 0.

Um, it's probably worth it to note/mention this behavior of comskip in the docs. One other thing I wished for the v3 docs was that instead of just crossing out deprecated functions in the table, perhaps list an alternate SageTV API equivalent. I used couple of those functions in v2 and it took a bit of an effort to figure out which SageTV API call I needed to use.

Still, by far, you have the best set of documents I've come across in SageTV community. :)

SJQ is a very cool program, Slugger. Thanks a lot!
:D

jsin 11-18-2009 02:02 PM

Request for less verbose logs
 
V3 log seems lot more chatty than V2 logs. Many, many, lines of "SysMsg needs to be processed...", "Processing a system message..." and what not. Makes it harder to see the actual useful messages. Is there a way to make it less verbose?

Thanks.

Slugger 11-18-2009 08:15 PM

Quote:

Originally Posted by jsin (Post 392929)
Um, it's probably worth it to note/mention this behavior of comskip in the docs. One other thing I wished for the v3 docs was that instead of just crossing out deprecated functions in the table, perhaps list an alternate SageTV API equivalent. I used couple of those functions in v2 and it took a bit of an effort to figure out which SageTV API call I needed to use.

Still, by far, you have the best set of documents I've come across in SageTV community. :)

SJQ is a very cool program, Slugger. Thanks a lot!
:D

You might have to scroll to the right, depending on your screen resolution, but there is a column on the far right side that shows the alternatives for those tests removed in SJQv3. On top of that, in the paragraph following the table, there is a link to the SageTV API, which is where all those tests for SJQv3 come from.

Slugger 11-18-2009 08:16 PM

Quote:

Originally Posted by jsin (Post 392931)
V3 log seems lot more chatty than V2 logs. Many, many, lines of "SysMsg needs to be processed...", "Processing a system message..." and what not. Makes it harder to see the actual useful messages. Is there a way to make it less verbose?

Thanks.

Modify sjq.log4j.properties in your SageTV root install directory and restart SJQ. The default logging will be turned down once SJQv3 moves out of beta (for now I need the debugging on relatively high while tracking things down).

jsin 11-18-2009 08:47 PM

Quote:

Originally Posted by Slugger (Post 392995)
You might have to scroll to the right, depending on your screen resolution, but there is a column on the far right side that shows the alternatives for those tests removed in SJQv3. On top of that, in the paragraph following the table, there is a link to the SageTV API, which is where all those tests for SJQv3 come from.

Oh, you're right! Scrolling to the right does show the replacement functions! My apologies for not noticing that earlier.

Yes, I did use the link to SageTV API and peruse through the functions to figure out which one I needed to use. Thanks for providing that.

I have to say, SJQ turned out to be lot more flexible than "Comskip Monitor" which I used for over a year w/o any problems.

Thanks!!!

jsin 11-18-2009 08:52 PM

Quote:

Originally Posted by Slugger (Post 392996)
Modify sjq.log4j.properties in your SageTV root install directory and restart SJQ. The default logging will be turned down once SJQv3 moves out of beta (for now I need the debugging on relatively high while tracking things down).

I thought of that but I wasn't quite sure how to modify it, as "INFO" level seems to give some stuff I want to see as well, so changing it to something like "FATAL" or "ERROR" didn't seem appropriate.

I figured you wanted a lot of log for debugging during beta stage; even so, I just didn't think it was necessary to have tons of "Processing a system message" every minute. :)

It's not a problem; I can wait until v3 comes out of beta.

Again, thanks for providing a cool program!

Slugger 11-21-2009 10:38 AM

:MAXPROCS and :MAX settings replaced by :RESOURCES
 
Starting with snapshot 658, task parallelism is now controlled by the new :RESOURCES client option. Instead of controlling task parallelism at the task level, it is now controlled globally at the client level. This option replaces the :MAXPROCS and :MAX options.

Each task client that connects to an SJQ server has exactly 100 "resources" available to it (resources are simply an abstract commodity). When configuring a client, you tell it how many resources each task needs. A client will run tasks is parallel as long as it has the resources available to run a given task in the queue. All the rules about :RESOURCES are available here.

Some important notes:

Old client configs will work, but will use a default value of 100 for :RESOURCES (which means that all tasks will run serially on the client). In other words, you probably will want to explicitly set the :RESOURCES value for your tasks after upgrading.

Negative values are not allowed and will be reset to 100 if used.

A client will only ever have at most 100 resources free so to prevent a client from running a task temporarily, set its required resources to a value greater than 100.

Though negative values aren't allowed, zero (0) is allowed. This is the way to tell a client that it can run an infinite number of a given task simultaneously.

Code:

SDCOMSKIP {
  :CPU "LOW"
  :RESOURCES 33
  "C:/comskip/comksip.exe \"%c%\""
}

HDCOMSKIP {
  :CPU "LOW"
  :RESOURCES 66
  "C:/comskip/comksip.exe \"%c%\""
}

METADATA {
  :CPU "LOW"
  :RESOURCES 1
  "java -jar C:/bmt/bmt.jar blah \"%c%\""
}

This example client config shows that it can run 3 SDCOMSKIP tasks at once or 1 HDCOMSKIP and 1 SDCOMSKIP and so on. Note that under this scheme, better control is gained because now an HDCOMSKIP will not be able to run if more than 1 SDCOMSKIP is already running. With the new global resources setting, the possibilities are almost endless and one should be able to control their task parallelism much easier.

Slugger 11-22-2009 08:10 PM

Snapshot 659 includes the implementation of the internal _ALERT task, which allows the triggering of SageAlert alerts via SJQ through a ruleset:

Code:

if [FreeSpacePercentage < 10] {
  :ALERTLVL "WARN"
  :ALERTSUBJ "Recording space is low!"
  :ALERTTXT "Available recording space is below 10%!  Time for a new hard drive! :)"
  _ALERT
}

This particular example is perhaps a bad one because SageAlert itself already monitors disk space as a native alert type, but none the less it shows how one can trigger alerts via SJQ (I'm actually quite interested to hear about some use cases if people find uses for this feature).

Also see the previously implemented DONEALERT and FAILALERT task options that allow you to trigger alerts when a task completes successfully or fails.

With the project issue ticket queue completely empty, snapshot 659 is currently feature complete for SJQv3! All current testers are asked to upgrade to 659 and put it to the test and report back any issues ASAP. This would also be the time to come up with any other feature requests for v3. Barring any major issues discovered I'd really like to wrap up v3 and move it out of beta before Christmas as I'm looking to start my next SageTV project shortly after the holiday season (mid Jan) and would like to have SJQv3 wrapped up before the holiday season gets under way.

jakfech 11-23-2009 10:48 AM

SJQ3 is very stable
 
I am running 655, I will upgrade to the latest today. Overall I've found my system quite reliable and I can't thank you enough for the work you've put into this. I had a problem with my clients shutting (run as Windows services) due to the JSON bug fixed in 655 but since I've upgraded my clients have been stable. I will report back if I have a problem with them.

The only major problem I am currently having is with the server, and maybe this has already been addressed. I've found that SJQ will hang if a recording-based task is queued and I delete that recording before SJQ starts the task on a client. SJQ stays at "Started" forever. I am able to delete the task successfully however.

I know Java and would debug this further but I've just been so short on time.

kevine 11-23-2009 11:12 AM

Quote:

Originally Posted by jakfech (Post 393694)
The only major problem I am currently having is with the server, and maybe this has already been addressed. I've found that SJQ will hang if a recording-based task is queued and I delete that recording before SJQ starts the task on a client. SJQ stays at "Started" forever. I am able to delete the task successfully however.

I too have this same exact problem. However, I also have a problem where the server says it is running but the client is using no CPU time whatsoever (hung). After killing it, the resource comes back and another job gets queued. Currently I am only running comskip but thought I'd mention it. If it happens again, I will provide the log. Eventually, I caught on and used the MAXRUNRATIO parameter.

It could have had something to do with how many comskips I had running though. You see, I configured my file cleaner incorrectly and ended up deleting ALL of my comskip files! :soapbox: I really don't recommend doing this... I delete all 650+ of them. So I was running 21 instances of comskip concurrently for 24 hours between my WHS and 3 other machines. However, my lowest of concurrent uses was my laptop at 3 and it had the same problem. I will let you know if I can reproduce any of the problems.

Oh, and second the work you have put in. Thank you very much. This makes things much easier to configure.

Slugger 11-23-2009 11:13 AM

Quote:

Originally Posted by jakfech (Post 393694)
I am running 655, I will upgrade to the latest today. Overall I've found my system quite reliable and I can't thank you enough for the work you've put into this. I had a problem with my clients shutting (run as Windows services) due to the JSON bug fixed in 655 but since I've upgraded my clients have been stable. I will report back if I have a problem with them.

The only major problem I am currently having is with the server, and maybe this has already been addressed. I've found that SJQ will hang if a recording-based task is queued and I delete that recording before SJQ starts the task on a client. SJQ stays at "Started" forever. I am able to delete the task successfully however.

I know Java and would debug this further but I've just been so short on time.

Can you report back if you're still seeing this issue with the latest server and task client versions (server 659 and task client 655)?

Slugger 11-23-2009 11:19 AM

Tasks get stuck in STARTED state iff the task client fails to report back that it started the task successfully and that would only happen if the task client crashed after receiving a task from the server. Basically, if a task is stuck in STARTED state then the first thing to check is if the task client stopped running on the host that was assigned the task. If so, then it's a task client bug that needs to be tracked down.

It could also be a server side bug if the task client update produced a 500 error on the server, but I'd likely have seen 500 errors server side by now.

Please upgrade to server 659 and task client 655 and get as many details as you can if you're still seeing these problems. If nothing else, a step by step guide on how to reproduce the scenario would be most beneficial.

kevine 11-23-2009 11:46 AM

Quote:

Originally Posted by Slugger (Post 393703)
It could also be a server side bug if the task client update produced a 500 error on the server, but I'd likely have seen 500 errors server side by now.

Here is one from this morning. I don't know if this helps but I can tell you it was an unattended problem because I was not working on anything at 5am. This client log happens to be on the server too.

Code:

Date: Mon Nov 23 05:05:38 EST 2009

HTTP ERROR 500

Problem accessing /sjq/SJQ. Reason:

    INTERNAL_SERVER_ERROR
Caused by:

java.lang.NullPointerException
at com.google.code.sagetvaddons.sjq.server.SJQServlet.doPost(SJQServlet.java:186)
at com.google.code.sagetvaddons.sjq.server.SJQServlet.doGet(SJQServlet.java:93)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:389)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.handler.rewrite.RewriteHandler.handle(RewriteHandler.java:230)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:322)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)

Here is the server log from the same time period:
Code:

2009-11-23 05:03:57,538 WARN  [SJQServlet]: Checking client '192.168.1.121': 4 resources free.
2009-11-23 05:03:57,538 WARN  [SJQServlet]: Client '192.168.1.121' cannot pop a task of type 'COMSKIP': Requires 16 resources, but only 4 available.
2009-11-23 05:03:57,538 WARN  [SJQServlet]: Client '192.168.1.121' cannot pop a task of type 'MCOMSKIP': Requires 16 resources, but only 4 available.
2009-11-23 05:03:57,632 INFO  [DataStore]: Flushed 1 log message(s) in 94ms
2009-11-23 05:04:00,304 INFO  [ClientParser]: MAPPING: D:\tv => \\elms-server\SageTV
2009-11-23 05:04:00,304 INFO  [ClientParser]: MAPPING: D:\videos => \\elms-server\videos
2009-11-23 05:04:00,304 INFO  [ClientParser]: MAPPING: D:\music => \\elms-server\music
2009-11-23 05:04:00,304 WARN  [SJQServlet]: Checking client '192.168.1.119': 0 resources free.
2009-11-23 05:04:00,304 WARN  [SJQServlet]: Client '192.168.1.119' cannot pop a task of type 'COMSKIP': Requires 25 resources, but only 0 available.
2009-11-23 05:04:00,304 WARN  [SJQServlet]: Client '192.168.1.119' cannot pop a task of type 'MCOMSKIP': Requires 25 resources, but only 0 available.
2009-11-23 05:04:45,304 INFO  [DataStore]: Flushed 5121 log message(s) in 407ms
2009-11-23 05:04:48,804 INFO  [ClientParser]: MAPPING: D:\tv => \\elms-server\SageTV
2009-11-23 05:04:48,804 INFO  [ClientParser]: MAPPING: D:\videos => \\elms-server\videos
2009-11-23 05:04:48,804 INFO  [ClientParser]: MAPPING: D:\music => \\elms-server\music
2009-11-23 05:04:48,804 WARN  [SJQServlet]: Checking client '192.168.1.101': 4 resources free.
2009-11-23 05:04:48,804 WARN  [SJQServlet]: Client '192.168.1.101' cannot pop a task of type 'COMSKIP': Requires 12 resources, but only 4 available.
2009-11-23 05:04:48,804 WARN  [SJQServlet]: Client '192.168.1.101' cannot pop a task of type 'MCOMSKIP': Requires 12 resources, but only 4 available.
2009-11-23 05:05:28,897 INFO  [DataStore]: Flushed 5121 log message(s) in 359ms
2009-11-23 05:05:38,366 INFO  [ClientParser]: MAPPING: D:\tv => \\elms-server\SageTV
2009-11-23 05:05:38,366 INFO  [ClientParser]: MAPPING: D:\videos => \\elms-server\videos
2009-11-23 05:05:38,366 INFO  [ClientParser]: MAPPING: D:\music => \\elms-server\music
2009-11-23 05:05:38,366 WARN  [SJQServlet]: Checking client '192.168.1.121': 20 resources free.
2009-11-23 05:05:38,632 INFO  [DataStore]: Flushed 56 log message(s) in 266ms
2009-11-23 05:05:39,225 INFO  [DataStore]: Flushed 580 log message(s) in 1156ms
2009-11-23 05:05:39,460 INFO  [DataStore]: Flushed 7 log message(s) in 1391ms


Slugger 11-23-2009 12:54 PM

Not sure why there's an NPE there, but snapshot 660 adds a bunch of logging to hopefully help track it down. Please upgrade to server 660 and paste the logs in next time you see this error.


All times are GMT -6. The time now is 07:14 AM.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2019, vBulletin Solutions Inc.
Copyright 2003-2005 SageTV, LLC. All rights reserved.