Problem or Use Case
The safe assumption for the reliability of nodes on a network, is that they aren't, especially true for the internet, and not untrue for local networks either, espcially wifi where interference, walking to far from the acces point and people tripping out the power (or over the power cable) can kill the network or individual nodes.
Crowdrender doesn't yet have a complete solution for checking if nodes are still online and responding. It will detect if a node is late reporting that it has received a streamed edit and will not render using that node until it has reported it is synchronised, so there is some work in this direction already done. However, if a long render job, especially an animation, was launched and a node or many nodes went down, this would likely hang the system. So we need a way to circumvent this, here is where better node management comes in.
Proposed feature
Node would be assumed to be unreliable, this means that any command given to a node is expected to have failed unless there is evidence to prove otherwise. This assumption, when properly implemented, will ensure that we can take appropriate action to mitigate stalls to renders and performance loss. For example if a node doesn't respond in time to a call to render, we can give its job to another node and black list it for the rest of the job. This means that going to bed with a large render job in progress isn't quite so nerve racking :)