Restart configuration for processes

This document explain how to properly manage python processes like daemons, web apps or any other code when errors occurs. This will aim to a generic solution for ECS and supervisor processes back-off configuration.

Problem Description

Currently we avoid processes to stop, this caused some incidents letting the connection with the DB broken and several daemons retrying to do action not being able to connect to DB. This was fixed by restarting the daemon, this refreshed the connection and it worked again.

If we re-create the process when failed for unknown reasons, can solve some scenarios:

Configuration refresh: Reload all configuration from infrastructure may solve a problem with wrong configuration and not restarted process.
Connection refresh: Try to close and open connection if something fails, maybe the connection is broken and this can fix some issues.

Background

In old services processes run with supervisor and in new ones with AWS ECS.

ECS tasks If tasks for an ECS service repeatedly fail to enter the RUNNING state (progressing directly from PENDING to STOPPED), then the time between subsequent restart attempts is incrementally increased up to a maximum of 15 minutes. Tasks that fail immediately due to command errors do not trigger the throttle or the service event message.

Supervisor tasks are not going down as per his configuration and deployment process. Configuration Example:

[program:daemon_recon_client_funds_consumer]
command=/srv/bos/scripts/celery/%(program_name)s.sh
user=root
numprocs=3
process_name=%(program_name)s-%(process_num)s
autostart=false
autorestart=true
redirect_stderr=False
stopsignal=INT
stdout_logfile=/var/log/supervisor/%(program_name)s-%(process_num)s.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stdout_capture_maxbytes=5MB
stderr_logfile=/var/log/supervisor/%(program_name)s-%(process_num)s.err.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
stderr_capture_maxbytes=5MB
stopwaitsecs=120
killasgroup=true

Solution

We cant to let the exceptions go up.

try:
    main()
  
except Exception as exc:
    logging.exception('Something was wrong')
    raise exc

It is important that you log the errors before let the daemon goes down, because sometime the infra is not able to parse the log on a raw exception. So, log the exception and re-raise it.

You may want to just restart the daemon with certainly exceptions if you have clear the exceptions that you want to do the restart or because the cost of restarting is too high. You have to valorate and decide if you need so.

Don’t let the processes go down always. Try to handle as many exceptions as you can if you know this exception will be not solved by letting the exception goes up.

try:
    main()
  
except ConcurrencyException:
    logging.warning('Cannot acquire lock, we will try again')

except ValidationError:
    logging.exception('The message is invalid')

except Exception as exc:
    logging.exception('Something was wrong')
    raise exc

This is an example, you need to properly set up with the proper exception for each process.

Supervisor config

This is a bit more tricky because, if we let the processes restarting infinite, the resource consumption on platform starting is high. So, we need supervisor to be able to stop retrying start. To do so, you can use two config:

startretries: The times the task will try to start.
startsecs: The time in seconds to consider a task to be running.

We need both set it, if not, the startsecs is 0 by default and once started will be considered “RUNNING” and the startretries won’t be take in account if processes fails. More info in Supervisor docs.

Example:

[program:daemon_transaction_monitoring_sending_request]
command=/srv/bos/scripts/celery/%(program_name)s.sh
user=root
numprocs=1
process_name=%(program_name)s-%(process_num)s
autostart=false
autorestart=true
startsecs=120
startretries=5
redirect_stderr=False
stopsignal=INT
stdout_logfile=/var/log/supervisor/%(program_name)s-%(process_num)s.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stdout_capture_maxbytes=5MB
stderr_logfile=/var/log/supervisor/%(program_name)s-%(process_num)s.err.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
stderr_capture_maxbytes=5MB
stopwaitsecs=120
killasgroup=true

ECS config

Tasks that fail immediately due to command errors do not trigger the throttle or the service event message, so you should avoid using raising an error by default in ECS.

Instead try to just restart the task when you know a restart will certainly fix the issue.

Alternatives

Let the process stopped.
- PRO: You can check what is actually happening and then start the process again once fixed.
- CON: Require manual steps.
Continue with broad except and don't let process goes down.
- PRO: The processes are simpler, as developers don't have to worry about how the back-off system works in infrastructure.
- CON: You can get lock on an infinite failing and never recover if you don't have active monitoring (we don't currently).
Celery periodic task as infinite loop with two task running.
- PRO: Celery already manage the exceptions for you.
- CON: You can workaround celery by using locks and be continuously queueing tasks and run two task to be consuming from queue to avoid flooding memory with tasks int he queue.
Celery alternative solutions.
- PRO: Celery already manage the exceptions for you.
- CON: Celery is queue consuming based and Celery Periodic task are cron based, Celery is not the proper tool in this version to do so, you will always need to work-around the default behavior.

Caveats

Are processes that handle the errors by itself like gunicorn or Celery tasks. We cannot handle the process getting down in this cases as it will be the framework the one in charge of doing it.

Sometimes, as you are doing a loop or per performance reasons looping over objects, you don't want your process to crash as it will end up half-way finished.

Starting processes infinite can be dangerous as per resource consumption. There are services, like BOS, using a lot of resources and running lots of queries on start up process. So, if a process is in an infinite restart loop can get the machine down or cost a lot of money.

Operation

N/A

Security Impact

N/A

Performance Impact

Starting processes can be expensive because of AWS data consumption reasons or process starting initializing resources, so you need to have next consideration in your configuration if possible:

Start delay: Let some time the process stopped to see if time fix the issue.
Exponential back off: Same as before but giving more time each iteration.

Developer Impact

Developers will need to know how the infrastructure works and what this back-off system means for their development.

Developers will need to define what exceptions need to be handle it and what needs to restart the processes, this can add an extra code or extra thinking creating new processes.

Data Consumer Impact

N/A

Deployment

This is mainly infrastructure recommendation, so, SRE teams should be aware of this and enforce standardize it. Also, infrastructure should match the back-off system with described here.

Dependencies

N/A

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search