celeryconfig.py - Task distribution via Celery

Warning

TRAP runs in parallel on a single multi-core machine by default now, using the standard multiprocessing functionality. As such, you should only try to use celery if you want to distribute a single job over multiple machines.

Celery provides a mechanism for distributing tasks over a cluster of compute machines by means of an “asynchronous task queue”. This means that users submit jobs to a centralised queueing system (a “broker”), and then one or more worker processes collect and process each job from the queue sequentially, returning the results to the original submitter.

Celery is a flexible but complex system, and the details of its configuration fall outside the scope of this document. The user is, instead, referred to the Celery documentation. Here, we provide only some brief explanation.

If you would like to take advantage of the task distribution system, you will need to set up a broker and one or more workers which will process tasks from it. There are a number of different brokers available, each with their own pros and cons: RabbitMQ is a fine default choice.

Workers can be started by using the celery worker option to the trap-manage.py script. Indeed, trap-manage.py provides a convenient way of interfacing with a variety of Celery subcommands: try trap-manage.py celery -h for information.

When you start a worker, you will need to configure it to connect to an appropriate broker. If you are using the trap-manage.py script, you can configure the worker through the file celeryconfig.py in your project folder: set the BROKER_URL variable appropriately. Note that if you are running the broker and a worker on the same host with a standard configuration, the default value should be fine.

Note that a single broker and set of workers can be used by multiple different pipeline users. If running on a shared system, it is likely sensible to regard the broker and workers as a “system service” that all users can access, rather than having each user try to run their own Celery system.

Note also that a worker loads all the necessary code to perform its tasks into memory when it is initalized. If the code on disk changes after this point (for example, if a bug is fixed in the TraP installation), the worker will continuing executing the old version of the code until it is stopped and restarted. If, for example, you are using a “daily build” of the TraP code, you will need to restart your workers after each build to ensure they stay up-to-date.

Finally, always bear in mind that it is possible to disable the whole task distribution system and run the pipeline in a single process. This is simpler to set up, and likely simpler to debug in the event of problems. But keep in mind that a running broker is still required. To enable this mode, simple edit your celeryconfig.py file and ensure it contains the (uncommented) line:

CELERY_ALWAYS_EAGER = CELERY_EAGER_PROPAGATES_EXCEPTIONS = True

Run Celery workers

If you want to parallelize TraP operations using celery, you need to run a separate Celery worker. This worker will receive jobs from a broker, so it is assumed you installed and started a broker in the installation step. Start a Celery worker by running:

% trap-manage.py celery worker

If you want to increase the log level add --loglevel=info or maybe even debug to the command. If you dont want to use a Celery worker (run the pipeline is serial mode) uncomment this line in the celeryconfig.py file in your pipline directory:

#CELERY_ALWAYS_EAGER = CELERY_EAGER_PROPAGATES_EXCEPTIONS = True

Note that a running broker is still required.

Celery Configuration File

The management script may be used to start a Celery worker. The worker is configured using the file celeryconfig.py in the project directory. The default contents of this file are:

# TraP Celery Configuration

# This file uses the standard Celery configuration system.
# Please refer to the URL below for full documentation:
# http://docs.celeryproject.org/en/latest/configuration.html

# Uncomment the below for local use; that is, bypassing the task distribution
# system and running all tasks in serial in a single process. No broker or
# workers are required.
#CELERY_ALWAYS_EAGER = CELERY_EAGER_PROPAGATES_EXCEPTIONS = True

# Prevents issues with using a separate threading.Thread in addition to Celery.
CELERYD_FORCE_EXECV = True

# Otherwise, configure the broker to which workers should connect and to which
# they will return results. This must be started independently of the
# pipeline.
BROKER_URL = CELERY_RESULT_BACKEND = 'amqp://guest@localhost//'

# This is used when you run a worker.
CELERY_IMPORTS = ("tkp.distribute.celery.tasks", )

Note that this file is Python code, and will be parsed as such. In fact, it is a fully-fledged Celery configuration file, and the reader is referred to the main Celery documentation for a complete reference. Here, we highlight just the important parameters defined in the defualt configuration.

Note the line:

#CELERY_ALWAYS_EAGER = CELERY_EAGER_PROPAGATES_EXCEPTIONS = True

By uncommenting this line (removing the initial #), the pipeline is forced to run in serial mode. That is, tasks are executed sequentially by a single Python process. No broker and no workers are required. This will likely have a significant impact on performance, but makes the system simpler and easier to debug in the event of problems.

The line:

BROKER_URL = CELERY_RESULT_BACKEND = 'amqp://guest@localhost//'

specifies the URL of the Celery broker, which is also the location to which workers will return their results. Various different types of broker are available (see our introduction to Celery for suggestions), and they must be configured and started independently of the pipeline: the appropriate URL to use will therefore depend on the configuration chosen for your local system.

The other parameters in the file – CELERY_IMPORTS and CELERYD_HIJACK_ROOT_LOGGER – should be left set to their default values.