.. _celery-intro:
++++++++++++++++++++++++++++++++++++++++++++++++++
``celeryconfig.py`` - Task distribution via Celery
++++++++++++++++++++++++++++++++++++++++++++++++++
.. Warning::
TRAP runs in parallel on a single multi-core machine by default now,
using the standard multiprocessing functionality.
As such, you should only try to use celery if you want to
distribute a single job over multiple machines.
`Celery `_ provides a mechanism for distributing
tasks over a cluster of compute machines by means of an "asynchronous task
queue". This means that users submit jobs to a centralised queueing system (a
"broker"), and then one or more worker processes collect and process each job
from the queue sequentially, returning the results to the original submitter.
Celery is a flexible but complex system, and the details of its configuration
fall outside the scope of this document. The user is, instead, referred to the
`Celery documentation `_. Here,
we provide only some brief explanation.
If you would like to take advantage of the task distribution system, you will
need to set up a broker and one or more workers which will process tasks from
it. There are a number of `different brokers available
`_, each
with their own pros and cons: `RabbitMQ `_ is a fine
default choice.
Workers can be started by using the ``celery worker`` option to the
:ref:`trap-manage.py ` script. Indeed, ``trap-manage.py`` provides a
convenient way of interfacing with a variety of Celery subcommands: try
``trap-manage.py celery -h`` for information.
When you start a worker, you will need to configure it to connect to an
appropriate broker. If you are using the ``trap-manage.py`` script, you can
configure the worker through the file :ref:`celeryconfig.py `
in your :ref:`project folder `: set the ``BROKER_URL``
variable appropriately. Note that if you are running the broker and a worker
on the same host with a standard configuration, the default value should be
fine.
Note that a single broker and set of workers can be used by multiple different
pipeline users. If running on a shared system, it is likely sensible to
regard the broker and workers as a "system service" that all users can access,
rather than having each user try to run their own Celery system.
Note also that a worker loads all the necessary code to perform its
tasks into memory when it is initalized. If the code on disk changes after
this point (for example, if a bug is fixed in the TraP installation), the
worker *will continuing executing the old version of the code* until it is
stopped and restarted. If, for example, you are using a "daily build" of the
TraP code, you will need to restart your workers after each build to ensure
they stay up-to-date.
Finally, always bear in mind that it is possible to disable the whole task
distribution system and run the pipeline in a single process. This is simpler
to set up, and likely simpler to debug in the event of problems. But keep in
mind that a running broker is still required. To enable this mode, simple edit
your ``celeryconfig.py`` file and ensure it contains the (uncommented) line::
CELERY_ALWAYS_EAGER = CELERY_EAGER_PROPAGATES_EXCEPTIONS = True
Run Celery workers
==================
If you want to parallelize TraP operations using celery, you need to run a
separate Celery worker. This worker will receive jobs from a broker, so it is
assumed you installed and started a broker in the installation step. Start a
Celery worker by running::
% trap-manage.py celery worker
If you want to increase the log level add ``--loglevel=info`` or maybe even
``debug`` to the command. If you dont want to use a Celery worker (run the
pipeline is serial mode) uncomment this line in the ``celeryconfig.py`` file in
your pipline directory::
#CELERY_ALWAYS_EAGER = CELERY_EAGER_PROPAGATES_EXCEPTIONS = True
Note that a running broker is still required.
.. _celeryconfig_py:
Celery Configuration File
=========================
The :ref:`management script ` may be used to start a :ref:`Celery
` worker. The worker is configured using the file
``celeryconfig.py`` in the :ref:`project directory `. The
default contents of this file are:
.. literalinclude:: /../tkp/config/project_template/celeryconfig.py
Note that this file is Python code, and will be parsed as such. In fact, it is
a fully-fledged Celery configuration file, and the reader is referred to the
`main Celery documentation
`_ for a complete
reference. Here, we highlight just the important parameters defined in the
defualt configuration.
Note the line::
#CELERY_ALWAYS_EAGER = CELERY_EAGER_PROPAGATES_EXCEPTIONS = True
By uncommenting this line (removing the initial ``#``), the pipeline is forced
to run in serial mode. That is, tasks are executed sequentially by a single
Python process. No broker and no workers are required. This will likely have a
significant impact on performance, but makes the system simpler and easier to
debug in the event of problems.
The line::
BROKER_URL = CELERY_RESULT_BACKEND = 'amqp://guest@localhost//'
specifies the URL of the Celery broker, which is also the location to which
workers will return their results. Various different types of broker are
available (see our :ref:`introduction to Celery ` for
suggestions), and they must be configured and started independently of the
pipeline: the appropriate URL to use will therefore depend on the
configuration chosen for your local system.
The other parameters in the file -- ``CELERY_IMPORTS`` and
``CELERYD_HIJACK_ROOT_LOGGER`` -- should be left set to their default values.