.. _resource_control:
.. _resources:

===================
Resource Management
===================

There are several types of resources that Lifeblood manages:

* machine resources
  such as cpu cores, memory and devices
* global resources (NOT IMPLEMENTED YET)
  such as licenses available in total

When a task invocation is produced - certain amount of resources are associated with it.

For example, one particular Karma render may require minimum of 16 GB of RAM, one Karma license, all the CPU cores
a worker can give, but at least 4 and as many GPUs as a machine has.

Machine Resources
=================

Machine resources belong to certain "hardware" (or machine). Single machine can run multiple workers (usually through worker pools)

.. admonition:: Reminder

    | :ref:`Worker <worker>` - is the component of Lifeblood that actually does the work
    | **One** worker can do only **one** invocation job at the same time.
    | Multiple workers can and are run on the same machine to efficiently use all available resources
    | :ref:`Worker pool<usage pools>` is yet another component of Lifeblood responsible for managing workers on the same machine -
      that's what you are actually launching on your machine(s) that you want to load with work.

    Please note, Lifeblood's **Worker Pool** concept is **radically different from Deadline's concept of worker pools**
    and means a completely different thing, so please don't think let a familiar word confuse you

There are 2 fundamental types of machine resources:

* numerical resources
* device resources

.. _numerical_resources:

Numerical Resources
-------------------

Numerical machine resources are just that: named values associated with a machine.

Number of CPU cores, amount of RAM, local disk space - is a numerical resource.
But it does not have to correspond to anything meaningful. Foo - is a numerical resource if you need it to be,
number of cords coming out of the back of the server, number of buttons on it, a number of sheeps you counted while trying to fall asleep at it -
all those can be numerical resources if you need.

Currently the following resource types are defined:

* CPU cores
* RAM

Those resources are only special in that worker can get their values without defining them in the config file.
In all other senses they are not different from the number of hamsters that supply power to the machine.

You will probably not need to define your own resources CPU cores and ram is usually enough for most tasks, but in case you need to - you can do that.

A job definition generated by a node has some resource requirements.
For numeric resources it contains 2 values: **minimum** and **preferred** amount a resource.
The logic is the following: pick a worker that can provide **minimum** value of a resource, but actually take up to **preferred** amount.

If a machine runs several workers (default case) - those workers share available resources. So when one worker is assigned a job with some resources -
all other workers on the same machine will have less resources available. (the amount assigned to that job will be subtracted from available resources of all other
workers on the same machine)

So if there are 2 workers running on a machine that has 64 GB of ram, and one take a job that needs 48 GB of ram,
then the second worker won't be able to take any jobs that require more than 16 GB of ram, until the first worker is done.

.. admonition:: Example

    Karma node generates a job that requires:

    * CPU cores: minimum 2, preferably 64
    * RAM: minimum 4GB, preferably 0

    Scheduler selects a worker that currently has 16 cores and 20 GB of RAM available,
    as this worker satisfies the minimal requirements of 2 cores and 4 GB RAM.
    Scheduler will now assign all 16 cores to the job, as job preferred 64, but worker only has 16;
    and assign 4GB of RAM to the job.

    The worker is now busy with the job, but another free worker running on the **same machine** now has
    0 CPU cores and 16 GB of RAM available, and can still be assigned some other jobs.

.. note::
    job **is assigned** resources, but by default, workers **WILL NOT ENFORCE** that resource usage.
    The invocation that was assigned 16 GB of ram may end up actually taking 1 GB, or 100 GB of ram,
    there are no **default** instruments to impose resource limits on working invocations as it is very system-dependent,
    however there is a mechanism to implement such enforcement.

.. _device_resources:

Device Resources
----------------

Devices - are special kind of machine resources.

Unlike numerical resources, each device a machine defines is unique and special.

Each device has it's own numerical resources

Each device has a predefined type. Scheduler declares device type definitions.
Device type definitions defines what numerical resources a device type has

.. admonition:: Example

    a device type 'gpu' may have such resources defined as:

    - Memory size
    - supported OpenCL version
    - supported CUDA Capability version

Worker declares what devices it has, since device are arbitrary - it's up to the user to define them in the worker configuration.

A worker, for example, may define 3 devices of type 'gpu': "GeForce GVFTX 999080", "gpu1" and "sheep". why "sheep"? because these names can be arbitrary,
they can be whatever makes more sense for the use case.

A job definition generated by a node can have some device requirements. Device requirement consists of a type of the device needed,
and maybe some of device resource **minimal** requirements. There is no **prefered** requirements for device resources.

Scheduler will find a worker that has a device of specific type with specific resources available, and it will assign **the whole** device to that job.
Once again, the device is assigned to a job **as a whole**, unlike numerical resources, parts of a device cannot be assigned to multiple jobs.
So if a job requires a device of type 'gpu' with 1 GB of memory, and a worker has a gpu with 12 GB - the job will get that whole device assigned.

.. note::
    So far we were using 'gpu' device type as an example, but devices are not tied to anything real, 'gpu' is just one possible device type, but so are
    'network interface', 'modem', 'public web address' - it is up to you and your needs to fill those names with purpose.

    But let's return to gpu example, as it is probably the most useful one for most cases.

Job running on a worker can get a list of device assigned to this job through ``lifeblood_connection`` api.
But even if a node knows how to deal with some kind of devices - it is yet unclear how to match abstract device names to actual hardware, or other entities.

For that a device can have **tags**.

For example, all Houdini nodes can use 'gpu' type of devices, but how can a job know that given 'gpu1a' device name corresponds to a GeForce 2080, while 'gpu2' device
name corresponds to second gpu that is AMD something?

Houdini deals with OpenCL devices in it's own unique way, and to deal with that, jobs created by houdini nodes will look for specific ``houdini_ocl`` tag on a device.
That tag has form of ``<device type>:<vendor name>:<device number>``, and will help the job to configure houdini process that it starts to use the correct GPU for OpenCL
in terms specific to houdini only.

To read which node can use what type of devices and what tags it expects - refer to each node's manual,
like for example, :ref:`hip driver renderer <nodes/stock/houdini/hip_driver_renderer>` or :ref:`karma <nodes/stock/houdini/karma>`

How to define GPU devices for your workers
++++++++++++++++++++++++++++++++++++++++++

By default, Lifeblood Scheduler's config comes with a gpu definition that has just 3 resources:

- Memory size
- supported OpenCL version
- supported CUDA Capability version

(you can expand this definition in scheduler's config file)

To define gpus of a machine, locate and edit it's Lifeblood worker's :ref:`config <config-dir>`

Here is an example worker configuration defining 2 gpu devices:

.. code-block:: toml

    [devices.gpu.gpu1.resources]
    mem = "4G"
    opencl_ver = 3.0
    cuda_cc = 5.0

    [devices.gpu.gpu1.tags]
    houdini_ocl = "GPU::0"
    karma_dev = "0/1"

    [devices.gpu.gpu2.resources]
    mem = "1G"
    opencl_ver = 1.0

    [devices.gpu.gpu2.tags]
    houdini_ocl = "GPU:Intel(R) Corporation"

This worker config above defines that the machine has 2 'gpu' type devices named 'gpu1' and 'gpu2'.

First has 4GB of memory, second has 1. First supports opencl 3.0 and cuda 5.0, second - opencl 1.0 and cuda 0.0 (meaning does not support)

In the tags section you can see that 'gpu1' has ``houdini_ocl = "GPU::0"`` for identify this device for houdini nodes, and ``karma_dev = "0/1"``
to identify it for karma node specifically. (details about how to set those tags look in corresponding nodes' manual pages)

Second device has only houdini opencl tag ``houdini_ocl = "GPU:Intel(R) Corporation"``, as it karma does not support this gpu (it's an integrated one)