Data aggregation and calculation in Postgres

This proposal shows how we can move expensive calculations that require multiple queries to multiple models away from the Python code and into Postgres. We will use calculating balances as an example to show how this would be done.

Problem Description

We have a number of models that all contain similar information, but are separate tables and do not have a consistent data shape. Examples: assets and recon matches both describe movements of money but require different queries to see the resulting movement, and the trade models all have a common set of data but are in different tables and must be queried separately using the ORM.

If we need to perform calculations, we have to query each model and do the calculations in Python. This leads to code that is hard to refactor and does many queries to get a simple thing such as a balance.

For example, BalancesService has an API that requires we calculate client balances for one currency at a time. Because there is no central data source for the trade information, all the queries to the models are written to look up amounts for single currencies and work out the balances. This is very slow, and for some clients can result in over 1000 queries just to get their balances. Refactoring this to support multiple currencies would be a lot of work, and then if we wanted to extend to get balances for multiple clients, it would mean an even more abstract API or multiple code paths for bulk and single clients, which would either duplicate code or become difficult to read). It would be very easy to make a mistake and make it run slower on large datasets without realising it. This is already the case on the client screen where we have clients with many currency accounts.

Solution

When we have similar data read from multiple models (example: calculating a balance from SpotDeal, Drawdown, and FXSettlement) we should have a view in the database that combines the data into a single resultset which can be queried and filtered as if it was a single model in the Python layer. This moves the responsibility for the calculations from Python to Postgres. The view's data can be read with an unmanaged model, and the existing models will remain for creating and updating data.

Example code

As an example, this is how it works for calculate_balance_pending_trades_credit on BalancesService

The current code for calculating the pending trades credit looks like this:

    def calculate_balance_pending_trades_credit(cls, client_pk, currency_symbol):
        from trades.models import Drawdown, SpotDeal, FXSettlement

        spots_amount_not_funded_in = SpotDeal.objects.for_client_pk(client_pk).with_sell_currency_symbol(
            currency_symbol).overdue_pending_to_be_funded().values_list('amount_not_funded_in', flat=True)
        drawdowns_amount_not_funded_in = Drawdown.objects.for_client_pk(client_pk).with_sell_currency_symbol(
            currency_symbol).overdue_pending_to_be_funded().values_list('amount_not_funded_in', flat=True)
        fxs_amount_not_funded_in = FXSettlement.objects.for_client_pk(client_pk).with_sell_currency_symbol(
            currency_symbol).overdue_pending_to_be_funded().values_list('amount_not_funded_in', flat=True)

        deals_amount_not_funded_in = chain(spots_amount_not_funded_in,
                                           drawdowns_amount_not_funded_in,
                                           fxs_amount_not_funded_in)
        balance_credit = utils.four_decimals(sum(deals_amount_not_funded_in))
        return balance_credit

We can replace the three queries to SpotDeal, Drawdown, and FXSettlement with a database view that keeps the data the same shape across the models, and lets us query and run calculations.

The view for the pending overdue could look like this (just for example, columns and data sources might include more data from more sources, or differ in an actual implementation):


CREATE OR REPLACE VIEW pending_overdue_trade_data AS (
    SELECT "trade_type", "id", "client_id", "debit_currency_id", "credit_currency_id", "debit", "credit", "deleted", "due_date" FROM (

      SELECT
        'drawdown' as "trade_type",
        "trades_drawdown"."id" as "id",
        "trades_drawdown"."client_id",
        "trades_drawdown"."amount_not_funded_out" as "debit",
        "trades_drawdown"."amount_not_funded_in" as "credit",
        "trades_forwarddeal"."buy_currency_id" as "debit_currency_id",
        "trades_forwarddeal"."sell_currency_id" as "credit_currency_id",
        "trades_drawdown"."final_expected_clearing_date" as "due_date",
        "trades_drawdown"."deleted"
      FROM
        "trades_drawdown"
      INNER JOIN "trades_forwarddeal" ON ( "trades_drawdown"."forward_deal_id" = "trades_forwarddeal"."id" )

      UNION ALL

      SELECT
        'spotdeal' as "trade_type",
        "trades_spotdeal"."id" as "id",
        "trades_spotdeal"."client_id",
        "trades_spotdeal"."amount_not_funded_out" as "debit",
        "trades_spotdeal"."amount_not_funded_in" as "credit",
        "trades_spotdeal"."buy_currency_id" as "debit_currency_id",
        "trades_spotdeal"."sell_currency_id" as "credit_currency_id",
        "trades_spotdeal"."final_expected_clearing_date" as "due_date",
        "trades_spotdeal"."deleted"
      FROM
        "trades_spotdeal"

      UNION ALL

      SELECT
        'fxsettlement' as "trade_type",
        "trades_fxsettlement"."id" as "id",
        "trades_fxsettlement"."client_id",
        "trades_fxsettlement"."amount_not_funded_out" as "debit",
        "trades_fxsettlement"."amount_not_funded_in" as "credit",
        "trades_fxsettlement"."currency_out_id" as "debit_currency_id",
        "trades_fxsettlement"."currency_in_id" as "credit_currency_id",
        "trades_fxsettlement"."value_date" as "due_date",
        "trades_fxsettlement"."deleted"
      FROM
        "trades_fxsettlement"

    ) trades

    WHERE ( "due_date" < NOW() AND "deleted" IS NULL )
);

This view was written by running the queries that the ORM generated for the data we need to get for the three models, assigning common aliases to each of the columns we need, and using a UNION to bring them all back in the same resultset.

Now we need to do the calculation for the balances, so we create an unmanaged model to use the data from the view.

class PendingOverdueTrades(models.Model):
    id = models.IntegerField(primary_key=True)
    trade_type = models.CharField(max_length=64)
    client = models.ForeignKey(Client, related_name='client_pending_balance')
    debit = models.DecimalField(default=Decimal('0'), max_digits=19, decimal_places=2)
    credit = models.DecimalField(default=Decimal('0'), max_digits=19, decimal_places=2)
    debit_currency = models.ForeignKey(Currency, related_name='debit_currency')
    credit_currency = models.ForeignKey(Currency, related_name='credit_currency')
    due_date = models.DateTimeField()
    deleted = models.DateTimeField()

    class Meta:
        managed = False
        db_table = 'pending_overdue_trade_data'

Now we can query that view to get back all the pending debit balances for a client using the unmanaged model and Django's annotations. This is the result on the obfuscated database from 2020/06/15

>>> def get_debit_balances(client_id):
        return list(
            Currency.objects.filter(
                debit_currency__client_id=client_id).annotate(
                debit=Sum('debit_currency__debit'))
        )
>>> for c in get_debit_balances(3):
        print c.symbol
        print c.debit

GBP
1.0000
EUR
4385.3700
USD
0.0000
NOK
0.0000
HUF
0.0000
KES
0.0000

That returned a Currency object annotated with the debit balance for every currency that appeared in the view for the client. But is it fast? Running that function call with %timeit:

>>> %timeit get_debit_balances(3)
100 loops, best of 3: 2.65 ms per loop

Yes. The actual execution time of the query was 1.6ms in postgres.

This example is a simple one to show how we can have the database calculate granular figures have higher level grouping defined in the ORM, but if it makes sense the database could do more calculations and the Python model could just select the results. This should be decided on a case by case basis, depending on what makes most sense for the data.

Testing

Testing these views uses the same test structure we have for the current code, it's just going to select the data from Postgres instead of calculating it in Python.

Alternatives

Refactoring the current code to perform more efficient queries.

We cannot create the UNIONs we need using the Django ORM, so the most we could do is to fetch figures for multiple currencies and avoid looping round currencies to get individual balances. It would also require work if we wanted to get balances for multiple clients and keep the performance. It would be impossible to match the performance of using views in the database.

Store calculated data statically

We already do this for some models, but implementing it in the Python layer has concurrency issues and is already causing problems in production.

For the balances example used in this RFC, we would need a ledger system where every movement that would change a balance would need to be an entry in the ledger and the balance could be stored statically. This is a more "correct" solution but would require an immense amount of work to implement and test, and with the current code in BOS it is much more likely to go wrong. There are also concurrency and consistency challenges with this solution that the current state of BOS code and testing would make difficult to implement.

Caveats

When we change trade models, these views will not update automatically. However, we have so much python code where field names are hard coded, and so much code where we perform queries on multiple hard coded model types, that the amount of work to update how balances are calculated in the database view would be around the same.

Operation

N/A

Security Impact

N/A

Performance Impact

The balance calculations could easily be done on demand with the current volume of trades we do, and would be many times faster than the current Python implementations.

Other areas of BOS where we currently have to re-implement a UNION and do calculations on the app server side could be speeded up by many times too, and save RAM and CPU power on the app servers.

Developer Impact

Developers who are not comfortable with writing SQL by hand would need to spend some time practising, and make sure they're writing queries that are fast and efficient.

Data Consumer Impact

No impact yet, as this will be the first implementation of this pattern.

Deployment

Views can be deployed as raw SQL migrations.

Dependencies

N/A

References

N/A