Python Logging

This RFC is a collection of best practices around logging in services written in Python.

Problem Description

Services configure logging in slightly different ways, and different log formats are difficult to process centrally, and this will get even more complex with the growing number of microservices. Log collation and processing requires structured (JSON) logging. Currently, we are investing time in implementing and testing logging in each microservice.

Logging also affects integration with Sentry. Sometimes there is too much noise coming from Sentry so there needs to be a better way to filter patterns of information but at the same time ensure that relevant events are always captured.

Ideally, common logging should be a library that can be just imported into a Python service, one that provides the default configuration, and produces the same log format in all services.

Scope

New-ish Ebury services only, no retrofitting of monoliths
Python 3.8+

Best Practices (Solution)

Do add logging to your code :) ...

... avoid print statements! Some of your print statements can be very useful in the long term but should take the form of a well-behaved debug log. Try to hit the right balance:

DO:

logging may be the single way to understand what happened in production, therefore include INFO logs as "milestones" for every critical path in the application
DEBUG logs are useful in troubleshooting, ideally having a DEBUG log for each if branch in the code. DEBUG logs are rarely enabled in production due to performance concerns, however using error capturing such as Sentry, they will be buffered in memory and sent along with an ERROR event as breadcrumbs
avoid logging at module scope because imports will emit log records and logging may not be configured yet

DON'T:

make too much noise, e.g. don't log in each cycle of a loop but log a summary instead, and consider if they are concise enough, avoid dumping huge lists or dictionaries and JSON objects into debug logs

Choose the appropriate log level ...

DEBUG: Detailed information, for diagnosing problems
INFO: Confirm things are working as expected
WARNING: Something unexpected happened, or indicative of some problem but the software is still working as expected
ERROR: More serious problem, the software is not able to perform some function.
CRITICAL: A serious error, the program itself may be unable to continue running.

Log exceptions ...

... using logger.exception(...) which will add traceback information. A structured logger will transform the trace information as needed.

Use structured (JSON) logs in production...

... so that they can be easily processed in an ELK stack, log records from different sources can be collated based on log record fields. In a production environment we rely on automated log processing, and tooling that work well with JSON logs. In a local development environment, however, it is useful to have unstructured, pretty-printed logs on the console, therefore the format should be configurable based on the environment.

Standard Context Keys

In order to keep the number of context keys and Kibana filters manageable, it is useful to define the key names in a common logging library. An example from Sherlock:

from log_filters import EBURY_ID, LOG_TAG

LOG_TAG_VALUE = 'payment_screening_service'

logger.info(
        'searched_entity_in_lxnx',
        LOG_TAG=LOG_TAG_VALUE,
        predefined_search_name=predefined_search_name,
        EBURY_ID=entity_dict['ebury_id'],
        number_of_hits=entity_screening_result.number_of_hits,
        aml_result=entity_screening_result.aml_status_code,
    )
...
logger.info(
        'created_payment_screening_db_entries',
        LOG_TAG=LOG_TAG_VALUE, 
        EBURY_ID=payload.payment_id, 
        pk=screening_request.pk
    )

Structured logs from other libraries

Third party libraries will use the standard logging module. Those log records must be also brought to the same format (structlog provides hacks to achieve this).

Log to stdout

Each running process should write its log event stream, unbuffered, to stdout. [The Twelve-Factor App]

An advantage with writing logs to stdout is that the infrastructure (ECS, Kuberentes) is responsible of the log propagation (e.g. ELK, EFK). Similiar to environment variables, we don’t know how they are actually injected, but the apps only know to read from the environment.

Should some startup or wrapper scripts print non-structured logs to stdout, those can be filtered at service level:

Have a common library for this ...

... so that all projects can re-use the implementation. Include the library in the cookiecutter templates so all new projects benefit of logging best practices. Add detailed usage information, patterns and antipatterns in the README of the library.

Set up logging as the first step ...

... in the service's main/start script, just in case something that is imported should log something, and that must be already emitted in the right format. Initialiaze sentry_sdk at the same time.

Use a dedicated logger per (bigger) modules ...

... instead of the root logger everywhere (Python loggers form a hierarchy). This may give more flexibility about what, how and where is logged, e.g. one module's logs to a file on DEBUG level, while others to console on INFO level, etc.

Required log record fields...

To be decided.

Suggested log record fields...

Field	Description
Timestamp	in ISO-8601 format (including time-zone info)
Log level
Message	Text intended for human consumption. Not recommended to use these in log processing (with complicated regex expressions), as they are subject to change based on the mood of a developer :). See fields below intended for use in log processing.
entity ID*	the unique ID of an affected entity being processed
entity state*	state of an entity being processed, esp. when there is a finite state machine defined
request_id and correlation_id	To see related events across all application modules and across multiple applications. Instead of manually adding these fields, see [Log correlation] and [Distributed tracing]
status or error code*

*Naming should be specific, for instance beneficiary_uid instead of something generic like entity_id, use standard context keys if already defined.

The type of the value must be also consistent to make it possible for Elastic search to create an index for the field. Stick to the smallest possible type (e.g. don't use string for an integer status_code).

For example, uvicorn would automatically generate this log record:

{
    "event": "10.13.76.122:36790 - \"GET /api/v1.0/client/beneficiaries?client_identifier=EBPCLI14017&items_per_page=1&page=1&order=asc HTTP/1.1\" 200",
    "level": "info",
    "logger": "uvicorn.access",
    "timestamp": "2021-09-07T13:49:16.844523Z"
}

... and this could be extended with the following fields:

{
    ...
    "method": "GET",
    "endpoint": "/api/v1.0/client/beneficiaries",
    "aml_id": "EBPCLI14017",
    "status_code": 200
}

... where the keys should be predefied literals in the logging library (note the aml_id vs. client_identifier). The IP address should be probably masked out in the original event field.

Nice to have fields:

Field	Purpose
Logger name	it provides one aspect of filtering log records
source code pathname + lineno	For troubleshooting. Only set this field when DEBUG logs are enabled?
Sentry event ID	The correlation ID of an event reported to Sentry. So that it is easier to find the relevant logs during troubleshooting.

Personal Identifiable Information

We should not log any sensitive or personal information without having an approval from The Data Protection Officer (even in debug logs). IPs, passwords, emails, addresses, bank information should be either masked or removed.

Ways to avoid logging sensitive information:

TL;DR:

use structured logs with standard context keys predefined in the library
disable with_locals and send_default_pii in the Sentry SDK, and implement event/breadcrumb filters

The above in more detail:

Use structured logs. In an unstructured log, it is easier for sensitive data to slip through filtering. For example: write log.info('serving request', IP_ADDR=ip_addr) instead of log.info('serving request from IP %s', ip_addr).
Implement logging filters that automatically remove, mask out, or obfuscate fields known to contain potentially sensitive data.
Use predefined context keys: if a predefined key is not yet available, then evaluate the risks before adding a new key. If necessary, implement log filters for the new key.
Some data modeling libraries support secret data types, and they can ensure that values are not added to logs by mistake. See [pydantic], for example.
Following the above will ensure that Sentry won't pick up PII from logs, however, it can still do so from other sources: (1) local variables in stack traces -> disable the with_locals config; (2) default Sentry integrations may add sensitive data (e.g. IP addresses in uvicorn logs) -> implement event/breadcrumb filters to remove or obfuscate the data.
Finally, not strictly related to logging but data stored in distributed logs (Kafka) or other forms of data stores are assumed to be encrypted and protected from unauthorized access.

Sentry Scope for Breadcrumbs

Workers process requests/messages/events in a loop. sentry_sdk provides out-of-the-box solutions for most integrations to ensure that only information relevant to the current event is presented for an event and information from previous cycles doesn't pollute the breadcrumbs presented. For our own solutions (such as a simple synchronous event processing loop) we need to manage the Sentry scope manually to achieve this. (Otherwise, by default, sentry_sdk will attach up to 100 of the last breadcrumbs in the current scope including parent scopes.)

Illustration:

for event in event_stream:
    with sentry_sdk.push_scope():
        log.debug('Processing event', XYZ_ENTITY=event.payload.uid)
        add_bread_crumb(some fields of event)
        do_the_event_processing(event)

Without the push_scope context per event, if for example the 3rd cycle of the loop failed and Sentry picked up an event (error log or exception), then it would attach the breadcrumbs (including debug logs) also from cycle 1 and 2 and that would be confusing.

Logging Antipatterns

Do not embed variable information in the log message

Instead of:

logger.info('ID %s not found', ebury_id)

write (structlog example):

from log_filters import EBURY_ID

logger.info('not_found', EBURY_ID=ebury_id)

Metrics

Logs don't substitute metrics.

Way of monitoring	Pros	Cons
Metrics	Information is available fairly quickly	When scraping fails, samples are lost
Logs	Information from different services can be collated	Slow to analyse

Alternatives

[ebury-logging] has already implemented useful patterns and ideas, however, it is heavily Django-oriented. We would also like to avoid maintaining code for different Python versions, as the new projects would all be written in Python 3. It is suggested that we take the knowledge accumulated in ebury-logging, and selectively re-implement the functionality in Python 3 that the new projects will really use.

structlog offers a great experience in both local development and production (can produce both pretty logs and JSON logs). Their plugin system (processors) allows great customization and we already have integrations with Sentry, ElasticAPM.

...

Caveats

Operation

A common library implementing many of the best practices discussed in this RFC. Expectation on the behaviour of the library from any service must be implemented in the form of unit tests with a description about where the requirement came from.

The library is intended for internal access.

Security Impact

Any third party packages used should go through regular security procedures in Ebury (e.g. packages screened on Ebury Pypi?).

Performance Impact

Specific packages (e.g. structlog vs. python-json-logger) to be compared for performance.

Developer Impact

Developers should not spend time on implementing and testing logging, just import and use the library as a service.

Data Consumer Impact

It is suggested to use a common logging library in (fairly) new services only. Changing logging in legacy code carries the risk of unintended affect on other data consumers.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search