ISO Migration - Inbound MVP - File reading using Autoclient

Reference Documents

Reference Document Location
COP001 ISO Migation - Inbound MVP PRD
COP002 Tech analysis
COP003 Diagrams
COP004 Swift-Gateway vs ACBalancer analysis
COP005 Swift-Gateway and ACBalancer Tradeoffs
COP006 Autoclient docs
COP007 File examples
COP008 Master RFC Draft - MVP0

Glossary

  • FXS: FX Suite
  • FIN files: Files currently used by SWIFT to transport MT messages (text-based, proprietary encoded)
  • IA files: New standard of files proposed by SWIFT to transport MX messages (XML-based, open standard), also known as InterAct files.
  • SW-GW: Alias for Swift Gateway
  • Inbound: refers to the direction of traffic, meaning that this RFC only considers the files that Ebury receives from the Swift network
  • EMP: Ebury Mass Payments
  • LAU: Local Authenthication, a method SWIFT uses to digitally sign their files

Problem Description

ISO 20022 is an ISO standard for data exchange between financial institutions that is being implemented in various financial market agents and clients that we use today. Today our database, our systems and our processes are based on SWIFT MT messages and we must be migrated to the new SWIFT ISO 20022 CBPR+ format, also known as MX format. Adhering to this new format is not a straightforward change.

This document will present the first component that will be developed within the new architecture to meet the business requirements, focusing in the new incoming flow.

Narrowing down the scope

The first step to integrate with the SWIFT network is to develop a Gateway responsible for consuming the files delivered by the network to Ebury.

In this RFC we’ll be detailing a component of the incoming flow called Swift Gateway.

What Swift Gateway IS: A pass-through, low-level interface between Ebury and SWIFT, allowing connection between the 2 players using different technologies (SWIFT uses SFTP, Ebury uses Kafka).

What Swift Gateway IS NOT: A domain connector or transformer of any kind. It will not apply filters or any type of business logic on the incoming files.

Background

Part of the background below was already explained in the Master RFC, but we'll repeat some parts here to make this RFC more consistent.

Our communication with the SWIFT network is done through exchanging files, using SWIFTs own file server called “Autoclient”. Ebury access Autoclient servers through SFTP protocol. Each Autoclient server is located physically inside Ebury premises (not in a cloud environment), and we currently have 3 of them.

For redundancy, there are multiple instances of the SWIFT Autoclient software running on different physical servers. Each SWIFT Autoclient retrieves the same files from SWIFT. The gateway is responsible for retrieving a file and deleting the replicas from all instances of the SWIFT Autoclient.

Files coming from SWIFT Network

The SWIFT network delivers FIN files (which packages MT messages) to Ebury, through a service called Autoclient. The current stack used at Ebury for the INCOMING flow is served by 1 legacy application, FX Suite (aka FXS), which is able to read these .FIN files and the SWIFT MT messages inside the FIN files.

For each file that FX Suite currently reads, it uploads a copy to S3 and saves a copy in its own Database. Since FX Suite has several daemons that runs reading from the same directories, a locking mechanism was implemented in REDIS to avoid 2 daemons consuming the same file.

FX Suite is also responsible for deleting the files that are replicated in the extra servers, so if it consumes a file from server A, it will delete the file from servers B and C.

Files coming from SWIFT Network

The SWIFT ISO 20022 migration project, in the context of Incoming Funds, proposes 2 changes:

  1. The messages will be changed from MT (text-based, semantically poor) to MX (XML-based, ISO 20022 standard, semantically rich);
  2. The transport mechanism of these messages will be IA files (InterAct files), and not FIN files anymore;

InterAct files do not transport MX messages directly. The ISO messages are embedded in an envelope (called DataPDU) that contains information about the source and destination financial institutions of the message, such as the BIC code.

This DataPDU envelope is SWIFT-specific, and unrelated to the ISO 20022 format.

Clash of standards

The DataPDU is important to us because there are business rules based on the BIC code information and possibly on other fields that SWIFT makes available, and ISO doesn't.

Solution

An overview of the whole architecture can be seen in the Master RFC Draft document.

Ingress File Traffic

The SWIFT network will deliver the InterAct files to the same Autoclient servers that FIN files are delivered today, using SFTP protocol, and replicated across all three servers we have:

Clash of standards

The SFTP protocol does not provide a notification mechanism, hence we need to poll the SFTP servers to detect new files.

When a new file has been detected, the service will

  1. download the file from the SFTP server;
  2. upload the file to S3 (for backup/auditing);
  3. store metadata about the file (path, size, hash, s3 bucket and key) for auditing purposes;
  4. delete the file from all 3 Autoclient servers;

Acquire the file

The InterAct files are considered private resources for Swift Gateway, given the fact that they're just bundles of DataPDUs, and of no interest to downstream systems.

After acquiring the file, the service will extract the DataPDU messages and notify downstream systems about each new DataPDU.

Each part inside an InterAct file consists of:

  1. Binary header with 31 bytes: a prefix byte, the length encoded in 6 ASCII-character decimal digits, and 24 bytes of LAU signature;
  2. UTF-8 encoded XML payload which is the DataPDU itself;

Hexdump of an InterAct file

At this moment the service will:

  1. Break the file into DataPDU blocks;
  2. Validate the consistency of each DataPDU by checking size, LAU signature and checking if the DataPDU is a syntactically valid XML document;
  3. Calculate the unique but deterministic key (i.e. filename + DataPDU index in the file) for each DataPDU. This is necessary for deduplication in the consumers due to the at-least-once notification strategy;
  4. Store the DataPDU in the database along with the generated key;
  5. Publish the DataPDU to Kafka.

Breaking the file into DataPDUs

Note: as different banks will switch to the ISO format at different times, there will be a period when we receive both FIN files and InterAct files. Swift Gateway will only download and remove InterAct files, it will leave the FIN files on the Autoclient servers, letting the existing FXS service handle them. Similarly, FXS only moves FIN files. The files can be distinguished by filename extensions (.fin / .ia).

Egress File Traffic

To send files to the Swift network, Swift Gateway

  1. receives DataPDU XML messages in a Kafka topic along with a unique request ID and optionally some labels - both the unique ID and the labels will be used in responses about the request;
  2. calculates the InterAct header signature and LAU signature
  3. constructs the InterAct file content by adding the 31-byte binary header in front of the XML content;
  4. generates a timestamp-based filename;
  5. generates LAU file content
  6. Selects one of the AutoClient servers - using round robin
  7. inserts a record about this file upload operation in the DB storing the
  8. request id (unique)
  9. filename (unique)
  10. labels (blob)
  11. AutoClient Server selected
  12. the DataPDU message read from Kafka
  13. state: NEW
  14. [state==NEW] uploads the LAU file to the emission folder of the selected server - Swift will only read/move this file when it finds the corresponding .ia file, hence this operation doesn't need to be atomic
  15. [state==NEW] uploads the InterAct file to the selected AutoClient server - writes with a .tmp file extension in the emission folder
  16. if [state==NEW] then it
  17. locks the row (SELECT ... FOR UPDATE) - we explain later why
  18. changes the state to MOVING FILE
  19. renames the InterAct file from .tmp extension to .ia extension - the (kind of) atomic file rename operation can avoid race conditions with the Swift-side reading process
  20. updates the file-upload operation DB record with state UPLOADED
  21. if [state==UPLOADED] then
  22. uploads the file with the same filename to an S3 bucket - for backup/auditing
  23. updates the file-upload operation DB record with state IN S3 BUCKET (commmits and releases the locked row)
  24. commits the Kafka message

Summary of state transitions: NEW > MOVING FILE > UPLOADED > IN S3 BUCKET

The described solution can scale with multiple parallel upload processes as long as the input Kafka topic is partitioned.

Error Handling

The upload process might fail at any step. When that happens the worker either restarts, reconsumes the same Kafka message, and performs the entire upload process again, or leaves the upload operation DB record in an unfinished state, raises an alert, commits the current Kafka message and consumed the next one. All steps can be handled either in an idempotent manner or orchestrated by tracking the state.

Step Idempotency Strategy
Signature calculation, constructing file contents, generating filename, selecting AutoClient server recalculates every time - cheap in-memory operation
Inserting DB record about the file upload the unique constraint guarantees that we cannot do that twice; should that exception happen, we read back the DB record and use the values stored there (i.e. filename, selected AutoClient server, upload state); should the file-upload operation state we read back be MOVING FILE, we raise an alert as this needs investigation (see Incident Handling), and commit the Kafka message
Uploading LAU and the .tmp InterAct file if the file exists, overwrite it or remove it and upload it again - the file might be incomplete
Renaming the InterAct file from .tmp to the .ia filename extension we only get to this step when the upload operation was found in NEW state
Upload to S3 we can upload the same content with same key repeatedly - this should happen very rarely so we don't worry about the cost
Commit Kafka message if it was successfully commited before, we won't see the same Kafka message again
Incident Handling

We want to avoid sending the same file twice insted of relying on idempotency on the SWIFT side, for reasons:

  • The lack of a file in emission folder could mean (1) we never sent it or (2) Swift has already processed the file and removed it.
  • The communication pattern via FTP is complicated.
  • There is no real protocol-level contract, and the specs are vague.
  • In the end, the account may be debited twice.
  • We can't safely rely on "already seen" error response codes to assume the request has indeed been processed before (what would be that set of error codes exactly)?
  • In case we receive an "already seen" response code, before we treat that as an error, for how long should we wait to get an ACK about the previous upload attempt? Are ACKs reliable, will we always receive them?

When there is an error during the renaming of the InterAct file from .tmp to .ia or the saving of the upload state to DB right after that, then

  1. an engineer needs to check the logs what happened exactly, can we conclude that the FTP mv command was successful, and only a subsequent step failed? - in this case update the DB record to state UPLOADED (with a DB migration)
  2. if the fate of the FTP mv command cannot be established by inspecting the logs then we need to reach out to the banking partner or SWIFT to check if they received the file
  3. if they did not receive it then update the upload status to NEW (with a DB migration)
  4. if they received it then update the upload state to UPLOADED (with a DB migration) - this way the upload to S3 will still be performed

At start-up, the workers query the DB and execute (continue) any operations in NEW or UPLOADED state. There can be multiple concurrent workers, and this is why we lock the upload operation DB row.

Handling ACKs and NACKs

After a successful file upload we can receive positive and negative acknowledgements from SWIFT network. They will typically embed the original message (or at least the important references that allow the sender to recognise them), and in case of a rejection an optional reason code and additional information. Instead of the usual Message XML element inside /DataPDU/Header there will be one of TransmissionReport, DeliveryNotification, DeliveryReport, MessageStatus, etc. (see XMLv2 specs). SwiftGW doesn't need to read and interpret this information, it needs to forward the DataPDU as is to the next downstream service. The chosen strategy today is to keep all types of DataPDUs in the same, single Kafka topic. The only reason SwiftGW might want to know the message type is to produce metrics.

Handling error files

Error files are in response to an InterAct file that the receiving side could not read or understand, failed some integrity checks, etc. The error file uses the same filename as the original file with the added .err file extension. SwiftGW saves the original request ID, labels and the generated filename when uploading an InterAct file, so using the filename, it can find the original request and communicate all that information back to the requestor in a Kafka topic. In most cases the content of the error file is unstructured text. SwiftGW handles the content as opaque payload.

SwiftGW must not read and remove *.fin.err and *.fin.err.lau files from the AutoClient servers, as the old FXS flow still needs those.

Receiving ACK, NACK and error files is normal operation from a SwiftGW point of view and must not trigger an alert. The downstream services will.

DB Garbage Collection

The data in the DB serves only operational purposes. Files sent and received are uploaded to S3 for backup purposes. This means we can keep the DB footprint small by regularly deleting old records from the DB (CronJob). Suggested retention period: 7 days. This implies that any incident (where messages may need to be reprocessed) must be solved within 7 days, which is reasonable for payment instructions and processing of financial records.

Service Ownership

New Service Service Name Owner
Yes SWIFT Gateway COP Team

Alternatives

An alternative would be to implement the new features using FXSuite, but this is a legacy which Ebury wants to decommission.

Another alternative would be to implement the changes in AC Balancer, but AC Balancer only handles the OUTGOING legacy flow and even though it's a newer service written in Python 3, it's strongly coupled with the FX Suite legacy stack, which we also want to isolate in order to make the decommissioning process easier in the future. The pros and cons of using AC Balancer were described in the tradeoffs analysis document.

The implementation in a new service is also motivated by the separation of responsibility, since we are dealing with another nature of files and with the idea of creating a gateway with SWIFT and not with Autoclient (even if at the moment we are using Autoclient).

  • We have considered using Benthos, however we didn't find it fit for this project for reasons:
    • its sftp input produced erratic behaviour (permanent disconnection after even a single connection error), and none of the Benthos healthcheck endpoints, nor metrics correlated with the situation, so we could not even use those with K8S probes to at least restart the service, and we don't know a way to add a reliable heartbeat mechanism such as the one suggested for the problem with paramiko
    • our use-case is not a typical one: dealing with file replicas on 3 physical FTP servers (the typical use-case would be to have one endpoint/interface and let the cloud infrastructure provide high-availability), hence not many other Benthos users out there are expected to face the same kind of problems we have. A cloud solution from Swift is being tested with smaller participants at the time of writing this, and we don't expect that the AutoClient server solution for Ebury will be replaced with a cloud-based, API solution any time soon. (If there is appetite, maybe we can look at Benthos again when that happens!)
  • to be able to (1) understand how exactly Benthos is supposed to work under the hoods, (2) fix bugs we find, (3) add more metrics and logs where we today have no clue what's happening inside a standard Benthos input/output plugin (e.g. what S3 request URL is being used that fails), or to be able to define reliable K8S probes, we need to dig into Benthos code but few people in Ebury know golang. Ebury doesn't have a support agreement for this product.
  • more details here and here

  • Kafka Connect SFTP Source and sink connectors

  • the main reason we discarded the idea at the very beginning of the project is that we had concerns that a file wouldn't fit into the Kafka message size limit, however since then we have learned that with Avro encoding, we can specify the compression type, which typically results in ~2-magnitude smaller payload size, giving us roughly a 100-200 MB max raw file size that we could still put into a Kafka message.
  • we expect the same problems around handling file replicas from the three FTP servers, as with Benthos

Caveats

  • FX Suite currently contains code to consume IA files in order to generate some specific metrics - this can lead to concurrency issues between the new Swift Gateway and FX Suite, if both are reading from the same SFTP folder in Autoclient

    • Initial analysis points that this part of FXS code is isolated into a specific daemon running on ECS, and we can set this daemon to run on ZERO machines, effectively eliminating the concurrency problem, without changing any of the FX Suite code;
    • If this is not possible for any other reason, we can add a switch or feature flag in FX Suite to enable or disable this code in production. This change shouldn’t be big, but it will be an important part of the rollout plan;
    • FX Suite also removes the IA file that has been read from Autoclient;
  • Python code that connects to SFTP servers usually rely on a library called Paramiko. Paramiko is known to be unreliable specially regarding timeouts, and in some of our legacy applications we had to develop workarounds brute-forcing timeouts using the Operating System signals to avoid the process to be infinitely frozen. A generic solution to mitigate occassional, unexpected lock-ups by blocking function calls into third-party libraries can be mitigated by the use of heartbeat files and K8S probes (example used in existing services here, here and here).

  • This solution needs a reliable way to raise alerts. Using Promethues's poll strategy is not reliable, as the process may crash before the metrics could have been scraped. This is a generic problem for which Ebury has not found a best-practice yet at the time of writing and is outside the scope of this blueprint.

Operation

The service will be deployed in a private network (Kubernetes cluster) and it will communicate through SFTP with Autoclient and Kafka topics with other Ebury systems.

In the initial days of deployment the service will be monitored by the COP team, but over time as operations begin to flow, it will be handed over to the Support team, with the proper alerts in place and runbooks defined.

This service is expected to run under Ebury Core infrastructure, and MUST NOT run under Ebury Mass Payments infrastructure, as EMP does not have any Autoclient instance.

Security Impact

No impact is expected on the existing Alliance Lite 2 service from SWIFT.

The files received from SWIFT will contain PI data related to clients, and therefore MUST NOT be leaked through logs, for example.

All received files are a special kind of XML format that SWIFT calls XMLv2, that are standard XML files, but prefixed with a binary header containing integrity information about the file. Our service will validate if the file is consistent before processing the file and passing it through to other Ebury Services.

Since the file contains only XML, we’ve checked with security (SP-19179) and there’s no need to scan them for viruses.

Points of attention:

  • XML files are a possible attack vector, thus the service will need to defend itself against possibly malicious XML files;
    • There is a way to mitigate this attack vector by using proper libraries and protecting the code accordingly to the notes below:
    • The number of files and expected average size is not clear yet, but all files will be stored internally into a S3 bucket, so we need to plan capacity and long-term storage accordingly, avoiding a possible DoS attack (voluntary or involuntary) due to the amount of files exchanged;
    • No caching will be involved in parsing the files;

For each processed file we’ll store metadata about it, and one of the important fields is a hash (SHA-256) of the file, which we can use retrospectively to check for modifications and inconsistencies;

Connections to the Autoclient servers are done using login+password, both stored in Vault.

Performance Impact

In the context of Incoming Funds, the most similar workload that we have today is the one from FX Suite that handles MT940 and MT942 messages.

At the risk of over-simplification, the workflow of a MT940 is similar to a camt.053, and the workflow of MT942 is similar to camt.054 and camt.052.

That means that for each new entry in one of Ebury accounts we expect to receive a file that contains a camt.054 (or camt.052, depending on the bank), and for each account Ebury has we expect to receive one camt.053 daily.

In a period of 24 hours (October 10th, 2023) we had 2.400+ MT 940 messages and 12.500+ MT 942 messages, by checking Kibana logs. This is within a typical daily volume if we check the last 30 days.

MT940 in Kibana logs

MT942 in Kibana logs

There’s also an useful dashboard showing the current flow of MT942 messages. Focus on the "Received MT942" panel, as it shows how many messages we received and had to analyse.

MT942 in Kibana dashboard

We expect about the same amount of files for camt.054, camt.053 and camt.052 as described above daily, as the focus is similar which translates roughly to:

  • 12.000+ daily messages combining camt.054 and camt.052;
  • 1.500+ daily messages of camt.053;

We also expect an increase in size, as the legacy MT message format is more compact than the ISO 20022 XML format. Without actual production data there’s no way to have a real value, but we can estimate from the samples below:

  • 1 MT942 file with a single entry has 500 bytes;
  • 1 InterAct file, containing 1 camt.054 message with a single entry has 13.000 bytes;
  • The camt.052 message is similar to camt.054 by ISO standard;

Taking only these 2 files into consideration we have these estimations:

Scenario Rationale Final Size
MT942 12.000 messages/day * 500 Bytes/message 5.7 MB/day
Camt.054 + Camt.052 12.000 messages/day * 13.000 Bytes/message 148.8 MB/day

That’s a 26x increase in size.

There’s also a seasonal component on these messages, as historically at the end of the year the volume of Incoming Funds messages increases a lot. Unfortunately we don’t have hard statistics regarding the volume, as we’re storing this information only in Kibana, but this is common knowledge inside Money Flows and Operations teams, as they need to deal with larger volumes of fund reconciliations.

In order to analyse our scaling needs for future stages, the following metrics are planned to be collected:

  • Number of files received daily;
  • Total file processing time (From receiving to sending them to Kafka and archiving them);
  • File size;
  • Idle time;
  • Bandwidth / Download speed;

Data Contracts

The IA files are received in this format:

Swift Data Contract

The contracts will be created following the structure established by ISO 20022 itself. We will publish topics with the complete message (.ia) and topics with the respective DataPDU's.

The InterAct XMLv2 files are defined in Appendix B of AutoClient 1.4.0 for Alliance Lite 2 User Guide.

Inside the InterAct XMLv2 files, there will be ISO 20022 messages.

Swift Gateway will be a producer of the following Kafka topics: - events.swift.incoming-file - Will contain messages notifying that new IA files are ready for consumption; - Each message will contain metadata already described in the proposed solution; - The messages will NOT transport the whole file, only a reference to the S3 location; - There is no need for infinite retention; - events.swift.incoming-pdu - Will contain messages notifying about DataPDU inside each IA file; - Each message will contain the whole DataPDU; - The DataPDU is limited by SWIFT standard to 999.999 bytes fitting inside a kafka message; - If needed, we can activate compression in AVRO schemas reducing the payload; - There is no need for infinite retention; - events.swift.incoming-error-file - Contains: - error file content (unstructured text) - original request ID - original request labels

Swift Gateway will be a consumer of the following Kafka topics: - events.swift.outgoing-pdu - Requests to turn DataPDUs into InterAct files and upload to AutoClient - The requests contain: - DataPDU payload - request ID - labels

Data Sources

SWIFT network is the main data source, providing files through Autoclient server (via SFTP protocol).

Deployment

The service will be deployed in our Kubernetes cluster, meaning it MUST support concurrency, as there’s expected overlap between the current pod and the previous pod in a deployment.

This is described in the Developers Handbook and it’s a reasonable non-functional requirement.

Dependencies

No dependencies with external systems at the moment.



Based on RFC Template Version 1.1