Salesforce - Prometheus integration

In Salesforce we don’t have a strong alerting mechanism, apart from some spare emails, so we are working to unify our alerts with the rest of the company.

For this, we are following the guidelines here Monitoring platform RFC to integrate with Prometheus for data gathering and alert enrouting.

Problem Description

Due to the incoming projects, like the integration with the CRR service Salesforce - CRR service integration or with Jumio, how we are tracking data, monitoring it and, especially, alerting when it is necessary, is becoming hugely important for the business to anticipate errors and problems.

Background

We have 2 objects in Salesforce to help the development team to monitor the platform:

Apex_Debug_Log__c: the main purpose of this object is to track errors. For this, it has the next fields: Is_Urgent__c, Message__c, Apex_Class__c, Method__c, Record_Id__c, Stack_Trace__c and Type__c. In case of an urgent log, an email is sent to the development team with the details. We have also a dashboard where we can see a gauge with the errors in the last 7 days: Salesforce monitoring dashboard
API_Log__c: the main purpose of this object is to track all communication with external services. For this, it has the next fields: Body__c, End_Point__c, Method__c, Source__c, Status_Code__c, Status_Message__c and Type__c. We are using it only to track the request itself so, in case of error, an Apex_Debug_Log__c record would be created.

However, monitoring this data in Salesforce and sending some emails directly to the development team is not enough for the quality standards of the department, so we need to move this forward into the next level.

Solution

We will not go deep on how Prometheus manages data and integrate with other services like Kibana, Grafana, Nagios or Victorops because that’s covered in its own RFC, but we will describe how Salesforce will enable Prometheus to gather the relevant data.

For this, ideally we should use a Prometheus Client library to help to process the metrics, however no library exists for Apex and creating one from scratch could be a huge amount of work, so we prefer to create a simpler client only with the info metrics we need today, as Counters or Gauges, and improving it in the future to add quantiles, histogram or summaries in the future if it is needed. For this basic client we are creating a new object PrometheusCounter with an increase method.

Once the basic client has been created, we will create a /metrics endpoint which, using the aforementioned library will collect all the PrometheusCounters in the system, transforming them into the Prometheus format and returning it.

What data will be returned following the Prometheus format and how it will be used will depend on every project, but, as an example and using CRR integration as reference, this is what we will return:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
crr_http_requests_total{method="post",code="200", path="/lightning/r/Account/id"} 1027 1395066363000
crr_http_requests_total{method="post",code="400", path="/lightning/r/Case/id"}    3 1395066363000
crr_http_requests_total{method="post",code="any_other_code", path="/lightning/r/Account/id"}    1 1395066363000

# HELP http_requests_total The last HTTP request duration.
# TYPE http_requests_total gauge
crr_http_requests_duration_milliseconds{method="post",code="200", path="/lightning/r/Account/id"} 14657 1395066363000

Quick note, we are not aggregating the rest of the responses in any_other_code, but we would be creating a single line for every different status code.

Basically, for total requests, firstly every time we receive a response for an external callout we are increasing the relevant counter, so later we can return the counter and the timestamp for all requests by status code. This will allow us in Prometheus to compare the number of errors in a time frame to raise an alert if the counter value is over a specific threshold.

For requests duration we have a similar approach, this time using Gauges to track the request duration so we can use the data in Prometheus to, base in the average in a fixed time frame, we can raise an alert if it reaches a threshold.

We are talking in generic terms, but to be more specific again and coming back to the CRR example, we could define 30 minutes as time frame and 5 as threshold so, if we receive more than 5 errors - being error any other response different to 200-, we would raise an alert and an email would be received by devops/support.

Alternatives

Other alternatives were reviewed, like calling directly to Nagios or Victorops without passing through Prometheus, however, devops is doing a great effort unifying monitoring and alerting into a centralised service, which further it giving us many benefits, as it is explained in its own RFC, so the solution was quickly agreed.

Caveats

Operation

Tool for internal usage, only salesforce developers, devops and support team should have control here.

Security Impact

We are using OAuth 2.0 Client Credentials Flow for Server-to-Server Integration.

We have custom docker image which do the authentication and cache the access token which going to be used in multiple API call.
And there is ansible-playbook-prometheus where we defined our environment variables for Prometheus Staging and production server.

Secret Rotation

Its best practise to rotate our clinentId and clientSecret periodically to keep our integration secure. To make sure that your existing integrations don’t break, generate staged consumer details and share them with your connected app integrations. When you're ready, apply the new consumer details.

Connected App

Id	Name	Description	Contact Email
	Prometheus	Prometheus is Ebury monitoring and alerting tool.	platform-core@ebury.com

Salesforce Service User

Id	Username	Email	Profile
00520000003DkiZAAS	integration@ebury.com	salesforce@ebury.com	System API

DVO team was recommending using the TCP port 19100, however this is not feasible for us due we cannot manage ports in Salesforce so we will be using the 443. We agreed having a well defined authentication mechanism is enough.

Performance Impact

Prometheus will be calling the /metrics endpoint every 15 seconds, which means 5760 daily requests. Taking into account we have 1,039,600 available daily requests in Salesforce -today, they are directly related to the number of users in the platform so, as much as we keep adding more using, this number will keep increasing- it would suppose only the 0.55% of the calls, which, thinking also that we usually are between the 25% and 35% of our daily requests daily, does not give us any problem.

Developer Impact

No impact at all apart from the benefits of being notified when something happens in a proper way.

Data Consumer Impact

Deployment

The deployment can be done in two phases, firstly deploy the library and the endpoint in Salesforce even if it is not being yet used and, later, configure Prometheus to consume it.

Of course, these two steps can be done in the same deployment, but never configuring Prometheus before the Salesforce deployment.

Dependencies

We have no dependencies for the development nor deployment of the Salesforce service but, for the activation, we have to coordinate with devops to be aligned with the development in Prometheus.

References

Monitoring platform - Ebury blueprints

Prometheus Lib epic: ODT-182

Prometheus integration epic: ODT-183

Postman collection In Ebury Tech

Secret Rotation

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search