Jira Data Ingestion

The data mastered in Jira needs to be consumed by key processes at Ebury. This document covers the mechanisms we'll use going forward to extract such information.

Reference Documents

Reference	Document Location
Hevo Doc	Hevo Doc
SCD2	Slowly changing dimension
DBT	Data Build Tool
Data Lake To Data Warehouse	Data Lake to Data Warehouse

Problem Description

Despite Jira is the centralised data repository related to Operations team's activities, it does not offer all the needed performance metrics. The same scenario happens in Tech where ProcessLabs has been used to cover this need, pulling Jira data and providing its own data mart with additional capabilities to support custom metrics, visualisations and combine the data with other data to enrich it (for example bob).

Currently, information about team activities is scattered across different JIRA projects and individual tickets, making it difficult to get a clear overview of team performance and identify areas for improvement.

The purpose of this project is to extract all possible raw data from JIRA and store it in a data warehouse so that it can be easily accessed and analyzed. By doing so, we aim to track important metrics related to the operations team's performance, such as ticket resolution time, team workload, and efficiency, and identify areas for improvement.

Without a centralized repository for this information, it can be challenging to get a clear overview of the team's performance and identify areas for improvement. This can lead to inefficiencies, missed opportunities for process improvement, and ultimately decreased productivity and effectiveness of the operations team.

Therefore, it is essential to extract and store this data in a centralized location to provide the necessary visibility into team activities and help identify areas for improvement.

Additionally, while Jira Query Language (JQL) may provide some KPIs, it can be challenging to get a comprehensive view of team performance across different projects and tickets. By extracting all possible raw data from Jira and combining it with other data sources, we can gain more insights and identify areas for improvement.

Background

JIRA is a popular tool used for tracking issues and project management. We need to extract all possible raw data from JIRA in order to have it available in the data warehouse for better tracking of metrics.

Solution

The proposed solution for the problem described is to extract data from Jira and store it in a data warehouse. This will enable us to track metrics from all teams in currently going on in Jira and a comprehensive view of the activities taking place.

This solution, which is not longer-term when time constraints are evaluated, is a tactical solution. The data is not currently being captured in the data lake, but it is the longer-term strategy to push data into the data lake for subsequent extraction. This is also included in the Alternatives part.

The tactical decision to use Hevo has been made because of the immediate need for this data and the availability of an already proven and approved technology and relationship. The Data Engineering strategy has yet to be defined but would detail our longer term approach to data ingestion and the terms with which associated technology selections should be made.

HEVO will get data and send it to BigQuery after that a Data Build Tool (DBT) process will then be capable of transforming the raw information.

The workflow will involve:

Communication with Jira will be carried out using HEVO
HEVO connects with BigQuery and brings raw information
Deciding what information we need to transform using DBT from the raw information extracted
Developing DBT models which transform raw information
Information transfers will be monitored by HEVO
Alerting will be carried out by HEVO and an email will be sent to the team when an error occurs

The Architectural Design:

The Jira Cloud Entity Relationship Diagram:

To understand the data model of Jira Cloud, we use the Entity Relationship Diagram (ERD), which shows the data entities and their attributes, as well as any relationships between them. As we have the ERD, the next step is to load it into the data warehouse using DBT models. This involves creating tables in the data warehouse, such as scd2 and historical, and ensuring that the data is properly structured and formatted.

To model the data from Jira, we focus on specific tables such as project, issue, issue_type, issue_worklog, board, status, component, board_sprint, and comment. Using DBT, we then clean and normalize the data, and perform any necessary calculations or aggregations. To trace changes in the data over time, we use historical tables, and we use scd2 modeling to understand how the data has changed. Also, the Data Lake to Data Warehouse pattern has been added as a reference.

Incomplete Data

Hevo Jira Connector documentation states that JIRA's REST APIs do not support identifying deleted issues, which means that the solution may result in incomplete data if issues are deleted in JIRA. It is important to consider this limitation and implement appropriate measures to handle deleted issues in the source system.

In addition, it requires a new DCS (Data Cron Services) needs to be created to calculate and track deleted issues. This solution will involve using Airflow to periodically check for deletions and update the data accordingly. Specifically, the DCS will run once a day to check for any deletions and update the dataset to include this information.

Dimensional Modeling

Base Layer

We want SCD2 (Slowly changing dimension type 2) tables for each object with an is_last_version flag to be able to identify the last version of each unique id.
We want each table to have an is_deleted flag. So, we can correctly identify deleted Jira issues.
We want the Jira data to be updated hourly.
We want to understand the relations between all the base tables and point it out here.
We would like to have a definition of each field in the base tables. So, we can understand the meaning behind it.
Field names should be snake_case and have certain consistency. Follow the naming convention guide.
This layer will be handled as DIO scope.

Core Layer

Given the simplicity of the Jira data, the low criticality of the current reporting dependencies (internal MI) and that we are just starting to work with this data, we will not be building a specific star schema dimensional model for Jira as part of the MVP (Minimum Viable Product). This is something that we may consider down the line.
We will also not be incorporating the Jira data to the current EBI Live dimensional model.
We will be creating obt tables for each business workflow. Examples of obt tables we have already created (in testing):
- obt_jira_onboarding_issues
- obt_jira_reassessment_issues
These obt tables will select the relevant columns and filters for each business workflow. They will also enrich the Jira data with fields that will be required for downstream reporting.
The obt tables will include a set of tests to ensure the accuracy of the data and to flag any inconsistencies in manual data entry in the system
These obt tables will then be used downstream in the relevant domain marts to build specific business metrics
This layer will be handled as Analytics Engineering scope.

The Jira Cloud ERD:

The solution rationale is:

Low resource impact (compared to other options).
Splits out the resource dependencies to make it quicker to implement.
Achievable within the timelines provided by the project (as soon as).
An enabler for further data extraction reducing future implementation costs and aligns with Ebury's architecture goals. We can expand on this solution to meet short / medium / long term requirements.
Tech-debt that is easier to move (compared to other options).

Service Ownership

New Service	Service Name	Service Owner
Yes*	Hevo - Jira Extractor	DIO Team
Yes*	DCS - Jira Deletion Calculator	DIO Team

* This service will use an existing third party tool but we will be adding a new data pipeline to that tool.

Alternatives

We have alternatives for gathering data from Jira.

Extract from Jira’s API or Webhook via and pushing it into Kafka alternative involves extracting information from JIRA using Kafka as the data pipeline. The process would include working with Kafka, creating topics, configuring Kafka Connect, and developing a process to obtain information and push it into Kafka. The information that will be collected is defined in the previous stage. The data would be stored in S3 and Google Cloud Storage for auditability purposes, and alerting would be carried out by HEVO, Airflow, Prometheus or custom processes in Cloud Run,Google Cloud Monitoring or AWS depending on the implemented process for obtaining information from JIRA API.

Pros:

Using Kafka as the data pipeline provides a scalable and fault-tolerant solution. Kafka's publish/subscribe model ensures that the data is delivered to all subscribed consumers, which can improve data consistency. Data can be easily processed in real-time using stream processing frameworks like Apache Flink or Apache Spark Streaming.

Cons:

Implementing Kafka requires a significant learning curve and expertise, which may add additional complexity to the project. Setting up a Kafka cluster can be time-consuming and costly. The Kafka-based solution may not be necessary for the scale of the project, and it may be more efficient to use a simpler solution such as HEVO or JIRA's API.

We chose Extract from JIRA via HEVO because it is a simpler and more cost-effective solution that requires less technical expertise and allows for faster implementation. It also allows for easier data validation and monitoring, and provides built-in alerting and error handling. However, it has limitations in terms of flexibility and scalability compared to Extract from JIRA via Kafka, which could provide more flexibility and scalability in handling large volumes of data and integrating with a wider range of data sources. It also requires a more complex and costly infrastructure, and involves a steeper learning curve and greater technical expertise.

Caveats

Some caveats to consider for the proposed solution:

Data quality: The success of this project depends on the quality of data that is extracted from JIRA. It is important to ensure that the data being extracted is complete, accurate, and up-to-date. If the data quality is poor, it may lead to incorrect metrics being tracked and analyzed.
Scalability: As the amount of data being extracted from JIRA grows, it is important to ensure that the solution can scale accordingly. The infrastructure should be designed to handle a larger volume of data without any degradation in performance.
Cost: The solution proposed involves using cloud services like BigQuery and Kafka, which may come with a significant cost. It is important to consider the cost implications of the solution, and ensure that it fits within the budget of the organization. Additionally, In Hevo as the usage of the JIRA project and the number of data sources increases, the cost of the solution may increase proportionally. It is important to monitor usage and cost closely to ensure that afford the solution over the long term. The monthly usage limit in Hevo is 50M(the number of Events), and should be considered for future projects.
Security: It is important to ensure that the solution is designed with security in mind, and that appropriate measures are taken to protect sensitive data. Access to the data should be restricted to authorized users, and appropriate security controls should be put in place to prevent data breaches.

Operation

The proposed solution will be run on a cloud-based infrastructure with automated deployment processes. The DIO team will be in charge of operating and maintaining the service. The service will be intended for internal use, accessible only within the company's network. Access to the service will be restricted to authorized personnel, with different levels of access based on role and responsibilities. The roles involved in operating the service include the DIO team, SRE team and relevant stakeholders(Data team). The Service Level Agreement (SLA) will be established and communicated to all parties involved to ensure smooth operations and efficient issue resolution.

Security Impact

The data that is handled in this service is Ebury employee activity data. The data that all will be stored in our Google Cloud infrastructure in a BQ and consumed by the Data Team to produce the relevant reports required by the stakeholders.

The solution should be designed and implemented with security in mind, and all relevant security measures should be put in place to minimize any potential impact on the system. The relevant security team should also be involved in the design and implementation process to ensure that all necessary security measures are in place.

For this MVP, the following security requirements: - Access control: GCP accesses managed by Jira Service Management Platform where Security and DIO team approvals are mandatory. - Data encryption: Data must be encrypted both at rest and in transit to avoid exposing sensitive information. Hevo data encryption policy here. (To be defined by Security) - API authentication: Managed through automationadmin@ebury.com service account to guarantee a secured API implementation to prevent unathorised access, manipulation or deletion of data. - Vendor security: The security posture of Hevo assessed and confirmed that their security measures align with Ebury's standards. - Success criteria: MVP will cover Ops data on a first instance and then also Tech data will be loaded.

Performance Impact

The data extraction every 30 minutes and load every 15 through HEVO, DBT models runs once every hour.

Data Contracts

The compliance related information is going to be moved to Jira (new data source). There is a number of working groups where all stakeholders and consumers agree on the migration steps and strategy.

Data Sources

The data will be extracted using Hevo. The data will be transformed in payloads for It is expected the data to land in Big Query.

Deployment

CAB Policy here

Google Cloud Composer (Airflow) - dcs-jira-deletion-calculator

Dependencies

No dependencies. This work can start immediately.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search