Written by Marc Suesser and Surya Penumatcha
The IBM Data Science Experience (DSX) offers a wealth of functionality to any software developer, especially those interested in data science. An important part of that functionality is the ability to use Notebooks, which are a convenient and intuitive way to compartmentalize different segments of a code base.
The IBM Watson Data Platform (WDP) Integration team manages system verification defects for various services and utilizes GitHub’s “issues” feature to keep track of each defect’s status, details, and assignments. Currently, there is no way for us to quantify the team’s activity each week. How many defects are being opened, closed, and worked on each week for each service? How severe are those defects?
In this article, we’ll explore how to track and present a high-level overview of a specific development team’s weekly activity. The goal is to collect raw data from GitHub, organize and sort it, then visualize it in a meaningful way. After analyzing organized data about a service’s defects, an observer can make educated inferences about the team and code base. This program was developed during the active development of Watson Machine Learning (WML), which will be reflected in the example charts shown below.
Gathering the information
The program begins in a Python notebook that retrieves defect information from GitHub, parses the data, and stores it in three separate Db2 Warehouse on Cloud tables (one for each service the integration team is assigned to).
Pushing the data from tuples to data frames, and an example of one Db2 Warehouse on Cloud table update
Thanks to DSX, we’re able to schedule this to run and update every hour. The table will be kept up to date with minimal maintenance required. Click the clock icon inside a DSX notebook to set up a scheduled job:
Once you schedule a notebook, you’ll see it in your scheduled jobs. These jobs are displayed in the project’s “Overview” section:
Cleaning and visualizing the information
Once all defect data has been aggregated, a separate notebook pulls this information out, sorts it by severity, and charts the information we want to see.
Thanks to the GitHub REST API, the creation and closure dates for each defect is available to us. With this, we are able to create burn down charts. Below, you’ll see an example of the charts for one of the three services we are analyzing. The x-axis contains one tick for each workweek, and we plot three data points per tick: number of defects opened, number of defects closed, and number of defects that remain open at the conclusion of the week (the backlog). Alongside the burn down chart, the program generates a graph to display the count of defects in the backlog per severity level.
The entire notebook is pushed to GitHub (another handy feature the DSX provides). Our web server then pulls the notebook directly from our GitHub repository, and neatly displays it in a web page.
After this is all completed, the sorted data is sent (via Db2 Warehouse on Cloud tables) to our external web page, and we’re able to use it to do other useful activities, like generate graphs and tables outlining specific details regarding each defect.
Under the Hood
There’s a lot of cross-platform communication going on behind the scenes. Information needs to be extracted, parsed, externally stored, transferred, and re-parsed. Here’s a graphic to show the flow of these two programs:
- Make a request from the Node.js server to get the pipeline information of all the issues on the ZenHub boards.*
- Sort the pipeline data and put the information in a Db2 Warehouse on Cloud table.*
- Get a list of issues from GitHub and parse the relevant information (issue created date, label, severity, and more) into tables.
- Push the unsorted GitHub data to a Db2 Warehouse on Cloud table.
- Get the defect data from the Db2 Warehouse on Cloud table, sort it, and generate the burn down table. Plot the burn down data on a graph using matplotlib.
- Save the burn down data points to a Db2 Warehouse on Cloud table.
- On the web page, retrieve the burn down data and defect tables from Db2 Warehouse on Cloud. When a user makes a request on the front end, deliver this information.
- Push the Notebook to GitHub as a Gist. This Gist link is embedded into the web page. When the notebook is updated on GitHub, the web page automatically displays that new notebook.
- Additional visualizations are made on the web page once the data has been retrieved.
* Steps one and two would be merged with step three if not for some issues we ran into regarding interactions between IBM’s firewall and the different services we were making API requests from.
IBM Data Science Experience has proven to be an incredibly useful resource for this project’s development, and it will continue to provide user-friendly, partially automated maintenance.