Projects – Máté Pölös

Small- midsize company controlling reports

Original situation: company of 30-40 employees grew to double size; also the services that the company provides were getting more complex. Management could not afford an intuitive approach any more, decision making demanded to be backed by data. When reports were required, it usually took a few days for the finance and backoffice colleagues to prepare them. Calculation mistakes spotted by managers in the results required the same fun redone with the same effort. Evaluation and closing of a financial year lasted over 3 months.

Solution: together with the Customer we started whiteboarding the architecture from the data sources to be integrated down to the first reporting requirements. We identified the connections between the data silos from various source systems and the transformations necessary to obtain clean and verified data for the reports. I laid the foundations of the data load and transformation processes, the data warehouse and the data models for the reports (including the many of the complex measures that were used). I worked together with a representative of the management who specified the requirements for the reports and did the report design and editing.

Technologies used:

Data load: SQL Server Integration Services and SQL Server Linked Server objects for on-prem data sources, Azure Data Factory for online sources (e.g. Dynamics365). Using OData queries where necessary to limit the amount of columns and rows of data transferred.
Transformations and data warehouse: TSQL and Microsoft SQL Server
Reporting and data models: PowerBI including several complex measures implemented in DAX.

Results: company management now has fresh data about the state of business each hour in the form of visually appealing reports. Year-end closing and evaluation is done within one month. This BI system has become a continuously evolving tool as deeper insights keep inspiring new questions.

Research data analysis and visualization

Task: wildlife observation involving GPS tracing of animal movement requires identifying patterns and extents thereof. The results will help finding better ways to cope with damage the animals cause in the fields of crops and to better coordinate hunting activities. (On one hand agricultural damage by wildlife is in the range of million euros in Hungary per year and on the other hand there are various requirements and regulations that affect hunting.)

Solution: there were two observation methods used and different data processing requirements and solutions.

Visual observation by about a hundred hunting associations at the same time in a broad area to assess the number of animals. The input of these was collected and a brief PDF overview of the results specific to the area of each hunting association was produced with some diagrams of basic descriptive statistics. The solution was implemented in a way that can make data processing and report producing possible with minimal effort in the future. I carried out the consolidation of the input data in a Microsoft SQL Server database, and then put together a report template in SQL Reporting Services according to the specifications given. This template was filled with the proprietary data for each hunting association then.
GPS tracing of individual animals. The objective was to get a realistic view of the movement activity and home range of the animals. The statistical methods to be used were identified by the research leader. I loaded the raw data to a Microsoft SQL Server database, carried out some transformations and then executed the statistical calculations in R (adehabitatHR package) and generated some simple map diagrams (shp objects). The R functions were called directly from SQL server with the datasets resulting from the SQL queries. I imported the result into Excel 3D maps to create an animation and into QGIS for more sophisticated map visualizations.

Performance optimization – Tabular Analysis Services

Task: there is a limited timeframe for a large Tabular Analysis Services data model to be processed after the overnight data warehouse is finished and before the daily work starts. The data model is about 50GB in size with 100m rows in the larger tables.

Solution: I established parallelism by table partitioning to ensure maximal use of hardware resources in the available time to achieve the best processing time possible. I used TSML statements for the detailed coordination of the Tabular processing. For observing the results of the optimization iterations, I put together a small report that showed the processing activities in time and the load on the resources. For this, I used Extended Events in SQL Server Analysis Services to gather log information about the details of when was happening what, and Windows Performance Monitor log data regarding resource load. The output was aligned and presented in a PowerBI report. Processing sessions were controlled by a Powershell script.

Result: processing timespan shrank from over 4 hours to about 1 hour and 15 minutes.

Performance optimization – ETL

Task: data import from various systems (finance, project timesheeting) of the company into the data warehouse takes sometimes 4 hours, making it impossible to run reports as often as the management requires and also exerting pressure on some source systems causing them to respond slow. This was a relatively small database with the size around 1 GB.

Solution: I reviewed the SQL Server Integration Services (SSIS) data import logic and found that it was importing all the data from the beginning of history on each load; there was a large number of columns imported for each table and then never used, accounting for more than half of the throughput; transformation steps were programmed in a sequential logic (SQL cursors). I changed the process to do incremental import where possible, convey only the necessary columns for the tables and to use set-based processing logic (what SQL is invented for…). I moved as much of the processing logic from SSIS to TSQL as it scales much better and makes better use of the hardware resources.

Result: the data load time was down to about 4 minutes, no more performance impact was noticed on the source systems.