Thorsten Papenbrock (HPI) - Data Profiling with HPI-Valid

One of the major tasks of data engineers deals with the problem of automatically organizing data in a meaningful way so that datasets become useful for a variety of applications in the “smart world”, artificial intelligence and analytics of all kinds. One key aspect to data profiling are Unique Column Combinations (UCC). They identify entities in datasets and support different data organizing activities. Up until recently, it was only possible to discover UCCs for moderately sized datasets with quite some run time effort. For large datasets UCCs were not discoverable due to runtime and memory limitations. Researchers at HPI have developed a novel UCC discovery algorithm (“HPIValid”), which drastically reduces the runtime of UCC discovery by orders of magnitude and at the same time reduces the memory footprint in comparision to state of the art algorithms. Across different moderately sized datasets HPIValid performed 5-100 times faster by using up only 5-20 % memory on average with decreasing efficiency on larger datasets.
More information...

Dr. Thorsten Papenbrock is a senior researcher and lecturer at the Hasso-Plattner-Institute at the University of Potsdam. He received his M.Sc. in IT-Systems Engineering in 2014 and his Ph.D. in Information Systems in 2017. At HPI, he is currently heading the distributed computing group. His research focuses on the development of efficient and scalable systems for complex data management and analytics tasks, such as data cleaning, time series analysis and data profiling. In this context, he is particularly interested in energy saving and environmentally friendly software solutions.

clean-IT: Towards Sustainable Digital Technologies

Thorsten Papenbrock (HPI) - Data Profiling with HPI-Valid

About this video