3.3 Density-based hierarchical clustering

This video belongs to the openHPI course Big Data Analytics. Do you want to see more?

3.3 Density-based hierarchical clustering

Time effort: approx. 22 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Welcome to the back to the course Big Data Analytics density-based hierarchical clustering in the subarea.
00:06Last time we did and hierarchical clustering algorithms in general and the agglomerations and the Clustering algorithm.
00:13Today we're talking about a special algorithm in the area of density-based clustering
00:18and to transfer these methods to the area of hierarchical clustering.
00:22Perhaps first of all the motivation again, why we're looking at hierarchical clustering.
00:27Here are two examples: On the left side you see hierarchical ones, nested cluster structures,
00:33and on the right side you can see cluster structures with different densities.
00:38With the methods presented today, we want to identify all the clusters shown here.
00:45So we want to recognize both structures that are nested in each other as well as areas with different densities as clusters.
00:53With this, we want to avoid a fundamental disadvantage of density-based cluster methods, is the previous selection of parameters.
01:04By setting a specific parameter such as Epsilon and MinPoints with the DBSCAN algorithm
01:10we have implicitly committed ourselves, which of these cluster structures we want to find.
01:15If we've set Epsilon very, very small. and selected very dense areas accordingly, then here we find only the internal, very dense structures.
01:26While, the large blue structures we can't see in the left picture.
01:31If we make Epsilon very large, then we find the blue structures but then they can no longer identify the structures inside.
01:40Also in the right picture we have to decide with a parameter selection, and that's what density-based hierarchical clustering is all about.
01:49So we want to use parameter settings that are as flexible as possible find any cluster structures and in particular hierarchically nested structures.
02:00So here again for the density-based clustering:
02:02In this sense, we want to handle local density adaptively.
02:07Where we have very dense objects, here in the example of Cluster C1 and Cluster C2,
02:14in these two areas we want to adopt stricter parameters, so use a smaller epsilon.
02:22While in other areas, here in the entire Cluster C, we want to soften the parameter as much as possible and larger areas, i.e. larger neighborhoods,
02:34to display both the hierarchies and the to process different densities at the same time.
02:40That's the simple idea,
02:43we don't change the algorithm much at this point, but we use the original definition of DBSCAN
02:49and just say we don't have a fixed epsilon, we only have a maximum epsilon, which we want to tolerate.
02:58Between a very small epsilon and this maximum epsilon. we can adapt adaptively to the local region, so we use the Epsilon2, down here, for C1 and C2,
03:13while for the whole cluster we use Epsilon1 and therefore also the hierarchical nesting of these three cluster structures.
03:22How does it work? Well, the algorithm I'd like you to meet today
03:27is called OPTICS, Ordering Points To Identify The Clustering Structure,
03:31and the basic idea is just to be very lazy. and always take the shortest route.
03:38So we'll remember where we are in our data room right now, and we're trying to take steps as lazy as possible.
03:46So we're not going to have a big epsilon next, but we always try to work off as small an epsilon as possible,
03:56and that's what the algorithm calls a particularly small Reachability Distance.
04:02This idea can easily be illustrated by this picture:
04:07Let's start at any point.
04:09Of course, if we're here, it doesn't make sense, to choose a particularly large epsilon, but all we have to do is identify the smallest possible step to the neighboring object,
04:23so it's enough for us to choose a little epsilon in this environment, so we can go from one object to the next.
04:32and according to these objects all with very small increments.
04:37As you can see, the step size is slowly increasing, and we're working our way through this cluster with ever-increasing steps.
04:47Thus we already fulfill the transitive envelope property of DBSCAN, but now without the limitation of a fixed epsilon.
04:55We adapt again and again, until we've completed this entire cluster here,
05:00and then, you see, all of a sudden. a bigger step to the next cluster.
05:07In this case, you have to extend the epsilon. and allow a larger neighborhood.
05:13Once you're on the other side, the second cluster, but you can again do this in small steps. to the following objects.
05:25And as you can see, the two cluster structures you can work with small Epsilon environments,
05:34while between the two cluster structures you must select the epsilon significantly larger.
05:39And this is exactly what the algorithm tries to do intelligently, by sorting the points, hence the name "Ordering Points",
05:48the sorted processing sequence, so that always the shortest jump from a core object to the next neighboring object can be performed.
05:58The output of the algorithm is oriented to also on this sorted list of items.
06:04We get a sorted list of points back at the end, which also visualizes the cluster structures.
06:13With two dimensions, the so-called Core Distance and the so-called Reachability Distance.
06:18First of all, I would like to introduce them to you:
06:20You already know the Core Distance from the DBSCAN approach, but this is about the Core Distance,
06:28which is the smallest possible distance, so that the object O becomes the core object.
06:34We previously had the so-called core property at DBSCAN, so that an object binary seen was a core object or not a core object.
06:43This is about how small I can choose my epsilon, so that what we're looking at right now, namely the object O becomes the core object.
06:53We have also set the parameter MinPoints and a parameter epsilon, which however only our maximum spread of the epsilon.
07:03So we are adaptive, we search for the smallest possible radius, so that this one object becomes the core object.
07:12And if there is no such radius smaller than Epsilon, then we say we didn't find this. and mark this object with the question mark as Core Distance.
07:22You can imagine the reachability distance like this, that's the distance I need to get from a core object O to get to another object P.
07:33The formula below looks complicated, but it doesn't say more than it does:
07:38If the object P - now shown here on the right side - is lies within our core object radius, i.e. within the red sphere,
07:50then I just choose the Core Distance from O, then I can get from O to P by doing my Epsilon radius to the core distance.
07:59If the object P is further away, as here in the example the object Q, then the red radius is not enough for us.
08:08So we have to determine the distance between P and O. as a distance for the Reachability Distance.
08:16So think visually, I always take the smallest possible step.
08:20For P, one small step is enough for me to get from O to P,
08:26but the distance between O and Q is so far, that I need to expand my epsilon, and with it. the reachability distance here is now greater to choose.
08:35And the third case we see in the definition is, if the distance between O and P is greater than our default maximum epsilon,
08:45then let's say the reachability distance is undefined.
08:49These are the three cases we consider for the reachability distance.
08:53Let's take a closer look at the algorithm now, how we exploit these two definitions.
08:58Here is an overview of the algorithm.
09:03I don't want to go through the individual steps of the algorithm now. in this pseudo code, you can look at them again separately.
09:10The important thing for me is that you understand the concept behind it.
09:14What does this algorithm do?
09:15The algorithm extracts individual objects from a database. and inserts it into a sorted control list.
09:25That's exactly the order I just showed you, always with the principle, first to the neighboring objects and later to the more distant objects.
09:35Thus the Control List is also sorted according to the Reachability-Distance.
09:41The objects that have the smallest Reachability Distance, visually have this entry in my control list.
09:48So the objects that are closest to you to our current cluster structures,
09:52these are considered first and then into the so-called cluster file,
10:00these are our final clustering results, also sorted according to the processing sequence.
10:08And here you can see the basic idea:
10:11We extract from the database the neighboring objects, so the neighborhood property, as we met them at DBSCAN,
10:18sort these objects by reachability distance and always select the object with the smallest reachability distance.
10:26These objects are then transferred to the clustering file.
10:30Let's take a look at the whole algorithm using the example. Here is a more complex dataset:
10:38On the right side you can see the data set in two-dimensional space, and on the left side you can see the final result of the OPTICS algorithm.
10:49The important thing is to imagine the processing sequence here.
10:53We can see on the right, numbered, how the objects are processed, we start with any object, Object number 1 is considered first.
11:04Since we didn't have any other objects in the control list yet, the reachability distance is of course undefined for this object,
11:13because we haven't had a starting point yet, from where we can get to object one.
11:18Thus the first bar here is undefined for object 1.
11:24The Core Distance can be adjusted accordingly, and now you choose the smallest radius here, so that this object 1 becomes the core object.
11:33We see this object 1 is very far away from other objects, so we must have a correspondingly large radius - I'll just give you a very brief overview here -
11:44so that this one object becomes the core object.
11:49After we've established the Core Distance and the Reachability Distance. for this object, we have also the neighbors for this object.
12:01These are objects 2, 3, 4, and these objects are now in our control list by Reachability Distance.
12:10This means, which is the closest object to object 1, so that we can get the smallest possible step to the nearest object.
12:21That's obviously visual now here object 2, and we're determining the Reachability Distance and the Core Distance for this object 2.
12:31We see in our diagram now, object two has a reachability distance, which is equal to the core distance of even, that is this value here,
12:41and that's simply because the object 2 is within the core distance radius of object 1, according to the definition of even.
12:51Now we're picking up more objects, and we'll see object 3 next.
12:58This also has a very high Reachability-Distance, because we have to jump from object one to object three at a pretty high distance.
13:09But we see - this is now the important thing for this diagram - that the core distance for object 3, this one is very, very small.
13:26Why is that the case?
13:27Because the object 3 is located in a very dense area where there are a lot of objects.
13:34Here, it is sufficient to select a small radius and accordingly within this cluster now to work through this cluster in small steps.
13:42So we see that in this cluster a small radius is sufficient for us, to process all these objects according to the classic DBSCAN definition.
13:53And this is exactly the quality we see now in this plateau here.
13:58All objects within this cluster are now in this area. with very small Reachability-Distance and very small Core Distance.
14:07So this is where we collect the objects of cluster C1, all these objects we collect step by step,
14:16and at some point we picked up all these objects. and must now find the nearest object again.
14:23This is now object 16 in our example.
14:26So now we jump out of this cluster to object 16 and see visually already in the result,
14:33object 16 has a high Reachability-Distance and because it is so far away from all other objects, also a high core distance.
14:41We'll keep collecting and get to object 17.
14:44This also has a high Reachability-Distance and also because it's so far away, a high core distance.
14:50And only at object 18 do we jump into a cluster again. and see the same effect as just now,
14:58a high Reachability Distance, this is the indicator for the cluster start, and a low core distance.
15:05And we're picking up all the objects here, too, which is according to cluster 2.
15:09The algorithm goes on like this, and we see, we keep jumping into regions where clustered objects are located, the third valley here is Cluster 3.
15:25And in the end, we see a lot of objects, that can be clustered again and can be used for correspondingly high Core Distances and High Reachability Distances,
15:37and that's accordingly the end of the algorithm.
15:40The algorithm has now collected all objects and the Core Distances and Reachability Distances are output accordingly.
15:47You can only see from this example, how flexibly we can adjust the epsilon now.
15:53We have not yet found hierarchical cluster structures.
15:56Even though we have the entire database as a large cluster, only the intuition of how the algorithm works has been conveyed here.
16:07But we can also use the whole algorithm on hierarchically clustered structures and then see visually in the result these hierarchies again.
16:16And you can see it here.
16:18In the left picture you can see a large cluster with three dense clusters.
16:25These three dense clusters - marked here - are exactly the three valleys you see in this picture.
16:34The entire cluster that we have presented here, then corresponds to the whole valley, as shown above.
16:44At this point you can also see, How to create the hierarchical cluster structures with a classic DBSCAN approach.
16:54If we decide now in this picture, to set the epsilon of DBSCAN to this value, we find three cluster structures.
17:03With Epsilon 1 we find three cluster structures, with Epsilon 2 we find a large cluster structure.
17:11And that makes us flexible and allows us to work in a postprocessing the cluster structures we want, also by different settings of Epsilon itself.
17:22A little more complex on the right, Here you can see many different structures.
17:28Maybe we should start with the Reachability Plot, so that we can recognize the interpretation from it.
17:34First we'll see a valley structure here, which corresponds to a first cluster.
17:41This cluster is of course very, very sparsely staffed, has correspondingly very high reachability distances,
17:50while further back in this diagram - here - cluster structures that are very, very densely packed.
17:59And these can be seen in this area here.
18:03There are a few objects that also belong to this cluster structure, you can see individual objects next to this cluster structure, I'll mark these short.
18:15These also belong to the large cluster structure, so if I adjust the epsilon here, I get the entire cluster accordingly.
18:23And every other valley you see here, corresponds exactly to a cluster structure.
18:30But one can also use several cluster structures, as shown here now.
18:37And at this level, three small clusters could now be and a slightly larger cluster with the entire object in this area.
18:53Here you can really see how you can get out of the Reachability-Plot extract the interpretation of hierarchical clusters.
19:02Finally, perhaps a brief summary the procedures of hierarchical clustering:
19:12We're obviously seeing the benefits in there, that we no longer have to specify a number of clusters in advance.
19:20Both agglomerative and density-based hierarchical Clustering algorithms enable you to find cluster structures without a prior specification of the number of clusters.
19:33They are very robust in terms of parameters, especially if you use the Epsilon in the OPTICS area,
19:41while the previous methods for Setting of Epsilon rather heuristic nature were at DBSCAN.
19:49The big advantage that is certainly needed in many applications, one has the complete hierarchy of all clusters
19:57and can decide in the end, which cluster structures you want to process and how.
20:02It is not necessary to decide in advance how dense the structures must be.
20:07Visualization is possible both with the dendrograms agglomerative clustering as well as with the Reachability Plots at OPTICS
20:14very intuitive and can also be used for illustration purposes. of these cluster structures can be used.
20:20That, let's call it partitioning flat clustering. you can later use such dendrograms or extract them from the Reachability Plots.
20:31Of course, this is also a disadvantage at the same time, because the user has to extract this final clustering.
20:38And another disadvantage you have to be aware of, is the scalability of such methods.
20:43While previous methods such as K-Means or DBSCAN were very efficient in the database size and we were only able to had to have very little access to our data,
20:57we see here in standard methods of agglomerative clustering square to sometimes cubic running times,
21:06during OPTICS without index support is also scaled quadratically accordingly.
21:11So you have to be aware that the benefits, we see up here on this slide, of course, this is accompanied by corresponding efficiency disadvantages.
21:20Thank you so much, and I'll see you in the next class.