This video belongs to the openHPI course Big Data Analytics. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Welcome to the back to the course Big Data Analytics density-based hierarchical clustering in the subarea.
- 00:06Last time we did and hierarchical clustering algorithms in general and the agglomerations and the Clustering algorithm.
- 00:13Today we're talking about a special algorithm in the area of density-based clustering
- 00:18and to transfer these methods to the area of hierarchical clustering.
- 00:22Perhaps first of all the motivation again, why we're looking at hierarchical clustering.
- 00:27Here are two examples: On the left side you see hierarchical ones, nested cluster structures,
- 00:33and on the right side you can see cluster structures with different densities.
- 00:38With the methods presented today, we want to identify all the clusters shown here.
- 00:45So we want to recognize both structures that are nested in each other as well as areas with different densities as clusters.
- 00:53With this, we want to avoid a fundamental disadvantage of density-based cluster methods, is the previous selection of parameters.
- 01:04By setting a specific parameter such as Epsilon and MinPoints with the DBSCAN algorithm
- 01:10we have implicitly committed ourselves, which of these cluster structures we want to find.
- 01:15If we've set Epsilon very, very small. and selected very dense areas accordingly, then here we find only the internal, very dense structures.
- 01:26While, the large blue structures we can't see in the left picture.
- 01:31If we make Epsilon very large, then we find the blue structures but then they can no longer identify the structures inside.
- 01:40Also in the right picture we have to decide with a parameter selection, and that's what density-based hierarchical clustering is all about.
- 01:49So we want to use parameter settings that are as flexible as possible find any cluster structures and in particular hierarchically nested structures.
- 02:00So here again for the density-based clustering:
- 02:02In this sense, we want to handle local density adaptively.
- 02:07Where we have very dense objects, here in the example of Cluster C1 and Cluster C2,
- 02:14in these two areas we want to adopt stricter parameters, so use a smaller epsilon.
- 02:22While in other areas, here in the entire Cluster C, we want to soften the parameter as much as possible and larger areas, i.e. larger neighborhoods,
- 02:34to display both the hierarchies and the to process different densities at the same time.
- 02:40That's the simple idea,
- 02:43we don't change the algorithm much at this point, but we use the original definition of DBSCAN
- 02:49and just say we don't have a fixed epsilon, we only have a maximum epsilon, which we want to tolerate.
- 02:58Between a very small epsilon and this maximum epsilon. we can adapt adaptively to the local region, so we use the Epsilon2, down here, for C1 and C2,
- 03:13while for the whole cluster we use Epsilon1 and therefore also the hierarchical nesting of these three cluster structures.
- 03:22How does it work? Well, the algorithm I'd like you to meet today
- 03:27is called OPTICS, Ordering Points To Identify The Clustering Structure,
- 03:31and the basic idea is just to be very lazy. and always take the shortest route.
- 03:38So we'll remember where we are in our data room right now, and we're trying to take steps as lazy as possible.
- 03:46So we're not going to have a big epsilon next, but we always try to work off as small an epsilon as possible,
- 03:56and that's what the algorithm calls a particularly small Reachability Distance.
- 04:02This idea can easily be illustrated by this picture:
- 04:07Let's start at any point.
- 04:09Of course, if we're here, it doesn't make sense, to choose a particularly large epsilon, but all we have to do is identify the smallest possible step to the neighboring object,
- 04:23so it's enough for us to choose a little epsilon in this environment, so we can go from one object to the next.
- 04:32and according to these objects all with very small increments.
- 04:37As you can see, the step size is slowly increasing, and we're working our way through this cluster with ever-increasing steps.
- 04:47Thus we already fulfill the transitive envelope property of DBSCAN, but now without the limitation of a fixed epsilon.
- 04:55We adapt again and again, until we've completed this entire cluster here,
- 05:00and then, you see, all of a sudden. a bigger step to the next cluster.
- 05:07In this case, you have to extend the epsilon. and allow a larger neighborhood.
- 05:13Once you're on the other side, the second cluster, but you can again do this in small steps. to the following objects.
- 05:25And as you can see, the two cluster structures you can work with small Epsilon environments,
- 05:34while between the two cluster structures you must select the epsilon significantly larger.
- 05:39And this is exactly what the algorithm tries to do intelligently, by sorting the points, hence the name "Ordering Points",
- 05:48the sorted processing sequence, so that always the shortest jump from a core object to the next neighboring object can be performed.
- 05:58The output of the algorithm is oriented to also on this sorted list of items.
- 06:04We get a sorted list of points back at the end, which also visualizes the cluster structures.
- 06:13With two dimensions, the so-called Core Distance and the so-called Reachability Distance.
- 06:18First of all, I would like to introduce them to you:
- 06:20You already know the Core Distance from the DBSCAN approach, but this is about the Core Distance,
- 06:28which is the smallest possible distance, so that the object O becomes the core object.
- 06:34We previously had the so-called core property at DBSCAN, so that an object binary seen was a core object or not a core object.
- 06:43This is about how small I can choose my epsilon, so that what we're looking at right now, namely the object O becomes the core object.
- 06:53We have also set the parameter MinPoints and a parameter epsilon, which however only our maximum spread of the epsilon.
- 07:03So we are adaptive, we search for the smallest possible radius, so that this one object becomes the core object.
- 07:12And if there is no such radius smaller than Epsilon, then we say we didn't find this. and mark this object with the question mark as Core Distance.
- 07:22You can imagine the reachability distance like this, that's the distance I need to get from a core object O to get to another object P.
- 07:33The formula below looks complicated, but it doesn't say more than it does:
- 07:38If the object P - now shown here on the right side - is lies within our core object radius, i.e. within the red sphere,
- 07:50then I just choose the Core Distance from O, then I can get from O to P by doing my Epsilon radius to the core distance.
- 07:59If the object P is further away, as here in the example the object Q, then the red radius is not enough for us.
- 08:08So we have to determine the distance between P and O. as a distance for the Reachability Distance.
- 08:16So think visually, I always take the smallest possible step.
- 08:20For P, one small step is enough for me to get from O to P,
- 08:26but the distance between O and Q is so far, that I need to expand my epsilon, and with it. the reachability distance here is now greater to choose.
- 08:35And the third case we see in the definition is, if the distance between O and P is greater than our default maximum epsilon,
- 08:45then let's say the reachability distance is undefined.
- 08:49These are the three cases we consider for the reachability distance.
- 08:53Let's take a closer look at the algorithm now, how we exploit these two definitions.
- 08:58Here is an overview of the algorithm.
- 09:03I don't want to go through the individual steps of the algorithm now. in this pseudo code, you can look at them again separately.
- 09:10The important thing for me is that you understand the concept behind it.
- 09:14What does this algorithm do?
- 09:15The algorithm extracts individual objects from a database. and inserts it into a sorted control list.
- 09:25That's exactly the order I just showed you, always with the principle, first to the neighboring objects and later to the more distant objects.
- 09:35Thus the Control List is also sorted according to the Reachability-Distance.
- 09:41The objects that have the smallest Reachability Distance, visually have this entry in my control list.
- 09:48So the objects that are closest to you to our current cluster structures,
- 09:52these are considered first and then into the so-called cluster file,
- 10:00these are our final clustering results, also sorted according to the processing sequence.
- 10:08And here you can see the basic idea:
- 10:11We extract from the database the neighboring objects, so the neighborhood property, as we met them at DBSCAN,
- 10:18sort these objects by reachability distance and always select the object with the smallest reachability distance.
- 10:26These objects are then transferred to the clustering file.
- 10:30Let's take a look at the whole algorithm using the example. Here is a more complex dataset:
- 10:38On the right side you can see the data set in two-dimensional space, and on the left side you can see the final result of the OPTICS algorithm.
- 10:49The important thing is to imagine the processing sequence here.
- 10:53We can see on the right, numbered, how the objects are processed, we start with any object, Object number 1 is considered first.
- 11:04Since we didn't have any other objects in the control list yet, the reachability distance is of course undefined for this object,
- 11:13because we haven't had a starting point yet, from where we can get to object one.
- 11:18Thus the first bar here is undefined for object 1.
- 11:24The Core Distance can be adjusted accordingly, and now you choose the smallest radius here, so that this object 1 becomes the core object.
- 11:33We see this object 1 is very far away from other objects, so we must have a correspondingly large radius - I'll just give you a very brief overview here -
- 11:44so that this one object becomes the core object.
- 11:49After we've established the Core Distance and the Reachability Distance. for this object, we have also the neighbors for this object.
- 12:01These are objects 2, 3, 4, and these objects are now in our control list by Reachability Distance.
- 12:10This means, which is the closest object to object 1, so that we can get the smallest possible step to the nearest object.
- 12:21That's obviously visual now here object 2, and we're determining the Reachability Distance and the Core Distance for this object 2.
- 12:31We see in our diagram now, object two has a reachability distance, which is equal to the core distance of even, that is this value here,
- 12:41and that's simply because the object 2 is within the core distance radius of object 1, according to the definition of even.
- 12:51Now we're picking up more objects, and we'll see object 3 next.
- 12:58This also has a very high Reachability-Distance, because we have to jump from object one to object three at a pretty high distance.
- 13:09But we see - this is now the important thing for this diagram - that the core distance for object 3, this one is very, very small.
- 13:26Why is that the case?
- 13:27Because the object 3 is located in a very dense area where there are a lot of objects.
- 13:34Here, it is sufficient to select a small radius and accordingly within this cluster now to work through this cluster in small steps.
- 13:42So we see that in this cluster a small radius is sufficient for us, to process all these objects according to the classic DBSCAN definition.
- 13:53And this is exactly the quality we see now in this plateau here.
- 13:58All objects within this cluster are now in this area. with very small Reachability-Distance and very small Core Distance.
- 14:07So this is where we collect the objects of cluster C1, all these objects we collect step by step,
- 14:16and at some point we picked up all these objects. and must now find the nearest object again.
- 14:23This is now object 16 in our example.
- 14:26So now we jump out of this cluster to object 16 and see visually already in the result,
- 14:33object 16 has a high Reachability-Distance and because it is so far away from all other objects, also a high core distance.
- 14:41We'll keep collecting and get to object 17.
- 14:44This also has a high Reachability-Distance and also because it's so far away, a high core distance.
- 14:50And only at object 18 do we jump into a cluster again. and see the same effect as just now,
- 14:58a high Reachability Distance, this is the indicator for the cluster start, and a low core distance.
- 15:05And we're picking up all the objects here, too, which is according to cluster 2.
- 15:09The algorithm goes on like this, and we see, we keep jumping into regions where clustered objects are located, the third valley here is Cluster 3.
- 15:25And in the end, we see a lot of objects, that can be clustered again and can be used for correspondingly high Core Distances and High Reachability Distances,
- 15:37and that's accordingly the end of the algorithm.
- 15:40The algorithm has now collected all objects and the Core Distances and Reachability Distances are output accordingly.
- 15:47You can only see from this example, how flexibly we can adjust the epsilon now.
- 15:53We have not yet found hierarchical cluster structures.
- 15:56Even though we have the entire database as a large cluster, only the intuition of how the algorithm works has been conveyed here.
- 16:07But we can also use the whole algorithm on hierarchically clustered structures and then see visually in the result these hierarchies again.
- 16:16And you can see it here.
- 16:18In the left picture you can see a large cluster with three dense clusters.
- 16:25These three dense clusters - marked here - are exactly the three valleys you see in this picture.
- 16:34The entire cluster that we have presented here, then corresponds to the whole valley, as shown above.
- 16:44At this point you can also see, How to create the hierarchical cluster structures with a classic DBSCAN approach.
- 16:54If we decide now in this picture, to set the epsilon of DBSCAN to this value, we find three cluster structures.
- 17:03With Epsilon 1 we find three cluster structures, with Epsilon 2 we find a large cluster structure.
- 17:11And that makes us flexible and allows us to work in a postprocessing the cluster structures we want, also by different settings of Epsilon itself.
- 17:22A little more complex on the right, Here you can see many different structures.
- 17:28Maybe we should start with the Reachability Plot, so that we can recognize the interpretation from it.
- 17:34First we'll see a valley structure here, which corresponds to a first cluster.
- 17:41This cluster is of course very, very sparsely staffed, has correspondingly very high reachability distances,
- 17:50while further back in this diagram - here - cluster structures that are very, very densely packed.
- 17:59And these can be seen in this area here.
- 18:03There are a few objects that also belong to this cluster structure, you can see individual objects next to this cluster structure, I'll mark these short.
- 18:15These also belong to the large cluster structure, so if I adjust the epsilon here, I get the entire cluster accordingly.
- 18:23And every other valley you see here, corresponds exactly to a cluster structure.
- 18:30But one can also use several cluster structures, as shown here now.
- 18:37And at this level, three small clusters could now be and a slightly larger cluster with the entire object in this area.
- 18:53Here you can really see how you can get out of the Reachability-Plot extract the interpretation of hierarchical clusters.
- 19:02Finally, perhaps a brief summary the procedures of hierarchical clustering:
- 19:12We're obviously seeing the benefits in there, that we no longer have to specify a number of clusters in advance.
- 19:20Both agglomerative and density-based hierarchical Clustering algorithms enable you to find cluster structures without a prior specification of the number of clusters.
- 19:33They are very robust in terms of parameters, especially if you use the Epsilon in the OPTICS area,
- 19:41while the previous methods for Setting of Epsilon rather heuristic nature were at DBSCAN.
- 19:49The big advantage that is certainly needed in many applications, one has the complete hierarchy of all clusters
- 19:57and can decide in the end, which cluster structures you want to process and how.
- 20:02It is not necessary to decide in advance how dense the structures must be.
- 20:07Visualization is possible both with the dendrograms agglomerative clustering as well as with the Reachability Plots at OPTICS
- 20:14very intuitive and can also be used for illustration purposes. of these cluster structures can be used.
- 20:20That, let's call it partitioning flat clustering. you can later use such dendrograms or extract them from the Reachability Plots.
- 20:31Of course, this is also a disadvantage at the same time, because the user has to extract this final clustering.
- 20:38And another disadvantage you have to be aware of, is the scalability of such methods.
- 20:43While previous methods such as K-Means or DBSCAN were very efficient in the database size and we were only able to had to have very little access to our data,
- 20:57we see here in standard methods of agglomerative clustering square to sometimes cubic running times,
- 21:06during OPTICS without index support is also scaled quadratically accordingly.
- 21:11So you have to be aware that the benefits, we see up here on this slide, of course, this is accompanied by corresponding efficiency disadvantages.
- 21:20Thank you so much, and I'll see you in the next class.
To enable the transcript, please select a language in the video player settings menu.