3.1 Density-based Clustering

This video belongs to the openHPI course Big Data Analytics. Do you want to see more?

3.1 Density-based Clustering

Time effort: approx. 27 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Welcome to the back in course Big Data Analytics
00:02And we're still in the clustering phase, here I would like to introduce another clustering paradigm today,
00:09the density-based clustering paradigm.
00:12We had previously dealt with the partitioning Clustering algorithms discussed
00:17and that's what it's all about right now, what are the disadvantages of partitioning algorithms, and how to solve them with a new paradigm.
00:25Let's start with the motivation:
00:27What happens when I actually have data that are so shaped like the first examples from our Clustering chapter,
00:35expand arbitrarily shaped clusters or cluster structures with a lot of outgoing or noise in the data,
00:44what does a partitioning clustering result look like?
00:47Here we see a few examples.
00:50At left picture for now these snake-shaped clusters of any shape, and we're gonna set up a K-Means clustering here with four clusters.
01:01It is obvious that the cluster structures, that we would now intuitively recognize as human beings,
01:05are not recognized by the clustering algorithm, but where cluster structures are cut up or two cluster structures into one cluster.
01:14In general, this is not a good idea, so an arbitrarily shaped Clustering with a K-Means algorithm, especially because K-Means assumes convex cluster structures.
01:27In the second example in the middle we see rather cluster structures of different sizes,
01:34a very large cluster in the middle, and three slightly smaller clusters at the edges.
01:40Here, too, K-Means tries to find a way to realize large clustering and you see,
01:47that the middle cluster will be cut up and then split up into the other clusters.
01:52In the third case, on the right side, we see cluster structures also with a lot of noise, so especially the points up here cause the shift of this cluster center to the top left,
02:05and thus also different other cluster structures will be like down here, cut up, arbitrarily shaped.
02:13There's a cluster here that's erroneously will be merged with the second,
02:18and overall, one can say all these scattered around the data room anomalies, Outlier, are assigned to one of the four clusters.
02:28We don't want that either.
02:29So in summary we now want to find arbitrarily shaped clusters, separate different sized clusters and clusters from noise, so don't force Outlier into a cluster.
02:44And this general idea is called density-based clustering.
02:49This is about dense regions with especially many objects from sparsely populated regions.
02:58So we have especially in the first example, this serpentine cluster structure,
03:02that's a tightly packed cluster, and that's gonna be separately from the other cluster structures, because there are no other objects in between.
03:10There are now different density-based approaches in the books and in the existing publications.
03:17I'll show you an example here, the so-called DBSCAN algorithm.
03:21Let's take the first definition from the density-based clustering concept. And here the point is to first formalize density intuitively.
03:33What did we see before? Intuitively, we've seen that many points lie in a certain region,
03:39we as humans have taken that for granted, and now it's a matter of formalizing density as well,
03:45and we can do that by going over every data point - so here's a data point -
03:52an epsilon environment with a radius of "epsilon", and count how many objects lie in this Epsilon environment.
04:01And we are now counting here that there are four objects in this Epsilon environment, and the point itself is of course also in the Epsilon environment,
04:08and now we can just say these five points, these dots make up the density.
04:15And that is exactly our first definition.
04:18We now need the neighboring points with an Epsilon radius, and this crowd of Epsilon, the neighborhood with radius epsilon around a query point Q,
04:30we determine from our database, namely these are all objects P in our database, that have a distance to Q less than or equal to Epsilon, and we select them all.
04:41As a second parameter in addition to this radius Epsilon we can now introduce the MinPoints parameter,
04:48that's the minimum number of points, that we expect for a region of poetry.
04:54And if we have defined it that way, then you call it a core object as an object Q, whose neighborhood with radius Epsilon at least MinPoints contains many objects.
05:08So the neighborhood contains more than or equal to MinPoints many objects.
05:13And these very core objects, which then represent our cluster structures.
05:18If we have an area with a lot of dense objects, then we gather these dense objects together and can thus set up the cluster structure.
05:28If we want to illustrate this a little, now in the one-dimensional, what happens in density-based clustering?
05:35So if we have a one-dimensional space now. and now just apply the density. and have a few objects at a certain point,
05:46then we see more and more objects in a different place. and again more objects in a second place.
05:53What we're doing with this is, we cut at a certain border, MinPoints, these mountains in the graph and say
06:04MinPoints and above are the core objects and below MinPoints, they're not core objects.
06:13And what should the data distribution look like then?
06:15A particularly large number of objects should be clustered here, while in the places here rather fewer objects are distributed.
06:25So one could imagine it really pictorial like a mountain range, and above a certain level, those are the core objects, and below a certain level, they're not core objects.
06:36And the density-based clustering approach does the same, and you can imagine it that way,
06:42like it's a hike across a mountain range, and you're not allowed to exceed a certain limit level, which is to descend MinPoints.
06:50You must always be above the MinPoints level. and everything you can achieve,
06:55by making this mountain hike above a certain contour line,
06:59it's a cluster.
07:01So the objects that lie here, that would be the first cluster, and the objects that lie here in this mountain, that would be the second cluster.
07:12And this is exactly what DBSCAN is now trying to formalize. with the following three definitions,
07:17namely the direct density accessibility, directly density-reachable, Density-Accessibility and Density-Connectedness.
07:27What's the meaning of this?
07:27With the direct density accessibility we have first of all an object P, which is in the neighborhood of another object Q, and the object Q is a core object.
07:39This means that the object Q has a certain density, more than MinPoints many objects, and P is directly adjacent, is within this neighborhood.
07:50And we see here graphically this object Q and the object P and we now refer to P as directly dense-accessible from Q.
08:01The direction is particularly important here because Q is the core object, from this core object I can directly density-reach another object.
08:08Density accessibility is then simply the transitive envelope of these directly dense-accessible objects.
08:17What's the meaning of this?
08:17We now start at an object Q and can now always work with directly density-accessible objects.
08:24So we're going to this middle object first. and then proceed to object P.
08:31That means, of course, that this object, let's call it R, here in the middle, must also be a core object.
08:39Because if we form this transitive shell, then we always go over - now figuratively speaking - along this contour line on a mountain
08:49and we must never move down from this mountain, to fall below this contour line, but we must Always picking up core objects.
08:57So we always walk in fixed steps, with the maximum step size of Epsilon. and collect all the core objects.
09:06And that is the definition of density accessibility.
09:11The last definition is now that of density-connectedness, so P and Q are two densely-connected objects regarding these two parameters Epsilon and MinPoints,
09:23so that you can pass any point O both P and Q can density-reach.
09:33So here they say two objects, P and Q. are densely-connected if there is a third object O, from which one can walk two transitive paths up to P and up to Q.
09:51And that is now a particularly important feature here in this case, because we can say every cluster now consists of densely-connected objects.
10:02And it is exactly these densely connected objects that we collect.
10:06If we have the last property now, then it's good enough for our clustering algorithm,
10:11to find an object of this cluster, namely a core object, in our example here it is the core object O, from which we can reach all other objects in this cluster.
10:22And that's exactly what the DBSCAN algorithm does, I'll show you on the next slide,
10:28namely he is simply looking for an object of a cluster and then builds the entire transitive shell, the maximum envelope of all density-accessible objects.
10:38And then the algorithm defines it as a cluster.
10:42The two properties that are important for the cluster definition, is exactly the maximum, i.e. of this one object we collect all densely accessible objects,
10:55so from one object we collect all density-accessible objects within this cluster.
10:59This is the maximum within the density-based cluster definition, and the second definition is connectivity.
11:06Every object in such a cluster must be densely accessible. or dense-connected to all other objects within that cluster.
11:15And the algorithm then continues to distinguish between different groups of objects,
11:22namely we distinguish the core objects, which we have just defined, these are the objects above a certain MinPoints boundary.
11:30We also distinguish between border objects, These are objects that are densely accessible from a core object.
11:39So we have a core object here, and via a density accessibility we can get to this border object.
11:46This border object itself is not a core object.
11:50And the third category are noise objects.
11:53They're not density-accessible from a cluster, and are not a core object themselves.
12:00And that would be our third group here now, never noise objects.
12:05The final clustering we get, is a set of density-based clusters according to the definition
12:14and an additional set n of noise objects, that don't belong in any of these clusters.
12:20Now we see here the first algorithm, which is between noise, noise and cluster structures.
12:27Furthermore, here we see the first algorithm, that does not require a parameter about the number of clusters.
12:34We have only determined in advance how dense a certain region must be, via the parameter Epsilon and via the parameter MinPoints.
12:41We don't tell the algorithm in advance, how many clusters we're looking for.
12:46We will then automatically find a certain number of of density-based clusters that are all maximum and interconnected among each other and don't have to worry about the numbers first.
12:58Let's look at the whole algorithm. DBSCAN stands for "Density Based Spacial Clustering for Applications with Noise",
13:10so there's the special application in the area of of the data points with noise in them, in the name,
13:18and it's based on that basic idea, which I have just illustrated,
13:23is that each object in a density-based cluster density-accessible from any core object.
13:32So every object in a density-based cluster is density-accessible from any core object.
13:39And now the algorithm exploits exactly this arbitrary core object, because we are now only looking for one object in our database, by simply looking at each object in our database once
13:52and test it in the first line of the algorithm, whether this object has ever been touched before and if it's been looked at before.
14:03So when we find an object, which has not yet been considered, we're testing the core object property now.
14:12So we form an Epsilon environment around this object, count how many objects are in the neighborhood,
14:17and if it's a core object, so if it's more Objects in the neighborhood has called MinPoints,
14:23then we pick up all the objects, starting from O, that are density-accessible from O.
14:29And exactly these two lines are the property here above.
14:36Namely, as soon as we find a core object, we're basically collecting all the objects in this cluster.
14:42Because we know that all objects must be densely accessible.
14:47If the object has not yet been viewed and is not a core object, then we assign it to the noise objects group.
14:57Now you can ask yourself here once we've done this, have assigned an object to the noise group, this object always remains a noise object.
15:10Or you may have made a mistake at this point and this first assignment to the noise group you have to revise it later.
15:18And the answer is yes, you can reassign it, and that the algorithm will run at the next run
15:27in a different place and may have another object has not yet been seen,
15:32and this object is a core object and may be collecting exactly our noise object and assign it to this cluster.
15:40And here is the exciting question, You can reconsider whether this assignment of the objects is actually unique,
15:49Whether each object must be assigned to exactly one cluster or there are ambiguities in this algorithm.
15:57This is really the case at a certain point, that you have to be careful, and I would ask that of you,
16:05to think for yourself, what objects could those be? that may be assigned to multiple clusters.
16:11And the algorithms can now be seen in the example.
16:18So how does the DBSCAN algorithm work in this little example?
16:23We now have here Epsilon = 2 and MinPoints = 3, so we need at least three objects in our Epsilon environment.
16:29We now start at any object and count on this object how many neighbors this object has.
16:35Here we see 3 objects in the neighborhood, so this is a core object, and we'll start with the algorithm.
16:40The algorithm now collects objects, we now see the transitive hull, all the objects, which are density-accessible from the first object.
16:51And at the last place, we have now collected all the objects, there's no more objects in the neighborhood, where we can expand the transitive shell.
17:01This makes it the first cluster.
17:03So in the first step we found this cluster here, and now we just move to the next object in the clustering algorithm.
17:13The next object is not a core object now. This assigns this object to the noise object group.
17:22And we go on, starting now at another object and pick up the second cluster.
17:27You're looking very nice now, that the transitive hull then automatically stops,
17:31again we have no more objects to track and then found the second cluster accordingly.
17:38So we did in the first run. this cluster via this core object,
17:43identifies this noise object and then, started from here found this cluster.
17:48The beauty of it is that here, at this point. do not have to provide a specific sequence of events for the database,
17:57because you know as soon as you find a core object, you can set up the complete cluster.
18:02That would be the advantages of the DBSCAN algorithm now. Now we need to think about the drawbacks again.
18:10I just said we don't have to specify the number of clusters.
18:14This is a big advantage for this algorithm, yet we have parameters.
18:18The parameters are Epsilon and MinPoints, so we need to think about this first, which is the minimum density we assume for a cluster.
18:28And the idea here is that we think about it, what is the minimum density that we want to set with these parameters? and how to heuristically detect this minimum density from the data.
18:40And a heuristic for this is the so-called nearest neighbor distance, so the distance to the nearest neighbor.
18:47We do not consider the distance to the nearest neighbour here, but the distance to the fourth nearest neighbor.
18:54And that four-next-next-door distance, that's what we determine. and can then define an epsilon radius accordingly, which finds this cluster with MinPoints = 4.
19:08So imagine you want to identify this structure as a cluster, how would you have to hire Epsilon?
19:14You would have to build at least this radius from this object, so that you can see all these four objects. Look at the four-next neighbor distance, pick it up.
19:24And that four-next neighbor distance, which you now determine for each individual point in your data set.
19:31You see, quite naturally, in this area, where there are more densely packed objects,
19:36is this four-next neighbor distance. much smaller than in this area.
19:42And if you do that for every object in your database, you'll get a diagram like this here.
19:48On the X-axis you simply see all objects plotted, and on the Y-axis you can now see here the three-next neighbor distance,
19:57and here you see, certain objects have a very high three-next neighbour distance
20:02and here a flat valley with many objects, who have a low distance.
20:08And exactly these objects we want to define as clusters and take advantage of this very first bend here.
20:14and define exactly the distance at this first bend point, the three-next neighbor distance than the epsilon distance.
20:21All objects over here can be can only be border objects or noise objects.
20:28With this heuristic we can very nicely set Epsilon.
20:30Nevertheless, we still have to adjust the MinPoints, and the heuristic simply says that we have twice the dimension minus 1 as MinPoints to create this k-Next Neighbor Chart.
20:45And based on the k-Next Neighbor Diagram. then choose the Epsilon distance.
20:50So we can very nicely, now in this example here deposited in yellow, identify the four cluster structures, namely these clusters, this arbitrarily shaped cluster and this small cluster here,
21:06while all the noise objects are out there accordingly have a higher three-next neighbor distance and not in a cluster.
21:15So much for the parameter selection of DBSCAN.
21:19Now you can ask yourself, does this actually always work? that with DBSCAN we now find this perfect bend in this diagram.
21:27And I've now brought you an example, on the left side you see data distribution, and on the right, the three-next neighbor chart.
21:34Here you can see that you don't see a bend.
21:36So in the three-next neighbor diagram. no place where we can cut it off.
21:42And we also see the reason for it, on the left side, there are simply clusters with very low density, this cluster A, which is up here in the diagram.
21:52Then there's a lot of noise out here, too, which is also up here in the diagram.
21:57But there is also the continuous case. From these sparsely occupied rooms, there are also cluster structures B, which make sense.
22:07Or that outer D-cluster that looks similar. in this k-next neighbor plot. and thus cannot be distinguished between noise here and cluster structure there.
22:19And the example is simply constructed that way, that you have cluster structures nested hierarchically in each other like these clusters G1, G2, G3, which are located in a larger cluster G.
22:29Or this cluster F, which is located in the large cluster E.
22:33And with this hierarchical cluster structure, that you always have gradually decreasing K-nearest neighbour distances and simply has no clear criterion.
22:44If we were to introduce such a criterion now anyway, so if we cut off at one point right now,
22:51let's say, at this point here, then simply a clustering with the objects is created, that are visible below this Epsilon parameter.
23:06This would be D1, D2, G1, G2, G3 in this example, the elements that stand back here,
23:16these are our clusters, which we also identify as density-based clusters.
23:20All other objects are displayed as noise accordingly.
23:25And you see a problem with that, too, which takes us straight to the next lesson, namely hierarchical cluster structures.
23:33These are not using the density-based clustering approach, the way I've presented him here,
23:38but you always have to adjust to an epsilon value. and with this epsilon value can only be a hierarchical level of clustering.
23:47As a conclusion for this unit the discussion, what are the advantages of density-based clustering, what are the disadvantages.
23:57The advantages are, of course, that you can find clusters of any shape.
24:00We have thus overcome the restriction of convex clusters, so we can find arbitrarily shaped cluster structures.
24:09It is not necessary to set the number of clusters beforehand.
24:12We can thus also distinguish clusters from noise, we have an extra group of noise objects, that are not assigned to a cluster.
24:22And we have the support of spatial index structures. which, in our case, means that if we have an object.
24:36and an Epsilon environment, then we have to be as efficient as possible. in this epsilon environment to determine the number of objects, in this Epsilon environment.
24:47And for this there are in particular in the Database literature very nice index structures,
24:51which can solve exactly this query processing more efficiently, so that we can also reduce the runtime of the algorithm.
24:59Typically without such index structures the DBSCAN approach has a quadratic effort,
25:05because we have to be careful for every single object in which the neighborhood is determined by the database,
25:10and worst-case scenario, we're gonna have to find and the neighbors will look at all the other objects on us.
25:15This gives us this square effort.
25:17With spatial index structures this can be reduced accordingly.
25:21But that also means that you have to think about it, how to efficiently search for such Epsilon environments in the database.
25:31Other disadvantages besides now this complexity, which is certainly also a disadvantage of the algorithm,
25:38is certainly the number of parameters and the setting of these parameters.
25:43Epsilon and MinPoints are certainly not so intuitively adjustable like the K at K-Means, because these two parameters are interdependent.
25:52We've seen the heuristics, by fixing one of these parameters we can solve the other parameter.
25:57But it's still just a heuristic. and you, as the user of this algorithm, must can also handle these two parameters sensibly.
26:05It must also be clearly stated that the parameters react very sensitively.
26:13If the parameters in some data sets are set incorrectly, it can also happen that completely nonsensical cluster structures or perhaps no cluster structures at all can be identified,
26:22or in the other extreme case perhaps the entire record occurs as a cluster.
26:27Like I said, these hiring difficulties, but also for hierarchical clustering algorithms we'll look at the next class.
26:34Thank you so much and see you next time.