This video belongs to the openHPI course Künstliche Intelligenz und Maschinelles Lernen in der Praxis. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Now we have to last video a few Ways to get data
- 00:05we now want a very simple and illustrative For example, see what this might look like for web crawling.
- 00:12And we just start off, as usual, by making sure that we can Libraries that we need to import right away.
- 00:19And that's selenium, that's one of those. Standard library with which you can set up such webdrivers
- 00:26The user behavior on the Web can mimic that for it. And typically, we'll use it again and again, pandas.
- 00:34to manage our data and also to be stored in the aftermath.
- 00:39And now, if we have just imported these, can start working in the browser.
- 00:44Because what we want to do now is just our behavior for example on the course page of OpenHPI.
- 00:52So, for example, when we are now at the Course page to collect different titles.
- 00:58Yeah, well, there are about a hundred courses.
- 01:01We would like to see the different Collect titles to evaluate.
- 01:04Or, for example, to train an AI model, what this in various categories such as programming, artificial
- 01:10Then rank intelligence at the end, then We want to have this data collected for this.
- 01:16But now we don't want to and tap each one individually.
- 01:20And what you can do there very nicely is with a Webcrawling approach to program this behavior according to.
- 01:27That one program for us collects the data and we don't of your own. And what you need to know about it in order to crawl
- 01:34is to use a website like OpenHPI and the course overview thereof,
- 01:40that it consists of different components; and for the web crawling the relevant is the HTML component.
- 01:47Yes, that means what we see here now, that's via a so-called HTML structure.
- 01:53And we can use these HTML structures to get then parse them and take over the data from them.
- 01:58And this is now dependent on the browser you are using.
- 02:03I'm now using Google Chrome here.
- 02:04With Google Chrome, there's a very nicely interactive to access this HTML structure.
- 02:11That is, what I'm doing right-clicking on to make the page, click on "examine".
- 02:15Then the developer tools open.
- 02:17Yes, that means I can now use the HTML on the right See the structure and get it displayed on
- 02:24on the left, which is affected.
- 02:26If I now just like to collect these titles, then
- 02:30I can use this HTML structure after the various Find titles, and I'll just start.
- 02:35I'm looking here in the HTML structure according to HPI Academy.
- 02:41And see that here a match was found.
- 02:46This means that I have to add this location at the HPI Academy.
- 02:52I also see this in the fact that now on the left Page this blue box was placed around the title.
- 02:56That's exactly the Position that I want.
- 02:59Now you need a little experience with webcrawling because what I see next is that there's a structure about it.
- 03:07, namely a so-called div- Element of what a class Course Title has.
- 03:11It just sounds like Course Title, like that's something. which these different titles all have in common.
- 03:16That is, it's an element or it's a description. with which I very well the program what I want to develop,
- 03:24can say that it is to search for this course title.
- 03:27And now, if we take that term, and we take it. search for it in the HTML structure, then we see,
- 03:33that we always find the titles exactly.
- 03:36That means that we have a common ground , which we can now use ideally,
- 03:41in addition to the various Collect course titles.
- 03:44Now we know everything in the browser we know to write our program and
- 03:49can now re-implement that, and for example selenium
- 03:53ideal for, because with selenium you can web agents very nicely.
- 03:59That is, if I do this now, it takes a bit, that has to load first, then opens
- 04:06A new tab in Google Chrome. But now this is a special tab because
- 04:12I could also operate the tab normally. but I can also control it through my program.
- 04:17And now comes the exciting. I can because I now know that for example these titles somehow
- 04:22in this div with Course titles and then a bit more I'm not going to go into that too much, which is structure.
- 04:29these different titles automatically collection. So if I do this now,
- 04:33runs our program over all these different Titles that were on the course page and collects them.
- 04:39We can do that now in enter a data frame.
- 04:42And see that we can, for example, start up Talks at HPI or just the HPI Academy.
- 04:47Or to the bottom of Semantic Web Technologies. Yes, and we also see we have
- 04:53that's really not long 101 of these titles collected.
- 04:57It doesn't matter now, it could be 10,000. the program would be easy to run.
- 05:03What you would have to consider, of course, is There's something else that can happen.
- 05:06must log in first or that it there is still a navigation over different pages.
- 05:11So it's a very simple example But basically, this is what's going on.
- 05:17This is how you can implement web crawling.
- 05:20Right, and now that we have the have collected data once,
- 05:22only the last step is missing and we can do that as often as we can.
- 05:27Use and save pandas now simply save this file as CSV.
- 05:32And so have our first 101 Course titles collected via webcrawling.
- 05:37And so now, for example, the Collect authors or, for example, course descriptions.
- 05:42That means you could add as you like.
- 05:46For the first time, it should be enough as an illustrative example how to Web crawling,
- 05:50to obtain data on the web, for example.
To enable the transcript, please select a language in the video player settings menu.
About this video
- Auf GitHub haben wir alle Materialien für die praktischen Einheiten zusammengefasst und für Sie aufbereitet.