If you are a web developer of any domain, at least once in your career, you should have been approached by a bunch of dudes, be it your manager or the marketing team asking the magical question “DO YOU KNOW WEB SCRAPING?”. If you are just starting your career you might not even know if there is such a concept.
I would definitely say this is the most techie thing that any non-tech person can know about. And yes, this is a very useful and important topic regardless of your domain of development. So, in this article, we will be discussing Web Scraping and building a simple application to scrap a sample website using Puppeteer.
Note: This tutorial is meant for develpers and techies only. If you are not a developer and still need to use web scraping, there are numerous tools out there that can do the job for you. One such tool that i recommend is ScrapingBee, it is professional, simple and feature rich tool for scraping web. Do give it a try
For a brief description of web scraping, do check out this article on What is Web Scraping and How to Use It?
What is web scraping?
To put it simply, web scraping is the process of extracting information from a web page programmatically. What do we mean by extraction of information? For example, when you see an e-commerce website and want to store all the products and their prices, say to analyze the data for your own e-commerce website. One way is that you can manually go through all the products one by one and record all the values If you are a time traveler from 1982.
Enter web scraping, for the above use case we can write a script or use a tool to strip all the required information. If you are a non-technical person you can go with tools like ScrapingBee. But, if you are a developer you will be amazed how easy this process is, especially using a library like Puppeteer.
When to use web scraping?
There are numerous scenarios where web scraping comes in handy, some of them are as follows.
- Price Comparison & Competition Monitoring
- Data Analysis and Data Science.
- Lead Generation for Marketing
Yep before we scrap the brains out, we need to be aware of which data can be scraped and which cannot. This means there are some data legally available to use and some are not. For instance, you cannot scrap a site’s premium content and host it free of cost in your own blog or website, I mean that’s insane. So be sure that you are scraping legal content.
Without any delay, let’s dive into coding.
Simple web scraping application.
In this section, we will develop a simple application to scrap product details like price and product name(Refer to the below screenshot) from the website books.toscrape.com.
Note: As mention in the Legal obligations section the above mentioned link is not obligated to any content restrictions, so we are using this. If you want to use the following technique on any other website proceed with caution.
npm i puppeteer
npm i monk
We will be using only one file(index.js).
This is the only file in this project. Obviously, it goes without saying your folder structure might be more complex depending on the requirements and the content you are scraping.
Step 1: Importing all the required libraries, replace mongo_db_connection_link with your MongoDB connection string.
price-data, is the collection name, feel free to rename it if needed.
Step 2: Now, we will create a puppeteer object to launch our headless browser. What is a headless browser you ask? Well to put it simply headless browser is a browser with no user interface.
The way Puppeteer works is that it launches a headless browser(Chromium) and executes all the code in that against the given website.
Setting the headless option to false will launch a chromium browser instance so you can see what is going on in real-time.
Setting the devtools option to true allows us to inspect and debug our code in the launched browser instance(We need to set the headless option to true to use this feature).
Note: We can cofigure pupeteer to use Firefox Nightly, kindly refer to this link. Also we need not install chromium separately as the it will be installed along puppeteer.
Step 3: In this step, we will open a page and make it navigate to the mentioned link for scraping.
Step 4: This is the step that we are waiting for, let us start scraping. First, let’s understand the DOM structure of the page and see where and how the product name and price are located.
In the above screenshot, we can see that the product name is inside the <a> tag which is inside a <h3> tag. And the product price is inside the <p> tag.
So, scraping all the values within these two tags will give us the product names and the respective prices.
Based on the above observations, our code should look like this.
- First, we skip all the DOM elements till the class product_pod using the waitForSelector function. This is because when we inspect and analyze the page, we can see that all the <p> and <h3> tags before this class do not contain any product information. So, it is efficient and makes sense to skip all unwanted DOM elements.
- page.$$eval() function will return the DOM objects of the matched elements. Using that we can perform different operations.
- product_name will contain all the product names. This is by accessing the contents inside the <a> which is inside <h3> tags.
- Same way product_price contains the price of the product. In our case, the product prices are inside the <p> tag with the class product_price which in turn is inside a <div> tag with the class price_color.
Step 5: In this step, We will insert the scraped values into the database.
The above code is a simple insert function from the monk library that inserts the data in the price-data collection.
After completing scraping, we need to close the browser. This will terminate the terminal session and also close the browser if the headless option is set to false.
Running the application:
We can run our application with the command
node index.js. Now after the node session is done, we can see that the scrapped code is inserted into our database.
Most of the time we need to debug our code either to check some error or to simply understand how the code works. For this, enable the devtools option to true and the headless option to false from Step 1 and place the highlighted code at the starting of your code.
Also, note that the debug code should be placed after the page is initialized(after the newPage function).
Now if you run the application again, the chromium-browser opens and the debugger will stop the execution at the start so that we can debug the code.
Note: If you dont want the browser to close after debugging comment the code “await borswer.close()”
Web scraping is a very useful topic regardless of the domain that you are working on. You might scrap the web but using this concept you can do much more. One of the use-cases that I use in my daily work is testing the front-end code using the puppeteer.
That’s it! Happy coding.
Github link: https://github.com/kishork2120/pupeteer-tutorial.
Don’t forget to try ScrapingBee.
If you love this article do check out my other article on Data generation with Falso.