A Beginner's Guide To Web Scraping With Python

Python is one of the most popular and versatile programming languages available. It’s often deployed across many industries and used for web development, machine learning, and data science. Given its widespread use, particularly in such in-demand (and interrelated) fields as machine learning and big data, it's not surprising that Python has surpassed Java as the top programming language. In this article, you will learn about web scraping Python.

What is Web Scraping Python?

Web scraping is a website extraction technique that pulls vital information. Software programs that scrape the web usually simulate human exploration of the web by either implementing low-level Hypertext Transfer Protocol (HTTP) or embedding a full-fledged web browser, such as Internet Explorer, Google Chrome, or Mozilla Firefox.

For example, Beautiful Soup (bs4, which is the most up-to-date version) is a Python library used for extracting data from HTML and XML files. It works with your favorite parser to provide ways to navigate, search, and modify the parse tree. Because of its capabilities, it helps programmers eliminate the amount of work they need to complete manually.   

Become a Certified UI UX Expert in Just 5 Months!

UMass Amherst UI UX BootcampExplore Program
Become a Certified UI UX Expert in Just 5 Months!

Why Scrape the Web?

Today, more and more businesses publish data on the internet. This information includes the product, customer, pricing, and supplier details. Companies—in the telemarketing industry, for example—scrape this data from websites for competitive intelligence and strategic positioning purposes. Whether or not companies are doing this legally is another question, as these activities are difficult to track—especially when you throw machine learning and AI into the mix.

Inspect the Data Source

Before learning how to scrape a website, it’s good to know more about the website’s structure, which is necessary to extract relevant information.

Decipher Information in the URLs

A lot of information is contained in the URL you are going to scrape, and understanding how URLs work will make the scraping process much easier.

https://www.indeed.co.in/jobs?q=&l=bangalore

There are two main parts of a URL:

  • The base URL represents the path of the website. In the above example, the base URL is https://www.indeed.co.in/jobs.
  • The query parameters represent additional information that can be declared on the page. In the above example, the query parameter is ?q=&l=bangalore 

Inspecting the Site Using Developer Tools

Next, you’ll need to understand the page structure of a website and pick up the necessary HTML response.

DevTools is a set of web developer tools built directly into the Google Chrome browser which contains a set of web developing tools that can be used to create high-quality apps. These tools can help you edit pages and diagnose problems quickly, which ultimately enables you to make better websites faster.

In Chrome, you can open up the developer tools through the menu by navigating to View → Developer → Developer Tools. You can also access them by right-clicking on the page, selecting the Inspect option, or using a keyboard shortcut.

Libraries Required for Web Scraping Python

There are several libraries available in Python to perform a single function. In this guide, we will be using two different Python modules for scraping data:

  • Urllib2:  A Python module that can be used to fetch URLs.
  • Beautiful Soup: Beautiful Soup is a Python package used for pulling information from web pages. It creates parse trees that help extract data easily.
Looking forward to make a move to programming? Take up the Python Training Course and begin your career as a professional Python programmer.

Scraping a Web Page Using Beautiful Soup

We are going to start scraping the data from a Wikipedia page.

Below is the website link:

https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India

Let’s do this step by step:

  • Importing Necessary Libraries

Let's import requests and bs4 modules. The request module enables us to load a web page into Python so that we can analyze it.

Web_scraping_using_python

  • Make the “Soup”

Making_soup

The response.text translates the page into something readable. 

Note: You need to enter the specific parser that Beautiful Soup uses to parse your text. The default is the built-in Python parser, which we can call using HTML.parser

Beautiful Soup enables us to navigate through data present on the web page. Let’s try some commands to see how it works.

  • soup.<tag>.string

soup_tag_string

The command above returns the string within the provided tag.  

  • Find all the links within <a> tags

soup_a-Web_scraping

  • Become a Certified Expert in AWS, Azure and GCP

    Caltech Cloud Computing BootcampExplore Program
    Become a Certified Expert in AWS, Azure and GCP

    Find Elements by Class Name

If you want to retrieve data by a particular class name, identifying the right class is critical. The find() command is used to find elements by a particular class or tag name.

In Chrome, you can check the class name by right-clicking on the required table of the web page and navigating to “Inspect”.

elements_by_classname

  • Find Elements By ID

In an HTML page, every element can have the id attribute. You can begin to parse your page by selecting a specific element by its ID.

_id-web_scraping

Putting Our Web Scraper Together

Let's apply everything you have learned above to scrape the information about states and their respective capitals from the Wikipedia page.

First, you need to extract information from the DataFrame. For that, you need to iterate through each row(tr) and assign each element of tr(td) to a variable and add it to the list.

Let’s look at the code.

generating_list_to_append_data

Now, let’s convert the list to the DataFrame.

converting_the_list_to_dataframe

Conclusion

Our Python web scraping tutorial covered some of the basics of scraping data from the web

using Python, requests, and Beautiful Soup. We also went through the full web scraping process from start to finish.

If you have any questions, please feel free to ask them in our comments section, and our experts will answer them promptly. 

Want to Learn More About Python?

Since Python is such a popular programming language, it goes without saying that Python developers are in super high demand. If you are ready to kickstart your career in software development, check out our Python Training Course today!

About the Author

Aryan GuptaAryan Gupta

Aryan is a tech enthusiast who likes to stay updated about trending technologies of today. He is passionate about all things technology, a keen researcher, and writes to inspire. Aside from technology, he is an active football player and a keen enthusiast of the game.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.