Tutorial Playlist

Python Tutorial for Beginners

Overview

How to Install Python on Windows?

Lesson - 1

Python OOPs Concept: Here's What You Need to Know

Lesson - 2

Getting Started With Jupyter Network

Lesson - 3

PyCharm Tutorial: Getting Started with PyCharm

Lesson - 4

A Beginner’s Guide To Python Variables

Lesson - 5

Python Numbers: Integers, Floats, Complex Numbers

Lesson - 6

Learn A to Z About Python Functions

Lesson - 7

The Basics of Python Loops

Lesson - 8

Introduction to Python While Loop

Lesson - 9

Python For Loops Explained With Examples

Lesson - 10

Understanding Python If-Else Statement

Lesson - 11

Introduction to Python Strings

Lesson - 12

Everything You Need to Know About Python Slicing

Lesson - 13

All You Need To Know About Python List

Lesson - 14

Python Regular Expression (RegEX)

Lesson - 15

An Introduction to Python Threading

Lesson - 16

Objects and Classes in Python: Create, Modify and Delete

Lesson - 17

The Best Python Pandas Tutorial

Lesson - 18

A Beginner's Guide To Web Scraping With Python

Lesson - 19

A Handy Guide to Python Tuples

Lesson - 20

How to Easily Implement Python Sets and Dictionaries

Lesson - 21

Everything You Need to Know About Python Arrays

Lesson - 22

An Introduction to Matplotlib for Beginners

Lesson - 23

An Introduction to Scikit-Learn: Machine Learning in Python

Lesson - 24

Top 10 Python IDEs in 2020: Choosing The Best One

Lesson - 25

The Best NumPy Tutorial for Beginners

Lesson - 26

Python Django Tutorial: The Best Guide on Django Framework

Lesson - 27

Top 10 Reason Why You Should Learn Python

Lesson - 28

How To Become a Python Developer

Lesson - 29

Top 50 Python Interview Questions and Answers for 2020

Lesson - 30
A Beginner's Guide To Web Scraping With Python

Python is one of the most popular and versatile programming languages available. It’s often deployed across many industries and used for web development, machine learning, and data science. Given its widespread use, particularly in such in-demand (and interrelated) fields as machine learning and big data, it's not surprising that Python has surpassed Java as the top programming language. In this article, you will learn about web scraping using Python.

In this article, you will learn about the following topics:

  • What is web scraping using Python?
  • Why scrape the web?
  • Inspect the data source
    • Decipher the information in the URLs
    • Inspecting the sites using developer tools
  • Libraries required for web scraping
  • Scraping the web using Beautiful Soup
    • Importing the necessary libraries
    • Make the soup
  • Navigating the data structure
    • soup.<tag>.string
    • Find all the links within <a> tag
    • Find elements by class name
    • Find elements by ID
  • Putting together our web scraper

What is Web Scraping Using Python?

Web scraping is a website extraction technique that pulls vital information. Software programs that scrape the web usually simulate human exploration of the web by either implementing low-level Hypertext Transfer Protocol (HTTP) or embedding a full-fledged web browser, such as Internet Explorer, Google Chrome, or Mozilla Firefox.

For example, Beautiful Soup (bs4, which is the most up-to-date version) is a Python library used for extracting data from HTML and XML files. It works with your favorite parser to provide ways to navigate, search, and modify the parse tree. Because of its capabilities, it helps programmers eliminate the amount of work they need to complete manually.   

Why Scrape the Web?

Today, more and more businesses publish data on the internet. This information includes the product, customer, pricing, and supplier details. Companies—in the telemarketing industry, for example—scrape this data from websites for competitive intelligence and strategic positioning purposes. Whether or not companies are doing this legally is another question, as these activities are difficult to track—especially when you throw machine learning and AI into the mix.

Python Training Course

Learn Data Operations in PythonExplore Course
Python Training Course

Inspect the Data Source

Before learning how to scrape a website, it’s good to know more about the website’s structure, which is necessary to extract relevant information.

Decipher Information in the URLs

A lot of information is contained in the URL you are going to scrape, and understanding how URLs work will make the scraping process much easier.

https://www.indeed.co.in/jobs?q=&l=bangalore

There are two main parts of a URL:

  • The base URL represents the path of the website. In the above example, the base URL is https://www.indeed.co.in/jobs.
  • The query parameters represent additional information that can be declared on the page. In the above example, the query parameter is ?q=&l=bangalore 

Inspecting the Site Using Developer Tools

Next, you’ll need to understand the page structure of a website and pick up the necessary HTML response.

DevTools is a set of web developer tools built directly into the Google Chrome browser which contains a set of web developing tools that can be used to create high-quality apps. These tools can help you edit pages and diagnose problems quickly, which ultimately enables you to make better websites faster.

In Chrome, you can open up the developer tools through the menu by navigating to View → Developer → Developer Tools. You can also access them by right-clicking on the page, selecting the Inspect option, or using a keyboard shortcut.

Libraries Required for Web Scraping

There are several libraries available in Python to perform a single function. In this guide, we will be using two different Python modules for scraping data:

  • Urllib2:  A Python module that can be used to fetch URLs.
  • Beautiful Soup: Beautiful Soup is a Python package used for pulling information from web pages. It creates parse trees that help extract data easily.
Looking forward to make a move to programming? Take up the Python Training Course and begin your career as a professional Python programmer.

Scraping a Web Page Using Beautiful Soup

We are going to start scraping the data from a Wikipedia page.

Below is the website link:

https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India

Let’s do this step by step:

  • Importing Necessary Libraries

Let's import requests and bs4 modules. The request module enables us to load a web page into Python so that we can analyze it.

Web_scraping_using_python

  • Make the “Soup”

Making_soup

The response.text translates the page into something readable. 

Note: You need to enter the specific parser that Beautiful Soup uses to parse your text. The default is the built-in Python parser, which we can call using HTML.parser

Navigating the Data Structure

Beautiful Soup enables us to navigate through data present on the web page. Let’s try some commands to see how it works.

  • soup.<tag>.string

soup_tag_string

The command above returns the string within the provided tag.  

  • Find all the links within <a> tags

soup_a-Web_scraping

  • Find Elements by Class Name

If you want to retrieve data by a particular class name, identifying the right class is critical. The find() command is used to find elements by a particular class or tag name.

In Chrome, you can check the class name by right-clicking on the required table of the web page and navigating to “Inspect”.

elements_by_classname

  • Find Elements By ID

In an HTML page, every element can have the id attribute. You can begin to parse your page by selecting a specific element by its ID.

_id-web_scraping

Putting Our Web Scraper Together

Let's apply everything you have learned above to scrape the information about states and their respective capitals from the Wikipedia page.

First, you need to extract information from the DataFrame. For that, you need to iterate through each row(tr) and assign each element of tr(td) to a variable and add it to the list.

Let’s look at the code.

generating_list_to_append_data

Now, let’s convert the list to the DataFrame.

converting_the_list_to_dataframe

Conclusion

Our Python web scraping tutorial covered some of the basics of scraping data from the web

using Python, requests, and Beautiful Soup. We also went through the full web scraping process from start to finish.

If you have any questions, please feel free to ask them in our comments section, and our experts will answer them promptly. 

Want to Learn More About Python?

Since Python is such a popular programming language, it goes without saying that Python developers are in super high demand. If you are ready to kickstart your career in software development, check out our Python Training Course today!

About the Author

Aryan GuptaAryan Gupta

Aryan is a tech enthusiast who likes to stay updated about trending technologies of today. He is passionate about all things technology, a keen researcher, and writes to inspire. Aside from technology, he is an active football player and a keen enthusiast of the game.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.