Scraping Open Data from the Web with BeautifulSoup

During the first week of Research Data and Digital Scholarship Data Jam 2021 we discussed about “Sourcing the Data” by “Scraping Open Data from the Web”.

Here are the workshop materials including slides and Python code exercise.

Prerequisites

The “Web Scraping with BeautifulSoup” workshop presumes the attendees to have some knowledge of HTML/CSS and Python. The required software includes Jupyter Notebook, and Python packages like pip, sys, urllib.request, and bs4.

Web scraping or crawling is the process of fetching data from a third-party website by downloading and parsing the HTML code. Here we use the Python Requests library which enables us to download a web page. Then we use the Python BeautifulSoup library to extract and parse the relevant parts of the web page in HTML or XML format.

Farming the HTML tags

The secret to scraping a webpage are the ingredients: These include the web page that is being scraped, the inspect developer tool, the tags and tag branch of the exact section of the web page being scraped, and finally, the Python script.

Structure of a regular web page

Before we can do web scraping, we need to understand the structure of the web page we're working with and then extract parts of that structure.

< HTML >

< head >

    < title >

    < /title>

< /head >


< body >

    < p >
    < /p >

< /body >

< /HTML>

Python Demo

Download the latest versions of Python and Anaconda3 depending on your system’s Operating System.

Open the Anaconda Navigator and select Jupyter Notebook.

Navigate to the directory where you want to hold your files. Create or upload a notebook, marked by .ipynb file extension.

Install the BeautifulSoup4 package bs4.

# Importing Python libraries import bs4 as BeautifulSoup import urllib.request

Scenario 1: I want to know more about storing fruits for the winter. Using Wikipedia for collecting data

For this workshop we will be using the Wikipedia web page on Fruit preserves: https://en.wikipedia.org/wiki/Fruit_preserves

Inspect the webpage by right-clicking on the required data or pressing the F12 key.

wikipedia inspect developer tool for fruit jam

Get the link

# Fetching the content from the Wikipedia URL get_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Fruit_preserves') read_page = get_data.read()

Parse the text data

# Parsing the Wikipedia URL content and storing the page text parse_page = BeautifulSoup.BeautifulSoup(read_page,'html.parser') # Returning all the <p> paragraph tags paragraphs = parse_page.find_all('p') page_content = ''

Add text to string as is

# Looping through each of the paragraphs and adding them to the variable for p in paragraphs: page_content += p.text # Make the paragraph tags readable print(p.prettify())

Add text to string for paragraphs

# Looping through each of the paragraph text and adding them to the variable for p in paragraphs: page_content += p.text print(p.get_text())

Scenario 2: I want to collect all European Paintings in the Philadelphia Museum of Art artworks collections with fruits in them.

Understand the webpage data.
- Go to the museum website https://philamuseum.org/.
- Go to the online collection https://philamuseum.org/collections/search.html.
- Select the European painting filter and search for the keyword "fruit".
- Inspect the webpage https://philamuseum.org/collections/results.html?searchTxt=fruit&searchNameID=&searchClassID=&searchOrigin=&searchDeptID=5&keySearch2=&page=1.
- Identify the container tag. Here it is the class "pinch".
- Identify the link tag "a href".

Import the libraries and fetch the webpage

import requests from bs4 import BeautifulSoup page = requests.get('https://philamuseum.org/collections/results.html?searchTxt=fruit&searchNameID=&searchClassID=&searchOrigin=&searchDeptID=5&keySearch2=&page=1') soup = BeautifulSoup(page.text, 'html.parser')

Selecting the HTML tags

# Locating the HTML tags for the hyperlinks

art = soup.find(class_='pinch')
art_objects = art.find_all('a')

for artwork in art_objects:
    
    links = artwork.get('href')
    print(links)

# Collect full links

for artwork in art_objects:
    links = 'https://philamuseum.org/collection' + artwork.get('href')
    print(links)

Export to CSV (Comma Separated Value) table

import csv import requests from bs4 import BeautifulSoup page = requests.get('https://philamuseum.org/collections/results.html?searchTxt=fruit&searchNameID=&searchClassID=&searchOrigin=&searchDeptID=5&keySearch2=&page=1') soup = BeautifulSoup(page.text, 'html.parser') art = soup.find(class_='pinch') art_objects = art.find_all('a') for artwork in art_objects: links = artwork.get('href') # Open a csv file to write the output in f = csv.writer(open('pma.csv', 'w')) for artwork in art_objects: links = 'https://philamuseum.org' + artwork.get('href') # Insert each iteration's output into the csv file f.writerow([links])

Why Scrape Data?

Open Data Collection from obsolete or expired websites
Open Data Access in the absence of an API (Application Programming Interface)
Automated real-time data harvesting
Data Aggregation
Data Monitoring

Challenges

For effective web scraping, script customizations, error handling, data cleaning, and results storage, are necessary steps and thus making the process time and resources intensive.
Websites are Dynamic and always in development. The content of a website changing reflects in the tags that are selected in your script. This is especially important to consider in the case of real-time data scraping.
Python packages and updates sometimes constitute in unstable scripts
It is extremely important to throttle your request or cap it to a specific amount for server considerations to not overload the server usage.

Spoonfuls of Data Helps the Servers NOT Go Down

Here are resources on learning more about University of Pennsylvania Libraries Policy of server usage:

Alternatives

ScraPy – Framework for large scale web scraping
Selenium – Browser Automation Library for Web Scraping using JavaScript
rvest – R package for multi-page web scraping

API (Application Programming Interface) – Direct data requests from the source
DOM (Document Object Model) parsing
Pattern matching text – Using regex (regular expressions) for selecting text
Manually copy-pasting

Resources

ProgrammingHistorian Intro to BeautifulSoup Tutorial
BeautifulSoup Documentation
DataQuest Interactive Tutorial – Contact Research Data and Digital Scholarship for license
Mozilla Developer Network HTML Basics
Mozilla Developer Network HTML elements reference

Author

Label

Publish Date

Fri, 12/17/2021 - 12:00

Access

Research

Questions & appointments

Spaces & tools

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Our approach

Our priorities

Your impact

Latest news

Scraping Open Data from the Web with BeautifulSoup

Prerequisites

Farming the HTML tags

Structure of a regular web page

Python Demo

Why Scrape Data?

Challenges

Spoonfuls of Data Helps the Servers NOT Go Down

Alternatives

Resources

Scraping Open Data from the Web with BeautifulSoup

Main content

Prerequisites

Farming the HTML tags

Structure of a regular web page

Python Demo

Why Scrape Data?

Challenges

Spoonfuls of Data Helps the Servers NOT Go Down

Alternatives

Resources

Social media