Scraping Open Data from the Web with BeautifulSoup

Main content

During the first week of Research Data and Digital Scholarship Data Jam 2021 we discussed about “Sourcing the Data” by “Scraping Open Data from the Web”. 

Here are the workshop materials including slides and Python code exercise

Prerequisites 

The “Web Scraping with BeautifulSoup” workshop presumes the attendees to have some knowledge of HTML/CSS and Python. The required software includes Jupyter Notebook, and Python packages like pip, sys, urllib.request, and bs4. 

Web scraping or crawling is the process of fetching data from a third-party website by downloading and parsing the HTML code. Here we use the Python Requests library which enables us to download a web page. Then we use the Python BeautifulSoup library to extract and parse the relevant parts of the web page in HTML or XML format. 

Farming the HTML tags  

The secret to scraping a webpage are the ingredients: These include the web page that is being scraped, the inspect developer tool, the tags and tag branch of the exact section of the web page being scraped, and finally, the Python script. 

Structure of a regular web page 

Before we can do web scraping, we need to understand the structure of the web page we're working with and then extract parts of that structure. 

< HTML >

< head >

    < title >

    < /title>

< /head >


< body >

    < p >
    < /p >

< /body >

< /HTML>

 

Python Demo 

  1. Download the latest  versions of Python and Anaconda3 depending on your system’s Operating System.  

  1. Open the Anaconda Navigator and select Jupyter Notebook. 

  1. Navigate to the directory where you want to hold your files. Create or upload a notebook, marked by .ipynb file extension. 

  1. Install the BeautifulSoup4 package bs4. 

# Importing Python libraries
import bs4 as BeautifulSoup
import urllib.request  

 

  1. Scenario 1: I want to know more about storing fruits for the winter. Using Wikipedia for collecting data  

For this workshop we will be using the Wikipedia web page on Fruit preserves: https://en.wikipedia.org/wiki/Fruit_preserves

  1. Inspect the webpage by right-clicking on the required data or pressing the F12 key.

wikipedia inspect developer tool for fruit jam
  1. Get the link 

# Fetching the content from the Wikipedia URL
get_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Fruit_preserves')

read_page = get_data.read()
  1. Parse the text data 

# Parsing the Wikipedia URL content and storing the page text
parse_page = BeautifulSoup.BeautifulSoup(read_page,'html.parser')

# Returning all the <p> paragraph tags
paragraphs = parse_page.find_all('p')
page_content = ''
  1. Add text to string as is 

# Looping through each of the paragraphs and adding them to the variable
for p in paragraphs:  
    page_content += p.text
    
    # Make the paragraph tags readable
    print(p.prettify())
  1. Add text to string for paragraphs 

# Looping through each of the paragraph text and adding them to the variable

for p in paragraphs:  
    page_content += p.text
    print(p.get_text())
  1. Scenario 2: I want to collect all European Paintings in the Philadelphia Museum of Art artworks collections with fruits in them. 

 

  1. Understand the webpage data.

  1. Import the libraries and fetch the webpage 

import requests
from bs4 import BeautifulSoup

page = requests.get('https://philamuseum.org/collections/results.html?searchTxt=fruit&searchNameID=&searchClassID=&searchOrigin=&searchDeptID=5&keySearch2=&page=1')

soup = BeautifulSoup(page.text, 'html.parser')
  1. Selecting the HTML tags 

    # Locating the HTML tags for the hyperlinks
    
    art = soup.find(class_='pinch')
    art_objects = art.find_all('a')
    
    for artwork in art_objects:
        
        links = artwork.get('href')
        print(links)


    # Collect full links
    for artwork in art_objects:
        links = 'https://philamuseum.org/collection' + artwork.get('href')
        print(links)
  2. Export to CSV (Comma Separated Value) table

import csv
import requests
from bs4 import BeautifulSoup

page = requests.get('https://philamuseum.org/collections/results.html?searchTxt=fruit&searchNameID=&searchClassID=&searchOrigin=&searchDeptID=5&keySearch2=&page=1')

soup = BeautifulSoup(page.text, 'html.parser')

art = soup.find(class_='pinch')
art_objects = art.find_all('a')

for artwork in art_objects:
    links = artwork.get('href')

    
# Open a csv file to write the output in

f = csv.writer(open('pma.csv', 'w'))

for artwork in art_objects:
    links = 'https://philamuseum.org' + artwork.get('href')
    
    # Insert each iteration's output into the csv file
    f.writerow([links])
 

Why Scrape Data? 

  • Open Data Collection from obsolete or expired websites 

  • Open Data Access in the absence of an API (Application Programming Interface) 

  • Automated real-time data harvesting 

  • Data Aggregation 

  • Data Monitoring 

Challenges 

  • For effective web scraping, script customizations, error handling, data cleaning, and results storage, are necessary steps and thus making the process time and resources intensive. 

  • Websites are Dynamic and always in development. The content of a website changing reflects in the tags that are selected in your script. This is especially important to consider in the case of real-time data scraping. 

  • Python packages and updates sometimes constitute in unstable scripts 

  • It is extremely important to throttle your request or cap it to a specific amount for server considerations to not overload the server usage. 

Spoonfuls of Data Helps the Servers NOT Go Down 

Here are resources on learning more about University of Pennsylvania Libraries Policy of server usage:

Alternatives 

  • ScraPy – Framework for large scale web scraping 

  • Selenium – Browser Automation Library for Web Scraping using JavaScript 

  • rvest – R package for multi-page web scraping 

  • API (Application Programming Interface) – Direct data requests from the source 

  • DOM (Document Object Model) parsing 

  • Pattern matching text – Using regex (regular expressions) for selecting text 

  • Manually copy-pasting 

Resources 

Home page Slide titled "Data Jam 2021 Web Scraping with BeautifulSoup" watermarked "Penn Libraries" logo and "Research Data and Digital Scholarship" Data Jam logo of a cat in a jam jar with Copyright Commons CC-BY 4.0 linked on the page
Publish Date