Web Scraping Trump Lies (Using BeautifulSoup)

Source

Beautiful Soup Methods & Attributes

You can apply these two methods to either the initial soup object or a Tag object (such as first_result):

  • find(): searches for the first matching tag, and returns a Tag object
  • find_all(): searches for all matching tags, and returns a ResultSet object (which you can treat like a list of Tags)

You can extract information from a Tag object (such as first_result) using these two attributes:

  • text: extracts the text of a Tag, and returns a string
  • contents: extracts the children of a Tag, and returns a list of Tags and strings

And of course, there are many more methods and attributes available to you, which are described in the Beautiful Soup documentation.

Examining the New York Times Article

New York Times article

Here's the way the article presented the information:

When converting this into a dataset, you can think of each lie as a "record" with four fields:

  1. The date of the lie.
  2. The lie itself (as a quotation).
  3. The writer's brief explanation of why it was a lie.
  4. The URL of an article that substantiates the claim that it was a lie.
In [1]:
import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('span', attrs={'class':'short-desc'})

records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append([date, lie, explanation, url])

import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
df['date'] = pd.to_datetime(df['date'])
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')
df= pd.read_csv('trump_lies.csv')
df.head()
Out[1]:
date lie explanation url
0 2017-01-21 I wasn't a fan of Iraq. I didn't want to go in... He was for an invasion before he was against it. https://www.buzzfeed.com/andrewkaczynski/in-20...
1 2017-01-21 A reporter for Time magazine — and I have been... Trump was on the cover 11 times and Nixon appe... http://nation.time.com/2013/11/06/10-things-yo...
2 2017-01-23 Between 3 million and 5 million illegal votes ... There's no evidence of illegal voting. https://www.nytimes.com/2017/01/23/us/politics...
3 2017-01-25 Now, the audience was the biggest ever. But th... Official aerial photos show Obama's 2009 inaug... https://www.nytimes.com/2017/01/21/us/politics...
4 2017-01-25 Take a look at the Pew reports (which show vot... The report never mentioned voter fraud. https://www.nytimes.com/2017/01/24/us/politics...

Let's Dive into Details

Installing Request

In [2]:
!pip install requests
Requirement already satisfied: requests in /home/bahar/anaconda3/lib/python3.7/site-packages (2.24.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from requests) (1.25.9)
Requirement already satisfied: idna<3,>=2.5 in /home/bahar/anaconda3/lib/python3.7/site-packages (from requests) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /home/bahar/anaconda3/lib/python3.7/site-packages (from requests) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /home/bahar/anaconda3/lib/python3.7/site-packages (from requests) (2020.6.20)

Reading the Web Page into Python

In [3]:
import requests

r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

printing the HTML

- the response object (r) has a `text` attribute, which contains the same HTML code
In [4]:
print(r.text[0:500])
<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page

Installing Beautiful Soup

In [5]:
!pip install beautifulsoup4
Requirement already satisfied: beautifulsoup4 in /home/bahar/anaconda3/lib/python3.7/site-packages (4.9.1)
Requirement already satisfied: soupsieve>1.2 in /home/bahar/anaconda3/lib/python3.7/site-packages (from beautifulsoup4) (2.0.1)

Parsing the HTML using Beautiful Soup

Note that html.parser is the parser included with the Python standard library, though other parsers can be used by Beautiful Soup.

Differences between Parsers

In [6]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'html.parser')

Collecting all of the records

You might have noticed that each record has the following format:

<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>

In [7]:
results = soup.find_all('span', attrs={'class':'short-desc'})

Lenghth of Results

In [8]:
len(results)
Out[8]:
180

The First Three Results

In [9]:
results[0:3]
Out[9]:
[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_blank">(There's no evidence of illegal voting.)</a></span></span>]

The Last Result

In [10]:
results[-1]
Out[10]:
<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

First Result

In [11]:
first_result = results[0]

first_result
Out[11]:
<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

Extracting the Date

In [12]:
first_result.find('strong')
Out[12]:
<strong>Jan. 21 </strong>
In [13]:
first_result.contents[0]
Out[13]:
<strong>Jan. 21 </strong>

Extracting the Date ( the text between the opening and closing tags)

What is \xa0? You don't actually need to know this, but it's called an "escape sequence" that represents the &nbsp; character we saw earlier in the HTML source.

However, you do need to know that an escape sequence represents a single character within a string. Let's slice it off from the end of the string:

In [14]:
first_result.find('strong').text
Out[14]:
'Jan. 21\xa0'
In [15]:
first_result.contents[0].text
Out[15]:
'Jan. 21\xa0'

Removing Escape Sequence

In [16]:
first_result.find('strong').text[0:-1]
Out[16]:
'Jan. 21'
In [17]:
first_result.contents[0].text[0:-1]
Out[17]:
'Jan. 21'

Adding Year

In [18]:
first_result.find('strong').text[0:-1] + ', 2017'
Out[18]:
'Jan. 21, 2017'
In [19]:
first_result.contents[0].text[0:-1] + ', 2017'
Out[19]:
'Jan. 21, 2017'

Extracting the Contents (Lie)

In [20]:
first_result.contents[1]
Out[20]:
"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

Removing the Curly Quotation Marks & the Extra Space

In [21]:
first_result.contents[1][1:-2]
Out[21]:
"I wasn't a fan of Iraq. I didn't want to go into Iraq."

Extracting the URL & the Explanation

In [22]:
first_result.contents[2]
Out[22]:
<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

Extracting the URL

In [23]:
first_result.find('a')['href']
Out[23]:
'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

Extracting the Text

In [24]:
first_result.find('a').text[1:-1]
Out[24]:
'He was for an invasion before he was against it.'

Building the dataset

Now that we've figured out how to extract the four components of first_result, we can create a loop to repeat this process on all 116 results. We'll store the output in a list of tuples called records:

In [25]:
records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append([date, lie, explanation, url])
    

print("Record Length: ", len(records))
print(records[0:3])
Record Length:  180
[['Jan. 21, 2017', "I wasn't a fan of Iraq. I didn't want to go into Iraq.", 'He was for an invasion before he was against it.', 'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'], ['Jan. 21, 2017', 'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.', 'Trump was on the cover 11 times and Nixon appeared 55 times.', 'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'], ['Jan. 23, 2017', 'Between 3 million and 5 million illegal votes caused me to lose the popular vote.', "There's no evidence of illegal voting.", 'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html']]

Creating a Dataframe

In [26]:
import pandas as pd

df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])

df
Out[26]:
date lie explanation url
0 Jan. 21, 2017 I wasn't a fan of Iraq. I didn't want to go in... He was for an invasion before he was against it. https://www.buzzfeed.com/andrewkaczynski/in-20...
1 Jan. 21, 2017 A reporter for Time magazine — and I have been... Trump was on the cover 11 times and Nixon appe... http://nation.time.com/2013/11/06/10-things-yo...
2 Jan. 23, 2017 Between 3 million and 5 million illegal votes ... There's no evidence of illegal voting. https://www.nytimes.com/2017/01/23/us/politics...
3 Jan. 25, 2017 Now, the audience was the biggest ever. But th... Official aerial photos show Obama's 2009 inaug... https://www.nytimes.com/2017/01/21/us/politics...
4 Jan. 25, 2017 Take a look at the Pew reports (which show vot... The report never mentioned voter fraud. https://www.nytimes.com/2017/01/24/us/politics...
... ... ... ... ...
175 Oct. 25, 2017 We have trade deficits with almost everybody. We have trade surpluses with more than 100 cou... https://www.bea.gov/newsreleases/international...
176 Oct. 27, 2017 Wacky & totally unhinged Tom Steyer, who has b... Steyer has financially supported many winning ... https://www.opensecrets.org/donor-lookup/resul...
177 Nov. 1, 2017 Again, we're the highest-taxed nation, just ab... We're not. http://www.politifact.com/truth-o-meter/statem...
178 Nov. 7, 2017 When you look at the city with the strongest g... Several other cities, including New York and L... http://www.politifact.com/truth-o-meter/statem...
179 Nov. 11, 2017 I'd rather have him – you know, work with him... There is no evidence that Democrats "set up" R... https://www.nytimes.com/interactive/2017/12/10...

180 rows × 4 columns

Converting Datetime

In [27]:
df['date'] = pd.to_datetime(df['date'])

df.head()
Out[27]:
date lie explanation url
0 2017-01-21 I wasn't a fan of Iraq. I didn't want to go in... He was for an invasion before he was against it. https://www.buzzfeed.com/andrewkaczynski/in-20...
1 2017-01-21 A reporter for Time magazine — and I have been... Trump was on the cover 11 times and Nixon appe... http://nation.time.com/2013/11/06/10-things-yo...
2 2017-01-23 Between 3 million and 5 million illegal votes ... There's no evidence of illegal voting. https://www.nytimes.com/2017/01/23/us/politics...
3 2017-01-25 Now, the audience was the biggest ever. But th... Official aerial photos show Obama's 2009 inaug... https://www.nytimes.com/2017/01/21/us/politics...
4 2017-01-25 Take a look at the Pew reports (which show vot... The report never mentioned voter fraud. https://www.nytimes.com/2017/01/24/us/politics...

Exporting the Dataset to a CSV File

In [28]:
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')

Reading CSV File

In [29]:
df = pd.read_csv('trump_lies.csv')

df.head()
Out[29]:
date lie explanation url
0 2017-01-21 I wasn't a fan of Iraq. I didn't want to go in... He was for an invasion before he was against it. https://www.buzzfeed.com/andrewkaczynski/in-20...
1 2017-01-21 A reporter for Time magazine — and I have been... Trump was on the cover 11 times and Nixon appe... http://nation.time.com/2013/11/06/10-things-yo...
2 2017-01-23 Between 3 million and 5 million illegal votes ... There's no evidence of illegal voting. https://www.nytimes.com/2017/01/23/us/politics...
3 2017-01-25 Now, the audience was the biggest ever. But th... Official aerial photos show Obama's 2009 inaug... https://www.nytimes.com/2017/01/21/us/politics...
4 2017-01-25 Take a look at the Pew reports (which show vot... The report never mentioned voter fraud. https://www.nytimes.com/2017/01/24/us/politics...

Appendix A: Web scraping advice

  • Web scraping works best with static, well-structured web pages. Dynamic or interactive content on a web page is often not accessible through the HTML source, which makes scraping it much harder!
  • Web scraping is a "fragile" approach for building a dataset. The HTML on a page you are scraping can change at any time, which may cause your scraper to stop working.
  • If you can download the data you need from a website, or if the website provides an API with data access, those approaches are preferable to scraping since they are easier to implement and less likely to break.
  • If you are scraping a lot of pages from the same website (in rapid succession), it's best to insert delays in your code so that you don't overwhelm the website with requests. If the website decides you are causing a problem, they can block your IP address (which may affect everyone in your building!)
  • Before scraping a website, you should review its robots.txt file (also known as the Robots exclusion standard) to check whether you are "allowed" to scrape their website. (Here is the robots.txt file for nytimes.com.)

Appendix B: Web scraping resources

Appendix C: Alternative syntax for Beautiful Soup

It's worth noting that Beautiful Soup actually offers multiple ways to express the same command. I tend to use the most verbose option, since I think it makes the code readable, but it's useful to be able to recognize the alternative syntax since you might see it used elsewhere.

For example, you can search for a tag by accessing it like an attribute:

In [30]:
# search for a tag by name
first_result.find('strong')

# shorter alternative: access it like an attribute
first_result.strong
Out[30]:
<strong>Jan. 21 </strong>

You can also search for multiple tags a few different ways:

In [31]:
# search for multiple tags by name and attribute
results = soup.find_all('span', attrs={'class':'short-desc'})

# shorter alternative: if you don't specify a method, it's assumed to be find_all()
results = soup('span', attrs={'class':'short-desc'})

# even shorter alternative: you can specify the attribute as if it's a parameter
results = soup('span', class_='short-desc')

For more details, check out the Beautiful Soup documentation.