Web Scraping Trump Lies (Using BeautifulSoup)¶

Beautiful Soup Methods & Attributes¶

You can apply these two methods to either the initial soup object or a Tag object (such as first_result):

find(): searches for the first matching tag, and returns a Tag object
find_all(): searches for all matching tags, and returns a ResultSet object (which you can treat like a list of Tags)

You can extract information from a Tag object (such as first_result) using these two attributes:

text: extracts the text of a Tag, and returns a string
contents: extracts the children of a Tag, and returns a list of Tags and strings

And of course, there are many more methods and attributes available to you, which are described in the Beautiful Soup documentation.

Examining the New York Times Article¶

New York Times article

Here's the way the article presented the information:

When converting this into a dataset, you can think of each lie as a "record" with four fields:

The date of the lie.
The lie itself (as a quotation).
The writer's brief explanation of why it was a lie.
The URL of an article that substantiates the claim that it was a lie.

import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('span', attrs={'class':'short-desc'})

records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append([date, lie, explanation, url])

import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
df['date'] = pd.to_datetime(df['date'])
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')
df= pd.read_csv('trump_lies.csv')
df.head()

Let's Dive into Details¶

Installing Request¶

!pip install requests

Requirement already satisfied: requests in /home/bahar/anaconda3/lib/python3.7/site-packages (2.24.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/bahar/anaconda3/lib/python3.7/site-packages (from requests) (1.25.9)
Requirement already satisfied: idna<3,>=2.5 in /home/bahar/anaconda3/lib/python3.7/site-packages (from requests) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /home/bahar/anaconda3/lib/python3.7/site-packages (from requests) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /home/bahar/anaconda3/lib/python3.7/site-packages (from requests) (2020.6.20)

Reading the Web Page into Python¶

import requests

r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

printing the HTML¶

- the response object (r) has a `text` attribute, which contains the same HTML code

print(r.text[0:500])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page

Installing Beautiful Soup¶

!pip install beautifulsoup4

Requirement already satisfied: beautifulsoup4 in /home/bahar/anaconda3/lib/python3.7/site-packages (4.9.1)
Requirement already satisfied: soupsieve>1.2 in /home/bahar/anaconda3/lib/python3.7/site-packages (from beautifulsoup4) (2.0.1)

Parsing the HTML using Beautiful Soup¶

Note that html.parser is the parser included with the Python standard library, though other parsers can be used by Beautiful Soup.

Differences between Parsers

from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'html.parser')

Collecting all of the records¶

You might have noticed that each record has the following format:

 DATE LIE <a href="URL"> EXPLANATION </a>

results = soup.find_all('span', attrs={'class':'short-desc'})

Lenghth of Results¶

len(results)

180

The First Three Results¶

results[0:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_blank">(There's no evidence of illegal voting.)</a></span></span>]

The Last Result¶

results[-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

First Result¶

first_result = results[0]

first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

Extracting the Date¶

first_result.find('strong')

<strong>Jan. 21 </strong>

first_result.contents[0]

<strong>Jan. 21 </strong>

Extracting the Date ( the text between the opening and closing tags)¶

What is \xa0? You don't actually need to know this, but it's called an "escape sequence" that represents the   character we saw earlier in the HTML source.

However, you do need to know that an escape sequence represents a single character within a string. Let's slice it off from the end of the string:

first_result.find('strong').text

'Jan. 21\xa0'

first_result.contents[0].text

'Jan. 21\xa0'

Removing Escape Sequence¶

first_result.find('strong').text[0:-1]

'Jan. 21'

first_result.contents[0].text[0:-1]

'Jan. 21'

Adding Year¶

first_result.find('strong').text[0:-1] + ', 2017'

'Jan. 21, 2017'

first_result.contents[0].text[0:-1] + ', 2017'

'Jan. 21, 2017'

Extracting the Contents (Lie)¶

first_result.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

Removing the Curly Quotation Marks & the Extra Space¶

first_result.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

Extracting the URL & the Explanation¶

first_result.contents[2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

Extracting the URL¶

first_result.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

Extracting the Text¶

first_result.find('a').text[1:-1]

'He was for an invasion before he was against it.'

Building the dataset¶

Now that we've figured out how to extract the four components of first_result, we can create a loop to repeat this process on all 116 results. We'll store the output in a list of tuples called records:

records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append([date, lie, explanation, url])
    

print("Record Length: ", len(records))
print(records[0:3])

Record Length:  180
[['Jan. 21, 2017', "I wasn't a fan of Iraq. I didn't want to go into Iraq.", 'He was for an invasion before he was against it.', 'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'], ['Jan. 21, 2017', 'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.', 'Trump was on the cover 11 times and Nixon appeared 55 times.', 'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'], ['Jan. 23, 2017', 'Between 3 million and 5 million illegal votes caused me to lose the popular vote.', "There's no evidence of illegal voting.", 'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html']]

Creating a Dataframe¶

import pandas as pd

df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])

df

Converting Datetime¶

df['date'] = pd.to_datetime(df['date'])

df.head()

Exporting the Dataset to a CSV File¶

df.to_csv('trump_lies.csv', index=False, encoding='utf-8')

Reading CSV File¶

df = pd.read_csv('trump_lies.csv')

df.head()

Appendix A: Web scraping advice¶

Web scraping works best with static, well-structured web pages. Dynamic or interactive content on a web page is often not accessible through the HTML source, which makes scraping it much harder!
Web scraping is a "fragile" approach for building a dataset. The HTML on a page you are scraping can change at any time, which may cause your scraper to stop working.
If you can download the data you need from a website, or if the website provides an API with data access, those approaches are preferable to scraping since they are easier to implement and less likely to break.
If you are scraping a lot of pages from the same website (in rapid succession), it's best to insert delays in your code so that you don't overwhelm the website with requests. If the website decides you are causing a problem, they can block your IP address (which may affect everyone in your building!)
Before scraping a website, you should review its robots.txt file (also known as the Robots exclusion standard) to check whether you are "allowed" to scrape their website. (Here is the robots.txt file for nytimes.com.)

Appendix B: Web scraping resources¶

The Beautiful Soup documentation is written like a tutorial, and is worth reading to gain a detailed understanding of the library.
For more Beautiful Soup examples, see Web Scraping 101 with Python, More web scraping with Python, and this web scraping lesson from Stanford's "Text As Data" course.
Web Scraping with Python is a 3-hour video tutorial covering Beautiful Soup and other scraping tools. (The slides and code are also available.)
Scrapy is a popular application framework that is useful for more complex web scraping projects.
How a Math Genius Hacked OkCupid to Find True Love and How Netflix Reverse Engineered Hollywood are two fun examples of using web scraping to build an interesting dataset.

Appendix C: Alternative syntax for Beautiful Soup¶

It's worth noting that Beautiful Soup actually offers multiple ways to express the same command. I tend to use the most verbose option, since I think it makes the code readable, but it's useful to be able to recognize the alternative syntax since you might see it used elsewhere.

For example, you can search for a tag by accessing it like an attribute:

# search for a tag by name
first_result.find('strong')

# shorter alternative: access it like an attribute
first_result.strong

<strong>Jan. 21 </strong>

You can also search for multiple tags a few different ways:

# search for multiple tags by name and attribute
results = soup.find_all('span', attrs={'class':'short-desc'})

# shorter alternative: if you don't specify a method, it's assumed to be find_all()
results = soup('span', attrs={'class':'short-desc'})

# even shorter alternative: you can specify the attribute as if it's a parameter
results = soup('span', class_='short-desc')

For more details, check out the Beautiful Soup documentation.

	date	lie	explanation	url
0	2017-01-21	I wasn't a fan of Iraq. I didn't want to go in...	He was for an invasion before he was against it.	https://www.buzzfeed.com/andrewkaczynski/in-20...
1	2017-01-21	A reporter for Time magazine — and I have been...	Trump was on the cover 11 times and Nixon appe...	http://nation.time.com/2013/11/06/10-things-yo...
2	2017-01-23	Between 3 million and 5 million illegal votes ...	There's no evidence of illegal voting.	https://www.nytimes.com/2017/01/23/us/politics...
3	2017-01-25	Now, the audience was the biggest ever. But th...	Official aerial photos show Obama's 2009 inaug...	https://www.nytimes.com/2017/01/21/us/politics...
4	2017-01-25	Take a look at the Pew reports (which show vot...	The report never mentioned voter fraud.	https://www.nytimes.com/2017/01/24/us/politics...

	date	lie	explanation	url
0	Jan. 21, 2017	I wasn't a fan of Iraq. I didn't want to go in...	He was for an invasion before he was against it.	https://www.buzzfeed.com/andrewkaczynski/in-20...
1	Jan. 21, 2017	A reporter for Time magazine — and I have been...	Trump was on the cover 11 times and Nixon appe...	http://nation.time.com/2013/11/06/10-things-yo...
2	Jan. 23, 2017	Between 3 million and 5 million illegal votes ...	There's no evidence of illegal voting.	https://www.nytimes.com/2017/01/23/us/politics...
3	Jan. 25, 2017	Now, the audience was the biggest ever. But th...	Official aerial photos show Obama's 2009 inaug...	https://www.nytimes.com/2017/01/21/us/politics...
4	Jan. 25, 2017	Take a look at the Pew reports (which show vot...	The report never mentioned voter fraud.	https://www.nytimes.com/2017/01/24/us/politics...
...	...	...	...	...
175	Oct. 25, 2017	We have trade deficits with almost everybody.	We have trade surpluses with more than 100 cou...	https://www.bea.gov/newsreleases/international...
176	Oct. 27, 2017	Wacky & totally unhinged Tom Steyer, who has b...	Steyer has financially supported many winning ...	https://www.opensecrets.org/donor-lookup/resul...
177	Nov. 1, 2017	Again, we're the highest-taxed nation, just ab...	We're not.	http://www.politifact.com/truth-o-meter/statem...
178	Nov. 7, 2017	When you look at the city with the strongest g...	Several other cities, including New York and L...	http://www.politifact.com/truth-o-meter/statem...
179	Nov. 11, 2017	I'd rather have him – you know, work with him...	There is no evidence that Democrats "set up" R...	https://www.nytimes.com/interactive/2017/12/10...

	date	lie	explanation	url
0	2017-01-21	I wasn't a fan of Iraq. I didn't want to go in...	He was for an invasion before he was against it.	https://www.buzzfeed.com/andrewkaczynski/in-20...
1	2017-01-21	A reporter for Time magazine — and I have been...	Trump was on the cover 11 times and Nixon appe...	http://nation.time.com/2013/11/06/10-things-yo...
2	2017-01-23	Between 3 million and 5 million illegal votes ...	There's no evidence of illegal voting.	https://www.nytimes.com/2017/01/23/us/politics...
3	2017-01-25	Now, the audience was the biggest ever. But th...	Official aerial photos show Obama's 2009 inaug...	https://www.nytimes.com/2017/01/21/us/politics...
4	2017-01-25	Take a look at the Pew reports (which show vot...	The report never mentioned voter fraud.	https://www.nytimes.com/2017/01/24/us/politics...

	date	lie	explanation	url
0	2017-01-21	I wasn't a fan of Iraq. I didn't want to go in...	He was for an invasion before he was against it.	https://www.buzzfeed.com/andrewkaczynski/in-20...
1	2017-01-21	A reporter for Time magazine — and I have been...	Trump was on the cover 11 times and Nixon appe...	http://nation.time.com/2013/11/06/10-things-yo...
2	2017-01-23	Between 3 million and 5 million illegal votes ...	There's no evidence of illegal voting.	https://www.nytimes.com/2017/01/23/us/politics...
3	2017-01-25	Now, the audience was the biggest ever. But th...	Official aerial photos show Obama's 2009 inaug...	https://www.nytimes.com/2017/01/21/us/politics...
4	2017-01-25	Take a look at the Pew reports (which show vot...	The report never mentioned voter fraud.	https://www.nytimes.com/2017/01/24/us/politics...