You can apply these two methods to either the initial soup
object or a Tag object (such as first_result
):
find()
: searches for the first matching tag, and returns a Tag objectfind_all()
: searches for all matching tags, and returns a ResultSet object (which you can treat like a list of Tags)You can extract information from a Tag object (such as first_result
) using these two attributes:
text
: extracts the text of a Tag, and returns a stringcontents
: extracts the children of a Tag, and returns a list of Tags and stringsAnd of course, there are many more methods and attributes available to you, which are described in the Beautiful Soup documentation.
Here's the way the article presented the information:
When converting this into a dataset, you can think of each lie as a "record" with four fields:
import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('span', attrs={'class':'short-desc'})
records = []
for result in results:
date = result.find('strong').text[0:-1] + ', 2017'
lie = result.contents[1][1:-2]
explanation = result.find('a').text[1:-1]
url = result.find('a')['href']
records.append([date, lie, explanation, url])
import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
df['date'] = pd.to_datetime(df['date'])
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')
df= pd.read_csv('trump_lies.csv')
df.head()
!pip install requests
import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')
- the response object (r) has a `text` attribute, which contains the same HTML code
print(r.text[0:500])
!pip install beautifulsoup4
Note that html.parser
is the parser included with the Python standard library, though other parsers can be used by Beautiful Soup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
You might have noticed that each record has the following format:
<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>
results = soup.find_all('span', attrs={'class':'short-desc'})
len(results)
results[0:3]
results[-1]
first_result = results[0]
first_result
first_result.find('strong')
first_result.contents[0]
What is \xa0
? You don't actually need to know this, but it's called an "escape sequence" that represents the
character we saw earlier in the HTML source.
However, you do need to know that an escape sequence represents a single character within a string. Let's slice it off from the end of the string:
first_result.find('strong').text
first_result.contents[0].text
first_result.find('strong').text[0:-1]
first_result.contents[0].text[0:-1]
first_result.find('strong').text[0:-1] + ', 2017'
first_result.contents[0].text[0:-1] + ', 2017'
first_result.contents[1]
first_result.contents[1][1:-2]
first_result.contents[2]
first_result.find('a')['href']
first_result.find('a').text[1:-1]
Now that we've figured out how to extract the four components of first_result
, we can create a loop to repeat this process on all 116 results
. We'll store the output in a list of tuples called records
:
records = []
for result in results:
date = result.find('strong').text[0:-1] + ', 2017'
lie = result.contents[1][1:-2]
explanation = result.find('a').text[1:-1]
url = result.find('a')['href']
records.append([date, lie, explanation, url])
print("Record Length: ", len(records))
print(records[0:3])
import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
df
df['date'] = pd.to_datetime(df['date'])
df.head()
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')
df = pd.read_csv('trump_lies.csv')
df.head()
It's worth noting that Beautiful Soup actually offers multiple ways to express the same command. I tend to use the most verbose option, since I think it makes the code readable, but it's useful to be able to recognize the alternative syntax since you might see it used elsewhere.
For example, you can search for a tag by accessing it like an attribute:
# search for a tag by name
first_result.find('strong')
# shorter alternative: access it like an attribute
first_result.strong
You can also search for multiple tags a few different ways:
# search for multiple tags by name and attribute
results = soup.find_all('span', attrs={'class':'short-desc'})
# shorter alternative: if you don't specify a method, it's assumed to be find_all()
results = soup('span', attrs={'class':'short-desc'})
# even shorter alternative: you can specify the attribute as if it's a parameter
results = soup('span', class_='short-desc')
For more details, check out the Beautiful Soup documentation.