Scraping Data with Beautiful Soup Issues

Issue

I am working on scraping the countries of astronauts from this website: https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order. I am using BeautifulSoup to perform this task, but I’m having some issues. Here is my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = []

url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order'

r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
tags = soup.find_all('div', class_ ='astronaut_index__content container--xl mxa f fr fw aifs pl15 pr15 pt0')

for item in tags:
    name = item.select_one('bau astronaut_cell__title bold mr05')
    country = item.select_one('mouseover__contents rel py05 px075 bau caps small ac').get_text(strip = True)
    data.append([name,country])
    
df = pd.DataFrame(data)

df

df is returning an empty list. Not sure what is going on. When I take the code out of the for loop, it can’t seem to find the select_one function. Function should be coming from bs4 – not sure why that’s not working. Also, is there a repeatable pattern for web scraping that I’m missing? Seems like it’s a different beast every time I try to tackle these kinds of problems.

Any help would be appreciated! Thank you!

Solution

The url’s data is generated dynamically by javascript and Beautifulsoup can’t grab dynamic data.So, You can use automation tool something like selenium with Beautifulsoup.Here I apply selenium with Beautifulsoup.Please just run the code.

Script:

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time


data = []

url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)

soup = BeautifulSoup(driver.page_source, 'lxml')
tags = soup.select('.astronaut_cell.x')

for item in tags:
    name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
    #print(name.text)
    country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
    if country:
        country=country.get_text()
    #print(country)
    
    data.append([name, country])



cols=['name','country']
df = pd.DataFrame(data,columns=cols)

print(df)

Output:

name                   country
0       Bess, Cameron  United States of America
1          Bess, Lane  United States of America
2          Dick, Evan  United States of America
3       Taylor, Dylan  United States of America
4    Strahan, Michael  United States of America
..                ...                       ...
295     Jones, Thomas  United States of America
296      Sega, Ronald  United States of America
297     Usachov, Yury                    Russia
298   Fettman, Martin  United States of America
299       Wolf, David  United States of America

[300 rows x 2 columns]

Answered By – F.Hoque

Answer Checked By – Jay B. (AngularFixing Admin)

Leave a Reply

Your email address will not be published.