Webscraping with Selenium in Python

Issue

I am trying to webscrape the list of DAOs from masari.io but I am having trouble because I get the following errors:

DeprecationWarning: executable_path has been deprecated, please pass in a Service object


driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

DevTools listening on ws://127.0.0.1:56691/devtools/browser/b4609671-5e6e-4d25-b09e-4116b3dde4bf
[0525/100030.252:INFO:CONSOLE(1)] "enabling sentry error tracker", source: https://messari.io/static/js/main.977a4794.chunk.js (1)
[0525/100030.951:INFO:CONSOLE(2)] "Unable to refresh token: Login required", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.065:INFO:CONSOLE(2)] "


88b           d88                                                            88
888b         d888                                                            ""
88'8b       d8'88
88 '8b     d8' 88   ,adPPYba,  ,adPPYba,  ,adPPYba,  ,adPPYYba,  8b,dPPYba,  88
88  '8b   d8'  88  a8P_____88  I8[    ""  I8[    ""  ""     'Y8  88P'   "Y8  88
88   '8b d8'   88  8PP"""""""   '"Y8ba,    '"Y8ba,   ,adPPPPP88  88          88
88    '888'    88  "8b,   ,aa  aa    ]8I  aa    ]8I  88,    ,88  88          88
88     '8'     88   '"Ybbd8"'  '"YbbdP"'  '"YbbdP"'  '"8bbdP"Y8  88          88


", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.069:INFO:CONSOLE(2)] "Interested in a CHALLENGE? Check out: https://messari.io/quiz", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
Traceback (most recent call last):
  File "c:/Users/Student/webScrape/scraper.py", line 21, in <module>
    matches = WebDriverWait(driver, 10).until(
  File "C:\Users\Student\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\support\wait.py", line 89, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
Backtrace:
        Ordinal0 [0x0096B8F3+2406643]
        Ordinal0 [0x008FAF31+1945393]
        Ordinal0 [0x007EC748+837448]
        Ordinal0 [0x008192E0+1020640]
        Ordinal0 [0x0081957B+1021307]
        Ordinal0 [0x00846372+1205106]
        Ordinal0 [0x008342C4+1131204]
        Ordinal0 [0x00844682+1197698]
        Ordinal0 [0x00834096+1130646]
        Ordinal0 [0x0080E636+976438]
        Ordinal0 [0x0080F546+980294]
        GetHandleVerifier [0x00BD9612+2498066]
        GetHandleVerifier [0x00BCC920+2445600]
        GetHandleVerifier [0x00A04F2A+579370]
        GetHandleVerifier [0x00A03D36+574774]
        Ordinal0 [0x00901C0B+1973259]
        Ordinal0 [0x00906688+1992328]
        Ordinal0 [0x00906775+1992565]
        Ordinal0 [0x0090F8D1+2029777]
        BaseThreadInitThunk [0x777BFA29+25]
        RtlGetAppContainerNamedObjectPath [0x77B77A7E+286]
        RtlGetAppContainerNamedObjectPath [0x77B77A4E+238]

I know there is an API for messari.io, but I am almost certain it is only for their assets and not their list of DAOs. I tried using Selenium since it is a dynamic page but I am still having trouble. Here is my code:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests

url = 'https://messari.io/governor/daos'

DRIVER_PATH = 'PATH_TO_DRIVER_ON_MY_PC'
options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")

# s = Service('PATH_TO_DRIVER_ON_MY_PC')
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get('https://messari.io/governor/daos')

try:
    matches = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, "td")))
    # for match in matches:
    #     print(match.text)

finally:
    driver.quit()

Update I fixed the executable_path warning, but I am still getting the same TimeoutException error. And when I run it without headless I also get the following message:

DevTools listening on ws://127.0.0.1:57773/devtools/browser/4450b78d-3a9f-401a-b39c-2c716ecad924
[9628:20616:0525/102300.840:ERROR:device_event_log_impl.cc(214)] [10:23:00.840] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
[9628:20616:0525/102300.841:ERROR:device_event_log_impl.cc(214)] [10:23:00.841] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)

I assume this part is more of a hardware message that I shouldn’t worry about based on similar questions bc when I unplugged my mouse it removed one of them.

Solution

This page doesn’t use <td> to display list of DAOs.
It uses <div> (with CSS) to display it similar to table.

And it keeps name of DAO in <h4>

At least it uses and in my Firefox on laptop with Linux.


Full working code (tested on Linux Mint, Python 3.8, Selenium 4.x, Chrome 101.x)

I used module webdriver_manager so it automatically downloads fresh driver when Linux installs newer version of Chrome

I have to use find_elements() (with s in word elements) or presence_of_all_elements_located() to get all <h4>.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from webdriver_manager.chrome import ChromeDriverManager

url = 'https://messari.io/governor/daos'

options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")

driver = webdriver.Chrome(options=options, service=Service(ChromeDriverManager().install()))

driver.get('https://messari.io/governor/daos')

try:
    matches = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.TAG_NAME, "h4")))
    
    #matches = driver.find_elements(By.TAG_NAME, "h4")
    
    for match in matches:
        if match.text:
            print(match.text)
finally:
    driver.quit()

Result:

Fei
Rook
Cosmos
Stargate Finance
Aave
Treasure DAO
DODO
Radicle
Goldfinch
Merit Circle
EPNS
Perpetual Protocol
Gitcoin
SuperRare
Indexed
Doodles
Rome DAO
Badger
Paraswap
Unlock
Terra
Shapeshift
Lobis
Pool Together
The Graph
Yearn Finance
Ampleforth
Alpaca Finance
Balancer
Gro Protocol
Sismo DAO
BeethovenX
ENS
Lido
Alchemist

EDIT:

TO get all values you may have to scroll page – and JavaScript will add new items.

There are answers which use while-loop with execute_script() which use JavaScript code to scroll to the bottom and get current height. If height is different than before scroll then you have to scroll again, but if height is the same then you have end of page and now you can get all items.

Answered By – furas

Answer Checked By – Candace Johnson (AngularFixing Volunteer)

Leave a Reply

Your email address will not be published.