How to scrape dynamic pages with Python and Selenium
Getting the html source of a webpage is easy with Python.
requests it can be achieved in one line.
import requests url = "http://example.com/" page_source = requests.get(url).text
This works well as long as the content of the page is static. For some dynamic pages no useful information can be shown without running scripts. The solution is to first execute the scripts and then scrape the result.
First you will need to install a WebDriver: basically an API to a Browser. Each major Browser has its own WebDriver, in this example I used Firefox’s one. You can install the WebDriver manually, but then additional configuration and manual updates are necessary. It is easier to use Webdriver Manager, which takes care of installing and keeping the WebDriver up to date.
You will need Selenium to talk to the WebDriver. Selenium has its own Python wrapper, which is used in this example.
Install the python libraries.
python -m pip install selenium webdriver-manager
Use this code to execute the scripts and get the resulting html.
from selenium import webdriver from selenium.webdriver.firefox.service import Service as FirefoxService from webdriver_manager.firefox import GeckoDriverManager url = "http://example.com/" driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install())) driver.get(url) page_source = driver.page_source driver.quit()
The page source can than be parsed as the one of a static web page.
Python’s standard library has its own parser
Another popular Python parser is Beautiful Soup, for which I recommend this tutorial on pythonprogramming.net.