How to scrape dynamic pages with Python and Selenium
Getting the html source of a webpage is easy with Python.
With requests
it can be achieved in one line.
import requests
url = "http://example.com/"
page_source = requests.get(url).text
This works well as long as the content of the page is static. For some dynamic pages no useful information can be shown without running scripts. The solution is to first execute the scripts and then scrape the result.
Tools
First you will need to install a WebDriver: basically an API to a Browser. Each major Browser has its own WebDriver, in this example I used Firefox’s one. You can install the WebDriver manually, but then additional configuration and manual updates are necessary. It is easier to use Webdriver Manager, which takes care of installing and keeping the WebDriver up to date.
You will need Selenium to talk to the WebDriver. Selenium has its own Python wrapper, which is used in this example.
Solution
Install the python libraries.
python -m pip install selenium webdriver-manager
Use this code to execute the scripts and get the resulting html.
from selenium import webdriver
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager
url = "http://example.com/"
driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))
driver.get(url)
page_source = driver.page_source
driver.quit()
Parsing
The page source can than be parsed as the one of a static web page.
Python’s standard library has its own parser html.parser
.
Another popular Python parser is Beautiful Soup, for which I recommend this tutorial on pythonprogramming.net.