Getting the html source of a webpage is easy with Python. With requests it can be achieved in one line.

import requests

url = "http://example.com/"
page_source = requests.get(url).text

This works well as long as the content of the page is static. For some dynamic pages no useful information can be shown without running scripts. The solution is to first execute the scripts and then scrape the result.

Tools

First you will need to install a WebDriver: basically an API to a Browser. Each major Browser has its own WebDriver, in this example I used Firefox’s one. You can install the WebDriver manually, but then additional configuration and manual updates are necessary. It is easier to use Webdriver Manager, which takes care of installing and keeping the WebDriver up to date.

You will need Selenium to talk to the WebDriver. Selenium has its own Python wrapper, which is used in this example.

Solution

Install the python libraries.

python -m pip install selenium webdriver-manager

Use this code to execute the scripts and get the resulting html.

from selenium import webdriver
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager

url = "http://example.com/"
driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))
driver.get(url)
page_source = driver.page_source
driver.quit()

Parsing

The page source can than be parsed as the one of a static web page. Python’s standard library has its own parser html.parser. Another popular Python parser is Beautiful Soup, for which I recommend this tutorial on pythonprogramming.net.

tags: Data   Python