URLparse defines a standard interface to break URL strings up in components (addressing scheme, network location, path etc.). E.g:
from urlparse import urlparse, urlunparse urlparse("https://www.example.com/connect/login") ParseResult(scheme='https', netloc='www.example.com', path='/connect/login', params='', query='', fragment='')
urlparse.urlunparse() function is supposed to reconstruct the components of a URL returned by the
urlparse() function back to form of the original URL. However, URLs do not survive the round-trip through
urlunparse(urlparse(url)). Python sees
////example.com as a URL with no hostname or scheme therefore considers it a path
output = urlparse("////evil.com") print(output) ParseResult(scheme='', netloc='', path='//evil.com', params='', query='', fragment='')
As you see, the two slashes are removed and it is marked as a relative-path URL but the issue arises when we reconstruct the URL using
urlunparse() function - the URL is treated as an absolute path.
output = urlunparse(output) # from previous output print(output) "//evil.com" urlparse(output) # let's deconstruct the new output ParseResult(scheme='', netloc='evil.com', path='', params='', query='', fragment='')
We went from a path to a hostname as you can see above after reconstructing the input. Depending on its usage across the code, this inconsistency in parsing URLs could result in a security issue which was indeed demonstrated on Reddit suffering from an open redirect vulnerability caused by this unexpected behavior.
I reported this issue to Python but it wasn’t resolved and the ticket remains open since then