Python module UrlParse security inconsistency

Yassine Aboukir · March 3, 2015

URLparse defines a standard interface to break URL strings up in components (addressing scheme, network location, path etc.). E.g:

from urlparse import urlparse, urlunparse 
urlparse("https://www.example.com/connect/login")
ParseResult(scheme='https', netloc='www.example.com', path='/connect/login', params='', query='', fragment='')

urlparse.urlunparse() function is supposed to reconstruct the components of a URL returned by the urlparse() function back to form of the original URL. However, URLs do not survive the round-trip through urlunparse(urlparse(url)). Python sees ////example.com as a URL with no hostname or scheme therefore considers it a path //example.com.

output = urlparse("////evil.com")
print(output)
ParseResult(scheme='', netloc='', path='//evil.com', params='', query='', fragment='')

As you see, the two slashes are removed and it is marked as a relative-path URL but the issue arises when we reconstruct the URL using urlunparse() function - the URL is treated as an absolute path.

output = urlunparse(output) # from previous output
print(output)
"//evil.com"
urlparse(output) # let's deconstruct the new output
ParseResult(scheme='', netloc='evil.com', path='', params='', query='', fragment='')

We went from a path to a hostname as you can see above after reconstructing the input. Depending on its usage across the code, this inconsistency in parsing URLs could result in a security issue which was indeed demonstrated on Reddit suffering from an open redirect vulnerability caused by this unexpected behavior.

https://github.com/reddit/reddit/commit/689a9554e60c6e403528b5d62a072ac4a072a20e

I reported this issue to Python but it wasn’t resolved and the ticket remains open since then

https://bugs.python.org/issue23505

Twitter, Facebook