Python module UrlParse security inconsistency

Yassine Aboukir · March 3, 2015

URLparse defines a standard interface to break URL strings up in components (addressing scheme, network location, path etc.). E.g:

from urlparse import urlparse, urlunparse 
ParseResult(scheme='https', netloc='', path='/connect/login', params='', query='', fragment='')

urlparse.urlunparse() function is supposed to reconstruct the components of a URL returned by the urlparse() function back to form of the original URL. However, URLs do not survive the round-trip through urlunparse(urlparse(url)). Python sees //// as a URL with no hostname or scheme therefore considers it a path //

output = urlparse("////")
ParseResult(scheme='', netloc='', path='//', params='', query='', fragment='')

As you see, the two slashes are removed and it is marked as a relative-path URL but the issue arises when we reconstruct the URL using urlunparse() function - the URL is treated as an absolute path.

output = urlunparse(output) # from previous output
urlparse(output) # let's deconstruct the new output
ParseResult(scheme='', netloc='', path='', params='', query='', fragment='')

We went from a path to a hostname as you can see above after reconstructing the input. Depending on its usage across the code, this inconsistency in parsing URLs could result in a security issue which was indeed demonstrated on Reddit suffering from an open redirect vulnerability caused by this unexpected behavior.

I reported this issue to Python but it wasn’t resolved and the ticket remains open since then

Twitter, Facebook