Extracting a list of URLs from a PDF can be difficult. One of the tools that makes it a lot easier is pdf-parser by Didier Stevens.

For now we’ll focus on pdf-parser for extracting URLs in PDFs. This is a powerful tool with many uses, the full scope of which is beyond this brief tutorial.

Once downloaded the syntax for our usage of the command is as follows:

pdf-parser.py -O –filter filename

This will output all of the raw data streams, so it will be useful to grep that to look for a pattern using a Regular Expression (regex), for example:

grep -Po “(http://|https://|ftp://|ftps://|rdp://|ssh://)(.+?)/(.+?)/”

Here we’re looking to output matches to any of the following patterns:

  • http://* / *
  • https://* / *
  • ftp://* / *
  • ftps://* / *
  • rdp://* / *
  • ssh://* / *

If we pipe (|) the pdf-parser command to the grep command we should be able to filter down the noise of the object stream and output only what we want to see.

pdf-parser.py -O –filter filename | grep -Po “(http://|https://|ftp://|ftps://|rdp://|ssh://)(.+?)/(.+?)/”

The output you will see will be any matches in the stream where a URL match is found, up to the second “/” in the URI. So for example: If there’s a URL in the PDF that looks like this:


What you would see in the output would be:


You can modify seeing more or less of the URI by changing the regex match values (refer to my previous post on Regular Expressions if you want/need assistance.

This technique is a very quick way to find possible malicious URLs in PDF documents without needing to open them, and expose yourself to any harm. It will most-likely not be applicable in every situation, but is an example of just one more tool that can be used to determine malicious intent.

Depending on the document being analyzed it may be necessary to play with the pattern, allowing more or less filtering (taking away/adding to the greedy .+? matches).

Next time I’ll review a cool use of another one of Didiers' tools oledump for extracting macros from Microsoft documents.