Monday, February 17, 2014

Daily Blog #239: Sunday Funday 2/16/14 Winner!

Hello Reader,
      It would appear that PDF metadata that many of you are familiar with! This was one of those tough weeks where I got multiple answers I really liked that answered the question in different ways. When I looked at them again though I thought this weeks winning answer stood out from both its depth and experience. I'll be sharing all the answers with David Dym to make sure metadiver grabs all the metadata you all have identified, while keeping those who wanted to be anonymous, anonymous. So let's get down to business and declare a winner for this week's Sunday Funday!

The Challenge:
Based on your past knowledge or testing:
1. What metadata can be present within a PDF document?
2. What effects what metadata will be present within a PDF document?

The Winning Answer
Dan Pullega @4n6k

Much can be gleaned from PDF metadata. Of course, there are the standard fields that will provide you with relatively common metadata...but depending on the program you use to create the PDF, there could be much, much more.

Let's start with the most common metadata. Most PDFs will have embedded metadata showing the PDF version, creation date, creation program, document ID, and last modified date. There are definitely more, but I have left them out as they wouldn't be very useful in an investigation (e.g. page count). The use for the aforementioned metadata is fairly obvious, but I will explain nonetheless.
The more obvious PDF metadata entries are:
- Creation program: program used to create the PDF (was it through desktop software, was it scanned, etc.)
- Creator/Author/Producer: Username or full name of the PDF's author OR further details on the program used to create the PDF (is it a previous employer?)
- Title: the title of the PDF that usually provides an outdated name for the document; good for identifying previous employer documents or documents that have been converted from one format into a PDF (e.g. SecretBluePrint.eps or oldCompanyFinances.doc shows up in the 'title' metadata entry)
Those are the easy ones. But what about the more overlooked metadata? Like I mentioned before, the program used to create or modify the PDF may have a huge impact on what information you are given. With that, let's look into it.

First, timestamps. We know that file metadata could potentially serve as a better indicator of when a given document was created. If the PDF has been transferred across various volumes and systems -- and we would like to find the origin of the document -- the creation date in the file metadata is going to be more reliable than the file system creation date (as it will have been updated with the copies/moves).

The metadata 'creation' date will [usually] preserve the REAL date of the file's creation. That is, if the PDF has been transferred across various volumes and systems, the 'creation' date in the file's metadata is going to give us a better idea of when the document was initially created.

The 'modified' date can be used in a similar way. We might even be able to tell how many programs through which the PDF was modified/saved. Say we have a PDF created using Adobe InDesign. If we were to open this PDF, modify it, and then save it as a new file using 'Save As...' in a program like Acrobat, we would see that the 'creation' date is still unchanged, but the 'modified' date had been updated (file system creation dates will tell us differently). Pretty standard stuff. Even if the PDF is saved using 'Save As...' (essentially creating a new file altogether with an updated file system creation time) AND it is moved from one system to the next, we will still have a genuine 'creation' date. Not only that, but we will have a metadata 'modified' date AND a new file system creation time to work with. Correlation among file metadata and file system timestamps are beyond the scope of this answer, but you get the point; 'creation' and 'modified' metadata dates are powerful and can be used creatively.
Also, with many PDF timestamps, we will be able to see a timezone offset. For example, a creation timestamp could be 2013:02:22 11:21:34-06:00. We now have a potential indication that the program that produced this PDF was set in Mountain time.

I mentioned that we might be able to determine if a PDF was created and modified through more than one program. As a quick side note, and if we really wanted to dive into the PDF analysis, we could take a look at some of the other telling metadata. The example above suggested creating a PDF in InDesign, opening it up in Acrobat, modifying it, and then saving it as a new file. When this happens, some of the metadata in the new file (like the 'modified' time) is updated while all of the InDesign metadata stays intact. However, there is a significant difference this time around: the 'XMP Toolkit' metadata value is different. Adobe implements their XMP Toolkit in all of their applications and plugins. They even open sourced it, so other programs can use it (and many do). The point is, "the XMPCore component of this toolkit is what allows for the creation and modification of the metadata that follows the XMP Data Model" (more here and here). So we have two PDFs, but the metadata for each was manipulated by two different versions of Adobe's XMP Core.
InDesign used "Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27" and...
Acrobat used "Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03"
But why is this important? Well, we can now more accurately pinpoint the program used to create the PDF. Sure, we will likely already have a metadata entry that tells us the 'Creation Program, but consider the above example; that tool (InDesign) may have been used to initially create the PDF, but it was NOT used to open, modify, and save a new version of it (Acrobat did that). Let's keep this in mind as we explain some other interesting metadata...
Remember: the amount of metadata that a program uses when creating files is limitless. XMP is built on XML, so any metadata tags can be defined. Let's take a real-world example of how powerful PDF metadata can be when created from certain programs. Download Trustwave's Global Security Report PDF from 2013. Run it in exiftool. What do you see? That's right, the "History" metadata fields will show you not only that the document was saved 497 times, but it will also show you the exact times that is was saved, the program used to save it each time, and the Document Instance ID for each save (less exciting).
While you have that open, take a look at the creation date (2013:02:22 11:21:34-06:00) and modify date (2013:05:09 10:47:39-07:00). The modify date is much later, but the last "History" save on the file was 2013:02:22 11:18:06-06:00. What's up with that? This is because the PDF was modified in a different program; one newer than InDesign CS5.5. How do I know this? Well, look at the XMP Core version. The XMP Core version used for InDesign CS5.5 is "Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03." I just so happen to have a PDF created with InDesign CS6 and that PDF uses "Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27." How can it be that CS5.5 is using a later XMP Core version than CS6?! Because another program was used to modify the CS5.5 PDF after the last save. On 2013:05:09 10:47:39-07:00 (the modify date), some program (let's just say it's Acrobat to satisfy my example from before) modified the PDF. The XMP Core version shown in the metadata is NOT from CS5.5.
Also from the 'History' metadata, we can tell that the creation date is actually "2012:12:29 11:20:49-06:00." and NOT "2013:02:22 11:21:34-06:00." My guess is that InDesign was keeping track of the saves, but when it came down to exporting the PDF, it tacked on the export date as the "Create Date" (as the last 'History' save of the file is 3 minutes before the alleged "Create Date").

If we really wanted to, we could use another metadata field (the PDF version) to further pinpoint the program used. If the PDF version is 1.7, we could look for programs on a suspect computer that save PDFs to version 1.7 by default. Believe it or not, many programs still save PDFs as version 1.4, 1.5, and 1.6.

After all of this, I think it's safe to say that PDF metadata can be pretty valuable. You just need to know what's available to you and how to interpret it.