Tuesday, September 13, 2016

Analyze thousands of PDFs, extract informations and generate a tabular report

Through the Car Library project website, I found ExifTool: free software by Phil Harvey which can analyze thousands of PDFs, extract informations (misc. and metadata) and generate a tabular report easily usable in LibreOffice Calc or MS Excel.
  1. Download the software
  2. Install it
    • Extract exiftool(-k).exe from zip
    • Rename file into exiftool.exe
    • Copy it in c:/windows/
  3. Start > Run > type cmd then Enter
  4. Copy-paste this code :
    exiftool -csv -r -Encrypt -Info -Root -Linearized  -All -ext pdf -m -t c:\collection > report.csv
    • c:\collection contains the PDFs
    • report.csv is generated at root of User
    • Informations extracted : name of file, name of folder, size of file, number of pages, metadatas
    • I still need to know how to get: native PDF ou scanned PDF; if scanned: OCR or not; if scanned: quality of scan; PDF/A (yes/no). There is the software PDF-Analyzer Pro 5.0 by Ingo Schmoekel but I didn't buy it.
It is possible to do that for others types of files (pictures,…).
Method sent to Harry. Applied to his 294 Go collection, analyzed 18420 files in 2212 folders. The processing time took 3 hours and generated a csv file of 52 Mo.

No comments:

Post a Comment