Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


PDF to HTML Validation with Python

No description

James Kellas

on 28 June 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of PDF to HTML Validation with Python

Image Magick
How is a human able to recognize that the previous two images were reasonably similar?

How can a computer do it better??
Error pages match 99%
Successful for images, text, graphs, formulae, etc
Blank pages match 100%
Not so good with title pages
Smaller amount to compare
Error pages match < 90%
Still having issues with title pages
Error pages match < 90%
Works w/ text, images, graphs, formulae, etc.
Errors on title pages match < 90%
PDF to HTML Validation
with Python
Analyzing the Problem
Testing Rule #1
Poppler Project for PDFs
Create Images
Image Comparison Tools
Red Apples to Red Apples
Resize and Crop
Browser In-compatibility
Time to Think Outside the Box
Finally, comparing!
Getting there...
Look Ma, no background!
Always compare apples to apples!
Selenium for browsers
Python Imaging Library (PIL)
Beer Goggles to the Rescue!
Font Issues
QT webkit
Add an alpha channel
Replace white pixels w/ transparent pixels
Compare pdf to html
Remove alpha pixels
Compare final product to black image
PIL > Image Magick
Linux != Windows/Mac
Verify conversion from PDF to HTML is reasonably similar
Needs to scale: 50,000 books at 500 pages each
Must be automated to keep costs down
Convert PDF and HTML to the same file type to objectively compare them
PNG images was the answer!
Broken HTML
Comparison Diff
Full transcript