Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF to HTML conversion issues #72

Closed
botzill opened this issue Jun 14, 2017 · 5 comments · Fixed by #348 or #357
Closed

PDF to HTML conversion issues #72

botzill opened this issue Jun 14, 2017 · 5 comments · Fixed by #348 or #357

Comments

@botzill
Copy link

botzill commented Jun 14, 2017

Hi,

I'm trying to convert a simple PDF to HTML using:
pdf2txt.py test.pdf -t html -o test.html

Here is the test PDF file:
test.pdf

and here is the output html:
screen shot 2017-06-14 at 8 17 58 pm

html source:

<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:118px; width:448px; height:45px;"><span style="font-family: ; font-size:16px">The Portable Document Format (PDF) is the world’s leading language for describing <br>the printed page</span><span style="font-family: ; font-size:15px"> <br></span><span style="font-family: ; font-size:15px">	<br></span></div><span style="position:absolute; border: black 1px solid; left:72px; top:121px; width:445px; height:13px;"></span>
<span style="position:absolute; border: black 1px solid; left:72px; top:135px; width:86px; height:13px;"></span>
<div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

Now, the problem is that width of the line is incorrectly computed making it to wrap differently then the original doc. This can lead to smth like this:
screen shot 2017-06-14 at 8 21 36 pm

Is there a fix for this issue? If not can you guide me where to look so that I can make a PR with the fix?

I think this tool may be helpful for what we need and in this case we can contribute to it.

Thx a lot!

@botzill
Copy link
Author

botzill commented Jun 20, 2017

I did some investigations about this issue and found out why I had such results. According to PDF specifications:

User Space
To avoid the device-dependent effects of specifying objects in device space, PDF
defines a device-independent coordinate system that always bears the same relationship
to the current page, regardless of the output device on which printing or
displaying occurs. This device-independent coordinate system is called user
space.
The user space coordinate system is initialized to a default state for each page of a
document. The CropBox entry in the page dictionary specifies the rectangle of
user space corresponding to the visible area of the intended output medium (display
window or printed page). The positive x axis extends horizontally to the
right and the positive y axis vertically upward, as in standard mathematical practice
(subject to alteration by the Rotate entry in the page dictionary). The length
of a unit along both the x and y axes is set by the UserUnit entry (PDF 1.6) in the
page dictionary (see Table 3.27). If that entry is not present or supported, the default
value of 1⁄72 inch is used. This coordinate system is called default user space.

So, in my case I'm using PDF-1.3 and there is no UserUnit defined which mean that default value is 1pt but in the output HTML px is used. I did some tests and the output is better.

Are there any reason why it was implemented like this?

@pietermarsman
Copy link
Member

pietermarsman commented Oct 25, 2019

You could use the exact layout mode (using --layoutmode exact). This does not group character elements into larger structures, but puts them on their precise location.

Schermafbeelding 2019-10-25 om 20 56 59

I did some testing and found that using --layoutmode normal could work even better if all the bounding boxes use points (i.e. 1/72 of an inch) and all the font-sizes are in px (1 px is a barely visible line). I also needed to change the font-family from VZWISY+Georgia to Georgia.

Schermafbeelding 2019-10-25 om 21 06 36

I'm not sure if this is a coincidence or not.

@pietermarsman
Copy link
Member

This issue is partially fixed by #348.

The core of this issue is that the weight of the characters is not computed correctly. This causes a mismatch between the bounding boxes and the actual text. This problem is solved by #348.

There is also a side issue: the font name from the test.pdf is b'VZWISY+Georgia', and this cannot be interpreted by the browser as being Georgia. I've created #349 for this.

@pietermarsman
Copy link
Member

pietermarsman commented Jan 9, 2020

@botzill do you have time for a review? Could you checkout #317 and #357.

@pietermarsman
Copy link
Member

FYI, it currently looks like this if you have the Georgia font:

Screenshot from 2020-01-16 22-35-24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment