PDF to HTML conversion issues #72

botzill · 2017-06-14T17:26:30Z

Hi,

I'm trying to convert a simple PDF to HTML using:
pdf2txt.py test.pdf -t html -o test.html

Here is the test PDF file:
test.pdf

and here is the output html:

html source:

<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:118px; width:448px; height:45px;"><span style="font-family: ; font-size:16px">The Portable Document Format (PDF) is the world’s leading language for describing <br>the printed page</span><span style="font-family: ; font-size:15px"> <br></span><span style="font-family: ; font-size:15px">	<br></span></div><span style="position:absolute; border: black 1px solid; left:72px; top:121px; width:445px; height:13px;"></span>
<span style="position:absolute; border: black 1px solid; left:72px; top:135px; width:86px; height:13px;"></span>
<div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

Now, the problem is that width of the line is incorrectly computed making it to wrap differently then the original doc. This can lead to smth like this:

Is there a fix for this issue? If not can you guide me where to look so that I can make a PR with the fix?

I think this tool may be helpful for what we need and in this case we can contribute to it.

Thx a lot!

The text was updated successfully, but these errors were encountered:

botzill · 2017-06-20T12:53:33Z

I did some investigations about this issue and found out why I had such results. According to PDF specifications:

User Space
To avoid the device-dependent effects of specifying objects in device space, PDF
defines a device-independent coordinate system that always bears the same relationship
to the current page, regardless of the output device on which printing or
displaying occurs. This device-independent coordinate system is called user
space.
The user space coordinate system is initialized to a default state for each page of a
document. The CropBox entry in the page dictionary specifies the rectangle of
user space corresponding to the visible area of the intended output medium (display
window or printed page). The positive x axis extends horizontally to the
right and the positive y axis vertically upward, as in standard mathematical practice
(subject to alteration by the Rotate entry in the page dictionary). The length
of a unit along both the x and y axes is set by the UserUnit entry (PDF 1.6) in the
page dictionary (see Table 3.27). If that entry is not present or supported, the default
value of 1⁄72 inch is used. This coordinate system is called default user space.

So, in my case I'm using PDF-1.3 and there is no UserUnit defined which mean that default value is 1pt but in the output HTML px is used. I did some tests and the output is better.

Are there any reason why it was implemented like this?

pietermarsman · 2019-10-25T19:09:33Z

You could use the exact layout mode (using --layoutmode exact). This does not group character elements into larger structures, but puts them on their precise location.

I did some testing and found that using --layoutmode normal could work even better if all the bounding boxes use points (i.e. 1/72 of an inch) and all the font-sizes are in px (1 px is a barely visible line). I also needed to change the font-family from VZWISY+Georgia to Georgia.

I'm not sure if this is a coincidence or not.

pietermarsman · 2019-12-30T17:09:08Z

This issue is partially fixed by #348.

The core of this issue is that the weight of the characters is not computed correctly. This causes a mismatch between the bounding boxes and the actual text. This problem is solved by #348.

There is also a side issue: the font name from the test.pdf is b'VZWISY+Georgia', and this cannot be interpreted by the browser as being Georgia. I've created #349 for this.

pietermarsman · 2020-01-09T19:53:44Z

@botzill do you have time for a review? Could you checkout #317 and #357.

pietermarsman · 2020-01-16T21:39:20Z

FYI, it currently looks like this if you have the Georgia font:

pietermarsman added the type: bug label Oct 13, 2019

pietermarsman added type: question and removed type: bug labels Oct 25, 2019

pietermarsman mentioned this issue Oct 25, 2019

Use points instead of pixels to outline html page, keep using pixels for font-size. This improves allignment but I don't know why #317

Closed

5 tasks

This was referenced Dec 30, 2019

Fix bug in computing character bounding box #348

Merged

The font name from an embedded font contains a strange prefix #349

Closed

pietermarsman added type: bug and removed type: question labels Dec 30, 2019

pietermarsman mentioned this issue Jan 9, 2020

Fix font name by removing subset tag #357

Merged

5 tasks

pietermarsman added the component:converter label Jan 14, 2020

pietermarsman closed this as completed in #348 Jan 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF to HTML conversion issues #72

PDF to HTML conversion issues #72

botzill commented Jun 14, 2017

botzill commented Jun 20, 2017 •

edited

Loading

pietermarsman commented Oct 25, 2019 •

edited

Loading

pietermarsman commented Dec 30, 2019

pietermarsman commented Jan 9, 2020 •

edited

Loading

pietermarsman commented Jan 16, 2020

PDF to HTML conversion issues #72

PDF to HTML conversion issues #72

Comments

botzill commented Jun 14, 2017

botzill commented Jun 20, 2017 • edited Loading

pietermarsman commented Oct 25, 2019 • edited Loading

pietermarsman commented Dec 30, 2019

pietermarsman commented Jan 9, 2020 • edited Loading

pietermarsman commented Jan 16, 2020

botzill commented Jun 20, 2017 •

edited

Loading

pietermarsman commented Oct 25, 2019 •

edited

Loading

pietermarsman commented Jan 9, 2020 •

edited

Loading