-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF to HTML conversion issues #72
Comments
I did some investigations about this issue and found out why I had such results. According to PDF specifications:
So, in my case I'm using PDF-1.3 and there is no UserUnit defined which mean that default value is 1pt but in the output HTML px is used. I did some tests and the output is better. Are there any reason why it was implemented like this? |
You could use the exact layout mode (using I did some testing and found that using I'm not sure if this is a coincidence or not. |
This issue is partially fixed by #348. The core of this issue is that the weight of the characters is not computed correctly. This causes a mismatch between the bounding boxes and the actual text. This problem is solved by #348. There is also a side issue: the font name from the |
Hi,
I'm trying to convert a simple PDF to HTML using:
pdf2txt.py test.pdf -t html -o test.html
Here is the test PDF file:
test.pdf
and here is the output html:

html source:
Now, the problem is that

width
of the line is incorrectly computed making it to wrap differently then the original doc. This can lead to smth like this:Is there a fix for this issue? If not can you guide me where to look so that I can make a PR with the fix?
I think this tool may be helpful for what we need and in this case we can contribute to it.
Thx a lot!
The text was updated successfully, but these errors were encountered: