Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(URGENT) PDF accessibility: words containing underscores being broken into multiple words in PDF content text #870

Closed
ronaldtse opened this issue Dec 1, 2022 · 9 comments
Assignees

Comments

@ronaldtse
Copy link
Contributor

This is from ISO 10303-50 (https://github.com/metanorma/iso-10303-detached-docs/tree/main/sources/iso-10303-50):

Screen Shot 2022-12-01 at 12 42 23 PM

Screen Shot 2022-12-01 at 12 42 48 PM

I remember we never had this problem before, it seems like a regression.

@Intelligent2013 can you add tests against this case?

@ronaldtse ronaldtse added the bug label Dec 1, 2022
@ronaldtse ronaldtse moved this to 🆕 New in Metanorma Dec 1, 2022
@ronaldtse ronaldtse moved this from 🆕 New to 🌋 Urgent in Metanorma Dec 1, 2022
@Intelligent2013 Intelligent2013 moved this from 🌋 Urgent to 🏗 In progress in Metanorma Dec 1, 2022
@Intelligent2013
Copy link
Contributor

@ronaldtse Apache FOP doesn't add line feed after the underscore character automatically, and may be cases when text renders overflowed:
image

therefore XSLT adds zero-width space after each _.

In Intermediate Format, the text the_integers represents as sequence of two text:

<font family="Noto Sans Mono"/>
<font family="Courier New"/>
<text x="13200" y="22132" next-is-space="true" foi:struct-ref="a12">the_</text>
<font family="Noto Sans Mono"/>
<font family="Courier New"/>
<text x="39600" y="22132" next-is-space="true" foi:struct-ref="a12">integers </text>

Each text has next-is-space="true". Looks like Apache FOP adds it for zero-width space, and it treats as space in copy-paste and comparison feature: the_ integers

I'll think about workaround solution. May be we need to omit next-is-space for zero-width space... (except cases when text is the latest in the row).

Intelligent2013 added a commit to metanorma/mn2pdf that referenced this issue Dec 2, 2022
Intelligent2013 added a commit to metanorma/mn2pdf that referenced this issue Dec 2, 2022
@Intelligent2013
Copy link
Contributor

Zero-width space removed from middle of words:
image

Next step: remove from end of line.

@Intelligent2013
Copy link
Contributor

Actually there aren't spaces in the generated PDF:

  • copied text doesn't contain space: the_integers : elementary_space := make_elementary_space(es_integers);
  • there isn't next-is-space="true" for text space(es_ in Apache FOP Intermediate Format XML:
<text x="290400" y="22132" foi:struct-ref="7d">make_</text>
<font family="Noto Sans Mono"/>
<font family="Courier New"/>
<text x="323400" y="22132" foi:struct-ref="7d">elementary_</text>
<font family="Noto Sans Mono"/>
<font family="Courier New"/>
<text x="396000" y="22132" foi:struct-ref="7d">space(es_</text>
<font family="Noto Sans Mono"/>
<font family="Courier New"/>
<text x="0" y="35332" next-is-space="true" foi:struct-ref="7d">integers);</text>

I think Acrobat adds space for line break in the comparison feature. For checking this assumption, I've created two Word documents with the same text, but with different font sizes.
24pt text doesn't fit on the line. There isn't space between tex and t on the next line, but Acrobat replaces line break to space:
image

@Intelligent2013
Copy link
Contributor

Issue partially fixed in mn2pdf v1.52 (https://github.com/metanorma/mn2pdf/releases/tag/v1.52):

Zero-width space removed from middle of words:
image

@Intelligent2013 Intelligent2013 moved this from 🏗 In progress to ✅ Done in Metanorma Dec 3, 2022
@ronaldtse
Copy link
Contributor Author

I think Acrobat adds space for line break in the comparison feature.

Is there a way to get rid of this problem?

@Intelligent2013
Copy link
Contributor

Is there a way to get rid of this problem?

At this moment I don't know the solution. I've asked here: https://community.adobe.com/t5/acrobat-discussions/redundant-space-character-in-the-compare-result/td-p/13393538

@ronaldtse
Copy link
Contributor Author

Thanks @Intelligent2013 !

@ronaldtse
Copy link
Contributor Author

Conclusion: This is an Adobe Acrobat "Compare PDFs" bug, it cannot be addressed in Metanorma.

@ronaldtse
Copy link
Contributor Author

@stuartgalt : Does the PDF group (ISO/TC 171/SC 2) have some any spec for Compare PDFs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants