Add text extraction based on ToUnicode cmap #314

dkaluza · 2024-08-20T10:44:31Z

Fixes #125.

* add new dictionary test

Heinenen

Some small things I would change, but overall good work!

Heinenen · 2024-08-21T21:01:02Z

src/document.rs

-                "Identity-H" => "?Identity-H Unimplemented?".to_string(), // Unimplemented
-                _ => String::from_utf8_lossy(bytes).to_string(),
-            }
+            info!("Decoding text with {:#?}", encoding);


What do you think of lowering the log level to debug here?
#260 complains about this particular instance.

Heinenen · 2024-08-21T21:18:43Z

src/document.rs

-    pub fn encode_text(encoding: Option<&str>, text: &str) -> Vec<u8> {
+    pub fn encode_text(encoding: Option<&Encoding>, text: &str) -> Vec<u8> {
        if let Some(encoding) = encoding {
-            match encoding {
-                "StandardEncoding" => string_to_bytes(encodings::STANDARD_ENCODING, text),
-                "MacRomanEncoding" => string_to_bytes(encodings::MAC_ROMAN_ENCODING, text),
-                "MacExpertEncoding" => string_to_bytes(encodings::MAC_EXPERT_ENCODING, text),
-                "WinAnsiEncoding" => string_to_bytes(encodings::WIN_ANSI_ENCODING, text),
-                "PDFDocEncoding" => string_to_bytes(encodings::PDF_DOC_ENCODING, text),
-                "UniGB-UCS2-H" | "UniGB−UTF16−H" => encodings::encode_utf16_be(text).to_vec(),
-                "Identity-H" => vec![], // Unimplemented
-                _ => text.as_bytes().to_vec(),
-            }
+            encoding.string_to_bytes(text)
        } else {
-            string_to_bytes(encodings::STANDARD_ENCODING, text)
+            string_to_bytes(&encodings::STANDARD_ENCODING, text)
        }
    }
 }


I think it would make sense to let encode_text and decode_text take &Encoding instead of an Option<&Encoding> and let the caller handle any fallback behavior.

Also, if I understand the spec correctly, StandardEncoding is not really a good default that should be used as fallback, but should only be used when a Type 1 font is used.

I agree, according to spec the StandardEncoding is a fallback only for Type 1 fonts in only certain circumstances, and is rarely used. By leaving it I wanted to maintain current behavior.

Anyway it will be cleaned up if we remove the Option.

Heinenen · 2024-08-21T21:37:46Z

src/encodings/mappings.rs

@@ -1,7 +1,7 @@
 use super::glyphnames::Glyph;

-/// MacRomanEncoding
-pub const MAC_ROMAN_ENCODING: [Option<u16>; 256] = [
+pub type ByteToGlyphMap = [Option<u16>; 256];


I'm not a huge fan of the type name, what do you think of CharacterEncoding, CodedCharacterSet, or CharacterMap?
CharacterEncoding is mostly correct and closest to the spec, CodedCharacterSet is (according to Wikipedia) the correct term, cf. https://en.wikipedia.org/wiki/Character_encoding#Terminology

CodedCharacterSet sounds good in my opinion, changing!

Heinenen · 2024-08-21T21:42:16Z

src/encodings/mod.rs

+                    .collect();
+                Ok(UTF_16BE.decode(&utf16_str).0.to_string())
+            }
+            _ => Err(Error::ContentDecode),


I'd rather remove the catch-all pattern. Adding an enum variant in the future then leads to a compile error, and the author of that change can think about the appropriate handling.

Good catch, changed the code to avoid catch-all.

Heinenen · 2024-08-21T21:43:41Z

src/encodings/mod.rs

+            _ => {
+                warn!("Unknown encoding used to encode text {self:?}");
+                text.as_bytes().to_vec()
+            }


Also remove this one. Same reasoning as above.

Heinenen · 2024-08-21T22:17:37Z

src/encodings/mod.rs

+            Self::OneByteEncoding(map) => string_to_bytes(map, text),
+            Self::SimpleEncoding(name) if ["UniGB-UCS2-H", "UniGB-UTF16-H"].contains(name) => encode_utf16_be(text),
+            Self::UnicodeMapEncoding(_unicode_map) => {
+                //maybe only possible if the unicode map is an identity?


should be // maybe ... (with space)

…on user

dkaluza · 2024-08-22T06:58:15Z

Thanks for the review!
Addressed mentioned issues.

Change of the decode/encode interface resulted in some changes in parser_aux extract and replace text.
Added warnings there to inform the user that results might not be as expected in situations with None encoding.

Marinus Enzinger and others added 8 commits May 21, 2024 10:34

Add ToUnicode CMap text decoding

26d8380

* add new dictionary test

Temporarily switch default parser to pom

a2e27ea

Refactor common cmap parse structures for nom implementation

fadea10

Add unicode tests

763f991

Merge branch 'master' into unicode-cmap

07cbc2a

Add nom parser for ToUnicode font key

be9e863

Try to use ToUnicode for text extraction without encoding.

73bd890

Merge branch 'master' into unicode-cmap

795e696

williamdes requested a review from Heinenen August 20, 2024 10:51

dkaluza added 2 commits August 20, 2024 13:15

Fix clippy passing unit type warnings in nom parser

06cecc2

Add load unicode async test

29f83e2

Heinenen requested changes Aug 21, 2024

View reviewed changes

Remove option form encode/decode functions delegating error handling …

9b39561

…on user

Heinenen approved these changes Aug 22, 2024

View reviewed changes

J-F-Liu merged commit 5859443 into J-F-Liu:master Aug 23, 2024
8 checks passed

dkaluza deleted the unicode-cmap branch August 24, 2024 11:46

Heinenen mentioned this pull request Sep 4, 2024

ToUnicode CMap error with 0.34.0 #319

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add text extraction based on ToUnicode cmap #314

Add text extraction based on ToUnicode cmap #314

dkaluza commented Aug 20, 2024

Heinenen left a comment

Heinenen Aug 21, 2024

Heinenen Aug 21, 2024

dkaluza Aug 22, 2024

Heinenen Aug 21, 2024

dkaluza Aug 22, 2024 •

edited

Loading

Heinenen Aug 21, 2024

dkaluza Aug 22, 2024

Heinenen Aug 21, 2024

Heinenen Aug 21, 2024

dkaluza commented Aug 22, 2024

Add text extraction based on ToUnicode cmap #314

Add text extraction based on ToUnicode cmap #314

Conversation

dkaluza commented Aug 20, 2024

Heinenen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkaluza Aug 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkaluza commented Aug 22, 2024

dkaluza Aug 22, 2024 •

edited

Loading