Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text extraction based on ToUnicode cmap #314

Merged
merged 11 commits into from
Aug 23, 2024

Conversation

dkaluza
Copy link
Contributor

@dkaluza dkaluza commented Aug 20, 2024

Fixes #125.

@williamdes williamdes requested a review from Heinenen August 20, 2024 10:51
Copy link
Collaborator

@Heinenen Heinenen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small things I would change, but overall good work!

src/document.rs Outdated
"Identity-H" => "?Identity-H Unimplemented?".to_string(), // Unimplemented
_ => String::from_utf8_lossy(bytes).to_string(),
}
info!("Decoding text with {:#?}", encoding);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of lowering the log level to debug here?
#260 complains about this particular instance.

src/document.rs Outdated
Comment on lines 591 to 588
pub fn encode_text(encoding: Option<&str>, text: &str) -> Vec<u8> {
pub fn encode_text(encoding: Option<&Encoding>, text: &str) -> Vec<u8> {
if let Some(encoding) = encoding {
match encoding {
"StandardEncoding" => string_to_bytes(encodings::STANDARD_ENCODING, text),
"MacRomanEncoding" => string_to_bytes(encodings::MAC_ROMAN_ENCODING, text),
"MacExpertEncoding" => string_to_bytes(encodings::MAC_EXPERT_ENCODING, text),
"WinAnsiEncoding" => string_to_bytes(encodings::WIN_ANSI_ENCODING, text),
"PDFDocEncoding" => string_to_bytes(encodings::PDF_DOC_ENCODING, text),
"UniGB-UCS2-H" | "UniGB−UTF16−H" => encodings::encode_utf16_be(text).to_vec(),
"Identity-H" => vec![], // Unimplemented
_ => text.as_bytes().to_vec(),
}
encoding.string_to_bytes(text)
} else {
string_to_bytes(encodings::STANDARD_ENCODING, text)
string_to_bytes(&encodings::STANDARD_ENCODING, text)
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to let encode_text and decode_text take &Encoding instead of an Option<&Encoding> and let the caller handle any fallback behavior.

Also, if I understand the spec correctly, StandardEncoding is not really a good default that should be used as fallback, but should only be used when a Type 1 font is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, according to spec the StandardEncoding is a fallback only for Type 1 fonts in only certain circumstances, and is rarely used. By leaving it I wanted to maintain current behavior.

Anyway it will be cleaned up if we remove the Option.

@@ -1,7 +1,7 @@
use super::glyphnames::Glyph;

/// MacRomanEncoding
pub const MAC_ROMAN_ENCODING: [Option<u16>; 256] = [
pub type ByteToGlyphMap = [Option<u16>; 256];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of the type name, what do you think of CharacterEncoding, CodedCharacterSet, or CharacterMap?
CharacterEncoding is mostly correct and closest to the spec, CodedCharacterSet is (according to Wikipedia) the correct term, cf. https://en.wikipedia.org/wiki/Character_encoding#Terminology

Copy link
Contributor Author

@dkaluza dkaluza Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodedCharacterSet sounds good in my opinion, changing!

.collect();
Ok(UTF_16BE.decode(&utf16_str).0.to_string())
}
_ => Err(Error::ContentDecode),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather remove the catch-all pattern. Adding an enum variant in the future then leads to a compile error, and the author of that change can think about the appropriate handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, changed the code to avoid catch-all.

Comment on lines 72 to 75
_ => {
warn!("Unknown encoding used to encode text {self:?}");
text.as_bytes().to_vec()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove this one. Same reasoning as above.

Self::OneByteEncoding(map) => string_to_bytes(map, text),
Self::SimpleEncoding(name) if ["UniGB-UCS2-H", "UniGB-UTF16-H"].contains(name) => encode_utf16_be(text),
Self::UnicodeMapEncoding(_unicode_map) => {
//maybe only possible if the unicode map is an identity?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be // maybe ... (with space)

@dkaluza
Copy link
Contributor Author

dkaluza commented Aug 22, 2024

Thanks for the review!
Addressed mentioned issues.

Change of the decode/encode interface resulted in some changes in parser_aux extract and replace text.
Added warnings there to inform the user that results might not be as expected in situations with None encoding.

@J-F-Liu J-F-Liu merged commit 5859443 into J-F-Liu:master Aug 23, 2024
8 checks passed
@dkaluza dkaluza deleted the unicode-cmap branch August 24, 2024 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement decoding of Unicode characters
3 participants