-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add text extraction based on ToUnicode cmap #314
Conversation
* add new dictionary test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small things I would change, but overall good work!
src/document.rs
Outdated
"Identity-H" => "?Identity-H Unimplemented?".to_string(), // Unimplemented | ||
_ => String::from_utf8_lossy(bytes).to_string(), | ||
} | ||
info!("Decoding text with {:#?}", encoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of lowering the log level to debug here?
#260 complains about this particular instance.
src/document.rs
Outdated
pub fn encode_text(encoding: Option<&str>, text: &str) -> Vec<u8> { | ||
pub fn encode_text(encoding: Option<&Encoding>, text: &str) -> Vec<u8> { | ||
if let Some(encoding) = encoding { | ||
match encoding { | ||
"StandardEncoding" => string_to_bytes(encodings::STANDARD_ENCODING, text), | ||
"MacRomanEncoding" => string_to_bytes(encodings::MAC_ROMAN_ENCODING, text), | ||
"MacExpertEncoding" => string_to_bytes(encodings::MAC_EXPERT_ENCODING, text), | ||
"WinAnsiEncoding" => string_to_bytes(encodings::WIN_ANSI_ENCODING, text), | ||
"PDFDocEncoding" => string_to_bytes(encodings::PDF_DOC_ENCODING, text), | ||
"UniGB-UCS2-H" | "UniGB−UTF16−H" => encodings::encode_utf16_be(text).to_vec(), | ||
"Identity-H" => vec![], // Unimplemented | ||
_ => text.as_bytes().to_vec(), | ||
} | ||
encoding.string_to_bytes(text) | ||
} else { | ||
string_to_bytes(encodings::STANDARD_ENCODING, text) | ||
string_to_bytes(&encodings::STANDARD_ENCODING, text) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to let encode_text
and decode_text
take &Encoding
instead of an Option<&Encoding>
and let the caller handle any fallback behavior.
Also, if I understand the spec correctly, StandardEncoding is not really a good default that should be used as fallback, but should only be used when a Type 1 font is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, according to spec the StandardEncoding is a fallback only for Type 1 fonts in only certain circumstances, and is rarely used. By leaving it I wanted to maintain current behavior.
Anyway it will be cleaned up if we remove the Option.
src/encodings/mappings.rs
Outdated
@@ -1,7 +1,7 @@ | |||
use super::glyphnames::Glyph; | |||
|
|||
/// MacRomanEncoding | |||
pub const MAC_ROMAN_ENCODING: [Option<u16>; 256] = [ | |||
pub type ByteToGlyphMap = [Option<u16>; 256]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of the type name, what do you think of CharacterEncoding
, CodedCharacterSet
, or CharacterMap
?
CharacterEncoding
is mostly correct and closest to the spec, CodedCharacterSet
is (according to Wikipedia) the correct term, cf. https://en.wikipedia.org/wiki/Character_encoding#Terminology
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodedCharacterSet sounds good in my opinion, changing!
src/encodings/mod.rs
Outdated
.collect(); | ||
Ok(UTF_16BE.decode(&utf16_str).0.to_string()) | ||
} | ||
_ => Err(Error::ContentDecode), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather remove the catch-all pattern. Adding an enum variant in the future then leads to a compile error, and the author of that change can think about the appropriate handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, changed the code to avoid catch-all.
src/encodings/mod.rs
Outdated
_ => { | ||
warn!("Unknown encoding used to encode text {self:?}"); | ||
text.as_bytes().to_vec() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also remove this one. Same reasoning as above.
src/encodings/mod.rs
Outdated
Self::OneByteEncoding(map) => string_to_bytes(map, text), | ||
Self::SimpleEncoding(name) if ["UniGB-UCS2-H", "UniGB-UTF16-H"].contains(name) => encode_utf16_be(text), | ||
Self::UnicodeMapEncoding(_unicode_map) => { | ||
//maybe only possible if the unicode map is an identity? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be // maybe ...
(with space)
Thanks for the review! Change of the decode/encode interface resulted in some changes in parser_aux extract and replace text. |
Fixes #125.