Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - Remove junk Characters from Text #23

Closed
looneyapache opened this issue Apr 12, 2022 · 9 comments
Closed

Feature Request - Remove junk Characters from Text #23

looneyapache opened this issue Apr 12, 2022 · 9 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@looneyapache
Copy link

Hi,

Your plugin works great!

Any possibility we could have a feature where in it removed junk characters from paragraph/pasted text, just leaving "."/periods where they are.

Thank you!

@Benature
Copy link
Owner

Sorry, I don't quite understand what junk characters are. Can you give an example, like input is blablabla and expected output(result) is blablabla.

@looneyapache
Copy link
Author

Sure please,

Following is input text"

The quick *~~~Brown f%%ox jumps) right~ over$ the lazy! dog).

Expected output (Clean) :

The quick brown fox jumps right over the lazy dog.

@Benature
Copy link
Owner

Oh no, what happened to the text. 😂

This feature is easy to implement I think, but I wonder in what circumstance that you will encounter such embarrassing text?

@looneyapache
Copy link
Author

It typically happens when I use OCR for old documents (hand typed documents) or corrupted PDF's or Old dbase/Foxpro Memo files (corrupted). I think its mainly because old documents are yellow and smudgy and are hard to scan where OCR inserts characters on its own :)

@looneyapache
Copy link
Author

Here is one example I picked from Google, Some documents have close resemblance to example :
bitonal-doc
)

@Benature Benature added enhancement New feature or request good first issue Good for newcomers labels Jul 15, 2022
@Benature
Copy link
Owner

The pictures you provide are illegible even by myself, what is the copy text like? My OCR result is below

The preser.t rerort is Oric of 玉 numbr wiich dr:prr4lrex duri上i anJ 1945 fpr the Frreign poncnic 永ministration Lmembevs:st t.

the unitedi St:tes Tarift' Cornissint:. Orine to the desire of thr 我 、” Econoniy Aininistration to obt ir this matcri.1. !.xs prompt1y as pces1.o, the reports yere Yot revievvd by the Trri:: Connissien. A11 st:tenont.s o1 fagt or opinion in tese renorts CI &ttributithlp t; the. irciyilei Etaef nembers tho prraredi th.em. Th.:Y. 1l,洲以:rieinlt:itsed f conf idential u: of Goverrnent xgencivs, ut .•r(〉 noR brin.: Hdpnniis with the consent oi the For(imr. Eocnnic iuiri:.istrtior:.

If the copy text is similar to this, I don't think directly deleting junk characters can get the expected text.

@looneyapache
Copy link
Author

looneyapache commented Aug 23, 2022

Thank you for responding! :)

I am noobie and I posted same issue on Obsidian forum, requesting for help. ( https://forum.obsidian.md/t/replace-all-asterisks-in-a-given-file/35238/25?u=looney.apache )

Solution was not as elegant as yours - But it works for me (at least for now )

I paste my text to be scrubbed in to https://textcleaner.net/ and get back cleaned text as need be.
Also not all of my ocr's are in such bad shape - Most of them are good and require some scrubbing to be useful

I value your assistance. Thank you!

@Benature
Copy link
Owner

The webpage is fantastic, it supports a lot of configs. The only fly in the ointment is that it cannot be used in Obsidian.

Though it seems that I can refer to the code of textcleaner, I'm not sure whether there're some issues with copyright.

But I don't think adding such many configs in Ob is a good idea, since it occupies a huge space. Such a dilemma 😂

@looneyapache
Copy link
Author

True!
Also , I've been using "Obsidian Text Format" and I've discovered that it's fantastic at fixing broken paragraphs.

Unquestionably one of the GOOD plugins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants