-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't reach anything under noscript tag #1105
Comments
try cheerio.load(pageContent, [ xmlMode: true}); |
I'm seeing this too. This should not have been closed. |
Enabling xml mode fixes this, otherwise parse5 will always strip edit: See #1105 (comment) for how to fix this. |
Right, but it's still an issue, right? It should stay open until it gets resolved. |
Try to match the noscript tag itself, get the html by calling .html() and then load that html with cheerio again. That way you'll be able to match any element under the noscript tag. |
This is definitely still a bug with cheerio, as cheerio is essentially a browser without JavaScript support. The noscript tag is intended to provide content for browsers that do not support javascript (which would include search engines, web crawlers, and web scrapers). "xmlMode: true" has several other side effects (see documentation) which can cause most pages, and especially those in question to fail to parse. |
Quickly looking through, it seems like https://github.com/cheeriojs/cheerio/blob/208bce1ee8ed921dbd0fc2988644fd3a68bf8bd1/lib/parse.js needs to be updated to turn I'm not sure if there are security implications of doing that though. |
|
The `scriptingEnabled` flag was added to parse5 in version 5.0.0 [parse5: ParserOptions](https://github.com/inikulin/parse5/blob/master/packages/parse5/docs/options/parser-options.md) `scriptingEnabled=true` will parse `<script>` tags as javascript and `<noscript>` tags as raw text. `scriptingEnabled=false` will parse `<script>` tags as raw text and `<noscript>` tags as HTML. The later is the preferred default behavior for cheerio. As we do not want to execute the javascript, but do want to view the page as a scripts-disabled browser would. See cheeriojs#1105 for discussion on this issue.
This is the first google result for "cheerio noscript". If you are attempting to parse html inside a noscript tag with cheerio, see the doc page linked below. Set the
|
https://rishi.app/blog/parsing-noscript-elements-using-cheerio-in-node-js/ is also another solution. |
I've got
cheerio.load(pageContent, { decodeEntities: false })
, but then when I try to match an element that's under anoscript
tag it doesn't work. I thought settingdecodeEntities
tofalse
would allow this. How do I match that element?The text was updated successfully, but these errors were encountered: