Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(URGENT) Cannot fetch from "W3C PR-math-19980224" #38

Open
manuelfuenmayor opened this issue Nov 9, 2023 · 11 comments
Open

(URGENT) Cannot fetch from "W3C PR-math-19980224" #38

manuelfuenmayor opened this issue Nov 9, 2023 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@manuelfuenmayor
Copy link

I'm trying to auto-fetch the reference of this document: https://www.w3.org/TR/1998/PR-math-19980224/,
with no success:

$ bundle exec relaton fetch "W3C PR-math-19980224"
[relaton-w3c] (W3C PR-math-19980224) Fetching from Relaton repository ...
[relaton-w3c] Downloaded index from https://raw.githubusercontent.com/relaton/relaton-data-w3c/main/index1.zip
[relaton-w3c] (W3C PR-math-19980224) Not found.
No matching bibliographic entry found

@ronaldtse
Copy link
Contributor

I've also tried these:

[relaton-w3c] (W3C TR PR-math-19980224) Fetching from Relaton repository ...
[relaton-w3c] (W3C TR PR-math-19980224) Not found.
No matching bibliographic entry found
[relaton-w3c] (W3C TR math-19980224) Fetching from Relaton repository ...
[relaton-w3c] (W3C TR math-19980224) Not found.
No matching bibliographic entry found

@ronaldtse ronaldtse added the bug Something isn't working label Nov 9, 2023
@ronaldtse ronaldtse changed the title Cannot fetch from "W3C PR-math-19980224" (URGENT) Cannot fetch from "W3C PR-math-19980224" Nov 9, 2023
@andrew2net
Copy link
Contributor

andrew2net commented Sep 29, 2024

@ronaldtse there isn't W3C PR-math-19980224 document in the dataset. There is only one relation which obsolete the document. Do we need obsoleted docs in our data repo?

<REC rdf:about="https://www.w3.org/TR/1998/REC-MathML-19980407/">
  <dc:date>1998-04-07</dc:date>
  <dc:title>Mathematical Markup Language (MathML) 1.0 Specification</dc:title>
  <doc:obsoletes rdf:resource="https://www.w3.org/TR/1998/PR-math-19980224"/>
  <doc:versionOf rdf:resource="https://www.w3.org/TR/REC-MathML/"/>
  <editor rdf:parseType="Resource">
    <contact:fullName>Patrick D F Ion</contact:fullName>
  </editor>
  <editor rdf:parseType="Resource">
    <contact:fullName>Robert R Miner</contact:fullName>
  </editor>
  <org:deliveredBy rdf:parseType="Resource">
    <contact:homePage rdf:resource="https://www.w3.org/Math/"/>
  </org:deliveredBy>
  <mat:hasErrata rdf:resource="https://www.w3.org/MarkUp/mathml101-updates/errata.html"/>
</REC>

@ronaldtse
Copy link
Contributor

Yes we need all documents included obsoleted ones in the dataset. Being obsolete means others still cite it.

@andrew2net
Copy link
Contributor

Then we need to scrape documents missed in the tr.rdf from www.w3.org website. We don't know all the missed documents. We can check if a relation is missed in tr.rdf. If it is, get it from www.w3.org.

@andrew2net
Copy link
Contributor

andrew2net commented Sep 30, 2024

@ronaldtse I see what happend. The most recent tr.rdf file doesn't have obsoleted docs. So we created an archive repo https://github.com/relaton/w3c-tr-archive. But the repo hasn't been updated for two years. Where can we get the previous tr.rdf files?

UPD found the issue with the link to archives.
UPD2 unfortunately the archive started in 2002. It doesn't have docs obsoleted before that year. We can only check all relations after fetching all docs from tr.rdf file and archives and scraping missed docs from www.w3.org site.

@ronaldtse
Copy link
Contributor

@andrew2net yes you are correct. That's likely the only way we can get the full archive.

@andrew2net
Copy link
Contributor

@ronaldtse wouldn't it be better to scrape all W3C documents from https://www.w3.org/TR/?
Documents have history
image

In history, we can find obsoleted docs
https://www.w3.org/standards/history/REC-MathML/
image

@ronaldtse
Copy link
Contributor

@andrew2net then let's scrape that, but I don't think the details are as complete as the RDF file. So we have to combine?

@andrew2net
Copy link
Contributor

@ronaldtse you are right. RDF data has more details.

We need to do the following:

  • Download all archives.
  • Because each archive file mostly repeats the previous version and passing all of them takes too much time, merge all the files into one w3c.rdf file, replacing old document records with newer ones.
  • Update the W3C data fetcher to merge current tr.rdf into the w3c.rdf daly.

Once we do it we will have as much as we can from the archives. The w3c.rdf will only miss documents obsoleted before the first available archive was created. So documents that obsolete the missing docs will be in the w3c.rdf. That means we can check the presence of all obsoletes relations in the w3c.rdf and fetch it from W3C website if it doesn't exist.

@andrew2net
Copy link
Contributor

@ronaldtse I fetched all the tr.rdf archives from web.archive.org and merged them into one archive. There are 9k+ documents in the merged file (we had 6k+ before). But the W3C PR-math-19980224 still does not present. So we need to get such docs from W3C website.

andrew2net added a commit that referenced this issue Oct 11, 2024
andrew2net added a commit to relaton/relaton-data-w3c that referenced this issue Oct 11, 2024
andrew2net added a commit to relaton/relaton-data-w3c that referenced this issue Oct 14, 2024
@andrew2net
Copy link
Contributor

@ronaldtse I'm trying to get all obsoleted relation that missed in the RDF archive from w3c.org website but the pages have inconsistent layout. I adapted scraper to get data from some layouts but it looks like there are much more of them. And some layouts conflict with each other. I don't know how much time will it take to cover all the variants, and want to clarify if I should continue to invest time to this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants