Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation of large datasets with SHACL #1914

Open
abrokenjester opened this issue Feb 11, 2020 · 14 comments
Open

Validation of large datasets with SHACL #1914

abrokenjester opened this issue Feb 11, 2020 · 14 comments
Labels
📶 enhancement issue is a new feature or improvement 📦 SHACL affects the SHACL validator

Comments

@abrokenjester
Copy link
Contributor

abrokenjester commented Feb 11, 2020

As a SHACL user, I want to validate an existing large dataset using a SHACL model, so that I can report possible problems without requiring them to be fixed immediately.

The existing SHACL Sail implementation is optimized for validating updates to the data, and works on a transactional basis. It assumes the underlying store is fully valid, and aborts any transaction that tries to change the data in a way that makes it non-compliant. While this is great for some use cases, in other scenarios a more flexible approach towards validation is needed.

Requirements:

  • the implementation must support validating an existing large dataset, accessible via a Repository, against a SHACL model supplied at runtime
  • the implementation must support producing a validation report in a manner that can be easily processed by the user

(Followup from discussion in #1900 )

@abrokenjester abrokenjester added 📶 enhancement issue is a new feature or improvement 📦 SHACL affects the SHACL validator labels Feb 11, 2020
@hmottestad
Copy link
Contributor

Btw. This is the current support: https://rdf4j.org/documentation/programming/shacl/#performance

Essentially you disable validation, add your data and your shapes, enable validation and call the revalidate() method which returns a validation report.

Under the hood this works by retrieving a list of targets for each shape, keeping the list in memory, sorting it, and passing it one target at a time through the validation engine. So you need enough memory that for every shape you can keep it’s list of targets in memory. You don’t need to keep all those lists in memory at the same time unless you have parallel validation enabled or caching enabled.

Revalidate is meant to be used in two scenarios:

  • data to validate doesn’t fit in memory
  • changes to the shapes in an existing ShaclSail, revalidate changed shapes

If your data fits in memory then simply adding your data to an empty ShaclSail is faster because it will do some optimizations under the hood and use sparql queries for some shapes.

@abrokenjester
Copy link
Contributor Author

I wonder if we can cover most of the use case I have in mind by just providing some convenience wrapping around the existing ShaclSail.

@hmottestad
Copy link
Contributor

Were you thinking about this working against a remote repo like a SPARQL endpoint?

@abrokenjester
Copy link
Contributor Author

abrokenjester commented Feb 12, 2020

Not necessarily - though ideally against any Repository (which would include a SPARQL endpoint).

The main problem I see with this is that the SHACLSail is, well, a Sail - stacked on top of another Sail - rather than on top of Repository. We can't easily apply it to different data sets because by its very design it assumes a single dataset (the underlying Sail) is its sole source of truth. Even doing some wrapping trick where we disable default validation and then only execute it on demand we'd still need to have a ShaclSail wrapped around our data store beforehand.

It's still worth exploring that option, and I'll do a bit of a spike to see what it could look like, but if we would go completely the other way - a completely new implementation using simple SPARQL querying - have you any sort of gut feel for how much effort there would be in that? Initially I'm more interested in getting something functional and relatively complete than in pure performance.

@hmottestad
Copy link
Contributor

Hacking something together to get the ShaclSail to run against a repository should be doable. Remember that within the ShaclSail it just runs against a SailConnection (sometimes a NotifyingSailConnection, but I think that would be easy to get rid of). The vital part to figure out is if you can bridge a RepositoryConnection to a SailConnection with some sort of wrapper, then we can look at injecting that into the ShaclSail.

A brand new SHACL implementation that supports any sort of repo would probably end up being a SHACL -> SPARQL convertor.

What would be nice then is if we developed a new Abstract Syntax Tree (eg. Java model) of SHACL. Then a second model that simplifies each part into the three subcomponents: target, path, rule. Both these would be useful for the ShaclSail too. From that we would implement two "generators" that either generate SPARQL or the current change-aware plans.

The only SHACL implementation that uses SPARQL that I'm aware of is the one in Stardog. I don't think it's infeasible to implement one, but I would probably plan for 3-6 months of work to be honest.

@jimkont
Copy link

jimkont commented Feb 18, 2020

RDFUnit supports running SHACL on most RDF sources (in including a remote SPARQL endpoint) but is based on Apache Jena

@hmottestad
Copy link
Contributor

@jimkont does it generate SPARQL queries, or does it need to retrieve data from the SPARQL endpoint to validate it locally?

@jimkont
Copy link

jimkont commented Feb 18, 2020 via email

@hmottestad
Copy link
Contributor

@hmottestad
Copy link
Contributor

Also a nice ontology for SHACL: https://github.com/AKSW/RDFUnit/blob/ecd1b3d709bef723ad79cbd115c026a0e5fd3dd2/rdfunit-commons/src/main/resources/org/aksw/rdfunit/vocabularies/shacl.ttl

Will probably want to use something like this when we start rewriting our SHACL AST implementation.

@jimkont
Copy link

jimkont commented Feb 18, 2020 via email

@hmottestad
Copy link
Contributor

@hmottestad
Copy link
Contributor

I've started work on #2083 which is the first step to allow for both full SHACL support with the current approach, as well as support for other SHACL implementations like a SPARQL based approach which would allow for faster validation of data stored in a remote store. We should also consider if it's time to start using the query plan nodes and implementations from the SPARQL engine instead of the custom ones in SHACL. Would be nice to test out as part of #2083.

hmottestad added a commit to HASMAC-AS/rdf4j that referenced this issue Mar 26, 2021
… queries from SHACL shapes

Signed-off-by: Håvard Ottestad <[email protected]>
@hmottestad
Copy link
Contributor

I have now started working on generating SPARQL queries from SHACL Shapes. See: #2963

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📶 enhancement issue is a new feature or improvement 📦 SHACL affects the SHACL validator
Projects
None yet
Development

No branches or pull requests

3 participants