-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation of large datasets with SHACL #1914
Comments
Btw. This is the current support: https://rdf4j.org/documentation/programming/shacl/#performance Essentially you disable validation, add your data and your shapes, enable validation and call the revalidate() method which returns a validation report. Under the hood this works by retrieving a list of targets for each shape, keeping the list in memory, sorting it, and passing it one target at a time through the validation engine. So you need enough memory that for every shape you can keep it’s list of targets in memory. You don’t need to keep all those lists in memory at the same time unless you have parallel validation enabled or caching enabled. Revalidate is meant to be used in two scenarios:
If your data fits in memory then simply adding your data to an empty ShaclSail is faster because it will do some optimizations under the hood and use sparql queries for some shapes. |
I wonder if we can cover most of the use case I have in mind by just providing some convenience wrapping around the existing ShaclSail. |
Were you thinking about this working against a remote repo like a SPARQL endpoint? |
Not necessarily - though ideally against any Repository (which would include a SPARQL endpoint). The main problem I see with this is that the SHACLSail is, well, a Sail - stacked on top of another Sail - rather than on top of Repository. We can't easily apply it to different data sets because by its very design it assumes a single dataset (the underlying Sail) is its sole source of truth. Even doing some wrapping trick where we disable default validation and then only execute it on demand we'd still need to have a ShaclSail wrapped around our data store beforehand. It's still worth exploring that option, and I'll do a bit of a spike to see what it could look like, but if we would go completely the other way - a completely new implementation using simple SPARQL querying - have you any sort of gut feel for how much effort there would be in that? Initially I'm more interested in getting something functional and relatively complete than in pure performance. |
Hacking something together to get the ShaclSail to run against a repository should be doable. Remember that within the ShaclSail it just runs against a SailConnection (sometimes a NotifyingSailConnection, but I think that would be easy to get rid of). The vital part to figure out is if you can bridge a RepositoryConnection to a SailConnection with some sort of wrapper, then we can look at injecting that into the ShaclSail. A brand new SHACL implementation that supports any sort of repo would probably end up being a SHACL -> SPARQL convertor. What would be nice then is if we developed a new Abstract Syntax Tree (eg. Java model) of SHACL. Then a second model that simplifies each part into the three subcomponents: target, path, rule. Both these would be useful for the ShaclSail too. From that we would implement two "generators" that either generate SPARQL or the current change-aware plans. The only SHACL implementation that uses SPARQL that I'm aware of is the one in Stardog. I don't think it's infeasible to implement one, but I would probably plan for 3-6 months of work to be honest. |
RDFUnit supports running SHACL on most RDF sources (in including a remote SPARQL endpoint) but is based on Apache Jena |
@jimkont does it generate SPARQL queries, or does it need to retrieve data from the SPARQL endpoint to validate it locally? |
It generates sparql queries
|
Found this file here with the SPARQL templates: https://github.com/AKSW/RDFUnit/blob/ecd1b3d709bef723ad79cbd115c026a0e5fd3dd2/rdfunit-commons/src/main/resources/org/aksw/rdfunit/vocabularies/shacl-core.ttl |
Also a nice ontology for SHACL: https://github.com/AKSW/RDFUnit/blob/ecd1b3d709bef723ad79cbd115c026a0e5fd3dd2/rdfunit-commons/src/main/resources/org/aksw/rdfunit/vocabularies/shacl.ttl Will probably want to use something like this when we start rewriting our SHACL AST implementation. |
Note that `shacl-core.ttl` contains a few non-standard shacl-sparql
constructs i.e. for `sh:in` as well as a few optimization extensions for
adjusting the queries depending on the constraint parameters.
iirc, the ontology is a copy from w3c
|
I've started work on #2083 which is the first step to allow for both full SHACL support with the current approach, as well as support for other SHACL implementations like a SPARQL based approach which would allow for faster validation of data stored in a remote store. We should also consider if it's time to start using the query plan nodes and implementations from the SPARQL engine instead of the custom ones in SHACL. Would be nice to test out as part of #2083. |
… queries from SHACL shapes Signed-off-by: Håvard Ottestad <[email protected]>
I have now started working on generating SPARQL queries from SHACL Shapes. See: #2963 |
As a SHACL user, I want to validate an existing large dataset using a SHACL model, so that I can report possible problems without requiring them to be fixed immediately.
The existing SHACL Sail implementation is optimized for validating updates to the data, and works on a transactional basis. It assumes the underlying store is fully valid, and aborts any transaction that tries to change the data in a way that makes it non-compliant. While this is great for some use cases, in other scenarios a more flexible approach towards validation is needed.
Requirements:
(Followup from discussion in #1900 )
The text was updated successfully, but these errors were encountered: