Validation of large datasets with SHACL #1914

abrokenjester · 2020-02-11T23:58:25Z

As a SHACL user, I want to validate an existing large dataset using a SHACL model, so that I can report possible problems without requiring them to be fixed immediately.

The existing SHACL Sail implementation is optimized for validating updates to the data, and works on a transactional basis. It assumes the underlying store is fully valid, and aborts any transaction that tries to change the data in a way that makes it non-compliant. While this is great for some use cases, in other scenarios a more flexible approach towards validation is needed.

Requirements:

the implementation must support validating an existing large dataset, accessible via a Repository, against a SHACL model supplied at runtime
the implementation must support producing a validation report in a manner that can be easily processed by the user

(Followup from discussion in #1900 )

hmottestad · 2020-02-12T07:16:33Z

Btw. This is the current support: https://rdf4j.org/documentation/programming/shacl/#performance

Essentially you disable validation, add your data and your shapes, enable validation and call the revalidate() method which returns a validation report.

Under the hood this works by retrieving a list of targets for each shape, keeping the list in memory, sorting it, and passing it one target at a time through the validation engine. So you need enough memory that for every shape you can keep it’s list of targets in memory. You don’t need to keep all those lists in memory at the same time unless you have parallel validation enabled or caching enabled.

Revalidate is meant to be used in two scenarios:

data to validate doesn’t fit in memory
changes to the shapes in an existing ShaclSail, revalidate changed shapes

If your data fits in memory then simply adding your data to an empty ShaclSail is faster because it will do some optimizations under the hood and use sparql queries for some shapes.

abrokenjester · 2020-02-12T08:44:19Z

I wonder if we can cover most of the use case I have in mind by just providing some convenience wrapping around the existing ShaclSail.

hmottestad · 2020-02-12T16:57:43Z

Were you thinking about this working against a remote repo like a SPARQL endpoint?

abrokenjester · 2020-02-12T22:58:33Z

Not necessarily - though ideally against any Repository (which would include a SPARQL endpoint).

The main problem I see with this is that the SHACLSail is, well, a Sail - stacked on top of another Sail - rather than on top of Repository. We can't easily apply it to different data sets because by its very design it assumes a single dataset (the underlying Sail) is its sole source of truth. Even doing some wrapping trick where we disable default validation and then only execute it on demand we'd still need to have a ShaclSail wrapped around our data store beforehand.

It's still worth exploring that option, and I'll do a bit of a spike to see what it could look like, but if we would go completely the other way - a completely new implementation using simple SPARQL querying - have you any sort of gut feel for how much effort there would be in that? Initially I'm more interested in getting something functional and relatively complete than in pure performance.

hmottestad · 2020-02-15T15:30:19Z

Hacking something together to get the ShaclSail to run against a repository should be doable. Remember that within the ShaclSail it just runs against a SailConnection (sometimes a NotifyingSailConnection, but I think that would be easy to get rid of). The vital part to figure out is if you can bridge a RepositoryConnection to a SailConnection with some sort of wrapper, then we can look at injecting that into the ShaclSail.

A brand new SHACL implementation that supports any sort of repo would probably end up being a SHACL -> SPARQL convertor.

What would be nice then is if we developed a new Abstract Syntax Tree (eg. Java model) of SHACL. Then a second model that simplifies each part into the three subcomponents: target, path, rule. Both these would be useful for the ShaclSail too. From that we would implement two "generators" that either generate SPARQL or the current change-aware plans.

The only SHACL implementation that uses SPARQL that I'm aware of is the one in Stardog. I don't think it's infeasible to implement one, but I would probably plan for 3-6 months of work to be honest.

jimkont · 2020-02-18T10:58:01Z

RDFUnit supports running SHACL on most RDF sources (in including a remote SPARQL endpoint) but is based on Apache Jena

hmottestad · 2020-02-18T19:17:53Z

@jimkont does it generate SPARQL queries, or does it need to retrieve data from the SPARQL endpoint to validate it locally?

jimkont · 2020-02-18T19:25:22Z

It generates sparql queries

hmottestad · 2020-02-18T19:41:26Z

Found this file here with the SPARQL templates: https://github.com/AKSW/RDFUnit/blob/ecd1b3d709bef723ad79cbd115c026a0e5fd3dd2/rdfunit-commons/src/main/resources/org/aksw/rdfunit/vocabularies/shacl-core.ttl

hmottestad · 2020-02-18T19:47:59Z

Also a nice ontology for SHACL: https://github.com/AKSW/RDFUnit/blob/ecd1b3d709bef723ad79cbd115c026a0e5fd3dd2/rdfunit-commons/src/main/resources/org/aksw/rdfunit/vocabularies/shacl.ttl

Will probably want to use something like this when we start rewriting our SHACL AST implementation.

jimkont · 2020-02-18T20:05:52Z

Note that `shacl-core.ttl` contains a few non-standard shacl-sparql constructs i.e. for `sh:in` as well as a few optimization extensions for adjusting the queries depending on the constraint parameters. iirc, the ontology is a copy from w3c

hmottestad · 2020-02-18T20:27:20Z

https://www.w3.org/ns/shacl.ttl

hmottestad · 2020-04-16T12:14:19Z

I've started work on #2083 which is the first step to allow for both full SHACL support with the current approach, as well as support for other SHACL implementations like a SPARQL based approach which would allow for faster validation of data stored in a remote store. We should also consider if it's time to start using the query plan nodes and implementations from the SPARQL engine instead of the custom ones in SHACL. Would be nice to test out as part of #2083.

… queries from SHACL shapes Signed-off-by: Håvard Ottestad <[email protected]>

hmottestad · 2021-04-03T14:36:37Z

I have now started working on generating SPARQL queries from SHACL Shapes. See: #2963

abrokenjester added 📶 enhancement issue is a new feature or improvement 📦 SHACL affects the SHACL validator labels Feb 11, 2020

abrokenjester mentioned this issue Apr 2, 2020

Allow SAIL to inspect/process unparsed query at prepareQuery stage #99

Closed

hmottestad added a commit to HASMAC-AS/rdf4j that referenced this issue Mar 26, 2021

eclipse-rdf4jGH-1914 start on initial framework for generating SPARQL…

9d5cf6a

… queries from SHACL shapes Signed-off-by: Håvard Ottestad <[email protected]>

abrokenjester added this to RDF4J Planning Jan 3, 2023

github-project-automation bot moved this to 📋 Backlog in RDF4J Planning Jan 3, 2023

VladimirAlexiev mentioned this issue Jan 28, 2025

which SHACL validators to try? Sveino/Inst4CIM-KG#95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation of large datasets with SHACL #1914

Validation of large datasets with SHACL #1914

abrokenjester commented Feb 11, 2020 •

edited

Loading

hmottestad commented Feb 12, 2020

abrokenjester commented Feb 12, 2020

hmottestad commented Feb 12, 2020

abrokenjester commented Feb 12, 2020 •

edited

Loading

hmottestad commented Feb 15, 2020

jimkont commented Feb 18, 2020

hmottestad commented Feb 18, 2020

jimkont commented Feb 18, 2020 via email

hmottestad commented Feb 18, 2020

hmottestad commented Feb 18, 2020

jimkont commented Feb 18, 2020 via email

hmottestad commented Feb 18, 2020

hmottestad commented Apr 16, 2020

hmottestad commented Apr 3, 2021

Validation of large datasets with SHACL #1914

Validation of large datasets with SHACL #1914

Comments

abrokenjester commented Feb 11, 2020 • edited Loading

hmottestad commented Feb 12, 2020

abrokenjester commented Feb 12, 2020

hmottestad commented Feb 12, 2020

abrokenjester commented Feb 12, 2020 • edited Loading

hmottestad commented Feb 15, 2020

jimkont commented Feb 18, 2020

hmottestad commented Feb 18, 2020

jimkont commented Feb 18, 2020 via email

hmottestad commented Feb 18, 2020

hmottestad commented Feb 18, 2020

jimkont commented Feb 18, 2020 via email

hmottestad commented Feb 18, 2020

hmottestad commented Apr 16, 2020

hmottestad commented Apr 3, 2021

abrokenjester commented Feb 11, 2020 •

edited

Loading

abrokenjester commented Feb 12, 2020 •

edited

Loading