Goal
We’d like a rigorous model of the execution of a complex query. Ideally, it would be a structured document that can be serialized to formats including JSON and XML for automated analysis, and visualization by human users.
If “q” is a query we’ve executed against complex heterogeneous sources, it would be good to be able to do something like this to understand which data sources were analyzed, and what algorithms and assumptions were incorporated:
This post is an overview of a technical approach to enable this using W3C standards in a Python environment.
Motivation
We answer complicated science questions by integrating heterogeneous data sources. We link semantic web graph data, results from REST service calls, data from CSV files, and other sources to produce query results.
But once a query result is produced, how is the user to know what it means or how to value it? How is the user to know which sources were used to generate the result? Which assumptions and algorithms were incorporated in the process? How long did the query’s constituent activities each take?
These questions broadly fall into the realm of provenance.
W3C PROV
The World Wide Web Consortium created the PROV standard to describe these sorts of concepts. It defines abstractions and how they interrelate to generically describe provenance for a wide range of generative processes. The standard also defines a variety of output formats so that provenance data produced with PROV can be widely consumed.
For our purposes, it is convenient to interact with PROV from a Python environment. The prov module supports creating, serializing, and visualizing PROV documents.
Usage
We want a framework in which developers can query data sources without thinking about provenance, yet have it captured reliably and thoroughly. Developers should only think in terms of the domain content of their queries. They should get a structured document describing the details of the algorithms used and the data sources consulted.
To do that, we provide a Query class with methods for accessing heterogeneous resources in a data lake environment. Each data access method is annotated with a provenance decorator. The provenance decorator inspects each method invocation at runtime, recording facts about the invocation including arguments, data source names, times and so on. Readers familiar with aspect oriented programming will note that this decorator works much like an aspect.
It is also possible to inspect invocation arguments in detail. One use, in our prototype, is to parse SPARQL queries, identify the namespaces involved in a query, and to record those in our provenance context. We can also record the entire text of the SPARQL query in the provenance object, though this is omitted in the example below for brevity.
Usage in the prototype looks like this:
Importantly, the developer using the query interface does nothing to record provenance data.
The document above could be used to generate more user friendly feedback to explain the query execution process.
Next, we’ll describe the approach a bit further.
Approach
Prov Abstractions
First we create abstractions for interacting with the prov library. QueryResponseProvenance wraps a prov document object. This lets us make some simplifying assumptions about how we’ll use provenance. AbstractEntity plays a similar role with respect to prov entities.
DataLake
Next, we create a DataLake class derived from QueryResponseProvenance. It’s main role is to examine specific kinds of data interactions and characterize them using the provenance document. For the initial implementation, it provides a method to parse SPARQL queries and log their characteristics including the RDF IRIs of queried resources. One could imagine it being extended to provide similar semantics for other query settings like SQL, or an HTTP path, query string, or POST parameters.
GreenTranslatorProvenance
This class derives from DataLake and principally parameterizes DataLake with information specific to this use case. In particular, it specifies a list of namespaces for specific data sources and algorithms that can be used in Green Translator.
Decorator
The Python decoration capability lets us define a function that can be used to augment the execution of other functions. We define a provenance decorator which inspects the name of called functions and conditionally records facts about their invocation.
Query
The Query class defines methods for accessing data sources within the translator data lake. One executes a SPARQL query and another invokes a smartAPI REST service. In this prototype, the accessors are for demonstration purposes and do not have a full set of real world parameters.
Each accessor method is annotated with the provenance decorator function. This means that, transparent to developers who invoke the methods, provenance information is recorded about the interaction within the context of the query.
Next
While clearly a prototype, this is one way to track query execution that:
- Provides granular logging of data sources and algorithms used
- Output of a structured document in XML, RDF, JSON and other formats
- Enables visualization of the process graph
- Is transparent to developers using the query interface
- Works seamlessly in a notebook environment
It’s unclear that the specific syntactic choices in this prototype are ideal. In particular, the use of a Python decorator object does substantially abstract details of provenance away from the query class. But it makes the decorator specific to the Query class’ implementation. And there’s nothing structural that would prevent us from just including the provenance code in the Query class. With the implemented method, though, it would be easier to swap out a different implementation of provenance or to add logging without bloating the Query logic. So that’s one potentially durable justification.
If W3C PROV does what we need, we’ll need to think for a bit about the classes of entities, activities, and other items to represent and how to do that. Perhaps there is an existing ontology we can use or extend.
Evaluation
Big questions to folks who have expressed a general desire for this sort of thing:
- Does this do what we need?
- Is there a substantially better approach?
Appendix A – Code
""" This part is generic provenance infrastructure """
import prov.model as prov
from datetime import datetime
from datetime import date
class QueryResponseProvenance(object):
""" Abstract W3C PROV model of the provenance of components of a complex query. """
def __init__(self, default_ns, namespaces=[]):
""" Specify a default namespace and associated sub namespaces """
self.document = prov.ProvDocument ()
self.default_ns = default_ns
self.document.set_default_namespace (self.default_ns)
self.namespaces = namespaces
self.subspaces = {}
for namespace in self.namespaces:
self.subspaces[namespace] = self.add_namespace (self.default_ns, namespace)
def add_namespace (self, root, qualifier):
subspace = "{0}{1}".format (root, qualifier)
self.document.add_namespace (qualifier, subspace)
return subspace
def add_entity (self, name, tuples):
self.document.entity (name, tuples)
def add_data_source (self, name, entity):
self.entity (name, entity.to_tuple ())
def add_algorithm (self, name, start, end=None):
assert name in self.namespaces, "Name must be in list of namespaces"
if end:
self.document.activity (name, start, end)
else:
self.document.activity (name, start)
def get_time (self):
return datetime.now ().strftime ("%Y-%m-%dT%H:%M:%S")
def __str__(self):
return self.__repr__()
def __repr__(self):
return self.document.get_provn ()
class AbstractEntity(object):
def __init__(self, type, namespace, attributes=[]):
self.attributes = []
self.namespace = namespace
self.attr_keys = {}
self.add_attribute (prov.PROV_TYPE, type)
for a in attributes:
assert len(a) == 2, "Attribute components must be len==2 arrays"
self.add_attribute (a[0], a[1])
def add_attribute (self, iri, value):
key = '{0}@{1}'.format (iri, value)
if not key in self.attr_keys:
self.attributes.append ((iri, value))
self.attr_keys[key] = 1
def to_tuple (self):
return tuple(self.attributes)
import re
""" DataLake Support - encapsulate provenance behaviors specific to
specific classes of data resources """
class DataLakeProvenance(QueryResponseProvenance):
PREFIX = re.compile ('^prefix ', re.I)
def __init__(self, default_ns, namespaces=[]):
QueryResponseProvenance.__init__(self, default_ns, namespaces)
def parse_sparql (self, text, source_map, type="data"):
sources = {}
for line in text.split ('\n'):
line = ' '.join (line.strip ().split ())
match = self.PREFIX.match (line)
if match is None:
continue
if True:
parts = line.strip().split (' ')
if len(parts) >= 3:
iri = parts[2].strip ()
e = None
for k, v in source_map.iteritems ():
if k in iri:
if v in sources:
e = sources[v]
else:
e = AbstractEntity (type, '{0}{1}'.format (self.default_ns, v))
sources[v] = e
e.add_attribute ('src', v)
if e is None:
e = AbstractEntity (type, 'http://purl.data.org/')
e.add_attribute ('src', iri)
self.add_entity ('data', e.to_tuple ())
for k, v in sources.iteritems ():
self.add_entity ('data', v.to_tuple ())
""" And now we get really specific about one data lake, the green translator. """
class GreenTranslatorProvenance (DataLakeProvenance):
def __init__(self):
DataLakeProvenance.__init__(
self,
default_ns = 'http://purl.translator.org/prov/',
namespaces = [
'clinical', 'enviro', 'medbiochem', # data sources
'exposure.pm25-ozone', 'clinical.med.prescribed', 'blazegraph' # algorithms / assumptions
])
def parse_sparql (self, query):
super(GreenTranslatorProvenance, self).parse_sparql (
query,
source_map = {
'<http://chem2bio2rdf.org/ctd/' : 'c2b2r.ctd',
'GO_' : 'GO',
'<http://chem2bio2rdf.org/drugbank' : 'c2b2r.drugbank',
'monarch' : 'monarch'
},
type="medbiochem:data")
"""
Far from the only possible approach, this treats the collection of provenance like an aspect,
separate in implementation from the operations it decorates, it is woven in at runtime via
Python's decorator functionality.
"""
def provenance ():
def provenance_aspect(function):
def wrapper(*args, **kwargs):
start = args[0].provenance.get_time ()
function(*args, **kwargs)
end = args[0].provenance.get_time ()
if 'patient' in function.__name__ or 'clinical' in function.__name__:
datasource = 'clinical'
provenance.add_algorithm ('clinical.med.prescribed', start, end)
e = AbstractEntity ('clinical:data', '{0}{1}'.format ('http://purl.clinical.org/', datasource))
e.add_attribute ('src', datasource)
args[0].provenance.add_entity ('clinical:data', e.to_tuple ())
elif 'get_exposure' in function.__name__:
args[0].provenance.add_algorithm ('exposure.pm25-ozone', start, end)
e = AbstractEntity ('enviro:data', 'http://purl.exposure.org/exposure')
e.add_attribute ('enviro:src', 'exposure')
args[0].provenance.add_entity ('enviro:data', e.to_tuple ())
e = AbstractEntity ('enviro:exposure', 'http://purl.exposure.org/call')
e.add_attribute ('enviro:call', '{0}({1}{2})'.format (function.__name__, args[1:], kwargs))
args[0].provenance.add_entity ('enviro:call', e.to_tuple ())
elif 'query_sparql' in function.__name__:
args[0].provenance.add_algorithm ('blazegraph', start, end)
args[0].provenance.parse_sparql (args[1])
if False:
e = AbstractEntity ('medbiochem', 'http://purl.medbiochem.org/query')
e.add_attribute ('medbiochem:query', '{0}'.format (args[1]))
args[0].provenance.add_entity ('query', e.to_tuple ())
return wrapper
return provenance_aspect
"""
Here's a very rough sample of a hypothetical interface for querying a variety of data sources.
With respect to the person whose focus is on querying data, this is still essentially infrastructure.
"""
from string import Template
from SPARQLWrapper import SPARQLWrapper2, JSON
import urllib2
proxy = urllib2.ProxyHandler({'http': 'gateway.ad.renci.org:8080'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
blazegraph_uri = "http://stars-blazegraph.renci.org/bigdata/sparql"
blazegraph = SPARQLWrapper2 (blazegraph_uri)
class Query (object):
def __init__(self):
self.provenance = GreenTranslatorProvenance ()
@provenance()
def get_exposure (self, start, end):
return requests.post (
"https://exposures.renci.org/v1/getExposureScore",
data = {
"etime": "1985-04-12",
"exposure": "pm25",
"loc": "35.720278,-79.176389,Sa,35.731944,-78.852778,Su",
"stime": "1985-04-12",
"tres": "string",
"tscore": "string"
}).json ()
@provenance()
def query_sparql (self, query, service=blazegraph):
service.setQuery (query)
service.setReturnFormat (JSON)
return service.query().convert ()
def plot (self):
self.provenance.document.plot ()
""" TEST
Finally, here's a usage scenario. This is what it should look like for a developer to use
the Translator framework to query resources and get a provenance model with no explicit effort.
"""
sparql_query = """
PREFIX db_resource: <http://chem2bio2rdf.org/drugbank/resource/>
PREFIX ctd_chem_disease: <http://chem2bio2rdf.org/ctd/resource/ctd_chem_disease/>
PREFIX biordf: <http://bio2rdf.org/>
PREFIX ctd: <http://chem2bio2rdf.org/ctd/resource/>
SELECT DISTINCT ?chem_disease ?meshid ?compound ?drug
WHERE {
?chem_disease ctd:diseaseid ?meshid .
?chem_disease ctd:cid ?compound .
?drug db_resource:CID ?compound
}"""
q = Query ()
q.get_exposure (start='1985-04-12', end='2017-04-12')
q.query_sparql (sparql_query)
print (q.provenance)
Output: The default output format is a simplified hierarchical document.
document
default <http://purl.translator.org/prov/>
prefix exposure.pm25-ozone <http://purl.translator.org/prov/exposure.pm25-ozone>
prefix clinical <http://purl.translator.org/prov/clinical>
prefix medbiochem <http://purl.translator.org/prov/medbiochem>
prefix blazegraph <http://purl.translator.org/prov/blazegraph>
prefix enviro <http://purl.translator.org/prov/enviro>
prefix clinical.med.prescribed <http://purl.translator.org/prov/clinical.med.prescribed>
activity(exposure.pm25-ozone, 2017-03-25T11:56:12, 2017-03-25T11:56:13)
entity(enviro:data, [prov:type="enviro:data", enviro:src="exposure"])
entity(enviro:call, [prov:type="enviro:exposure", enviro:call="get_exposure((){'start': '1985-04-12', 'end': '2017-04-12'})"])
activity(blazegraph, 2017-03-25T11:56:13, 2017-03-25T11:56:14)
entity(data, [prov:type="medbiochem:data", src="<http://bio2rdf.org/>"])
entity(data, [prov:type="medbiochem:data", src="c2b2r.drugbank"])
entity(data, [prov:type="medbiochem:data", src="c2b2r.ctd"])
endDocument
And here’s the JSON version:
{
"activity": {
"blazegraph": {
"prov:endTime": "2017-03-25T11:34:51",
"prov:startTime": "2017-03-25T11:34:48"
},
"exposure.pm25-ozone": {
"prov:endTime": "2017-03-25T11:34:48",
"prov:startTime": "2017-03-25T11:34:48"
}
},
"entity": {
"data": [
{
"prov:type": "medbiochem:data",
"src": "<http://bio2rdf.org/>"
},
{
"prov:type": "medbiochem:data",
"src": "c2b2r.drugbank"
},
{
"prov:type": "medbiochem:data",
"src": "c2b2r.ctd"
}
],
"enviro:call": {
"enviro:call": "get_exposure((){'start': '1985-04-12', 'end': '2017-04-12'})",
"prov:type": "enviro:exposure"
},
"enviro:data": {
"enviro:src": "exposure",
"prov:type": "enviro:data"
}
},
"prefix": {
"blazegraph": "http://purl.translator.org/prov/blazegraph",
"clinical": "http://purl.translator.org/prov/clinical",
"clinical.med.prescribed": "http://purl.translator.org/prov/clinical.med.prescribed",
"default": "http://purl.translator.org/prov/",
"enviro": "http://purl.translator.org/prov/enviro",
"exposure.pm25-ozone": "http://purl.translator.org/prov/exposure.pm25-ozone",
"medbiochem": "http://purl.translator.org/prov/medbiochem"
}
}