pyRdfa

RDFa 1.1 parser, also referred to as a “RDFa Distiller”. It is deployed, via a CGI front-end, on the U{W3C RDFa 1.1 Distiller pagehttp://www.w3.org/2012/pyRdfa/}.

For details on RDFa, the reader should consult the U{RDFa Core 1.1http://www.w3.org/TR/rdfa-core/}, U{XHTML+RDFa1.1http://www.w3.org/TR/2010/xhtml-rdfa}, and the U{RDFa 1.1 Litehttp://www.w3.org/TR/rdfa-lite/} documents. The U{RDFa 1.1 Primerhttp://www.w3.org/TR/owl2-primer/} may also prove helpful.

This package can also be downloaded U{from GitHubhttps://github.com/RDFLib/pyrdfa3}. The distribution also includes the CGI front-end and a separate utility script to be run locally.

Note that this package is an updated version of a U{previous RDFa distillerhttp://www.w3.org/2007/08/pyRdfa} that was developed for RDFa 1.0. Although it reuses large portions of that code, it has been quite thoroughly rewritten, hence put in a completely different project. (The version numbering has been continued, though, to avoid any kind of misunderstandings. This version has version numbers "3.0.0" or higher.)

(Simple) Usage

From a Python file, expecting a Turtle output:: from pyRdfa import pyRdfa print pyRdfa().rdf_from_source('filename') Other output formats are also possible. E.g., to produce RDF/XML output, one could use:: from pyRdfa import pyRdfa print pyRdfa().rdf_from_source('filename', outputFormat='pretty-xml') It is also possible to embed an RDFa processing. Eg, using:: from pyRdfa import pyRdfa graph = pyRdfa().graph_from_source('filename') returns an RDFLib.Graph object instead of a serialization thereof. See the the description of the L{pyRdfa class<pyRdfa>} for further possible entry points details.

There is also, as part of this module, a L{separate entry for CGI calls}.

Return (serialization) formats

The package relies on RDFLib. By default, it relies therefore on the serializers coming with the local RDFLib distribution. However, there has been some issues with serializers of older RDFLib releases; also, some output formats, like JSON-LD, are not (yet) part of the standard RDFLib distribution. A companion package, called pyRdfaExtras, is part of the download, and it includes some of those extra serializers. The extra format (not part of the RDFLib core) is U{JSON-LDhttp://json-ld.org/spec/latest/json-ld-syntax/}, whose 'key' is 'json', when used in the 'parse' method of an RDFLib graph.

(Note in 2018: the bugs that needed pyRdfaExtras are gone with the RDFLib versions, and the json-ld serializer and parser can be U{downloaded from githubhttps://github.com/RDFLib/rdflib-jsonld} (or installed via pip). This means that importing pyRdfaExtras is done only when running older (i.e., 2.X.X) RDFLib versions and can be safely ignored these days.)

Options

The package also implements some optional features that are not part of the RDFa recommendations. At the moment these are:

  • possibility for plain literals to be normalized in terms of white spaces. Default: false. (The RDFa specification requires keeping the white spaces and leave applications to normalize them, if needed)
  • inclusion of embedded RDF: Turtle content may be enclosed in a C{script} element and typed as C{text/turtle}, U{defined by the RDF Working Grouphttp://www.w3.org/TR/turtle/}. Alternatively, some XML dialects (e.g., SVG) allows the usage of RDF/XML as part of their core content to define metadata in RDF. For both of these cases pyRdfa parses these serialized RDF content and adds the resulting triples to the output Graph. Default: true.
  • extra, built-in transformers are executed on the DOM tree prior to RDFa processing (see below). These transformers can be provided by the end user.

Options are collected in an instance of the L{Options} class and may be passed to the processing functions as an extra argument. E.g., to allow the inclusion of embedded content:: from pyRdfa.options import Options options = Options(embedded_rdf=True) print pyRdfa(options=options).rdf_from_source('filename')

See the description of the L{Options} class for the details.

Host Languages

RDFa 1.1. Core is defined for generic XML; there are specific documents to describe how the generic specification is applied to XHTML and HTML5.

pyRdfa makes an automatic switch among these based on the content type of the source as returned by an HTTP request. The following are the possible host languages:

  • if the content type is C{text/html}, the content is HTML5
  • if the content type is C{application/xhtml+xml} I{and} the right DTD is used, the content is XHTML1
  • if the content type is C{application/xhtml+xml} and no or an unknown DTD is used, the content is XHTML5
  • if the content type is C{application/svg+xml}, the content type is SVG
  • if the content type is C{application/atom+xml}, the content type is SVG
  • if the content type is C{application/xml} or C{application/xxx+xml} (but 'xxx' is not 'atom' or 'svg'), the content type is XML

If local files are used, pyRdfa makes a guess on the content type based on the file name suffix: C{.html} is for HTML5, C{.xhtml} for XHTML1, C{.svg} for SVG, anything else is considered to be general XML. Finally, the content type may be set by the caller when initializing the L{pyRdfa class<pyRdfa>}.

Beyond the differences described in the RDFa specification, the main difference is the parser used to parse the source. In the case of HTML5, pyRdfa uses an U{HTML5 parserhttp://code.google.com/p/html5lib/}; for all other cases the simple XML parser, part of the core Python environment, is used. This may be significant in the case of erroneous sources: indeed, the HTML5 parser may do adjustments on the DOM tree before handing it over to the distiller. Furthermore, SVG is also recognized as a type that allows embedded RDF in the form of RDF/XML.

See the variables in the L{host} module if a new host language is added to the system. The current host language information is available for transformers via the option argument, too, and can be used to control the effect of the transformer.

Vocabularies

RDFa 1.1 has the notion of vocabulary files (using the C{@vocab} attribute) that may be used to expand the generated RDF graph. Expansion is based on some very simply RDF Schema and OWL statements on sub-properties and sub-classes, and equivalences.

pyRdfa implements this feature, although it does not do this by default. The extra C{vocab_expansion} parameter should be used for this extra step, for example:: from pyRdfa.options import Options options = Options(vocab_expansion=True) print pyRdfa(options=options).rdf_from_source('filename')

The triples in the vocabulary files themselves (i.e., the small ontology in RDF Schema and OWL) are removed from the result, leaving the inferred property and type relationships only (additionally to the “core” RDF content).

Vocabulary caching

By default, pyRdfa uses a caching mechanism instead of fetching the vocabulary files each time their URI is met as a C{@vocab} attribute value. (This behavior can be switched off setting the C{vocab_cache} option to false.)

Caching happens in a file system directory. The directory itself is determined by the platform the tool is used on, namely:

  • On Windows, it is the C{pyRdfa-cache} subdirectory of the C{%APPDATA%} environment variable
  • On MacOS, it is the C{~/Library/Application Support/pyRdfa-cache}
  • Otherwise, it is the C{~/.pyRdfa-cache}

This automatic choice can be overridden by the C{PyRdfaCacheDir} environment variable.

Caching can be set to be read-only, i.e., the setup might generate the cache files off-line instead of letting the tool writing its own cache when operating, e.g., as a service on the Web. This can be achieved by making the cache directory read only.

If the directories are neither readable nor writable, the vocabulary files are retrieved via HTTP every time they are hit. This may slow down processing, it is advised to avoid such a setup for the package.

The cache includes a separate index file and a file for each vocabulary file. Cache control is based upon the C{EXPIRES} header of a vocabulary file’s HTTP return header: when first seen, this data is stored in the index file and controls whether the cache has to be renewed or not. If the HTTP return header does not have this entry, the date is artificially set ot the current date plus one day.

(The cache files themselves are dumped and loaded using U{Python’s built in cPickle packagehttp://docs.python.org/release/2.7/library/pickle.html#module-cPickle}. These are binary files. Care should be taken if they are managed by CVS: they must be declared as binary files when adding them to the repository.)

RDFa 1.1 vs. RDFa 1.0

Unfortunately, RDFa 1.1 is I{not} fully backward compatible with RDFa 1.0, meaning that, in a few cases, the triples generated from an RDFa 1.1 source are not the same as for RDFa 1.0. (See the separate U{section in the RDFa 1.1 specificationhttp://www.w3.org/TR/rdfa-core/#major-differences-with-rdfa-syntax-1.0} for some further details.)

This distiller’s default behavior is RDFa 1.1. However, if the source includes, in the top element of the file (e.g., the C{html} element) a C{@version} attribute whose value contains the C{RDFa 1.0} string, then the distiller switches to a RDFa 1.0 mode. (Although the C{@version} attribute is not required in RDFa 1.0, it is fairly commonly used.) Similarly, if the RDFa 1.0 DTD is used in the XHTML source, it will be taken into account (a very frequent setup is that an XHTML file is defined with that DTD and is served as text/html; pyRdfa will consider that file as XHTML5, i.e., parse it with the HTML5 parser, but interpret the RDFa attributes under the RDFa 1.0 rules).

Transformers

The package uses the concept of 'transformers': the parsed DOM tree is possibly transformed I{before} performing the real RDFa processing. This transformer structure makes it possible to add additional 'services' without distoring the core code of RDFa processing.

A transformer is a function with three arguments:

  • C{node}: a DOM node for the top level element of the DOM tree
  • C{options}: the current L{Options} instance
  • C{state}: the current L{ExecutionContext} instance, corresponding to the top level DOM Tree element

The function may perform any type of change on the DOM tree; the typical behavior is to add or remove attributes on specific elements. Some transformations are included in the package and can be used as examples; see the L{transform} module of the distribution. These are:

  • The C{@name} attribute of the C{meta} element is copied into a C{@property} attribute of the same element
  • Interpreting the 'openid' references in the header. See L{transform.OpenID} for further details.
  • Implementing the Dublin Core dialect to include DC statements from the header. See L{transform.DublinCore} for further details.

The user of the package may refer add these transformers to L{Options} instance. Here is a possible usage with the “openid” transformer added to the call:: from pyRdfa.options import Options from pyRdfa.transform.OpenID import OpenID_transform options = Options(transformers=[OpenID_transform]) print pyRdfa(options=options).rdf_from_source('filename')

@summary: RDFa parser (distiller) @requires: Python version 2.7 or python 3.8 or up @requires: U{RDFLibhttp://rdflib.net}; version 3.X is preferred. @requires: U{html5libhttp://code.google.com/p/html5lib/} for the HTML5 parsing (note that version 1.0b1 and 1.0b2 should be avoided, it may lead to unicode encoding problems) @requires: U{httpheaderhttp://deron.meranda.us/python/httpheader/}; however, a small modification had to make on the original file, so for this reason and to make distribution easier this module (single file) is added to the package. @organization: U{World Wide Web Consortiumhttp://www.w3.org} @author: U{Ivan Herman} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE}

@var builtInTransformers: List of built-in transformers that are to be run regardless, because they are part of the RDFa spec @var CACHE_DIR_VAR: Environment variable used to define cache directories for RDFa vocabularies in case the default setting does not work or is not appropriate. @var rdfa_current_version: Current "official" version of RDFa that this package implements by default. This can be changed at the invocation of the package @var uri_schemes: List of registered (or widely used) URI schemes; used for warnings...

  1# -*- coding: utf-8 -*-
  2"""
  3RDFa 1.1 parser, also referred to as a “RDFa Distiller”. It is
  4deployed, via a CGI front-end, on the U{W3C RDFa 1.1 Distiller page<http://www.w3.org/2012/pyRdfa/>}.
  5
  6For details on RDFa, the reader should consult the U{RDFa Core 1.1<http://www.w3.org/TR/rdfa-core/>}, U{XHTML+RDFa1.1<http://www.w3.org/TR/2010/xhtml-rdfa>}, and the U{RDFa 1.1 Lite<http://www.w3.org/TR/rdfa-lite/>} documents.
  7The U{RDFa 1.1 Primer<http://www.w3.org/TR/owl2-primer/>} may also prove helpful.
  8
  9This package can also be downloaded U{from GitHub<https://github.com/RDFLib/pyrdfa3>}. The
 10distribution also includes the CGI front-end and a separate utility script to be run locally.
 11
 12Note that this package is an updated version of a U{previous RDFa distiller<http://www.w3.org/2007/08/pyRdfa>} that was developed
 13for RDFa 1.0. Although it reuses large portions of that code, it has been quite thoroughly rewritten, hence put in a completely
 14different project. (The version numbering has been continued, though, to avoid any kind of misunderstandings. This version has version numbers "3.0.0" or higher.)
 15
 16(Simple) Usage
 17==============
 18From a Python file, expecting a Turtle output::
 19 from pyRdfa import pyRdfa
 20 print pyRdfa().rdf_from_source('filename')
 21Other output formats are also possible. E.g., to produce RDF/XML output, one could use::
 22 from pyRdfa import pyRdfa
 23 print pyRdfa().rdf_from_source('filename', outputFormat='pretty-xml')
 24It is also possible to embed an RDFa processing. Eg, using::
 25 from pyRdfa import pyRdfa
 26 graph = pyRdfa().graph_from_source('filename')
 27returns an RDFLib.Graph object instead of a serialization thereof. See the the description of the
 28L{pyRdfa class<pyRdfa.pyRdfa>} for further possible entry points details.
 29
 30There is also, as part of this module, a L{separate entry for CGI calls<processURI>}.
 31
 32Return (serialization) formats
 33------------------------------
 34
 35The package relies on RDFLib. By default, it relies therefore on the serializers coming with the local RDFLib distribution. However, there has been some issues with serializers of older RDFLib releases; also, some output formats, like JSON-LD, are not (yet) part of the standard RDFLib distribution. A companion package, called pyRdfaExtras, is part of the download, and it includes some of those extra serializers. The extra format (not part of the RDFLib core) is U{JSON-LD<http://json-ld.org/spec/latest/json-ld-syntax/>}, whose 'key' is 'json', when used in the 'parse' method of an RDFLib graph.
 36
 37(Note in 2018: the bugs that needed pyRdfaExtras are gone with the RDFLib versions, and the json-ld serializer and parser can be U{downloaded from github<https://github.com/RDFLib/rdflib-jsonld>} (or installed via pip). This means that importing pyRdfaExtras is done only when running older (i.e., 2.X.X) RDFLib versions and can be safely ignored these days.)  
 38
 39Options
 40=======
 41
 42The package also implements some optional features that are not part of the RDFa recommendations. At the moment these are:
 43
 44 - possibility for plain literals to be normalized in terms of white spaces. Default: false. (The RDFa specification requires keeping the white spaces and leave applications to normalize them, if needed)
 45 - inclusion of embedded RDF: Turtle content may be enclosed in a C{script} element and typed as C{text/turtle}, U{defined by the RDF Working Group<http://www.w3.org/TR/turtle/>}. Alternatively, some XML dialects (e.g., SVG) allows the usage of RDF/XML as part of their core content to define metadata in RDF. For both of these cases pyRdfa parses these serialized RDF content and adds the resulting triples to the output Graph. Default: true.
 46 - extra, built-in transformers are executed on the DOM tree prior to RDFa processing (see below). These transformers can be provided by the end user.
 47
 48Options are collected in an instance of the L{Options} class and may be passed to the processing functions as an extra argument. E.g., to allow the inclusion of embedded content::
 49 from pyRdfa.options import Options
 50 options = Options(embedded_rdf=True)
 51 print pyRdfa(options=options).rdf_from_source('filename')
 52
 53See the description of the L{Options} class for the details.
 54
 55
 56Host Languages
 57==============
 58
 59RDFa 1.1. Core is defined for generic XML; there are specific documents to describe how the generic specification is applied to
 60XHTML and HTML5.
 61
 62pyRdfa makes an automatic switch among these based on the content type of the source as returned by an HTTP request. The following are the
 63possible host languages:
 64 - if the content type is C{text/html}, the content is HTML5
 65 - if the content type is C{application/xhtml+xml} I{and} the right DTD is used, the content is XHTML1
 66 - if the content type is C{application/xhtml+xml} and no or an unknown DTD is used, the content is XHTML5
 67 - if the content type is C{application/svg+xml}, the content type is SVG
 68 - if the content type is C{application/atom+xml}, the content type is SVG
 69 - if the content type is C{application/xml} or C{application/xxx+xml} (but 'xxx' is not 'atom' or 'svg'), the content type is XML
 70
 71If local files are used, pyRdfa makes a guess on the content type based on the file name suffix: C{.html} is for HTML5, C{.xhtml} for XHTML1, C{.svg} for SVG, anything else is considered to be general XML. Finally, the content type may be set by the caller when initializing the L{pyRdfa class<pyRdfa.pyRdfa>}.
 72
 73Beyond the differences described in the RDFa specification, the main difference is the parser used to parse the source. In the case of HTML5, pyRdfa uses an U{HTML5 parser<http://code.google.com/p/html5lib/>}; for all other cases the simple XML parser, part of the core Python environment, is used. This may be significant in the case of erroneous sources: indeed, the HTML5 parser may do adjustments on
 74the DOM tree before handing it over to the distiller. Furthermore, SVG is also recognized as a type that allows embedded RDF in the form of RDF/XML.
 75
 76See the variables in the L{host} module if a new host language is added to the system. The current host language information is available for transformers via the option argument, too, and can be used to control the effect of the transformer.
 77
 78Vocabularies
 79============
 80
 81RDFa 1.1 has the notion of vocabulary files (using the C{@vocab} attribute) that may be used to expand the generated RDF graph. Expansion is based on some very simply RDF Schema and OWL statements on sub-properties and sub-classes, and equivalences.
 82
 83pyRdfa implements this feature, although it does not do this by default. The extra C{vocab_expansion} parameter should be used for this extra step, for example::
 84 from pyRdfa.options import Options
 85 options = Options(vocab_expansion=True)
 86 print pyRdfa(options=options).rdf_from_source('filename')
 87
 88The triples in the vocabulary files themselves (i.e., the small ontology in RDF Schema and OWL) are removed from the result, leaving the inferred property and type relationships only (additionally to the “core” RDF content).
 89
 90Vocabulary caching
 91------------------
 92
 93By default, pyRdfa uses a caching mechanism instead of fetching the vocabulary files each time their URI is met as a C{@vocab} attribute value. (This behavior can be switched off setting the C{vocab_cache} option to false.)
 94
 95Caching happens in a file system directory. The directory itself is determined by the platform the tool is used on, namely:
 96 - On Windows, it is the C{pyRdfa-cache} subdirectory of the C{%APPDATA%} environment variable
 97 - On MacOS, it is the C{~/Library/Application Support/pyRdfa-cache}
 98 - Otherwise, it is the C{~/.pyRdfa-cache}
 99
100This automatic choice can be overridden by the C{PyRdfaCacheDir} environment variable.
101
102Caching can be set to be read-only, i.e., the setup might generate the cache files off-line instead of letting the tool writing its own cache when operating, e.g., as a service on the Web. This can be achieved by making the cache directory read only.
103
104If the directories are neither readable nor writable, the vocabulary files are retrieved via HTTP every time they are hit. This may slow down processing, it is advised to avoid such a setup for the package.
105
106The cache includes a separate index file and a file for each vocabulary file. Cache control is based upon the C{EXPIRES} header of a vocabulary file’s HTTP return header: when first seen, this data is stored in the index file and controls whether the cache has to be renewed or not. If the HTTP return header does not have this entry, the date is artificially set ot the current date plus one day.
107
108(The cache files themselves are dumped and loaded using U{Python’s built in cPickle package<http://docs.python.org/release/2.7/library/pickle.html#module-cPickle>}. These are binary files. Care should be taken if they are managed by CVS: they must be declared as binary files when adding them to the repository.)
109
110RDFa 1.1 vs. RDFa 1.0
111=====================
112
113Unfortunately, RDFa 1.1 is I{not} fully backward compatible with RDFa 1.0, meaning that, in a few cases, the triples generated from an RDFa 1.1 source are not the same as for RDFa 1.0. (See the separate  U{section in the RDFa 1.1 specification<http://www.w3.org/TR/rdfa-core/#major-differences-with-rdfa-syntax-1.0>} for some further details.)
114
115This distiller’s default behavior is RDFa 1.1. However, if the source includes, in the top element of the file (e.g., the C{html} element) a C{@version} attribute whose value contains the C{RDFa 1.0} string, then the distiller switches to a RDFa 1.0 mode. (Although the C{@version} attribute is not required in RDFa 1.0, it is fairly commonly used.) Similarly, if the RDFa 1.0 DTD is used in the XHTML source, it will be taken into account (a very frequent setup is that an XHTML file is defined with that DTD and is served as text/html; pyRdfa will consider that file as XHTML5, i.e., parse it with the HTML5 parser, but interpret the RDFa attributes under the RDFa 1.0 rules).
116
117Transformers
118============
119
120The package uses the concept of 'transformers': the parsed DOM tree is possibly
121transformed I{before} performing the real RDFa processing. This transformer structure makes it possible to
122add additional 'services' without distoring the core code of RDFa processing.
123
124A transformer is a function with three arguments:
125
126 - C{node}: a DOM node for the top level element of the DOM tree
127 - C{options}: the current L{Options} instance
128 - C{state}: the current L{ExecutionContext} instance, corresponding to the top level DOM Tree element
129
130The function may perform any type of change on the DOM tree; the typical behavior is to add or remove attributes on specific elements. Some transformations are included in the package and can be used as examples; see the L{transform} module of the distribution. These are:
131
132 - The C{@name} attribute of the C{meta} element is copied into a C{@property} attribute of the same element
133 - Interpreting the 'openid' references in the header. See L{transform.OpenID} for further details.
134 - Implementing the Dublin Core dialect to include DC statements from the header.  See L{transform.DublinCore} for further details.
135
136The user of the package may refer add these transformers to L{Options} instance. Here is a possible usage with the “openid” transformer added to the call::
137 from pyRdfa.options import Options
138 from pyRdfa.transform.OpenID import OpenID_transform
139 options = Options(transformers=[OpenID_transform])
140 print pyRdfa(options=options).rdf_from_source('filename')
141
142
143@summary: RDFa parser (distiller)
144@requires: Python version 2.7 or python 3.8 or up
145@requires: U{RDFLib<http://rdflib.net>}; version 3.X is preferred.
146@requires: U{html5lib<http://code.google.com/p/html5lib/>} for the HTML5 parsing (note that version 1.0b1 and 1.0b2 should be avoided, it may lead to unicode encoding problems)
147@requires: U{httpheader<http://deron.meranda.us/python/httpheader/>}; however, a small modification had to make on the original file, so for this reason and to make distribution easier this module (single file) is added to the package.
148@organization: U{World Wide Web Consortium<http://www.w3.org>}
149@author: U{Ivan Herman<a href="http://www.w3.org/People/Ivan/">}
150@license: This software is available for use under the
151U{W3C® SOFTWARE NOTICE AND LICENSE<href="http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231">}
152
153@var builtInTransformers: List of built-in transformers that are to be run regardless, because they are part of the RDFa spec
154@var CACHE_DIR_VAR: Environment variable used to define cache directories for RDFa vocabularies in case the default setting does not work or is not appropriate.
155@var rdfa_current_version: Current "official" version of RDFa that this package implements by default. This can be changed at the invocation of the package
156@var uri_schemes: List of registered (or widely used) URI schemes; used for warnings...
157"""
158
159__version__ = "3.6.0"
160__author__ =  'Ivan Herman and prrvchr'
161__contact__ = 'prrvchr@gmail.com'
162__license__ = 'W3C® SOFTWARE NOTICE AND LICENSE, http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231'
163
164name = "pyRdfa3"
165
166import sys
167
168from io import StringIO, IOBase
169
170import os
171import xml.dom.minidom
172from urllib.parse import urlparse
173
174import rdflib
175from rdflib import URIRef
176from rdflib import Literal
177from rdflib import BNode
178from rdflib import Namespace
179from rdflib import RDF as ns_rdf
180from rdflib import RDFS as ns_rdfs
181from rdflib import Graph
182
183# Namespace, in the RDFLib sense, for the rdfa vocabulary
184ns_rdfa = Namespace("http://www.w3.org/ns/rdfa#")
185
186from .extras.httpheader import acceptable_content_type, content_type
187from .transform.prototype import handle_prototypes
188
189# Vocabulary terms for vocab reporting
190RDFA_VOCAB = ns_rdfa["usesVocabulary"]
191
192# Namespace, in the RDFLib sense, for the XSD Datatypes
193ns_xsd = Namespace('http://www.w3.org/2001/XMLSchema#')
194
195# Namespace, in the RDFLib sense, for the distiller vocabulary, used as part of the processor graph
196ns_distill = Namespace("http://www.w3.org/2007/08/pyRdfa/vocab#")
197
198debug = False
199
200#########################################################################################################
201
202# Exception/error handling. Essentially, all the different exceptions are re-packaged into
203# separate exception class, to allow for an easier management on the user level
204
205class RDFaError(Exception):
206    """Superclass exceptions representing error conditions defined by the RDFa 1.1 specification.
207    It does not add any new functionality to the
208    Exception class."""
209    def __init__(self, msg):
210        self.msg = msg
211        Exception.__init__(self)
212
213class FailedSource(RDFaError):
214    """Raised when the original source cannot be accessed. It does not add any new functionality to the
215    Exception class."""
216    def __init__(self, msg, http_code = None):
217        self.msg = msg
218        self.http_code = http_code
219        RDFaError.__init__(self, msg)
220
221class HTTPError(RDFaError):
222    """Raised when HTTP problems are detected. It does not add any new functionality to the
223    Exception class."""
224    def __init__(self, http_msg, http_code):
225        self.msg = http_msg
226        self.http_code = http_code
227        RDFaError.__init__(self,http_msg)
228
229class ProcessingError(RDFaError):
230    """Error found during processing. It does not add any new functionality to the
231    Exception class."""
232    pass
233
234class pyRdfaError(Exception):
235    """Superclass exceptions representing error conditions outside the RDFa 1.1 specification."""
236    pass
237
238# Error and Warning RDFS classes
239RDFA_Error =                ns_rdfa["Error"]
240RDFA_Warning =              ns_rdfa["Warning"]
241RDFA_Info =                 ns_rdfa["Information"]
242NonConformantMarkup =       ns_rdfa["DocumentError"]
243UnresolvablePrefix =        ns_rdfa["UnresolvedCURIE"]
244UnresolvableReference =     ns_rdfa["UnresolvedCURIE"]
245UnresolvableTerm =          ns_rdfa["UnresolvedTerm"]
246VocabReferenceError =       ns_rdfa["VocabReferenceError"]
247PrefixRedefinitionWarning = ns_rdfa["PrefixRedefinition"]
248
249FileReferenceError =        ns_distill["FileReferenceError"]
250HTError =                   ns_distill["HTTPError"]
251IncorrectPrefixDefinition = ns_distill["IncorrectPrefixDefinition"]
252IncorrectBlankNodeUsage =   ns_distill["IncorrectBlankNodeUsage"]
253IncorrectLiteral =          ns_distill["IncorrectLiteral"]
254
255# Error message texts
256err_no_blank_node = "Blank node in %s position is not allowed; ignored"
257
258err_redefining_URI_as_prefix = "'%s' a registered or an otherwise used URI scheme, but is defined as a prefix here; is this a mistake? (see, eg, http://en.wikipedia.org/wiki/URI_scheme or http://www.iana.org/assignments/uri-schemes.html for further information for most of the URI schemes)"
259err_xmlns_deprecated =         "The usage of 'xmlns' for prefix definition is deprecated; please use the 'prefix' attribute instead (definition for '%s')"
260err_bnode_local_prefix =       "The '_' local CURIE prefix is reserved for blank nodes, and cannot be defined as a prefix"
261err_col_local_prefix =         "The character ':' is not valid in a CURIE Prefix, and cannot be used in a prefix definition (definition for '%s')"
262err_missing_URI_prefix =       "Missing URI in prefix declaration for '%s' (in '%s')"
263err_invalid_prefix =           "Invalid prefix declaration '%s' (in '%s')"
264err_no_default_prefix =        "Default prefix cannot be changed (in '%s')"
265err_prefix_and_xmlns =         "@prefix setting for '%s' overrides the 'xmlns:%s' setting; may be a source of problem if same file is run through RDFa 1.0"
266err_non_ncname_prefix =        "Non NCNAME '%s' in prefix definition (in '%s'); ignored"
267err_absolute_reference =       "CURIE Reference part contains an authority part: %s (in '%s'); ignored"
268err_query_reference =          "CURIE Reference query part contains an unauthorized character: %s (in '%s'); ignored"
269err_fragment_reference =       "CURIE Reference fragment part contains an unauthorized character: %s (in '%s'); ignored"
270err_lang =                     "There is a problem with language setting; either both xml:lang and lang used on an element with different values, or, for (X)HTML5, only xml:lang is used."
271err_URI_scheme =               "Unusual URI scheme used in <%s>; may that be a mistake, e.g., resulting from using an undefined CURIE prefix or an incorrect CURIE?"
272err_illegal_safe_CURIE =       "Illegal safe CURIE: %s; ignored"
273err_no_CURIE_in_safe_CURIE =   "Safe CURIE is used, but the value does not correspond to a defined CURIE: [%s]; ignored"
274err_undefined_terms =          "'%s' is used as a term, but has not been defined as such; ignored"
275err_non_legal_CURIE_ref =      "Relative URI is not allowed in this position (or not a legal CURIE reference) '%s'; ignored"
276err_undefined_CURIE =          "Undefined CURIE: '%s'; ignored"
277err_prefix_redefinition =      "Prefix '%s' (defined in the initial RDFa context or in an ancestor) is redefined"
278
279err_unusual_char_in_URI =      "Unusual character in uri: %s; possible error?"
280
281#############################################################################################
282
283from .state import ExecutionContext
284from .parse import parse_one_node
285from .options import Options
286from .transform import top_about, empty_safe_curie, vocab_for_role
287from .utils import URIOpener
288from .host import HostLanguage, MediaTypes, preferred_suffixes, content_to_host_language
289
290# Environment variable used to characterize cache directories for RDFa vocabulary files.
291CACHE_DIR_VAR = "PyRdfaCacheDir"
292
293# current "official" version of RDFa that this package implements. This can be changed at the invocation of the package
294rdfa_current_version = "1.1"
295
296# I removed schemes that would not appear as a prefix anyway, like iris.beep
297# http://en.wikipedia.org/wiki/URI_scheme seems to be a good source of information
298# as well as http://www.iana.org/assignments/uri-schemes.html
299# There are some overlaps here, but better more than not enough...
300
301# This comes from wikipedia
302registered_iana_schemes = [
303    "aaa","aaas","acap","cap","cid","crid","data","dav","dict","did","dns","fax","file", "ftp","geo","go",
304    "gopher","h323","http","https","iax","icap","im","imap","info","ipp","iris","ldap", "lsid",
305    "mailto","mid","modem","msrp","msrps", "mtqp", "mupdate","news","nfs","nntp","opaquelocktoken",
306    "pop","pres", "prospero","rstp","rsync", "service","shttp","sieve","sip","sips", "sms", "snmp", "soap", "tag",
307    "tel","telnet", "tftp", "thismessage","tn3270","tip","tv","urn","vemmi","wais","ws", "wss", "xmpp"
308]
309
310# This comes from wikipedia, too
311unofficial_common = [
312    "about", "adiumxtra", "aim", "apt", "afp", "aw", "bitcoin", "bolo", "callto", "chrome", "coap",
313    "content", "cvs", "doi", "ed2k", "facetime", "feed", "finger", "fish", "git", "gg",
314    "gizmoproject", "gtalk", "irc", "ircs", "irc6", "itms", "jar", "javascript",
315    "keyparc", "lastfm", "ldaps", "magnet", "maps", "market", "message", "mms",
316    "msnim", "mumble", "mvn", "notes", "palm", "paparazzi", "psync", "rmi",
317    "secondlife", "sgn", "skype", "spotify", "ssh", "sftp", "smb", "soldat",
318    "steam", "svn", "teamspeak", "things", "udb", "unreal", "ut2004",
319    "ventrillo", "view-source", "webcal", "wtai", "wyciwyg", "xfire", "xri", "ymsgr"
320]
321
322# These come from the IANA page
323historical_iana_schemes = [
324    "fax", "mailserver", "modem", "pack", "prospero", "snews", "videotex", "wais"
325]
326
327provisional_iana_schemes = [
328    "afs", "dtn", "dvb", "icon", "ipn", "jms", "oid", "rsync", "ni"
329]
330
331other_used_schemes = [
332    "hdl", "isbn", "issn", "mstp", "rtmp", "rtspu", "stp"
333]
334
335uri_schemes = registered_iana_schemes + unofficial_common + historical_iana_schemes + provisional_iana_schemes + other_used_schemes
336
337# List of built-in transformers that are to be run regardless, because they are part of the RDFa spec
338builtInTransformers = [
339    empty_safe_curie, top_about, vocab_for_role
340]
341
342#########################################################################################################
343class pyRdfa:
344    """Main processing class for the distiller
345
346    @ivar options: an instance of the L{Options} class
347    @ivar media_type: the preferred default media type, possibly set at initialization
348    @ivar base: the base value, possibly set at initialization
349    @ivar http_status: HTTP Status, to be returned when the package is used via a CGI entry. Initially set to 200, may be modified by exception handlers
350    """
351    def __init__(self, options = None, base = "", media_type = "", rdfa_version = None):
352        """
353        @keyword options: Options for the distiller
354        @type options: L{Options}
355        @keyword base: URI for the default "base" value (usually the URI of the file to be processed)
356        @keyword media_type: explicit setting of the preferred media type (a.k.a. content type) of the the RDFa source
357        @keyword rdfa_version: the RDFa version that should be used. If not set, the value of the global L{rdfa_current_version} variable is used
358        """
359        self.http_status = 200
360
361        self.base = base
362        if base == "":
363            self.required_base = None
364        else:
365            self.required_base    = base
366        self.charset         = None
367
368        # predefined content type
369        self.media_type = media_type
370
371        if options == None:
372            self.options = Options()
373        else:
374            self.options = options
375
376        if media_type != "":
377            self.options.set_host_language(self.media_type)
378
379        if rdfa_version is not None:
380            self.rdfa_version = rdfa_version
381        else:
382            self.rdfa_version = None
383
384    def _get_input(self, name):
385        """
386        Trying to guess whether "name" is a URI or a string (for a file); it then tries to open this source accordingly,
387        returning a file-like object. If name is none of these, it returns the input argument (that should
388        be, supposedly, a file-like object already).
389
390        If the media type has not been set explicitly at initialization of this instance,
391        the method also sets the media_type based on the HTTP GET response or the suffix of the file. See
392        L{host.preferred_suffixes} for the suffix to media type mapping.
393
394        @param name: identifier of the input source
395        @type name: string or a file-like object
396        @return: a file like object if opening "name" is possible and successful, "name" otherwise
397        """
398
399        isstring = isinstance(name, str)
400
401        try:
402            if isstring:
403                # check if this is a URI, ie, if there is a valid 'scheme' part
404                # otherwise it is considered to be a simple file
405                if urlparse(name)[0] != "":
406                    url_request       = URIOpener(name, {}, self.options.certifi_verify)
407                    self.base           = url_request.location
408                    if self.media_type == "":
409                        if url_request.content_type in content_to_host_language:
410                            self.media_type = url_request.content_type
411                        else:
412                            self.media_type = MediaTypes.xml
413                        self.options.set_host_language(self.media_type)
414                    self.charset = url_request.charset
415                    if self.required_base == None:
416                        self.required_base = name
417                    return url_request.data
418                else:
419                    # Creating a File URI for this thing
420                    if self.required_base == None:
421                        self.required_base = "file://" + os.path.join(os.getcwd(),name)
422                    if self.media_type == "":
423                        self.media_type = MediaTypes.xml
424                        # see if the default should be overwritten
425                        for suffix in preferred_suffixes:
426                            if name.endswith(suffix):
427                                self.media_type = preferred_suffixes[suffix]
428                                self.charset = 'utf-8'
429                                break
430                        self.options.set_host_language(self.media_type)
431                    return open(name)
432            else:
433                return name
434        except HTTPError:
435            raise sys.exc_info()[1]
436        except RDFaError as e:
437            raise e
438        except:
439            _type, value, _traceback = sys.exc_info()
440            raise FailedSource(value)
441
442    @staticmethod
443    def _validate_output_format(outputFormat):
444        """
445        Malicious actors may create XSS style issues by using an illegal output format... better be careful
446        """
447        # protection against possible malicious URL call
448        if outputFormat not in ["turtle", "n3", "xml", "pretty-xml", "nt", "json-ld"]:
449            outputFormat = "turtle"
450        return outputFormat
451        
452    ####################################################################################################################
453    # Externally used methods
454    #
455    def graph_from_DOM(self, dom, graph = None, pgraph = None):
456        """
457        Extract the RDF Graph from a DOM tree. This is where the real processing happens. All other methods get down to this
458        one, eventually (e.g., after opening a URI and parsing it into a DOM).
459        @param dom: a DOM Node element, the top level entry node for the whole tree (i.e., the C{dom.documentElement} is used to initiate processing down the node hierarchy)
460        @keyword graph: an RDF Graph (if None, than a new one is created)
461        @type graph: rdflib Graph instance.
462        @keyword pgraph: an RDF Graph to hold (possibly) the processor graph content. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph.
463        @type pgraph: rdflib Graph instance
464        @return: an RDF Graph
465        @rtype: rdflib Graph instance
466        """
467        def copyGraph(tog, fromg):
468            for t in fromg:
469                tog.add(t)
470            for k,ns in fromg.namespaces():
471                tog.bind(k,ns)
472
473        if graph == None:
474            # Create the RDF Graph, that will contain the return triples...
475            graph   = Graph()
476
477        # this will collect the content, the 'default graph', as called in the RDFa spec
478        default_graph = Graph()
479
480        # get the DOM tree
481        topElement = dom.documentElement
482
483        # Create the initial state. This takes care of things
484        # like base, top level namespace settings, etc.
485        state = ExecutionContext(topElement, default_graph, base=self.required_base if self.required_base != None else "", options=self.options, rdfa_version=self.rdfa_version)
486
487        # Perform the built-in and external transformations on the HTML tree.
488        for trans in self.options.transformers + builtInTransformers:
489            trans(topElement, self.options, state)
490
491        # This may have changed if the state setting detected an explicit version information:
492        self.rdfa_version = state.rdfa_version
493
494        # The top level subject starts with the current document; this
495        # is used by the recursion
496        # this function is the real workhorse
497        parse_one_node(topElement, default_graph, None, state, [])
498
499        # Massage the output graph in term of rdfa:Pattern and rdfa:copy
500        handle_prototypes(default_graph)
501
502        # If the RDFS expansion has to be made, here is the place...
503        if self.options.vocab_expansion:
504            from .rdfs.process import process_rdfa_sem
505            process_rdfa_sem(default_graph, self.options)
506
507        # Experimental feature: nothing for now, this is kept as a placeholder
508        if self.options.experimental_features:
509            pass
510
511        # What should be returned depends on the way the options have been set up
512        if self.options.output_default_graph:
513            copyGraph(graph, default_graph)
514            if self.options.output_processor_graph:
515                if pgraph != None:
516                    copyGraph(pgraph, self.options.processor_graph.graph)
517                else:
518                    copyGraph(graph, self.options.processor_graph.graph)
519        elif self.options.output_processor_graph:
520            if pgraph != None:
521                copyGraph(pgraph, self.options.processor_graph.graph)
522            else:
523                copyGraph(graph, self.options.processor_graph.graph)
524
525        # this is necessary if several DOM trees are handled in a row...
526        self.options.reset_processor_graph()
527
528        return graph
529
530    def graph_from_source(self, name, graph = None, rdfOutput = False, pgraph = None):
531        """
532        Extract an RDF graph from an RDFa source. The source is parsed, the RDF extracted, and the RDFa Graph is
533        returned. This is a front-end to the L{pyRdfa.graph_from_DOM} method.
534
535        @param name: a URI, a file name, or a file-like object
536        @param graph: rdflib Graph instance. If None, a new one is created.
537        @param pgraph: rdflib Graph instance for the processor graph. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph.
538        @param rdfOutput: whether runtime exceptions should be turned into RDF and returned as part of the processor graph
539        @return: an RDF Graph
540        @rtype: rdflib Graph instance
541        """
542        def copyErrors(tog, options):
543            if tog == None:
544                tog = Graph()
545            if options.output_processor_graph:
546                for t in options.processor_graph.graph:
547                    tog.add(t)
548                    if pgraph != None : pgraph.add(t)
549                for k,ns in options.processor_graph.graph.namespaces():
550                    tog.bind(k,ns)
551                    if pgraph != None : pgraph.bind(k,ns)
552            options.reset_processor_graph()
553            return tog
554
555        isstring = isinstance(name, str)
556
557        try:
558            # First, open the source... Possible HTTP errors are returned as error triples
559            stream = None
560            try:
561                stream = self._get_input(name)
562            except FailedSource as ex:
563                _f = sys.exc_info()[1]
564                self.http_status = 400
565                if not rdfOutput : raise Exception(ex.msg)
566                err = self.options.add_error(ex.msg, FileReferenceError, name)
567                self.options.processor_graph.add_http_context(err, 400)
568                return copyErrors(graph, self.options)
569            except HTTPError as ex:
570                h = sys.exc_info()[1]
571                self.http_status = h.http_code
572                if not rdfOutput : raise Exception(ex.msg)
573                err = self.options.add_error("HTTP Error: %s (%s)" % (h.http_code,h.msg), HTError, name)
574                self.options.processor_graph.add_http_context(err, h.http_code)
575                return copyErrors(graph, self.options)
576            except RDFaError as ex:
577                e = sys.exc_info()[1]
578                self.http_status = 500
579                # Something nasty happened:-(
580                if not rdfOutput : raise Exception(ex.msg)
581                err = self.options.add_error(str(ex.msg), context = name)
582                self.options.processor_graph.add_http_context(err, 500)
583                return copyErrors(graph, self.options)
584            except Exception as ex:
585                e = sys.exc_info()[1]
586                self.http_status = 500
587                # Something nasty happened:-(
588                if not rdfOutput : raise ex
589                err = self.options.add_error(str(e), context = name)
590                self.options.processor_graph.add_http_context(err, 500)
591                return copyErrors(graph, self.options)
592
593            dom = None
594            try:
595                msg = ""
596                parser = None
597                if self.options.host_language == HostLanguage.html5:
598                    import warnings
599                    warnings.filterwarnings("ignore", category=DeprecationWarning)
600                    from html5lib import HTMLParser, treebuilders
601                    parser = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
602                    if self.charset:
603                        # This means the HTTP header has provided a charset, or the
604                        # file is a local file when we suppose it to be a utf-8
605                        #
606                        # 2020-01-20, Ivan Herman
607                        #   for some reasons the python3 version ran into a problem with this html5lib call
608                        #   the override_encoding argument was not accepted.
609                        # dom = parser.parse(stream, override_encoding=self.charset)
610                        dom = parser.parse(stream)
611                    else:
612                        # No charset set. The HTMLLib parser tries to sniff into the
613                        # the file to find a meta header for the charset; if that
614                        # works, fine, otherwise it falls back on window-...
615                        dom = parser.parse(stream)
616
617                    try:
618                        if isstring:
619                            stream.close()
620                            stream = self._get_input(name)
621                        else:
622                            stream.seek(0)
623                        from .host import adjust_html_version
624                        self.rdfa_version = adjust_html_version(stream, self.rdfa_version)
625                    except:
626                        # if anything goes wrong, it is not really important; rdfa version stays what it was...
627                        pass
628
629                else:
630                    from .host import adjust_xhtml_and_version
631                    if isinstance(stream, IOBase):
632                        parse = xml.dom.minidom.parse
633                    else:
634                        parse = xml.dom.minidom.parseString
635                    dom = parse(stream)
636                    adjusted_host_language, version = adjust_xhtml_and_version(dom, self.options.host_language, self.rdfa_version)
637                    self.options.host_language = adjusted_host_language
638                    self.rdfa_version = version
639            except ImportError:
640                msg = "HTML5 parser not available. Try installing html5lib <http://code.google.com/p/html5lib>"
641                raise ImportError(msg)
642            except Exception:
643                e = sys.exc_info()[1]
644                # These are various parsing exception. Per spec, this is a case when
645                # error triples MUST be returned, ie, the usage of rdfOutput (which switches between an HTML formatted
646                # return page or a graph with error triples) does not apply
647                err = self.options.add_error(str(e), context = name)
648                self.http_status = 400
649                self.options.processor_graph.add_http_context(err, 400)
650                return copyErrors(graph, self.options)
651
652            # If we got here, we have a DOM tree to operate on...
653            return self.graph_from_DOM(dom, graph, pgraph)
654        except Exception:
655            # Something nasty happened during the generation of the graph...
656            (a,b,c) = sys.exc_info()
657            sys.excepthook(a,b,c)
658            if isinstance(b, ImportError):
659                self.http_status = None
660            else:
661                self.http_status = 500
662            if not rdfOutput : raise b
663            err = self.options.add_error(str(b), context = name)
664            self.options.processor_graph.add_http_context(err, 500)
665            return copyErrors(graph, self.options)
666
667    def rdf_from_sources(self, names, outputFormat = "turtle", rdfOutput = False):
668        """
669        Extract and RDF graph from a list of RDFa sources and serialize them in one graph. The sources are parsed, the RDF
670        extracted, and serialization is done in the specified format.
671        @param names: list of sources, each can be a URI, a file name, or a file-like object
672        @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", "json" or "json-ld". "turtle" and "n3", "xml" and "pretty-xml", and "json" and "json-ld" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only.
673        @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph
674        @type rdfOutput: boolean
675        @return: a serialized RDF Graph
676        @rtype: string
677        """
678        # protection against possible malicious URL call
679        outputFormat = pyRdfa._validate_output_format(outputFormat);
680
681        # This is better because it gives access to the various, non-standard serializations
682        # If it does not work because the extra are not installed, fall back to the standard
683        # rdlib distribution...
684        graph = Graph()
685
686        # graph.bind("xsd", Namespace('http://www.w3.org/2001/XMLSchema#'))
687        # the value of rdfOutput determines the reaction on exceptions...
688        for name in names:
689            self.graph_from_source(name, graph, rdfOutput)
690
691        # Stupid difference between python2 and python3...
692        return str(graph.serialize(format=outputFormat), encoding='utf-8')
693
694
695    def rdf_from_source(self, name, outputFormat = "turtle", rdfOutput = False):
696        """
697        Extract and RDF graph from an RDFa source and serialize it in one graph. The source is parsed, the RDF
698        extracted, and serialization is done in the specified format.
699        @param name: a URI, a file name, or a file-like object
700        @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", or "json-ld". "turtle" and "n3", or "xml" and "pretty-xml" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only.
701        @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph
702        @type rdfOutput: boolean
703        @return: a serialized RDF Graph
704        @rtype: string
705        """
706        return self.rdf_from_sources([name], outputFormat, rdfOutput)
707
708################################################# CGI Entry point
709def processURI(uri, outputFormat, form={}):
710    """The standard processing of an RDFa uri options in a form; used as an entry point from a CGI call.
711
712    The call accepts extra form options (i.e., HTTP GET options) as follows:
713
714     - C{graph=[output|processor|output,processor|processor,output]} specifying which graphs are returned. Default: C{output}
715     - C{space_preserve=[true|false]} means that plain literals are normalized in terms of white spaces. Default: C{false}
716     - C{rfa_version} provides the RDFa version that should be used for distilling. The string should be of the form "1.0" or "1.1". Default is the highest version the current package implements, currently "1.1"
717     - C{host_language=[xhtml,html,xml]} : the host language. Used when files are uploaded or text is added verbatim, otherwise the HTTP return header should be used. Default C{xml}
718     - C{embedded_rdf=[true|false]} : whether embedded turtle or RDF/XML content should be added to the output graph. Default: C{false}
719     - C{vocab_expansion=[true|false]} : whether the vocabularies should be expanded through the restricted RDFS entailment. Default: C{false}
720     - C{vocab_cache=[true|false]} : whether vocab caching should be performed or whether it should be ignored and vocabulary files should be picked up every time. Default: C{false}
721     - C{vocab_cache_report=[true|false]} : whether vocab caching details should be reported. Default: C{false}
722     - C{vocab_cache_bypass=[true|false]} : whether vocab caches have to be regenerated every time. Default: C{false}
723     - C{rdfa_lite=[true|false]} : whether warnings should be generated for non RDFa Lite attribute usage. Default: C{false}
724     - C{certifi_verify=[true|false]} : whether the SSL certificate needs to be verified. Default: C{true}
725
726    @param uri: URI to access. Note that the C{text:} and C{uploaded:} fake URI values are treated separately; the former is for textual intput (in which case a StringIO is used to get the data) and the latter is for uploaded file, where the form gives access to the file directly.
727    @param outputFormat: serialization format, as defined by the package. Currently "xml", "turtle", "nt", or "json". Default is "turtle", also used if any other string is given.
728    @param form: extra call options (from the CGI call) to set up the local options
729    @type form: cgi FieldStorage instance
730    @return: serialized graph
731    @rtype: string
732    """
733    def _get_option(param, compare_value, default):
734        param_old = param.replace('_', '-')
735        if param in list(form.keys()):
736            val = form.getfirst(param).lower()
737            return val == compare_value
738        elif param_old in list(form.keys()):
739            # this is to ensure the old style parameters are still valid...
740            # in the old days I used '-' in the parameters, the standard favours '_'
741            val = form.getfirst(param_old).lower()
742            return val == compare_value
743        else:
744            return default
745
746    if uri == "uploaded:":
747        stream = form["uploaded"].file
748        base = ""
749    elif uri == "text:":
750        stream = StringIO(form.getfirst("text"))
751        base = ""
752    else:
753        stream = uri
754        base = uri
755
756    if "rdfa_version" in list(form.keys()):
757        rdfa_version = form.getfirst("rdfa_version")
758    else:
759        rdfa_version = None
760
761    # working through the possible options
762    # Host language: HTML, XHTML, or XML
763    # Note that these options should be used for the upload and inline version only in case of a form
764    # for real uris the returned content type should be used
765    if "host_language" in list(form.keys()):
766        if form.getfirst("host_language").lower() == "xhtml":
767            media_type = MediaTypes.xhtml
768        elif form.getfirst("host_language").lower() == "html":
769            media_type = MediaTypes.html
770        elif form.getfirst("host_language").lower() == "svg":
771            media_type = MediaTypes.svg
772        elif form.getfirst("host_language").lower() == "atom":
773            media_type = MediaTypes.atom
774        else:
775            media_type = MediaTypes.xml
776    else:
777        media_type = ""
778
779    transformers = []
780
781    check_lite = "rdfa_lite" in list(form.keys()) and form.getfirst("rdfa_lite").lower() == "true"
782
783    # The code below is left for backward compatibility only. In fact, these options are not exposed any more,
784    # they are not really in use
785    from .transform.metaname import meta_transform
786    from .transform.OpenID import OpenID_transform
787    from .transform.DublinCore import DC_transform
788
789    if "extras" in list(form.keys()) and form.getfirst("extras").lower() == "true":
790        for t in [OpenID_transform, DC_transform, meta_transform]:
791            transformers.append(t)
792    else:
793        if "extra-meta" in list(form.keys()) and form.getfirst("extra-meta").lower() == "true":
794            transformers.append(meta_transform)
795        if "extra-openid" in list(form.keys()) and form.getfirst("extra-openid").lower() == "true":
796            transformers.append(OpenID_transform)
797        if "extra-dc" in list(form.keys()) and form.getfirst("extra-dc").lower() == "true":
798            transformers.append(DC_transform)
799
800    output_default_graph = True
801    output_processor_graph = False
802    # Note that I use the 'graph' and the 'rdfagraph' form keys here. Reason is that
803    # I used 'graph' in the previous versions, including the RDFa 1.0 processor,
804    # so if I removed that altogether that would create backward incompatibilities
805    # On the other hand, the RDFa 1.1 doc clearly refers to 'rdfagraph' as the standard
806    # key.
807    a = None
808    if "graph" in list(form.keys()):
809        a = form.getfirst("graph").lower()
810    elif "rdfagraph" in list(form.keys()):
811        a = form.getfirst("rdfagraph").lower()
812    if a != None:
813        if a == "processor":
814            output_default_graph = False
815            output_processor_graph = True
816        elif a == "processor,output" or a == "output,processor":
817            output_processor_graph = True
818
819    embedded_rdf =        _get_option( "embedded_rdf", "true", False)
820    space_preserve =      _get_option( "space_preserve", "true", True)
821    vocab_cache =         _get_option( "vocab_cache", "true", True)
822    vocab_cache_report =  _get_option( "vocab_cache_report", "true", False)
823    refresh_vocab_cache = _get_option( "vocab_cache_refresh", "true", False)
824    vocab_expansion =     _get_option( "vocab_expansion", "true", False)
825    certifi_verify =      _get_option( "certifi_verify", "true", True)
826    if vocab_cache_report:
827        output_processor_graph = True
828
829    options = Options(output_default_graph   = output_default_graph,
830                      output_processor_graph = output_processor_graph,
831                      space_preserve         = space_preserve,
832                      transformers           = transformers,
833                      vocab_cache            = vocab_cache,
834                      vocab_cache_report     = vocab_cache_report,
835                      refresh_vocab_cache    = refresh_vocab_cache,
836                      vocab_expansion        = vocab_expansion,
837                      embedded_rdf           = embedded_rdf,
838                      check_lite             = check_lite,
839                      certifi_verify         = certifi_verify)
840
841    processor = pyRdfa(options = options, base = base, media_type = media_type, rdfa_version = rdfa_version)
842
843    # Decide the output format; the issue is what should happen in case of a top level error like an inaccessibility of
844    # the html source: should a graph be returned or an HTML page with an error message?
845
846    # decide whether HTML or RDF should be sent.
847    htmlOutput = False
848    #if 'HTTP_ACCEPT' in os.environ:
849    #    acc = os.environ['HTTP_ACCEPT']
850    #    possibilities = ['text/html',
851    #                     'application/rdf+xml',
852    #                     'text/turtle; charset=utf-8',
853    #                     'application/json',
854    #                     'application/ld+json',
855    #                     'text/rdf+n3']
856    #
857    #    # this nice module does content negotiation and returns the preferred format
858    #    sg = acceptable_content_type(acc, possibilities)
859    #    htmlOutput = (sg != None and sg[0] == content_type('text/html'))
860    #    os.environ['rdfaerror'] = 'true'
861
862    # This is really for testing purposes only, it is an unpublished flag to force RDF output no
863    # matter what
864    import html
865    try:
866        outputFormat = pyRdfa._validate_output_format(outputFormat);
867        if outputFormat == "n3":
868            retval = 'Content-Type: text/rdf+n3; charset=utf-8\n'
869        elif outputFormat == "nt" or outputFormat == "turtle":
870            retval = 'Content-Type: text/turtle; charset=utf-8\n'
871        elif outputFormat == "json-ld" or outputFormat == "json":
872            retval = 'Content-Type: application/ld+json; charset=utf-8\n'
873        else:
874            retval = 'Content-Type: application/rdf+xml; charset=utf-8\n'
875        graph = processor.rdf_from_source(stream, outputFormat, rdfOutput = ("forceRDFOutput" in list(form.keys())) or not htmlOutput)
876        retval += '\n'
877        retval += graph
878        return retval
879    except HTTPError:
880        _type, h, _traceback = sys.exc_info()
881
882        retval = 'Content-type: text/html; charset=utf-8\nStatus: %s \n\n' % h.http_code
883        retval += "<html>\n"
884        retval += "<head>\n"
885        retval += "<title>HTTP Error in distilling RDFa content</title>\n"
886        retval += "</head><body>\n"
887        retval += "<h1>HTTP Error in distilling RDFa content</h1>\n"
888        retval += "<p>HTTP Error: %s (%s)</p>\n" % (h.http_code, h.msg)
889        retval += "<p>On URI: <code>'%s'</code></p>\n" % html.escape(uri)
890        retval +="</body>\n"
891        retval +="</html>\n"
892        return retval
893    except:
894        # This branch should occur only if an exception is really raised, ie, if it is not turned
895        # into a graph value.
896        _type, value, _traceback = sys.exc_info()
897
898        import traceback
899
900        retval = 'Content-type: text/html; charset=utf-8\nStatus: %s\n\n' % processor.http_status
901        retval += "<html>\n"
902        retval += "<head>\n"
903        retval += "<title>Exception in RDFa processing</title>\n"
904        retval += "</head><body>\n"
905        retval += "<h1>Exception in distilling RDFa</h1>\n"
906        retval += "<pre>\n"
907        strio  = StringIO()
908        traceback.print_exc(file=strio)
909        retval += strio.getvalue()
910        retval +="</pre>\n"
911        retval +="<pre>%s</pre>\n" % value
912        retval +="<h1>Distiller request details</h1>\n"
913        retval +="<dl>\n"
914        if uri == "text:" and "text" in form and form["text"].value != None and len(form["text"].value.strip()) != 0:
915            retval +="<dt>Text input:</dt><dd>%s</dd>\n" % html.escape(form["text"].value).replace('\n','<br/>')
916        elif uri == "uploaded:":
917            retval +="<dt>Uploaded file</dt>\n"
918        else:
919            retval +="<dt>URI received:</dt><dd><code>'%s'</code></dd>\n" % html.escape(uri)
920        if "host_language" in list(form.keys()):
921            retval +="<dt>Media Type:</dt><dd>%s</dd>\n" % html.escape(media_type)
922        if "graph" in list(form.keys()):
923            retval +="<dt>Requested graphs:</dt><dd>%s</dd>\n" % html.escape(form.getfirst("graph").lower())
924        else:
925            retval +="<dt>Requested graphs:</dt><dd>default</dd>\n"
926        retval +="<dt>Output serialization format:</dt><dd> %s</dd>\n" % outputFormat
927        if "space_preserve" in form : retval +="<dt>Space preserve:</dt><dd> %s</dd>\n" % html.escape(form["space_preserve"].value)
928        retval +="</dl>\n"
929        retval +="</body>\n"
930        retval +="</html>\n"
931        return retval
name = 'pyRdfa3'
ns_rdfa = Namespace('http://www.w3.org/ns/rdfa#')
RDFA_VOCAB = rdflib.term.URIRef('http://www.w3.org/ns/rdfa#usesVocabulary')
ns_xsd = Namespace('http://www.w3.org/2001/XMLSchema#')
ns_distill = Namespace('http://www.w3.org/2007/08/pyRdfa/vocab#')
debug = False
class RDFaError(builtins.Exception):
206class RDFaError(Exception):
207    """Superclass exceptions representing error conditions defined by the RDFa 1.1 specification.
208    It does not add any new functionality to the
209    Exception class."""
210    def __init__(self, msg):
211        self.msg = msg
212        Exception.__init__(self)

Superclass exceptions representing error conditions defined by the RDFa 1.1 specification. It does not add any new functionality to the Exception class.

RDFaError(msg)
210    def __init__(self, msg):
211        self.msg = msg
212        Exception.__init__(self)
msg
Inherited Members
builtins.BaseException
with_traceback
args
class FailedSource(RDFaError):
214class FailedSource(RDFaError):
215    """Raised when the original source cannot be accessed. It does not add any new functionality to the
216    Exception class."""
217    def __init__(self, msg, http_code = None):
218        self.msg = msg
219        self.http_code = http_code
220        RDFaError.__init__(self, msg)

Raised when the original source cannot be accessed. It does not add any new functionality to the Exception class.

FailedSource(msg, http_code=None)
217    def __init__(self, msg, http_code = None):
218        self.msg = msg
219        self.http_code = http_code
220        RDFaError.__init__(self, msg)
msg
http_code
Inherited Members
builtins.BaseException
with_traceback
args
class HTTPError(RDFaError):
222class HTTPError(RDFaError):
223    """Raised when HTTP problems are detected. It does not add any new functionality to the
224    Exception class."""
225    def __init__(self, http_msg, http_code):
226        self.msg = http_msg
227        self.http_code = http_code
228        RDFaError.__init__(self,http_msg)

Raised when HTTP problems are detected. It does not add any new functionality to the Exception class.

HTTPError(http_msg, http_code)
225    def __init__(self, http_msg, http_code):
226        self.msg = http_msg
227        self.http_code = http_code
228        RDFaError.__init__(self,http_msg)
msg
http_code
Inherited Members
builtins.BaseException
with_traceback
args
class ProcessingError(RDFaError):
230class ProcessingError(RDFaError):
231    """Error found during processing. It does not add any new functionality to the
232    Exception class."""
233    pass

Error found during processing. It does not add any new functionality to the Exception class.

Inherited Members
RDFaError
RDFaError
msg
builtins.BaseException
with_traceback
args
class pyRdfaError(builtins.Exception):
235class pyRdfaError(Exception):
236    """Superclass exceptions representing error conditions outside the RDFa 1.1 specification."""
237    pass

Superclass exceptions representing error conditions outside the RDFa 1.1 specification.

Inherited Members
builtins.Exception
Exception
builtins.BaseException
with_traceback
args
RDFA_Error = rdflib.term.URIRef('http://www.w3.org/ns/rdfa#Error')
RDFA_Warning = rdflib.term.URIRef('http://www.w3.org/ns/rdfa#Warning')
RDFA_Info = rdflib.term.URIRef('http://www.w3.org/ns/rdfa#Information')
NonConformantMarkup = rdflib.term.URIRef('http://www.w3.org/ns/rdfa#DocumentError')
UnresolvablePrefix = rdflib.term.URIRef('http://www.w3.org/ns/rdfa#UnresolvedCURIE')
UnresolvableReference = rdflib.term.URIRef('http://www.w3.org/ns/rdfa#UnresolvedCURIE')
UnresolvableTerm = rdflib.term.URIRef('http://www.w3.org/ns/rdfa#UnresolvedTerm')
VocabReferenceError = rdflib.term.URIRef('http://www.w3.org/ns/rdfa#VocabReferenceError')
PrefixRedefinitionWarning = rdflib.term.URIRef('http://www.w3.org/ns/rdfa#PrefixRedefinition')
FileReferenceError = rdflib.term.URIRef('http://www.w3.org/2007/08/pyRdfa/vocab#FileReferenceError')
HTError = rdflib.term.URIRef('http://www.w3.org/2007/08/pyRdfa/vocab#HTTPError')
IncorrectPrefixDefinition = rdflib.term.URIRef('http://www.w3.org/2007/08/pyRdfa/vocab#IncorrectPrefixDefinition')
IncorrectBlankNodeUsage = rdflib.term.URIRef('http://www.w3.org/2007/08/pyRdfa/vocab#IncorrectBlankNodeUsage')
IncorrectLiteral = rdflib.term.URIRef('http://www.w3.org/2007/08/pyRdfa/vocab#IncorrectLiteral')
err_no_blank_node = 'Blank node in %s position is not allowed; ignored'
err_redefining_URI_as_prefix = "'%s' a registered or an otherwise used URI scheme, but is defined as a prefix here; is this a mistake? (see, eg, http://en.wikipedia.org/wiki/URI_scheme or http://www.iana.org/assignments/uri-schemes.html for further information for most of the URI schemes)"
err_xmlns_deprecated = "The usage of 'xmlns' for prefix definition is deprecated; please use the 'prefix' attribute instead (definition for '%s')"
err_bnode_local_prefix = "The '_' local CURIE prefix is reserved for blank nodes, and cannot be defined as a prefix"
err_col_local_prefix = "The character ':' is not valid in a CURIE Prefix, and cannot be used in a prefix definition (definition for '%s')"
err_missing_URI_prefix = "Missing URI in prefix declaration for '%s' (in '%s')"
err_invalid_prefix = "Invalid prefix declaration '%s' (in '%s')"
err_no_default_prefix = "Default prefix cannot be changed (in '%s')"
err_prefix_and_xmlns = "@prefix setting for '%s' overrides the 'xmlns:%s' setting; may be a source of problem if same file is run through RDFa 1.0"
err_non_ncname_prefix = "Non NCNAME '%s' in prefix definition (in '%s'); ignored"
err_absolute_reference = "CURIE Reference part contains an authority part: %s (in '%s'); ignored"
err_query_reference = "CURIE Reference query part contains an unauthorized character: %s (in '%s'); ignored"
err_fragment_reference = "CURIE Reference fragment part contains an unauthorized character: %s (in '%s'); ignored"
err_lang = 'There is a problem with language setting; either both xml:lang and lang used on an element with different values, or, for (X)HTML5, only xml:lang is used.'
err_URI_scheme = 'Unusual URI scheme used in <%s>; may that be a mistake, e.g., resulting from using an undefined CURIE prefix or an incorrect CURIE?'
err_illegal_safe_CURIE = 'Illegal safe CURIE: %s; ignored'
err_no_CURIE_in_safe_CURIE = 'Safe CURIE is used, but the value does not correspond to a defined CURIE: [%s]; ignored'
err_undefined_terms = "'%s' is used as a term, but has not been defined as such; ignored"
err_undefined_CURIE = "Undefined CURIE: '%s'; ignored"
err_prefix_redefinition = "Prefix '%s' (defined in the initial RDFa context or in an ancestor) is redefined"
err_unusual_char_in_URI = 'Unusual character in uri: %s; possible error?'
CACHE_DIR_VAR = 'PyRdfaCacheDir'
rdfa_current_version = '1.1'
registered_iana_schemes = ['aaa', 'aaas', 'acap', 'cap', 'cid', 'crid', 'data', 'dav', 'dict', 'did', 'dns', 'fax', 'file', 'ftp', 'geo', 'go', 'gopher', 'h323', 'http', 'https', 'iax', 'icap', 'im', 'imap', 'info', 'ipp', 'iris', 'ldap', 'lsid', 'mailto', 'mid', 'modem', 'msrp', 'msrps', 'mtqp', 'mupdate', 'news', 'nfs', 'nntp', 'opaquelocktoken', 'pop', 'pres', 'prospero', 'rstp', 'rsync', 'service', 'shttp', 'sieve', 'sip', 'sips', 'sms', 'snmp', 'soap', 'tag', 'tel', 'telnet', 'tftp', 'thismessage', 'tn3270', 'tip', 'tv', 'urn', 'vemmi', 'wais', 'ws', 'wss', 'xmpp']
unofficial_common = ['about', 'adiumxtra', 'aim', 'apt', 'afp', 'aw', 'bitcoin', 'bolo', 'callto', 'chrome', 'coap', 'content', 'cvs', 'doi', 'ed2k', 'facetime', 'feed', 'finger', 'fish', 'git', 'gg', 'gizmoproject', 'gtalk', 'irc', 'ircs', 'irc6', 'itms', 'jar', 'javascript', 'keyparc', 'lastfm', 'ldaps', 'magnet', 'maps', 'market', 'message', 'mms', 'msnim', 'mumble', 'mvn', 'notes', 'palm', 'paparazzi', 'psync', 'rmi', 'secondlife', 'sgn', 'skype', 'spotify', 'ssh', 'sftp', 'smb', 'soldat', 'steam', 'svn', 'teamspeak', 'things', 'udb', 'unreal', 'ut2004', 'ventrillo', 'view-source', 'webcal', 'wtai', 'wyciwyg', 'xfire', 'xri', 'ymsgr']
historical_iana_schemes = ['fax', 'mailserver', 'modem', 'pack', 'prospero', 'snews', 'videotex', 'wais']
provisional_iana_schemes = ['afs', 'dtn', 'dvb', 'icon', 'ipn', 'jms', 'oid', 'rsync', 'ni']
other_used_schemes = ['hdl', 'isbn', 'issn', 'mstp', 'rtmp', 'rtspu', 'stp']
uri_schemes = ['aaa', 'aaas', 'acap', 'cap', 'cid', 'crid', 'data', 'dav', 'dict', 'did', 'dns', 'fax', 'file', 'ftp', 'geo', 'go', 'gopher', 'h323', 'http', 'https', 'iax', 'icap', 'im', 'imap', 'info', 'ipp', 'iris', 'ldap', 'lsid', 'mailto', 'mid', 'modem', 'msrp', 'msrps', 'mtqp', 'mupdate', 'news', 'nfs', 'nntp', 'opaquelocktoken', 'pop', 'pres', 'prospero', 'rstp', 'rsync', 'service', 'shttp', 'sieve', 'sip', 'sips', 'sms', 'snmp', 'soap', 'tag', 'tel', 'telnet', 'tftp', 'thismessage', 'tn3270', 'tip', 'tv', 'urn', 'vemmi', 'wais', 'ws', 'wss', 'xmpp', 'about', 'adiumxtra', 'aim', 'apt', 'afp', 'aw', 'bitcoin', 'bolo', 'callto', 'chrome', 'coap', 'content', 'cvs', 'doi', 'ed2k', 'facetime', 'feed', 'finger', 'fish', 'git', 'gg', 'gizmoproject', 'gtalk', 'irc', 'ircs', 'irc6', 'itms', 'jar', 'javascript', 'keyparc', 'lastfm', 'ldaps', 'magnet', 'maps', 'market', 'message', 'mms', 'msnim', 'mumble', 'mvn', 'notes', 'palm', 'paparazzi', 'psync', 'rmi', 'secondlife', 'sgn', 'skype', 'spotify', 'ssh', 'sftp', 'smb', 'soldat', 'steam', 'svn', 'teamspeak', 'things', 'udb', 'unreal', 'ut2004', 'ventrillo', 'view-source', 'webcal', 'wtai', 'wyciwyg', 'xfire', 'xri', 'ymsgr', 'fax', 'mailserver', 'modem', 'pack', 'prospero', 'snews', 'videotex', 'wais', 'afs', 'dtn', 'dvb', 'icon', 'ipn', 'jms', 'oid', 'rsync', 'ni', 'hdl', 'isbn', 'issn', 'mstp', 'rtmp', 'rtspu', 'stp']
builtInTransformers = [<function empty_safe_curie>, <function top_about>, <function vocab_for_role>]
class pyRdfa:
344class pyRdfa:
345    """Main processing class for the distiller
346
347    @ivar options: an instance of the L{Options} class
348    @ivar media_type: the preferred default media type, possibly set at initialization
349    @ivar base: the base value, possibly set at initialization
350    @ivar http_status: HTTP Status, to be returned when the package is used via a CGI entry. Initially set to 200, may be modified by exception handlers
351    """
352    def __init__(self, options = None, base = "", media_type = "", rdfa_version = None):
353        """
354        @keyword options: Options for the distiller
355        @type options: L{Options}
356        @keyword base: URI for the default "base" value (usually the URI of the file to be processed)
357        @keyword media_type: explicit setting of the preferred media type (a.k.a. content type) of the the RDFa source
358        @keyword rdfa_version: the RDFa version that should be used. If not set, the value of the global L{rdfa_current_version} variable is used
359        """
360        self.http_status = 200
361
362        self.base = base
363        if base == "":
364            self.required_base = None
365        else:
366            self.required_base    = base
367        self.charset         = None
368
369        # predefined content type
370        self.media_type = media_type
371
372        if options == None:
373            self.options = Options()
374        else:
375            self.options = options
376
377        if media_type != "":
378            self.options.set_host_language(self.media_type)
379
380        if rdfa_version is not None:
381            self.rdfa_version = rdfa_version
382        else:
383            self.rdfa_version = None
384
385    def _get_input(self, name):
386        """
387        Trying to guess whether "name" is a URI or a string (for a file); it then tries to open this source accordingly,
388        returning a file-like object. If name is none of these, it returns the input argument (that should
389        be, supposedly, a file-like object already).
390
391        If the media type has not been set explicitly at initialization of this instance,
392        the method also sets the media_type based on the HTTP GET response or the suffix of the file. See
393        L{host.preferred_suffixes} for the suffix to media type mapping.
394
395        @param name: identifier of the input source
396        @type name: string or a file-like object
397        @return: a file like object if opening "name" is possible and successful, "name" otherwise
398        """
399
400        isstring = isinstance(name, str)
401
402        try:
403            if isstring:
404                # check if this is a URI, ie, if there is a valid 'scheme' part
405                # otherwise it is considered to be a simple file
406                if urlparse(name)[0] != "":
407                    url_request       = URIOpener(name, {}, self.options.certifi_verify)
408                    self.base           = url_request.location
409                    if self.media_type == "":
410                        if url_request.content_type in content_to_host_language:
411                            self.media_type = url_request.content_type
412                        else:
413                            self.media_type = MediaTypes.xml
414                        self.options.set_host_language(self.media_type)
415                    self.charset = url_request.charset
416                    if self.required_base == None:
417                        self.required_base = name
418                    return url_request.data
419                else:
420                    # Creating a File URI for this thing
421                    if self.required_base == None:
422                        self.required_base = "file://" + os.path.join(os.getcwd(),name)
423                    if self.media_type == "":
424                        self.media_type = MediaTypes.xml
425                        # see if the default should be overwritten
426                        for suffix in preferred_suffixes:
427                            if name.endswith(suffix):
428                                self.media_type = preferred_suffixes[suffix]
429                                self.charset = 'utf-8'
430                                break
431                        self.options.set_host_language(self.media_type)
432                    return open(name)
433            else:
434                return name
435        except HTTPError:
436            raise sys.exc_info()[1]
437        except RDFaError as e:
438            raise e
439        except:
440            _type, value, _traceback = sys.exc_info()
441            raise FailedSource(value)
442
443    @staticmethod
444    def _validate_output_format(outputFormat):
445        """
446        Malicious actors may create XSS style issues by using an illegal output format... better be careful
447        """
448        # protection against possible malicious URL call
449        if outputFormat not in ["turtle", "n3", "xml", "pretty-xml", "nt", "json-ld"]:
450            outputFormat = "turtle"
451        return outputFormat
452        
453    ####################################################################################################################
454    # Externally used methods
455    #
456    def graph_from_DOM(self, dom, graph = None, pgraph = None):
457        """
458        Extract the RDF Graph from a DOM tree. This is where the real processing happens. All other methods get down to this
459        one, eventually (e.g., after opening a URI and parsing it into a DOM).
460        @param dom: a DOM Node element, the top level entry node for the whole tree (i.e., the C{dom.documentElement} is used to initiate processing down the node hierarchy)
461        @keyword graph: an RDF Graph (if None, than a new one is created)
462        @type graph: rdflib Graph instance.
463        @keyword pgraph: an RDF Graph to hold (possibly) the processor graph content. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph.
464        @type pgraph: rdflib Graph instance
465        @return: an RDF Graph
466        @rtype: rdflib Graph instance
467        """
468        def copyGraph(tog, fromg):
469            for t in fromg:
470                tog.add(t)
471            for k,ns in fromg.namespaces():
472                tog.bind(k,ns)
473
474        if graph == None:
475            # Create the RDF Graph, that will contain the return triples...
476            graph   = Graph()
477
478        # this will collect the content, the 'default graph', as called in the RDFa spec
479        default_graph = Graph()
480
481        # get the DOM tree
482        topElement = dom.documentElement
483
484        # Create the initial state. This takes care of things
485        # like base, top level namespace settings, etc.
486        state = ExecutionContext(topElement, default_graph, base=self.required_base if self.required_base != None else "", options=self.options, rdfa_version=self.rdfa_version)
487
488        # Perform the built-in and external transformations on the HTML tree.
489        for trans in self.options.transformers + builtInTransformers:
490            trans(topElement, self.options, state)
491
492        # This may have changed if the state setting detected an explicit version information:
493        self.rdfa_version = state.rdfa_version
494
495        # The top level subject starts with the current document; this
496        # is used by the recursion
497        # this function is the real workhorse
498        parse_one_node(topElement, default_graph, None, state, [])
499
500        # Massage the output graph in term of rdfa:Pattern and rdfa:copy
501        handle_prototypes(default_graph)
502
503        # If the RDFS expansion has to be made, here is the place...
504        if self.options.vocab_expansion:
505            from .rdfs.process import process_rdfa_sem
506            process_rdfa_sem(default_graph, self.options)
507
508        # Experimental feature: nothing for now, this is kept as a placeholder
509        if self.options.experimental_features:
510            pass
511
512        # What should be returned depends on the way the options have been set up
513        if self.options.output_default_graph:
514            copyGraph(graph, default_graph)
515            if self.options.output_processor_graph:
516                if pgraph != None:
517                    copyGraph(pgraph, self.options.processor_graph.graph)
518                else:
519                    copyGraph(graph, self.options.processor_graph.graph)
520        elif self.options.output_processor_graph:
521            if pgraph != None:
522                copyGraph(pgraph, self.options.processor_graph.graph)
523            else:
524                copyGraph(graph, self.options.processor_graph.graph)
525
526        # this is necessary if several DOM trees are handled in a row...
527        self.options.reset_processor_graph()
528
529        return graph
530
531    def graph_from_source(self, name, graph = None, rdfOutput = False, pgraph = None):
532        """
533        Extract an RDF graph from an RDFa source. The source is parsed, the RDF extracted, and the RDFa Graph is
534        returned. This is a front-end to the L{pyRdfa.graph_from_DOM} method.
535
536        @param name: a URI, a file name, or a file-like object
537        @param graph: rdflib Graph instance. If None, a new one is created.
538        @param pgraph: rdflib Graph instance for the processor graph. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph.
539        @param rdfOutput: whether runtime exceptions should be turned into RDF and returned as part of the processor graph
540        @return: an RDF Graph
541        @rtype: rdflib Graph instance
542        """
543        def copyErrors(tog, options):
544            if tog == None:
545                tog = Graph()
546            if options.output_processor_graph:
547                for t in options.processor_graph.graph:
548                    tog.add(t)
549                    if pgraph != None : pgraph.add(t)
550                for k,ns in options.processor_graph.graph.namespaces():
551                    tog.bind(k,ns)
552                    if pgraph != None : pgraph.bind(k,ns)
553            options.reset_processor_graph()
554            return tog
555
556        isstring = isinstance(name, str)
557
558        try:
559            # First, open the source... Possible HTTP errors are returned as error triples
560            stream = None
561            try:
562                stream = self._get_input(name)
563            except FailedSource as ex:
564                _f = sys.exc_info()[1]
565                self.http_status = 400
566                if not rdfOutput : raise Exception(ex.msg)
567                err = self.options.add_error(ex.msg, FileReferenceError, name)
568                self.options.processor_graph.add_http_context(err, 400)
569                return copyErrors(graph, self.options)
570            except HTTPError as ex:
571                h = sys.exc_info()[1]
572                self.http_status = h.http_code
573                if not rdfOutput : raise Exception(ex.msg)
574                err = self.options.add_error("HTTP Error: %s (%s)" % (h.http_code,h.msg), HTError, name)
575                self.options.processor_graph.add_http_context(err, h.http_code)
576                return copyErrors(graph, self.options)
577            except RDFaError as ex:
578                e = sys.exc_info()[1]
579                self.http_status = 500
580                # Something nasty happened:-(
581                if not rdfOutput : raise Exception(ex.msg)
582                err = self.options.add_error(str(ex.msg), context = name)
583                self.options.processor_graph.add_http_context(err, 500)
584                return copyErrors(graph, self.options)
585            except Exception as ex:
586                e = sys.exc_info()[1]
587                self.http_status = 500
588                # Something nasty happened:-(
589                if not rdfOutput : raise ex
590                err = self.options.add_error(str(e), context = name)
591                self.options.processor_graph.add_http_context(err, 500)
592                return copyErrors(graph, self.options)
593
594            dom = None
595            try:
596                msg = ""
597                parser = None
598                if self.options.host_language == HostLanguage.html5:
599                    import warnings
600                    warnings.filterwarnings("ignore", category=DeprecationWarning)
601                    from html5lib import HTMLParser, treebuilders
602                    parser = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
603                    if self.charset:
604                        # This means the HTTP header has provided a charset, or the
605                        # file is a local file when we suppose it to be a utf-8
606                        #
607                        # 2020-01-20, Ivan Herman
608                        #   for some reasons the python3 version ran into a problem with this html5lib call
609                        #   the override_encoding argument was not accepted.
610                        # dom = parser.parse(stream, override_encoding=self.charset)
611                        dom = parser.parse(stream)
612                    else:
613                        # No charset set. The HTMLLib parser tries to sniff into the
614                        # the file to find a meta header for the charset; if that
615                        # works, fine, otherwise it falls back on window-...
616                        dom = parser.parse(stream)
617
618                    try:
619                        if isstring:
620                            stream.close()
621                            stream = self._get_input(name)
622                        else:
623                            stream.seek(0)
624                        from .host import adjust_html_version
625                        self.rdfa_version = adjust_html_version(stream, self.rdfa_version)
626                    except:
627                        # if anything goes wrong, it is not really important; rdfa version stays what it was...
628                        pass
629
630                else:
631                    from .host import adjust_xhtml_and_version
632                    if isinstance(stream, IOBase):
633                        parse = xml.dom.minidom.parse
634                    else:
635                        parse = xml.dom.minidom.parseString
636                    dom = parse(stream)
637                    adjusted_host_language, version = adjust_xhtml_and_version(dom, self.options.host_language, self.rdfa_version)
638                    self.options.host_language = adjusted_host_language
639                    self.rdfa_version = version
640            except ImportError:
641                msg = "HTML5 parser not available. Try installing html5lib <http://code.google.com/p/html5lib>"
642                raise ImportError(msg)
643            except Exception:
644                e = sys.exc_info()[1]
645                # These are various parsing exception. Per spec, this is a case when
646                # error triples MUST be returned, ie, the usage of rdfOutput (which switches between an HTML formatted
647                # return page or a graph with error triples) does not apply
648                err = self.options.add_error(str(e), context = name)
649                self.http_status = 400
650                self.options.processor_graph.add_http_context(err, 400)
651                return copyErrors(graph, self.options)
652
653            # If we got here, we have a DOM tree to operate on...
654            return self.graph_from_DOM(dom, graph, pgraph)
655        except Exception:
656            # Something nasty happened during the generation of the graph...
657            (a,b,c) = sys.exc_info()
658            sys.excepthook(a,b,c)
659            if isinstance(b, ImportError):
660                self.http_status = None
661            else:
662                self.http_status = 500
663            if not rdfOutput : raise b
664            err = self.options.add_error(str(b), context = name)
665            self.options.processor_graph.add_http_context(err, 500)
666            return copyErrors(graph, self.options)
667
668    def rdf_from_sources(self, names, outputFormat = "turtle", rdfOutput = False):
669        """
670        Extract and RDF graph from a list of RDFa sources and serialize them in one graph. The sources are parsed, the RDF
671        extracted, and serialization is done in the specified format.
672        @param names: list of sources, each can be a URI, a file name, or a file-like object
673        @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", "json" or "json-ld". "turtle" and "n3", "xml" and "pretty-xml", and "json" and "json-ld" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only.
674        @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph
675        @type rdfOutput: boolean
676        @return: a serialized RDF Graph
677        @rtype: string
678        """
679        # protection against possible malicious URL call
680        outputFormat = pyRdfa._validate_output_format(outputFormat);
681
682        # This is better because it gives access to the various, non-standard serializations
683        # If it does not work because the extra are not installed, fall back to the standard
684        # rdlib distribution...
685        graph = Graph()
686
687        # graph.bind("xsd", Namespace('http://www.w3.org/2001/XMLSchema#'))
688        # the value of rdfOutput determines the reaction on exceptions...
689        for name in names:
690            self.graph_from_source(name, graph, rdfOutput)
691
692        # Stupid difference between python2 and python3...
693        return str(graph.serialize(format=outputFormat), encoding='utf-8')
694
695
696    def rdf_from_source(self, name, outputFormat = "turtle", rdfOutput = False):
697        """
698        Extract and RDF graph from an RDFa source and serialize it in one graph. The source is parsed, the RDF
699        extracted, and serialization is done in the specified format.
700        @param name: a URI, a file name, or a file-like object
701        @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", or "json-ld". "turtle" and "n3", or "xml" and "pretty-xml" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only.
702        @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph
703        @type rdfOutput: boolean
704        @return: a serialized RDF Graph
705        @rtype: string
706        """
707        return self.rdf_from_sources([name], outputFormat, rdfOutput)

Main processing class for the distiller

@ivar options: an instance of the L{Options} class @ivar media_type: the preferred default media type, possibly set at initialization @ivar base: the base value, possibly set at initialization @ivar http_status: HTTP Status, to be returned when the package is used via a CGI entry. Initially set to 200, may be modified by exception handlers

pyRdfa(options=None, base='', media_type='', rdfa_version=None)
352    def __init__(self, options = None, base = "", media_type = "", rdfa_version = None):
353        """
354        @keyword options: Options for the distiller
355        @type options: L{Options}
356        @keyword base: URI for the default "base" value (usually the URI of the file to be processed)
357        @keyword media_type: explicit setting of the preferred media type (a.k.a. content type) of the the RDFa source
358        @keyword rdfa_version: the RDFa version that should be used. If not set, the value of the global L{rdfa_current_version} variable is used
359        """
360        self.http_status = 200
361
362        self.base = base
363        if base == "":
364            self.required_base = None
365        else:
366            self.required_base    = base
367        self.charset         = None
368
369        # predefined content type
370        self.media_type = media_type
371
372        if options == None:
373            self.options = Options()
374        else:
375            self.options = options
376
377        if media_type != "":
378            self.options.set_host_language(self.media_type)
379
380        if rdfa_version is not None:
381            self.rdfa_version = rdfa_version
382        else:
383            self.rdfa_version = None

@keyword options: Options for the distiller @type options: L{Options} @keyword base: URI for the default "base" value (usually the URI of the file to be processed) @keyword media_type: explicit setting of the preferred media type (a.k.a. content type) of the the RDFa source @keyword rdfa_version: the RDFa version that should be used. If not set, the value of the global L{rdfa_current_version} variable is used

http_status
base
charset
media_type
def graph_from_DOM(self, dom, graph=None, pgraph=None):
456    def graph_from_DOM(self, dom, graph = None, pgraph = None):
457        """
458        Extract the RDF Graph from a DOM tree. This is where the real processing happens. All other methods get down to this
459        one, eventually (e.g., after opening a URI and parsing it into a DOM).
460        @param dom: a DOM Node element, the top level entry node for the whole tree (i.e., the C{dom.documentElement} is used to initiate processing down the node hierarchy)
461        @keyword graph: an RDF Graph (if None, than a new one is created)
462        @type graph: rdflib Graph instance.
463        @keyword pgraph: an RDF Graph to hold (possibly) the processor graph content. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph.
464        @type pgraph: rdflib Graph instance
465        @return: an RDF Graph
466        @rtype: rdflib Graph instance
467        """
468        def copyGraph(tog, fromg):
469            for t in fromg:
470                tog.add(t)
471            for k,ns in fromg.namespaces():
472                tog.bind(k,ns)
473
474        if graph == None:
475            # Create the RDF Graph, that will contain the return triples...
476            graph   = Graph()
477
478        # this will collect the content, the 'default graph', as called in the RDFa spec
479        default_graph = Graph()
480
481        # get the DOM tree
482        topElement = dom.documentElement
483
484        # Create the initial state. This takes care of things
485        # like base, top level namespace settings, etc.
486        state = ExecutionContext(topElement, default_graph, base=self.required_base if self.required_base != None else "", options=self.options, rdfa_version=self.rdfa_version)
487
488        # Perform the built-in and external transformations on the HTML tree.
489        for trans in self.options.transformers + builtInTransformers:
490            trans(topElement, self.options, state)
491
492        # This may have changed if the state setting detected an explicit version information:
493        self.rdfa_version = state.rdfa_version
494
495        # The top level subject starts with the current document; this
496        # is used by the recursion
497        # this function is the real workhorse
498        parse_one_node(topElement, default_graph, None, state, [])
499
500        # Massage the output graph in term of rdfa:Pattern and rdfa:copy
501        handle_prototypes(default_graph)
502
503        # If the RDFS expansion has to be made, here is the place...
504        if self.options.vocab_expansion:
505            from .rdfs.process import process_rdfa_sem
506            process_rdfa_sem(default_graph, self.options)
507
508        # Experimental feature: nothing for now, this is kept as a placeholder
509        if self.options.experimental_features:
510            pass
511
512        # What should be returned depends on the way the options have been set up
513        if self.options.output_default_graph:
514            copyGraph(graph, default_graph)
515            if self.options.output_processor_graph:
516                if pgraph != None:
517                    copyGraph(pgraph, self.options.processor_graph.graph)
518                else:
519                    copyGraph(graph, self.options.processor_graph.graph)
520        elif self.options.output_processor_graph:
521            if pgraph != None:
522                copyGraph(pgraph, self.options.processor_graph.graph)
523            else:
524                copyGraph(graph, self.options.processor_graph.graph)
525
526        # this is necessary if several DOM trees are handled in a row...
527        self.options.reset_processor_graph()
528
529        return graph

Extract the RDF Graph from a DOM tree. This is where the real processing happens. All other methods get down to this one, eventually (e.g., after opening a URI and parsing it into a DOM). @param dom: a DOM Node element, the top level entry node for the whole tree (i.e., the C{dom.documentElement} is used to initiate processing down the node hierarchy) @keyword graph: an RDF Graph (if None, than a new one is created) @type graph: rdflib Graph instance. @keyword pgraph: an RDF Graph to hold (possibly) the processor graph content. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. @type pgraph: rdflib Graph instance @return: an RDF Graph @rtype: rdflib Graph instance

def graph_from_source(self, name, graph=None, rdfOutput=False, pgraph=None):
531    def graph_from_source(self, name, graph = None, rdfOutput = False, pgraph = None):
532        """
533        Extract an RDF graph from an RDFa source. The source is parsed, the RDF extracted, and the RDFa Graph is
534        returned. This is a front-end to the L{pyRdfa.graph_from_DOM} method.
535
536        @param name: a URI, a file name, or a file-like object
537        @param graph: rdflib Graph instance. If None, a new one is created.
538        @param pgraph: rdflib Graph instance for the processor graph. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph.
539        @param rdfOutput: whether runtime exceptions should be turned into RDF and returned as part of the processor graph
540        @return: an RDF Graph
541        @rtype: rdflib Graph instance
542        """
543        def copyErrors(tog, options):
544            if tog == None:
545                tog = Graph()
546            if options.output_processor_graph:
547                for t in options.processor_graph.graph:
548                    tog.add(t)
549                    if pgraph != None : pgraph.add(t)
550                for k,ns in options.processor_graph.graph.namespaces():
551                    tog.bind(k,ns)
552                    if pgraph != None : pgraph.bind(k,ns)
553            options.reset_processor_graph()
554            return tog
555
556        isstring = isinstance(name, str)
557
558        try:
559            # First, open the source... Possible HTTP errors are returned as error triples
560            stream = None
561            try:
562                stream = self._get_input(name)
563            except FailedSource as ex:
564                _f = sys.exc_info()[1]
565                self.http_status = 400
566                if not rdfOutput : raise Exception(ex.msg)
567                err = self.options.add_error(ex.msg, FileReferenceError, name)
568                self.options.processor_graph.add_http_context(err, 400)
569                return copyErrors(graph, self.options)
570            except HTTPError as ex:
571                h = sys.exc_info()[1]
572                self.http_status = h.http_code
573                if not rdfOutput : raise Exception(ex.msg)
574                err = self.options.add_error("HTTP Error: %s (%s)" % (h.http_code,h.msg), HTError, name)
575                self.options.processor_graph.add_http_context(err, h.http_code)
576                return copyErrors(graph, self.options)
577            except RDFaError as ex:
578                e = sys.exc_info()[1]
579                self.http_status = 500
580                # Something nasty happened:-(
581                if not rdfOutput : raise Exception(ex.msg)
582                err = self.options.add_error(str(ex.msg), context = name)
583                self.options.processor_graph.add_http_context(err, 500)
584                return copyErrors(graph, self.options)
585            except Exception as ex:
586                e = sys.exc_info()[1]
587                self.http_status = 500
588                # Something nasty happened:-(
589                if not rdfOutput : raise ex
590                err = self.options.add_error(str(e), context = name)
591                self.options.processor_graph.add_http_context(err, 500)
592                return copyErrors(graph, self.options)
593
594            dom = None
595            try:
596                msg = ""
597                parser = None
598                if self.options.host_language == HostLanguage.html5:
599                    import warnings
600                    warnings.filterwarnings("ignore", category=DeprecationWarning)
601                    from html5lib import HTMLParser, treebuilders
602                    parser = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
603                    if self.charset:
604                        # This means the HTTP header has provided a charset, or the
605                        # file is a local file when we suppose it to be a utf-8
606                        #
607                        # 2020-01-20, Ivan Herman
608                        #   for some reasons the python3 version ran into a problem with this html5lib call
609                        #   the override_encoding argument was not accepted.
610                        # dom = parser.parse(stream, override_encoding=self.charset)
611                        dom = parser.parse(stream)
612                    else:
613                        # No charset set. The HTMLLib parser tries to sniff into the
614                        # the file to find a meta header for the charset; if that
615                        # works, fine, otherwise it falls back on window-...
616                        dom = parser.parse(stream)
617
618                    try:
619                        if isstring:
620                            stream.close()
621                            stream = self._get_input(name)
622                        else:
623                            stream.seek(0)
624                        from .host import adjust_html_version
625                        self.rdfa_version = adjust_html_version(stream, self.rdfa_version)
626                    except:
627                        # if anything goes wrong, it is not really important; rdfa version stays what it was...
628                        pass
629
630                else:
631                    from .host import adjust_xhtml_and_version
632                    if isinstance(stream, IOBase):
633                        parse = xml.dom.minidom.parse
634                    else:
635                        parse = xml.dom.minidom.parseString
636                    dom = parse(stream)
637                    adjusted_host_language, version = adjust_xhtml_and_version(dom, self.options.host_language, self.rdfa_version)
638                    self.options.host_language = adjusted_host_language
639                    self.rdfa_version = version
640            except ImportError:
641                msg = "HTML5 parser not available. Try installing html5lib <http://code.google.com/p/html5lib>"
642                raise ImportError(msg)
643            except Exception:
644                e = sys.exc_info()[1]
645                # These are various parsing exception. Per spec, this is a case when
646                # error triples MUST be returned, ie, the usage of rdfOutput (which switches between an HTML formatted
647                # return page or a graph with error triples) does not apply
648                err = self.options.add_error(str(e), context = name)
649                self.http_status = 400
650                self.options.processor_graph.add_http_context(err, 400)
651                return copyErrors(graph, self.options)
652
653            # If we got here, we have a DOM tree to operate on...
654            return self.graph_from_DOM(dom, graph, pgraph)
655        except Exception:
656            # Something nasty happened during the generation of the graph...
657            (a,b,c) = sys.exc_info()
658            sys.excepthook(a,b,c)
659            if isinstance(b, ImportError):
660                self.http_status = None
661            else:
662                self.http_status = 500
663            if not rdfOutput : raise b
664            err = self.options.add_error(str(b), context = name)
665            self.options.processor_graph.add_http_context(err, 500)
666            return copyErrors(graph, self.options)

Extract an RDF graph from an RDFa source. The source is parsed, the RDF extracted, and the RDFa Graph is returned. This is a front-end to the L{pyRdfa.graph_from_DOM} method.

@param name: a URI, a file name, or a file-like object @param graph: rdflib Graph instance. If None, a new one is created. @param pgraph: rdflib Graph instance for the processor graph. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. @param rdfOutput: whether runtime exceptions should be turned into RDF and returned as part of the processor graph @return: an RDF Graph @rtype: rdflib Graph instance

def rdf_from_sources(self, names, outputFormat='turtle', rdfOutput=False):
668    def rdf_from_sources(self, names, outputFormat = "turtle", rdfOutput = False):
669        """
670        Extract and RDF graph from a list of RDFa sources and serialize them in one graph. The sources are parsed, the RDF
671        extracted, and serialization is done in the specified format.
672        @param names: list of sources, each can be a URI, a file name, or a file-like object
673        @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", "json" or "json-ld". "turtle" and "n3", "xml" and "pretty-xml", and "json" and "json-ld" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only.
674        @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph
675        @type rdfOutput: boolean
676        @return: a serialized RDF Graph
677        @rtype: string
678        """
679        # protection against possible malicious URL call
680        outputFormat = pyRdfa._validate_output_format(outputFormat);
681
682        # This is better because it gives access to the various, non-standard serializations
683        # If it does not work because the extra are not installed, fall back to the standard
684        # rdlib distribution...
685        graph = Graph()
686
687        # graph.bind("xsd", Namespace('http://www.w3.org/2001/XMLSchema#'))
688        # the value of rdfOutput determines the reaction on exceptions...
689        for name in names:
690            self.graph_from_source(name, graph, rdfOutput)
691
692        # Stupid difference between python2 and python3...
693        return str(graph.serialize(format=outputFormat), encoding='utf-8')

Extract and RDF graph from a list of RDFa sources and serialize them in one graph. The sources are parsed, the RDF extracted, and serialization is done in the specified format. @param names: list of sources, each can be a URI, a file name, or a file-like object @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", "json" or "json-ld". "turtle" and "n3", "xml" and "pretty-xml", and "json" and "json-ld" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph @type rdfOutput: boolean @return: a serialized RDF Graph @rtype: string

def rdf_from_source(self, name, outputFormat='turtle', rdfOutput=False):
696    def rdf_from_source(self, name, outputFormat = "turtle", rdfOutput = False):
697        """
698        Extract and RDF graph from an RDFa source and serialize it in one graph. The source is parsed, the RDF
699        extracted, and serialization is done in the specified format.
700        @param name: a URI, a file name, or a file-like object
701        @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", or "json-ld". "turtle" and "n3", or "xml" and "pretty-xml" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only.
702        @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph
703        @type rdfOutput: boolean
704        @return: a serialized RDF Graph
705        @rtype: string
706        """
707        return self.rdf_from_sources([name], outputFormat, rdfOutput)

Extract and RDF graph from an RDFa source and serialize it in one graph. The source is parsed, the RDF extracted, and serialization is done in the specified format. @param name: a URI, a file name, or a file-like object @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", or "json-ld". "turtle" and "n3", or "xml" and "pretty-xml" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph @type rdfOutput: boolean @return: a serialized RDF Graph @rtype: string

def processURI(uri, outputFormat, form={}):
710def processURI(uri, outputFormat, form={}):
711    """The standard processing of an RDFa uri options in a form; used as an entry point from a CGI call.
712
713    The call accepts extra form options (i.e., HTTP GET options) as follows:
714
715     - C{graph=[output|processor|output,processor|processor,output]} specifying which graphs are returned. Default: C{output}
716     - C{space_preserve=[true|false]} means that plain literals are normalized in terms of white spaces. Default: C{false}
717     - C{rfa_version} provides the RDFa version that should be used for distilling. The string should be of the form "1.0" or "1.1". Default is the highest version the current package implements, currently "1.1"
718     - C{host_language=[xhtml,html,xml]} : the host language. Used when files are uploaded or text is added verbatim, otherwise the HTTP return header should be used. Default C{xml}
719     - C{embedded_rdf=[true|false]} : whether embedded turtle or RDF/XML content should be added to the output graph. Default: C{false}
720     - C{vocab_expansion=[true|false]} : whether the vocabularies should be expanded through the restricted RDFS entailment. Default: C{false}
721     - C{vocab_cache=[true|false]} : whether vocab caching should be performed or whether it should be ignored and vocabulary files should be picked up every time. Default: C{false}
722     - C{vocab_cache_report=[true|false]} : whether vocab caching details should be reported. Default: C{false}
723     - C{vocab_cache_bypass=[true|false]} : whether vocab caches have to be regenerated every time. Default: C{false}
724     - C{rdfa_lite=[true|false]} : whether warnings should be generated for non RDFa Lite attribute usage. Default: C{false}
725     - C{certifi_verify=[true|false]} : whether the SSL certificate needs to be verified. Default: C{true}
726
727    @param uri: URI to access. Note that the C{text:} and C{uploaded:} fake URI values are treated separately; the former is for textual intput (in which case a StringIO is used to get the data) and the latter is for uploaded file, where the form gives access to the file directly.
728    @param outputFormat: serialization format, as defined by the package. Currently "xml", "turtle", "nt", or "json". Default is "turtle", also used if any other string is given.
729    @param form: extra call options (from the CGI call) to set up the local options
730    @type form: cgi FieldStorage instance
731    @return: serialized graph
732    @rtype: string
733    """
734    def _get_option(param, compare_value, default):
735        param_old = param.replace('_', '-')
736        if param in list(form.keys()):
737            val = form.getfirst(param).lower()
738            return val == compare_value
739        elif param_old in list(form.keys()):
740            # this is to ensure the old style parameters are still valid...
741            # in the old days I used '-' in the parameters, the standard favours '_'
742            val = form.getfirst(param_old).lower()
743            return val == compare_value
744        else:
745            return default
746
747    if uri == "uploaded:":
748        stream = form["uploaded"].file
749        base = ""
750    elif uri == "text:":
751        stream = StringIO(form.getfirst("text"))
752        base = ""
753    else:
754        stream = uri
755        base = uri
756
757    if "rdfa_version" in list(form.keys()):
758        rdfa_version = form.getfirst("rdfa_version")
759    else:
760        rdfa_version = None
761
762    # working through the possible options
763    # Host language: HTML, XHTML, or XML
764    # Note that these options should be used for the upload and inline version only in case of a form
765    # for real uris the returned content type should be used
766    if "host_language" in list(form.keys()):
767        if form.getfirst("host_language").lower() == "xhtml":
768            media_type = MediaTypes.xhtml
769        elif form.getfirst("host_language").lower() == "html":
770            media_type = MediaTypes.html
771        elif form.getfirst("host_language").lower() == "svg":
772            media_type = MediaTypes.svg
773        elif form.getfirst("host_language").lower() == "atom":
774            media_type = MediaTypes.atom
775        else:
776            media_type = MediaTypes.xml
777    else:
778        media_type = ""
779
780    transformers = []
781
782    check_lite = "rdfa_lite" in list(form.keys()) and form.getfirst("rdfa_lite").lower() == "true"
783
784    # The code below is left for backward compatibility only. In fact, these options are not exposed any more,
785    # they are not really in use
786    from .transform.metaname import meta_transform
787    from .transform.OpenID import OpenID_transform
788    from .transform.DublinCore import DC_transform
789
790    if "extras" in list(form.keys()) and form.getfirst("extras").lower() == "true":
791        for t in [OpenID_transform, DC_transform, meta_transform]:
792            transformers.append(t)
793    else:
794        if "extra-meta" in list(form.keys()) and form.getfirst("extra-meta").lower() == "true":
795            transformers.append(meta_transform)
796        if "extra-openid" in list(form.keys()) and form.getfirst("extra-openid").lower() == "true":
797            transformers.append(OpenID_transform)
798        if "extra-dc" in list(form.keys()) and form.getfirst("extra-dc").lower() == "true":
799            transformers.append(DC_transform)
800
801    output_default_graph = True
802    output_processor_graph = False
803    # Note that I use the 'graph' and the 'rdfagraph' form keys here. Reason is that
804    # I used 'graph' in the previous versions, including the RDFa 1.0 processor,
805    # so if I removed that altogether that would create backward incompatibilities
806    # On the other hand, the RDFa 1.1 doc clearly refers to 'rdfagraph' as the standard
807    # key.
808    a = None
809    if "graph" in list(form.keys()):
810        a = form.getfirst("graph").lower()
811    elif "rdfagraph" in list(form.keys()):
812        a = form.getfirst("rdfagraph").lower()
813    if a != None:
814        if a == "processor":
815            output_default_graph = False
816            output_processor_graph = True
817        elif a == "processor,output" or a == "output,processor":
818            output_processor_graph = True
819
820    embedded_rdf =        _get_option( "embedded_rdf", "true", False)
821    space_preserve =      _get_option( "space_preserve", "true", True)
822    vocab_cache =         _get_option( "vocab_cache", "true", True)
823    vocab_cache_report =  _get_option( "vocab_cache_report", "true", False)
824    refresh_vocab_cache = _get_option( "vocab_cache_refresh", "true", False)
825    vocab_expansion =     _get_option( "vocab_expansion", "true", False)
826    certifi_verify =      _get_option( "certifi_verify", "true", True)
827    if vocab_cache_report:
828        output_processor_graph = True
829
830    options = Options(output_default_graph   = output_default_graph,
831                      output_processor_graph = output_processor_graph,
832                      space_preserve         = space_preserve,
833                      transformers           = transformers,
834                      vocab_cache            = vocab_cache,
835                      vocab_cache_report     = vocab_cache_report,
836                      refresh_vocab_cache    = refresh_vocab_cache,
837                      vocab_expansion        = vocab_expansion,
838                      embedded_rdf           = embedded_rdf,
839                      check_lite             = check_lite,
840                      certifi_verify         = certifi_verify)
841
842    processor = pyRdfa(options = options, base = base, media_type = media_type, rdfa_version = rdfa_version)
843
844    # Decide the output format; the issue is what should happen in case of a top level error like an inaccessibility of
845    # the html source: should a graph be returned or an HTML page with an error message?
846
847    # decide whether HTML or RDF should be sent.
848    htmlOutput = False
849    #if 'HTTP_ACCEPT' in os.environ:
850    #    acc = os.environ['HTTP_ACCEPT']
851    #    possibilities = ['text/html',
852    #                     'application/rdf+xml',
853    #                     'text/turtle; charset=utf-8',
854    #                     'application/json',
855    #                     'application/ld+json',
856    #                     'text/rdf+n3']
857    #
858    #    # this nice module does content negotiation and returns the preferred format
859    #    sg = acceptable_content_type(acc, possibilities)
860    #    htmlOutput = (sg != None and sg[0] == content_type('text/html'))
861    #    os.environ['rdfaerror'] = 'true'
862
863    # This is really for testing purposes only, it is an unpublished flag to force RDF output no
864    # matter what
865    import html
866    try:
867        outputFormat = pyRdfa._validate_output_format(outputFormat);
868        if outputFormat == "n3":
869            retval = 'Content-Type: text/rdf+n3; charset=utf-8\n'
870        elif outputFormat == "nt" or outputFormat == "turtle":
871            retval = 'Content-Type: text/turtle; charset=utf-8\n'
872        elif outputFormat == "json-ld" or outputFormat == "json":
873            retval = 'Content-Type: application/ld+json; charset=utf-8\n'
874        else:
875            retval = 'Content-Type: application/rdf+xml; charset=utf-8\n'
876        graph = processor.rdf_from_source(stream, outputFormat, rdfOutput = ("forceRDFOutput" in list(form.keys())) or not htmlOutput)
877        retval += '\n'
878        retval += graph
879        return retval
880    except HTTPError:
881        _type, h, _traceback = sys.exc_info()
882
883        retval = 'Content-type: text/html; charset=utf-8\nStatus: %s \n\n' % h.http_code
884        retval += "<html>\n"
885        retval += "<head>\n"
886        retval += "<title>HTTP Error in distilling RDFa content</title>\n"
887        retval += "</head><body>\n"
888        retval += "<h1>HTTP Error in distilling RDFa content</h1>\n"
889        retval += "<p>HTTP Error: %s (%s)</p>\n" % (h.http_code, h.msg)
890        retval += "<p>On URI: <code>'%s'</code></p>\n" % html.escape(uri)
891        retval +="</body>\n"
892        retval +="</html>\n"
893        return retval
894    except:
895        # This branch should occur only if an exception is really raised, ie, if it is not turned
896        # into a graph value.
897        _type, value, _traceback = sys.exc_info()
898
899        import traceback
900
901        retval = 'Content-type: text/html; charset=utf-8\nStatus: %s\n\n' % processor.http_status
902        retval += "<html>\n"
903        retval += "<head>\n"
904        retval += "<title>Exception in RDFa processing</title>\n"
905        retval += "</head><body>\n"
906        retval += "<h1>Exception in distilling RDFa</h1>\n"
907        retval += "<pre>\n"
908        strio  = StringIO()
909        traceback.print_exc(file=strio)
910        retval += strio.getvalue()
911        retval +="</pre>\n"
912        retval +="<pre>%s</pre>\n" % value
913        retval +="<h1>Distiller request details</h1>\n"
914        retval +="<dl>\n"
915        if uri == "text:" and "text" in form and form["text"].value != None and len(form["text"].value.strip()) != 0:
916            retval +="<dt>Text input:</dt><dd>%s</dd>\n" % html.escape(form["text"].value).replace('\n','<br/>')
917        elif uri == "uploaded:":
918            retval +="<dt>Uploaded file</dt>\n"
919        else:
920            retval +="<dt>URI received:</dt><dd><code>'%s'</code></dd>\n" % html.escape(uri)
921        if "host_language" in list(form.keys()):
922            retval +="<dt>Media Type:</dt><dd>%s</dd>\n" % html.escape(media_type)
923        if "graph" in list(form.keys()):
924            retval +="<dt>Requested graphs:</dt><dd>%s</dd>\n" % html.escape(form.getfirst("graph").lower())
925        else:
926            retval +="<dt>Requested graphs:</dt><dd>default</dd>\n"
927        retval +="<dt>Output serialization format:</dt><dd> %s</dd>\n" % outputFormat
928        if "space_preserve" in form : retval +="<dt>Space preserve:</dt><dd> %s</dd>\n" % html.escape(form["space_preserve"].value)
929        retval +="</dl>\n"
930        retval +="</body>\n"
931        retval +="</html>\n"
932        return retval

The standard processing of an RDFa uri options in a form; used as an entry point from a CGI call.

The call accepts extra form options (i.e., HTTP GET options) as follows:

  • C{graph=[output|processor|output,processor|processor,output]} specifying which graphs are returned. Default: C{output}
  • C{space_preserve=[true|false]} means that plain literals are normalized in terms of white spaces. Default: C{false}
  • C{rfa_version} provides the RDFa version that should be used for distilling. The string should be of the form "1.0" or "1.1". Default is the highest version the current package implements, currently "1.1"
  • C{host_language=[xhtml,html,xml]} : the host language. Used when files are uploaded or text is added verbatim, otherwise the HTTP return header should be used. Default C{xml}
  • C{embedded_rdf=[true|false]} : whether embedded turtle or RDF/XML content should be added to the output graph. Default: C{false}
  • C{vocab_expansion=[true|false]} : whether the vocabularies should be expanded through the restricted RDFS entailment. Default: C{false}
  • C{vocab_cache=[true|false]} : whether vocab caching should be performed or whether it should be ignored and vocabulary files should be picked up every time. Default: C{false}
  • C{vocab_cache_report=[true|false]} : whether vocab caching details should be reported. Default: C{false}
  • C{vocab_cache_bypass=[true|false]} : whether vocab caches have to be regenerated every time. Default: C{false}
  • C{rdfa_lite=[true|false]} : whether warnings should be generated for non RDFa Lite attribute usage. Default: C{false}
  • C{certifi_verify=[true|false]} : whether the SSL certificate needs to be verified. Default: C{true}

@param uri: URI to access. Note that the C{text:} and C{uploaded:} fake URI values are treated separately; the former is for textual intput (in which case a StringIO is used to get the data) and the latter is for uploaded file, where the form gives access to the file directly. @param outputFormat: serialization format, as defined by the package. Currently "xml", "turtle", "nt", or "json". Default is "turtle", also used if any other string is given. @param form: extra call options (from the CGI call) to set up the local options @type form: cgi FieldStorage instance @return: serialized graph @rtype: string