pyRdfa
RDFa 1.1 parser, also referred to as a “RDFa Distiller”. It is deployed, via a CGI front-end, on the U{W3C RDFa 1.1 Distiller pagehttp://www.w3.org/2012/pyRdfa/}.
For details on RDFa, the reader should consult the U{RDFa Core 1.1http://www.w3.org/TR/rdfa-core/}, U{XHTML+RDFa1.1http://www.w3.org/TR/2010/xhtml-rdfa}, and the U{RDFa 1.1 Litehttp://www.w3.org/TR/rdfa-lite/} documents. The U{RDFa 1.1 Primerhttp://www.w3.org/TR/owl2-primer/} may also prove helpful.
This package can also be downloaded U{from GitHubhttps://github.com/RDFLib/pyrdfa3}. The distribution also includes the CGI front-end and a separate utility script to be run locally.
Note that this package is an updated version of a U{previous RDFa distillerhttp://www.w3.org/2007/08/pyRdfa} that was developed for RDFa 1.0. Although it reuses large portions of that code, it has been quite thoroughly rewritten, hence put in a completely different project. (The version numbering has been continued, though, to avoid any kind of misunderstandings. This version has version numbers "3.0.0" or higher.)
(Simple) Usage
From a Python file, expecting a Turtle output:: from pyRdfa import pyRdfa print pyRdfa().rdf_from_source('filename') Other output formats are also possible. E.g., to produce RDF/XML output, one could use:: from pyRdfa import pyRdfa print pyRdfa().rdf_from_source('filename', outputFormat='pretty-xml') It is also possible to embed an RDFa processing. Eg, using:: from pyRdfa import pyRdfa graph = pyRdfa().graph_from_source('filename') returns an RDFLib.Graph object instead of a serialization thereof. See the the description of the L{pyRdfa class<pyRdfa>} for further possible entry points details.
There is also, as part of this module, a L{separate entry for CGI calls
Return (serialization) formats
The package relies on RDFLib. By default, it relies therefore on the serializers coming with the local RDFLib distribution. However, there has been some issues with serializers of older RDFLib releases; also, some output formats, like JSON-LD, are not (yet) part of the standard RDFLib distribution. A companion package, called pyRdfaExtras, is part of the download, and it includes some of those extra serializers. The extra format (not part of the RDFLib core) is U{JSON-LDhttp://json-ld.org/spec/latest/json-ld-syntax/}, whose 'key' is 'json', when used in the 'parse' method of an RDFLib graph.
(Note in 2018: the bugs that needed pyRdfaExtras are gone with the RDFLib versions, and the json-ld serializer and parser can be U{downloaded from githubhttps://github.com/RDFLib/rdflib-jsonld} (or installed via pip). This means that importing pyRdfaExtras is done only when running older (i.e., 2.X.X) RDFLib versions and can be safely ignored these days.)
Options
The package also implements some optional features that are not part of the RDFa recommendations. At the moment these are:
- possibility for plain literals to be normalized in terms of white spaces. Default: false. (The RDFa specification requires keeping the white spaces and leave applications to normalize them, if needed)
- inclusion of embedded RDF: Turtle content may be enclosed in a C{script} element and typed as C{text/turtle}, U{defined by the RDF Working Grouphttp://www.w3.org/TR/turtle/}. Alternatively, some XML dialects (e.g., SVG) allows the usage of RDF/XML as part of their core content to define metadata in RDF. For both of these cases pyRdfa parses these serialized RDF content and adds the resulting triples to the output Graph. Default: true.
- extra, built-in transformers are executed on the DOM tree prior to RDFa processing (see below). These transformers can be provided by the end user.
Options are collected in an instance of the L{Options} class and may be passed to the processing functions as an extra argument. E.g., to allow the inclusion of embedded content:: from pyRdfa.options import Options options = Options(embedded_rdf=True) print pyRdfa(options=options).rdf_from_source('filename')
See the description of the L{Options} class for the details.
Host Languages
RDFa 1.1. Core is defined for generic XML; there are specific documents to describe how the generic specification is applied to XHTML and HTML5.
pyRdfa makes an automatic switch among these based on the content type of the source as returned by an HTTP request. The following are the possible host languages:
- if the content type is C{text/html}, the content is HTML5
- if the content type is C{application/xhtml+xml} I{and} the right DTD is used, the content is XHTML1
- if the content type is C{application/xhtml+xml} and no or an unknown DTD is used, the content is XHTML5
- if the content type is C{application/svg+xml}, the content type is SVG
- if the content type is C{application/atom+xml}, the content type is SVG
- if the content type is C{application/xml} or C{application/xxx+xml} (but 'xxx' is not 'atom' or 'svg'), the content type is XML
If local files are used, pyRdfa makes a guess on the content type based on the file name suffix: C{.html} is for HTML5, C{.xhtml} for XHTML1, C{.svg} for SVG, anything else is considered to be general XML. Finally, the content type may be set by the caller when initializing the L{pyRdfa class<pyRdfa>}.
Beyond the differences described in the RDFa specification, the main difference is the parser used to parse the source. In the case of HTML5, pyRdfa uses an U{HTML5 parserhttp://code.google.com/p/html5lib/}; for all other cases the simple XML parser, part of the core Python environment, is used. This may be significant in the case of erroneous sources: indeed, the HTML5 parser may do adjustments on the DOM tree before handing it over to the distiller. Furthermore, SVG is also recognized as a type that allows embedded RDF in the form of RDF/XML.
See the variables in the L{host} module if a new host language is added to the system. The current host language information is available for transformers via the option argument, too, and can be used to control the effect of the transformer.
Vocabularies
RDFa 1.1 has the notion of vocabulary files (using the C{@vocab} attribute) that may be used to expand the generated RDF graph. Expansion is based on some very simply RDF Schema and OWL statements on sub-properties and sub-classes, and equivalences.
pyRdfa implements this feature, although it does not do this by default. The extra C{vocab_expansion} parameter should be used for this extra step, for example:: from pyRdfa.options import Options options = Options(vocab_expansion=True) print pyRdfa(options=options).rdf_from_source('filename')
The triples in the vocabulary files themselves (i.e., the small ontology in RDF Schema and OWL) are removed from the result, leaving the inferred property and type relationships only (additionally to the “core” RDF content).
Vocabulary caching
By default, pyRdfa uses a caching mechanism instead of fetching the vocabulary files each time their URI is met as a C{@vocab} attribute value. (This behavior can be switched off setting the C{vocab_cache} option to false.)
Caching happens in a file system directory. The directory itself is determined by the platform the tool is used on, namely:
- On Windows, it is the C{pyRdfa-cache} subdirectory of the C{%APPDATA%} environment variable
- On MacOS, it is the C{~/Library/Application Support/pyRdfa-cache}
- Otherwise, it is the C{~/.pyRdfa-cache}
This automatic choice can be overridden by the C{PyRdfaCacheDir} environment variable.
Caching can be set to be read-only, i.e., the setup might generate the cache files off-line instead of letting the tool writing its own cache when operating, e.g., as a service on the Web. This can be achieved by making the cache directory read only.
If the directories are neither readable nor writable, the vocabulary files are retrieved via HTTP every time they are hit. This may slow down processing, it is advised to avoid such a setup for the package.
The cache includes a separate index file and a file for each vocabulary file. Cache control is based upon the C{EXPIRES} header of a vocabulary file’s HTTP return header: when first seen, this data is stored in the index file and controls whether the cache has to be renewed or not. If the HTTP return header does not have this entry, the date is artificially set ot the current date plus one day.
(The cache files themselves are dumped and loaded using U{Python’s built in cPickle packagehttp://docs.python.org/release/2.7/library/pickle.html#module-cPickle}. These are binary files. Care should be taken if they are managed by CVS: they must be declared as binary files when adding them to the repository.)
RDFa 1.1 vs. RDFa 1.0
Unfortunately, RDFa 1.1 is I{not} fully backward compatible with RDFa 1.0, meaning that, in a few cases, the triples generated from an RDFa 1.1 source are not the same as for RDFa 1.0. (See the separate U{section in the RDFa 1.1 specificationhttp://www.w3.org/TR/rdfa-core/#major-differences-with-rdfa-syntax-1.0} for some further details.)
This distiller’s default behavior is RDFa 1.1. However, if the source includes, in the top element of the file (e.g., the C{html} element) a C{@version} attribute whose value contains the C{RDFa 1.0} string, then the distiller switches to a RDFa 1.0 mode. (Although the C{@version} attribute is not required in RDFa 1.0, it is fairly commonly used.) Similarly, if the RDFa 1.0 DTD is used in the XHTML source, it will be taken into account (a very frequent setup is that an XHTML file is defined with that DTD and is served as text/html; pyRdfa will consider that file as XHTML5, i.e., parse it with the HTML5 parser, but interpret the RDFa attributes under the RDFa 1.0 rules).
Transformers
The package uses the concept of 'transformers': the parsed DOM tree is possibly transformed I{before} performing the real RDFa processing. This transformer structure makes it possible to add additional 'services' without distoring the core code of RDFa processing.
A transformer is a function with three arguments:
- C{node}: a DOM node for the top level element of the DOM tree
- C{options}: the current L{Options} instance
- C{state}: the current L{ExecutionContext} instance, corresponding to the top level DOM Tree element
The function may perform any type of change on the DOM tree; the typical behavior is to add or remove attributes on specific elements. Some transformations are included in the package and can be used as examples; see the L{transform} module of the distribution. These are:
- The C{@name} attribute of the C{meta} element is copied into a C{@property} attribute of the same element
- Interpreting the 'openid' references in the header. See L{transform.OpenID} for further details.
- Implementing the Dublin Core dialect to include DC statements from the header. See L{transform.DublinCore} for further details.
The user of the package may refer add these transformers to L{Options} instance. Here is a possible usage with the “openid” transformer added to the call:: from pyRdfa.options import Options from pyRdfa.transform.OpenID import OpenID_transform options = Options(transformers=[OpenID_transform]) print pyRdfa(options=options).rdf_from_source('filename')
@summary: RDFa parser (distiller)
@requires: Python 3.8 or higher.
@requires: U{requestshttps://pypi.org/project/requests/2.32.3/}; version 2.32.3 or higher.
@requires: U{rdflibhttps://pypi.org/project/rdflib/7.0.0/}; version 7.0.0 or higher.
@requires: U{html5libhttps://pypi.org/project/html5lib/1.1/}; version 1.1 or higher.
@organization: U{World Wide Web Consortiumhttp://www.w3.org}
@author: U{Ivan Herman}
@license: This software is available for use under the
U{W3C® SOFTWARE NOTICE AND LICENSE
@var builtInTransformers: List of built-in transformers that are to be run regardless, because they are part of the RDFa spec @var CACHE_DIR_VAR: Environment variable used to define cache directories for RDFa vocabularies in case the default setting does not work or is not appropriate. @var rdfa_current_version: Current "official" version of RDFa that this package implements by default. This can be changed at the invocation of the package @var uri_schemes: List of registered (or widely used) URI schemes; used for warnings...
1# -*- coding: utf-8 -*- 2""" 3RDFa 1.1 parser, also referred to as a “RDFa Distiller”. It is 4deployed, via a CGI front-end, on the U{W3C RDFa 1.1 Distiller page<http://www.w3.org/2012/pyRdfa/>}. 5 6For details on RDFa, the reader should consult the U{RDFa Core 1.1<http://www.w3.org/TR/rdfa-core/>}, U{XHTML+RDFa1.1<http://www.w3.org/TR/2010/xhtml-rdfa>}, and the U{RDFa 1.1 Lite<http://www.w3.org/TR/rdfa-lite/>} documents. 7The U{RDFa 1.1 Primer<http://www.w3.org/TR/owl2-primer/>} may also prove helpful. 8 9This package can also be downloaded U{from GitHub<https://github.com/RDFLib/pyrdfa3>}. The 10distribution also includes the CGI front-end and a separate utility script to be run locally. 11 12Note that this package is an updated version of a U{previous RDFa distiller<http://www.w3.org/2007/08/pyRdfa>} that was developed 13for RDFa 1.0. Although it reuses large portions of that code, it has been quite thoroughly rewritten, hence put in a completely 14different project. (The version numbering has been continued, though, to avoid any kind of misunderstandings. This version has version numbers "3.0.0" or higher.) 15 16(Simple) Usage 17============== 18From a Python file, expecting a Turtle output:: 19 from pyRdfa import pyRdfa 20 print pyRdfa().rdf_from_source('filename') 21Other output formats are also possible. E.g., to produce RDF/XML output, one could use:: 22 from pyRdfa import pyRdfa 23 print pyRdfa().rdf_from_source('filename', outputFormat='pretty-xml') 24It is also possible to embed an RDFa processing. Eg, using:: 25 from pyRdfa import pyRdfa 26 graph = pyRdfa().graph_from_source('filename') 27returns an RDFLib.Graph object instead of a serialization thereof. See the the description of the 28L{pyRdfa class<pyRdfa.pyRdfa>} for further possible entry points details. 29 30There is also, as part of this module, a L{separate entry for CGI calls<processURI>}. 31 32Return (serialization) formats 33------------------------------ 34 35The package relies on RDFLib. By default, it relies therefore on the serializers coming with the local RDFLib distribution. However, there has been some issues with serializers of older RDFLib releases; also, some output formats, like JSON-LD, are not (yet) part of the standard RDFLib distribution. A companion package, called pyRdfaExtras, is part of the download, and it includes some of those extra serializers. The extra format (not part of the RDFLib core) is U{JSON-LD<http://json-ld.org/spec/latest/json-ld-syntax/>}, whose 'key' is 'json', when used in the 'parse' method of an RDFLib graph. 36 37(Note in 2018: the bugs that needed pyRdfaExtras are gone with the RDFLib versions, and the json-ld serializer and parser can be U{downloaded from github<https://github.com/RDFLib/rdflib-jsonld>} (or installed via pip). This means that importing pyRdfaExtras is done only when running older (i.e., 2.X.X) RDFLib versions and can be safely ignored these days.) 38 39Options 40======= 41 42The package also implements some optional features that are not part of the RDFa recommendations. At the moment these are: 43 44 - possibility for plain literals to be normalized in terms of white spaces. Default: false. (The RDFa specification requires keeping the white spaces and leave applications to normalize them, if needed) 45 - inclusion of embedded RDF: Turtle content may be enclosed in a C{script} element and typed as C{text/turtle}, U{defined by the RDF Working Group<http://www.w3.org/TR/turtle/>}. Alternatively, some XML dialects (e.g., SVG) allows the usage of RDF/XML as part of their core content to define metadata in RDF. For both of these cases pyRdfa parses these serialized RDF content and adds the resulting triples to the output Graph. Default: true. 46 - extra, built-in transformers are executed on the DOM tree prior to RDFa processing (see below). These transformers can be provided by the end user. 47 48Options are collected in an instance of the L{Options} class and may be passed to the processing functions as an extra argument. E.g., to allow the inclusion of embedded content:: 49 from pyRdfa.options import Options 50 options = Options(embedded_rdf=True) 51 print pyRdfa(options=options).rdf_from_source('filename') 52 53See the description of the L{Options} class for the details. 54 55 56Host Languages 57============== 58 59RDFa 1.1. Core is defined for generic XML; there are specific documents to describe how the generic specification is applied to 60XHTML and HTML5. 61 62pyRdfa makes an automatic switch among these based on the content type of the source as returned by an HTTP request. The following are the 63possible host languages: 64 - if the content type is C{text/html}, the content is HTML5 65 - if the content type is C{application/xhtml+xml} I{and} the right DTD is used, the content is XHTML1 66 - if the content type is C{application/xhtml+xml} and no or an unknown DTD is used, the content is XHTML5 67 - if the content type is C{application/svg+xml}, the content type is SVG 68 - if the content type is C{application/atom+xml}, the content type is SVG 69 - if the content type is C{application/xml} or C{application/xxx+xml} (but 'xxx' is not 'atom' or 'svg'), the content type is XML 70 71If local files are used, pyRdfa makes a guess on the content type based on the file name suffix: C{.html} is for HTML5, C{.xhtml} for XHTML1, C{.svg} for SVG, anything else is considered to be general XML. Finally, the content type may be set by the caller when initializing the L{pyRdfa class<pyRdfa.pyRdfa>}. 72 73Beyond the differences described in the RDFa specification, the main difference is the parser used to parse the source. In the case of HTML5, pyRdfa uses an U{HTML5 parser<http://code.google.com/p/html5lib/>}; for all other cases the simple XML parser, part of the core Python environment, is used. This may be significant in the case of erroneous sources: indeed, the HTML5 parser may do adjustments on 74the DOM tree before handing it over to the distiller. Furthermore, SVG is also recognized as a type that allows embedded RDF in the form of RDF/XML. 75 76See the variables in the L{host} module if a new host language is added to the system. The current host language information is available for transformers via the option argument, too, and can be used to control the effect of the transformer. 77 78Vocabularies 79============ 80 81RDFa 1.1 has the notion of vocabulary files (using the C{@vocab} attribute) that may be used to expand the generated RDF graph. Expansion is based on some very simply RDF Schema and OWL statements on sub-properties and sub-classes, and equivalences. 82 83pyRdfa implements this feature, although it does not do this by default. The extra C{vocab_expansion} parameter should be used for this extra step, for example:: 84 from pyRdfa.options import Options 85 options = Options(vocab_expansion=True) 86 print pyRdfa(options=options).rdf_from_source('filename') 87 88The triples in the vocabulary files themselves (i.e., the small ontology in RDF Schema and OWL) are removed from the result, leaving the inferred property and type relationships only (additionally to the “core” RDF content). 89 90Vocabulary caching 91------------------ 92 93By default, pyRdfa uses a caching mechanism instead of fetching the vocabulary files each time their URI is met as a C{@vocab} attribute value. (This behavior can be switched off setting the C{vocab_cache} option to false.) 94 95Caching happens in a file system directory. The directory itself is determined by the platform the tool is used on, namely: 96 - On Windows, it is the C{pyRdfa-cache} subdirectory of the C{%APPDATA%} environment variable 97 - On MacOS, it is the C{~/Library/Application Support/pyRdfa-cache} 98 - Otherwise, it is the C{~/.pyRdfa-cache} 99 100This automatic choice can be overridden by the C{PyRdfaCacheDir} environment variable. 101 102Caching can be set to be read-only, i.e., the setup might generate the cache files off-line instead of letting the tool writing its own cache when operating, e.g., as a service on the Web. This can be achieved by making the cache directory read only. 103 104If the directories are neither readable nor writable, the vocabulary files are retrieved via HTTP every time they are hit. This may slow down processing, it is advised to avoid such a setup for the package. 105 106The cache includes a separate index file and a file for each vocabulary file. Cache control is based upon the C{EXPIRES} header of a vocabulary file’s HTTP return header: when first seen, this data is stored in the index file and controls whether the cache has to be renewed or not. If the HTTP return header does not have this entry, the date is artificially set ot the current date plus one day. 107 108(The cache files themselves are dumped and loaded using U{Python’s built in cPickle package<http://docs.python.org/release/2.7/library/pickle.html#module-cPickle>}. These are binary files. Care should be taken if they are managed by CVS: they must be declared as binary files when adding them to the repository.) 109 110RDFa 1.1 vs. RDFa 1.0 111===================== 112 113Unfortunately, RDFa 1.1 is I{not} fully backward compatible with RDFa 1.0, meaning that, in a few cases, the triples generated from an RDFa 1.1 source are not the same as for RDFa 1.0. (See the separate U{section in the RDFa 1.1 specification<http://www.w3.org/TR/rdfa-core/#major-differences-with-rdfa-syntax-1.0>} for some further details.) 114 115This distiller’s default behavior is RDFa 1.1. However, if the source includes, in the top element of the file (e.g., the C{html} element) a C{@version} attribute whose value contains the C{RDFa 1.0} string, then the distiller switches to a RDFa 1.0 mode. (Although the C{@version} attribute is not required in RDFa 1.0, it is fairly commonly used.) Similarly, if the RDFa 1.0 DTD is used in the XHTML source, it will be taken into account (a very frequent setup is that an XHTML file is defined with that DTD and is served as text/html; pyRdfa will consider that file as XHTML5, i.e., parse it with the HTML5 parser, but interpret the RDFa attributes under the RDFa 1.0 rules). 116 117Transformers 118============ 119 120The package uses the concept of 'transformers': the parsed DOM tree is possibly 121transformed I{before} performing the real RDFa processing. This transformer structure makes it possible to 122add additional 'services' without distoring the core code of RDFa processing. 123 124A transformer is a function with three arguments: 125 126 - C{node}: a DOM node for the top level element of the DOM tree 127 - C{options}: the current L{Options} instance 128 - C{state}: the current L{ExecutionContext} instance, corresponding to the top level DOM Tree element 129 130The function may perform any type of change on the DOM tree; the typical behavior is to add or remove attributes on specific elements. Some transformations are included in the package and can be used as examples; see the L{transform} module of the distribution. These are: 131 132 - The C{@name} attribute of the C{meta} element is copied into a C{@property} attribute of the same element 133 - Interpreting the 'openid' references in the header. See L{transform.OpenID} for further details. 134 - Implementing the Dublin Core dialect to include DC statements from the header. See L{transform.DublinCore} for further details. 135 136The user of the package may refer add these transformers to L{Options} instance. Here is a possible usage with the “openid” transformer added to the call:: 137 from pyRdfa.options import Options 138 from pyRdfa.transform.OpenID import OpenID_transform 139 options = Options(transformers=[OpenID_transform]) 140 print pyRdfa(options=options).rdf_from_source('filename') 141 142 143@summary: RDFa parser (distiller) 144@requires: Python 3.8 or higher. 145@requires: U{requests<https://pypi.org/project/requests/2.32.3/>}; version 2.32.3 or higher. 146@requires: U{rdflib<https://pypi.org/project/rdflib/7.0.0/>}; version 7.0.0 or higher. 147@requires: U{html5lib<https://pypi.org/project/html5lib/1.1/>}; version 1.1 or higher. 148@organization: U{World Wide Web Consortium<http://www.w3.org>} 149@author: U{Ivan Herman<a href="http://www.w3.org/People/Ivan/">} 150@license: This software is available for use under the 151U{W3C® SOFTWARE NOTICE AND LICENSE<href="http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231">} 152 153@var builtInTransformers: List of built-in transformers that are to be run regardless, because they are part of the RDFa spec 154@var CACHE_DIR_VAR: Environment variable used to define cache directories for RDFa vocabularies in case the default setting does not work or is not appropriate. 155@var rdfa_current_version: Current "official" version of RDFa that this package implements by default. This can be changed at the invocation of the package 156@var uri_schemes: List of registered (or widely used) URI schemes; used for warnings... 157""" 158 159__version__ = "3.6.3" 160__author__ = 'Ivan Herman and prrvchr' 161__contact__ = 'prrvchr@gmail.com' 162__license__ = 'W3C® SOFTWARE NOTICE AND LICENSE, http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231' 163 164name = "pyRdfa3" 165 166import sys 167 168from io import StringIO, IOBase 169 170import os 171import xml.dom.minidom 172from urllib.parse import urlparse 173 174import rdflib 175from rdflib import URIRef 176from rdflib import Literal 177from rdflib import BNode 178from rdflib import Namespace 179from rdflib import RDF as ns_rdf 180from rdflib import RDFS as ns_rdfs 181from rdflib import Graph 182 183# Namespace, in the RDFLib sense, for the rdfa vocabulary 184ns_rdfa = Namespace("http://www.w3.org/ns/rdfa#") 185 186from .extras.httpheader import acceptable_content_type, content_type 187from .transform.prototype import handle_prototypes 188 189# Vocabulary terms for vocab reporting 190RDFA_VOCAB = ns_rdfa["usesVocabulary"] 191 192# Namespace, in the RDFLib sense, for the XSD Datatypes 193ns_xsd = Namespace('http://www.w3.org/2001/XMLSchema#') 194 195# Namespace, in the RDFLib sense, for the distiller vocabulary, used as part of the processor graph 196ns_distill = Namespace("http://www.w3.org/2007/08/pyRdfa/vocab#") 197 198debug = False 199 200######################################################################################################### 201 202# Exception/error handling. Essentially, all the different exceptions are re-packaged into 203# separate exception class, to allow for an easier management on the user level 204 205class RDFaError(Exception): 206 """Superclass exceptions representing error conditions defined by the RDFa 1.1 specification. 207 It does not add any new functionality to the 208 Exception class.""" 209 def __init__(self, msg): 210 self.msg = msg 211 Exception.__init__(self) 212 213class FailedSource(RDFaError): 214 """Raised when the original source cannot be accessed. It does not add any new functionality to the 215 Exception class.""" 216 def __init__(self, msg, http_code = None): 217 self.msg = msg 218 self.http_code = http_code 219 RDFaError.__init__(self, msg) 220 221class HTTPError(RDFaError): 222 """Raised when HTTP problems are detected. It does not add any new functionality to the 223 Exception class.""" 224 def __init__(self, http_msg, http_code): 225 self.msg = http_msg 226 self.http_code = http_code 227 RDFaError.__init__(self,http_msg) 228 229class ProcessingError(RDFaError): 230 """Error found during processing. It does not add any new functionality to the 231 Exception class.""" 232 pass 233 234class pyRdfaError(Exception): 235 """Superclass exceptions representing error conditions outside the RDFa 1.1 specification.""" 236 pass 237 238# Error and Warning RDFS classes 239RDFA_Error = ns_rdfa["Error"] 240RDFA_Warning = ns_rdfa["Warning"] 241RDFA_Info = ns_rdfa["Information"] 242NonConformantMarkup = ns_rdfa["DocumentError"] 243UnresolvablePrefix = ns_rdfa["UnresolvedCURIE"] 244UnresolvableReference = ns_rdfa["UnresolvedCURIE"] 245UnresolvableTerm = ns_rdfa["UnresolvedTerm"] 246VocabReferenceError = ns_rdfa["VocabReferenceError"] 247PrefixRedefinitionWarning = ns_rdfa["PrefixRedefinition"] 248 249FileReferenceError = ns_distill["FileReferenceError"] 250HTError = ns_distill["HTTPError"] 251IncorrectPrefixDefinition = ns_distill["IncorrectPrefixDefinition"] 252IncorrectBlankNodeUsage = ns_distill["IncorrectBlankNodeUsage"] 253IncorrectLiteral = ns_distill["IncorrectLiteral"] 254 255# Error message texts 256err_no_blank_node = "Blank node in %s position is not allowed; ignored" 257 258err_redefining_URI_as_prefix = "'%s' a registered or an otherwise used URI scheme, but is defined as a prefix here; is this a mistake? (see, eg, http://en.wikipedia.org/wiki/URI_scheme or http://www.iana.org/assignments/uri-schemes.html for further information for most of the URI schemes)" 259err_xmlns_deprecated = "The usage of 'xmlns' for prefix definition is deprecated; please use the 'prefix' attribute instead (definition for '%s')" 260err_bnode_local_prefix = "The '_' local CURIE prefix is reserved for blank nodes, and cannot be defined as a prefix" 261err_col_local_prefix = "The character ':' is not valid in a CURIE Prefix, and cannot be used in a prefix definition (definition for '%s')" 262err_missing_URI_prefix = "Missing URI in prefix declaration for '%s' (in '%s')" 263err_invalid_prefix = "Invalid prefix declaration '%s' (in '%s')" 264err_no_default_prefix = "Default prefix cannot be changed (in '%s')" 265err_prefix_and_xmlns = "@prefix setting for '%s' overrides the 'xmlns:%s' setting; may be a source of problem if same file is run through RDFa 1.0" 266err_non_ncname_prefix = "Non NCNAME '%s' in prefix definition (in '%s'); ignored" 267err_absolute_reference = "CURIE Reference part contains an authority part: %s (in '%s'); ignored" 268err_query_reference = "CURIE Reference query part contains an unauthorized character: %s (in '%s'); ignored" 269err_fragment_reference = "CURIE Reference fragment part contains an unauthorized character: %s (in '%s'); ignored" 270err_lang = "There is a problem with language setting; either both xml:lang and lang used on an element with different values, or, for (X)HTML5, only xml:lang is used." 271err_URI_scheme = "Unusual URI scheme used in <%s>; may that be a mistake, e.g., resulting from using an undefined CURIE prefix or an incorrect CURIE?" 272err_illegal_safe_CURIE = "Illegal safe CURIE: %s; ignored" 273err_no_CURIE_in_safe_CURIE = "Safe CURIE is used, but the value does not correspond to a defined CURIE: [%s]; ignored" 274err_undefined_terms = "'%s' is used as a term, but has not been defined as such; ignored" 275err_non_legal_CURIE_ref = "Relative URI is not allowed in this position (or not a legal CURIE reference) '%s'; ignored" 276err_undefined_CURIE = "Undefined CURIE: '%s'; ignored" 277err_prefix_redefinition = "Prefix '%s' (defined in the initial RDFa context or in an ancestor) is redefined" 278 279err_unusual_char_in_URI = "Unusual character in uri: %s; possible error?" 280 281############################################################################################# 282 283from .state import ExecutionContext 284from .parse import parse_one_node 285from .options import Options 286from .transform import top_about, empty_safe_curie, vocab_for_role 287from .utils import URIOpener 288from .host import HostLanguage, MediaTypes, preferred_suffixes, content_to_host_language 289 290# Environment variable used to characterize cache directories for RDFa vocabulary files. 291CACHE_DIR_VAR = "PyRdfaCacheDir" 292 293# current "official" version of RDFa that this package implements. This can be changed at the invocation of the package 294rdfa_current_version = "1.1" 295 296# I removed schemes that would not appear as a prefix anyway, like iris.beep 297# http://en.wikipedia.org/wiki/URI_scheme seems to be a good source of information 298# as well as http://www.iana.org/assignments/uri-schemes.html 299# There are some overlaps here, but better more than not enough... 300 301# This comes from wikipedia 302registered_iana_schemes = [ 303 "aaa","aaas","acap","cap","cid","crid","data","dav","dict","did","dns","fax","file", "ftp","geo","go", 304 "gopher","h323","http","https","iax","icap","im","imap","info","ipp","iris","ldap", "lsid", 305 "mailto","mid","modem","msrp","msrps", "mtqp", "mupdate","news","nfs","nntp","opaquelocktoken", 306 "pop","pres", "prospero","rstp","rsync", "service","shttp","sieve","sip","sips", "sms", "snmp", "soap", "tag", 307 "tel","telnet", "tftp", "thismessage","tn3270","tip","tv","urn","vemmi","wais","ws", "wss", "xmpp" 308] 309 310# This comes from wikipedia, too 311unofficial_common = [ 312 "about", "adiumxtra", "aim", "apt", "afp", "aw", "bitcoin", "bolo", "callto", "chrome", "coap", 313 "content", "cvs", "doi", "ed2k", "facetime", "feed", "finger", "fish", "git", "gg", 314 "gizmoproject", "gtalk", "irc", "ircs", "irc6", "itms", "jar", "javascript", 315 "keyparc", "lastfm", "ldaps", "magnet", "maps", "market", "message", "mms", 316 "msnim", "mumble", "mvn", "notes", "palm", "paparazzi", "psync", "rmi", 317 "secondlife", "sgn", "skype", "spotify", "ssh", "sftp", "smb", "soldat", 318 "steam", "svn", "teamspeak", "things", "udb", "unreal", "ut2004", 319 "ventrillo", "view-source", "webcal", "wtai", "wyciwyg", "xfire", "xri", "ymsgr" 320] 321 322# These come from the IANA page 323historical_iana_schemes = [ 324 "fax", "mailserver", "modem", "pack", "prospero", "snews", "videotex", "wais" 325] 326 327provisional_iana_schemes = [ 328 "afs", "dtn", "dvb", "icon", "ipn", "jms", "oid", "rsync", "ni" 329] 330 331other_used_schemes = [ 332 "hdl", "isbn", "issn", "mstp", "rtmp", "rtspu", "stp" 333] 334 335uri_schemes = registered_iana_schemes + unofficial_common + historical_iana_schemes + provisional_iana_schemes + other_used_schemes 336 337# List of built-in transformers that are to be run regardless, because they are part of the RDFa spec 338builtInTransformers = [ 339 empty_safe_curie, top_about, vocab_for_role 340] 341 342######################################################################################################### 343class pyRdfa: 344 """Main processing class for the distiller 345 346 @ivar options: an instance of the L{Options} class 347 @ivar media_type: the preferred default media type, possibly set at initialization 348 @ivar base: the base value, possibly set at initialization 349 @ivar http_status: HTTP Status, to be returned when the package is used via a CGI entry. Initially set to 200, may be modified by exception handlers 350 """ 351 def __init__(self, options = None, base = "", media_type = "", rdfa_version = None): 352 """ 353 @keyword options: Options for the distiller 354 @type options: L{Options} 355 @keyword base: URI for the default "base" value (usually the URI of the file to be processed) 356 @keyword media_type: explicit setting of the preferred media type (a.k.a. content type) of the the RDFa source 357 @keyword rdfa_version: the RDFa version that should be used. If not set, the value of the global L{rdfa_current_version} variable is used 358 """ 359 self.http_status = 200 360 361 self.base = base 362 if base == "": 363 self.required_base = None 364 else: 365 self.required_base = base 366 self.charset = None 367 368 # predefined content type 369 self.media_type = media_type 370 371 if options == None: 372 self.options = Options() 373 else: 374 self.options = options 375 376 if media_type != "": 377 self.options.set_host_language(self.media_type) 378 379 if rdfa_version is not None: 380 self.rdfa_version = rdfa_version 381 else: 382 self.rdfa_version = None 383 384 def _get_input(self, name): 385 """ 386 Trying to guess whether "name" is a URI or a string (for a file); it then tries to open this source accordingly, 387 returning a file-like object. If name is none of these, it returns the input argument (that should 388 be, supposedly, a file-like object already). 389 390 If the media type has not been set explicitly at initialization of this instance, 391 the method also sets the media_type based on the HTTP GET response or the suffix of the file. See 392 L{host.preferred_suffixes} for the suffix to media type mapping. 393 394 @param name: identifier of the input source 395 @type name: string or a file-like object 396 @return: a file like object if opening "name" is possible and successful, "name" otherwise 397 """ 398 399 isstring = isinstance(name, str) 400 401 try: 402 if isstring: 403 # check if this is a URI, ie, if there is a valid 'scheme' part 404 # otherwise it is considered to be a simple file 405 if urlparse(name)[0] != "": 406 url_request = URIOpener(name, {}, self.options.certifi_verify) 407 self.base = url_request.location 408 if self.media_type == "": 409 if url_request.content_type in content_to_host_language: 410 self.media_type = url_request.content_type 411 else: 412 self.media_type = MediaTypes.xml 413 self.options.set_host_language(self.media_type) 414 self.charset = url_request.charset 415 if self.required_base == None: 416 self.required_base = name 417 return url_request.data 418 else: 419 # Creating a File URI for this thing 420 if self.required_base == None: 421 self.required_base = "file://" + os.path.join(os.getcwd(),name) 422 if self.media_type == "": 423 self.media_type = MediaTypes.xml 424 # see if the default should be overwritten 425 for suffix in preferred_suffixes: 426 if name.endswith(suffix): 427 self.media_type = preferred_suffixes[suffix] 428 self.charset = 'utf-8' 429 break 430 self.options.set_host_language(self.media_type) 431 return open(name) 432 else: 433 return name 434 except HTTPError: 435 raise sys.exc_info()[1] 436 except RDFaError as e: 437 raise e 438 except: 439 _type, value, _traceback = sys.exc_info() 440 raise FailedSource(value) 441 442 @staticmethod 443 def _validate_output_format(outputFormat): 444 """ 445 Malicious actors may create XSS style issues by using an illegal output format... better be careful 446 """ 447 # protection against possible malicious URL call 448 if outputFormat not in ["turtle", "n3", "xml", "pretty-xml", "nt", "json-ld"]: 449 outputFormat = "turtle" 450 return outputFormat 451 452 #################################################################################################################### 453 # Externally used methods 454 # 455 def graph_from_DOM(self, dom, graph = None, pgraph = None): 456 """ 457 Extract the RDF Graph from a DOM tree. This is where the real processing happens. All other methods get down to this 458 one, eventually (e.g., after opening a URI and parsing it into a DOM). 459 @param dom: a DOM Node element, the top level entry node for the whole tree (i.e., the C{dom.documentElement} is used to initiate processing down the node hierarchy) 460 @keyword graph: an RDF Graph (if None, than a new one is created) 461 @type graph: rdflib Graph instance. 462 @keyword pgraph: an RDF Graph to hold (possibly) the processor graph content. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. 463 @type pgraph: rdflib Graph instance 464 @return: an RDF Graph 465 @rtype: rdflib Graph instance 466 """ 467 def copyGraph(tog, fromg): 468 for t in fromg: 469 tog.add(t) 470 for k,ns in fromg.namespaces(): 471 tog.bind(k,ns) 472 473 if graph == None: 474 # Create the RDF Graph, that will contain the return triples... 475 graph = Graph() 476 477 # this will collect the content, the 'default graph', as called in the RDFa spec 478 default_graph = Graph() 479 480 # get the DOM tree 481 topElement = dom.documentElement 482 483 # Create the initial state. This takes care of things 484 # like base, top level namespace settings, etc. 485 state = ExecutionContext(topElement, default_graph, base=self.required_base if self.required_base != None else "", options=self.options, rdfa_version=self.rdfa_version) 486 487 # Perform the built-in and external transformations on the HTML tree. 488 for trans in self.options.transformers + builtInTransformers: 489 trans(topElement, self.options, state) 490 491 # This may have changed if the state setting detected an explicit version information: 492 self.rdfa_version = state.rdfa_version 493 494 # The top level subject starts with the current document; this 495 # is used by the recursion 496 # this function is the real workhorse 497 parse_one_node(topElement, default_graph, None, state, []) 498 499 # Massage the output graph in term of rdfa:Pattern and rdfa:copy 500 handle_prototypes(default_graph) 501 502 # If the RDFS expansion has to be made, here is the place... 503 if self.options.vocab_expansion: 504 from .rdfs.process import process_rdfa_sem 505 process_rdfa_sem(default_graph, self.options) 506 507 # Experimental feature: nothing for now, this is kept as a placeholder 508 if self.options.experimental_features: 509 pass 510 511 # What should be returned depends on the way the options have been set up 512 if self.options.output_default_graph: 513 copyGraph(graph, default_graph) 514 if self.options.output_processor_graph: 515 if pgraph != None: 516 copyGraph(pgraph, self.options.processor_graph.graph) 517 else: 518 copyGraph(graph, self.options.processor_graph.graph) 519 elif self.options.output_processor_graph: 520 if pgraph != None: 521 copyGraph(pgraph, self.options.processor_graph.graph) 522 else: 523 copyGraph(graph, self.options.processor_graph.graph) 524 525 # this is necessary if several DOM trees are handled in a row... 526 self.options.reset_processor_graph() 527 528 return graph 529 530 def graph_from_source(self, name, graph = None, rdfOutput = False, pgraph = None): 531 """ 532 Extract an RDF graph from an RDFa source. The source is parsed, the RDF extracted, and the RDFa Graph is 533 returned. This is a front-end to the L{pyRdfa.graph_from_DOM} method. 534 535 @param name: a URI, a file name, or a file-like object 536 @param graph: rdflib Graph instance. If None, a new one is created. 537 @param pgraph: rdflib Graph instance for the processor graph. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. 538 @param rdfOutput: whether runtime exceptions should be turned into RDF and returned as part of the processor graph 539 @return: an RDF Graph 540 @rtype: rdflib Graph instance 541 """ 542 def copyErrors(tog, options): 543 if tog == None: 544 tog = Graph() 545 if options.output_processor_graph: 546 for t in options.processor_graph.graph: 547 tog.add(t) 548 if pgraph != None : pgraph.add(t) 549 for k,ns in options.processor_graph.graph.namespaces(): 550 tog.bind(k,ns) 551 if pgraph != None : pgraph.bind(k,ns) 552 options.reset_processor_graph() 553 return tog 554 555 isstring = isinstance(name, str) 556 557 try: 558 # First, open the source... Possible HTTP errors are returned as error triples 559 stream = None 560 try: 561 stream = self._get_input(name) 562 except FailedSource as ex: 563 _f = sys.exc_info()[1] 564 self.http_status = 400 565 if not rdfOutput : raise Exception(ex.msg) 566 err = self.options.add_error(ex.msg, FileReferenceError, name) 567 self.options.processor_graph.add_http_context(err, 400) 568 return copyErrors(graph, self.options) 569 except HTTPError as ex: 570 h = sys.exc_info()[1] 571 self.http_status = h.http_code 572 if not rdfOutput : raise Exception(ex.msg) 573 err = self.options.add_error("HTTP Error: %s (%s)" % (h.http_code,h.msg), HTError, name) 574 self.options.processor_graph.add_http_context(err, h.http_code) 575 return copyErrors(graph, self.options) 576 except RDFaError as ex: 577 e = sys.exc_info()[1] 578 self.http_status = 500 579 # Something nasty happened:-( 580 if not rdfOutput : raise Exception(ex.msg) 581 err = self.options.add_error(str(ex.msg), context = name) 582 self.options.processor_graph.add_http_context(err, 500) 583 return copyErrors(graph, self.options) 584 except Exception as ex: 585 e = sys.exc_info()[1] 586 self.http_status = 500 587 # Something nasty happened:-( 588 if not rdfOutput : raise ex 589 err = self.options.add_error(str(e), context = name) 590 self.options.processor_graph.add_http_context(err, 500) 591 return copyErrors(graph, self.options) 592 593 dom = None 594 try: 595 msg = "" 596 parser = None 597 if self.options.host_language == HostLanguage.html5: 598 import warnings 599 warnings.filterwarnings("ignore", category=DeprecationWarning) 600 from html5lib import HTMLParser, treebuilders 601 parser = HTMLParser(tree=treebuilders.getTreeBuilder("dom")) 602 if self.charset: 603 # This means the HTTP header has provided a charset, or the 604 # file is a local file when we suppose it to be a utf-8 605 # 606 # 2020-01-20, Ivan Herman 607 # for some reasons the python3 version ran into a problem with this html5lib call 608 # the override_encoding argument was not accepted. 609 # dom = parser.parse(stream, override_encoding=self.charset) 610 dom = parser.parse(stream) 611 else: 612 # No charset set. The HTMLLib parser tries to sniff into the 613 # the file to find a meta header for the charset; if that 614 # works, fine, otherwise it falls back on window-... 615 dom = parser.parse(stream) 616 617 try: 618 if isstring: 619 stream.close() 620 stream = self._get_input(name) 621 else: 622 stream.seek(0) 623 from .host import adjust_html_version 624 self.rdfa_version = adjust_html_version(stream, self.rdfa_version) 625 except: 626 # if anything goes wrong, it is not really important; rdfa version stays what it was... 627 pass 628 629 else: 630 from .host import adjust_xhtml_and_version 631 if isinstance(stream, IOBase): 632 parse = xml.dom.minidom.parse 633 else: 634 parse = xml.dom.minidom.parseString 635 dom = parse(stream) 636 adjusted_host_language, version = adjust_xhtml_and_version(dom, self.options.host_language, self.rdfa_version) 637 self.options.host_language = adjusted_host_language 638 self.rdfa_version = version 639 except ImportError: 640 msg = "HTML5 parser not available. Try installing html5lib <http://code.google.com/p/html5lib>" 641 raise ImportError(msg) 642 except Exception: 643 e = sys.exc_info()[1] 644 # These are various parsing exception. Per spec, this is a case when 645 # error triples MUST be returned, ie, the usage of rdfOutput (which switches between an HTML formatted 646 # return page or a graph with error triples) does not apply 647 err = self.options.add_error(str(e), context = name) 648 self.http_status = 400 649 self.options.processor_graph.add_http_context(err, 400) 650 return copyErrors(graph, self.options) 651 652 # If we got here, we have a DOM tree to operate on... 653 return self.graph_from_DOM(dom, graph, pgraph) 654 except Exception: 655 # Something nasty happened during the generation of the graph... 656 (a,b,c) = sys.exc_info() 657 sys.excepthook(a,b,c) 658 if isinstance(b, ImportError): 659 self.http_status = None 660 else: 661 self.http_status = 500 662 if not rdfOutput : raise b 663 err = self.options.add_error(str(b), context = name) 664 self.options.processor_graph.add_http_context(err, 500) 665 return copyErrors(graph, self.options) 666 667 def rdf_from_sources(self, names, outputFormat = "turtle", rdfOutput = False): 668 """ 669 Extract and RDF graph from a list of RDFa sources and serialize them in one graph. The sources are parsed, the RDF 670 extracted, and serialization is done in the specified format. 671 @param names: list of sources, each can be a URI, a file name, or a file-like object 672 @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", "json" or "json-ld". "turtle" and "n3", "xml" and "pretty-xml", and "json" and "json-ld" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. 673 @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph 674 @type rdfOutput: boolean 675 @return: a serialized RDF Graph 676 @rtype: string 677 """ 678 # protection against possible malicious URL call 679 outputFormat = pyRdfa._validate_output_format(outputFormat); 680 681 # This is better because it gives access to the various, non-standard serializations 682 # If it does not work because the extra are not installed, fall back to the standard 683 # rdlib distribution... 684 graph = Graph() 685 686 # graph.bind("xsd", Namespace('http://www.w3.org/2001/XMLSchema#')) 687 # the value of rdfOutput determines the reaction on exceptions... 688 for name in names: 689 self.graph_from_source(name, graph, rdfOutput) 690 691 # Stupid difference between python2 and python3... 692 return str(graph.serialize(format=outputFormat), encoding='utf-8') 693 694 695 def rdf_from_source(self, name, outputFormat = "turtle", rdfOutput = False): 696 """ 697 Extract and RDF graph from an RDFa source and serialize it in one graph. The source is parsed, the RDF 698 extracted, and serialization is done in the specified format. 699 @param name: a URI, a file name, or a file-like object 700 @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", or "json-ld". "turtle" and "n3", or "xml" and "pretty-xml" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. 701 @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph 702 @type rdfOutput: boolean 703 @return: a serialized RDF Graph 704 @rtype: string 705 """ 706 return self.rdf_from_sources([name], outputFormat, rdfOutput) 707 708################################################# CGI Entry point 709def processURI(uri, outputFormat, form={}): 710 """The standard processing of an RDFa uri options in a form; used as an entry point from a CGI call. 711 712 The call accepts extra form options (i.e., HTTP GET options) as follows: 713 714 - C{graph=[output|processor|output,processor|processor,output]} specifying which graphs are returned. Default: C{output} 715 - C{space_preserve=[true|false]} means that plain literals are normalized in terms of white spaces. Default: C{false} 716 - C{rfa_version} provides the RDFa version that should be used for distilling. The string should be of the form "1.0" or "1.1". Default is the highest version the current package implements, currently "1.1" 717 - C{host_language=[xhtml,html,xml]} : the host language. Used when files are uploaded or text is added verbatim, otherwise the HTTP return header should be used. Default C{xml} 718 - C{embedded_rdf=[true|false]} : whether embedded turtle or RDF/XML content should be added to the output graph. Default: C{false} 719 - C{vocab_expansion=[true|false]} : whether the vocabularies should be expanded through the restricted RDFS entailment. Default: C{false} 720 - C{vocab_cache=[true|false]} : whether vocab caching should be performed or whether it should be ignored and vocabulary files should be picked up every time. Default: C{false} 721 - C{vocab_cache_report=[true|false]} : whether vocab caching details should be reported. Default: C{false} 722 - C{vocab_cache_bypass=[true|false]} : whether vocab caches have to be regenerated every time. Default: C{false} 723 - C{rdfa_lite=[true|false]} : whether warnings should be generated for non RDFa Lite attribute usage. Default: C{false} 724 - C{certifi_verify=[true|false]} : whether the SSL certificate needs to be verified. Default: C{true} 725 726 @param uri: URI to access. Note that the C{text:} and C{uploaded:} fake URI values are treated separately; the former is for textual intput (in which case a StringIO is used to get the data) and the latter is for uploaded file, where the form gives access to the file directly. 727 @param outputFormat: serialization format, as defined by the package. Currently "xml", "turtle", "nt", or "json". Default is "turtle", also used if any other string is given. 728 @param form: extra call options (from the CGI call) to set up the local options 729 @type form: cgi FieldStorage instance 730 @return: serialized graph 731 @rtype: string 732 """ 733 def _get_option(param, compare_value, default): 734 param_old = param.replace('_', '-') 735 if param in list(form.keys()): 736 val = form.getfirst(param).lower() 737 return val == compare_value 738 elif param_old in list(form.keys()): 739 # this is to ensure the old style parameters are still valid... 740 # in the old days I used '-' in the parameters, the standard favours '_' 741 val = form.getfirst(param_old).lower() 742 return val == compare_value 743 else: 744 return default 745 746 if uri == "uploaded:": 747 stream = form["uploaded"].file 748 base = "" 749 elif uri == "text:": 750 stream = StringIO(form.getfirst("text")) 751 base = "" 752 else: 753 stream = uri 754 base = uri 755 756 if "rdfa_version" in list(form.keys()): 757 rdfa_version = form.getfirst("rdfa_version") 758 else: 759 rdfa_version = None 760 761 # working through the possible options 762 # Host language: HTML, XHTML, or XML 763 # Note that these options should be used for the upload and inline version only in case of a form 764 # for real uris the returned content type should be used 765 if "host_language" in list(form.keys()): 766 if form.getfirst("host_language").lower() == "xhtml": 767 media_type = MediaTypes.xhtml 768 elif form.getfirst("host_language").lower() == "html": 769 media_type = MediaTypes.html 770 elif form.getfirst("host_language").lower() == "svg": 771 media_type = MediaTypes.svg 772 elif form.getfirst("host_language").lower() == "atom": 773 media_type = MediaTypes.atom 774 else: 775 media_type = MediaTypes.xml 776 else: 777 media_type = "" 778 779 transformers = [] 780 781 check_lite = "rdfa_lite" in list(form.keys()) and form.getfirst("rdfa_lite").lower() == "true" 782 783 # The code below is left for backward compatibility only. In fact, these options are not exposed any more, 784 # they are not really in use 785 from .transform.metaname import meta_transform 786 from .transform.OpenID import OpenID_transform 787 from .transform.DublinCore import DC_transform 788 789 if "extras" in list(form.keys()) and form.getfirst("extras").lower() == "true": 790 for t in [OpenID_transform, DC_transform, meta_transform]: 791 transformers.append(t) 792 else: 793 if "extra-meta" in list(form.keys()) and form.getfirst("extra-meta").lower() == "true": 794 transformers.append(meta_transform) 795 if "extra-openid" in list(form.keys()) and form.getfirst("extra-openid").lower() == "true": 796 transformers.append(OpenID_transform) 797 if "extra-dc" in list(form.keys()) and form.getfirst("extra-dc").lower() == "true": 798 transformers.append(DC_transform) 799 800 output_default_graph = True 801 output_processor_graph = False 802 # Note that I use the 'graph' and the 'rdfagraph' form keys here. Reason is that 803 # I used 'graph' in the previous versions, including the RDFa 1.0 processor, 804 # so if I removed that altogether that would create backward incompatibilities 805 # On the other hand, the RDFa 1.1 doc clearly refers to 'rdfagraph' as the standard 806 # key. 807 a = None 808 if "graph" in list(form.keys()): 809 a = form.getfirst("graph").lower() 810 elif "rdfagraph" in list(form.keys()): 811 a = form.getfirst("rdfagraph").lower() 812 if a != None: 813 if a == "processor": 814 output_default_graph = False 815 output_processor_graph = True 816 elif a == "processor,output" or a == "output,processor": 817 output_processor_graph = True 818 819 embedded_rdf = _get_option( "embedded_rdf", "true", False) 820 space_preserve = _get_option( "space_preserve", "true", True) 821 vocab_cache = _get_option( "vocab_cache", "true", True) 822 vocab_cache_report = _get_option( "vocab_cache_report", "true", False) 823 refresh_vocab_cache = _get_option( "vocab_cache_refresh", "true", False) 824 vocab_expansion = _get_option( "vocab_expansion", "true", False) 825 certifi_verify = _get_option( "certifi_verify", "true", True) 826 if vocab_cache_report: 827 output_processor_graph = True 828 829 options = Options(output_default_graph = output_default_graph, 830 output_processor_graph = output_processor_graph, 831 space_preserve = space_preserve, 832 transformers = transformers, 833 vocab_cache = vocab_cache, 834 vocab_cache_report = vocab_cache_report, 835 refresh_vocab_cache = refresh_vocab_cache, 836 vocab_expansion = vocab_expansion, 837 embedded_rdf = embedded_rdf, 838 check_lite = check_lite, 839 certifi_verify = certifi_verify) 840 841 processor = pyRdfa(options = options, base = base, media_type = media_type, rdfa_version = rdfa_version) 842 843 # Decide the output format; the issue is what should happen in case of a top level error like an inaccessibility of 844 # the html source: should a graph be returned or an HTML page with an error message? 845 846 # decide whether HTML or RDF should be sent. 847 htmlOutput = False 848 #if 'HTTP_ACCEPT' in os.environ: 849 # acc = os.environ['HTTP_ACCEPT'] 850 # possibilities = ['text/html', 851 # 'application/rdf+xml', 852 # 'text/turtle; charset=utf-8', 853 # 'application/json', 854 # 'application/ld+json', 855 # 'text/rdf+n3'] 856 # 857 # # this nice module does content negotiation and returns the preferred format 858 # sg = acceptable_content_type(acc, possibilities) 859 # htmlOutput = (sg != None and sg[0] == content_type('text/html')) 860 # os.environ['rdfaerror'] = 'true' 861 862 # This is really for testing purposes only, it is an unpublished flag to force RDF output no 863 # matter what 864 import html 865 try: 866 outputFormat = pyRdfa._validate_output_format(outputFormat); 867 if outputFormat == "n3": 868 retval = 'Content-Type: text/rdf+n3; charset=utf-8\n' 869 elif outputFormat == "nt" or outputFormat == "turtle": 870 retval = 'Content-Type: text/turtle; charset=utf-8\n' 871 elif outputFormat == "json-ld" or outputFormat == "json": 872 retval = 'Content-Type: application/ld+json; charset=utf-8\n' 873 else: 874 retval = 'Content-Type: application/rdf+xml; charset=utf-8\n' 875 graph = processor.rdf_from_source(stream, outputFormat, rdfOutput = ("forceRDFOutput" in list(form.keys())) or not htmlOutput) 876 retval += '\n' 877 retval += graph 878 return retval 879 except HTTPError: 880 _type, h, _traceback = sys.exc_info() 881 882 retval = 'Content-type: text/html; charset=utf-8\nStatus: %s \n\n' % h.http_code 883 retval += "<html>\n" 884 retval += "<head>\n" 885 retval += "<title>HTTP Error in distilling RDFa content</title>\n" 886 retval += "</head><body>\n" 887 retval += "<h1>HTTP Error in distilling RDFa content</h1>\n" 888 retval += "<p>HTTP Error: %s (%s)</p>\n" % (h.http_code, h.msg) 889 retval += "<p>On URI: <code>'%s'</code></p>\n" % html.escape(uri) 890 retval +="</body>\n" 891 retval +="</html>\n" 892 return retval 893 except: 894 # This branch should occur only if an exception is really raised, ie, if it is not turned 895 # into a graph value. 896 _type, value, _traceback = sys.exc_info() 897 898 import traceback 899 900 retval = 'Content-type: text/html; charset=utf-8\nStatus: %s\n\n' % processor.http_status 901 retval += "<html>\n" 902 retval += "<head>\n" 903 retval += "<title>Exception in RDFa processing</title>\n" 904 retval += "</head><body>\n" 905 retval += "<h1>Exception in distilling RDFa</h1>\n" 906 retval += "<pre>\n" 907 strio = StringIO() 908 traceback.print_exc(file=strio) 909 retval += strio.getvalue() 910 retval +="</pre>\n" 911 retval +="<pre>%s</pre>\n" % value 912 retval +="<h1>Distiller request details</h1>\n" 913 retval +="<dl>\n" 914 if uri == "text:" and "text" in form and form["text"].value != None and len(form["text"].value.strip()) != 0: 915 retval +="<dt>Text input:</dt><dd>%s</dd>\n" % html.escape(form["text"].value).replace('\n','<br/>') 916 elif uri == "uploaded:": 917 retval +="<dt>Uploaded file</dt>\n" 918 else: 919 retval +="<dt>URI received:</dt><dd><code>'%s'</code></dd>\n" % html.escape(uri) 920 if "host_language" in list(form.keys()): 921 retval +="<dt>Media Type:</dt><dd>%s</dd>\n" % html.escape(media_type) 922 if "graph" in list(form.keys()): 923 retval +="<dt>Requested graphs:</dt><dd>%s</dd>\n" % html.escape(form.getfirst("graph").lower()) 924 else: 925 retval +="<dt>Requested graphs:</dt><dd>default</dd>\n" 926 retval +="<dt>Output serialization format:</dt><dd> %s</dd>\n" % outputFormat 927 if "space_preserve" in form : retval +="<dt>Space preserve:</dt><dd> %s</dd>\n" % html.escape(form["space_preserve"].value) 928 retval +="</dl>\n" 929 retval +="</body>\n" 930 retval +="</html>\n" 931 return retval
206class RDFaError(Exception): 207 """Superclass exceptions representing error conditions defined by the RDFa 1.1 specification. 208 It does not add any new functionality to the 209 Exception class.""" 210 def __init__(self, msg): 211 self.msg = msg 212 Exception.__init__(self)
Superclass exceptions representing error conditions defined by the RDFa 1.1 specification. It does not add any new functionality to the Exception class.
Inherited Members
- builtins.BaseException
- with_traceback
- args
214class FailedSource(RDFaError): 215 """Raised when the original source cannot be accessed. It does not add any new functionality to the 216 Exception class.""" 217 def __init__(self, msg, http_code = None): 218 self.msg = msg 219 self.http_code = http_code 220 RDFaError.__init__(self, msg)
Raised when the original source cannot be accessed. It does not add any new functionality to the Exception class.
Inherited Members
- builtins.BaseException
- with_traceback
- args
222class HTTPError(RDFaError): 223 """Raised when HTTP problems are detected. It does not add any new functionality to the 224 Exception class.""" 225 def __init__(self, http_msg, http_code): 226 self.msg = http_msg 227 self.http_code = http_code 228 RDFaError.__init__(self,http_msg)
Raised when HTTP problems are detected. It does not add any new functionality to the Exception class.
Inherited Members
- builtins.BaseException
- with_traceback
- args
230class ProcessingError(RDFaError): 231 """Error found during processing. It does not add any new functionality to the 232 Exception class.""" 233 pass
Error found during processing. It does not add any new functionality to the Exception class.
235class pyRdfaError(Exception): 236 """Superclass exceptions representing error conditions outside the RDFa 1.1 specification.""" 237 pass
Superclass exceptions representing error conditions outside the RDFa 1.1 specification.
Inherited Members
- builtins.Exception
- Exception
- builtins.BaseException
- with_traceback
- args
344class pyRdfa: 345 """Main processing class for the distiller 346 347 @ivar options: an instance of the L{Options} class 348 @ivar media_type: the preferred default media type, possibly set at initialization 349 @ivar base: the base value, possibly set at initialization 350 @ivar http_status: HTTP Status, to be returned when the package is used via a CGI entry. Initially set to 200, may be modified by exception handlers 351 """ 352 def __init__(self, options = None, base = "", media_type = "", rdfa_version = None): 353 """ 354 @keyword options: Options for the distiller 355 @type options: L{Options} 356 @keyword base: URI for the default "base" value (usually the URI of the file to be processed) 357 @keyword media_type: explicit setting of the preferred media type (a.k.a. content type) of the the RDFa source 358 @keyword rdfa_version: the RDFa version that should be used. If not set, the value of the global L{rdfa_current_version} variable is used 359 """ 360 self.http_status = 200 361 362 self.base = base 363 if base == "": 364 self.required_base = None 365 else: 366 self.required_base = base 367 self.charset = None 368 369 # predefined content type 370 self.media_type = media_type 371 372 if options == None: 373 self.options = Options() 374 else: 375 self.options = options 376 377 if media_type != "": 378 self.options.set_host_language(self.media_type) 379 380 if rdfa_version is not None: 381 self.rdfa_version = rdfa_version 382 else: 383 self.rdfa_version = None 384 385 def _get_input(self, name): 386 """ 387 Trying to guess whether "name" is a URI or a string (for a file); it then tries to open this source accordingly, 388 returning a file-like object. If name is none of these, it returns the input argument (that should 389 be, supposedly, a file-like object already). 390 391 If the media type has not been set explicitly at initialization of this instance, 392 the method also sets the media_type based on the HTTP GET response or the suffix of the file. See 393 L{host.preferred_suffixes} for the suffix to media type mapping. 394 395 @param name: identifier of the input source 396 @type name: string or a file-like object 397 @return: a file like object if opening "name" is possible and successful, "name" otherwise 398 """ 399 400 isstring = isinstance(name, str) 401 402 try: 403 if isstring: 404 # check if this is a URI, ie, if there is a valid 'scheme' part 405 # otherwise it is considered to be a simple file 406 if urlparse(name)[0] != "": 407 url_request = URIOpener(name, {}, self.options.certifi_verify) 408 self.base = url_request.location 409 if self.media_type == "": 410 if url_request.content_type in content_to_host_language: 411 self.media_type = url_request.content_type 412 else: 413 self.media_type = MediaTypes.xml 414 self.options.set_host_language(self.media_type) 415 self.charset = url_request.charset 416 if self.required_base == None: 417 self.required_base = name 418 return url_request.data 419 else: 420 # Creating a File URI for this thing 421 if self.required_base == None: 422 self.required_base = "file://" + os.path.join(os.getcwd(),name) 423 if self.media_type == "": 424 self.media_type = MediaTypes.xml 425 # see if the default should be overwritten 426 for suffix in preferred_suffixes: 427 if name.endswith(suffix): 428 self.media_type = preferred_suffixes[suffix] 429 self.charset = 'utf-8' 430 break 431 self.options.set_host_language(self.media_type) 432 return open(name) 433 else: 434 return name 435 except HTTPError: 436 raise sys.exc_info()[1] 437 except RDFaError as e: 438 raise e 439 except: 440 _type, value, _traceback = sys.exc_info() 441 raise FailedSource(value) 442 443 @staticmethod 444 def _validate_output_format(outputFormat): 445 """ 446 Malicious actors may create XSS style issues by using an illegal output format... better be careful 447 """ 448 # protection against possible malicious URL call 449 if outputFormat not in ["turtle", "n3", "xml", "pretty-xml", "nt", "json-ld"]: 450 outputFormat = "turtle" 451 return outputFormat 452 453 #################################################################################################################### 454 # Externally used methods 455 # 456 def graph_from_DOM(self, dom, graph = None, pgraph = None): 457 """ 458 Extract the RDF Graph from a DOM tree. This is where the real processing happens. All other methods get down to this 459 one, eventually (e.g., after opening a URI and parsing it into a DOM). 460 @param dom: a DOM Node element, the top level entry node for the whole tree (i.e., the C{dom.documentElement} is used to initiate processing down the node hierarchy) 461 @keyword graph: an RDF Graph (if None, than a new one is created) 462 @type graph: rdflib Graph instance. 463 @keyword pgraph: an RDF Graph to hold (possibly) the processor graph content. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. 464 @type pgraph: rdflib Graph instance 465 @return: an RDF Graph 466 @rtype: rdflib Graph instance 467 """ 468 def copyGraph(tog, fromg): 469 for t in fromg: 470 tog.add(t) 471 for k,ns in fromg.namespaces(): 472 tog.bind(k,ns) 473 474 if graph == None: 475 # Create the RDF Graph, that will contain the return triples... 476 graph = Graph() 477 478 # this will collect the content, the 'default graph', as called in the RDFa spec 479 default_graph = Graph() 480 481 # get the DOM tree 482 topElement = dom.documentElement 483 484 # Create the initial state. This takes care of things 485 # like base, top level namespace settings, etc. 486 state = ExecutionContext(topElement, default_graph, base=self.required_base if self.required_base != None else "", options=self.options, rdfa_version=self.rdfa_version) 487 488 # Perform the built-in and external transformations on the HTML tree. 489 for trans in self.options.transformers + builtInTransformers: 490 trans(topElement, self.options, state) 491 492 # This may have changed if the state setting detected an explicit version information: 493 self.rdfa_version = state.rdfa_version 494 495 # The top level subject starts with the current document; this 496 # is used by the recursion 497 # this function is the real workhorse 498 parse_one_node(topElement, default_graph, None, state, []) 499 500 # Massage the output graph in term of rdfa:Pattern and rdfa:copy 501 handle_prototypes(default_graph) 502 503 # If the RDFS expansion has to be made, here is the place... 504 if self.options.vocab_expansion: 505 from .rdfs.process import process_rdfa_sem 506 process_rdfa_sem(default_graph, self.options) 507 508 # Experimental feature: nothing for now, this is kept as a placeholder 509 if self.options.experimental_features: 510 pass 511 512 # What should be returned depends on the way the options have been set up 513 if self.options.output_default_graph: 514 copyGraph(graph, default_graph) 515 if self.options.output_processor_graph: 516 if pgraph != None: 517 copyGraph(pgraph, self.options.processor_graph.graph) 518 else: 519 copyGraph(graph, self.options.processor_graph.graph) 520 elif self.options.output_processor_graph: 521 if pgraph != None: 522 copyGraph(pgraph, self.options.processor_graph.graph) 523 else: 524 copyGraph(graph, self.options.processor_graph.graph) 525 526 # this is necessary if several DOM trees are handled in a row... 527 self.options.reset_processor_graph() 528 529 return graph 530 531 def graph_from_source(self, name, graph = None, rdfOutput = False, pgraph = None): 532 """ 533 Extract an RDF graph from an RDFa source. The source is parsed, the RDF extracted, and the RDFa Graph is 534 returned. This is a front-end to the L{pyRdfa.graph_from_DOM} method. 535 536 @param name: a URI, a file name, or a file-like object 537 @param graph: rdflib Graph instance. If None, a new one is created. 538 @param pgraph: rdflib Graph instance for the processor graph. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. 539 @param rdfOutput: whether runtime exceptions should be turned into RDF and returned as part of the processor graph 540 @return: an RDF Graph 541 @rtype: rdflib Graph instance 542 """ 543 def copyErrors(tog, options): 544 if tog == None: 545 tog = Graph() 546 if options.output_processor_graph: 547 for t in options.processor_graph.graph: 548 tog.add(t) 549 if pgraph != None : pgraph.add(t) 550 for k,ns in options.processor_graph.graph.namespaces(): 551 tog.bind(k,ns) 552 if pgraph != None : pgraph.bind(k,ns) 553 options.reset_processor_graph() 554 return tog 555 556 isstring = isinstance(name, str) 557 558 try: 559 # First, open the source... Possible HTTP errors are returned as error triples 560 stream = None 561 try: 562 stream = self._get_input(name) 563 except FailedSource as ex: 564 _f = sys.exc_info()[1] 565 self.http_status = 400 566 if not rdfOutput : raise Exception(ex.msg) 567 err = self.options.add_error(ex.msg, FileReferenceError, name) 568 self.options.processor_graph.add_http_context(err, 400) 569 return copyErrors(graph, self.options) 570 except HTTPError as ex: 571 h = sys.exc_info()[1] 572 self.http_status = h.http_code 573 if not rdfOutput : raise Exception(ex.msg) 574 err = self.options.add_error("HTTP Error: %s (%s)" % (h.http_code,h.msg), HTError, name) 575 self.options.processor_graph.add_http_context(err, h.http_code) 576 return copyErrors(graph, self.options) 577 except RDFaError as ex: 578 e = sys.exc_info()[1] 579 self.http_status = 500 580 # Something nasty happened:-( 581 if not rdfOutput : raise Exception(ex.msg) 582 err = self.options.add_error(str(ex.msg), context = name) 583 self.options.processor_graph.add_http_context(err, 500) 584 return copyErrors(graph, self.options) 585 except Exception as ex: 586 e = sys.exc_info()[1] 587 self.http_status = 500 588 # Something nasty happened:-( 589 if not rdfOutput : raise ex 590 err = self.options.add_error(str(e), context = name) 591 self.options.processor_graph.add_http_context(err, 500) 592 return copyErrors(graph, self.options) 593 594 dom = None 595 try: 596 msg = "" 597 parser = None 598 if self.options.host_language == HostLanguage.html5: 599 import warnings 600 warnings.filterwarnings("ignore", category=DeprecationWarning) 601 from html5lib import HTMLParser, treebuilders 602 parser = HTMLParser(tree=treebuilders.getTreeBuilder("dom")) 603 if self.charset: 604 # This means the HTTP header has provided a charset, or the 605 # file is a local file when we suppose it to be a utf-8 606 # 607 # 2020-01-20, Ivan Herman 608 # for some reasons the python3 version ran into a problem with this html5lib call 609 # the override_encoding argument was not accepted. 610 # dom = parser.parse(stream, override_encoding=self.charset) 611 dom = parser.parse(stream) 612 else: 613 # No charset set. The HTMLLib parser tries to sniff into the 614 # the file to find a meta header for the charset; if that 615 # works, fine, otherwise it falls back on window-... 616 dom = parser.parse(stream) 617 618 try: 619 if isstring: 620 stream.close() 621 stream = self._get_input(name) 622 else: 623 stream.seek(0) 624 from .host import adjust_html_version 625 self.rdfa_version = adjust_html_version(stream, self.rdfa_version) 626 except: 627 # if anything goes wrong, it is not really important; rdfa version stays what it was... 628 pass 629 630 else: 631 from .host import adjust_xhtml_and_version 632 if isinstance(stream, IOBase): 633 parse = xml.dom.minidom.parse 634 else: 635 parse = xml.dom.minidom.parseString 636 dom = parse(stream) 637 adjusted_host_language, version = adjust_xhtml_and_version(dom, self.options.host_language, self.rdfa_version) 638 self.options.host_language = adjusted_host_language 639 self.rdfa_version = version 640 except ImportError: 641 msg = "HTML5 parser not available. Try installing html5lib <http://code.google.com/p/html5lib>" 642 raise ImportError(msg) 643 except Exception: 644 e = sys.exc_info()[1] 645 # These are various parsing exception. Per spec, this is a case when 646 # error triples MUST be returned, ie, the usage of rdfOutput (which switches between an HTML formatted 647 # return page or a graph with error triples) does not apply 648 err = self.options.add_error(str(e), context = name) 649 self.http_status = 400 650 self.options.processor_graph.add_http_context(err, 400) 651 return copyErrors(graph, self.options) 652 653 # If we got here, we have a DOM tree to operate on... 654 return self.graph_from_DOM(dom, graph, pgraph) 655 except Exception: 656 # Something nasty happened during the generation of the graph... 657 (a,b,c) = sys.exc_info() 658 sys.excepthook(a,b,c) 659 if isinstance(b, ImportError): 660 self.http_status = None 661 else: 662 self.http_status = 500 663 if not rdfOutput : raise b 664 err = self.options.add_error(str(b), context = name) 665 self.options.processor_graph.add_http_context(err, 500) 666 return copyErrors(graph, self.options) 667 668 def rdf_from_sources(self, names, outputFormat = "turtle", rdfOutput = False): 669 """ 670 Extract and RDF graph from a list of RDFa sources and serialize them in one graph. The sources are parsed, the RDF 671 extracted, and serialization is done in the specified format. 672 @param names: list of sources, each can be a URI, a file name, or a file-like object 673 @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", "json" or "json-ld". "turtle" and "n3", "xml" and "pretty-xml", and "json" and "json-ld" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. 674 @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph 675 @type rdfOutput: boolean 676 @return: a serialized RDF Graph 677 @rtype: string 678 """ 679 # protection against possible malicious URL call 680 outputFormat = pyRdfa._validate_output_format(outputFormat); 681 682 # This is better because it gives access to the various, non-standard serializations 683 # If it does not work because the extra are not installed, fall back to the standard 684 # rdlib distribution... 685 graph = Graph() 686 687 # graph.bind("xsd", Namespace('http://www.w3.org/2001/XMLSchema#')) 688 # the value of rdfOutput determines the reaction on exceptions... 689 for name in names: 690 self.graph_from_source(name, graph, rdfOutput) 691 692 # Stupid difference between python2 and python3... 693 return str(graph.serialize(format=outputFormat), encoding='utf-8') 694 695 696 def rdf_from_source(self, name, outputFormat = "turtle", rdfOutput = False): 697 """ 698 Extract and RDF graph from an RDFa source and serialize it in one graph. The source is parsed, the RDF 699 extracted, and serialization is done in the specified format. 700 @param name: a URI, a file name, or a file-like object 701 @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", or "json-ld". "turtle" and "n3", or "xml" and "pretty-xml" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. 702 @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph 703 @type rdfOutput: boolean 704 @return: a serialized RDF Graph 705 @rtype: string 706 """ 707 return self.rdf_from_sources([name], outputFormat, rdfOutput)
Main processing class for the distiller
@ivar options: an instance of the L{Options} class @ivar media_type: the preferred default media type, possibly set at initialization @ivar base: the base value, possibly set at initialization @ivar http_status: HTTP Status, to be returned when the package is used via a CGI entry. Initially set to 200, may be modified by exception handlers
352 def __init__(self, options = None, base = "", media_type = "", rdfa_version = None): 353 """ 354 @keyword options: Options for the distiller 355 @type options: L{Options} 356 @keyword base: URI for the default "base" value (usually the URI of the file to be processed) 357 @keyword media_type: explicit setting of the preferred media type (a.k.a. content type) of the the RDFa source 358 @keyword rdfa_version: the RDFa version that should be used. If not set, the value of the global L{rdfa_current_version} variable is used 359 """ 360 self.http_status = 200 361 362 self.base = base 363 if base == "": 364 self.required_base = None 365 else: 366 self.required_base = base 367 self.charset = None 368 369 # predefined content type 370 self.media_type = media_type 371 372 if options == None: 373 self.options = Options() 374 else: 375 self.options = options 376 377 if media_type != "": 378 self.options.set_host_language(self.media_type) 379 380 if rdfa_version is not None: 381 self.rdfa_version = rdfa_version 382 else: 383 self.rdfa_version = None
@keyword options: Options for the distiller @type options: L{Options} @keyword base: URI for the default "base" value (usually the URI of the file to be processed) @keyword media_type: explicit setting of the preferred media type (a.k.a. content type) of the the RDFa source @keyword rdfa_version: the RDFa version that should be used. If not set, the value of the global L{rdfa_current_version} variable is used
456 def graph_from_DOM(self, dom, graph = None, pgraph = None): 457 """ 458 Extract the RDF Graph from a DOM tree. This is where the real processing happens. All other methods get down to this 459 one, eventually (e.g., after opening a URI and parsing it into a DOM). 460 @param dom: a DOM Node element, the top level entry node for the whole tree (i.e., the C{dom.documentElement} is used to initiate processing down the node hierarchy) 461 @keyword graph: an RDF Graph (if None, than a new one is created) 462 @type graph: rdflib Graph instance. 463 @keyword pgraph: an RDF Graph to hold (possibly) the processor graph content. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. 464 @type pgraph: rdflib Graph instance 465 @return: an RDF Graph 466 @rtype: rdflib Graph instance 467 """ 468 def copyGraph(tog, fromg): 469 for t in fromg: 470 tog.add(t) 471 for k,ns in fromg.namespaces(): 472 tog.bind(k,ns) 473 474 if graph == None: 475 # Create the RDF Graph, that will contain the return triples... 476 graph = Graph() 477 478 # this will collect the content, the 'default graph', as called in the RDFa spec 479 default_graph = Graph() 480 481 # get the DOM tree 482 topElement = dom.documentElement 483 484 # Create the initial state. This takes care of things 485 # like base, top level namespace settings, etc. 486 state = ExecutionContext(topElement, default_graph, base=self.required_base if self.required_base != None else "", options=self.options, rdfa_version=self.rdfa_version) 487 488 # Perform the built-in and external transformations on the HTML tree. 489 for trans in self.options.transformers + builtInTransformers: 490 trans(topElement, self.options, state) 491 492 # This may have changed if the state setting detected an explicit version information: 493 self.rdfa_version = state.rdfa_version 494 495 # The top level subject starts with the current document; this 496 # is used by the recursion 497 # this function is the real workhorse 498 parse_one_node(topElement, default_graph, None, state, []) 499 500 # Massage the output graph in term of rdfa:Pattern and rdfa:copy 501 handle_prototypes(default_graph) 502 503 # If the RDFS expansion has to be made, here is the place... 504 if self.options.vocab_expansion: 505 from .rdfs.process import process_rdfa_sem 506 process_rdfa_sem(default_graph, self.options) 507 508 # Experimental feature: nothing for now, this is kept as a placeholder 509 if self.options.experimental_features: 510 pass 511 512 # What should be returned depends on the way the options have been set up 513 if self.options.output_default_graph: 514 copyGraph(graph, default_graph) 515 if self.options.output_processor_graph: 516 if pgraph != None: 517 copyGraph(pgraph, self.options.processor_graph.graph) 518 else: 519 copyGraph(graph, self.options.processor_graph.graph) 520 elif self.options.output_processor_graph: 521 if pgraph != None: 522 copyGraph(pgraph, self.options.processor_graph.graph) 523 else: 524 copyGraph(graph, self.options.processor_graph.graph) 525 526 # this is necessary if several DOM trees are handled in a row... 527 self.options.reset_processor_graph() 528 529 return graph
Extract the RDF Graph from a DOM tree. This is where the real processing happens. All other methods get down to this one, eventually (e.g., after opening a URI and parsing it into a DOM). @param dom: a DOM Node element, the top level entry node for the whole tree (i.e., the C{dom.documentElement} is used to initiate processing down the node hierarchy) @keyword graph: an RDF Graph (if None, than a new one is created) @type graph: rdflib Graph instance. @keyword pgraph: an RDF Graph to hold (possibly) the processor graph content. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. @type pgraph: rdflib Graph instance @return: an RDF Graph @rtype: rdflib Graph instance
531 def graph_from_source(self, name, graph = None, rdfOutput = False, pgraph = None): 532 """ 533 Extract an RDF graph from an RDFa source. The source is parsed, the RDF extracted, and the RDFa Graph is 534 returned. This is a front-end to the L{pyRdfa.graph_from_DOM} method. 535 536 @param name: a URI, a file name, or a file-like object 537 @param graph: rdflib Graph instance. If None, a new one is created. 538 @param pgraph: rdflib Graph instance for the processor graph. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. 539 @param rdfOutput: whether runtime exceptions should be turned into RDF and returned as part of the processor graph 540 @return: an RDF Graph 541 @rtype: rdflib Graph instance 542 """ 543 def copyErrors(tog, options): 544 if tog == None: 545 tog = Graph() 546 if options.output_processor_graph: 547 for t in options.processor_graph.graph: 548 tog.add(t) 549 if pgraph != None : pgraph.add(t) 550 for k,ns in options.processor_graph.graph.namespaces(): 551 tog.bind(k,ns) 552 if pgraph != None : pgraph.bind(k,ns) 553 options.reset_processor_graph() 554 return tog 555 556 isstring = isinstance(name, str) 557 558 try: 559 # First, open the source... Possible HTTP errors are returned as error triples 560 stream = None 561 try: 562 stream = self._get_input(name) 563 except FailedSource as ex: 564 _f = sys.exc_info()[1] 565 self.http_status = 400 566 if not rdfOutput : raise Exception(ex.msg) 567 err = self.options.add_error(ex.msg, FileReferenceError, name) 568 self.options.processor_graph.add_http_context(err, 400) 569 return copyErrors(graph, self.options) 570 except HTTPError as ex: 571 h = sys.exc_info()[1] 572 self.http_status = h.http_code 573 if not rdfOutput : raise Exception(ex.msg) 574 err = self.options.add_error("HTTP Error: %s (%s)" % (h.http_code,h.msg), HTError, name) 575 self.options.processor_graph.add_http_context(err, h.http_code) 576 return copyErrors(graph, self.options) 577 except RDFaError as ex: 578 e = sys.exc_info()[1] 579 self.http_status = 500 580 # Something nasty happened:-( 581 if not rdfOutput : raise Exception(ex.msg) 582 err = self.options.add_error(str(ex.msg), context = name) 583 self.options.processor_graph.add_http_context(err, 500) 584 return copyErrors(graph, self.options) 585 except Exception as ex: 586 e = sys.exc_info()[1] 587 self.http_status = 500 588 # Something nasty happened:-( 589 if not rdfOutput : raise ex 590 err = self.options.add_error(str(e), context = name) 591 self.options.processor_graph.add_http_context(err, 500) 592 return copyErrors(graph, self.options) 593 594 dom = None 595 try: 596 msg = "" 597 parser = None 598 if self.options.host_language == HostLanguage.html5: 599 import warnings 600 warnings.filterwarnings("ignore", category=DeprecationWarning) 601 from html5lib import HTMLParser, treebuilders 602 parser = HTMLParser(tree=treebuilders.getTreeBuilder("dom")) 603 if self.charset: 604 # This means the HTTP header has provided a charset, or the 605 # file is a local file when we suppose it to be a utf-8 606 # 607 # 2020-01-20, Ivan Herman 608 # for some reasons the python3 version ran into a problem with this html5lib call 609 # the override_encoding argument was not accepted. 610 # dom = parser.parse(stream, override_encoding=self.charset) 611 dom = parser.parse(stream) 612 else: 613 # No charset set. The HTMLLib parser tries to sniff into the 614 # the file to find a meta header for the charset; if that 615 # works, fine, otherwise it falls back on window-... 616 dom = parser.parse(stream) 617 618 try: 619 if isstring: 620 stream.close() 621 stream = self._get_input(name) 622 else: 623 stream.seek(0) 624 from .host import adjust_html_version 625 self.rdfa_version = adjust_html_version(stream, self.rdfa_version) 626 except: 627 # if anything goes wrong, it is not really important; rdfa version stays what it was... 628 pass 629 630 else: 631 from .host import adjust_xhtml_and_version 632 if isinstance(stream, IOBase): 633 parse = xml.dom.minidom.parse 634 else: 635 parse = xml.dom.minidom.parseString 636 dom = parse(stream) 637 adjusted_host_language, version = adjust_xhtml_and_version(dom, self.options.host_language, self.rdfa_version) 638 self.options.host_language = adjusted_host_language 639 self.rdfa_version = version 640 except ImportError: 641 msg = "HTML5 parser not available. Try installing html5lib <http://code.google.com/p/html5lib>" 642 raise ImportError(msg) 643 except Exception: 644 e = sys.exc_info()[1] 645 # These are various parsing exception. Per spec, this is a case when 646 # error triples MUST be returned, ie, the usage of rdfOutput (which switches between an HTML formatted 647 # return page or a graph with error triples) does not apply 648 err = self.options.add_error(str(e), context = name) 649 self.http_status = 400 650 self.options.processor_graph.add_http_context(err, 400) 651 return copyErrors(graph, self.options) 652 653 # If we got here, we have a DOM tree to operate on... 654 return self.graph_from_DOM(dom, graph, pgraph) 655 except Exception: 656 # Something nasty happened during the generation of the graph... 657 (a,b,c) = sys.exc_info() 658 sys.excepthook(a,b,c) 659 if isinstance(b, ImportError): 660 self.http_status = None 661 else: 662 self.http_status = 500 663 if not rdfOutput : raise b 664 err = self.options.add_error(str(b), context = name) 665 self.options.processor_graph.add_http_context(err, 500) 666 return copyErrors(graph, self.options)
Extract an RDF graph from an RDFa source. The source is parsed, the RDF extracted, and the RDFa Graph is returned. This is a front-end to the L{pyRdfa.graph_from_DOM} method.
@param name: a URI, a file name, or a file-like object @param graph: rdflib Graph instance. If None, a new one is created. @param pgraph: rdflib Graph instance for the processor graph. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. @param rdfOutput: whether runtime exceptions should be turned into RDF and returned as part of the processor graph @return: an RDF Graph @rtype: rdflib Graph instance
668 def rdf_from_sources(self, names, outputFormat = "turtle", rdfOutput = False): 669 """ 670 Extract and RDF graph from a list of RDFa sources and serialize them in one graph. The sources are parsed, the RDF 671 extracted, and serialization is done in the specified format. 672 @param names: list of sources, each can be a URI, a file name, or a file-like object 673 @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", "json" or "json-ld". "turtle" and "n3", "xml" and "pretty-xml", and "json" and "json-ld" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. 674 @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph 675 @type rdfOutput: boolean 676 @return: a serialized RDF Graph 677 @rtype: string 678 """ 679 # protection against possible malicious URL call 680 outputFormat = pyRdfa._validate_output_format(outputFormat); 681 682 # This is better because it gives access to the various, non-standard serializations 683 # If it does not work because the extra are not installed, fall back to the standard 684 # rdlib distribution... 685 graph = Graph() 686 687 # graph.bind("xsd", Namespace('http://www.w3.org/2001/XMLSchema#')) 688 # the value of rdfOutput determines the reaction on exceptions... 689 for name in names: 690 self.graph_from_source(name, graph, rdfOutput) 691 692 # Stupid difference between python2 and python3... 693 return str(graph.serialize(format=outputFormat), encoding='utf-8')
Extract and RDF graph from a list of RDFa sources and serialize them in one graph. The sources are parsed, the RDF extracted, and serialization is done in the specified format. @param names: list of sources, each can be a URI, a file name, or a file-like object @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", "json" or "json-ld". "turtle" and "n3", "xml" and "pretty-xml", and "json" and "json-ld" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph @type rdfOutput: boolean @return: a serialized RDF Graph @rtype: string
696 def rdf_from_source(self, name, outputFormat = "turtle", rdfOutput = False): 697 """ 698 Extract and RDF graph from an RDFa source and serialize it in one graph. The source is parsed, the RDF 699 extracted, and serialization is done in the specified format. 700 @param name: a URI, a file name, or a file-like object 701 @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", or "json-ld". "turtle" and "n3", or "xml" and "pretty-xml" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. 702 @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph 703 @type rdfOutput: boolean 704 @return: a serialized RDF Graph 705 @rtype: string 706 """ 707 return self.rdf_from_sources([name], outputFormat, rdfOutput)
Extract and RDF graph from an RDFa source and serialize it in one graph. The source is parsed, the RDF extracted, and serialization is done in the specified format. @param name: a URI, a file name, or a file-like object @keyword outputFormat: serialization format. Can be one of "turtle", "n3", "xml", "pretty-xml", "nt". "xml", "pretty-xml", or "json-ld". "turtle" and "n3", or "xml" and "pretty-xml" are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph @type rdfOutput: boolean @return: a serialized RDF Graph @rtype: string
710def processURI(uri, outputFormat, form={}): 711 """The standard processing of an RDFa uri options in a form; used as an entry point from a CGI call. 712 713 The call accepts extra form options (i.e., HTTP GET options) as follows: 714 715 - C{graph=[output|processor|output,processor|processor,output]} specifying which graphs are returned. Default: C{output} 716 - C{space_preserve=[true|false]} means that plain literals are normalized in terms of white spaces. Default: C{false} 717 - C{rfa_version} provides the RDFa version that should be used for distilling. The string should be of the form "1.0" or "1.1". Default is the highest version the current package implements, currently "1.1" 718 - C{host_language=[xhtml,html,xml]} : the host language. Used when files are uploaded or text is added verbatim, otherwise the HTTP return header should be used. Default C{xml} 719 - C{embedded_rdf=[true|false]} : whether embedded turtle or RDF/XML content should be added to the output graph. Default: C{false} 720 - C{vocab_expansion=[true|false]} : whether the vocabularies should be expanded through the restricted RDFS entailment. Default: C{false} 721 - C{vocab_cache=[true|false]} : whether vocab caching should be performed or whether it should be ignored and vocabulary files should be picked up every time. Default: C{false} 722 - C{vocab_cache_report=[true|false]} : whether vocab caching details should be reported. Default: C{false} 723 - C{vocab_cache_bypass=[true|false]} : whether vocab caches have to be regenerated every time. Default: C{false} 724 - C{rdfa_lite=[true|false]} : whether warnings should be generated for non RDFa Lite attribute usage. Default: C{false} 725 - C{certifi_verify=[true|false]} : whether the SSL certificate needs to be verified. Default: C{true} 726 727 @param uri: URI to access. Note that the C{text:} and C{uploaded:} fake URI values are treated separately; the former is for textual intput (in which case a StringIO is used to get the data) and the latter is for uploaded file, where the form gives access to the file directly. 728 @param outputFormat: serialization format, as defined by the package. Currently "xml", "turtle", "nt", or "json". Default is "turtle", also used if any other string is given. 729 @param form: extra call options (from the CGI call) to set up the local options 730 @type form: cgi FieldStorage instance 731 @return: serialized graph 732 @rtype: string 733 """ 734 def _get_option(param, compare_value, default): 735 param_old = param.replace('_', '-') 736 if param in list(form.keys()): 737 val = form.getfirst(param).lower() 738 return val == compare_value 739 elif param_old in list(form.keys()): 740 # this is to ensure the old style parameters are still valid... 741 # in the old days I used '-' in the parameters, the standard favours '_' 742 val = form.getfirst(param_old).lower() 743 return val == compare_value 744 else: 745 return default 746 747 if uri == "uploaded:": 748 stream = form["uploaded"].file 749 base = "" 750 elif uri == "text:": 751 stream = StringIO(form.getfirst("text")) 752 base = "" 753 else: 754 stream = uri 755 base = uri 756 757 if "rdfa_version" in list(form.keys()): 758 rdfa_version = form.getfirst("rdfa_version") 759 else: 760 rdfa_version = None 761 762 # working through the possible options 763 # Host language: HTML, XHTML, or XML 764 # Note that these options should be used for the upload and inline version only in case of a form 765 # for real uris the returned content type should be used 766 if "host_language" in list(form.keys()): 767 if form.getfirst("host_language").lower() == "xhtml": 768 media_type = MediaTypes.xhtml 769 elif form.getfirst("host_language").lower() == "html": 770 media_type = MediaTypes.html 771 elif form.getfirst("host_language").lower() == "svg": 772 media_type = MediaTypes.svg 773 elif form.getfirst("host_language").lower() == "atom": 774 media_type = MediaTypes.atom 775 else: 776 media_type = MediaTypes.xml 777 else: 778 media_type = "" 779 780 transformers = [] 781 782 check_lite = "rdfa_lite" in list(form.keys()) and form.getfirst("rdfa_lite").lower() == "true" 783 784 # The code below is left for backward compatibility only. In fact, these options are not exposed any more, 785 # they are not really in use 786 from .transform.metaname import meta_transform 787 from .transform.OpenID import OpenID_transform 788 from .transform.DublinCore import DC_transform 789 790 if "extras" in list(form.keys()) and form.getfirst("extras").lower() == "true": 791 for t in [OpenID_transform, DC_transform, meta_transform]: 792 transformers.append(t) 793 else: 794 if "extra-meta" in list(form.keys()) and form.getfirst("extra-meta").lower() == "true": 795 transformers.append(meta_transform) 796 if "extra-openid" in list(form.keys()) and form.getfirst("extra-openid").lower() == "true": 797 transformers.append(OpenID_transform) 798 if "extra-dc" in list(form.keys()) and form.getfirst("extra-dc").lower() == "true": 799 transformers.append(DC_transform) 800 801 output_default_graph = True 802 output_processor_graph = False 803 # Note that I use the 'graph' and the 'rdfagraph' form keys here. Reason is that 804 # I used 'graph' in the previous versions, including the RDFa 1.0 processor, 805 # so if I removed that altogether that would create backward incompatibilities 806 # On the other hand, the RDFa 1.1 doc clearly refers to 'rdfagraph' as the standard 807 # key. 808 a = None 809 if "graph" in list(form.keys()): 810 a = form.getfirst("graph").lower() 811 elif "rdfagraph" in list(form.keys()): 812 a = form.getfirst("rdfagraph").lower() 813 if a != None: 814 if a == "processor": 815 output_default_graph = False 816 output_processor_graph = True 817 elif a == "processor,output" or a == "output,processor": 818 output_processor_graph = True 819 820 embedded_rdf = _get_option( "embedded_rdf", "true", False) 821 space_preserve = _get_option( "space_preserve", "true", True) 822 vocab_cache = _get_option( "vocab_cache", "true", True) 823 vocab_cache_report = _get_option( "vocab_cache_report", "true", False) 824 refresh_vocab_cache = _get_option( "vocab_cache_refresh", "true", False) 825 vocab_expansion = _get_option( "vocab_expansion", "true", False) 826 certifi_verify = _get_option( "certifi_verify", "true", True) 827 if vocab_cache_report: 828 output_processor_graph = True 829 830 options = Options(output_default_graph = output_default_graph, 831 output_processor_graph = output_processor_graph, 832 space_preserve = space_preserve, 833 transformers = transformers, 834 vocab_cache = vocab_cache, 835 vocab_cache_report = vocab_cache_report, 836 refresh_vocab_cache = refresh_vocab_cache, 837 vocab_expansion = vocab_expansion, 838 embedded_rdf = embedded_rdf, 839 check_lite = check_lite, 840 certifi_verify = certifi_verify) 841 842 processor = pyRdfa(options = options, base = base, media_type = media_type, rdfa_version = rdfa_version) 843 844 # Decide the output format; the issue is what should happen in case of a top level error like an inaccessibility of 845 # the html source: should a graph be returned or an HTML page with an error message? 846 847 # decide whether HTML or RDF should be sent. 848 htmlOutput = False 849 #if 'HTTP_ACCEPT' in os.environ: 850 # acc = os.environ['HTTP_ACCEPT'] 851 # possibilities = ['text/html', 852 # 'application/rdf+xml', 853 # 'text/turtle; charset=utf-8', 854 # 'application/json', 855 # 'application/ld+json', 856 # 'text/rdf+n3'] 857 # 858 # # this nice module does content negotiation and returns the preferred format 859 # sg = acceptable_content_type(acc, possibilities) 860 # htmlOutput = (sg != None and sg[0] == content_type('text/html')) 861 # os.environ['rdfaerror'] = 'true' 862 863 # This is really for testing purposes only, it is an unpublished flag to force RDF output no 864 # matter what 865 import html 866 try: 867 outputFormat = pyRdfa._validate_output_format(outputFormat); 868 if outputFormat == "n3": 869 retval = 'Content-Type: text/rdf+n3; charset=utf-8\n' 870 elif outputFormat == "nt" or outputFormat == "turtle": 871 retval = 'Content-Type: text/turtle; charset=utf-8\n' 872 elif outputFormat == "json-ld" or outputFormat == "json": 873 retval = 'Content-Type: application/ld+json; charset=utf-8\n' 874 else: 875 retval = 'Content-Type: application/rdf+xml; charset=utf-8\n' 876 graph = processor.rdf_from_source(stream, outputFormat, rdfOutput = ("forceRDFOutput" in list(form.keys())) or not htmlOutput) 877 retval += '\n' 878 retval += graph 879 return retval 880 except HTTPError: 881 _type, h, _traceback = sys.exc_info() 882 883 retval = 'Content-type: text/html; charset=utf-8\nStatus: %s \n\n' % h.http_code 884 retval += "<html>\n" 885 retval += "<head>\n" 886 retval += "<title>HTTP Error in distilling RDFa content</title>\n" 887 retval += "</head><body>\n" 888 retval += "<h1>HTTP Error in distilling RDFa content</h1>\n" 889 retval += "<p>HTTP Error: %s (%s)</p>\n" % (h.http_code, h.msg) 890 retval += "<p>On URI: <code>'%s'</code></p>\n" % html.escape(uri) 891 retval +="</body>\n" 892 retval +="</html>\n" 893 return retval 894 except: 895 # This branch should occur only if an exception is really raised, ie, if it is not turned 896 # into a graph value. 897 _type, value, _traceback = sys.exc_info() 898 899 import traceback 900 901 retval = 'Content-type: text/html; charset=utf-8\nStatus: %s\n\n' % processor.http_status 902 retval += "<html>\n" 903 retval += "<head>\n" 904 retval += "<title>Exception in RDFa processing</title>\n" 905 retval += "</head><body>\n" 906 retval += "<h1>Exception in distilling RDFa</h1>\n" 907 retval += "<pre>\n" 908 strio = StringIO() 909 traceback.print_exc(file=strio) 910 retval += strio.getvalue() 911 retval +="</pre>\n" 912 retval +="<pre>%s</pre>\n" % value 913 retval +="<h1>Distiller request details</h1>\n" 914 retval +="<dl>\n" 915 if uri == "text:" and "text" in form and form["text"].value != None and len(form["text"].value.strip()) != 0: 916 retval +="<dt>Text input:</dt><dd>%s</dd>\n" % html.escape(form["text"].value).replace('\n','<br/>') 917 elif uri == "uploaded:": 918 retval +="<dt>Uploaded file</dt>\n" 919 else: 920 retval +="<dt>URI received:</dt><dd><code>'%s'</code></dd>\n" % html.escape(uri) 921 if "host_language" in list(form.keys()): 922 retval +="<dt>Media Type:</dt><dd>%s</dd>\n" % html.escape(media_type) 923 if "graph" in list(form.keys()): 924 retval +="<dt>Requested graphs:</dt><dd>%s</dd>\n" % html.escape(form.getfirst("graph").lower()) 925 else: 926 retval +="<dt>Requested graphs:</dt><dd>default</dd>\n" 927 retval +="<dt>Output serialization format:</dt><dd> %s</dd>\n" % outputFormat 928 if "space_preserve" in form : retval +="<dt>Space preserve:</dt><dd> %s</dd>\n" % html.escape(form["space_preserve"].value) 929 retval +="</dl>\n" 930 retval +="</body>\n" 931 retval +="</html>\n" 932 return retval
The standard processing of an RDFa uri options in a form; used as an entry point from a CGI call.
The call accepts extra form options (i.e., HTTP GET options) as follows:
- C{graph=[output|processor|output,processor|processor,output]} specifying which graphs are returned. Default: C{output}
- C{space_preserve=[true|false]} means that plain literals are normalized in terms of white spaces. Default: C{false}
- C{rfa_version} provides the RDFa version that should be used for distilling. The string should be of the form "1.0" or "1.1". Default is the highest version the current package implements, currently "1.1"
- C{host_language=[xhtml,html,xml]} : the host language. Used when files are uploaded or text is added verbatim, otherwise the HTTP return header should be used. Default C{xml}
- C{embedded_rdf=[true|false]} : whether embedded turtle or RDF/XML content should be added to the output graph. Default: C{false}
- C{vocab_expansion=[true|false]} : whether the vocabularies should be expanded through the restricted RDFS entailment. Default: C{false}
- C{vocab_cache=[true|false]} : whether vocab caching should be performed or whether it should be ignored and vocabulary files should be picked up every time. Default: C{false}
- C{vocab_cache_report=[true|false]} : whether vocab caching details should be reported. Default: C{false}
- C{vocab_cache_bypass=[true|false]} : whether vocab caches have to be regenerated every time. Default: C{false}
- C{rdfa_lite=[true|false]} : whether warnings should be generated for non RDFa Lite attribute usage. Default: C{false}
- C{certifi_verify=[true|false]} : whether the SSL certificate needs to be verified. Default: C{true}
@param uri: URI to access. Note that the C{text:} and C{uploaded:} fake URI values are treated separately; the former is for textual intput (in which case a StringIO is used to get the data) and the latter is for uploaded file, where the form gives access to the file directly. @param outputFormat: serialization format, as defined by the package. Currently "xml", "turtle", "nt", or "json". Default is "turtle", also used if any other string is given. @param form: extra call options (from the CGI call) to set up the local options @type form: cgi FieldStorage instance @return: serialized graph @rtype: string