LLD Exploitation
Introduction
This best practice describes how to exploit Linguistic Linked Data resources. The suggested steps for exploitation comprise:
- search and discovery of relevant resources
- verify the license of the dataset
- navigating to the distribution of the data (download or SPARQL endpoint)
- extract that part of the data that is relevant for a particular purpose or application
Use Case
Let us consider the example of a company developing sentiment analysis and opinion mining software that has a working system for the English language and wants to port the system to also support German. The company wants to find a corpus that is annotated at the sentiment level and extract a first seed lexicon of German subjective expressions with their polarity (positive, negative, neutral).
Method
In order to exploit Linguistic Linked Data resources, the above mentioned methodology can be implemented as follows:
- Search and discovery: relevant linguistic resources can be discovered using LingHub, which has been developed by the LIDER project.
- Licensing: when a relevant dataset has been found using LingHub, by clicking on the link of the resource one can navigate to a page containing all the metadata about the resource.
- Distribution: from the metadata page in LingHub, one can either download the dataset or discover where the SPARQL endpoint of the data is.
- Extraction: Using W3C standards, in particular SPARQL as RDF query language, one can extract that portion of the data that is needed for a particular purpose.
If LIDER guidelines are followed during publication and metadata provision for resources and if the resource is registered at either Metashare, CLARIN VO, LRE Map or DataHub, LingHub will crawl the resource and index the resource with the appropriate metadata. Further, if de facto standards and vocabularies as recommended by LIDER are followed, then the same extraction patterns can be used to extract data from different datasets.
Use Case Revisited
Our company looking for a German lexicon would follow the above sketched methodology as follows:
- Search and discovery: the company would enter the query "sentiment corpus German" into LingHub and reach the following page: http://linghub.lider-project.eu/search/?property=&query=sentiment+corpus+german. It would get two results. Clicking for instance on the usage review dataset it would reach the following page: http://linghub.lider-project.eu/datahub/usage-review-corpus#Nedfa753871df4052a5e6074d9389e901.
- Licensing: It would check the license http://opendatacommons.org/licenses/by/1.0 and see that it is compatible with their purposes.
- Distribution: The company would understand from the metadata page of the usage review dataset that a download is available at http://data.lider-project.eu/usage/usage.nt.gz and that a SPARQL endpoint is available at: http://data.lider-project.eu/usage/sparql
- Extraction: Using the following SPARQL query
SELECT ?string ?polarity WHERE { ?phrase <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#anchorOf> ?string ; <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#lang> <http://www.lexvo.org/page/iso639-3/deu> ; <http://www.gsi.dit.upm.es/ontologies/marl/ns#hasPolarity> ?polarity . }
the company would obtain a seed lexicon of subjective phrases with their polarity as a result.