See also: IRC log
<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf
<HarryH> 412 623 is me - Harry Hochheiser Pittsburgh
<mscottm> Brian Lowe: Developer on VIVO project, Stella Mitchell also works as developer / ontology on VIVO
<mscottm> Harry Hochheiser - University of Pittsburgh, interested in HCLS
<mscottm> Brian Lowe: Developer on VIVO project, Stella Mitchell also works as developer / ontology on VIVO
<ram> Ram from Metaome - We have a life science search engine called DistilBio (distilbio.com)
<scribe> scribe: Jun
<mscottm> Chimezie Ogbuji - Cleveland Clinic, Case Western, Recently started a startup
<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf
Scott: introduce Janos' talk: it's important to differentiate RDF datasets apart from by their content, licenses, etc
<mscottm> VIVO - scientific research network ontology
Janos: one of the members of CTSA Connect graduate programme, to connect two major ontologies, VIVO and ***, to connect clinical sciences data
<chimezie> yes, I do
<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf
Slide 1: a lot of further work. this just presents a start
slide 2
Janos: Semantic Web is based on RDF, a graph-based data model
<mscottm> CTSA Connect: http://www.ctsaconnect.org/about-us
Janos: more flexible than relational DBs by allowing parallel edges
slide 3
Janos: a paper submitted to the
Triple Challenge 2010
... they did some quantification of datasets, looking into the
internal structure of the data
... drew some of the approaches of this paper
... took a look of the datasets of the challenge, and did some
structural analysis and others
slide 4
Janos: a basic python library to
parse n-triples. it's a memory based approach, and do some
processing. based on PyPy
... PyPy for just-in-time compiling. speed up the
processing
<Amit> conference is full! cannot join by voice
Janos: just some basic
statistical analysis, then started to do some pattern matching
analysis. not by using SPARQL endpoint
... each file is treated as its own graph. didn't use Named
Graphs
Q: on scalability
Janos: largest one is
LinkedCT
... 28 millions triples. took 30% of a 64G memory
... SPARQL1.1 might provide better performance promises
slide 5
scribe: started with some basic counts
slide 6
Janos: do some simple fractions
calculations
... e.g, how many literals in your triples
... how many literals are unique?
... how many objects are unique?
... structure measurement, by taking out the typing sort of
information and literals
... subject/object coverage, more pointing or more
pointed?
... more concrete examples to follow
slide 7
<mscottm> scribenick: Jun
Janos: computed it against a
couple of LOD datasets, 4 of the LODD, DailyMed, LinkedCT,
DrugbankRDF, RxNorm
... BioGrid database: an open access DB on Protein and Genetic
Interactions
... BioPAX: pathways in BioPAX format
... bioGrid can be downloaded via OWL format
... VIVO: NIH funded project for scientific networking
... got n-triples for VIVO dataset
... go through by the number of triples desc
slide 8
Janos: top subjects, top classes,
predicates, etc
... give you a good idea of how people use ontologies
... LinkCT: 40% are literals, objects have 80% repetition
... three dominant classes
Michael: have you done this analysis on the GO ontology?
Janos: not yet
Michael: expecting more diverse coverage
Janos: would be interesting to look at
slide 9
Janos: BioGrid in BioPAX
... 50MB in owl but 40 millions triples in n-triple
format
... again, subject, object coverage, and top classes. they are
not LOD yet
... get a good sense of what's actually in the content
slide 10
Janos: RxNorm
... only 6 classes. pretty small
... quite a bit of literals. structure data is higher than
other datasets
Q: do you see a big structure differences from these datasets?
Janos: TBD
slide 11
Janos: 1.2 million triples
... data about publications, such as Authorship, Person
...
... publication is dominant data source there. pretty good
subject/object coverage
slide 12
Janos: it has a lot of links to outside datasets, have a much higher object coverage
slide 13
Janos: top predicate: owl:sameAs. again has a lot of links to outside datasets
mscottm: any idea about how one type of metric could be more useful than another, or searching for others?
slide 14
Janos: there are a lot of tools for graph vis and analysis, but not so good with RDF data
slide 15
Janos: the twist is to allow multiple paths between 2 nodes
slide 16
Janos: there are ways to collapse the parallel edges, or put RDF into XML, in order to use some graph analysis tools
slide 17
Janos: show some examples
... get co-authors that are only members of a site, to get a
smaller co-author network
slide 18
Janos: do some basic graph
analysis using Mathematica
... basic in-degrees, out-degrees, histograms, one/two degree
separation etc
<mscottm> Nice!
slide 19
Janos: Gephi doesn't support parallel edges. you have to do some pre-processing
slide 20
Janos: some links
<michael> thanks, janos, i need to drop off
Eric: any further analysis on some of the results, like the social network?
<mscottm, I have to leave for another meeting>
<mattgamble> First how do you work out which metrics are useful?
<egombocz> Our Knowledge Explorer also provides metrics for weighing of connections in several ways
<mscottm> Chime - would you please jot your comment/question into IRC? I received an urgent call exactly when you started.. :(
<chimezie> My question was whether he had considered using rdflib (https://github.com/RDFLib)
<mscottm> CTSA Connect - ISF - Integrated Semantic Framework: core is combining VIVO ontology and eagle-i ontology
<HarryH> Thanks , Janos - very interesting!
<ram> Thanks Janos
<Stella> thanks all, bye
<mscottm> bye all
This is scribe.perl Revision: 1.136 of Date: 2011/05/12 12:01:43 Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/ Guessing input format: RRSAgent_Text_Format (score 1.00) Succeeded: s/Susan/Stella/ Succeeded: s/Scott/mscottm/ Succeeded: s/matrix/metric/ Found Scribe: Jun Inferring ScribeNick: Jun Found ScribeNick: Jun WARNING: No "Topic:" lines found. Default Present: Tony, +1.510.705.aaaa, tlebo, +1.631.444.aabb, +46.7.08.13.aacc, Scott_Marshall, +46.7.08.13.aadd, Chimezie, +1.412.623.aaee, +1.206.732.aaff, +1.857.250.aagg, EricP Present: Tony +1.510.705.aaaa tlebo +1.631.444.aabb +46.7.08.13.aacc Scott_Marshall +46.7.08.13.aadd Chimezie +1.412.623.aaee +1.206.732.aaff +1.857.250.aagg EricP WARNING: No meeting title found! You should specify the meeting title like this: <dbooth> Meeting: Weekly Baking Club Meeting WARNING: No meeting chair found! You should specify the meeting chair like this: <dbooth> Chair: dbooth Got date from IRC log name: 21 May 2012 Guessing minutes URL: http://www.w3.org/2012/05/21-HCLS-minutes.html People with action items: WARNING: No "Topic: ..." lines found! Resulting HTML may have an empty (invalid) <ol>...</ol>. Explanation: "Topic: ..." lines are used to indicate the start of new discussion topics or agenda items, such as: <dbooth> Topic: Review of Amy's report[End of scribe.perl diagnostic output]