SV_MEETING_TITLE -- 21 May 2012

<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf

<HarryH> 412 623 is me - Harry Hochheiser Pittsburgh

<mscottm> Brian Lowe: Developer on VIVO project, Stella Mitchell also works as developer / ontology on VIVO

<mscottm> Harry Hochheiser - University of Pittsburgh, interested in HCLS

<mscottm> Brian Lowe: Developer on VIVO project, Stella Mitchell also works as developer / ontology on VIVO

<ram> Ram from Metaome - We have a life science search engine called DistilBio (distilbio.com)

<scribe> scribe: Jun

<mscottm> Chimezie Ogbuji - Cleveland Clinic, Case Western, Recently started a startup

<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf

Scott: introduce Janos' talk: it's important to differentiate RDF datasets apart from by their content, licenses, etc

<mscottm> VIVO - scientific research network ontology

Janos: one of the members of CTSA Connect graduate programme, to connect two major ontologies, VIVO and ***, to connect clinical sciences data

<chimezie> yes, I do

<Janos> http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf

Slide 1: a lot of further work. this just presents a start

slide 2

Janos: Semantic Web is based on RDF, a graph-based data model

<mscottm> CTSA Connect: http://www.ctsaconnect.org/about-us

Janos: more flexible than relational DBs by allowing parallel edges

slide 3

Janos: a paper submitted to the Triple Challenge 2010
... they did some quantification of datasets, looking into the internal structure of the data
... drew some of the approaches of this paper
... took a look of the datasets of the challenge, and did some structural analysis and others

slide 4

Janos: a basic python library to parse n-triples. it's a memory based approach, and do some processing. based on PyPy
... PyPy for just-in-time compiling. speed up the processing

<Amit> conference is full! cannot join by voice

Janos: just some basic statistical analysis, then started to do some pattern matching analysis. not by using SPARQL endpoint
... each file is treated as its own graph. didn't use Named Graphs

Q: on scalability

Janos: largest one is LinkedCT
... 28 millions triples. took 30% of a 64G memory
... SPARQL1.1 might provide better performance promises

slide 5

scribe: started with some basic counts

slide 6

Janos: do some simple fractions calculations
... e.g, how many literals in your triples
... how many literals are unique?
... how many objects are unique?
... structure measurement, by taking out the typing sort of information and literals
... subject/object coverage, more pointing or more pointed?
... more concrete examples to follow

slide 7

<mscottm> scribenick: Jun

Janos: computed it against a couple of LOD datasets, 4 of the LODD, DailyMed, LinkedCT, DrugbankRDF, RxNorm
... BioGrid database: an open access DB on Protein and Genetic Interactions
... BioPAX: pathways in BioPAX format
... bioGrid can be downloaded via OWL format
... VIVO: NIH funded project for scientific networking
... got n-triples for VIVO dataset
... go through by the number of triples desc

slide 8

Janos: top subjects, top classes, predicates, etc
... give you a good idea of how people use ontologies
... LinkCT: 40% are literals, objects have 80% repetition
... three dominant classes

Michael: have you done this analysis on the GO ontology?

Janos: not yet

Michael: expecting more diverse coverage

Janos: would be interesting to look at

slide 9

Janos: BioGrid in BioPAX
... 50MB in owl but 40 millions triples in n-triple format
... again, subject, object coverage, and top classes. they are not LOD yet
... get a good sense of what's actually in the content

slide 10

Janos: RxNorm
... only 6 classes. pretty small
... quite a bit of literals. structure data is higher than other datasets

Q: do you see a big structure differences from these datasets?

Janos: TBD

slide 11

Janos: 1.2 million triples
... data about publications, such as Authorship, Person ...
... publication is dominant data source there. pretty good subject/object coverage

slide 12

Janos: it has a lot of links to outside datasets, have a much higher object coverage

slide 13

Janos: top predicate: owl:sameAs. again has a lot of links to outside datasets

mscottm: any idea about how one type of metric could be more useful than another, or searching for others?

slide 14

Janos: there are a lot of tools for graph vis and analysis, but not so good with RDF data

slide 15

Janos: the twist is to allow multiple paths between 2 nodes

slide 16

Janos: there are ways to collapse the parallel edges, or put RDF into XML, in order to use some graph analysis tools

slide 17

Janos: show some examples
... get co-authors that are only members of a site, to get a smaller co-author network

slide 18

Janos: do some basic graph analysis using Mathematica
... basic in-degrees, out-degrees, histograms, one/two degree separation etc

<mscottm> Nice!

slide 19

Janos: Gephi doesn't support parallel edges. you have to do some pre-processing

slide 20

Janos: some links

<michael> thanks, janos, i need to drop off

Eric: any further analysis on some of the results, like the social network?

<mscottm, I have to leave for another meeting>

<mattgamble> First how do you work out which metrics are useful?

<egombocz> Our Knowledge Explorer also provides metrics for weighing of connections in several ways

<mscottm> Chime - would you please jot your comment/question into IRC? I received an urgent call exactly when you started.. :(

<chimezie> My question was whether he had considered using rdflib (https://github.com/RDFLib)

<mscottm> CTSA Connect - ISF - Integrated Semantic Framework: core is combining VIVO ontology and eagle-i ontology

<HarryH> Thanks , Janos - very interesting!

<ram> Thanks Janos

<Stella> thanks all, bye

<mscottm> bye all

- DRAFT -

SV_MEETING_TITLE

21 May 2012

Attendees

Contents

Summary of Action Items

Scribe.perl diagnostic output