See also: IRC log
<trackbot> Date: 15 December 2009
<Ashok> Do we have telcons on Dec 22 and 27 ?
<mhausenblas> no, Ashok ;)
<ericP> slackers
<Ashok> Thanks, Michael!
<mhausenblas> scribenick: cygri
<MacTed> MacTed = Ted Thibodeau
<MacTed> correct
<angela_UNITN> aacc is me
<angela_UNITN> aadd is heiko
<HeikoStoermer> right
PROPOSAL: Accept the minutes of the 8 December 2009 telecon,
http://www.w3.org/2009/12/08-RDB2RDF-minutes.html
<whalb> +1
<Marcelo> +1
+1
<soeren> +1
RESOLUTION: Accept the minutes of the 8 December 2009 telecon
Use Case planning
mhausenblas:
http://www.w3.org/2001/sw/rdb2rdf/wiki/Use_Cases_and_Requirements
... invite ppl to add their use cases
Ashok: format? HTML or only wiki?
mhausenblas: initially collaborate on the wiki, then turn into proper WG Note with help of EricP
Soeren: present use cases as database schemas?
mhausenblas: rather keep it on user level, e.g., "we have a web shop..."
or "combine crm system with web shop"
for now, it's structured brainstorming
number of use cases we're aiming at?
EricP: a size that we can manage
http://www.w3.org/2001/sw/rdb2rdf/wiki/images/c/cf/Okkam.pdf
Heiko Stoermer is presenting
work is part of OKKAM, EU project
ENS -- Entity Naming System
thanks mhausenblas!
slide 2
slide 3
ENS provides services for re-use of identifiers
several public services
ID search, ID creation, ID management (alternative IDs), create+update profiles of entities
scalable architecture
access through SOAP services, REST is coming
web frontends
slide 4
benefits from using ENS
heiko: easily retrieve all data attached to the same ID
thx ericP!
scribe: maintain metadata about
entities
... profile updates based on popularity
... application in business intelligence
... integrate data across systems
... potentially get links to stuff outside on the web for
free
... e.g. other people talking about your product (SAP use
case)
slide 5
heiko: architecture
... storage
... lifecycle, e.g. ageing, merging, splitting of IDs
... entity matching (queries)
... access management: no mining queries ("give me all
XYZ")
... access APIs
slide 6
heiko: scalability
... storage has distributed index, and distributed entity
store, both clustered
... replication+sharding
... solr
... ENS Core does life cycle etc, also clustered
slide 7
heiko: currently also working on
offline processing
... batch processing, deduplication, data quality assessment
etc
slide 8
heiko: under development for 2
years, version 2 coming
... now at 7.5M records, system scales to 50M
... want to be at 50M records and capability of 500M at project
end 06/2010
slide 9
heiko: entity repository = ID + attached entity description
slide 10
heiko: challenges
... no defined fixed schema, just vocabularies
... we don't define vocabularies
... users specify name-value pairs
... matching afterwards is difficult
... users can use whatever vocab they want, "professor" instead
of "person", we must deal with that
slide 11
heiko: internal representation:
XML documents with name-value pairs describing the
entities
... and alternative identifiers
... can be interpreted as linked data style sameAs
... e.g. dbpedia URI
... API call for retrieving the canonical OKKAM ID for an
alternative identifier
slide 12
heiko: current content of the
repo
... wikipedia, geonames, manually created
... total 7.5M entities
... currently adding DBLP
... no restriction w.r.t. types of entites, we can manage
everything
slide 13
heiko: entity ID search
... user submits key-value pairs as query
... query must be matched against profiles
... result is canonical identifier
... skip slide 14
slide 15
heiko: 2 phase process in
search
... 1. entity search, 2. refined entity matching
... entity search is for recall, pull out everything that is
relevant, that's fast
... refined matching then to increase precision, can be more
expensive
... return match or no match
slide 16
heiko: bridging to database
integration
... expose two DBs as two knowledge bases (graph)
... typical approach for integration: owl:sameAs between
records in diff DBs
slide 17
heiko: owl:sameAs has strong
semantics, you forget where the data came from
... (slide 18) better: use same ID everywhere
... OKKAM ID as "mediator" in the middle
... without undesirable consequences of sameAs
slide 19
heiko: you can give local
identifiers and then connect them to OKKAM ID
... then you can merge based on the ID, with desired semantic
rules
slide 20
heiko: a database alignment
project with okkam
... client has bunch of databases
... want unified view
... convert them all to RDF
... use ENS to align
... so entities are linked without having to merge the
graphs
slide 22
heiko: in RDB you have PKs, so
unique ID is often a number
... in RDF you need a URI
... ENS is the thing that can enable stepping from the RDF
world to the RDF world
... afterwards, coreference is syntactically evident
... so okkam provides mapping between local ID and global OKKAM
ID
... DERI has sig.ma application
... you can give it an okkam ID and it will give view on all
data out there that uses the ID
<Souri> +q
Q&A
ericP: similar to Shared Names project? Concept Wiki?
heiko: they do life science IDs,
we do all domains
... they are vertical app
ericP: different proteins are
sometimes the same, sometimes not considered the same
... predicated similarity?
heiko: frequently raised point...
up until which is X the same when you start replacing all its
parts?
... we don't deal with that kind of semantics
... what's the same or not is in your knowledge base
... if you describe things differently from me, if we need
insulation, we will have two different entites
ericP: when I do SPARQL queries, should engine be aware of OKKAM?
heiko: no SPARQL interface yet
Souri: q related to goal of this WG... how do you do mapping in the DBs?
heiko: that's up to mapping infrastructure. we just provide a URI. ENS is not a mapping layer between DB and RDF. ENS is ID management
Souri: do you hand an ID to the user, "build your DB using this"? or does user give all hist IDs to the ENS?
heiko: can do two things. first,
whenever I create an entity, ENS assigns it an ID. when someone
else wants to talk about same entity, ID is already there in
the ENS
... second, we already have distributed data. you give data to
the ENS, it gives you an ID (existing or newly created). repeat
for different data sources, you get same ID
Ashok: are okkam IDs URIs? what's the structure of the URI?
heiko: yes, they are URIs
<HeikoStoermer> http://www.okkam.org/entity/ok5f23a5ce-a683-4c4d-ae73-b78cdc17aec1
heiko: that's an okkam ID
... it's a UUID
<angela_UNITN> you can aggragate data by okkamID using sig.ma for example
<angela_UNITN> http://sig.ma/search?q=http://www.okkam.org/entity/ok5f23a5ce-a683-4c4d-ae73-b78cdc17aec1
jsequeda: let's say i have legacy DB about companies with PKs. so I would map my PKs to okkam IDs?
heiko: yes you want to have the
okkam ID somewhere in your data, because then it's stable
... either do it entity by entity, or use batch processor where
you send the data to the ENS
... privacy issues of course
jsequeda: how much disambiguation do you do? how tell apart oracle the company and oracle the DB?
heiko: if you just have a string,
we can do nothing for you. need more info in your record
... sometimes can fall back on global popularity. IBM the
company vs IBM the band
... in practice, today: build a slightly more elaborate
description of your entity; do it one by one; send query to
ENS
... real examples from use case partners have sufficient
detail
... structure of query: simplest is bag of words; more complex
is key value pairs; easy to pull that from a DB and that helps
us a great deal
mhausenblas: further questions on the mailing list
<ericP> +1
mhausenblas: no telecon on december 22nd and 29th
<Marcelo> +1
PROPOSAL: reconvene jan 5th
<mhausenblas> http://www.w3.org/2001/sw/rdb2rdf/wiki/ScribeList
next scribe is Souri
microsoft patent ... apparently does not come from SQL Server team but perhaps Live Search
<jsequeda> Email on the New York Semantic Web mailing list
<jsequeda> Actually its not a patent yet, just an application. The USPTO is looking at ways to improve discovery of prior art, and has a pilot program where you can participate in the examination process. So if you know of prior art, post it here:
<jsequeda> http://www.peertopatent.org/
<MacTed> there is a date that "prior art" must exist before, associated with the patent ... but I forget whether that's the "submission date" or something else
<Souri> Oracle has a paper in VLDB 2005
<mhausenblas> [adjourned]
This is scribe.perl Revision: 1.135 of Date: 2009/03/02 03:52:20 Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/ Guessing input format: RRSAgent_Text_Format (score 1.00) Succeeded: s/Heiko Störmer/Heiko Stoermer/ Found ScribeNick: cygri Inferring Scribes: cygri Default Present: Seema, +43.316.876.aaaa, +1.562.249.aabb, +039046188aacc, +39.046.188.aadd, Ashok_Malhotra, Souri, EricP, mhausenblas, MacTed, soeren, cygri, whalb, [IPcaller], angela_UNITN, HeikoStoermer, jsequeda, +44.131.208.aaee, hhalpin Present: Seema +43.316.876.aaaa +1.562.249.aabb +039046188aacc +39.046.188.aadd Ashok_Malhotra Souri EricP mhausenblas MacTed soeren cygri whalb [IPcaller] angela_UNITN HeikoStoermer jsequeda +44.131.208.aaee hhalpin Orri Regrets: Ben_Szekely Nuno Ahmed Agenda: http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2009Dec/0008.html Found Date: 15 Dec 2009 Guessing minutes URL: http://www.w3.org/2009/12/15-RDB2RDF-minutes.html People with action items:[End of scribe.perl diagnostic output]