Warning:
This wiki has been archived and is now read-only.
Graphs Design 6.1/Crawler Example
This is a more detailed walk through of the Shared Web Crawler and Archiving Web Crawler use cases, showing several ways they can be addressed using Graphs Design 6.1.
The Scenario
Craig's computer system is doing RDF Web crawling. It has a list of URLs from which it will fetch RDF content. It will parse that content and save the resulting RDF Graphs. It will then make available the information it gathered, and some metadata about how the information was gathered. That information will be obtained and used by Dave's machine.
Some more details:
- This morning, Craig's machine was erased — there are no old crawl results.
- Today Craig's machine only tried to dereference three URLs, as follows.
- Dereference of http://alice.example.org/page1 at 2012-04-02T160701 returned application/turtle, "<a> <b> 1".
- Dereference of http://alice.example.org/page2 at 2012-04-02T160702 returned application/turtle, "<a> <b> 2".
- Dereference of http://alice.example.org/page3 at 2012-04-02T160703 failed with error code 404 Not Found.
- Craig's machine exposes this information in a trig file at http://craig.example.org/crawl/2012-04-02
If we adopt Graphs Design 6.1, there are still many ways to address this scenario. Each section below presents one of these ways; they all use Design 6.1.
These examples do not include as much metadata as one would probably like. In particular, it clients probably SHOULD pay attention to cache management headers like Last-Modified, Expires, ETag, and Cache-Control. Hopefully the examples are detailed enough to show how one could include such header information.
Contents
Blank Nodes Referring to RDF Graphs
With this approach, we refer to the RDF graphs using blank node labels.
_:g1 { <a> <b> 1 } _:g2 { <a> <b> 2 } { _:g1 a rdf:Graph. # this says that _:g1 names the graph itself _:g2 a rdf:Graph. # ditto for _:g2 [ a eg:DereferenceOperation; eg:source <http://alice.example.org/page1>; eg:atTime "2012-04-02T160701"^^xs:dateTime; eg:result _:g1; eg:status 200; ]. [ a eg:DereferenceOperation; eg:source <http://alice.example.org/page2>; eg:atTime "2012-04-02T160702"^^xs:dateTime; eg:result _:g2; eg:status 200; ]. [ a eg:DereferenceOperation; eg:source <http://alice.example.org/page3>; eg:atTime "2012-04-02T160703"^^xs:dateTime; eg:status 404; ]. }
Pros:
- Semantically simple; the blank node "_:g1" refers to the RDF graph that was parsed from the content obtained from page1.
Cons:
- Uses blank nodes
Use the Original URLs as Graph Labels
With this approach, the graph tags are the URLs used to obtain those graphs.
<http://alice.example.org/page1> { <a> <b> 1 } <http://alice.example.org/page2> { <a> <b> 2 } { [ a eg:DereferenceOperation; eg:source <http://alice.example.org/page1>; eg:atTime "2012-04-02T160701"^^xs:dateTime; eg:resultTagged <http://alice.example.org/page2>; eg:status 200; ]. [ a eg:DereferenceOperation; eg:source <http://alice.example.org/page2>; eg:atTime "2012-04-02T160702"^^xs:dateTime; eg:resultTagged <http://alice.example.org/page2>; eg:status 200; ]. [ a eg:DereferenceOperation; eg:source <http://alice.example.org/page3>; eg:atTime "2012-04-02T160703"^^xs:dateTime; eg:status 404; ]. }
Pros:
- For simple uses, the default graph can be ignored
Cons:
- If there are multiple dereferences of one URL, the result tag and the source will have to be different for some of them
- Legitimate, accurate crawler results will logically conflict (different graphs for the same label) if a source changes between the crawls.
- Multiple datasets, from different crawls (where some sources have changed contents), can't be properly combined without application-logic
Snapshot URLs as Graph Labels
With this approach, the graph tags are the URLs created on the crawler which can be used to obtain the graphs.
<http://craig.example.org/snap/e02cce51a67d8ca63f5d2ced5c5068b996ab6026> { <a> <b> 1 } <http://craig.example.org/snap/93a60479a59194657189180397825328d70e8916> { <a> <b> 2 } { [ a eg:DereferenceOperation; eg:source <http://alice.example.org/page1>; eg:atTime "2012-04-02T160701"^^xs:dateTime; eg:resultAt <http://craig.example.org/snap/e02cce51a67d8ca63f5d2ced5c5068b996ab6026>; eg:status 200; ]. [ a eg:DereferenceOperation; eg:source <http://alice.example.org/page2>; eg:atTime "2012-04-02T160702"^^xs:dateTime; eg:resultAt <http://craig.example.org/snap/93a60479a59194657189180397825328d70e8916>; eg:status 200; ]. [ a eg:DereferenceOperation; eg:source <http://alice.example.org/page3>; eg:atTime "2012-04-02T160703"^^xs:dateTime; eg:status 404; ]. }
Pros:
- Can handle multiple retreivals
- Clients may be able to just store the graph URLs and deref only when needed
Cons:
- ???