XML Processing Model WG -- 10 Feb 2016

We accept the draft agenda, https://www.w3.org/XML/XProc/2016/01/27-minutes.html

Henry: that agenda should include links to the draft

<scribe> ACTION: add link to the draft [recorded in http://www.w3.org/2016/02/10-xproc-irc]

<alexmilowski> https://github.com/xproc/notes/tree/master/dataflows

Alex: good to spend time on the major conceptual issues

Do we accept meeting minutes from the Jan 27th

https://www.w3.org/XML/XProc/2016/01/27-minutes.html

none heard accepted

Do we accept meeting minutes from the Feb 3rd ?

https://www.w3.org/XML/XProc/2016/02/03-minutes

Henry: before tomorrow we need to have the link to the new doc

Alex: we have included link

Henry: no one will see it until tomorrow

Alex: will send email out to the list with link of draft proposal

https://github.com/xproc/notes/tree/master/dataflows

Jim: has sent agenda for today to the list

pause to read Henry comments

https://github.com/ndw/xproc-notes/blob/master/design/ht-notes.html

Henry: I would like to put further discussions about the syntax to one side
... With all respect to the WG, we are not up for this job ...
... We need more wo/manpower

Alex: agree we need more people

<alexmilowski> http://noflojs.org/documentation/fbp/

<alexmilowski> http://www.nextflow.io/example1.html

Alex: goes on about some examples of dataflow languages
... noflow is a javascript type thing - its a dataflow language that processes inputs/outputs

Jim: what about google dataflow

Alex: I looked at google dataflow its an api

Jim: its an important distinction api vs actual dataflow language, but I mention in the awareness of dataflow in general

Alex: There are lots of people working in data pipelining but there is no way to say 'here is my data pipeline'

Liam: that is (and gives me) an interesting insight - there is two kinds of (anything) of data flow language
... First approach everything is done with dataflow, the second is where you have 'islands' that are steps

Henry: The way I understood, 'we now use right arrow operator'

Alex: the steps continue to be a blackbox
... Gives examples of blackbox steps

<liam> [ pervasive dataflow language example - https://en.wikipedia.org/wiki/Lucid_%28programming_language%29 ]

Alex: if you look at the science workflows they are all dataflows
... mentions Spark, pig in data science workflows
... when I was teaching, my students started to draw data pipelines

Henry: XProc v1 is pretty good, but once you go into longer term doing anything other then linear flows are unreasonably harder

Alex: We have two fundamental problems, we focused on XML

Liam: I raised that in my reply to Henry's mail

Alex: The other part about xproc is we focused on steps
... and everything became 'steps'
... the design centre revolved around steps

Henry: I agree with the first part- not the rest

Alex: The syntax of the language focused on the stpes

Henry: maybe we are saying the same thing
... the hard part is managing more then linear flow

Liam: We spent longer in XQuery because of XQuery scripting
... describes experience with XQuery scripting, spent almost three years
... we should avoid that kind of route

https://lists.w3.org/Archives/Public/public-xml-processing-model-wg/2016Feb/0008.html

Alex: it should be easy, I want a tool to parse an instance of xproc v2 and draw dataflow graph

Henry: The data flow graph is a candidate - abstract semantics
... something that is less detail
... We need something that is more specific then the nice pictures of the xproc v1 spec
... We only designed for the simple case
... .we got blindsided because we were focused on the simple case
... (pls <diety/> do we have to do more UML diagrams)
... Layering is going to be absolutely crucial - one of the things we need right was that the language that is the interface to the engine was lot lower then any sane human would use

Jim: thinks that xproc v1 was working with abstract syntax tree

Henry: It is implicitly there, the way we talked about xproc syntax ... there are a lot of defaulting rules
... xproc own language is its own compilation target - its worth not assuming the authoring language is the same thing

Alex: we have operators now and we connect things by operators

Henry goes to the whiteboard

scribe: (then comes back)

Henry: We get to the heart, if it were my choice we should stop this discussion
... The complexity that kills us is the non linear definition

Alex: I do not disagree with you

Henry: I want to see the syntax we will serialise

Alex: We have lots of use cases, we need to get more use cases (non xml, non epub) eg. from machine learning pipelines

<liam> [ graphical pipeline editor with real example - http://3.bp.blogspot.com/-codtjRuVW5I/T4qFXWkFMVI/AAAAAAAAAeo/EdEfvkVeun8/s1600/node4-exp.png ]

Alex: We need to be able draw a graph as dataflow
... and so here is the ways you can write this down

Henry: That is the next step, take one of the examples and do that
... I want to see what the problem is

Alex: We have a lot of experiences with pipelines, especially on the working group
... What we do not have is enough use cases
... I know where to get more
... I would like to keep our next steps grounded, we take the use case, we push it through our new syntax and see if it works - we cannot go back to first principles
... can a tool spit out a diagram, if it cannot then why ?
... if its because its a difficult pipeline (strange, weird) fine

Liam: the reason this works (diagram) it is all single layer
... Description is at the right level

Henry now at the whiteboard

Henry: I would like to lay out my thinking

Alex: one point of consensus, we need better use cases ...

<scribe> ACTION: re collate list of use cases [recorded in http://www.w3.org/2016/02/10-xproc-irc]

scribe distracted by getting Norm to join

hangout

break for

a few mins

<scribe> Scribe: Alex

<scribe> ScribeNick: alexmilowski

The real chair has joined us via a video call.

Henry is at the board …

Henry: like to rethink what the basic elements of the flow ...

<Norm> I'm listening. Heading to other room to check coffee

Henry: Build up the inventory of functionality that we need to map onto the language
... starts with pipes ...
... take the notion of pipeline and make it richer and simpler at the same time
... what naturally flows together but also more at the architecture that actually runs
... reserve the posibility of more explicit layering in the design
... that there is a language not designed for human consumption and there are layers over that...
... pipes that are very detailed and not really hand crafted
... kinds of things: xdm objects, xml documents, other documents, metadata …
... the notion of scope does not fit well with the notion of data flow (e.g. variables versus the data the flows)
... where does the XPath static context come from … what does a variable resolve against in the data flow world?
... maybe contexts flow along with data
... gauges? instrumentation on the pipeline that gives you insight on what inside the pipe … sampling
... viewport .. you can look inside
... hindsight, we need to treat sets of documents (collections) as first class objects in the architecture … the need to move collections around

Alex: side note, the data science perspective is that you want to process a collection as if you have the whole thing but in actuality, you only have a subset that you can process on a particular computation node

Henry: there are two different related points … thinking about collections as something in the flow.
... difference betwen compile time sets and runtime sets
... example: splitting index terms by an alphabet (e.g. one output set for each letter)
... at runtime a step can create an output stream for each letter

Alex: the runtime one is like map/reduce

Henry: steps have signatures, … still important.
... they vary on three dimentions: item type, arity, and style
... item type: what comes down the pipe
... arity - how many on the input/output
... style - events, sets, callbacks, etc.
... every interface that data flows through has signatures
... sinks have to deal with multiplexing, steps do not have to deal with that
... sinks and sources do the ‘adaption'
... the source has to do the assembly

Alex: In xproc v1, you have to put something in the middle
... In xproc vnext you get that built into the language (do the meet)
... easier to specify in xproc vnext
... when you conceptualise dataflow process and go do it in the language and extract the graph, they are not going to be the same
... the question are ‘how are they different'

<Norm> Norm: I don't think meets were especially difficult in XProc1, you didn't need a step, but you did have to name all the steps that were meeting and build the p:pipe list for them.

Henry: Steps get configured at authoring time
... We spent a while trying to get params and options right

Norm: Maps gave us a different and less obscure story to tell about params

scribe buffer full

<Norm> ScribeNick: Norm

Henry: I want to reconsider the whole question of step configuration; it's easy in the simple case in XProc1, it's when you want to do something sophisticated where you fall off the complexity cliff.
... Getting this into the language is going to be tricky.

Alex: What part of configuration isn't covered by maps?

Henry: That's just saying AVTs can represent everything. That's true, but...
... Dependencies and ordering are a big part of the problem. How do you get at what you put on the right hand side of the arrows.

Alex: Config params for XSLT is easy

Henry: I disagree, it's about what you can *use* to construct those paramters that's important.
... What is the XPath context? How do you specify the XPath context for variable references in the expressions which compute the map?

Alex: We have that mechanism now.

Henry: It's full of defaulting.

Some further discussion of maps and parameters. Room for additional understanding/discussion.

Jim: Is it time to move on?

Alex: There are lots of good points here, we need to consider them
... We need to take configuration as you've described it and break it down into different silos.

Henry: Think of a data flow language and now tell me what are the dependencies by which (and where do expressions sit) in the data flow.
... Steps/sinks/sources/adapters/pipes. Where does configuration fit in there? It doesn't have an obvious place to stand.

Alex: Right now that's an implicit input.

Henry: There's a bunch of defaulting going on that needs to be made explicit at this level.

Alex: The story we tell today makes the story more complicated because steps have variables and other implicit inputs

Henry: We're in violent agreement.

Jim: There are lots of languages that escape to Java properties and lots of specs stop at that point.
... I think this is worth having an action.

Liam: It might be worth a few minutes on my responses to Henry.

<jfuller> scribeNick: jfuller

https://lists.w3.org/Archives/Public/public-xml-processing-model-wg/2016Feb/0008.html

Go through Liam response to Henry

Liam: Has said we need more resources, which I think its true but when I look at successful languages they were done by 1 or 2
... If we produce something 1% as successful as Perl, that would be not a bad thing
... XQuery has lots of orthogonality ...

(Alex: we never want to go back to parameter entities ... ever again)

Liam: We could get some review with programming language design people
... Adoption of a language is about writing in a syntax
... The biggest question - is how we know it succeeded ... what is the measure of success

Jim: 1000x more users is my metric

Alex: I would like to see more data science users

<alexmilowski> Replace X with J globally!

Norm: I think we want to make a language that is less xml centric but more data agnostic but we do not need to drop the X

Alex: heritage works against us

<Norm> Naming things is bloody difficult. You come up with a better name, I'll sign on.

Liam: I care that vnext is not v1.1

<Norm> W3C DFL

Jim: What other WG groups would have a dataflow

some discussion about where dataflow may emerge in other W3C WG.

Alex: Relying on xquery makes a lot of sense and easily shoehorn json
... its more xpath

Henry: You cannot use xpath unless you specify how you embed it

Alex: xpath is possibly too much for what we need ...

<scribe> ACTION: discuss about xpath vs xquery and minimal size of exp language [recorded in http://www.w3.org/2016/02/10-xproc-irc]

<scribe> ACTION: identify metric of success [recorded in http://www.w3.org/2016/02/10-xproc-irc]

Norm: tomorrow we say 100x more users, all in agreement ?

= Jfuller: none heard

https://github.com/xproc/notes/tree/master/dataflows

take a 10 min break

back from break

Alex will take us through doc

Alex: Will email ebnf to the list

Norm: Might be premature to go through EBNF

Alex: The spec does not describe the story as it needs to describe

Norm: Story will change

Alex has problem with definition of step chain

Henry: I find usage of bind slightly awkward
... means exact opposite of what I was thinking

Alex: as we find these things we should be more permissive

Henry: I dislike having it on the left hand side
... What I would like is $in => source@xslt
... [$in,$style]=>[source,stylesheet]@xslt

Alex: discussing output binding

Norm: if we chain xslt to something else ...

Alex: we are talking about >>

<Norm> Output bindings may be defined at the end of a step chain with the ">>" operator.

Henry: You can dump an output into a variable and then stuff that variable into any number of inputs
... but... are things more symmetrical if you have a document or a seq of docs, dumped into a var ... you may stuff them into an input

Alex: you use the append operator and bind the output to a var

we use >> to mean 'append'

<Norm> https://github.com/xproc/notes/issues/5

jfuller: brings up the issue of default context
... we are going to have problem with $1 in xpath

Liam: version could be distracting

jfuller: As long as we can explain it then I think we are ok

<alexmilowski> [$source,$manifest] → { … }

Henry: add an example source frag
... if we were embedding in python it would be easier

(in response to block expr)

Henry: Overloading of square bracket is an issue

<ht_home> Henry likes Norm's musing, so moves it to the record: [$source,$manifest] -> [source,manifest]{ if ($source/*/@version) ... }

Alex: we left off append operators in the early discussion of step chains on purpose.

jfuller: Do we need to illustrate options now ?

Henry: Its about the flow stupid ...

Liam: I do not like the unnamed inputbinding form

["document.xml","style.xsl"] -> xslt()

Norm: They will be all the same in xproc

<ht_home> So why aren't we just using python syntax:

Norm: In the middle of a chain you have to use the anonymous forms

<ht_home> -> xslt(source=$1, stylesheet="style.xsl")

Henry: why not allow pythonic approach

Alex: Norm has a solution to this ....

<Norm> ... -> [$source,$secondary]@xslt(...)

<Norm> Oops. ... -> [source,secondary]@xslt(...)

Henry: Why are we making a distinction ?

Liam: Can you write a function that returns a step ?

Henry: The last point I made on the whiteboard was about 'reflection'

discussion about dynamically creating steps

Liam: I think I understand the last example before Inputs section
... Its really hard to handle messages from XSLT

Norm: in xproc vnext we might define a new output port fot xslt messages

Liam: Its unclear

<Norm> Norm: We don't have many steps that have two outputs which in the general case feed into steps with two inputs

picking up where we left off at Inputs

Alex: We build up what a port is bound too

Jim: new issues are at

https://github.com/xproc/notes/issues

jfuller: older issues at
... Norm helpfully elects to migrate them

Norm: for presentation - we know xinclude step does not take a sequence
... Maybe think about automatically accept sequences
... but that would be confusing for existing users

Alex: some feedback I got from a non xml person, suggested that we make our examples more neutral and less xml centric
... We need to think about all other data formats

Jfuller: we already identified that right ?
... we should define minimal flow language, this just means we can push this out or in when we spec it out

Alex: same feeling about literals
... obvious things that are missing are triples (json-ld, turtle?)
... some things like SPARQL are easier

Norm: we are going to have to say something about extensibility of data model ... need to describe how AVT work
... The only rational description of the data constructor is what is between the delimiters
... and then you interpret that sequence as html, etc

Alex: we may want to go farther then that for other media types

Alex and Norm discuss the finer points of data constructor

Liam: what happens here if we do AVT inside data like json

<Norm> https://github.com/xproc/notes/issues/11

Alex: the syntax here is incomplete
... we need to carefully craft a story ...

Henry: we remove it for now,

Alex: we will skip it

Henry: Its not a USP of this proposal

Alex: we know we need literals

, we need to do AVT, etc

<liam> [ my quasiquoting proposal from 2013 - https://lists.w3.org/Archives/Member/w3c-xsl-query/2013Nov/0017.html (there was also a version with user-specified delimiters, more like here-documents) ]

Norm: Take it out of the linear flow and put into appendix

jfuller: now onto output bindings

Alex: essentially 'where you are sending the output'
... Using this generally to send things to output ports by reference
... depends on how it is predeclared and I wonder how it will be received

Norm: I am ok with right hand usage (for now)

Alex: $included is effectively being declared, but it is implicit

<Norm> Mmmm. ... >> [$out=result, $chunks=secondary]

jfuller: we are just giving a primary output a name

<Norm> 3 + 4 >> $sum7

Alex: it has nice semantics, in that you can build indendant flows

(like my previous 'meet' example)

Alex: we know that ordering is an issue

jfuller: do we have sort ?

Norm: we have an RFP for p:sort

Henry: Sorting is useful and we should cater for the 80% case

Alex: step declarations are rough at the moment

<Norm> $in → [source]{ if ($source/*/@version eq "v1.0")

<Norm> then [$source,"crummy.xsl"] → xslt() ≫ @1

<Norm> else [$source,"better.xsl"] → xslt() ≫ @1 }

<Norm> ≫ $out

<alexmilowski> https://github.com/xproc/notes/blob/master/dataflows/examples/database-import.xpl

now discussing block expressions

Henry: At the flow level we have variables

jfuller: was if then intentional vs if then else

Alex: it was intentional

Liam: then there is ambiguity

Liam reminds us of the dangling else problem

Norm: we can have a step that produces nothing ... we can finesse it if we need too

Alex: We need conditionals
... We already variables do we need let binding ?
... Projects can be a step

Norm: I think its an antipattern
... We should have a step for each kind

Alex: projections is a nice to have
... iteration is a necessary piece
... replacement is a nice to have

Norm: I think we need viewport

Alex: Do folks use this ?

Norm: I absolutely use it

Alex: tee is neat probably a nice to have

jfuller: raises implicit iteration

Liam: is there order implied ?

Norm: it is a mapping operator

Henry: It must output them in that order

Liam: the term iteration implies ordering

Henry: do we answer this question in xproc v1

Norm: We probably do, but we wanted to allow parallism

Henry: xproc v1, it says 'each document in turn'

<Norm> |-

<Norm> -|

Alex points some meta thoughts for lang design

Murray: I do not quite understand the Tee example

<Norm> $in -> xinclude() >> $temp

<Norm> $temp -> "included.xml"

<Norm> [$temp,"stylesheet.xsl"] -> xslt()

<Norm> >> $out

Alex attempts to explain

<Norm> $temp >> "included.xml"

<Norm> $in -> xinclude() >> $temp

<Norm> $temp >> "included.xml"

<Norm> [$temp,"stylesheet.xsl"] -> xslt()

<Norm> >> $out

Murray: you can use Tee as a manifold

Henry: Murray is right

jfuller: Murray's arguments convince me

Norm: I think we should remove it

Alex: If you have a problem with that then we have a problem with iteration
... They are all operators

<ht_home> ("x1.xml","x2.xml") { | xinclude() }

Henry: If I could do that, then I would be happy

<Norm> ("x1.xml", "x2.xml") -> { foreach xinclude() }

<alexmilowski> [$in,”chop.xq”] → xquery() ! { [$1,”chunk.xsl”] → xslt() → @1 } → $out

<Norm> [$in,”chop.xq”] → xquery() ! { [$1,”chunk.xsl”] → xslt() >> @1 } >> $out

<Norm> [$in,”chop.xq”] → xquery() → { foreach [$1,”chunk.xsl”] → xslt() >> @1 } >> $out

<alexmilowski> [$in,”chop.xq”] → xquery() ! xinclude() ≫ $out

<Norm> [$in,”chop.xq”] → xquery() → { if ($1) then $1 → xslt() >> @1 else () } >> $out

Alex: block expressions nest

Norm: curly brackets ought to be lexically scoped

moving on

Alex: we need a way to name flows, reuse them

Henry: Literal strings as inputs or outputs translates to GET/PUT what about string variable references

Alex: You can call your xproc with an option

Norm: externally bound yes, but what about when I want to choose

Henry: Norm is solving a simpler problem

Alex: use let

<Norm> xs:int($1/*/@version) >> $version

Alex and Henry discussing let

<Norm> let $imode := "string" "doc.xml" -> xslt(initial-mode=$imode)

<Norm> () -> { let $imode := "string" "doc.xml" -> xslt(initial-mode=$imode) }

<Norm> () -> { let $imode := "string" { "doc.xml" -> xslt(initial-mode=$imode) } |

<Norm> () -> { let $imode := "string" { "doc.xml" -> xslt(initial-mode=$imode) }

<Norm> () -> { let $imode := "string" { "doc.xml" -> xslt(initial-mode=$imode) } }

<alexmilowski> let $imode := "string" { "doc.xml" -> xslt(initial-mode=$imode) }

<Norm> $imode := "string" { load("doc.xml") -> xslt(initial-mode=$imode) }

Norm: I cannot live with quoted string doing one thing and quoted string doing another thing

file: //doc.xml

<Norm> $imode := "string" { `doc.xml` -> xslt(initial-mode=$imode) }

<Norm> $imode := "string" { <doc.xml> -> xslt(initial-mode=$imode) }

<ht_home> HST likes ^ and v, one for read-from and the other for write-to

<ht_home> [Read that as up-arrow and down-arrow]

<ht_home> And, crucially, we should _definitely_ support <$schemaFileName

<ht_home> Changing my mind again. Use postfix > for read and prefix > as write, so we have

Norm: We will take the proposal plus comments
... And thats how we will move forward

Alex: The other thing we need to do tomorrow, is what Henry suggested

<ht_home> e.g. [$1,$schemaFileName>] -> xmlschema() >> [@1,>$errorReportFilename]

Alex: We need people's use cases
... What you were trying to do

Henry: and it will be fascinating about how they go about diagramming them

(just catching up)

Any objection to the current existing proposal in the context of todays understanding ?

https://github.com/xproc/notes/tree/master/dataflows

none heard

tomorrow we meet

http://www.xmlprague.cz/day1-2016/

at 10:00

Murray: Why the over reliance on $

Henry: This touches upon something I mentioned this morning the idea of typing
... and something we have to come back too, less concerned of naming at the moment

Murray: I want things to flow things into a pipe
... is essentially a manifold

jim is back as well

<Norm> Crucially, $in is a source from which 0 or more documents can be read. It might be a constructed document or sequence of documents or it might be the output of some other step

@outputfromfoo -> xinclude() ->

$in <- @outputfromfoo

(from Murray)

["foo.xml"] -> xinclude -> @outputfromfoo

$in <- @outputfromfoo

<alexmilowski> We should contact these people: http://www.xml-project.com/morganaxproc/

<Norm> Yes

- DRAFT -

XML Processing Model WG

Meeting 287, 10 Feb 2016

Attendees

Contents

Summary of Action Items