Meeting minutes
Model Execution API
anssik: Discuss proposed improvements to the existing execution API
ningxin_hu: current discussion is around the Ping's first comment
today's execution API requires user to provide an output buffer at execution time, so need to know the shape of it and allocate the right sized buffer
… Ping identified that as an ergonomic issue
… also in some models shape is dynamic, not known beforehand, as identified by Rama
… need to address these two issues in the current execution API
… also Chai proposed how to simplify the execution interface, under discussion currently
… Chai other perspectives?
Chai: this is a long thread, but I'm glad we had this discussion
… to add what Ningxin described, the original ask is very specific, but in the discussion later on we touch other related topics
… this is becoming an API discussion, all interlinked
… supporting dynamic input shapes and output buffers is reasonable, we should support that
… we're almost in agreement with respect to these points
… if we look at the last replies in this issue, we can conclude we're on the same page
… related to the topic, how to simplify, I think Kenneth has raised many good points around why do we need the complication step
… raises good discussion points, it is good for perf to have a complication step, app has more control
… but we should be able to collapse complication and execution into one
… another related topic, whether we want in the future support eager execution
… Ningxin told me at one point we had that discussion in this group, I wasn't in this group at that time
… if we want to do that, it should be natural
… to support eager we shouldn't need to change everything
… simplifying complication and execution will help make the API amenable for eager more
anssik: can we split out other issues from this one? e.g. for eager we do have past issue
Chai: good to have all the discussion in context in this issue
ningxin_hu: not sure about quantized support?
Chai: we have one issue for that, let's lean on it, not specifically issue for adding quantization support to the API
… can open a new issue or piggy-pack on the existing one?
ningxin_hu: today's spec has kinds of quantization support, I suggest we comment on that issue
… see if we remove quant from Operand
Chai: fine either way
ningxin_hu: float32 and scalar are represented by Operand, that might be confusing
Chai: that'd be a separate issue
Ping: I'll back to issue #87 and review feedback
Rama: I just looked at #87, it seems execution of subgraph has not been discussion yet?
anssik: was subgraph execution discussion yet?
Chai: I need more information to understand this, in the new API compilation is immutable
… if you want to compile a subgraph, it creates a separate compilation
… compiling part of the graph only should be already solved, as a byproduct of where we arrive now
<ningxin_hu> There are my comments regarding to subgraph execution in the polyfill PR review: https://github.com/webmachinelearning/webnn-polyfill/pull/1#issuecomment-689939624
https://github.com/webmachinelearning/webnn/issues/87
Ping: current compilation is static, cannot be changed, my questions is what if people want to execute a subgraph, extract the feature vector, execute somewhere in the middle before your softmax
… either the user has to execute the whole graph, or you allow people to execute the subgraph
… I think this is normal to have this situation
… or sometimes, you repeatedly feed a layer
… how can the API be able to handle this type of use cases?
Chai: thanks, this is more clear now
… can you explain the use case a bit more?
… are you thinking of transfer learning?
Ping: let's say MobileNet is not good for classification as is, but people use transfer learning
… you need to be able to execute toward a node that's not the output of the model
… is that clear or no?
Chai: I understand this know
ningxin_hu: actually, you raised this issue in the polyfill PR review, I had some comments these
<ningxin_hu> https://github.com/webmachinelearning/webnn-polyfill/pull/1#issuecomment-689939624
ningxin_hu: to my understanding you can create a subgraph directly
… e.g. in the LeNet example, before matmul layers you can create the graph
… that's the feature extractor you can create and reduce weight there
… perhaps that satisfies the requirement?
Ping: that could be the resolution?
… many users of the models do not have an ability to create another model, they just take a pre-trained model and use it as is, fine-tune its feature vector
… this scenario should be solved by execution API itself, not by creating a new model
… if we can stop subgraph execution, there are other cases when people want to execute just one layer
… fine-tuning is widely used scenario
Chai: I think this is the same request as eager execution
… the requirement is very similar to making the API support eager mode
Ping: to me they are different
… in eager I'm executing much faster, I'm not dynamically creating a graph, I have a pre-trained model, I want to compile it to make it faster
… that's part of the origin model
ningxin_hu: my understanding, as illustrated in the example, developer has the flexibility to create the graph as it winds
… the all topology is available, it is up to the developer to create any model and compile and execute it
… do you want to pick some intermediate modes to get their outputs?
Ping: like you described, but more flexible way to define input and output nodes
… WebNN does not necessarily need that since we are use case-driven in the API design
… as a JS dev, I'd want to execute a part of a pre-trained model
Chai: you want an ability to only part of the pre-trained model? If so, I thisnk the latest changes discussed in this issue address thi
Rama: the interface is sufficient, can you specify intermediate values as inputs and outputs?
Chai: at some point we discussed optional output argument on the execute method
Rama: the API signature does not change, but can we assume the outputs can be intermediate, impacts implementations
Rama: the underlying implementation must do something smart if it needs to stop somewhere in the middle
… e.g. remove unnecessary nodes before execution
Ping: one questions regarding output shape
… now compilation is done prior to execution, without knowing subgraph how to satisfy the new execution plan(?)
ningxin_hu: compute method can return the result with out dictionary that has dimensions and buffer, you can spec input dim and shape
<Chai> afk
Ping: is it true, every time I ask you compile
ningxin_hu: in today's spec, Operand can have negative value in one dim to say not specified
… when you compute, you spec the dim of the input in detail in concrete shape, and compute will infer the output shape and return it to you
Ping: how we handle that usually, is complication is part of the execution
… compilation is cached
… also the shape does not need to be set prior
… the main concern was that there are a lot of pre steps that is needed prior to execution, may not be known, also need to find out what is the output shape
… because the shape is dynamic and tedious for user to do that
ningxin_hu: exactly, you're discussing today's execution API and we came up with a solution to that issue
… no need to know the shape of the output beforehand
Packing operations for gemm / matmul
anssik: anssik: Discuss optional packing ops and related optimization opportunities
Packing operations for gemm / matmul #86
Rama: the use of constant operands to operations like GEMM, matmul where there is opportunity to transfer the layout as an optimization step
… question was, should this be explicitly exposed through the API?
<Chai> back now
Rama: my position is better left as an implementation detail
<ningxin_hu> +1
Chai: I agree with Rama
… most of the issues re packing have been addressed
… process of packing is very hardware specific
Chai: I'll respond on the issue to comment on the group's position
Fingerprinting
anssik: Discuss possible fingerprinting vectors and mitigations
"an efficient matmul implementation can be fingerprinted to determine hardware capabilities."
ningxin_hu: some Intel hw was mentioned, so I can follow up from that perspective
… another comment, Kenneth mentions 8bit multiplication, this is related to our quant design
… related to our quantization operator design, as discussed in packing, we can hide this in implementation
… let me follow up from Intel hardware perspective
anssik: any other comments on the fingerprinting issue?
[none heard]
WebNN polyfill and samples
anssik: Continue discuss review feedback and suggestions for the foundational implementation and LeNet sample
Add the foundation implementation #1
ningxin_hu: comments addressed raised by Ping for the WebNN polyfill
… also separate issues created for couple
… Node.js support was a topic in the workshop, last week enabled the polyfill for Node.js running Mocha tests
… update to the LeNet example, added a Table of the LeNet topology
TAG review
https://github.com/webmachinelearning/webnn/blob/master/explainer.md
anssik: TAG review would depend on a more complete explainer https://github.com/webmachinelearning/webnn/blob/master/explainer.md
Chai: the explainer would likely need some code snippets to explain the API, but given we're changing the API a little so better land those API changes after which update the explainer
anssik: Sangwhan is a good person review out explainer PRs
ningxin_hu: proposal to not block the polyfill and example PRs on spec changes
… then revise them based on the new API design
<Chai> that works
anssik: any concerns with that proposal from Ningxin?
<Chai> i can help with the explainer
[no concerns]
PROPOSED RESOLUTION: Land WebNN polyfill and samples when existing review comments have been addressed, do not block on in-flight spec PRs and API design discussion
PROPOSED RESOLUTION: Land WebNN polyfill and samples PRs when existing review comments have been addressed, do not block on in-flight spec PRs and API design discussion
Resolution: Land WebNN polyfill and samples PRs when existing review comments have been addressed, do not block on in-flight spec PRs and API design discussion