HTML Speech Incubator Group Teleconference -- 02 Jun 2011

<burn_> trackbot, start telcon

<trackbot> Date: 02 June 2011

<Robert> can you hear me?

<burn_> Scribe: Michael_Johnston

<burn_> ScribeNick: Michael

<burn_> Agenda: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0006.html

burn: start with review of face to face minutes, will review again next week
... comments on minutes

updated final report document

burn: comments on draft at this point?

all: silence

agreed upon design decisions

additional issues to add to list of issues

michael: does move to have emma document in dom, remove impetus for json variant of emma

bjorn: have simple javascript api for accessing most common elements, dont need json variant of emma, for details can access emma object

milan: need to do xml parsing?

bodell: will be much the same as other http requests that return xml, dont need to parse

milan: are mobile devices a problem, verbosity of xml

bodell: no

dan: is there pressure from this group to build a json version of emma?

all: agreement: no push for json version of emma

burn: any other issues to add to list for discussion

markup binding

bjorn: no feedback from chrome team yet

bodell: keep html binding lightweight, js constructor, simple "for" mechanism, small work to define, if dont want then remove the element
... should not mess up js api

olli: problem with for attribute it what it can point to, what elements can be used as target, doesnt quite work with content editable, important use case
... clarifies issue, need to make clear which elements can be targets and what the semantics is
... also content editable areas

michael: have to define semantics when target is e.g. a drop down or radio button

olli: may be new kinds of elements also

bodell: assumption would be to bind to any element, but they would not all have to work,

bodell; some browsers would want to handle more input types

olli: reco would be element in the dom, what is the benefit of the reco
... if for is not used

bodell: google desire to have element with microphone click api

bjorn: have proposed several things along the way,
... most important aspect is to have an element you can click to start speaking without the pop up or info bar

olli: no clear what the element gives

robert: follow up with chrome folks

bjorn: still waiting on that
... do agree that html element discussion does not block the js api discussion

olli: issue may get solved along the way

burn: need to see concrete proposal to make decision

crucial decisions partially discussed

<marc> http://www.w3.org/2005/Incubator/htmlspeech/2011/05/f2fminutes201105.html

burn: will go through each ...

bjorn: audio capture topic is dealt with, should be default way, if there is an audio capture api will deal with then

burn: audio codecs mandatory

robert: even IP status around speex is unclear also
... are only reasonable answers pcm and mulaw, despite their flaws

bjorn: flac, high bandwidth

<bringert> FLAC

<bringert> http://en.wikipedia.org/wiki/Free_Lossless_Audio_Codec

milan: speex is in ietf draft on how to package in rtp

<Milan> http://tools.ietf.org/html/draft-ietf-avt-rtp-speex-07

burn: rtp way to send it does not mean there are ip issues

bjorn: need to require some codecs or cant be interoperable

milan: problems sounds similar

bodell: rfc does not have a patent policy

burn: if something is necessary to implement the spec, and it is encumbered with IP, need to make that clear

bjorn: need protocol for interoperability

milan: protocol for RTC

burn: opus, codecs from two organizations, trying to blend, not clear if IP issues are being resolved, making container
... can use either one if you have permission
... dont have an answer yet, really need one, industry wide problem, may not be ours to solve, return to this

<mbodell> See http://lists.xiph.org/pipermail/speex-dev/2003-November/000753.html for some similar discussion on patent of speex

robert: will follow up re: speex again

milan: impact on protocol team if need to negotiate codec

marc: speex is not good enough for tts

<marc> ogg vorbis

<marc> and FLAC

burn: few names as candidates flac, ogg vorbis, speex, pcm

bjorn: already use flac in launched clients

olli: ogg vorbis is core html audio

<smaug> not core HTML audio. Some browsers just happen to support it

burn: candidates to consider flac, ogg vorbis, speex, pcm

do we support audio streaming and how?

burn: think we expect streaming, less clarity on how

milan: sending audio on regular time intervals as it is collected or generated

bjorn: discussed how to get events while capturing
... how it is done is a protocol question

burn: asr may begin before the user is
... finished speaking, result before engine comes

milan: without regular timed packets, wont get events on regular interval

bjorn: latency is what is app observable

bodell: having multiple events is not a big problem
... data in events can deal with timing

milan: if app is realtime, five seconds ago go this event

bjorn; agree, what we need is low latency, not sure what we can require, part of being a good implementation

burn: market takes care of product requirements

robert: fair to say that standard should not have inherent limitations
... 50 ms or so is the threshold

bjorn: protocol design should not make it impossible to achieve low latency event delivery

marc: audio streaming in the tts case?
... send audio while still rendering rest of an long utterance

bodell: tts is generally fast enough that this is not a problem

marc: if tts has to process all text before returning audio, could be a problem,
... wants to make sure that what we create here does not prevent an implementation doing this

bjorn: up to engine whether it starts to synthesize

marc: wav format, header has filesize, makes proper streaming

bjorn: protocol should make it possible for the tts to be streamed and start playing before
... synthesis is complete

burn: issue of supporting format coming back in video and
... and playing the audio

bjorn: should not require playing audio from video

robert: api should not prevent this

burn: video with three audio tracks, how does apis select

robert: our proposal separated capture api from reco, could support different kinds of capture

burn: protocol design should not preclude streaming of video codecs

raj: why specify video?

robert: if codec can be packetized in real time should be ok

burn: the protocol should not inhibit the tranmission of codecs that have similar requirements to audio?

What is meant by "start of speech", "end of speech", and endpointing in general? How do transmission delays affect the definitions and what we want in terms of APIs?

bodell: issue of latency impacting times

bjorn: agreed UA being basis for the clock

burn: dont have requirements for timing info from server

bodell: tts case?

bjorn: seems reasonable for server to include timing info

robert: could do offset from start

burn: something that UA can convert into UA local timestamp
... different ways to achieve that
... doesnt say what is made available in the api

bodell; many different times, when the utterance start etc

bodell: when received,

marc: impact on order that events are received

milan: will UA generate these events when using remote service

bodell: may assume energy detector gives you end of speech, before reco gives end of speech, hard to guarantee order

milan: start of energy is different than start of speech
... hard to write web app if get two start of speech events

bodell: different events,

bodell; was fixed order for the non continuous case

charles: could arrange fixed order delivery, even if times inside do not reflect this

bodell; no practical to hold events and put them in the desired order

burn: energy detector gets end of sound, then will get actual end of speech with better timing info, either get two or through away better info

marc: dont want to override better info from remote service

burn: front is for optimization so dont have to send all the audio

bodell: events could be in different orders
... not convinced in having standard order

milan: UA only have sound start, sound end
... avoid duplication,

bodell; already have different event names

robert: in name need to make clear some events are from energy
... detector others are from speech reco

milan: source of events

bodell: unmake statement about specific ordering

milan: new statement that user agent can insert are energy related events

marc: and probably capture start and end

charles: seems strong since speech service might or might not be remote

burn: removed ordering
... energy detector can only generate sound start stop

burn; speech service can only deliver the speech start stop

charles; if not order can be guarantee delivery

burn: how to guarantee it '

milan: as long as have single source for events

michael: (need a blackboard for this)

bodell: solved by removing required ordering
... allows all the use cases
... also works with continuous case
... thought had solved the issue

burn: but start before end?
... can get end without having seen a start

milan: reluctant to give up the ordering, if have single source for each type of event

burn: agreed speech service can only generate one, can't guarantee that they wont cross in time

milan: use remote speech service as the canonical

bodell: easiest to understand cross for end, UA would raise both events in the order they occurred

milan: it is possible to impose an ordering
... pros and cons, flexibility, or predictability for the web app developer

bjorn: events from the same source should be in the same order

<burn_> s/17:32:59 [Zakim] Zakim has left #htmlspeech//

- DRAFT -

HTML Speech Incubator Group Teleconference

02 Jun 2011

Attendees

Contents